Model selection procedures for high dimensional genomic data


  • Allan John Motyer
  • Sally Galbraith
  • Susan R Wilson



model selection, lasso, genomic data


Many complex diseases are thought to be caused by multiple genetic variants. Recent advances in genotyping technology allowed investigators of a complex disease to obtain data for a massive number of candidate genetic variants. Typically each candidate variant is tested individually for an association with the disease. We approach the problem as one of model selection for high dimensional data. We propose a method whereby penalised maximum likelihood estimation provides a reasonably sized set of variants for inclusion in our model. We then perform stepwise regression on this set of variants to arrive at our model. Penalised maximum likelihood estimation is performed with both the lasso and a more recently developed method known as the hyperlasso, with smoothing parameters chosen by cross-validation. The hyperlasso has a penalty function that favours sparser solutions but with less shrinkage of those variables that are included in the model, when compared to the lasso; however, this comes at extra computational cost. We apply the above method to a large genomic data set from a previously published mice obesity study and use resample model averaging to assess model performance. References
  • Kristin A. Ayers and Heather J. Cordell. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genetic Epidemiology, 38:879--891, 2010. doi:10.1002/gepi.20543
  • David J. Balding. A tutorial on statistical methods for population association studies. Nature Reviews Genetics, 7:781--791, 2006. doi:10.1038/nrg1916
  • Christopher S. Carlson, Michael A. Eberle, Mark J. Rieder, Qian Yi, Leonid Kruglyak, and Deborah A. Nickerson. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet., 74:106--120, 2004.
  • Seoae Cho, Kyunga Kim, Young Jin Kim, Jong-Keuk Lee, Yoon Shin Cho, Jong-Young Lee, Bok-Ghee Han, Heebal Kim, Jurg Ott, and Taesung Park. Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Annals of Human Genetics, 74:416--428, 2010. doi:10.1111/j.1469-1809.2010.00597.x
  • {European Bioinformatics Institute}.
  • Jianqing Fan and Jinchi Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101--148, 2010.
  • Anatole Ghazalpour, Sudheer Doss, Bin Zhang, Susanna Wang, Christopher Plaisier, Ruth Castellanos, Alec Brozell, Eric E. Schadt, Thomas A. Drake, Aldons J. Lusis, and Steve Horvath. Integrating genetic and network analysis to characterize genes related to mouse weight. PLoS Genetics, 2:e130, 2006.
  • I. Gradshteyn and I. Ryzik. Tables of Integrals, Series and Products: Corrected and Enlarged Edition. Academic Press, New York, 1980.
  • J. E. Griffin and P. J. Brown. Bayesian adaptive lassos with non-convex penalization. Technical report, University of Kent, 2007.
  • Clive J. Hoggart, John C. Whittaker, Maria {De Iorio}, and David J. Balding. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genetics, 4:e1000130, 2008. doi:10.1371/journal.pgen.1000130
  • B. Maher. Personal genomes: The case of the missing heritability. Nature, 456:18--21, 2008. doi:10.1038/456018a
  • T. A. Manolio et al. Finding the missing heritability of complex diseases. Nature, 461:747--753, 2009. doi:10.1038/nature08494
  • Mark I. McCarthy, Goncalo R. Abecasis, Lon R. Cardon, David B. Goldstein, Julian Little, John P. A. Ioannidis, and Joel N. Hirschhorn. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics, 9:356--369, 2008. doi:10.1038/nrg2344
  • Nicolai Meinshausen and Peter Buehlmann. Stability selection. Journal of the Royal Statistical Society, Series B, 72:417--473, 2010. doi:10.1111/j.1467-9868.2010.00740.x
  • R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0.
  • R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267--288, 1996.
  • William Valdar, Christopher C. Holmes, Richard Mott, and Jonathan Flint. Mapping in structured populations by resample model averaging. Genetics, 182:1263--1277, 2009. doi:10.1534/genetics.109.100727
  • Susanna Wang, Nadir Yehya, Eric E. Schadt, Hui Wang, Thomas A. Drake, and Aldons J. Lusis. Genetic and genomic analysis of a fat mass trait with complex inheritance reveals marked sex specificity. PLoS Genetics, 2:e15, 2006. doi:10.1371/journal.pgen.0020015
  • E. T. Whittaker. On the functions associated with the parabolic cylinder in harmonic analysis. Proc. London Math. Soc., 35:417--427, 1902. doi:10.1112/plms/s1-35.1.417
  • Jian Yang, Beben Benyamin, Brian P. McEvoy, Scott Gordon, Anjali K. Henders, Dale R. Nyholt, et al. Common {SNPs} explain a large proportion of the heritability for human height. Nature Genetics, 42:565--569, 2010. doi:10.1038/ng.608
  • Gang Zheng, Jonathan Marchini, and Nancy L. Geller. Introduction to the special issue: Genome-wide association studies. Statistical Science, 24:387, 2009. doi:10.1214/09-STS310





Proceedings Computational Techniques and Applications Conference