New Book: Computational Methods for Genetics of Complex Traits
My new book on "Computational Methods for Genetics of Complex Traits" has been published as part of the Advances in Genetics series by Academic Press. Here is a summary and outline. Thanks to all the authors that made this possible.
Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits. [Amazon]
Marylyn D. Ritchie, William S. Bush, Genome Simulation: Approaches for Synthesizing In Silico Datasets for Human Genomics, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 1-24.
Abstract: Simulated data is a necessary first step in the evaluation of new analytic methods because in simulated data the true effects are known. To successfully develop novel statistical and computational methods for genetic analysis, it is vital to simulate datasets consisting of single nucleotide polymorphisms (SNPs) spread throughout the genome at a density similar to that observed by new high-throughput molecular genomics studies. In addition, the simulation of environmental data and effects will be essential to properly formulate risk models for complex disorders. Data simulations are often criticized because they are much less noisy than natural biological data, as it is nearly impossible to simulate the multitude of possible sources of natural and experimental variability. However, simulating data in silico is the most straightforward way to test the true potential of new methods during development. Thus, advances that increase the complexity of data simulations will permit investigators to better assess new analytical methods. In this work, we will briefly describe some of the current approaches for the simulation of human genomics data describing the advantages and disadvantages of the various approaches. We will also include details on software packages available for data simulation. Finally, we will expand upon one particular approach for the creation of complex, human genomic datasets that uses a forward-time population simulation algorithm: genomeSIMLA. Many of the hallmark features of biological datasets can be synthesized in silico; still much research is needed to enhance our capabilities to create datasets that capture the natural complexity of biological datasets.
Holger Schwender, Ingo Ruczinski, Logic Regression and Its Extensions, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 25-45.
Abstract: Logic regression is an adaptive classification and regression procedure, initially developed to reveal interacting single nucleotide polymorphisms (SNPs) in genetic association studies. In general, this approach can be used in any setting with binary predictors, when the interaction of these covariates is of primary interest. Logic regression searches for Boolean (logic) combinations of binary variables that best explain the variability in the outcome variable, and thus, reveals variables and interactions that are associated with the response and/or have predictive capabilities. The logic expressions are embedded in a generalized linear regression framework, and thus, logic regression can handle a variety of outcome types, such as binary responses in case-control studies, numeric responses, and time-to-event data. In this chapter, we provide an introduction to the logic regression methodology, list some applications in public health and medicine, and summarize some of the direct extensions and modifications of logic regression that have been proposed in the literature.
Melanie A. Wilson, James W. Baurley, Duncan C. Thomas, David V. Conti, Complex System Approaches to Genetic Analysis: Bayesian Approaches, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 47-71.
Abstract: Genetic epidemiology is increasingly focused on complex diseases involving multiple genes and environmental factors, often interacting in complex ways. Although standard frequentist methods still have a role in hypothesis generation and testing for discovery of novel main effects and interactions, Bayesian methods are particularly well suited to modeling the relationships in an integrated 'systems biology' manner. In this chapter, we provide an overview of the principles of Bayesian analysis and their advantages in this context and describe various approaches to applying them for both model building and discovery in a genome-wide setting. In particular, we highlight the ability of Bayesian methods to construct complex probability models via a hierarchical structure and to account for uncertainty in model specification by averaging over large spaces of alternative models.
Yan V. Sun, Multigenic Modeling of Complex Disease by Random Forests, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 73-99.
Abstract: The genetics and heredity of complex human traits have been studied for over a century. Many genes have been implicated in these complex traits. Genome-wide association studies (GWAS) were designed to investigate the association between common genetic variation and complex human traits using high-throughput platforms that measured hundreds of thousands of common single-nucleotide polymorphisms (SNPs). GWAS have successfully identified many novel genetic loci associated with complex traits using a univariate regression-based approach. Even for traits with a large number of identified variants, only a small fraction of the interindividual variation in risk phenotypes has been explained. In biological systems, protein, DNA, RNA, and metabolites frequently interact to each other to perform their biological functions, and to respond to environmental factors. The complex interactions among genes and between the genes and environment may partially explain the 'missing heritability.' The traditional regression-based methods are limited to address the complex interactions among the hundreds of thousands of SNPs and their environmental context by both the modeling and computational challenge. Random Forests (RF), one of the powerful machine learning methods, is regarded as a useful alternative to capture the complex interaction effects among the GWAS data, and potentially address the genetic heterogeneity underlying these complex traits using a computationally efficient framework. In this chapter, the features of prediction and variable selection, and their applications in genetic association studies are reviewed and discussed. Additional improvements of the original RF method are warranted to make the applications in GWAS to be more successful.
Jason H. Moore, Detecting, Characterizing, and Interpreting Nonlinear Gene-Gene Interactions Using Multifactor Dimensionality Reduction, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 101-116.
Abstract: Human health is a complex process that is dependent on many genes, many environmental factors and chance events that are perhaps not measurable with current technology or are simply unknowable. Success in the design and execution of population-based association studies to identify those genetic and environmental factors that play an important role in human disease will depend on our ability to embrace, rather that ignore, complexity in the genotype to phenotype mapping relationship for any given human ecology. We review here three general computational challenges that must be addressed. First, data mining and machine learning methods are needed to model nonlinear interactions between multiple genetic and environmental factors. Second, filter and wrapper methods are needed to identify attribute interactions in large and complex solution landscapes. Third, visualization methods are needed to help interpret computational models and results. We provide here an overview of the multifactor dimensionality reduction (MDR) method that was developed for addressing each of these challenges.
Robert Culverhouse, The Restricted Partition Method, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 117-139.
Abstract: For many complex traits, the bulk of the phenotypic variation attributable to genetic factors remains unexplained, even after well-powered genome-wide association studies. Among the multiple possible explanations for the 'missing' variance, joint effects of multiple genetic variants are a particularly appealing target for investigation: they are well documented in biology and can often be evaluated using existing data. The first two sections of this chapter discusses these and other concerns that led to the development of the Restricted Partition Method (RPM). The RPM is an exploratory tool designed to investigate, in a model agnostic manner, joint effects of genetic and environmental factors contributing to quantitative or dichotomous phenotypes. The method partitions multilocus genotypes (or genotype-environmental exposure classes) into statistically distinct 'risk' groups, then evaluates the resulting model for phenotypic variance explained. It is sensitive to factors whose effects are apparent only in a joint analysis, and which would therefore be missed by many other methods. The third section of the chapter provides details of the RPM algorithm and walks the reader through an example. The final sections of the chapter discuss practical issues related to the use of the method. Because exhaustive pairwise or higher order analyses of many SNPs are computationally burdensome, much of the discussion focuses on computational issues. The RPM proved to be practical for a large candidate gene analysis, consisting of over 40,000 SNPs, using a desktop computer. Because the algorithm and software lend themselves to distributed processing, larger analyses can easily be split among multiple computers.
Peter Holmans, Statistical Methods for Pathway Analysis of Genome-Wide Data for Association with Complex Genetic Traits, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 141-179.
Abstract: A number of statistical methods have been developed to test for associations between pathways (collections of genes related biologically) and complex genetic traits. Pathway analysis methods were originally developed for analyzing gene expression data, but recently methods have been developed to perform pathway analysis on genome-wide association study (GWAS) data. The purpose of this review is to give an overview of these methods, enabling the reader to gain an understanding of what pathway analysis involves, and to select the method most suited to their purposes. This review describes the various types of statistical methods for pathway analysis, detailing the strengths and weaknesses of each. Factors influencing the power of pathway analyses, such as gene coverage and choice of pathways to analyze, are discussed, as well as various unresolved statistical issues. Finally, a list of computer programs for performing pathway analysis on genome-wide association data is provided.
Reagan J. Kelly, Jennifer A. Smith, Sharon L.R. Kardia, Providing Context and Interpretability to Genetic Association Analysis Results Using the KGraph, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 181-193.
Abstract: The KGraph is a data visualization system that has been developed to display the complex relationships between the univariate and bivariate associations among an outcome of interest, a set of covariates, and a set of genetic variations such as single-nucleotide polymorphisms (SNPs). It allows for easy simultaneous viewing and interpretation of genetic associations, correlations among covariates and SNPs, and information about the replication and cross-validation of these associations. The KGraph allows the user to more easily investigate multicollinearity and confounding through visualization of the multidimensional correlation structure underlying genetic associations. It emphasizes gene-environment interactions, gene-gene interactions, and correlations, all important components of the complex genetic architecture of most human traits. The KGraph was designed for use in gene-centric studies, but can be integrated into association analysis workflows as well. The software is available at http://www.epidkardia.sph.umich.edu/software/kgrapher