Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Sunday, November 28, 2010

Our Latest MDR Papers

Here are a few of our recently published papers on Multifactor Dimensionality Reduction (MDR). The first is a review while the next three report extensions of MDR for survivavl analysis, covariate adjustment and a robust approach that deals with the rare situation of having only a few genotype combinations contributing information. This work was supported by NIH grants R01 LM009012, R01 LM010098 and R01 AI59694.

Moore JH. Detecting, characterizing, and interpreting nonlinear gene-gene interactions using multifactor dimensionality reduction. Adv Genet. 2010;72:101-16. [PubMed]

Gui J, Moore JH, Kelsey KT, Marsit CJ, Karagas MR, Andrew AS. A novel survival multifactor dimensionality reduction method for detecting gene-gene interactions with application to bladder cancer prognosis. Hum Genet. 2010 Oct 28, in press. [PubMed]

Gui J, Andrew AS, Andrews P, Nelson HM, Kelsey KT, Karagas MR, Moore JH. A simple and computationally efficient sampling approach to covariate adjustment for multifactor dimensionality reduction analysis of epistasis. Hum Hered. 2010;70(3):219-25. [PubMed]

Gui J, Andrew AS, Andrews P, Nelson HM, Kelsey KT, Karagas MR, Moore JH. A robust multifactor dimensionality reduction method for detecting gene-gene interactions with application to the genetic analysis of bladder cancer susceptibility. Ann Hum Genet. 2010 Nov 22., in press. [PubMed]

Monday, November 22, 2010

Pathway-Based GWAS Analysis

I am a big fan of pathway-based approaches to the analysis of GWAS data. This looks like a nice overview. This area needs more attention and is much more likely to pay off than the one-SNP-at-a-time approach that has dominated the field.

Wang K, Li M, Hakonarson H. Analysing biological pathways in genome-wide association studies. Nat Rev Genet. 2010 Dec;11(12):843-854. [PubMed]


Genome-wide association (GWA) studies have typically focused on the analysis of single markers, which often lacks the power to uncover the relatively small effect sizes conferred by most genetic variants. Recently, pathway-based approaches have been developed, which use prior biological knowledge on gene function to facilitate more powerful analysis of GWA study data sets. These approaches typically examine whether a group of related genes in the same functional pathway are jointly associated with a trait of interest. Here we review the development of pathway-based approaches for GWA studies, discuss their practical use and caveats, and suggest that pathway-based approaches may also be useful for future GWA studies with sequencing data.

Wednesday, November 17, 2010

New Book: Computational Methods for Genetics of Complex Traits

My new book on "Computational Methods for Genetics of Complex Traits" has been published as part of the Advances in Genetics series by Academic Press. Here is a summary and outline. Thanks to all the authors that made this possible.

Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits. [Amazon]

Chapter 1

Marylyn D. Ritchie, William S. Bush, Genome Simulation: Approaches for Synthesizing In Silico Datasets for Human Genomics, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 1-24.

Abstract: Simulated data is a necessary first step in the evaluation of new analytic methods because in simulated data the true effects are known. To successfully develop novel statistical and computational methods for genetic analysis, it is vital to simulate datasets consisting of single nucleotide polymorphisms (SNPs) spread throughout the genome at a density similar to that observed by new high-throughput molecular genomics studies. In addition, the simulation of environmental data and effects will be essential to properly formulate risk models for complex disorders. Data simulations are often criticized because they are much less noisy than natural biological data, as it is nearly impossible to simulate the multitude of possible sources of natural and experimental variability. However, simulating data in silico is the most straightforward way to test the true potential of new methods during development. Thus, advances that increase the complexity of data simulations will permit investigators to better assess new analytical methods. In this work, we will briefly describe some of the current approaches for the simulation of human genomics data describing the advantages and disadvantages of the various approaches. We will also include details on software packages available for data simulation. Finally, we will expand upon one particular approach for the creation of complex, human genomic datasets that uses a forward-time population simulation algorithm: genomeSIMLA. Many of the hallmark features of biological datasets can be synthesized in silico; still much research is needed to enhance our capabilities to create datasets that capture the natural complexity of biological datasets.

Chapter 2

Holger Schwender, Ingo Ruczinski, Logic Regression and Its Extensions, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 25-45.

Abstract: Logic regression is an adaptive classification and regression procedure, initially developed to reveal interacting single nucleotide polymorphisms (SNPs) in genetic association studies. In general, this approach can be used in any setting with binary predictors, when the interaction of these covariates is of primary interest. Logic regression searches for Boolean (logic) combinations of binary variables that best explain the variability in the outcome variable, and thus, reveals variables and interactions that are associated with the response and/or have predictive capabilities. The logic expressions are embedded in a generalized linear regression framework, and thus, logic regression can handle a variety of outcome types, such as binary responses in case-control studies, numeric responses, and time-to-event data. In this chapter, we provide an introduction to the logic regression methodology, list some applications in public health and medicine, and summarize some of the direct extensions and modifications of logic regression that have been proposed in the literature.

Chapter 3

Melanie A. Wilson, James W. Baurley, Duncan C. Thomas, David V. Conti, Complex System Approaches to Genetic Analysis: Bayesian Approaches, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 47-71.

Abstract: Genetic epidemiology is increasingly focused on complex diseases involving multiple genes and environmental factors, often interacting in complex ways. Although standard frequentist methods still have a role in hypothesis generation and testing for discovery of novel main effects and interactions, Bayesian methods are particularly well suited to modeling the relationships in an integrated 'systems biology' manner. In this chapter, we provide an overview of the principles of Bayesian analysis and their advantages in this context and describe various approaches to applying them for both model building and discovery in a genome-wide setting. In particular, we highlight the ability of Bayesian methods to construct complex probability models via a hierarchical structure and to account for uncertainty in model specification by averaging over large spaces of alternative models.

Chapter 4

Yan V. Sun, Multigenic Modeling of Complex Disease by Random Forests, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 73-99.

Abstract: The genetics and heredity of complex human traits have been studied for over a century. Many genes have been implicated in these complex traits. Genome-wide association studies (GWAS) were designed to investigate the association between common genetic variation and complex human traits using high-throughput platforms that measured hundreds of thousands of common single-nucleotide polymorphisms (SNPs). GWAS have successfully identified many novel genetic loci associated with complex traits using a univariate regression-based approach. Even for traits with a large number of identified variants, only a small fraction of the interindividual variation in risk phenotypes has been explained. In biological systems, protein, DNA, RNA, and metabolites frequently interact to each other to perform their biological functions, and to respond to environmental factors. The complex interactions among genes and between the genes and environment may partially explain the 'missing heritability.' The traditional regression-based methods are limited to address the complex interactions among the hundreds of thousands of SNPs and their environmental context by both the modeling and computational challenge. Random Forests (RF), one of the powerful machine learning methods, is regarded as a useful alternative to capture the complex interaction effects among the GWAS data, and potentially address the genetic heterogeneity underlying these complex traits using a computationally efficient framework. In this chapter, the features of prediction and variable selection, and their applications in genetic association studies are reviewed and discussed. Additional improvements of the original RF method are warranted to make the applications in GWAS to be more successful.

Chapter 5

Jason H. Moore, Detecting, Characterizing, and Interpreting Nonlinear Gene-Gene Interactions Using Multifactor Dimensionality Reduction, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 101-116.

Abstract: Human health is a complex process that is dependent on many genes, many environmental factors and chance events that are perhaps not measurable with current technology or are simply unknowable. Success in the design and execution of population-based association studies to identify those genetic and environmental factors that play an important role in human disease will depend on our ability to embrace, rather that ignore, complexity in the genotype to phenotype mapping relationship for any given human ecology. We review here three general computational challenges that must be addressed. First, data mining and machine learning methods are needed to model nonlinear interactions between multiple genetic and environmental factors. Second, filter and wrapper methods are needed to identify attribute interactions in large and complex solution landscapes. Third, visualization methods are needed to help interpret computational models and results. We provide here an overview of the multifactor dimensionality reduction (MDR) method that was developed for addressing each of these challenges.

Chapter 6

Robert Culverhouse, The Restricted Partition Method, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 117-139.

Abstract: For many complex traits, the bulk of the phenotypic variation attributable to genetic factors remains unexplained, even after well-powered genome-wide association studies. Among the multiple possible explanations for the 'missing' variance, joint effects of multiple genetic variants are a particularly appealing target for investigation: they are well documented in biology and can often be evaluated using existing data. The first two sections of this chapter discusses these and other concerns that led to the development of the Restricted Partition Method (RPM). The RPM is an exploratory tool designed to investigate, in a model agnostic manner, joint effects of genetic and environmental factors contributing to quantitative or dichotomous phenotypes. The method partitions multilocus genotypes (or genotype-environmental exposure classes) into statistically distinct 'risk' groups, then evaluates the resulting model for phenotypic variance explained. It is sensitive to factors whose effects are apparent only in a joint analysis, and which would therefore be missed by many other methods. The third section of the chapter provides details of the RPM algorithm and walks the reader through an example. The final sections of the chapter discuss practical issues related to the use of the method. Because exhaustive pairwise or higher order analyses of many SNPs are computationally burdensome, much of the discussion focuses on computational issues. The RPM proved to be practical for a large candidate gene analysis, consisting of over 40,000 SNPs, using a desktop computer. Because the algorithm and software lend themselves to distributed processing, larger analyses can easily be split among multiple computers.

Chapter 7

Peter Holmans, Statistical Methods for Pathway Analysis of Genome-Wide Data for Association with Complex Genetic Traits, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 141-179.

Abstract: A number of statistical methods have been developed to test for associations between pathways (collections of genes related biologically) and complex genetic traits. Pathway analysis methods were originally developed for analyzing gene expression data, but recently methods have been developed to perform pathway analysis on genome-wide association study (GWAS) data. The purpose of this review is to give an overview of these methods, enabling the reader to gain an understanding of what pathway analysis involves, and to select the method most suited to their purposes. This review describes the various types of statistical methods for pathway analysis, detailing the strengths and weaknesses of each. Factors influencing the power of pathway analyses, such as gene coverage and choice of pathways to analyze, are discussed, as well as various unresolved statistical issues. Finally, a list of computer programs for performing pathway analysis on genome-wide association data is provided.

Chapter 8

Reagan J. Kelly, Jennifer A. Smith, Sharon L.R. Kardia, Providing Context and Interpretability to Genetic Association Analysis Results Using the KGraph, In: Jay C. Dunlap and Jason H. Moore, Editor(s), Advances in Genetics, Academic Press, 2010, Volume 72, Computational Methods for Genetics of Complex Traits, Pages 181-193.

Abstract: The KGraph is a data visualization system that has been developed to display the complex relationships between the univariate and bivariate associations among an outcome of interest, a set of covariates, and a set of genetic variations such as single-nucleotide polymorphisms (SNPs). It allows for easy simultaneous viewing and interpretation of genetic associations, correlations among covariates and SNPs, and information about the replication and cross-validation of these associations. The KGraph allows the user to more easily investigate multicollinearity and confounding through visualization of the multidimensional correlation structure underlying genetic associations. It emphasizes gene-environment interactions, gene-gene interactions, and correlations, all important components of the complex genetic architecture of most human traits. The KGraph was designed for use in gene-centric studies, but can be integrated into association analysis workflows as well. The software is available at http://www.epidkardia.sph.umich.edu/software/kgrapher

Wednesday, November 10, 2010

The Complex Genetic Architecture of the Metabolome

Here is yet another organismal study demonstrating the complexity of the mapping relationship between genotype and phenotype. They find that metabolites are canalized and sensitive to environmental perturbation. They recommend gene-environment interaction analysis for GWAS. If Arabidopsis thaliana is this complex, why would we expect Homo sapiens to be any simpler? Further, this kind of complexity exists at the endophenotype level. Now plug all this metabolite variation into many additional layers of biochemistry and physiology for mapping genotype variation to susceptibility of disease.

Chan EKF, Rowe HC, Hansen BG, Kliebenstein DJ (2010) The Complex Genetic Architecture of the Metabolome. PLoS Genet 6(11): e1001198 [PLoS]


Discovering links between the genotype of an organism and its metabolite levels can increase our understanding of metabolism, its controls, and the indirect effects of metabolism on other quantitative traits. Recent technological advances in both DNA sequencing and metabolite profiling allow the use of broad-spectrum, untargeted metabolite profiling to generate phenotypic data for genome-wide association studies that investigate quantitative genetic control of metabolism within species. We conducted a genome-wide association study of natural variation in plant metabolism using the results of untargeted metabolite analyses performed on a collection of wild Arabidopsis thaliana accessions. Testing 327 metabolites against >200,000 single nucleotide polymorphisms identified numerous genotype–metabolite associations distributed non-randomly within the genome. These clusters of genotype–metabolite associations (hotspots) included regions of the A. thaliana genome previously identified as subject to recent strong positive selection (selective sweeps) and regions showing trans-linkage to these putative sweeps, suggesting that these selective forces have impacted genome-wide control of A. thaliana metabolism. Comparing the metabolic variation detected within this collection of wild accessions to a laboratory-derived population of recombinant inbred lines (derived from two of the accessions used in this study) showed that the higher level of genetic variation present within the wild accessions did not correspond to higher variance in metabolic phenotypes, suggesting that evolutionary constraints limit metabolic variation. While a major goal of genome-wide association studies is to develop catalogues of intraspecific variation, the results of multiple independent experiments performed for this study showed that the genotype–metabolite associations identified are sensitive to environmental fluctuations. Thus, studies of intraspecific variation conducted via genome-wide association will require analyses of genotype by environment interaction. Interestingly, the network structure of metabolite linkages was also sensitive to environmental differences, suggesting that key aspects of network architecture are malleable.

Saturday, November 06, 2010

Top 10 Tips for Getting an R01 Funded by the National Library of Medicine

I just returned from serving on the Biomedical Library and Informatics Review Committee (BLIRC) for the National Library of Medicine (NLM). Here are 10 important things to keep in mind when writing an R01 for the NLM. These are all based on my experience serving on BLIRC over the past year. My bias is bioinformatics and computational biology. A clinical informaticist or library informaticist might have a different perspective. It is always a good idea to talk with your program officer before writing and submitting a grant.

1) Articulate an important and timely informatics question. Be forward-thinking. Know what is hot and what is going to be hot. Make sure that answering your particular scientific question will have an impact on biomedical research or clinical practice.

2) Propose new and novel informatics methods. Innovation very important. Know the literature and where your new method fits in. If you are havng trouble coming up with a truly innovative approach you might try combining existing methods in innovative ways. This is less exciting but much better than an incremental improvement on an existing approach.

3) Avoid purely applied software engineering projects. In other words, don't focus your grant only on building a database, web server or software package. The majority of the grant must be focused on new and novel algorithms or methods. NLM is looking for new informatics methods. They sometimes have separate RFAs for resource development grants (e.g. G08 mechanism).

4) Compare your algorithm or method to state of the art in field. Don't just propose a new algorithm or method. You need to have a baseline approach to compare it to. How do you know that your novel method is going to work better that what people are currently using?

5) A solid plan for how you will evaluate your novel informatics method is critical. How will you know whether your approach is truly working better than the state of the art in the field? Be very specific about how you will evaluate your approach and what the criteria are for concluding it is indeed working.

6) Application to real data is important. Simulation studies are necessary but not sufficient. Describe the biomedical data you will analyze and how you will improve your method based on results. Don't forget the details of how you will actually do the analysis. What significance criteria will you use?

7) Provide as many details as possible about your new and novel informatics algorithm or method given space constraints. Reviewers are unlikely to give you the benefit of the doubt, especially if you are a junior investigator with a poor track record. Tell the reviewers exactly how you are going to develop, extend, modify, apply and evaluate your informatics approach.

8) Be productive! Reviewers want to see a good paper trail from your previous faculty, postdoc and graduate student research. Your reviewers need to be convinced that if you are awarded a grant that you will actually make a contribution to the literature. It is well worth those extra evenings and weekends to get your papers submitted.

9) Innovation and approach have the biggest impact on your final score. The NLM did a factor analysis of scores for significance, innovation, approach, investigator and environment and their relationship with overall impact score. Innovation and approach had the highest correlation with the overall score. I agree with this completely based on my experience serving on BLIRC.

10) Make sure you have good collaborators with real effort budgeted to cover your weaknesses. It is often the case that a junior investigator will add a well-established senior investigator to the grant thinking the name recognition will help. This does not help and is seen as a negative if the senior person does not have real effort budgeted on the grant. Make sure your senior collaborator can contribute at least 5% effort and preferably 10% or more. Otherwise, noone will believe that the senior person will actually do any real work.