Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Thursday, June 19, 2008

Estimation of Distribution Algorithms (EDA)

Our paper on the use of simple Estimation of Distribution Algorithms (EDA) for the genome-wide genetic analysis of epistasis has been accepted for publication in Lecture Notes in Computer Science as part of the Sixth Annual Conference on Ant Colony Optimization and Swarm Intelligence (ANTS) to be held in September in Belgium. Our EDA was implemented as Ant Colony Optimization (ACO) that is inspired by the successful strategies (e.g. pheromones) that ants use to forage for food. In practice, ACO is a simple matrix of probabilities for choosing different SNPs for analysis that is updated as different solutons are evaluated. ACO was appealing for this problem because it inherently uses heuristic information or expert knowledge to update probabilities. We are in the process of adding this simple univariate EDA to the open-source MDR software package that is freely available from here. This new version (v. 2.0) will be available later this summer.

Greene CS, White BC, Moore JH. Ant colony optimization for genome-wide genetic analysis. Lecture Notes in Computer Science, in press (2008).


In human genetics it is now feasible to measure large numbers of DNA sequence variations across the human genome. Given current knowledge about biological networks and disease processes it seems likely that disease risk can best be modeled by interactions between biological components, which can be examined as interacting DNA sequence variations. The machine learning challenge is to effectively explore interactions in these datasets to identify combinations of variations which are predictive of common human diseases. Ant colony optimization (ACO) is a promising approach to this problem. The goal of this study is to examine the usefulness of ACO for problems in this domain and to develop a prototype of an expert knowledge guided probabilistic search wrapper. We show that an ACO approach is not successful in the absence of expert knowledge but is successful when expert knowledge is supplied through the pheromone updating rule.

Monday, June 16, 2008

Bateson and Epistasis

William Bateson first used the term 'epistasis' to describe distortions of Mendelian segregation ratios that were due to one gene masking the effects of another. This is discussed in his 1909 book on "Mendel's Principles of Heredity" which has now been reprinted by Cosimo Classics and is available through Amazon.com.

Thursday, June 12, 2008

A computationally efficient hypothesis testing method for epistasis analysis using MDR

Our paper on using extreme value distributions (EVD) to significantly reduce the number of permutations needed to assess the signifcance of an MDR model has been accepted for publication in Genetic Epidemiology. We show that as few as 20 permutations are needed to preserve power and type I error thus reducing the computational burden of the standard 1000-fold permutation test by 50-fold. This will play an important in using MDR for the analysis of genome-wide association studies (GWAS).

I would like to thank the anonymous referees of the paper that went above and beyond the call of duty to help us improve the paper. We are very greatful and I wish all reviews could be this complete.

Pattin, K.A., White, B.C., Barney, N., Gui, J., Nelson, H.H., Kelsey, K.R.Andrew, A.S., Karagas, M.R., Moore, J.H. A computationally efficient hypothesis testing method for epistasis analysis using multifactordimensionality reduction. Genetic Epidemiology, in press (2008).


Multifactor dimensionality reduction (MDR) was developed as a nonparametric and model-free data mining method for detecting, characterizing, and interpreting epistasis in the absence of significant main effects in genetic and epidemiologic studies of complex traits such as disease susceptibility. The goal of MDR is to change the representation of the data using a constructive induction algorithm to make nonadditive interactions easier to detect using any classification method such as naïve Bayes or logistic regression. Traditionally, MDR constructed variables have been evaluated with a naïve Bayes classifier that is combined with 10-fold cross validation to obtain an estimate of predictive accuracy or generalizability of epistasis models. Traditionally, we have used permutation testing to statistically evaluate the significance of models obtained through MDR. The advantage of permutation testing is that it controls for false-positives due to multiple testing. The disadvantage is that permutation testing is computationally expensive. This is in an important issue that arises in the context of detecting epistasis on a genome-wide scale. The goal of the present study was to develop and evaluate several alternatives to large-scale permutation testing for assessing the statistical significance of MDR models. Using data simulated from 70 different epistasis models, we compared the power and type I error rate of MDR using a 1000-fold permutation test with hypothesis testing using an extreme value distribution (EVD). We find that this new hypothesis testing method provides a reasonable alternative to the computationally expensive 1000-fold permutation test and is 50 times faster. We then demonstrate this new method by applying it to a genetic epidemiology study of bladder cancer susceptibility that was previously analyzed using MDR and assessed using a 1000-fold permutation test.

Monday, June 09, 2008

MDR Applications List Updated

I have updated the list of papers I know about that apply the MDR method to a genetic or epidemiologic study of a human disease or clinical endpoint. The list currently has 93 applied papers and can be found here in the May 29, 2006 entry of the Epistasis Blog.

You can carry out a PubMed search for MDR papers here.

Saturday, June 07, 2008

Contingency Table Measures and MDR

A new paper by Bush et al. in BMC Bioinformatics explores the use of different measures of contingency table patterns with MDR. We will consider whether some of these should be added to the open-source MDR software package.

Bush WS, Edwards TL, Dudek SM, McKinney BA, Ritchie MD. Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction. BMC Bioinformatics. 2008 May 16;9:238. [PubMed]

BACKGROUND: Multifactor Dimensionality Reduction (MDR) has been introduced previously as a non-parametric statistical method for detecting gene-gene interactions. MDR performs a dimensional reduction by assigning multi-locus genotypes to either high- or low-risk groups and measuring the percentage of cases and controls incorrectly labelled by this classification - the classification error. The combination of variables that produces the lowest classification error is selected as the best or most fit model. The correctly and incorrectly labelled cases and controls can be expressed as a two-way contingency table. We sought to improve the ability of MDR to detect gene-gene interactions by replacing classification error with a different measure to score model quality. RESULTS: In this study, we compare the detection and power of MDR using a variety of measures for two-way contingency table analysis. We simulated 40 genetic models, varying the number of disease loci in the model (2 - 5), allele frequencies of the disease loci (.2/.8 or .4/.6) and the broad-sense heritability of the model (.05 - .3). Overall, detection using NMI was 65.36% across all models, and specific detection was 59.4% versus detection using classification error at 62% and specific detection was 52.2%. CONCLUSION: Of the 10 measures evaluated, the likelihood ratio and normalized mutual information (NMI) are measures that consistently improve the detection and power of MDR in simulated data over using classification error. These measures also reduce the inclusion of spurious variables in a multi-locus model. Thus, MDR, which has already been demonstrated as a powerful tool for detecting gene-gene interactions, can be improved with the use of alternative fitness functions.

Monday, June 02, 2008

Exploiting the Proteome for the Genome-Wide Genetic Analysis of Epistasis

Our review paper on "Exploiting the Proteome for the Genome-Wide Genetic Analysis of Epistasis in Common Human Diseases" has been accepted for publication in Human Genetics. The corrected online version of the paper can be found here. This paper explores protein-protein interaction databases as a source of expert knowledge that can be used to help guide stochastic search algorithms such as genetic programming in their effort to detect epistasis in genome-wide association studies.

Pattin, K.A., Moore, J.H. Exploiting the Proteome for the Genome-Wide Genetic Analysis of Epistasis in Common Human Diseases. Human Genetics, in press (2008).


One of the central goals of human genetics is the identification of loci with alleles or genotypes that confer increased susceptibility. The availability of dense maps of single-nucleotide polymorphisms (SNPs) along with high-throughput genotyping technologies has set the stage for routine genome-wide association studies that are expected to significantly improve our ability to identify susceptibility loci. Before this promise can be realized, there are some significant challenges that need to be addressed. We address here the challenge of detecting epistasis or gene-gene interactions in genome-wide association studies. Discovering epistatic interactions in high dimensional datasets remains a challenge due to the computational complexity resulting from the analysis of all possible combinations of SNPs. One potential way to overcome the computational burden of a genome-wide epistasis analysis would be to devise a logical way to prioritize the many SNPs in a dataset so that the data may be analyzed more efficiently and yet still retain important biological information. One of the strongest demonstrations of the functional relationship between genes is protein-protein interaction. Thus, it is plausible that the expert knowledge extracted from protein interaction databases may allow for a more efficient analysis of genome-wide studies as well as facilitate the biological interpretation of the data. In this review we will discuss the challenges of detecting epistasis in genome-wide genetic studies and the means by which we propose to apply expert knowledge extracted from protein interaction databases to facilitate this process. We explore some of the fundamentals of protein interactions and the databases that are publicly available.