Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Tuesday, April 24, 2007

The Ubiquitous Nature of Epistasis

Our 2003 paper in Human Heredity on the ubiquity of epistasis in determining susceptibility to common human diseases has exceeded 100 citations. This invited paper was written as part of a special issue that followed a statistical genetics workshop at the Mathematics Institute in Oberwolfach, Germany in February of 2003. The workshop was organized by Chris Amos, Max Baur and Helmut Schafer. Details about the workshop can be found here. A group photo can be found here. I am in the back middle of the photo. Note the snow. It snowed nonstop for the three days we were there.

Here is the paper. Email me for a pdf.

Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56(1-3):73-82. [PubMed]

There is increasing awareness that epistasis or gene-gene interaction plays a role in susceptibility to common human diseases. In this paper, we formulate a working hypothesis that epistasis is a ubiquitous component of the genetic architecture of common human diseases and that complex interactions are more important than the independent main effects of any one susceptibility gene. This working hypothesis is based on several bodies of evidence. First, the idea that epistasis is important is not new. In fact, the recognition that deviations from Mendelian ratios are due to interactions between genes has been around for nearly 100 years. Second, the ubiquity of biomolecular interactions in gene regulation and biochemical and metabolic systems suggest that relationship between DNA sequence variations and clinical endpoints is likely to involve gene-gene interactions. Third, positive results from studies of single polymorphisms typically do not replicate across independent samples. This is true for both linkage and association studies. Fourth, gene-gene interactions are commonly found when properly investigated. We review each of these points and then review an analytical strategy called multifactor dimensionality reduction for detecting epistasis. We end with ideas of how hypotheses about biological epistasis can be generated from statistical evidence using biochemical systems models. If this working hypothesis is true, it suggests that we need a research strategy for identifying common disease susceptibility genes that embraces, rather than ignores, the complexity of the genotype to phenotype relationship.

Sunday, April 22, 2007

MDR Analysis in Imbalanced Datasets

Our paper on modifications to MDR for imbalanced datasets has been published in Genetic Epidemiology.

Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction.Genet Epidemiol. 2007 May;31(4):306-15. [PubMed]

Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: (1) over-sampling that resamples with replacement the smaller class until the data are balanced, (2) under-sampling that randomly removes subjects from the larger class until the data are balanced, and (3) balanced accuracy [(sensitivity+specificity)/2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800, 1600). Each dataset was generated with different ratios of cases to controls (1 : 1, 1 : 2, 1 : 4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets.