Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Friday, September 30, 2011

Gene-environment interaction in psychiatric research

This is a nice new critical review exploring issues of power and replication for detecting gene-envrionment interactions in psychiatric genetics. This paper makes a number of very nice points and is worth reading. A few things to keep in mind. First, the power issues discussed make the important assumption that the effect size of gene-environment interactions will be as small or smaller than main effects. This may not be true in some circumstances. Second, I don't expect gene-environment interactions, or most genetic effects for that matter, to replicate under a complex systems model. There are many good biological reasons why an interaction effect detected in one population might not replicate in a second independent population. See our 2009 PLoS One paper by Greene et al. for an explanation. Also, I find their distinction of candidate gene-environment interaction (cGxE) to be a bit strange.

Duncan LE, Keller MC. A Critical Review of the First 10 Years of Candidate Gene-by-Environment Interaction Research in Psychiatry. Am J Psychiatry, in press (2011). [PubMed]


Objective: Gene-by-environment interaction (G×E) studies in psychiatry have typically been conducted using a candidate G×E (cG×E) approach, analogous to the candidate gene association approach used to test genetic main effects. Such cG×E research has received widespread attention and acclaim, yet cG×E findings remain controversial. The authors examined whether the many positive cG×E findings reported in the psychiatric literature were robust or if, in aggregate, cG×E findings were consistent with the existence of publication bias, low statistical power, and a high false discovery rate. Method: The authors conducted analyses on data extracted from all published studies (103 studies) from the first decade (2000-2009) of cG×E research in psychiatry. Results: Ninety-six percent of novel cG×E studies were significant compared with 27% of replication attempts. These findings are consistent with the existence of publication bias among novel cG×E studies, making cG×E hypotheses appear more robust than they actually are. There also appears to be publication bias among replication attempts because positive replication attempts had smaller average sample sizes than negative ones. Power calculations using observed sample sizes suggest that cG×E studies are underpowered. Low power along with the likely low prior probability of a given cG×E hypothesis being true suggests that most or even all positive cG×E findings represent type I errors. Conclusions: In this new era of big data and small effects, a recalibration of views about groundbreaking findings is necessary. Well-powered direct replications deserve more attention than novel cG×E findings and indirect replications.

Thursday, September 15, 2011

Characterizing Genetic Interactions in Human Disease Association Studies Using Statistical Epistasis Networks

Our paper on using network science to study the genetic architecture of disease susceptibility has been published.

Hu T, Sinnott-Armstrong NA, Kiralis JW, Andrew AS, Karagas MR, Moore JH. Characterizing Genetic Interactions in Human Disease Association Studies Using Statistical Epistasis Networks. BMC Bioinformatics. 2011 Sep 12;12(1):364.[BMC]


Background: Epistasis is recognized ubiquitous in the genetic architecture of complex traits such as disease susceptibility. Experimental studies in model organisms have revealed extensive evidence of biological interactions among genes. Meanwhile, statistical and computational studies in human populations have suggested non-additive effects of genetic variation on complex traits. Although these studies form a baseline for understanding the genetic architecture of complex traits, to date they have only considered interactions among a small number of genetic variants. Our goal here is to use network science to determine the extent to which non-additive interactions exist beyond small subsets of genetic variants. We infer statistical epistasis networks to characterize the global space of pairwise interactions among approximately 1500 Single Nucleotide Polymorphisms (SNPs) spanning nearly 500 cancer susceptibility genes in a large population-based study of bladder cancer.

Results: The statistical epistasis network was built by linking pairs of SNPs if their pairwise interactions were stronger than a systematically derived threshold. Its topology clearly differentiated this real-data network from networks obtained from permutations of the same data under the null hypothesis that no association exists between genotype and phenotype. The network had a signiffcantly higher number of hub SNPs and, interestingly, these hub SNPs were not necessarily with high main effects. The network had a largest connected component of 39 SNPs that was absent in any other permuted-data networks. In addition, the vertex degrees of this network were distinctively found following an approximate power-law distribution and its topology appeared scale-free.

Conclusions: In contrast to many existing techniques focusing on high main-effect SNPs or models of several interating SNPs, our network approach characterized a global picture of gene-gene interactions in a population-based genetic data. The network was built using pairwise interactions, and its distinctive network topology and large connected components indicated joint effects in a large set of SNPs. Our observations suggested that this particular statistical epistasis network captured important features of the genetic architecture of bladder cancer that have not been described previously.

Labels: , , , ,

Tuesday, September 13, 2011

HyperCube Rule Mining

This looks like a neat rule-based machine learning method for association studies. Let me know if you try it.

Loucoubar C, Paul R, Bar-Hen A, Huret A, Tall A, et al. (2011) An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria. PLoS ONE 6(9): e24085. [PLoS]


Complex, high-dimensional data sets pose significant analytical challenges in the post-genomic era. Such data sets are not exclusive to genetic analyses and are also pertinent to epidemiology. There has been considerable effort to develop hypothesis-free data mining and machine learning methodologies. However, current methodologies lack exhaustivity and general applicability. Here we use a novel non-parametric, non-euclidean data mining tool, HyperCubeH, to explore exhaustively a complex epidemiological malaria data set by searching for over density of events in m-dimensional space. Hotspots of over density correspond to strings of variables, rules, that determine, in this case, the occurrence of Plasmodium falciparum clinical malaria episodes. The data set contained 46,837 outcome events from 1,653 individuals and 34 explanatory variables. The best predictive rule contained 1,689 events from 148 individuals and was defined as: individuals present during 1992–2003, aged 1–5 years old, having hemoglobin AA, and having had previous Plasmodium malariae malaria parasite infection #10 times. These individuals had 3.71 times more P. falciparum clinical malaria episodes than the general population. We validated the rule in two different cohorts. We compared and contrasted the HyperCubeH rule with the rules using variables identified by both traditional statistical methods and non-parametric regression tree methods. In addition, we tried all possible sub-stratified quantitative variables. No other model with equal or greater representativity gave a higher Relative Risk. Although three of the four variables in the rule were intuitive, the effect of number of P. malariae episodes was not. HyperCubeH efficiently sub-stratified quantitative variables to optimize the rule and was able to identify interactions among the variables, tasks not easy to perform using standard data mining methods. Search of local over density in m-dimensional space, explained by easily interpretable rules, is thus seemingly ideal for generating hypotheses for large datasets to unravel the complexity inherent in biological systems.

Monday, September 05, 2011

An R Package Implementation of Multifactor Dimensionality Reduction

A new R package for our Multifactor Dimensionality Reduction (MDR) method is available.

Winham SJ, Motsinger-Reif AA. An R Package Implementation of Multifactor Dimensionality Reduction. BioData Min. 2011 Aug 16;4(1):24. [PubMed]



A breadth of high-dimensional data is now available with unprecedented numbers of genetic markers and data-mining approaches to variable selection are increasingly being utilized to uncover associations, including potential gene-gene and gene-environment interactions. One of the most commonly used data-mining methods for case-control data is Multifactor Dimensionality Reduction (MDR), which has displayed success in both simulations and real data applications. Additional software applications in alternative programming languages can improve the availability and usefulness of the method for a broader range of users.


We introduce a package for the R statistical language to implement the Multifactor Dimensionality Reduction (MDR) method for nonparametric variable selection of interactions. This package is designed to provide an alternative implementation for R users, with great flexibility and utility for both data analysis and research. The 'MDR' package is freely available online at http://www.r-project.org/. We also provide data examples to illustrate the use and functionality of the package.


MDR is a frequently-used data-mining method to identify potential gene-gene interactions, and alternative implementations will further increase this usage. We introduce a flexible software package for R users.

Thursday, September 01, 2011

The 24/7 Lab - Does Creativity Suffer?

There was an interesting piece in Nature recently about Dr. Quiñones-Hinojosa and his promotion of the 24/7 lab. He selects people for his lab that he can motivate to work around the clock. He claims that this intense work ethic yields results. Here is a quote:

>>>Quiñones-Hinojosa credits his professional rise to his resilience and a seemingly limitless capacity for hard work. "When you go that extra step, you are training your brain like an athlete," he says. And the fact that his group has published 113 articles in the past six years and holds 13 funding grants is not, he says, because he is brighter or better connected than colleagues. "It's just a matter of volume," he says. "The key is we submit a couple of dozen grant applications a year, and we learn from our mistakes."<<<

I certainly credit hard work and long hours to my own success. However, I have a very different approach to running my lab. I believe that successful research is about much more than productivity. Productivity must not be achieved at the expense of creativity. I am willing to bet that the intense pressure that Quiñones-Hinojosa inflicts on his staff and students stifles their ability to think creatively. Instead of trying some new crazy idea they are intensely focused on getting the next experiment done so Quiñones-Hinojosa doesn't think they are slacking. I firmly believe that hard work must be balanced with fun and time for creative thought. The role of the PI is to set a good example by working hard, but at the same time to establish a relaxed work environment where innovation can flourish and staff and students aren't afraid to try new things. I have always liked the Google work philosophy and programs such as their 'day to play'. My experience has been that good staff and students work harder naturally when they are allowed to express themselves creatively. Some of our best work has come from people in my lab trying crazy ideas.

Here is a followup comment by Overbaugh posted in Nature.