Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Friday, March 29, 2013

Four tips for success in graduate school and beyond

The following are my words of wisdom for succeeding in graduate school and beyond. They are based on my own experience building a career in biomedical research and are based on my experience training more than a dozen graduate students in my own research lab.

1) Find your passion.

The most important predictor of success is to find the subject area that you are most passionate about. Graduate school requires total immersion and lots of hard work. Those graduate students that struggle most with the demands of graduate school tend to be the ones who are either there for the wrong reasons (e.g. didn't want to get a job, couldn't get into medical school, parents applied pressure) or the ones that end up working in a lab with an uninteresting research area. The bottom line is that you need to find a research area that excites you and gets you out of bed day in and day out.

2) Don't be afraid to fail.

This is so important for your entire career. Failure is part of learning and part of doing cutting-edge research. You need to be able to fail, learn from the experience, brush it off, put it behind you and move on to the next challenge. All successful people fail over and over. It is a healthy part of the process. I have seen a number of students and faculty over the years who were afraid to fail and they often become paralyzed. I have had great success with research funding from the NIH. However, this is not because every one of my grants succeeds. It is because I submit lots of grants many of which are never funded.

3) Seek out mentors.

I owe a lot of my success to the wonderful mentors I have had at every step of my career going all the way back to high school. You can't succeed without help no matter how smart you are. Faculty life, research and research funding are challenging in so many different ways. It is impossible to know the ins and outs of everything you need to do from navigating the politics of faculty life to the obscure details of how the NIH and its many institutes operates. Someone with experience has to be there to help you and give advice. For students it is important to know that people will be much more willing to help you and devote time for you if they see your passion and your hard work and dedication.

4) Work your ass off.

The final tip is that there is no substitute for hard work. It is not enough to be smart. I have seen many very smart graduate students and faculty fail or not reach their full potential due to lack of hard work. I personally believe that success in graduate school, postdoctoral research and faculty life requires total immersion. You need to live and breathe your research. This is not as hard as it sounds if you have found your passion. People who are passionate about what they do don't see it as work. It is much easier to do the things you don't enjoy doing if the payoff means you can do more of what you love. Also, as noted above, mentors are more likely to go to bat for you if they see you going the extra mile.

Thursday, March 21, 2013

Gene-based testing of interactions in association studies of quantitative traits

Great new paper with a nice approach to detecting epistasis.

Ma L, Clark AG, Keinan A. Gene-based testing of interactions in association studies of quantitative traits. PLoS Genet. 2013 Feb;9(2):e1003321. [PubMed]


Various methods have been developed for identifying gene-gene interactions in genome-wide association studies (GWAS). However, most methods focus on individual markers as the testing unit, and the large number of such tests drastically erodes statistical power. In this study, we propose novel interaction tests of quantitative traits that are gene-based and that confer advantage in both statistical power and biological interpretation. The framework of gene-based gene-gene interaction (GGG) tests combine marker-based interaction tests between all pairs of markers in two genes to produce a gene-level test for interaction between the two. The tests are based on an analytical formula we derive for the correlation between marker-based interaction tests due to linkage disequilibrium. We propose four GGG tests that extend the following value combining methods: minimum value, extended Simes procedure, truncated tail strength, and truncated value product. Extensive simulations point to correct type I error rates of all tests and show that the two truncated tests are more powerful than the other tests in cases of markers involved in the underlying interaction not being directly genotyped and in cases of multiple underlying interactions. We applied our tests to pairs of genes that exhibit a protein-protein interaction to test for gene-level interactions underlying lipid levels using genotype data from the Atherosclerosis Risk in Communities study. We identified five novel interactions that are not evident from marker-based interaction testing and successfully replicated one of these interactions, between and , in an independent sample from the Multi-Ethnic Study of Atherosclerosis. We conclude that our GGG tests show improved power to identify gene-level interactions in existing, as well as emerging, association studies.

Saturday, March 16, 2013

Alternative definitions of epistasis

This is a classic paper on epistasis from an evolutionary point of view. This is one of the few that discusses the relationship between linkage disequilibrium and epistasis.

Michael J. Wade, R. G. Winther, A. F. Agrawal, C. J. Goodnight. 2001. Alternative definitions of epistasis: dependence and interaction. Trends in Ecology & Evolution 16: 498-504 [PDF]


Although epistasis is at the center of the Fisher-Wright debate, biologists not involved in the controversy are often unaware that there are actually two different formal definitions of epistasis. We compare concepts of genetic independence in the two theoretical traditions of evolutionary genetics, population genetics and quantitative genetics, and show how independence of gene action (represented by the multiplicative model of population genetics) can be different from the absence of gene interaction (represented by the linear additive model of quantitative genetics). The two formulations converge with weak selection but not with strong selection or, for multiple loci, when the aggregated interaction terms are not negligible. As a result of the different formulations of gene interaction, the presence or absence of linkage disequilibrium, does not necessarily indicate the presence or absence of fitness epistasis. Indeed, linkage disequilibrium is generated in ‘additive’ models in quantitative genetics whenever two (or more) loci experience simultaneous selection. As a research strategy, it is often practical, for theoretical or experimental reasons, to minimize gene interaction by assuming independence of gene action in regard to fitness, or by assuming linear additive effects of multiple loci on a phenotype. However, minimizing the role of epistasis in theoretical investigations hinders our understanding of the origins of diversity and the evolution of complex phenotypes.

Wednesday, March 13, 2013

Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach

This open-access paper reports our successful application of learning classifier systems (LCS) to the genetic analysis of bladder cancer. This method is promising because it can detect both epistasis and genetic or locus heterogeneity. First read our comprehensive review of LCS. We also have a more recent overview of our own LCS approach in Computational Intelligence Magazine.

Urbanowicz RJ, Andrew AS, Karagas MR, Moore JH. Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach. J Am Med Inform Assoc 20, 603-612 (2013). [PubMed] [PDF]


BACKGROUND AND OBJECTIVE: Detecting complex patterns of association between genetic or environmental risk factors and disease risk has become an important target for epidemiological research. In particular, strategies that provide multifactor interactions or heterogeneous patterns of association can offer new insights into association studies for which traditional analytic tools have had limited success.

MATERIALS AND METHODS: To concurrently examine these phenomena, previous work has successfully considered the application of learning classifier systems (LCSs), a flexible class of evolutionary algorithms that distributes learned associations over a population of rules. Subsequent work dealt with the inherent problems of knowledge discovery and interpretation within these algorithms, allowing for the characterization of heterogeneous patterns of association. Whereas these previous advancements were evaluated using complex simulation studies, this study applied these collective works to a 'real-world' genetic epidemiology study of bladder cancer susceptibility.

RESULTS AND DISCUSSION: We replicated the identification of previously characterized factors that modify bladder cancer risk-namely, single nucleotide polymorphisms from a DNA repair gene, and smoking. Furthermore, we identified potentially heterogeneous groups of subjects characterized by distinct patterns of association. Cox proportional hazard models comparing clinical outcome variables between the cases of the two largest groups yielded a significant, meaningful difference in survival time in years (survivorship). A marginally significant difference in recurrence time was also noted. These results support the hypothesis that an LCS approach can offer greater insight into complex patterns of association.

CONCLUSIONS: This methodology appears to be well suited to the dissection of disease heterogeneity, a key component in the advancement of personalized medicine.

Tuesday, March 12, 2013

Multifactor dimensionality reduction reveals a three-locus epistatic interaction associated with susceptibility to pulmonary tuberculosis

In this brief paper we show evidence for three-way epistatic interaction among SNPs from several excellent candidates for tuberculosis. This highlights an example of how three-way interactions can be detected and characterized in genetic association data. 

Collins RL, Hu T, Wejse C, Sirugo G, Williams SM, Moore JH. Multifactor dimensionality reduction reveals a three-locus epistatic interaction associated with susceptibility to pulmonary tuberculosis. BioData Min. 2013 Feb 18;6(1):4. [PubMed]


BACKGROUND: Identifying high-order genetics associations with non-additive (i.e. epistatic) effects in population-based studies of common human diseases is a computational challenge. Multifactor dimensionality reduction (MDR) is a machine learning method that was designed specifically for this problem. The goal of the present study was to apply MDR to mining high-order epistatic interactions in a population-based genetic study of tuberculosis (TB).

RESULTS: The study used a previously published data set consisting of 19 candidate single-nucleotide polymorphisms (SNPs) in 321 pulmonary TB cases and 347 healthy controls from Guniea-Bissau in Africa. The ReliefF algorithm was applied first to generate a smaller set of the five most informative SNPs. MDR with 10-fold cross-validation was then applied to look at all possible combinations of two, three, four and five SNPs. The MDR model with the best testing accuracy (TA) consisted of SNPs rs2305619, rs187084, and rs11465421 (TA = 0.588) in PTX3, TLR9 and DC-sign, respectively. A general 1000-fold permutation test of the null hypothesis of no association confirmed the statistical significance of the model (p = 0.008). An additional 1000-fold permutation test designed specifically to test the linear null hypothesis that the association effects are only additive confirmed the presence of non-additive (i.e. nonlinear) or epistatic effects (p = 0.013). An independent information-gain measure corroborated these results with a third-order epistatic interaction that was stronger than any lower-order associations.

CONCLUSIONS: We have identified statistically significant evidence for a three-way epistatic interaction that is associated with susceptibility to TB. This interaction is stronger than any previously described one-way or two-way associations. This study highlights the importance of using machine learning methods that are designed to embrace, rather than ignore, the complexity of common diseases such as TB. We recommend future studies of the genetic of TB take into account the possibility that high-order epistatic interactions might play an important role in disease susceptibility.

Monday, March 11, 2013

ViSEN: Methodology and software for visualization of statistical epistasis networks

This is a short paper describing our new ViSEN software. This builds on work from the previous few posts on detecting three-way epistasis. Our ViSEN software package is freely available from sourceforge.

Hu T, Chen Y, Kiralis JW, Moore JH. ViSEN: Methodology and Software for Visualization of Statistical Epistasis Networks. Genet Epidemiol., in press 2013 [PubMed]


The nonlinear interaction effect among multiple genetic factors, i.e. epistasis, has been recognized as a key component in understanding the underlying genetic basis of complex human diseases and phenotypic traits. Due to the statistical and computational complexity, most epistasis studies are limited to interactions with an order of two. We developed ViSEN to analyze and visualize epistatic interactions of both two-way and three-way. ViSEN not only identifies strong interactions among pairs or trios of genetic attributes, but also provides a global interaction map that shows neighborhood and clustering structures. This visualized information could be very helpful to infer the underlying genetic architecture of complex diseases and to generate plausible hypotheses for further biological validations. ViSEN is implemented in Java and freely available at https://sourceforge.net/projects/visen/.

Sunday, March 10, 2013

Statistical epistasis networks reduce the computational complexity of searching three-locus genetic models

In this paper we show how building a network based on pairwise epistatic relationships can reduce the computational complexity of search for three-locus interactions. This was presented by my postdoc, Ting Hu, at the 2013 Pacific Symposium on Biocomputing.

Hu T, Andrew AS, Karagas MR, Moore JH. Statistical epistasis networks reduce the computational complexity of searching three-locus genetic models. Pac Symp Biocomput. 2013:397-408. [PubMed]


The rapid development of sequencing technologies makes thousands to millions of genetic attributes available for testing associations with various biological traits. Searching this enormous high-dimensional data space imposes a great computational challenge in genome-wide association studies. We introduce a network-based approach to supervise the search for three-locus models of disease susceptibility. Such statistical epistasis networks (SEN) are built using strong pairwise epistatic interactions and provide a global interaction map to search for higher-order interactions by prioritizing genetic attributes clustered together in the networks. Applying this approach to a population-based bladder cancer dataset, we found a high susceptibility three-way model of genetic variations in DNA repair and immune regulation pathways, which holds great potential for studying the etiology of bladder cancer with further biological validations. We demonstrate that our SEN-supervised search is able to find a small subset of three-locus models with significantly high associations at a substantially reduced computational cost.

Saturday, March 09, 2013

An information-gain approach to detecting three-way epistatic interactions in genetic association studies

We present in this paper a new method for estimating three-way epistatic interactions in genetic association studies.
Hu T, Chen Y, Kiralis JW, Collins RL, Wejse C, Sirugo G, Williams SM, Moore JH. An information-gain approach to detecting three-way epistatic interactions in genetic association studies. J Am Med Inform Assoc. 2013 Feb 18. [PubMed]
BACKGROUND: Epistasis has been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies.
OBJECTIVES: In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis.
METHODS: Such a measure is based on information gain, and is able to separate all lower order effects from pure three-way epistasis.
RESULTS: Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis in a West African population. In the tuberculosis data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations.
CONCLUSION: Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.