Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Thursday, May 16, 2019

Accessible AI for Automated Machine Learning

We released our open-source PennAI software for automated machine learning this week. Here is the Penn Medicine press release. Here is the Github link to the source code. More info can be found at the PennAI website. We think this will bring machine learning technology to novice users.

Monday, April 22, 2019

Automated discovery of test statistics

This was a fun proof-of-principle paper we did on using genetic programming to discover test statistics. We showed that with general principles that we could re-discover the two-sample t-test. This opens the door to the discovery of new test statistics for unsolved problems.

Sunday, March 31, 2019

How to increase our belief in discovered statistical interactions via large-scale association studies?

Our new paper with Dr. Kristel van Steen on approaches for improving evidence for statistical interactions.

Van Steen K, Moore JH. How to increase our belief in discovered statistical interactions via large-scale association studies? Hum Genet. 2019 [PubMed] [Human Genetics]


The understanding that differences in biological epistasis may impact disease risk, diagnosis, or disease management stands in wide contrast to the unavailability of widely accepted large-scale epistasis analysis protocols. Several choices in the analysis workflow will impact false-positive and false-negative rates. One of these choices relates to the exploitation of particular modelling or testing strategies. The strengths and limitations of these need to be well understood, as well as the contexts in which these hold. This will contribute to determining the potentially complementary value of epistasis detection workflows and is expected to increase replication success with biological relevance. In this contribution, we take a recently introduced regression-based epistasis detection tool as a leading example to review the key elements that need to be considered to fully appreciate the value of analytical epistasis detection performance assessments. We point out unresolved hurdles and give our perspectives towards overcoming these.

Friday, March 01, 2019

Testing the assumptions of parametric linear models: the need for biological data mining in disciplines such as human genetics

This editorial is in response to some claims that an observed linear relationship between relative pair trait correlation and IBD genetic sharing is indicative of a simple additive genetic architecture dominated by independent genetic effects. As we show here, you could observe this pattern under a genetic architecture dominated by epistasis.

Moore JH, Mackay TFC, Williams SM. Testing the assumptions of parametric linear models: the need for biological data mining in disciplines such as human genetics. BioData Min. 2019 Feb 11;12:6. [PubMed] [BioData Mining]


All data science methods have specific assumptions that are made in order for their inferences to be valid. Some assumptions impact statistical significance testing and some influence the models themselves. For example, a fundamental assumption of linear regression is that the relationship between the independent and dependent variables is additive such that a unit increase in one leads to a unit increase in the other with some error that can be modeled using a normal distribution. The presence of a nonlinear relationship between the variables violates this assumption and can lead to inaccurate inferences. We demonstrate this here using a simple example from human genetics and then end with some thoughts about the role of biological data mining in revealing nonlinear relationships between variables.

Wednesday, February 13, 2019

Preparing next-generation scientists for biomedical big data: artificial intelligence approaches

Our paper on how to prepare next-gen scientists for big data is out. We outline here a curriculum focused on precision medicine, data science, and artificial intelligence.

Moore JH, Boland MR, Camara PG, Chervitz H, Gonzalez G, Himes BE, Kim D, Mowery DL, Ritchie MD, Shen L, Urbanowicz RJ, Holmes JH. Preparing next-generation scientists for biomedical big data: artificial intelligence approaches. Per Med. 2019 [PubMed] [PerMed]


Personalized medicine is being realized by our ability to measure biological and environmental information about patients. Much of these data are being stored in electronic health records yielding big data that presents challenges for its management and analysis. Here, we review several areas of knowledge that are necessary for next-generation scientists to fully realize the potential of biomedical big data. We begin with an overview of big data and its storage and management. We then review statistics and data science as foundational topics followed by a core curriculum of artificial intelligence, machine learning and natural language processing that are needed to develop predictive models for clinical decision making. We end with some specific training recommendations for preparing next-generation scientists for biomedical big data.

Wednesday, January 02, 2019

Analysis validation has been neglected in the Age of Reproducibility

Our paper on the use of simulation to help improve analysis validation and results reproducibility.

Lotterhos KE, Moore JH, Stapleton AE. Analysis validation has been neglected in the Age of Reproducibility. PLoS Biol. 2018 Dec 10;16(12):e3000070. [PubMed] [PLoS Biology]


Increasingly complex statistical models are being used for the analysis of biological data. Recent commentary has focused on the ability to compute the same outcome for a given dataset (reproducibility). We argue that a reproducible statistical analysis is not necessarily valid because of unique patterns of nonindependence in every biological dataset. We advocate that analyses should be evaluated with known-truth simulations that capture biological reality, a process we call "analysis validation." We review the process of validation and suggest criteria that a validation project should meet. We find that different fields of science have historically failed to meet all criteria, and we suggest ways to implement meaningful validation in training and practice.

Wednesday, November 21, 2018

Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits

I love seeing new extensions and modifications to our MDR method. Here is a new from Dr. Lou.

Hou TT, Lin F, Bai S, Cleves MA, Xu HM, Lou XY. Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits. Genet Epidemiol, in press (2018)


The manifestation of complex traits is influenced by gene–gene and gene–environment interactions, and the identification of multifactor interactions is an important but challenging undertaking for genetic studies. Many complex phenotypes such as disease severity are measured on an ordinal scale with more than two categories. A proportional odds model can improve statistical power for these outcomes, when compared to a logit model either collapsing the categories into two mutually exclusive groups or limiting the analysis to pairs of categories. In this study, we propose a proportional odds model‐based generalized multifactor dimensionality reduction (GMDR) method for detection of interactions underlying polytomous ordinal phenotypes. Computer simulations demonstrated that this new GMDR method has a higher power and more accurate predictive ability than the GMDR methods based on a logit model and a multinomial logit model. We applied this new method to the genetic analysis of low‐density lipoprotein (LDL) cholesterol, a causal risk factor for coronary artery disease, in the Multi‐Ethnic Study of Atherosclerosis, and identified a significant joint action of the CELSR2, SERPINA12, HPGD, and APOB genes. This finding provides new information to advance the limited knowledge about genetic regulation and gene interactions in metabolic pathways of LDL cholesterol. In conclusion, the proportional odds model‐based GMDR is a useful tool that can boost statistical power and prediction accuracy in studying multifactor interactions underlying ordinal traits.

Wednesday, October 24, 2018

Statistical Inference Relief (STIR) feature selection

Happy to be a collaborator on this paper to add inference to the ReliefF method for feature selection. We have done a lot of work on this algorithm that is capable of detecting epistasis.

Le TT, Urbanowicz RJ, Moore JH, McKinney BA. Statistical Inference Relief (STIR) feature selection. Bioinformatics. 2018 Sep 18., in press


Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features.

We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data.

We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR's straightforward extension to genome-wide association studies.

Code and data available at http://insilico.utulsa.edu/software/STIR.

Thursday, September 20, 2018

The complex underpinnings of genetic background effects

A nice new paper on epistasis is yeast.

Mullis MN, Matsui T, Schell R, Foree R, Ehrenreich IM. The complex underpinnings of genetic background effects. Nat Commun. 2018 Sep 17;9(1):3548. [PubMed]

Genetic interactions between mutations and standing polymorphisms can cause mutations to show distinct phenotypic effects in different individuals. To characterize the genetic architecture of these so-called background effects, we genotype 1411 wild-type and mutant yeast cross progeny and measure their growth in 10 environments. Using these data, we map 1086 interactions between segregating loci and 7 different gene knockouts. Each knockout exhibits between 73 and 543 interactions, with 89% of all interactions involving higher-order epistasis between a knockout and multiple loci. Identified loci interact with as few as one knockout and as many as all seven knockouts. In mutants, loci interacting with fewer and more knockouts tend to show enhanced and reduced phenotypic effects, respectively. Cross-environment analysis reveals that most interactions between the knockouts and segregating loci also involve the environment. These results illustrate the complicated interactions between mutations, standing polymorphisms, and the environment that cause background effects.

Saturday, September 01, 2018

Analysis of Epistasis in Natural Traits Using Model Organisms

A nice new essay in Trends in Genetics

Campbell RF, McGrath PT, Paaby AB. Analysis of Epistasis in Natural Traits Using Model Organisms. Trends Genet. 2018 [PubMed]


Identification of statistical epistasis in natural populations remains challenging due to the relationship between allele frequency and statistical power.

Artificial populations have been constructed in model organisms to detect statistical epistasis between two regions of the genome; however, it is difficult to use these results to understand how epistasis operates in natural populations.

Studies of focal perturbations in defined genetic backgrounds suggests that natural selection can influence the types of nonadditive relationships that exist.