Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Thursday, September 20, 2018

The complex underpinnings of genetic background effects

A nice new paper on epistasis is yeast.

Mullis MN, Matsui T, Schell R, Foree R, Ehrenreich IM. The complex underpinnings of genetic background effects. Nat Commun. 2018 Sep 17;9(1):3548. [PubMed]

Genetic interactions between mutations and standing polymorphisms can cause mutations to show distinct phenotypic effects in different individuals. To characterize the genetic architecture of these so-called background effects, we genotype 1411 wild-type and mutant yeast cross progeny and measure their growth in 10 environments. Using these data, we map 1086 interactions between segregating loci and 7 different gene knockouts. Each knockout exhibits between 73 and 543 interactions, with 89% of all interactions involving higher-order epistasis between a knockout and multiple loci. Identified loci interact with as few as one knockout and as many as all seven knockouts. In mutants, loci interacting with fewer and more knockouts tend to show enhanced and reduced phenotypic effects, respectively. Cross-environment analysis reveals that most interactions between the knockouts and segregating loci also involve the environment. These results illustrate the complicated interactions between mutations, standing polymorphisms, and the environment that cause background effects.

Saturday, September 01, 2018

Analysis of Epistasis in Natural Traits Using Model Organisms

A nice new essay in Trends in Genetics

Campbell RF, McGrath PT, Paaby AB. Analysis of Epistasis in Natural Traits Using Model Organisms. Trends Genet. 2018 [PubMed]


Identification of statistical epistasis in natural populations remains challenging due to the relationship between allele frequency and statistical power.

Artificial populations have been constructed in model organisms to detect statistical epistasis between two regions of the genome; however, it is difficult to use these results to understand how epistasis operates in natural populations.

Studies of focal perturbations in defined genetic backgrounds suggests that natural selection can influence the types of nonadditive relationships that exist. 

Wednesday, August 15, 2018

PennAI - A System for Accessible Artificial Intelligence

Our paper on PennAI has finally been published as part of the proceedings of the Genetic Programming Theory and Practive XV workshop book. 


While artificial intelligence (AI) has become widespread, many commercial AI systems are not yet accessible to individual researchers nor the general public due to the deep knowledge of the systems required to use them. We believe that AI has matured to the point where it should be an accessible technology for everyone. We present an ongoing project whose ultimate goal is to deliver an open source, user-friendly AI system that is specialized for machine learning analysis of complex data in the biomedical and health care domains. We discuss how genetic programming can aid in this endeavor, and highlight specific examples where genetic programming has automated machine learning analyses in previous projects.

Monday, July 23, 2018

New Papers on ReliefF for Feature Selection

We have two new papers out on ReliefF for feature selection. ReliefF is a machine learning method that can detect epistasis.

Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH. Relief-based feature selection: Introduction and review. J Biomed Inform. 2018 Jul 18. [PubMed] [JBI]

Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH. Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform. 2018 Jul 17. [PubMed] [JBI]

Sunday, June 17, 2018

Leveraging epigenomics and contactomics data to investigate SNP pairs in GWAS

Our new paper on the incorporation of expert knowledge about epigenomics and chromatin looping for modeling epistasis has been published in Human Genetics.

Manduchi E, Williams SM, Chesi A, Johnson ME, Wells AD, Grant SFA, Moore JH. Leveraging epigenomics and contactomics data to investigate SNP pairs in GWAS. Hum Genet. 2018 May;137(5):413-425. [PubMed]


Although Genome Wide Association Studies (GWAS) have led to many valuable insights into the genetic bases of common diseases over the past decade, the issue of missing heritability has surfaced, as the discovered main effect genetic variants found to date do not account for much of a trait's predicted genetic component. We present a workflow, integrating epigenomics and topologically associating domain data, aimed at discovering trait-associated SNP pairs from GWAS where neither SNP achieved independent genome-wide significance. Each analyzed SNP pair consists of one SNP in a putative active enhancer and another SNP in a putative physically interacting gene promoter in a trait-relevant tissue. As a proof-of-principle case study, we used this approach to identify focused collections of SNP pairs that we analyzed in three independent Type 2 diabetes (T2D) GWAS. This approach led us to discover 35 significant SNP pairs, encompassing both novel signals and signals for which we have found orthogonal support from other sources. Nine of these pairs are consistent with eQTL results, two are consistent with our own capture C experiments, and seven involve signals supported by recent T2D literature.

Thursday, May 03, 2018

AI researchers allege that machine learning is alchemy

This is a really nice piece in Science on the limitation and challenges of machine learning. Highly recommended reading.

"Without deep understanding of..basic tools needed to build & train new algorithms, researchers creating AIs resort to hearsay, like medieval alchemists. "People gravitate around cargo-cult practices," relying on "folklore & magic spells"

Monday, April 30, 2018

Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)

Our new paper on using resampling methods to improve reproducibility of machine learning in the context of cross validation.

Piette ER, Moore JH. Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV). BioData Min. 2018 Apr 19;11:6. [PubMed]

Background: Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions.

Results: We apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of
primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results.

Conclusions: Application of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern.

Wednesday, April 25, 2018

Collective feature selection to identify crucial epistatic variants

Nice new paper from Marylyn Ritchie's group on feature selection for epistasis analysis.

Verma SS, Lucas A, Zhang X, Veturi Y, Dudek S, Li B, Li R, Urbanowicz R, Moore JH, Kim D, Ritchie MD. Collective feature selection to identify crucial epistatic variants. BioData Min. 2018 Apr 19;11:5. [Pubmed]

Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach.

Through our simulation study we propose a collective feature selection approach to select features that are in the "union" of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~‚ÄČ44,000 samples obtained from Geisinger's MyCode Community Health Initiative (on behalf of DiscovEHR collaboration).

In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.

Tuesday, March 27, 2018

Tips for cloud computing

Epistasis analysis is by nature computationally challenging. Here are some tips for working cloud computing into your analytical pipeline. 

Cole BS, Moore JH. Eleven quick tips for architecting biomedical informatics workflows with cloud computing. PLoS Comput Biol. 2018 Mar 29;14(3):e1005994.doi: 10.1371/journal.pcbi.1005994. [PLOS]

Cloud computing has revolutionized the development and operations of hardware and software across diverse technological arenas, yet academic biomedical research has lagged behind despite the numerous and weighty advantages that cloud computing offers. Biomedical researchers who embrace cloud computing can reap rewards in cost reduction, decreased development and maintenance workload, increased reproducibility, ease of sharing data and software, enhanced security, horizontal and vertical scalability, high availability, a thriving technology partner ecosystem, and much more. Despite these advantages that cloud-based workflows offer, the majority of scientific software developed in academia does not utilize cloud computing and must be migrated to the cloud by the user. In this article, we present 11 quick tips for architecting biomedical informatics workflows on compute clouds, distilling knowledge gained from experience developing, operating, maintaining, and distributing software and virtualized appliances on the world's largest cloud. Researchers who follow these tips stand to benefit immediately by migrating their workflows to cloud computing and embracing the paradigm of abstraction.

Sunday, February 18, 2018

Effect of genetic architecture on the prediction accuracy of quantitative traits in samples of unrelated individuals

Nice paper from Trudy Mackay et al. I had the pleasure of talking to her about this paper at the last EDGE workshop.

Morgante F, Huang W, Maltecca C, Mackay TFC. Effect of genetic architecture on the prediction accuracy of quantitative traits in samples of unrelated individuals. Heredity (Edinb). 2018 [PubMed]


Predicting complex phenotypes from genomic data is a fundamental aim of animal and plant breeding, where we wish to predict genetic merits of selection candidates; and of human genetics, where we wish to predict disease risk. While genomic prediction models work well with populations of related individuals and high linkage disequilibrium (LD) (e.g., livestock), comparable models perform poorly for populations of unrelated individuals and low LD (e.g., humans). We hypothesized that low prediction accuracies in the latter situation may occur when the genetics architecture of the trait departs from the infinitesimal and additive architecture assumed by most prediction models. We used simulated data for 10,000 lines based on sequence data from a population of unrelated, inbred Drosophila melanogaster lines to evaluate this hypothesis. We show that, even in very simplified scenarios meant as a stress test of the commonly used Genomic Best Linear Unbiased Predictor (G-BLUP) method, using all common variants yields low prediction accuracy regardless of the trait genetic architecture. However, prediction accuracy increases when predictions are informed by the genetic architecture inferred from mapping the top variants affecting main effects and interactions in the training data, provided there is sufficient power for mapping. When the true genetic architecture is largely or partially due to epistatic interactions, the additive model may not perform well, while models that account explicitly for interactions generally increase prediction accuracy. Our results indicate that accounting for genetic architecture can improve prediction accuracy for quantitative traits.