Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Wednesday, November 21, 2018

Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits

I love seeing new extensions and modifications to our MDR method. Here is a new from Dr. Lou.

Hou TT, Lin F, Bai S, Cleves MA, Xu HM, Lou XY. Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits. Genet Epidemiol, in press (2018)


The manifestation of complex traits is influenced by gene–gene and gene–environment interactions, and the identification of multifactor interactions is an important but challenging undertaking for genetic studies. Many complex phenotypes such as disease severity are measured on an ordinal scale with more than two categories. A proportional odds model can improve statistical power for these outcomes, when compared to a logit model either collapsing the categories into two mutually exclusive groups or limiting the analysis to pairs of categories. In this study, we propose a proportional odds model‐based generalized multifactor dimensionality reduction (GMDR) method for detection of interactions underlying polytomous ordinal phenotypes. Computer simulations demonstrated that this new GMDR method has a higher power and more accurate predictive ability than the GMDR methods based on a logit model and a multinomial logit model. We applied this new method to the genetic analysis of low‐density lipoprotein (LDL) cholesterol, a causal risk factor for coronary artery disease, in the Multi‐Ethnic Study of Atherosclerosis, and identified a significant joint action of the CELSR2, SERPINA12, HPGD, and APOB genes. This finding provides new information to advance the limited knowledge about genetic regulation and gene interactions in metabolic pathways of LDL cholesterol. In conclusion, the proportional odds model‐based GMDR is a useful tool that can boost statistical power and prediction accuracy in studying multifactor interactions underlying ordinal traits.

Wednesday, October 24, 2018

Statistical Inference Relief (STIR) feature selection

Happy to be a collaborator on this paper to add inference to the ReliefF method for feature selection. We have done a lot of work on this algorithm that is capable of detecting epistasis.

Le TT, Urbanowicz RJ, Moore JH, McKinney BA. Statistical Inference Relief (STIR) feature selection. Bioinformatics. 2018 Sep 18., in press


Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features.

We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data.

We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR's straightforward extension to genome-wide association studies.

Code and data available at http://insilico.utulsa.edu/software/STIR.

Thursday, September 20, 2018

The complex underpinnings of genetic background effects

A nice new paper on epistasis is yeast.

Mullis MN, Matsui T, Schell R, Foree R, Ehrenreich IM. The complex underpinnings of genetic background effects. Nat Commun. 2018 Sep 17;9(1):3548. [PubMed]

Genetic interactions between mutations and standing polymorphisms can cause mutations to show distinct phenotypic effects in different individuals. To characterize the genetic architecture of these so-called background effects, we genotype 1411 wild-type and mutant yeast cross progeny and measure their growth in 10 environments. Using these data, we map 1086 interactions between segregating loci and 7 different gene knockouts. Each knockout exhibits between 73 and 543 interactions, with 89% of all interactions involving higher-order epistasis between a knockout and multiple loci. Identified loci interact with as few as one knockout and as many as all seven knockouts. In mutants, loci interacting with fewer and more knockouts tend to show enhanced and reduced phenotypic effects, respectively. Cross-environment analysis reveals that most interactions between the knockouts and segregating loci also involve the environment. These results illustrate the complicated interactions between mutations, standing polymorphisms, and the environment that cause background effects.

Saturday, September 01, 2018

Analysis of Epistasis in Natural Traits Using Model Organisms

A nice new essay in Trends in Genetics

Campbell RF, McGrath PT, Paaby AB. Analysis of Epistasis in Natural Traits Using Model Organisms. Trends Genet. 2018 [PubMed]


Identification of statistical epistasis in natural populations remains challenging due to the relationship between allele frequency and statistical power.

Artificial populations have been constructed in model organisms to detect statistical epistasis between two regions of the genome; however, it is difficult to use these results to understand how epistasis operates in natural populations.

Studies of focal perturbations in defined genetic backgrounds suggests that natural selection can influence the types of nonadditive relationships that exist. 

Wednesday, August 15, 2018

PennAI - A System for Accessible Artificial Intelligence

Our paper on PennAI has finally been published as part of the proceedings of the Genetic Programming Theory and Practive XV workshop book. 


While artificial intelligence (AI) has become widespread, many commercial AI systems are not yet accessible to individual researchers nor the general public due to the deep knowledge of the systems required to use them. We believe that AI has matured to the point where it should be an accessible technology for everyone. We present an ongoing project whose ultimate goal is to deliver an open source, user-friendly AI system that is specialized for machine learning analysis of complex data in the biomedical and health care domains. We discuss how genetic programming can aid in this endeavor, and highlight specific examples where genetic programming has automated machine learning analyses in previous projects.

Monday, July 23, 2018

New Papers on ReliefF for Feature Selection

We have two new papers out on ReliefF for feature selection. ReliefF is a machine learning method that can detect epistasis.

Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH. Relief-based feature selection: Introduction and review. J Biomed Inform. 2018 Jul 18. [PubMed] [JBI]

Urbanowicz RJ, Olson RS, Schmitt P, Meeker M, Moore JH. Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform. 2018 Jul 17. [PubMed] [JBI]

Sunday, June 17, 2018

Leveraging epigenomics and contactomics data to investigate SNP pairs in GWAS

Our new paper on the incorporation of expert knowledge about epigenomics and chromatin looping for modeling epistasis has been published in Human Genetics.

Manduchi E, Williams SM, Chesi A, Johnson ME, Wells AD, Grant SFA, Moore JH. Leveraging epigenomics and contactomics data to investigate SNP pairs in GWAS. Hum Genet. 2018 May;137(5):413-425. [PubMed]


Although Genome Wide Association Studies (GWAS) have led to many valuable insights into the genetic bases of common diseases over the past decade, the issue of missing heritability has surfaced, as the discovered main effect genetic variants found to date do not account for much of a trait's predicted genetic component. We present a workflow, integrating epigenomics and topologically associating domain data, aimed at discovering trait-associated SNP pairs from GWAS where neither SNP achieved independent genome-wide significance. Each analyzed SNP pair consists of one SNP in a putative active enhancer and another SNP in a putative physically interacting gene promoter in a trait-relevant tissue. As a proof-of-principle case study, we used this approach to identify focused collections of SNP pairs that we analyzed in three independent Type 2 diabetes (T2D) GWAS. This approach led us to discover 35 significant SNP pairs, encompassing both novel signals and signals for which we have found orthogonal support from other sources. Nine of these pairs are consistent with eQTL results, two are consistent with our own capture C experiments, and seven involve signals supported by recent T2D literature.

Thursday, May 03, 2018

AI researchers allege that machine learning is alchemy

This is a really nice piece in Science on the limitation and challenges of machine learning. Highly recommended reading.

"Without deep understanding of..basic tools needed to build & train new algorithms, researchers creating AIs resort to hearsay, like medieval alchemists. "People gravitate around cargo-cult practices," relying on "folklore & magic spells"

Monday, April 30, 2018

Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)

Our new paper on using resampling methods to improve reproducibility of machine learning in the context of cross validation.

Piette ER, Moore JH. Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV). BioData Min. 2018 Apr 19;11:6. [PubMed]

Background: Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions.

Results: We apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of
primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results.

Conclusions: Application of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern.

Wednesday, April 25, 2018

Collective feature selection to identify crucial epistatic variants

Nice new paper from Marylyn Ritchie's group on feature selection for epistasis analysis.

Verma SS, Lucas A, Zhang X, Veturi Y, Dudek S, Li B, Li R, Urbanowicz R, Moore JH, Kim D, Ritchie MD. Collective feature selection to identify crucial epistatic variants. BioData Min. 2018 Apr 19;11:5. [Pubmed]

Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach.

Through our simulation study we propose a collective feature selection approach to select features that are in the "union" of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~ 44,000 samples obtained from Geisinger's MyCode Community Health Initiative (on behalf of DiscovEHR collaboration).

In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.