Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Tuesday, May 16, 2017

Detecting Statistical Interactions from Neural Network Weights

A new preprint on detecting interactions from deep learning models:

Detecting Statistical Interactions from Neural Network Weights 

Michael Tsang Dehua Cheng Yan Liu 
University of Southern California 
May 16, 2017 


Interpreting deep neural networks can enable new applications for predictive modeling where both accuracy and interpretability are required. In this paper, we examine the underlying structure of a deep neural network to interpret the statistical interactions it captures. Our key observation is that any input features that interact with each other must follow strongly weighted connections to a common hidden unit before the final output. We propose a novel framework for detecting feature interactions of arbitrary order by interpreting neural network weights. Our framework, which we call Neural Interaction Detector (NID), accurately identifies meaningful interactions without an exhaustive search on an exponential solution space of interaction candidates. Empirical evaluation on both synthetic and real-world data showed the effectiveness of NID, which can uncover interactions omitted by other methods in orders of magnitude less time.

Saturday, May 06, 2017

Accessible Artificial Intelligence

We are developing an open-source and user-friendly AI system for machine learning analysis of data. We have posted a preprint on arXiv. This paper will be presented and published as part of the 2017 Genetic Theory and Practice Workshop (GPTP) that will take place later this month. Our project was written up in a nice piece published by Motherboard.

Monday, May 01, 2017

Elected Fellow of the American Statistical Association

It is a great honor to be elected as a Fellow of the American Statistical Association. I have a Masters degree in statistics and have worked my entire career at the interface between stats and computer science. It means a lot to be recognized by my stats peers. Thanks!

Friday, April 28, 2017

Machine Learning Benchmark Data

We have posted a set of data that can be used as machine learning benchmarksThis repository contains the code and data for a large, curated set of benchmarks for evaluating supervised machine learning algorithms. These data sets cover a broad range of applications, and include binary and multi-class problems, as well as combinations of categorical, ordinal, and continuous features. There are no missing values in these data sets.

Wednesday, April 26, 2017

Improving the reproducibility of epistasis

Our new paper on using resampling methods to improve the reproducibility of epistasis results in genetic association studies. Email me for a PDF if you can't get throw the Springer paywall.

Piette E.R., Moore J.H. (2017) Improving the Reproducibility of Genetic Association Results Using Genotype Resampling Methods. In: Squillero G., Sim K. (eds) Applications of Evolutionary Computation. EvoApplications 2017. Lecture Notes in Computer Science, vol 10199. Springer


Replication may be an inadequate gold standard for substantiating the significance of results from genome-wide association studies (GWAS). Successful replication provides evidence supporting true results and against spurious findings, but various population attributes contribute to observed significance of a genetic effect. We hypothesize that failure to replicate an interaction observed to be significant in a GWAS of one population in a second population is sometimes attributable to differences in minor allele frequencies, and resampling the replication dataset by genotype to match the minor allele frequencies of the discovery data can improve estimates of the interaction significance. We show via simulation that resampling of the replication data produced results more concordant with the discovery findings. We recommend that failure to replicate GWAS results should not immediately be considered to refute previously-observed findings and conversely that replication does not guarantee significance, and suggest that datasets be compared more critically in biological context.

Wednesday, April 05, 2017

Version 0.7 of TPOT released on GitHub

Version 0.7 of our Tree-Based Pipeline Optimization Tool (TPOT) for automated machine learning is now available for download. New features include the ability to customize TPOT using a config file and the ability of TPOT to make use of multiple CPUs for parallel processing.

Thursday, March 30, 2017

Variant Set Enrichment: an R package to identify disease-associated functional genomic regions

Variant Set Enrichment (VSE) is an R package to calculate the enrichment of a set of disease-associated variants across functionally annotated genomic regions, consequently highlighting the mechanisms important in the etiology of the disease studied.

Ahmed M, Sallari RC, Guo H, Moore JH, He HH, Lupien M. Variant Set Enrichment: an R package to identify disease-associated functional genomic regions. BioData Min. 2017 Feb 22;10:9. [PDF]

Saturday, February 25, 2017

Relief Based Algorithms in Python

We have released a Python package for carrying out ReliefF-based feature selection that can be used for epistasis analysis using machine learning methods. Our ReBATE package is on GitHub. We have also released a version of this code that is compatible with the sci-kit learn machine learning library in Python. This is also available on GitHub.

For more information about ReliefF for epistasis analysis we recommend our book chapter on the subject.

Moore JH. Epistasis analysis using ReliefF. Methods Mol Biol. 2015;1253:315-25.

Saturday, January 14, 2017

Buffering mechanisms that protect an embryo’s development from detrimental effects of genetic variation

This news piece mentions a new article in Nature providing evidence for the buffering of mutations during development. Here is the citation and the abstract. A nice example of epistasis.

Cannavò E, Koelling N, Harnett D, Garfield D, Casale FP, Ciglar L, Gustafson HE, Viales RR, Marco-Ferreres R, Degner JF, Zhao B, Stegle O, Birney E, Furlong EE. Genetic variants regulating expression levels and isoform diversity during embryogenesis. Nature. 2016 Dec 26. doi: 10.1038/nature20802. [Epub ahead of print] PubMed PMID: 28024300.


Embryonic development is driven by tightly regulated patterns of gene expression, despite extensive genetic variation among individuals. Studies of expression quantitative trait loci (eQTL) indicate that genetic variation frequently alters gene expression in cell-culture models and differentiated tissues. However, the extent and types of genetic variation impacting embryonic gene expression, and their interactions with developmental programs, remain largely unknown. Here we assessed the effect of genetic variation on transcriptional (expression levels) and post-transcriptional (3' RNA processing) regulation across multiple stages of metazoan development, using 80 inbred Drosophila wild isolates, identifying thousands of developmental-stage-specific and shared QTL. Given the small blocks of linkage disequilibrium in Drosophila, we obtain near base-pair resolution, resolving causal mutations in developmental enhancers, validated transcription-factor-binding sites and RNA motifs. This fine-grain mapping uncovered extensive allelic interactions within enhancers that have opposite effects, thereby buffering their impact on enhancer activity. QTL affecting 3' RNA processing identify new functional motifs leading to transcript isoform diversity and changes in the lengths of 3' untranslated regions. These results highlight how developmental stage influences the effects of genetic variation and uncover multiple mechanisms that regulate and buffer expression variation during embryogenesis.

Thursday, January 12, 2017

Use of Information Measures and Their Approximations to Detect Predictive Gene-Gene Interaction

There is a neat paper that just appeared in the journal Entropy. The authors show how entropy-based methods can detect certain kinds of interactions that are not found with logistic regression. This builds on our previous work (e.g. Moore 2006, Hu 2013) introducing and expanding entropy as a useful metric for epistasis analysis in human genetics. We have recently reviewed these methods here. Others have reviewed these approaches here.