Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Monday, July 24, 2017

Automated Machine Learning (AutoML)

I posted a new website providing general information about AutoML and our own PennAI and TPOT projects. Feel free to contact me if you have something you think should be listed.

Tuesday, June 20, 2017


In my role as Director of the Penn Institute for Biomedical Informatics I am leading a project to develop an accessible artificial intelligence system called PennAI. More info can be found on the PennAI website launched to provide updates and info about the method and software. We are looking forward to using this in my research lab for data science.

Sunday, June 18, 2017

Is GWAS a hoax?

There is a great new Cell paper out from Jonathan Pritchard's group making the point that nearly the entire genome is connected to genes that impact risk of common human diseases. The implication is that genome-wide association studies (GWAS) are mostly finding incidental variants that happen to be involved in gene regulation or other genomic processes that only indirectly impact disease and thus might not make good drug targets. The identification of drug targets was the 'new' reason for doing GWAS replacing the old reason that focused on predicting risk which we now know doesn't work so well. The Cell paper is important but the point that it is really making is that disease risk is about systems and pathways rather than individual variants. This has been the focus of systems biology and complex adaptive systems all along. Human geneticists are perhaps finally waking up to this. I have written extensively about  a complex systems approach to human genetics for 20 years as partially documented in this blog. Here is a piece in Nature about this Cell paper and a thoughtful blog post by Dr. Ken Weiss from Penn State. Hopefully this is an indication that the univiariate approach to human genetics is finally over.

Friday, June 02, 2017

Automated Machine Learning Competition

We launched today a Kaggle-based competition for our tree-based pipeline optimization tool (TPOT) method for automated machine learning (AutoML). More information can be found here. Please spread the word!

Friday, May 26, 2017

The NIH rule of 21

I have worked my ass off my entire career publishing nearly 500 scientific papers (H-index = 73), training more than 50 undergraduate researchers, graduating 20 PhD students, training numerous postdocs, employing dozens of technical staff, mentoring dozens of faculty, and providing extensive service to the NIH as a good citizen. Despite all of this, Francis Collins and the NIH want to take grants away from me because I have not been productive enough. By all accounts this seems very likely. As you can tell I am quite steamed about this. I am not sure I will ever be able to forgive them should this come to pass [UPDATE: this policy was abandoned - see link below]. I have helped my institution respond to this. Not sure it will make a difference.

Here is the announcement from the NIH.

Here is a report on changes in response to the concerns of the research community.

Here is the announcement from the NIH about their plans to back away from this policy.

Here is information about the NIH Next Generation Researcher Initiative that seems like a much more sensible solution to the problem.

Tuesday, May 16, 2017

Detecting Statistical Interactions from Neural Network Weights

A new preprint on detecting interactions from deep learning models:

Detecting Statistical Interactions from Neural Network Weights 

Michael Tsang Dehua Cheng Yan Liu 
University of Southern California 
May 16, 2017 


Interpreting deep neural networks can enable new applications for predictive modeling where both accuracy and interpretability are required. In this paper, we examine the underlying structure of a deep neural network to interpret the statistical interactions it captures. Our key observation is that any input features that interact with each other must follow strongly weighted connections to a common hidden unit before the final output. We propose a novel framework for detecting feature interactions of arbitrary order by interpreting neural network weights. Our framework, which we call Neural Interaction Detector (NID), accurately identifies meaningful interactions without an exhaustive search on an exponential solution space of interaction candidates. Empirical evaluation on both synthetic and real-world data showed the effectiveness of NID, which can uncover interactions omitted by other methods in orders of magnitude less time.

Saturday, May 06, 2017

Accessible Artificial Intelligence

We are developing an open-source and user-friendly AI system for machine learning analysis of data. We call this PennAI. We have posted a preprint on arXiv. This paper will be presented and published as part of the 2017 Genetic Theory and Practice Workshop (GPTP) that will take place later this month. Our project was written up in a nice piece published by Motherboard.

Monday, May 01, 2017

Elected Fellow of the American Statistical Association

It is a great honor to be elected as a Fellow of the American Statistical Association. I have a Masters degree in statistics and have worked my entire career at the interface between stats and computer science. It means a lot to be recognized by my stats peers. Thanks!

Here is a list of all 62 newly elected fellow for 2017

Friday, April 28, 2017

Machine Learning Benchmark Data

We have posted a set of data that can be used as machine learning benchmarksThis repository contains the code and data for a large, curated set of benchmarks for evaluating supervised machine learning algorithms. These data sets cover a broad range of applications, and include binary and multi-class problems, as well as combinations of categorical, ordinal, and continuous features. There are no missing values in these data sets.

Wednesday, April 26, 2017

Improving the reproducibility of epistasis

Our new paper on using resampling methods to improve the reproducibility of epistasis results in genetic association studies. Email me for a PDF if you can't get throw the Springer paywall.

Piette E.R., Moore J.H. (2017) Improving the Reproducibility of Genetic Association Results Using Genotype Resampling Methods. In: Squillero G., Sim K. (eds) Applications of Evolutionary Computation. EvoApplications 2017. Lecture Notes in Computer Science, vol 10199. Springer


Replication may be an inadequate gold standard for substantiating the significance of results from genome-wide association studies (GWAS). Successful replication provides evidence supporting true results and against spurious findings, but various population attributes contribute to observed significance of a genetic effect. We hypothesize that failure to replicate an interaction observed to be significant in a GWAS of one population in a second population is sometimes attributable to differences in minor allele frequencies, and resampling the replication dataset by genotype to match the minor allele frequencies of the discovery data can improve estimates of the interaction significance. We show via simulation that resampling of the replication data produced results more concordant with the discovery findings. We recommend that failure to replicate GWAS results should not immediately be considered to refute previously-observed findings and conversely that replication does not guarantee significance, and suggest that datasets be compared more critically in biological context.