Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Saturday, January 13, 2018

News piece on Gene Medic

Here is a news piece on my new Atari 2600 game Gene Medic that appeared in the Daily Pennsylvanian.

Monday, January 01, 2018

Gene Medic - a retro edutainment game for the Atari 2600

I am please to announce the release of my new retro edutainment game of genome medicine for the Atari 2600 video computer system (VCS). The game is called Gene Medic and the goal is to edit a patient's mutations to restore health. You can find information about the game along with the binary and source core here.

Saturday, December 02, 2017

Relief-Based Feature Selection Methods

We have made multiple improvements to Relief-based methods for feature selection. The power of these approaches is that they are capable of detecting non-additive interactions without a combinatorial algorithm. We have posted two new papers on arXiv documenting our latest work in this area. The first paper is a review while the second presents some new results. The code for these approaches can be found on GitHub.

Thursday, November 09, 2017

We are hiring postdocs

We are looking to hire 2-3 postdocs in 2018. Projects include automated machine learning (AutoML) and artificial intelligence methods for the analysis of biomedical data. Email me if interested.

For more information see http://automl.info, http://pennai.org, and http://epistasis.org.

Wednesday, October 25, 2017

50% of GWAS hits for breast cancer fail to replicate

A new paper in Nature reports 65 new loci identified using genome-wide association studies in a multi-site sample of more than 100,000 subjects. Some of these loci look interesting and will likely yield some new insights into breast cancer. However, there is one sentence in this paper that I think deserves more discussion:

"Of the 102 loci that have previously been associated with breast cancer in Europeans, 49 showed evidence of association with breast cancer in the OncoArray dataset at p < 5 * 10 ^-8.

Less than half of the previous hits replicated at a genome-wide significance level. I am surprised that this paper doesn't address in any detail this significant lack of replication. Dropping the significance level to 0.05 yields a much higher replication rate.

The replicability of GWAS hits in breast cancer would make a great discussion topic for students.
risk loci

Friday, October 20, 2017

Incorporation of Biological Knowledge Into the Study of Gene-Environment Interactions

A nice review on GxE interactions.

Ritchie MD, Davis JR, Aschard H, Battle A, Conti D, Du M, Eskin E, Fallin MD, Hsu L, Kraft P, Moore JH, Pierce BL, Bien SA, Thomas DC, Wei P, Montgomery SB. Incorporation of Biological Knowledge Into the Study of Gene-Environment Interactions. Am J Epidemiol. 2017 Oct 1;186(7):771-777. [PubMed


A growing knowledge base of genetic and environmental information has greatly enabled the study of disease risk factors. However, the computational complexity and statistical burden of testing all variants by all environments has required novel study designs and hypothesis-driven approaches. We discuss how incorporating biological knowledge from model organisms, functional genomics, and integrative approaches can empower the discovery of novel gene-environment interactions and discuss specific methodological considerations with each approach. We consider specific examples where the application of these approaches has uncovered effects of gene-environment interactions relevant to drug response and immunity, and we highlight how such improvements enable a greater understanding of the pathogenesis of disease and the realization of precision medicine.

Monday, September 11, 2017

Data-driven Advice for Applying Machine Learning to Bioinformatics Problems

As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. In this arXiv paper, we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.

Saturday, August 26, 2017

Reproducibility of research results in the context of a complex system

I have always believed that it is unrealistic to expect research results to replicate across studies when the underlying biology is complex. This is a great piece in Nature that highlights this very point within the context of C. elegans. I highly recommend anyone interested in results replication and research reproducibility read this.

Monday, July 24, 2017

Automated Machine Learning (AutoML)

I posted a new website providing general information about AutoML and our own PennAI and TPOT projects. Feel free to contact me if you have something you think should be listed.

Tuesday, June 20, 2017


In my role as Director of the Penn Institute for Biomedical Informatics I am leading a project to develop an accessible artificial intelligence system called PennAI. More info can be found on the PennAI website launched to provide updates and info about the method and software. We are looking forward to using this in my research lab for data science.