Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Friday, April 28, 2017

Machine Learning Benchmark Data

We have posted a set of data that can be used as machine learning benchmarksThis repository contains the code and data for a large, curated set of benchmarks for evaluating supervised machine learning algorithms. These data sets cover a broad range of applications, and include binary and multi-class problems, as well as combinations of categorical, ordinal, and continuous features. There are no missing values in these data sets.

Wednesday, April 26, 2017

Improving the reproducibility of epistasis

Our new paper on using resampling methods to improve the reproducibility of epistasis results in genetic association studies. Email me for a PDF if you can't get throw the Springer paywall.

Piette E.R., Moore J.H. (2017) Improving the Reproducibility of Genetic Association Results Using Genotype Resampling Methods. In: Squillero G., Sim K. (eds) Applications of Evolutionary Computation. EvoApplications 2017. Lecture Notes in Computer Science, vol 10199. Springer


Replication may be an inadequate gold standard for substantiating the significance of results from genome-wide association studies (GWAS). Successful replication provides evidence supporting true results and against spurious findings, but various population attributes contribute to observed significance of a genetic effect. We hypothesize that failure to replicate an interaction observed to be significant in a GWAS of one population in a second population is sometimes attributable to differences in minor allele frequencies, and resampling the replication dataset by genotype to match the minor allele frequencies of the discovery data can improve estimates of the interaction significance. We show via simulation that resampling of the replication data produced results more concordant with the discovery findings. We recommend that failure to replicate GWAS results should not immediately be considered to refute previously-observed findings and conversely that replication does not guarantee significance, and suggest that datasets be compared more critically in biological context.

Wednesday, April 05, 2017

Version 0.7 of TPOT released on GitHub

Version 0.7 of our Tree-Based Pipeline Optimization Tool (TPOT) for automated machine learning is now available for download. New features include the ability to customize TPOT using a config file and the ability of TPOT to make use of multiple CPUs for parallel processing.