Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Sunday, February 18, 2018

Effect of genetic architecture on the prediction accuracy of quantitative traits in samples of unrelated individuals

Nice paper from Trudy Mackay et al. I had the pleasure of talking to her about this paper at the last EDGE workshop.

Morgante F, Huang W, Maltecca C, Mackay TFC. Effect of genetic architecture on the prediction accuracy of quantitative traits in samples of unrelated individuals. Heredity (Edinb). 2018 [PubMed]


Predicting complex phenotypes from genomic data is a fundamental aim of animal and plant breeding, where we wish to predict genetic merits of selection candidates; and of human genetics, where we wish to predict disease risk. While genomic prediction models work well with populations of related individuals and high linkage disequilibrium (LD) (e.g., livestock), comparable models perform poorly for populations of unrelated individuals and low LD (e.g., humans). We hypothesized that low prediction accuracies in the latter situation may occur when the genetics architecture of the trait departs from the infinitesimal and additive architecture assumed by most prediction models. We used simulated data for 10,000 lines based on sequence data from a population of unrelated, inbred Drosophila melanogaster lines to evaluate this hypothesis. We show that, even in very simplified scenarios meant as a stress test of the commonly used Genomic Best Linear Unbiased Predictor (G-BLUP) method, using all common variants yields low prediction accuracy regardless of the trait genetic architecture. However, prediction accuracy increases when predictions are informed by the genetic architecture inferred from mapping the top variants affecting main effects and interactions in the training data, provided there is sufficient power for mapping. When the true genetic architecture is largely or partially due to epistatic interactions, the additive model may not perform well, while models that account explicitly for interactions generally increase prediction accuracy. Our results indicate that accounting for genetic architecture can improve prediction accuracy for quantitative traits.

Saturday, January 13, 2018

News piece on Gene Medic

Here is a news piece on my new Atari 2600 game Gene Medic that appeared in the Daily Pennsylvanian.

Thursday, January 11, 2018

A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods

A new version of our HIBACHI approach for simulating more realistic data.

Moore JH, Shestov M, Schmitt P, Olson RS. A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods. Pac Symp Biocomput. 2018;23:259-267. [PDF]

A central challenge of developing and evaluating artificial intelligence and machine learning methods for regression and classification is access to data that illuminates the strengths and weaknesses of different methods. Open data plays an important role in this process by making it easy for computational researchers to easily access real data for this purpose. Genomics has in some examples taken a leading role in the open data effort starting with DNA microarrays. While real data from experimental and observational studies is necessary for developing computational methods it is not sufficient. This is because it is not possible to know what the ground truth is in real data. This must be accompanied by simulated data where that balance between signal and noise is known and can be directly evaluated. Unfortunately, there is a lack of methods and software for simulating data with the kind of complexity found in real biological and biomedical systems. We present here the Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) method and prototype software for simulating complex biological and biomedical data. Further, we introduce new methods for developing simulation models that generate data that specifically allows discrimination between different machine learning methods.

Wednesday, January 10, 2018

Leveraging putative enhancer-promoter interactions to investigate two-way epistasis in Type 2 Diabetes GWAS

We presented this paper at the 2018 Pacific Symposium on Biocomputing. This is an effort to incorporate functional genomics annotations into epistasis analysis in regulatory regions.

Manduchi E, Chesi A, Hall MA, Grant SFA, Moore JH. Leveraging putative enhancer-promoter interactions to investigate two-way epistasis in Type 2 Diabetes GWAS. Pac Symp Biocomput. 2018;23:548-558. [PDF]

We utilized evidence for enhancer-promoter interactions from functional genomics data in order to build biological filters to narrow down the search space for two-way Single Nucleotide Polymorphism (SNP) interactions in Type 2 Diabetes (T2D) Genome Wide Association Studies (GWAS). This has led us to the identification of a reproducible statistically significant SNP pair associated with T2D. As more functional genomics data are being generated that can help identify potentially interacting enhancer-promoter pairs in larger collection of tissues/cells, this approach has implications for investigation of epistasis from GWAS in general.

Monday, January 01, 2018

Gene Medic - a retro edutainment game for the Atari 2600

I am please to announce the release of my new retro edutainment game of genome medicine for the Atari 2600 video computer system (VCS). The game is called Gene Medic and the goal is to edit a patient's mutations to restore health. You can find information about the game along with the binary and source core here.

Wednesday, December 20, 2017

PMLB: a large benchmark suite for machine learning evaluation and comparison

The paper describing our machine learning benchmark data has been published.

Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Min. 2017 Dec 11;10:36. [PDF]

BACKGROUND: The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists.

RESULTS: The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered.

CONCLUSIONS: This work represents another important step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.

Saturday, December 02, 2017

Relief-Based Feature Selection Methods

We have made multiple improvements to Relief-based methods for feature selection. The power of these approaches is that they are capable of detecting non-additive interactions without a combinatorial algorithm. We have posted two new papers on arXiv documenting our latest work in this area. The first paper is a review while the second presents some new results. The code for these approaches can be found on GitHub.

Thursday, November 09, 2017

We are hiring postdocs

We are looking to hire 2-3 postdocs in 2018. Projects include automated machine learning (AutoML) and artificial intelligence methods for the analysis of biomedical data. Email me if interested.

For more information see http://automl.info, http://pennai.org, and http://epistasis.org.

Wednesday, October 25, 2017

50% of GWAS hits for breast cancer fail to replicate

A new paper in Nature reports 65 new loci identified using genome-wide association studies in a multi-site sample of more than 100,000 subjects. Some of these loci look interesting and will likely yield some new insights into breast cancer. However, there is one sentence in this paper that I think deserves more discussion:

"Of the 102 loci that have previously been associated with breast cancer in Europeans, 49 showed evidence of association with breast cancer in the OncoArray dataset at p < 5 * 10 ^-8.

Less than half of the previous hits replicated at a genome-wide significance level. I am surprised that this paper doesn't address in any detail this significant lack of replication. Dropping the significance level to 0.05 yields a much higher replication rate.

The replicability of GWAS hits in breast cancer would make a great discussion topic for students.
risk loci

Friday, October 20, 2017

Incorporation of Biological Knowledge Into the Study of Gene-Environment Interactions

A nice review on GxE interactions.

Ritchie MD, Davis JR, Aschard H, Battle A, Conti D, Du M, Eskin E, Fallin MD, Hsu L, Kraft P, Moore JH, Pierce BL, Bien SA, Thomas DC, Wei P, Montgomery SB. Incorporation of Biological Knowledge Into the Study of Gene-Environment Interactions. Am J Epidemiol. 2017 Oct 1;186(7):771-777. [PubMed


A growing knowledge base of genetic and environmental information has greatly enabled the study of disease risk factors. However, the computational complexity and statistical burden of testing all variants by all environments has required novel study designs and hypothesis-driven approaches. We discuss how incorporating biological knowledge from model organisms, functional genomics, and integrative approaches can empower the discovery of novel gene-environment interactions and discuss specific methodological considerations with each approach. We consider specific examples where the application of these approaches has uncovered effects of gene-environment interactions relevant to drug response and immunity, and we highlight how such improvements enable a greater understanding of the pathogenesis of disease and the realization of precision medicine.