Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Monday, December 30, 2019

Getting started with TPOT for automated machine learning

A great post from Dr. Trang Le on how to get started with automated machine learning (AutoML) with our Tree-Based Pipeline Optimization Tool (TPOT) in Python.

Thursday, December 26, 2019

The Human Pancreas Analysis Program (HPAP)

We are assisting with the bioinformatics support for The Human Pancreas Analysis Program (HPAP) which consists of two interlocking, collaborative projects at three institutions that seek to provide comprehensive molecular profiling in unprecedented detail of the pancreatic islet at various stages of type 1 diabetes (T1D) pathogenesis- pre-diabetic (positive islet autoantibodies), recent onset, and T1D of durations less than 10 years.
In the past decade, there have been dramatic advances in our ability to phenotype and molecularly profile human cells and tissues. HIRN-HPPAP will develop and apply these new technologies to study cells and tissues relevant to the beta cell loss in T1D with unprecedented resolution, including at the genomic, epigenomic, protein, and functional levels. Here we will employ state-of-the-art technologies to determine all aspects of pancreatic islet cell and immune cell biology as it pertains to the pathogenesis of type 1 diabetes. We will profile both the endocrine and immune systems with multiple modalities, and make the vast data accumulated available through the highly accessible PANC-DB, which will be developed through the project. These extensive and high quality datasets will be made available to the HIRN and the diabetes research community at-large for further discovery.

Wednesday, November 20, 2019

Automated machine learning analysis of metabolomics data

We have expanded our TPOT automated machine learning method (AutoML) to metabolomics data.

Orlenko A, Kofink D, Lyytikäinen LP, Nikus K, Mishra P, Kuukasjärvi P, Karhunen PJ, Kähönen M, Laurikka JO, Lehtimäki T, Asselberg FW, Moore JH. Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning (AutoML). Bioinformatics. 2019 Nov 8. [PubMed]


Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programming. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES).

We analyzed nuclear magnetic resonance (NMR)-derived lipoprotein and metabolite profiles in the ANGES cohort with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. We performed a comparative analysis of TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, applied to two phenotypic CAD profiles. As a result, TPOT generated ML pipelines that outperformed grid search optimized models across multiple performance metrics including balanced accuracy and area under the precision-recall curve. With the selected models, we demonstrated that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision, suggesting a discrepancy in the underlying processes between these phenotypes.

TPOT is freely available via http://epistasislab.github.io/tpot/

Wednesday, October 09, 2019

Embracing study heterogeneity for finding genetic interactions in large-scale research consortia

New collaborative paper in Genetic Epidemiology with Dr. Yong Chen

Liu Y, Huang J, Urbanowicz RJ, Chen K, Manduchi E, Greene CS, Moore JH, Scheet P, Chen Y. Embracing study heterogeneity for finding genetic interactions in large-scale research consortia. Genet Epidemiol. 2019 Oct 4. [PubMed]


Genetic interactions have been recognized as a potentially important contributor to the heritability of complex diseases. Nevertheless, due to small effect sizes and stringent multiple-testing correction, identifying genetic interactions in complex diseases is particularly challenging. To address the above challenges, many genomic research initiatives collaborate to form large-scale consortia and develop open access to enable sharing of genome-wide association study (GWAS) data. Despite the perceived benefits of data sharing from large consortia, a number of practical issues have arisen, such as privacy concerns on individual genomic information and heterogeneous data sources from distributed GWAS databases. In the context of large consortia, we demonstrate that the heterogeneously appearing marginal effects over distributed GWAS databases can offer new insights into genetic interactions for which conventional methods have had limited success. In this paper, we develop a novel two-stage testing procedure, named phylogenY-based effect-size tests for interactions using first 2 moments (YETI2), to detect genetic interactions through both pooled marginal effects, in terms of averaging site-specific marginal effects, and heterogeneity in marginal effects across sites, using a meta-analytic framework. YETI2 can not only be applied to large consortia without shared personal information but also can be used to leverage underlying heterogeneity in marginal effects to prioritize potential genetic interactions. We investigate the performance of YETI2 through simulation studies and apply YETI2 to bladder cancer data from dbGaP.

Wednesday, August 14, 2019

Alternative measures of association for GWAS

Manduchi E, Orzechowski PR, Ritchie MD, Moore JH. Exploration of a diversity of computational and statistical measures of association for genome-wide genetic studies. BioData Min. 2019 Jul 9;12:14. [PubMed] [BioData Mining]
The principal line of investigation in Genome Wide Association Studies (GWAS) is the identification of main effects, that is individual Single Nucleotide Polymorphisms (SNPs) which are associated with the trait of interest, independent of other factors. A variety of methods have been proposed to this end, mostly statistical in nature and differing in assumptions and type of model employed. Moreover, for a given model, there may be multiple choices for the SNP genotype encoding. As an alternative to statistical methods, machine learning methods are often applicable. Typically, for a given GWAS, a single approach is selected and utilized to identify potential SNPs of interest. Even when multiple GWAS are combined through meta-analyses within a consortium, each GWAS is typically analyzed with a single approach and the resulting summary statistics are then utilized in meta-analyses.
In this work we use as case studies a Type 2 Diabetes (T2D) and a breast cancer GWAS to explore a diversity of applicable approaches spanning different methods and encoding choices. We assess similarity of these approaches based on the derived ranked lists of SNPs and, for each GWAS, we identify a subset of representative approaches that we use as an ensemble to derive a union list of top SNPs. Among these are SNPs which are identified by multiple approaches as well as several SNPs identified by only one or a few of the less frequently used approaches. The latter include SNPs from established loci and SNPs which have other supporting lines of evidence in terms of their potential relevance to the traits.

Not every main effect analysis method is suitable for every GWAS, but for each GWAS there are typically multiple applicable methods and encoding options. We suggest a workflow for a single GWAS, extensible to multiple GWAS from consortia, where representative approaches are selected among a pool of suitable options, to yield a more comprehensive set of SNPs, potentially including SNPs that would typically be missed with the most popular analyses, but that could provide additional valuable insights for follow-up.

Wednesday, July 24, 2019

Scaling tree-based automated machine learning

Le TT, Fu W, Moore JH. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics, in press (2019). [PubMed] [Bioinformatics]

MOTIVATION: Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programming to recommend an optimized analysis pipeline for the data scientist's prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data.

RESULTS: We introduce two new features implemented in TPOT that helps increase the system's scalability: Feature Set Selector and Template. Feature Set Selector (FSS) provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT's efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing genetic programming to select the best subset in the final pipeline. Template enforces type constraints with strongly typed genetic programming and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual.

AVAILABILITY: Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot.

Tuesday, June 18, 2019

Workflows for regulome and transcriptome-based prioritization of genetic variants

Manduchi E, Hemerich D, van Setten J, Tragante V, Harakalova M, Pei J, Williams SM, van der Harst P, Asselbergs FW, Moore JH. A comparison of two workflows for regulome and transcriptome-based prioritization of genetic variants associated with myocardial mass. Genet Epidemiol. 2019 Sep;43(6):717-726. [PubMed] [Genetic Epi]

A typical task arising from main effect analyses in a Genome Wide Association Study (GWAS) is to identify single nucleotide polymorphisms (SNPs), in linkage disequilibrium with the observed signals, that are likely causal variants and the affected genes. The affected genes may not be those closest to associating SNPs. Functional genomics data from relevant tissues are believed to be helpful in selecting likely causal SNPs and interpreting implicated biological mechanisms, ultimately facilitating prevention and treatment in the case of a disease trait. These data are typically used post GWAS analyses to fine‐map the statistically significant signals identified agnostically by testing all SNPs and applying a multiple testing correction. The number of tested SNPs is typically in the millions, so the multiple testing burden is high. Motivated by this, in this study we investigated an alternative workflow, which consists in utilizing the available functional genomics data as a first step to reduce the number of SNPs tested for association. We analyzed GWAS on electrocardiographic QRS duration using these two workflows. The alternative workflow identified more SNPs, including some residing in loci not discovered with the typical workflow. Moreover, the latter are corroborated by other reports on QRS duration. This indicates the potential value of incorporating functional genomics information at the onset in GWAS analyses.

Thursday, May 16, 2019

Accessible AI for Automated Machine Learning

We released our open-source PennAI software for automated machine learning this week. Here is the Penn Medicine press release. Here is the Github link to the source code. More info can be found at the PennAI website. We think this will bring machine learning technology to novice users.

Monday, April 22, 2019

Automated discovery of test statistics

This was a fun proof-of-principle paper we did on using genetic programming to discover test statistics. We showed that with general principles that we could re-discover the two-sample t-test. This opens the door to the discovery of new test statistics for unsolved problems.

Sunday, March 31, 2019

How to increase our belief in discovered statistical interactions via large-scale association studies?

Our new paper with Dr. Kristel van Steen on approaches for improving evidence for statistical interactions.

Van Steen K, Moore JH. How to increase our belief in discovered statistical interactions via large-scale association studies? Hum Genet. 2019 [PubMed] [Human Genetics]


The understanding that differences in biological epistasis may impact disease risk, diagnosis, or disease management stands in wide contrast to the unavailability of widely accepted large-scale epistasis analysis protocols. Several choices in the analysis workflow will impact false-positive and false-negative rates. One of these choices relates to the exploitation of particular modelling or testing strategies. The strengths and limitations of these need to be well understood, as well as the contexts in which these hold. This will contribute to determining the potentially complementary value of epistasis detection workflows and is expected to increase replication success with biological relevance. In this contribution, we take a recently introduced regression-based epistasis detection tool as a leading example to review the key elements that need to be considered to fully appreciate the value of analytical epistasis detection performance assessments. We point out unresolved hurdles and give our perspectives towards overcoming these.