Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Wednesday, August 14, 2019

Alternative measures of association for GWAS

Manduchi E, Orzechowski PR, Ritchie MD, Moore JH. Exploration of a diversity of computational and statistical measures of association for genome-wide genetic studies. BioData Min. 2019 Jul 9;12:14. [PubMed] [BioData Mining]
The principal line of investigation in Genome Wide Association Studies (GWAS) is the identification of main effects, that is individual Single Nucleotide Polymorphisms (SNPs) which are associated with the trait of interest, independent of other factors. A variety of methods have been proposed to this end, mostly statistical in nature and differing in assumptions and type of model employed. Moreover, for a given model, there may be multiple choices for the SNP genotype encoding. As an alternative to statistical methods, machine learning methods are often applicable. Typically, for a given GWAS, a single approach is selected and utilized to identify potential SNPs of interest. Even when multiple GWAS are combined through meta-analyses within a consortium, each GWAS is typically analyzed with a single approach and the resulting summary statistics are then utilized in meta-analyses.
In this work we use as case studies a Type 2 Diabetes (T2D) and a breast cancer GWAS to explore a diversity of applicable approaches spanning different methods and encoding choices. We assess similarity of these approaches based on the derived ranked lists of SNPs and, for each GWAS, we identify a subset of representative approaches that we use as an ensemble to derive a union list of top SNPs. Among these are SNPs which are identified by multiple approaches as well as several SNPs identified by only one or a few of the less frequently used approaches. The latter include SNPs from established loci and SNPs which have other supporting lines of evidence in terms of their potential relevance to the traits.

Not every main effect analysis method is suitable for every GWAS, but for each GWAS there are typically multiple applicable methods and encoding options. We suggest a workflow for a single GWAS, extensible to multiple GWAS from consortia, where representative approaches are selected among a pool of suitable options, to yield a more comprehensive set of SNPs, potentially including SNPs that would typically be missed with the most popular analyses, but that could provide additional valuable insights for follow-up.

Wednesday, July 24, 2019

Scaling tree-based automated machine learning

Le TT, Fu W, Moore JH. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics, in press (2019). [PubMed] [Bioinformatics]

MOTIVATION: Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programming to recommend an optimized analysis pipeline for the data scientist's prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data.

RESULTS: We introduce two new features implemented in TPOT that helps increase the system's scalability: Feature Set Selector and Template. Feature Set Selector (FSS) provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT's efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing genetic programming to select the best subset in the final pipeline. Template enforces type constraints with strongly typed genetic programming and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual.

AVAILABILITY: Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot.

Tuesday, June 18, 2019

Workflows for regulome and transcriptome-based prioritization of genetic variants

Manduchi E, Hemerich D, van Setten J, Tragante V, Harakalova M, Pei J, Williams SM, van der Harst P, Asselbergs FW, Moore JH. A comparison of two workflows for regulome and transcriptome-based prioritization of genetic variants associated with myocardial mass. Genet Epidemiol. 2019 Sep;43(6):717-726. [PubMed] [Genetic Epi]

A typical task arising from main effect analyses in a Genome Wide Association Study (GWAS) is to identify single nucleotide polymorphisms (SNPs), in linkage disequilibrium with the observed signals, that are likely causal variants and the affected genes. The affected genes may not be those closest to associating SNPs. Functional genomics data from relevant tissues are believed to be helpful in selecting likely causal SNPs and interpreting implicated biological mechanisms, ultimately facilitating prevention and treatment in the case of a disease trait. These data are typically used post GWAS analyses to fine‐map the statistically significant signals identified agnostically by testing all SNPs and applying a multiple testing correction. The number of tested SNPs is typically in the millions, so the multiple testing burden is high. Motivated by this, in this study we investigated an alternative workflow, which consists in utilizing the available functional genomics data as a first step to reduce the number of SNPs tested for association. We analyzed GWAS on electrocardiographic QRS duration using these two workflows. The alternative workflow identified more SNPs, including some residing in loci not discovered with the typical workflow. Moreover, the latter are corroborated by other reports on QRS duration. This indicates the potential value of incorporating functional genomics information at the onset in GWAS analyses.

Thursday, May 16, 2019

Accessible AI for Automated Machine Learning

We released our open-source PennAI software for automated machine learning this week. Here is the Penn Medicine press release. Here is the Github link to the source code. More info can be found at the PennAI website. We think this will bring machine learning technology to novice users.

Monday, April 22, 2019

Automated discovery of test statistics

This was a fun proof-of-principle paper we did on using genetic programming to discover test statistics. We showed that with general principles that we could re-discover the two-sample t-test. This opens the door to the discovery of new test statistics for unsolved problems.

Sunday, March 31, 2019

How to increase our belief in discovered statistical interactions via large-scale association studies?

Our new paper with Dr. Kristel van Steen on approaches for improving evidence for statistical interactions.

Van Steen K, Moore JH. How to increase our belief in discovered statistical interactions via large-scale association studies? Hum Genet. 2019 [PubMed] [Human Genetics]


The understanding that differences in biological epistasis may impact disease risk, diagnosis, or disease management stands in wide contrast to the unavailability of widely accepted large-scale epistasis analysis protocols. Several choices in the analysis workflow will impact false-positive and false-negative rates. One of these choices relates to the exploitation of particular modelling or testing strategies. The strengths and limitations of these need to be well understood, as well as the contexts in which these hold. This will contribute to determining the potentially complementary value of epistasis detection workflows and is expected to increase replication success with biological relevance. In this contribution, we take a recently introduced regression-based epistasis detection tool as a leading example to review the key elements that need to be considered to fully appreciate the value of analytical epistasis detection performance assessments. We point out unresolved hurdles and give our perspectives towards overcoming these.

Friday, March 01, 2019

Testing the assumptions of parametric linear models: the need for biological data mining in disciplines such as human genetics

This editorial is in response to some claims that an observed linear relationship between relative pair trait correlation and IBD genetic sharing is indicative of a simple additive genetic architecture dominated by independent genetic effects. As we show here, you could observe this pattern under a genetic architecture dominated by epistasis.

Moore JH, Mackay TFC, Williams SM. Testing the assumptions of parametric linear models: the need for biological data mining in disciplines such as human genetics. BioData Min. 2019 Feb 11;12:6. [PubMed] [BioData Mining]


All data science methods have specific assumptions that are made in order for their inferences to be valid. Some assumptions impact statistical significance testing and some influence the models themselves. For example, a fundamental assumption of linear regression is that the relationship between the independent and dependent variables is additive such that a unit increase in one leads to a unit increase in the other with some error that can be modeled using a normal distribution. The presence of a nonlinear relationship between the variables violates this assumption and can lead to inaccurate inferences. We demonstrate this here using a simple example from human genetics and then end with some thoughts about the role of biological data mining in revealing nonlinear relationships between variables.

Wednesday, February 13, 2019

Preparing next-generation scientists for biomedical big data: artificial intelligence approaches

Our paper on how to prepare next-gen scientists for big data is out. We outline here a curriculum focused on precision medicine, data science, and artificial intelligence.

Moore JH, Boland MR, Camara PG, Chervitz H, Gonzalez G, Himes BE, Kim D, Mowery DL, Ritchie MD, Shen L, Urbanowicz RJ, Holmes JH. Preparing next-generation scientists for biomedical big data: artificial intelligence approaches. Per Med. 2019 [PubMed] [PerMed]


Personalized medicine is being realized by our ability to measure biological and environmental information about patients. Much of these data are being stored in electronic health records yielding big data that presents challenges for its management and analysis. Here, we review several areas of knowledge that are necessary for next-generation scientists to fully realize the potential of biomedical big data. We begin with an overview of big data and its storage and management. We then review statistics and data science as foundational topics followed by a core curriculum of artificial intelligence, machine learning and natural language processing that are needed to develop predictive models for clinical decision making. We end with some specific training recommendations for preparing next-generation scientists for biomedical big data.

Wednesday, January 02, 2019

Analysis validation has been neglected in the Age of Reproducibility

Our paper on the use of simulation to help improve analysis validation and results reproducibility.

Lotterhos KE, Moore JH, Stapleton AE. Analysis validation has been neglected in the Age of Reproducibility. PLoS Biol. 2018 Dec 10;16(12):e3000070. [PubMed] [PLoS Biology]


Increasingly complex statistical models are being used for the analysis of biological data. Recent commentary has focused on the ability to compute the same outcome for a given dataset (reproducibility). We argue that a reproducible statistical analysis is not necessarily valid because of unique patterns of nonindependence in every biological dataset. We advocate that analyses should be evaluated with known-truth simulations that capture biological reality, a process we call "analysis validation." We review the process of validation and suggest criteria that a validation project should meet. We find that different fields of science have historically failed to meet all criteria, and we suggest ways to implement meaningful validation in training and practice.

Wednesday, November 21, 2018

Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits

I love seeing new extensions and modifications to our MDR method. Here is a new from Dr. Lou.

Hou TT, Lin F, Bai S, Cleves MA, Xu HM, Lou XY. Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits. Genet Epidemiol, in press (2018)


The manifestation of complex traits is influenced by gene–gene and gene–environment interactions, and the identification of multifactor interactions is an important but challenging undertaking for genetic studies. Many complex phenotypes such as disease severity are measured on an ordinal scale with more than two categories. A proportional odds model can improve statistical power for these outcomes, when compared to a logit model either collapsing the categories into two mutually exclusive groups or limiting the analysis to pairs of categories. In this study, we propose a proportional odds model‐based generalized multifactor dimensionality reduction (GMDR) method for detection of interactions underlying polytomous ordinal phenotypes. Computer simulations demonstrated that this new GMDR method has a higher power and more accurate predictive ability than the GMDR methods based on a logit model and a multinomial logit model. We applied this new method to the genetic analysis of low‐density lipoprotein (LDL) cholesterol, a causal risk factor for coronary artery disease, in the Multi‐Ethnic Study of Atherosclerosis, and identified a significant joint action of the CELSR2, SERPINA12, HPGD, and APOB genes. This finding provides new information to advance the limited knowledge about genetic regulation and gene interactions in metabolic pathways of LDL cholesterol. In conclusion, the proportional odds model‐based GMDR is a useful tool that can boost statistical power and prediction accuracy in studying multifactor interactions underlying ordinal traits.