Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Saturday, December 19, 2015

Identifying Gene-Gene Interactions that are Highly Associated with Body Mass Index

A nice example of the application of our quantitative multifactor dimensionality reduction (QMDR) method to the study of gene-gene interaction associated with obesity.

De R, Verma SS, Drenos F, Holzinger ER, Holmes MV, Hall MA, Crosslin DR, Carrell DS, Hakonarson H, Jarvik G, Larson E, Pacheco JA, Rasmussen-Torvik LJ, Moore CB, Asselbergs FW, Moore JH, Ritchie MD, Keating BJ, Gilbert-Diamond D. Identifying gene-gene interactions that are highly associated with Body Mass Index using Quantitative Multifactor Dimensionality Reduction (QMDR). BioData Min. 2015 Dec 14;8:41. [PDF]


BACKGROUND: Despite heritability estimates of 40-70 % for obesity, less than 2% of its variation is explained by Body Mass Index (BMI) associated loci that have been identified so far. Epistasis, or gene-gene interactions are a plausible source to explain portions of the missing heritability of BMI.

METHODS: Using genotypic data from 18,686 individuals across five study cohorts - ARIC, CARDIA, FHS, CHS, MESA - we filtered SNPs (Single Nucleotide Polymorphisms) using two parallel approaches. SNPs were filtered either on the strength of their main effects of association with BMI, or on the number of knowledge sources supporting a specific SNP-SNP interaction in the context of BMI. Filtered SNPs were specifically analyzed for interactions that are highly associated with BMI using QMDR (Quantitative Multifactor Dimensionality Reduction). QMDR is a nonparametric, genetic model-free method that detects non-linear interactions associated with a quantitative trait.

RESULTS: We identified seven novel, epistatic models with a Bonferroni corrected p-value of association < 0.1. Prior experimental evidence helps explain the plausible biological interactions highlighted within our results and their relationship with obesity. We identified interactions between genes involved in mitochondrial dysfunction (POLG2), cholesterol metabolism (SOAT2), lipid metabolism (CYP11B2), cell adhesion (EZR), cell proliferation (MAP2K5), and insulin resistance (IGF1R). Moreover, we found an 8.8 % increase in the variance in BMI explained by these seven SNP-SNP interactions, beyond what is explained by the main effects of an index FTO SNP and the SNPs within these interactions. We also replicated one of these interactions and 58 proxy SNP-SNP models representing it in an independent dataset from the eMERGE study.

CONCLUSION: This study highlights a novel approach for discovering gene-gene interactions by combining methods such as QMDR with traditional statistics.

Monday, December 07, 2015

New $4.4 Million Research Project Targets Obesity in Pennsylvania

Here is the press release for our new state grant with Geisinger Clinic and Penn State University to study obesity in Pennsylvania. We will be developing deep learning methods for phenotype mining in electronic health record data.

Some press from the Philadelphia Business Journal.

Tuesday, November 03, 2015

gammaMAXT: A Fast Multiple-Testing Correction Algorithm

A great new algorithm for multiple testing correction in the context of gene-gene interaction analysis.

Lishout FV, Gadaleta F, Moore JH, Wehenkel L, Steen KV. gammaMAXT: a fast multiple-testing correction algorithm. BioData Min. 2015 Nov 20;8:36. [PDF]


BACKGROUND: The purpose of the MaxT algorithm is to provide a significance test algorithm that controls the family-wise error rate (FWER) during simultaneous hypothesis testing. However, the requirements in terms of computing time and memory of this procedure are proportional to the number of investigated hypotheses. The memory issue has been solved in 2013 by Van Lishout's implementation of MaxT, which makes the memory usage independent from the size of the dataset. This algorithm is implemented in MBMDR-3.0.3, a software that is able to identify genetic interactions, for a variety of SNP-SNP based epistasis models effectively. On the other hand, that implementation turned out to be less suitable for genome-wide interaction analysis studies, due to the prohibitive computational burden.

RESULTS: In this work we introduce gammaMAXT, a novel implementation of the maxT algorithm for multiple testing correction. The algorithm was implemented in software MBMDR-4.2.2, as part of the MB-MDR framework to screen for SNP-SNP, SNP-environment or SNP-SNP-environment interactions at a genome-wide level. We show that, in the absence of interaction effects, test-statistics produced by the MB-MDR methodology follow a mixture distribution with a point mass at zero and a shifted gamma distribution for the top 10 % of the strictly positive values. We show that the gammaMAXT algorithm has a power comparable to MaxT and maintains FWER, but requires less computational resources and time. We analyze a dataset composed of 10(6) SNPs and 1000 individuals within one day on a 256-core computer cluster. The same analysis would take about 10(4) times longer with MBMDR-3.0.3.

CONCLUSIONS: These results are promising for future GWAIs. However, the proposed gammaMAXT algorithm offers a general significance assessment and multiple testing approach, applicable to any context that requires performing hundreds of thousands of tests. It offers new perspectives for fast and efficient permutation-based significance assessment in large-scale (integrated) omics studies.

Saturday, October 17, 2015

An Independent Filter for Gene Set Testing Based on Spectral Enrichment

Our new paper on gene set enrichment analysis.

Frost HR, Li Z, Asselbergs FW, Moore JH. An Independent Filter for Gene Set Testing Based on Spectral Enrichment. IEEE/ACM Trans Comput Biol Bioinform. 2015 Sep-Oct;12(5):1076-86. [PubMed] [IEEE]


Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly
increased gene set testing power.

Sunday, October 04, 2015

ExSTraCS 2.0: Description and Evaluation of a Scalable Learning Classifier System

Our latest paper on the development, evaluation, and application of learning classifier systems (LCS) to the detection and characterization of genetic effects that are heterogeneous. This work has been lead by my former graduate student and postdoc Dr. Ryan Urbanowicz. See his web page for more information.

Urbanowicz RJ, Moore JH. ExSTraCS 2.0: Description and Evaluation of a Scalable Learning Classifier System. Evol Intell. 2015 Sep;8(2):89-116. [PubMed] [Springer]


Algorithmic scalability is a major concern for any machine learning strategy in this age of 'big data'. A large number of potentially predictive attributes is emblematic of problems in bioinformatics, genetic epidemiology, and many other fields. Previously, ExS-TraCS was introduced as an extended Michigan-style supervised learning classifier system that combined a set of powerful heuristics to successfully tackle the challenges of classification, prediction, and knowledge discovery in complex, noisy, and heterogeneous problem domains. While Michigan-style learning classifier systems are powerful and flexible learners, they are not considered to be particularly scalable. For the first time, this paper presents a complete description of the ExS-TraCS algorithm and introduces an effective strategy to dramatically improve learning classifier system scalability. ExSTraCS 2.0 addresses scalability with (1) a rule specificity limit, (2) new approaches to expert knowledge guided covering and mutation mechanisms, and (3) the implementation and utilization of the TuRF algorithm for improving the quality of expert knowledge discovery in larger datasets. Performance over a complex spectrum of simulated genetic datasets demonstrated that these new mechanisms dramatically improve nearly every performance metric on datasets with 20 attributes and made it possible for ExSTraCS to reliably scale up to perform on related 200 and 2000-attribute datasets. ExSTraCS 2.0 was also able to reliably solve the 6, 11, 20, 37, 70, and 135 multiplexer problems, and did so in similar or fewer learning iterations than previously reported, with smaller finite training sets, and without using building blocks discovered from simpler multiplexer problems. Furthermore, ExS-TraCS usability was made simpler through the elimination of previously critical run parameters.

Friday, September 25, 2015

A Genome-Wide Association Analysis Reveals Epistatic Cancellation of Additive Genetic Variance for Root Length in Arabidopsis thaliana

Great new paper in PLOS Genetics documenting epistasis in plants.

Lachowiec J, Shen X, Queitsch C, Carlborg Ö. A Genome-Wide Association Analysis Reveals Epistatic Cancellation of Additive Genetic Variance for Root Length in Arabidopsis thaliana. PLoS Genet. 2015 Sep 23;11(9):e1005541. [PubMed]


Efforts to identify loci underlying complex traits generally assume that most genetic variance is additive. Here, we examined the genetics of Arabidopsis thaliana root length and found that the genomic narrow-sense heritability for this trait in the examined population was statistically zero. The low amount of additive genetic variance that could be captured by the genome-wide genotypes likely explains why no associations to root length could be found using standard additive-model-based genome-wide association (GWA) approaches. However, as the broad-sense heritability for root length was significantly larger, and primarily due to epistasis, we also performed an epistatic GWA analysis to map loci contributing to the epistatic genetic variance. Four interacting pairs of loci were revealed, involving seven chromosomal loci that passed a standard multiple-testing corrected significance threshold. The genotype-phenotype maps for these pairs revealed epistasis that cancelled out the additive genetic variance, explaining why these loci were not detected in the additive GWA analysis. Small population sizes, such as in our experiment, increase the risk of identifying false epistatic interactions due to testing for associations with very large numbers of multi-marker genotypes in few phenotyped individuals. Therefore, we estimated the false-positive risk using a new statistical approach that suggested half of the associated pairs to be true positive associations. Our experimental evaluation of candidate genes within the seven associated loci suggests that this estimate is conservative; we identified functional candidate genes that affected root development in four loci that were part of three of the pairs. The statistical epistatic analyses were thus indispensable for confirming known, and identifying new, candidate genes for root length in this population of wild-collected Athaliana accessions. We also illustrate how epistatic cancellation of the additive genetic variance explains the insignificant narrow-sense and significant broad-sense heritability by using a combination of careful statistical epistatic analyses and functional genetic experiments.

Sunday, September 13, 2015

The role of visualization and 3-D printing in biological data mining

Visualization and visual analytics are the future of informatics. We explore here the role of visualization and 3-D printing in biological data mining with application to statistical epistasis networks.

Weiss TL, Zieselman A, Hill DP, Diamond SG, Shen L, Saykin AJ, Moore JH; Alzheimer’s Disease Neuroimaging Initiative. The role of visualization and 3-D printing in biological data mining. BioData Min. 2015 Aug 5;8:22. [PDF]


Biological data mining is a powerful tool that can provide a wealth of information about patterns of genetic and genomic biomarkers of health and disease. A potential disadvantage of data mining is volume and complexity of the results that can often be overwhelming. It is our working hypothesis that visualization methods can greatly enhance our ability to make sense of data mining results. More specifically, we propose that 3-D printing has an important role to play as a visualization technology in biological data mining. We provide here a brief review of 3-D printing along with a case study to illustrate how it might be used in a research setting.

We present as a case study a genetic interaction network associated with grey matter density, an endophenotype for late onset Alzheimer's disease, as a physical model constructed with a 3-D printer. The synergy or interaction effects of multiple genetic variants were represented through a color gradient of the physical connections between nodes. The digital gene-gene interaction network was then 3-D printed to generate a physical network model.

The physical 3-D gene-gene interaction network provided an easily manipulated, intuitive and creative way to visualize the synergistic relationships between the genetic variants and grey matter density in patients with late onset Alzheimer's disease. We discuss the advantages and disadvantages of this novel method of biological data mining visualization.

Tuesday, August 04, 2015

The role of artificial intelligence in precision medicine

Human health is the result of the interplay between many genetic factors, many environmental factors, and the complexity of our biological hierarchy from gene regulation to biochemical pathways to physiological systems. Understanding this complex genetic architecture is key for precision medicine since combinations of etiological factors naturally define small subgroups of subjects with the same risk for disease or treatment outcome. I have written extensively about this throughout my career in peer-reviewed publications and on this blog.

I gave an invited talk on this topic a few weeks ago at the "Leveraging Big Data and Knowledge to Fight Disease" symposium held at the New York Academy of Sciences in New York City. I spoke about our work on using artificial intelligence (AI) and machine learning for identifying combinations of risk factors from big data to advance our national precision medicine agenda. Rebecca Harrington from Popular Science magazine wrote this piece about the symposium and mentioned our work several times. Our EMERGENT algorithm is able to generate machine learning models of disease susceptibility that can take any mathematical form while at the same time learning the best way to do so. This latter feature moves the algorithm from the machine learning space to AI because it mimics how humans solve problems using their expert knowledge about both biological and quantitative sciences. Our latest published work about this algorithm can be found here. Email me for a reprint.

Some of my general thoughts about this topic can be found in a recent open-access editorial in BioData Mining.

Monday, June 15, 2015

Contingency and entrenchment in protein evolution under purifying selection

Great paper in PNAS by Dr. Joshua Plotkin.

Shah P, McCandlish DM, Plotkin JB. Contingency and entrenchment in protein evolution under purifying selection. Proc Natl Acad Sci U S A. 2015 [PDF]


The phenotypic effect of an allele at one genetic site may depend on alleles at other sites, a phenomenon known as epistasis. Epistasis can profoundly influence the process of evolution in populations and shape the patterns of protein divergence across species. Whereas epistasis between adaptive substitutions has been studied extensively, relatively little is known about epistasis under purifying selection. Here we use computational models of thermodynamic stability in a ligand-binding protein to explore the structure of epistasis in simulations of protein sequence evolution. Even though the predicted effects on stability of random mutations are almost completely additive, the mutations that fix under purifying selection are enriched for epistasis. In particular, the mutations that fix are contingent on previous substitutions: Although nearly neutral at their time of fixation, these mutations would be deleterious in the absence of preceding substitutions. Conversely, substitutions under purifying selection are subsequently entrenched by epistasis with later substitutions: They become increasingly deleterious to revert over time. Our results imply that, even under purifying selection, protein sequence evolution is often contingent on history and so it cannot be predicted by the phenotypic effects of mutations assayed in the ancestral background.

Monday, April 06, 2015

Spectral gene set enrichment (SGSE)

Our new spectral gene set enrichment (SGSE) method has been published.

Frost HR, Li Z, Moore JH. Spectral gene set enrichment (SGSE). BMC Bioinformatics. 2015 Mar 3;16:70. [PubMed]


Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters.

We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data.

Unsupervised gene set testing can provide important information about the biological signalheld in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.