Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Tuesday, August 04, 2015

The role of artificial intelligence in precision medicine

Human health is the result of the interplay between many genetic factors, many environmental factors, and the complexity of our biological hierarchy from gene regulation to biochemical pathways to physiological systems. Understanding this complex genetic architecture is key for precision medicine since combinations of etiological factors naturally define small subgroups of subjects with the same risk for disease or treatment outcome. I have written extensively about this throughout my career in peer-reviewed publications and on this blog.

I gave an invited talk on this topic a few weeks ago at the "Leveraging Big Data and Knowledge to Fight Disease" symposium held at the New York Academy of Sciences in New York City. I spoke about our work on using artificial intelligence (AI) and machine learning for identifying combinations of risk factors from big data to advance our national precision medicine agenda. Rebecca Harrington from Popular Science magazine wrote this piece about the symposium and mentioned our work several times. Our EMERGENT algorithm is able to generate machine learning models of disease susceptibility that can take any mathematical form while at the same time learning the best way to do so. This latter feature moves the algorithm from the machine learning space to AI because it mimics how humans solve problems using their expert knowledge about both biological and quantitative sciences. Our latest published work about this algorithm can be found here. Email me for a reprint.

Some of my general thoughts about this topic can be found in a recent open-access editorial in BioData Mining.

Monday, June 15, 2015

Contingency and entrenchment in protein evolution under purifying selection

Great paper in PNAS by Dr. Joshua Plotkin.

Shah P, McCandlish DM, Plotkin JB. Contingency and entrenchment in protein evolution under purifying selection. Proc Natl Acad Sci U S A. 2015 [PDF]


The phenotypic effect of an allele at one genetic site may depend on alleles at other sites, a phenomenon known as epistasis. Epistasis can profoundly influence the process of evolution in populations and shape the patterns of protein divergence across species. Whereas epistasis between adaptive substitutions has been studied extensively, relatively little is known about epistasis under purifying selection. Here we use computational models of thermodynamic stability in a ligand-binding protein to explore the structure of epistasis in simulations of protein sequence evolution. Even though the predicted effects on stability of random mutations are almost completely additive, the mutations that fix under purifying selection are enriched for epistasis. In particular, the mutations that fix are contingent on previous substitutions: Although nearly neutral at their time of fixation, these mutations would be deleterious in the absence of preceding substitutions. Conversely, substitutions under purifying selection are subsequently entrenched by epistasis with later substitutions: They become increasingly deleterious to revert over time. Our results imply that, even under purifying selection, protein sequence evolution is often contingent on history and so it cannot be predicted by the phenotypic effects of mutations assayed in the ancestral background.

Monday, April 06, 2015

Spectral gene set enrichment (SGSE)

Our new spectral gene set enrichment (SGSE) method has been published.

Frost HR, Li Z, Moore JH. Spectral gene set enrichment (SGSE). BMC Bioinformatics. 2015 Mar 3;16:70. [PubMed]


Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters.

We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data.

Unsupervised gene set testing can provide important information about the biological signalheld in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.

Tuesday, March 24, 2015

Biomedical Informatics Faculty Positions at the University of Pennsylvania

I recently moved my research lab to the Perelman School of Medicine at the University of Pennsylvania where I serve as Director of the Institute for Biomedical Informatics (IBI). One of my goals is to increase the faculty base in informatics. We are recruiting faculty across all ranks and across the spectrum of biomedical informatics including bioinformatics, translational bioinformatics, clinical informatics, clinical research informatics, consumer health informatics, and public health informatics. More information can be found here.

Tuesday, February 24, 2015

Great feature selection method for detecting epistasis using random forests

This is a really neat approach that is worth exploring for using machine learning methods such as random forests for the detection and modeling of statistical epistasis in genetic studies of human health.

Holzinger ER, Szymczak S, Dasgupta A, Malley J, Li Q, Bailey-Wilson JE. Variable selection method for the identification of epistatic models. Pac Symp Biocomput. 2015;20:195-206. [PDF]


Standard analysis methods for genome wide association studies (GWAS) are not robust to complex disease models, such as interactions between variables with small main effects. These types of effects likely contribute to the heritability of complex human traits. Machine learning methods that are capable of identifying interactions, such as Random Forests (RF), are an alternative analysis approach. One caveat to RF is that there is no standardized method of selecting variables so that false positives are reduced while retaining adequate power. To this end, we have developed a novel variable selection method called relative recurrency variable importance metric (r2VIM). This method incorporates recurrency and variance estimation to assist in optimal threshold selection. For this study, we specifically address how this method performs in data with almost completely epistatic effects (i.e. no marginal effects). Our results show that with appropriate parameter settings, r2VIM can identify interaction effects when the marginal effects are virtually nonexistent. It also outperforms logistic regression, which has essentially no power under this type of model when the number of potential features (genetic variants) is large. (All Supplementary Data can be found here: http://research.nhgri.nih.gov/manuscripts/Bailey-Wilson/r2VIM_epi/).

Labels: , ,

Friday, February 20, 2015

Is Big Data a 21st Century Maginot Line?

We have just published this open access editorial BioData Mining on whether 'big data' is a 21st century Maginot line. This is relevant because we as scientists sometimes let the data define the research questions rather than the other way around. As the size and complexity of data grows we may find ourselves asking simpler and simpler questions only some of which are important to advancing our understanding of human health and disease.

Huang X, Jennings SF, Bruce B, Buchan A, Cai L, Chen P, Cramer CL, Guan W, Hilgert UK, Jiang H, Li Z, McClure G, McMullen DF, Nanduri B, Perkins A, Rekepalli B, Salem S, Specker J, Walker K, Wunsch D, Xiong D, Zhang S, Zhang Y, Zhao Z, Moore JH. Big data - a 21st century science Maginot Line? No-boundary thinking: shifting from the big data paradigm. BioData Min. 2015 Feb 6;8:7. [PDF]

See also our previous related essay on 'no boundary thinking' in bioinformatics.

Huang X, Bruce B, Buchan A, Congdon CB, Cramer CL, Jennings SF, Jiang H, Li Z, McClure G, McMullen R, Moore JH, Nanduri B, Peckham J, Perkins A, Polson SW, Rekepalli B, Salem S, Specker J, Wunsch D, Xiong D, Zhang S, Zhao Z. No-boundary thinking in bioinformatics research. BioData Min. 2013 Nov 6;6(1):19. [PDF]

Labels: ,

Saturday, January 31, 2015

Epistasis: Methods and Protocols

Our new edited volume on epistasis.

This volume presents a valuable and readily reproducible collection of established and emerging techniques on modern genetic analyses. Chapters focus on statistical or data mining analyses, genetic architecture, the burden of multiple testing, genetic variance, measuring epistasis, multifactor dimensionality reduction, and ReliefF. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, lists of the necessary materials and reagents, step-by-step, readily reproducible laboratory protocols, and key tips on troubleshooting and avoiding known pitfalls.

Saturday, January 03, 2015

Heuristic identification of biological architectures for simulating complex hierarchical genetic interactions

Moore JH, Amos R, Kiralis J, Andrews PC. Heuristic identification of biological architectures for simulating complex hierarchical genetic interactions. Genet Epidemiol. 2015 Jan;39(1):25-34. [PubMed]


Simulation plays an essential role in the development of new computational and statistical methods for the genetic analysis of complex traits. Most simulations start with a statistical model using methods such as linear or logistic regression that specify the relationship between genotype and phenotype. This is appealing due to its simplicity and because these statistical methods are commonly used in genetic analysis. It is our working hypothesis that simulations need to move beyond simple statistical models to more realistically represent the biological complexity of genetic architecture. The goal of the present study was to develop a prototype genotype-phenotype simulation method and software that are capable of simulating complex genetic effects within the context of a hierarchical biology-based framework. Specifically, our goal is to simulate multilocus epistasis or gene-gene interaction where the genetic variants are organized within the framework of one or more genes, their regulatory regions and other regulatory loci. We introduce here the Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) method and prototype software for simulating data in this manner. This approach combines a biological hierarchy, a flexible mathematical framework, a liability threshold model for defining disease endpoints, and a heuristic search strategy for identifying high-order epistatic models of disease susceptibility. We provide several simulation examples using genetic models exhibiting independent main effects and three-way epistatic effects.

Tuesday, December 16, 2014

SNP characteristics predict replication success in association studies

Gorlov IP, Moore JH, Peng B, Jin JL, Gorlova OY, Amos CI. SNP characteristics predict replication success in association studies. Hum Genet. 2014 Dec;133(12):1477-86. [PubMed]


Successful independent replication is the most direct approach for distinguishing real genotype-disease associations from false discoveries in genome-wide association studies (GWAS). Selecting SNPs for replication has been primarily based on P values from the discovery stage, although additional characteristics of SNPs may be used to improve replication success. We used disease-associated SNPs from more than 2,000 published GWASs to identify predictors of SNP reproducibility. SNP reproducibility was defined as a proportion of successful replications among all replication attempts. The study reporting association for the first time was considered to be discovery and all consequent studies targeting the same phenotype replications. We found that -Log(P), where P is a P value from the discovery study, is the strongest predictor of the SNP reproducibility. Other significant predictors include type of the SNP (e.g., missense vs intronic SNPs) and minor allele frequency. Features of the genes linked to the disease-associated SNP also predict SNP reproducibility. Based on empirically defined rules, we developed a reproducibility score (RS) to predict SNP reproducibility independently of -Log(P). We used data from two lung cancer GWAS studies as well as recently reported disease-associated SNPs to validate RS. Minus Log(P) outperforms RS when the very top SNPs are selected, while RS works better with relaxed selection criteria. In conclusion, we propose an empirical model to predict SNP reproducibility, which can be used to select SNPs for validation and prioritization.

Tuesday, November 04, 2014

The effects of recombination on phenotypic exploration and robustness in evolution

Hu T, Banzhaf W, Moore JH. The effects of recombination on phenotypic exploration and robustness in evolution. Artif Life. 2014 Fall;20(4):457-70. [IEEE]


Recombination is a commonly used genetic operator in artificial and computational evolutionary systems. It has been empirically shown to be essential for evolutionary processes. However, little has been done to analyze the effects of recombination on quantitative genotypic and phenotypic properties. The majority of studies only consider mutation, mainly due to the more serious consequences of recombination in reorganizing entire genomes. Here we adopt methods from evolutionary biology to analyze a simple, yet representative, genetic programming method, linear genetic programming. We demonstrate that recombination has less disruptive effects on phenotype than mutation, that it accelerates novel phenotypic exploration, and that it particularly promotes robust phenotypes and evolves genotypic robustness and synergistic epistasis. Our results corroborate an explanation for the prevalence of recombination in complex living organisms, and helps elucidate a better understanding of the evolutionary mechanisms involved in the design of complex artificial evolutionary systems and intelligent algorithms.