Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Sunday, October 04, 2015

ExSTraCS 2.0: Description and Evaluation of a Scalable Learning Classifier System

Our latest paper on the development, evaluation, and application of learning classifier systems (LCS) to the detection and characterization of genetic effects that are heterogeneous. This work has been lead by my former graduate student and postdoc Dr. Ryan Urbanowicz. See his web page for more information.

Urbanowicz RJ, Moore JH. ExSTraCS 2.0: Description and Evaluation of a Scalable Learning Classifier System. Evol Intell. 2015 Sep;8(2):89-116. [PubMed] [Springer]


Algorithmic scalability is a major concern for any machine learning strategy in this age of 'big data'. A large number of potentially predictive attributes is emblematic of problems in bioinformatics, genetic epidemiology, and many other fields. Previously, ExS-TraCS was introduced as an extended Michigan-style supervised learning classifier system that combined a set of powerful heuristics to successfully tackle the challenges of classification, prediction, and knowledge discovery in complex, noisy, and heterogeneous problem domains. While Michigan-style learning classifier systems are powerful and flexible learners, they are not considered to be particularly scalable. For the first time, this paper presents a complete description of the ExS-TraCS algorithm and introduces an effective strategy to dramatically improve learning classifier system scalability. ExSTraCS 2.0 addresses scalability with (1) a rule specificity limit, (2) new approaches to expert knowledge guided covering and mutation mechanisms, and (3) the implementation and utilization of the TuRF algorithm for improving the quality of expert knowledge discovery in larger datasets. Performance over a complex spectrum of simulated genetic datasets demonstrated that these new mechanisms dramatically improve nearly every performance metric on datasets with 20 attributes and made it possible for ExSTraCS to reliably scale up to perform on related 200 and 2000-attribute datasets. ExSTraCS 2.0 was also able to reliably solve the 6, 11, 20, 37, 70, and 135 multiplexer problems, and did so in similar or fewer learning iterations than previously reported, with smaller finite training sets, and without using building blocks discovered from simpler multiplexer problems. Furthermore, ExS-TraCS usability was made simpler through the elimination of previously critical run parameters.

Friday, September 25, 2015

A Genome-Wide Association Analysis Reveals Epistatic Cancellation of Additive Genetic Variance for Root Length in Arabidopsis thaliana

Great new paper in PLOS Genetics documenting epistasis in plants.

Lachowiec J, Shen X, Queitsch C, Carlborg Ö. A Genome-Wide Association Analysis Reveals Epistatic Cancellation of Additive Genetic Variance for Root Length in Arabidopsis thaliana. PLoS Genet. 2015 Sep 23;11(9):e1005541. [PubMed]


Efforts to identify loci underlying complex traits generally assume that most genetic variance is additive. Here, we examined the genetics of Arabidopsis thaliana root length and found that the genomic narrow-sense heritability for this trait in the examined population was statistically zero. The low amount of additive genetic variance that could be captured by the genome-wide genotypes likely explains why no associations to root length could be found using standard additive-model-based genome-wide association (GWA) approaches. However, as the broad-sense heritability for root length was significantly larger, and primarily due to epistasis, we also performed an epistatic GWA analysis to map loci contributing to the epistatic genetic variance. Four interacting pairs of loci were revealed, involving seven chromosomal loci that passed a standard multiple-testing corrected significance threshold. The genotype-phenotype maps for these pairs revealed epistasis that cancelled out the additive genetic variance, explaining why these loci were not detected in the additive GWA analysis. Small population sizes, such as in our experiment, increase the risk of identifying false epistatic interactions due to testing for associations with very large numbers of multi-marker genotypes in few phenotyped individuals. Therefore, we estimated the false-positive risk using a new statistical approach that suggested half of the associated pairs to be true positive associations. Our experimental evaluation of candidate genes within the seven associated loci suggests that this estimate is conservative; we identified functional candidate genes that affected root development in four loci that were part of three of the pairs. The statistical epistatic analyses were thus indispensable for confirming known, and identifying new, candidate genes for root length in this population of wild-collected Athaliana accessions. We also illustrate how epistatic cancellation of the additive genetic variance explains the insignificant narrow-sense and significant broad-sense heritability by using a combination of careful statistical epistatic analyses and functional genetic experiments

Sunday, September 13, 2015

The role of visualization and 3-D printing in biological data mining

Visualization and visual analytics are the future of informatics. We explore here the role of visualization and 3-D printing in biological data mining with application to statistical epistasis networks.

Weiss TL, Zieselman A, Hill DP, Diamond SG, Shen L, Saykin AJ, Moore JH; Alzheimer’s Disease Neuroimaging Initiative. The role of visualization and 3-D printing in biological data mining. BioData Min. 2015 Aug 5;8:22. [PDF]


Biological data mining is a powerful tool that can provide a wealth of information about patterns of genetic and genomic biomarkers of health and disease. A potential disadvantage of data mining is volume and complexity of the results that can often be overwhelming. It is our working hypothesis that visualization methods can greatly enhance our ability to make sense of data mining results. More specifically, we propose that 3-D printing has an important role to play as a visualization technology in biological data mining. We provide here a brief review of 3-D printing along with a case study to illustrate how it might be used in a research setting.

We present as a case study a genetic interaction network associated with grey matter density, an endophenotype for late onset Alzheimer's disease, as a physical model constructed with a 3-D printer. The synergy or interaction effects of multiple genetic variants were represented through a color gradient of the physical connections between nodes. The digital gene-gene interaction network was then 3-D printed to generate a physical network model.

The physical 3-D gene-gene interaction network provided an easily manipulated, intuitive and creative way to visualize the synergistic relationships between the genetic variants and grey matter density in patients with late onset Alzheimer's disease. We discuss the advantages and disadvantages of this novel method of biological data mining visualization.

Tuesday, August 04, 2015

The role of artificial intelligence in precision medicine

Human health is the result of the interplay between many genetic factors, many environmental factors, and the complexity of our biological hierarchy from gene regulation to biochemical pathways to physiological systems. Understanding this complex genetic architecture is key for precision medicine since combinations of etiological factors naturally define small subgroups of subjects with the same risk for disease or treatment outcome. I have written extensively about this throughout my career in peer-reviewed publications and on this blog.

I gave an invited talk on this topic a few weeks ago at the "Leveraging Big Data and Knowledge to Fight Disease" symposium held at the New York Academy of Sciences in New York City. I spoke about our work on using artificial intelligence (AI) and machine learning for identifying combinations of risk factors from big data to advance our national precision medicine agenda. Rebecca Harrington from Popular Science magazine wrote this piece about the symposium and mentioned our work several times. Our EMERGENT algorithm is able to generate machine learning models of disease susceptibility that can take any mathematical form while at the same time learning the best way to do so. This latter feature moves the algorithm from the machine learning space to AI because it mimics how humans solve problems using their expert knowledge about both biological and quantitative sciences. Our latest published work about this algorithm can be found here. Email me for a reprint.

Some of my general thoughts about this topic can be found in a recent open-access editorial in BioData Mining.

Monday, June 15, 2015

Contingency and entrenchment in protein evolution under purifying selection

Great paper in PNAS by Dr. Joshua Plotkin.

Shah P, McCandlish DM, Plotkin JB. Contingency and entrenchment in protein evolution under purifying selection. Proc Natl Acad Sci U S A. 2015 [PDF]


The phenotypic effect of an allele at one genetic site may depend on alleles at other sites, a phenomenon known as epistasis. Epistasis can profoundly influence the process of evolution in populations and shape the patterns of protein divergence across species. Whereas epistasis between adaptive substitutions has been studied extensively, relatively little is known about epistasis under purifying selection. Here we use computational models of thermodynamic stability in a ligand-binding protein to explore the structure of epistasis in simulations of protein sequence evolution. Even though the predicted effects on stability of random mutations are almost completely additive, the mutations that fix under purifying selection are enriched for epistasis. In particular, the mutations that fix are contingent on previous substitutions: Although nearly neutral at their time of fixation, these mutations would be deleterious in the absence of preceding substitutions. Conversely, substitutions under purifying selection are subsequently entrenched by epistasis with later substitutions: They become increasingly deleterious to revert over time. Our results imply that, even under purifying selection, protein sequence evolution is often contingent on history and so it cannot be predicted by the phenotypic effects of mutations assayed in the ancestral background.

Monday, April 06, 2015

Spectral gene set enrichment (SGSE)

Our new spectral gene set enrichment (SGSE) method has been published.

Frost HR, Li Z, Moore JH. Spectral gene set enrichment (SGSE). BMC Bioinformatics. 2015 Mar 3;16:70. [PubMed]


Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters.

We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data.

Unsupervised gene set testing can provide important information about the biological signalheld in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.

Tuesday, March 24, 2015

Biomedical Informatics Faculty Positions at the University of Pennsylvania

I recently moved my research lab to the Perelman School of Medicine at the University of Pennsylvania where I serve as Director of the Institute for Biomedical Informatics (IBI). One of my goals is to increase the faculty base in informatics. We are recruiting faculty across all ranks and across the spectrum of biomedical informatics including bioinformatics, translational bioinformatics, clinical informatics, clinical research informatics, consumer health informatics, and public health informatics. More information can be found here.

Tuesday, February 24, 2015

Great feature selection method for detecting epistasis using random forests

This is a really neat approach that is worth exploring for using machine learning methods such as random forests for the detection and modeling of statistical epistasis in genetic studies of human health.

Holzinger ER, Szymczak S, Dasgupta A, Malley J, Li Q, Bailey-Wilson JE. Variable selection method for the identification of epistatic models. Pac Symp Biocomput. 2015;20:195-206. [PDF]


Standard analysis methods for genome wide association studies (GWAS) are not robust to complex disease models, such as interactions between variables with small main effects. These types of effects likely contribute to the heritability of complex human traits. Machine learning methods that are capable of identifying interactions, such as Random Forests (RF), are an alternative analysis approach. One caveat to RF is that there is no standardized method of selecting variables so that false positives are reduced while retaining adequate power. To this end, we have developed a novel variable selection method called relative recurrency variable importance metric (r2VIM). This method incorporates recurrency and variance estimation to assist in optimal threshold selection. For this study, we specifically address how this method performs in data with almost completely epistatic effects (i.e. no marginal effects). Our results show that with appropriate parameter settings, r2VIM can identify interaction effects when the marginal effects are virtually nonexistent. It also outperforms logistic regression, which has essentially no power under this type of model when the number of potential features (genetic variants) is large. (All Supplementary Data can be found here: http://research.nhgri.nih.gov/manuscripts/Bailey-Wilson/r2VIM_epi/).

Labels: , ,

Friday, February 20, 2015

Is Big Data a 21st Century Maginot Line?

We have just published this open access editorial BioData Mining on whether 'big data' is a 21st century Maginot line. This is relevant because we as scientists sometimes let the data define the research questions rather than the other way around. As the size and complexity of data grows we may find ourselves asking simpler and simpler questions only some of which are important to advancing our understanding of human health and disease.

Huang X, Jennings SF, Bruce B, Buchan A, Cai L, Chen P, Cramer CL, Guan W, Hilgert UK, Jiang H, Li Z, McClure G, McMullen DF, Nanduri B, Perkins A, Rekepalli B, Salem S, Specker J, Walker K, Wunsch D, Xiong D, Zhang S, Zhang Y, Zhao Z, Moore JH. Big data - a 21st century science Maginot Line? No-boundary thinking: shifting from the big data paradigm. BioData Min. 2015 Feb 6;8:7. [PDF]

See also our previous related essay on 'no boundary thinking' in bioinformatics.

Huang X, Bruce B, Buchan A, Congdon CB, Cramer CL, Jennings SF, Jiang H, Li Z, McClure G, McMullen R, Moore JH, Nanduri B, Peckham J, Perkins A, Polson SW, Rekepalli B, Salem S, Specker J, Wunsch D, Xiong D, Zhang S, Zhao Z. No-boundary thinking in bioinformatics research. BioData Min. 2013 Nov 6;6(1):19. [PDF]

Labels: ,

Saturday, January 31, 2015

Epistasis: Methods and Protocols

Our new edited volume on epistasis.

This volume presents a valuable and readily reproducible collection of established and emerging techniques on modern genetic analyses. Chapters focus on statistical or data mining analyses, genetic architecture, the burden of multiple testing, genetic variance, measuring epistasis, multifactor dimensionality reduction, and ReliefF. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, lists of the necessary materials and reagents, step-by-step, readily reproducible laboratory protocols, and key tips on troubleshooting and avoiding known pitfalls.