Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Saturday, January 30, 2010

Whole Genome Association Study of Brain-Wide Imaging Phenotypes

We did a neat cluster analysis in this paper. Combining GWAS data with brain imaging phenotypes is a challenge.

Shen L, Kim S, Risacher SL, Nho K, Swaminathan S, West JD, Foroud T, Pankratz N, Moore JH, Sloan CD, Huentelman MJ, Craig DW, Dechairo BM, Potkin SG, Jack CR Jr, Weiner MW, Saykin AJ; Alzheimer’s Disease Neuroimaging Initiative. Whole Genome Association Study of Brain-Wide Imaging Phenotypes for Identifying Quantitative Trait Loci in MCI and AD: A Study of the ADNI Cohort. Neuroimage. 2010 Jan 22. [PubMed] PubMed PMID: 20100581.


A genome-wide, whole brain approach to investigate genetic effects on neuroimaging phenotypes for identifying quantitative trait loci is described. The Alzheimer's Disease Neuroimaging Initiative 1.5T MRI and genetic dataset was investigated using voxel-based morphometry (VBM) and FreeSurfer parcellation followed by genome wide association studies (GWAS). 142 measures of grey matter (GM) density, volume, and cortical thickness were extracted from baseline scans. GWAS, using PLINK, were performed on each phenotype using quality controlled genotype and scan data including 530,992 of 620,903 single nucleotide polymorphisms (SNPs) and 733 of 818 participants (175 AD, 354 amnestic mild cognitive impairment, MCI, and 204 healthy controls, HC). Hierarchical clustering and heat maps were used to analyze the GWAS results and associations are reported at two significance thresholds (p<10(-7) and p<10(-6)). As expected, SNPs in the APOE and TOMM40 genes were confirmed as markers strongly associated with multiple brain regions. Other top SNPs were proximal to the EPHA4, TP63 and NXPH1 genes. Detailed image analyses of rs6463843 (flanking NXPH1) revealed reduced global and regional GM density across diagnostic groups in TT relative to GG homozygotes. Interaction analysis indicated that AD patients homozygous for the T allele showed differential vulnerability to right hippocampal GM density loss. NXPH1 codes for a protein implicated in promotion of adhesion between dendrites and axons, a key factor in synaptic integrity, the loss of which is a hallmark of AD. A genome wide, whole brain search strategy has the potential to reveal novel candidate genes and loci warranting further investigation and replication.

Tuesday, January 19, 2010

Genetics of diabetes reveals biology but does not improve prediction

I very much enjoyed this blog posting on www.phgfoundation.org. They discuss a new paper published in the British Medical Journal (below) that shows traditional risk factors do a much better job of predicting Type II Diabetes than 20 published SNPs. A quote from the post: "By assessing the area under the receiver operator characteristic curve (a plot of sensitivity versus 1-specificity, where a value of 1.0 represents a perfect test and 0.5 represents a useless test), the traditional models significantly outperformed the genetic model (around 0.75 versus 0.54), and their performance was not substantially improved by the addition of genetic risk factors." This comes as no surpise to me because the genetic studies that led to this test were all based on single-locus analyses that completely ignore the underlying complexity of this common disease. It is my working hypothesis that we will not be able to use genetic to predict disease risk until we ebrace, rather than ignore, the complexity of the genetic architecture of common human diseases. We commented on this in a 2007 letter to Science (also below).

Talmud PJ, Hingorani AD, Cooper JA, Marmot MG, Brunner EJ, Kumari M, Kivimäki M, Humphries SE. Utility of genetic and non-genetic risk factors in prediction of type 2 diabetes: Whitehall II prospective cohort study. BMJ. 2010 Jan 14;340:b4838. doi: 10.1136/bmj.b4838. [PubMed] PMID: 20075150.

Williams SM, Canter JA, Crawford DC, Moore JH, Ritchie MD, Haines JL. Problems with genome-wide association studies. Science. 2007 Jun 29;316(5833):1840-2. [PubMed] PMID: 17605173.

Saturday, January 16, 2010

Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application

This is a very nice paper. Hint for students: there might be a research project in there.

Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application. Am J Hum Genet. 2010 Jan 8;86(1):6-22. [PubMed]


Genome-wide association studies (GWAS) have rapidly become a standard method for disease gene discovery. A substantial number of recent GWAS indicate that for most disorders, only a few common variants are implicated and the associated SNPs explain only a small fraction of the genetic risk. This review is written from the viewpoint that findings from the GWAS provide preliminary genetic information that is available for additional analysis by statistical procedures that accumulate evidence, and that these secondary analyses are very likely to provide valuable information that will help prioritize the strongest constellations of results. We review and discuss three analytic methods to combine preliminary GWAS statistics to identify genes, alleles, and pathways for deeper investigations. Meta-analysis seeks to pool information from multiple GWAS to increase the chances of finding true positives among the false positives and provides a way to combine associations across GWAS, even when the original data are unavailable. Testing for epistasis within a single GWAS study can identify the stronger results that are revealed when genes interact. Pathway analysis of GWAS results is used to prioritize genes and pathways within a biological context. Following a GWAS, association results can be assigned to pathways and tested in aggregate with computational tools and pathway databases. Reviews of published methods with recommendations for their application are provided within the framework for each approach. Copyright © 2010 The American Society of Human Genetics.

Wednesday, January 13, 2010

Epistatic Interactions

I don't agree with everything in this paper but it does provide some useful information.

VanderWeele, Tyler J. (2010) "Epistatic Interactions," Statistical Applications in Genetics and Molecular Biology: Vol. 9 : Iss. 1, Article 1. [PDF]


The term "epistasis" is sometimes used to describe some form of statistical interaction between genetic factors and is alternatively sometimes used to describe instances in which the effect of a particular genetic variant is masked by a variant at another locus. In general statistical tests for interaction are of limited use in detecting "epistasis" in the sense of masking. It is, however, shown that there are relations between empirical data patterns and epistasis that have not been previously noted. These relations can sometimes be exploited to empirically test for "epistatic interactions" in the sense of the masking of the effect of a particular genetic variant by a variant at another locus.

Tuesday, January 12, 2010

Multifactor Dimensionality Reduction for Graphics Processing Units Enables Genome-wide Testing of Epistasis in Sporadic ALS

Our new paper on using MDR on GPUs for GWAS analysis of epistasis has been accepted for publication in Bioinformatics. A preprint will be available soon. The GPU-MDR software is available from our website.

Casey S. Greene, Nicholas A. Sinnott-Armstrong, Daniel S. Himmelstein, Paul J. Park, Jason H. Moore, and Brent T. Harris. Multifactor Dimensionality Reduction for Graphics Processing Units Enables Genome-wide Testing of Epistasis in Sporadic ALS. Bioinformatics, in press (2010).


Motivation: Epistasis, the presence of gene-gene interactions, has been hypothesized to be at the root of many common human diseases, but current genome-wide association studies largely ignore its role. Multifactor dimensionality reduction (MDR) is a powerful model-free method for detecting epistatic relationships between genes but computational costs have made its application to genomewide data difficult. Graphics processing units (GPUs), the hardware responsible for rendering computer games, are powerful parallel processors. Using GPUs to run MDR on a genome-wide dataset allows for statistically rigorous testing of epistasis. Results: The implementation of MDR for GPUs (MDRGPU) includes core features of the widely used Java software package, MDR. This GPU implementation allows for large scale analysis of epistasis at a dramatically lower cost than the standard CPU based implementations. As a proof-of-concept, we applied this software to a genome-wide study of sporadic amyotrophic lateral sclerosis (ALS). We discovered a statistically significant two-SNP classifier and subsequently replicated the significance of these two SNPs in an independent study of ALS. MDRGPU makes the large scale analysis of epistasis tractable and opens the door to statistically rigorous testing of interactions in genome-wide datasets. Availability: MDRGPU is open source and available free of charge from http://www.sourceforge.net/projects/mdr.

Friday, January 08, 2010

A novel approach to simulate gene-environment interactions in complex diseases

This looks interesting and perhaps useful. Let me know if you try it.

Amato R, Pinelli M, D'Andrea D, Miele G, Nicodemi M, Raiconi G, Cocozza S. A novel approach to simulate gene-environment interactions in complex diseases. BMC Bioinformatics. 2010 Jan 5;11(1):8. [PubMed]


BACKGROUND: Complex diseases are multifactorial traits caused by both genetic and environmental factors. They represent the most part of human diseases and include those with largest prevalence and mortality (cancer, heart disease, obesity, etc.). Despite of a large amount of information that have been collected about both genetic and environmental risk factors, there are relatively few examples of studies on their interactions in epidemiological literature. One reason can be the incomplete knowledge of the power of statistical methods designed to search for risk factors and their interactions in this data sets. An improvement in this direction would lead to a better understanding and description of gene-environment interaction. To this aim, a possible strategy is to challenge the different statistical methods against data sets where the underlying phenomenon is completely known and fully controllable, like for example simulated ones. RESULTS: We present a mathematical approach that models gene-environment interactions. By this method it is possible to generate simulated populations having gene-environment interactions of any form, involving any number of genetic and environmental factors and also allowing non-linear interactions as epistasis. In particular, we implemented a simple version of this model in a Gene-Environment iNteraction Simulator (GENS), a tool designed to simulate case-control data sets where a one gene-one environment interaction influences the disease risk. The main effort has been to allow user to describe characteristics of population by using standard epidemiological measures and to implement constraints to make the simulator behavior biologically meaningful. CONCLUSIONS: By the multi-logistic model implemented in GENS it is possible to simulate case-control samples of complex disease where gene-environment interactions influence the disease risk. The user has a full control of the main characteristics of the simulated population and a Monte Carlo process allows random variability. A Knowledge-based approach reduces the complexity of the mathematical model by using reasonable biological constraints and makes the simulation more understandable in biological terms. Simulated data sets can be used for the assessment of novel statistical methods or for the evaluation of the statistical power when designing a study.

Thursday, January 07, 2010

Bioinformatics Strategies for Genome-Wide Association Studies (GWAS)

Our new review on bioinformatics strategies for GWAS analysis has been published in Bioinformatics. We focus in this paper on methods that are designed to embrace, rather than ignore, the complexity of common human diseases.

Moore, J.H., Asselbergs, F.W., Williams, S.M. Bioinformatics strategies for genome-wide association studies. Bioinformatics (2010). [PDF]


Motivation: The sequencing of the human genome has made it possible to identify an informative set of more than one million single nucleotide polymorphisms (SNPs) across the genome that can be used to carry out genome-wide association studies (GWAS). The availability of massive amounts of GWAS data has necessitated the development of new biostatistical methods for quality control, imputation, and analysis issues including multiple testing. This work has been successful and has enabled the discovery of new associations that have been replicated in multiple studies. However, it is now recognized that most SNPs discovered via GWAS have small effects on disease susceptibility and thus may not be suitable for improving healthcare through genetic testing. One likely explanation for the mixed results of GWAS is that the current biostatistical analysis paradigm is by design agnostic or unbiased in that it ignores all prior knowledge about disease pathobiology. Further, the linear modeling framework that is employed in GWAS often considers only one SNP at a time thus ignoring their genomic and environmental context. There is now a shift away from the biostatistical approach toward a more holistic approach that recognizes the complexity of the genotype-phenotype relationship that is characterized by significant heterogeneity and gene-gene and gene-environment interaction. We argue here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases. The goal of this review is to identify and discuss those GWAS challenges that will require computational methods.

Wednesday, January 06, 2010

New Papers for EvoBIO'10 and EvoCOMPLEX'10

We have four new papers that have been accepted for publication and presentation as part of the EvoBIO'10 and EvoCOMPLEX'10 conferences in Istanbul, Turkey. I hope to see you here!

Payne, J.L., Moore, J.H. Sexual Recombination in Self-Organizing Interaction Networks. Lecture Notes in Computer Science, in press (2010). EvoCOMPLEX'10


We build on recent advances in the design of self-organizing interaction networks by introducing a sexual variant of an existing asexual, mutation-limited algorithm. Both the asexual and sexual variants are tested on benchmark optimization problems with varying levels of problem difficulty, deception, and epistasis. Speci cally, we investigate algorithm performance on Massively Multimodal Deceptive Problems and NK Landscapes. In the former case, we nd that sexual recombination improves solution quality for all problem instances considered; in the latter case, sexual recombination only improves solution quality for problem instances with intermediate levels of epistasis. We conclude that sexual recombination in self-organizing interaction networks may improve solution quality in problem domains with deception or a moderate degree of epistatic interactions.

Greene, C.S., Himmelstein, D.S., Moore, J.H. A Model Free Method to Generate Human Genetics Datasets with Complex Gene-Disease Relationships. Lecture Notes in Computer Science, in press (2010). EvoBIO'10


A goal of human genetics is to discover genetic factors that influence individuals’ susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variations and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate six-hundred pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variations have been minimized, while the predictiveness of third, fourth, or fifth order combinations is maximized. This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This could improve our ability to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 56,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.

Greene, C.S., Himmelstein, D.S., Kiralis, J., Moore, J.H. The Informative Extremes: Using Both Nearest and Farthest Individuals Can Improve Relief Algorithms in the Domain of Human Genetics. Lecture Notes in Computer Science, in press (2010). EvoBIO'10


A primary goal of human genetics is the discovery of genetic factors that influence individual susceptibility to common human diseases. This problem is difficult because common diseases are likely the result of joint failure of two or more interacting components instead of single component failures. Efficient algorithms that can detect interacting attributes are needed. The Relief family of machine learning algorithms, which use nearest neighbors to weight attributes, are a promising approach. Recently an improved Relief algorithm called Spatially Uniform
ReliefF (SURF) has been developed that significantly increases the ability of these algorithms to detect interacting attributes. Here we introduce an algorithm called SURF* which uses distant instances along with the usual nearby ones to weight attributes. The weighting depends
on whether the instances are are nearby or distant. We show this new algorithm significantly outperforms both ReliefF and SURF for genetic analysis in the presence of attribute interactions. We make SURF* freely available in the open source MDR software package. MDR is a crossplatform Java application which features a user friendly graphical interface.

Penrod, N.M., Greene, C.S., Granizo-MacKenzie, D., Moore, J.H., Artificial Immune Systems for Epistasis Analysis in Human Genetics. Lecture Notes in Computer Science, in press (2010). EvoBIO'10


Modern genotyping techniques have allowed the field of human genetics to generate vast amounts of data, but analysis methodologies have not been able to keep pace with this increase. In order to allow personal genomics to play a vital role in modern health care, analysis
methods capable of discovering high order interactions that contribute to an individual’s risk of disease must be developed. An artificial immune system (AIS) is a method which maps well to this problem and has a number of appealing properties. By considering many attributes simultaneously, it may be able to effectively and efficiently detect epistasis, that is non-additive gene-gene interactions. This situation of interacting genes is currently very difficult to detect without biological insight or statistical heuristics. Even with these approaches, at low heritability, these approaches have trouble distinguishing genetic signal from noise. The AIS also has a compact solution representation which can be rapidly evaluated. Finally the AIS approach, by iteratively developing an antibody which ignores irrelevant genotypes, may be better able to differentiate signal from noise than machine learning approaches like ReliefF which struggle at small heritabilities. Here we develop a basic AIS and evaluate it on very low heritability datasets. We find that the basic AIS is not robust to parameter settings but that, at some parameter settings, it performs very effectively. We use the settings where the strategy succeeds to suggest a path towards a robust AIS for human genetics. Developing an AIS which succeeds across many parameter settings will be critical to prepare this method for widespread use.

Tuesday, January 05, 2010

Computational Human Genetics and the Dartmouth Neukom Institute

Our work on computational methods for the genetic analysis of common human diseases is supported by the Neukom Institute for Computational Science at Dartmouth College. The following are videos of me and Ryan Urbanowicz from my lab talking about our research supported by the Neukom Institute.