Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Monday, May 24, 2010

A screening methodology based on Random Forests to improve the detection of gene-gene interactions

This is an interesting new paper proposing to use random forests to filter SNPs for MDR modeling of gene-gene interactions. We have seen similar results with our ReliefF-based algorithms. Removing noisy SNPs prior to MDR modeling helps cut down the number of combinations that need to be evaluated thus reducing the chances of overfitting. I am guessing RF work well on smaller numbers of SNPs but will not scale to GWAS when there are no marginal effects of the interacting loci.

De Lobel L, Geurts P, Baele G, Castro-Giner F, Kogevinas M, Van Steen K. A screening methodology based on Random Forests to improve the detection of gene-gene interactions. Eur J Hum Genet. 2010 [PubMed]


The search for susceptibility loci in gene-gene interactions imposes a methodological and computational challenge for statisticians because of the large dimensionality inherent to the modelling of gene-gene interactions or epistasis. In an era in which genome-wide scans have become relatively common, new powerful methods are required to handle the huge amount of feasible gene-gene interactions and to weed out false positives and negatives from these results. One solution to the dimensionality problem is to reduce data by preliminary screening of markers to select the best candidates for further analysis. Ideally, this screening step is statistically independent of the testing phase. Initially developed for small numbers of markers, the Multifactor Dimensionality Reduction (MDR) method is a nonparametric, model-free data reduction technique to associate sets of markers with optimal predictive properties to disease. In this study, we examine the power of MDR in larger data sets and compare it with other approaches that are able to identify gene-gene interactions. Under various interaction models (purely and not purely epistatic), we use a Random Forest (RF)-based prescreening method, before executing MDR, to improve its performance. We find that the power of MDR increases when noisy SNPs are first removed, by creating a collection of candidate markers with RFs. We validate our technique by extensive simulation studies and by application to asthma data from the European Committee of Respiratory Health Study II.

Tuesday, May 18, 2010

Missing heritability and strategies for finding the underlying causes of common diseases

I participated in the following viewpoint piece in Nature Reviews Genetics. A number of good points are made by each author. I was hoping the piece would be a bit more controversial.

Evan E. Eichler, Jonathan Flint, Greg Gibson, Augustine Kong, Suzanne M. Leal, Jason H. Moore and Joseph H. Nadeau. Missing heritability and strategies for finding the underlying causes of common diseases. Nature Reviews Genetics 11, 446:450 (2010). [Nature] [PubMed]


Although recent genome-wide studies have provided valuable insights into the genetic basis of human disease, they have explained relatively little of the heritability of most complex traits, and the variants identified through these studies have small effect sizes. This has led to the important and hotly debated issue of where the ‘missing heritability’ of complex diseases might be found. Here, seven leading geneticists offer their opinion about where this heritability is likely to lie, what this could tell us about the underlying genetic architecture of common diseases and how this could inform research strategies for uncovering genetic risk factors.

Tuesday, May 11, 2010

Postdoctoral Position in Computational Genetics at Vanderbilt

This looks like a good opportunity.

The Program in Computational Genomics in the CHGR at Vanderbilt University has an immediate opening for a post-doctoral fellow to pursue new and exciting research in human genetics. The successful candidate will have a Ph.D. degree (or equivalent) in genetics, human genetics, epidemiology, computational biology, bioinformatics, biostatistics, or related field. The successful candidate will work as part of an established research team and will have access to several large genome-wide association study (GWAS) datasets and numerous follow-up studies for association and copy number variation. Both established and evolving methods to detect and characterize single and multi-locus effects will be applied, and rich phenotypic data will permit analysis of discrete and quantitative traits. The candidate will integrate data from linkage, association, CNV, and re-sequencing studies along with knowledge of gene networks to identify susceptibility genes. He/She will also have the opportunity to conduct research in methods development in the study of gene-gene and gene-environment interactions for complex disease. In addition, the candidate will have the opportunity to interact with numerous senior investigators in multiple fields.

The CHGR is an interdisciplinary center with over 40 faculty representing numerous clinical and basic science departments. It has a highly interactive research program organized into three thematic programs: Disease Gene Discovery, Computational Genomics, and Translational Genetics. The CHGR has substantial core facilities for family and patient ascertainment; DNA banking, genotyping, and sequencing; and computational genomics, data management, and data analysis. It occupies over 14,000 sf of newly appointed wet and dry lab space. The CHGR faculty and staff enjoy the substantial benefits of the collaborative Vanderbilt atmosphere. More information about the specific CHGR post-doctoral positions can be found at: http://chgr.mc.vanderbilt.edu/chgr-careers/postdoc.

Interested candidates should forward their C.V. a description of their research interests (preferably by email), and three letters of reference by June 30, 2010:

Dr. Marylyn Ritchie, PhD
c/o Maria Comer
Center for Human Genetics Research
Vanderbilt University
519 Light Hall
Nashville, TN 37232-0700
Email: maria.comer@vanderbilt.edu
Tel: 615-322-7909
Fax: 615-343-8619

Tuesday, May 04, 2010

Bioinformatics, Genomics and Alzheimer's Disease

This is a nice example of how bioinformatics and genomics can be used together to study a complex problem like Alzheimer's disease. Studies like this one will be useful for providing the kind of biological knowledge we need to guide machine learning analysis of gene-gene interactions in genome-wide association studies.

Gómez Ravetti M, Rosso OA, Berretta R, Moscato P. Uncovering molecular biomarkers that correlate cognitive decline with the changes of hippocampus' gene expression profiles in Alzheimer's disease. PLoS One. 2010 Apr 13;5(4):e10153. [PLoS One]


BACKGROUND: Alzheimer's disease (AD) is characterized by a neurodegenerative progression that alters cognition. On a phenotypical level, cognition is evaluated by means of the MiniMental State Examination (MMSE) and the post-mortem examination of Neurofibrillary Tangle count (NFT) helps to confirm an AD diagnostic. The MMSE evaluates different aspects of cognition including orientation, short-term memory (retention and recall), attention and language. As there is a normal cognitive decline with aging, and death is the final state on which NFT can be counted, the identification of brain gene expression biomarkers from these phenotypical measures has been elusive. METHODOLOGY/PRINCIPAL FINDINGS: We have reanalysed a microarray dataset contributed in 2004 by Blalock et al. of 31 samples corresponding to hippocampus gene expression from 22 AD subjects of varying degree of severity and 9 controls. Instead of only relying on correlations of gene expression with the associated MMSE and NFT measures, and by using modern bioinformatics methods based on information theory and combinatorial optimization, we uncovered a 1,372-probe gene expression signature that presents a high-consensus with established markers of progression in AD. The signature reveals alterations in calcium, insulin, phosphatidylinositol and wnt-signalling. Among the most correlated gene probes with AD severity we found those linked to synaptic function, neurofilament bundle assembly and neuronal plasticity. CONCLUSIONS/SIGNIFICANCE: A transcription factors analysis of 1,372-probe signature reveals significant associations with the EGR/KROX family of proteins, MAZ, and E2F1. The gene homologous of EGR1, zif268, Egr-1 or Zenk, together with other members of the EGR family, are consolidating a key role in the neuronal plasticity in the brain. These results indicate a degree of commonality between putative genes involved in AD and prion-induced neurodegenerative processes that warrants further investigation.