Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Friday, February 26, 2010

The Genetic Interpretation of Area under the ROC Curve

Two very interesting new papers. These are must read.

Wray NR, Yang J, Goddard ME, Visscher PM. The Genetic Interpretation of Area under the ROC Curve in Genomic Profiling. PLoS Genet. 2010 6(2): e1000864. [PLoS]

Wray NR, Goddard ME. Multi-locus models of genetic risk of disease. Genome Med. 2010 Feb 2;2(2):10. [PubMed]

Monday, February 22, 2010

Maximal conditional chi-square importance in random forests

Interesting new paper. Nice to see the conditioning on other SNPs.

Wang M, Chen X, Zhang H. Maximal conditional chi-square importance in random forests. Bioinformatics. 2010. [PubMed]


MOTIVATION: High-dimensional data are frequently generated in genome-wide association studies (GWAS) and other studies. It is important to identify features such as single nucleotide polymorphisms (SNPs) in GWAS that are associated with a disease. Random forests represent a very useful approach for this purpose, using a variable importance score. This importance score has several shortcomings. We propose an alternative importance measure to overcome those shortcomings. RESULTS: We characterized the effect of multiple SNPs under various models using our proposed importance measure in random forests, which uses maximal conditional chi-square (MCC) as a measure of asso-ciation between a SNP and the trait conditional on other SNPs. Based on this importance measure, we employed a permutation test to estimate empirical p-values of SNPs. Our method was compared to a univariate test and the permutation test using the Gini and per-mutation importance. In simulation, the proposed method performed consistently superior to the other methods in identifying of risk SNPs. In a genome-wide association study of age-related macular degeneration, the proposed method confirmed two significant SNPs (at the genomewide adjusted level of 0.05). Further analysis showed that these two SNPs conformed with a heterogeneity model. Com-pared with the existing importance measures, the MCC importance measure is more sensitive to complex effects of risk SNPs by utiliz-ing conditional information on different SNPs. The permutation test with the MCC importance measure provides an efficient way to iden-tify candidate SNPs in GWAS and facilitates the understanding of the etiology between genetic variants and complex diseases. CONTACT: heping.zhang@yale.edu.

Wednesday, February 03, 2010

The GenEpi Toolbox

This looks useful. Has anyone tried it? The GenEpi Toolbox.

Here is a recent paper discussing this new bioinformatics resource for genetic epidemiology.

Coassin S, Brandst├Ątter A, Kronenberg F. Lost in the space of bioinformatic tools: A constantly updated survival guide for genetic epidemiology. The GenEpi Toolbox. Atherosclerosis. 2009 Oct 29. [Epub ahead of print] [PubMed] PMID:19963217.


Genome-wide association studies (GWASs) led to impressive advances in the elucidation of genetic factors underlying complex phenotypes and diseases. However, the ability of GWAS to identify new susceptibility loci in a hypothesis-free approach requires tools to quickly retrieve comprehensive information about a genomic region and analyze the potential effects of coding and non-coding SNPs in a candidate gene region. Furthermore, once a candidate region is chosen for resequencing and fine-mapping studies, the identification of several rare mutations is likely and requires strong bioinformatic support to properly evaluate and prioritize the found mutations for further analysis. Due to the variety of regulatory layers that can be affected by a mutation, a comprehensive in-silico evaluation of candidate SNPs can be a demanding and very time-consuming task. Although many bioinformatic tools that significantly simplify this task were made available in the last years, their utility is often still unknown to researches not intensively involved in bioinformatics. We present a comprehensive guide of 64 tools and databases to bioinformatically analyze gene regions of interest to predict SNP effects. In addition, we discuss tools to perform data mining of large genetic regions, predict the presence of regulatory elements, make in-silico evaluations of SNPs effects and address issues ranging from interactome analysis to graphically annotated proteins sequences. Finally, we exemplify the use of these tools by applying them to hits of a recently performed GWAS. Taken together a combination of the discussed tools are summarized and constantly updated in the web-based "GenEpi Toolbox" (http://genepi_toolbox.i-med.ac.at) and can help to get a glimpse at the potential functional relevance of both large genetic regions and single nucleotide mutations which might help to prioritize the next steps.

Tuesday, February 02, 2010

Genetic Heterogeneity and Cancer

The following paper raises the important issue of genetic heterogeneity. This is a nice paper because it addresses the complexity of genetic architecture. However, it is very poorly cited. Note how few citations there are before the year 2000. This is not a new idea. It would have been nice if they could have provided the reader with a historical perspective on this important phenomenon.

Galvan A, Ioannidis JP, Dragani TA. Beyond genome-wide association studies: genetic heterogeneity and individual predisposition to cancer. Trends Genet. 2010 Jan 25. [Epub ahead of print] [PubMed] PMID: 20106545.


Genome-wide association studies (GWAS) using population-based designs have identified many genetic loci associated with risk of a range of complex diseases including cancer; however, each locus exerts a very small effect and most heritability remains unexplained. Family-based pedigree studies have also suggested tentative loci linked to increased cancer risk, often characterized by pedigree-specificity. However, comparison between the results of population- and family-based studies shows little concordance. Explanations for this unidentified genetic 'dark matter' of cancer include phenotype ascertainment issues, limited power, gene-gene and gene-environment interactions, population heterogeneity, parent-of-origin-specific effects, and rare and unexplored variants. Many of these reasons converge towards the concept of genetic heterogeneity that might implicate hundreds of genetic variants in regulating cancer risk. Dissecting the dark matter is a challenging task. Further insights can be gained from both population association and pedigree studies.

Monday, February 01, 2010

An Open Access Database of Genome-wide Association Results

Ran across this paper today. Might be useful for those interested in reanalysis of GWAS data.

Johnson AD, O'Donnell CJ. An open access database of genome-wide association results. BMC Med Genet. 2009 Jan 22;10:6. [PubMed] PMID: 19161620; PubMed Central PMCID: PMC2639349.

BACKGROUND: The number of genome-wide association studies (GWAS) is growing rapidly leading to the discovery and replication of many new disease loci. Combining results from multiple GWAS datasets may potentially strengthen previous conclusions and suggest new disease loci, pathways or pleiotropic genes. However, no database or centralized resource currently exists that contains anywhere near the full scope of GWAS results. METHODS: We collected available results from 118 GWAS articles into a database of 56,411 significant SNP-phenotype associations and accompanying information, making this database freely available here. In doing so, we met and describe here a number of challenges to creating an open access database of GWAS results. Through preliminary analyses and characterization of available GWAS, we demonstrate the potential to gain new insights by querying a database across GWAS. RESULTS: Using a genomic bin-based density analysis to search for highly associated regions of the genome, positive control loci (e.g., MHC loci) were detected with high sensitivity. Likewise, an analysis of highly repeated SNPs across GWAS identified replicated loci (e.g., APOE, LPL). At the same time we identified novel, highly suggestive loci for a variety of traits that did not meet genome-wide significant thresholds in prior analyses, in some cases with strong support from the primary medical genetics literature (SLC16A7, CSMD1, OAS1), suggesting these genes merit further study. Additional adjustment for linkage disequilibrium within most regions with a high density of GWAS associations did not materially alter our findings. Having a centralized database with standardized gene annotation also allowed us to examine the representation of functional gene categories (gene ontologies) containing one or more associations among top GWAS results. Genes relating to cell adhesion functions were highly over-represented among significant associations (p < 4.6 x 10(-14)), a finding which was not perturbed by a sensitivity analysis. CONCLUSION: We provide access to a full gene-annotated GWAS database which could be used for further querying, analyses or integration with other genomic information. We make a number of general observations. Of reported associated SNPs, 40% lie within the boundaries of a RefSeq gene and 68% are within 60 kb of one, indicating a bias toward gene-centricity in the findings. We found considerable heterogeneity in information available from GWAS suggesting the wider community could benefit from standardization and centralization of results reporting.