Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Monday, August 31, 2009

An Algorithm for Multilocus Genomic Interactions

This looks like an interesting new paper. I found this while searching PubMed for entropy AND interaction.

Miller DJ, Zhang Y, Yu G, Liu Y, Chen L, Langefeld CD, Herrington D, Wang Y. An Algorithm for Learning Maximum Entropy Probability Models of Disease Risk That Efficiently Searches and Sparingly Encodes Multilocus Genomic Interactions. Bioinformatics. 2009 Jul 16. [PubMed]


MOTIVATION: In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical, and methodological challenges for accurately identifying markers/interactions and for building phenotype-predictive models. RESULTS: We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is determined by the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the multiple k-way interactions between these markers. Additional key aspects are: i) evaluation of a select subset of up to 5-way interactions while retaining relatively low complexity; ii) flexible SNP coding (dominant, recessive) within each interaction; iii) no mathematical interaction form assumed; iv) model structure and order selection based on the Bayesian Information Criterion, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; v) MECPM directly yields a phenotype-predictive model. MECPM was compared to a panel of methods on data sets with up to 1000 SNPs and up to 8 embedded penetrance function (i.e., ground-truth) interactions, including a 5- way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers and interactions, compared with previous methods. AVAILABILITY: http://www.cbil.ece.vt.edu/ResearchOngoingSNP.htm CONTACT: djmiller@engr.psu.edu.

Friday, August 28, 2009

The Status of the P Versus NP Problem

A great review in the latest Communications of the ACM on the P vs. NP problem. For an introduction to the topic, see the Wikipedia article here. This is very relavent to the detection of gene-gene interactions in genome-wide association studies.

Fortnow L. The Status of the P Versus NP Problem. Communications of the ACM Vol. 52 No. 9, Pages 78-86 (2009). [ACM]

Fortnow concludes with "The P versus NP problem has gone from an interesting problem related to logic to perhaps the most fundamental and important mathematical question of our time, whose importance only grows as computers become more powerful and widespread."

Thursday, August 27, 2009

Epistasis Replication

The following paper shows evidence for replication of a gene-gene interaction in Alzheimer disease. I anticipate more papers like this will be published as studies start to test for epistasis in multiple samples.

Combarros O, van Duijn CM, Hammond N, Belbin O, Arias-Vasquez A, Cortina-Borja M, Lehmann MG, Aulchenko YS, Schuur M, Kolsch H, Heun R, Wilcock GK, Brown K, Kehoe PG, Harrison R, Coto E, Alvarez V, Deloukas P, Mateo I, Gwilliam R, Morgan K, Warden DR, Smith AD, Lehmann DJ. Replication by the Epistasis Project of the interaction between the genes for IL-6 and IL-10 in the risk of Alzheimer's disease. J Neuroinflammation. 2009 Aug 23;6(1):22. [PubMed]


BACKGROUND: Chronic inflammation is a characteristic of Alzheimer's disease (AD). An interaction associated with the risk of AD has been reported between polymorphisms in the regulatory regions of the genes for the pro-inflammatory cytokine, interleukin-6 (IL-6, gene: IL6), and the anti-inflammatory cytokine, interleukin-10 (IL-10, gene: IL10).

METHODS: We examined this interaction in the Epistasis Project, a collaboration of 7 AD research groups, contributing DNA samples from 1,757 cases of AD and 6,295 controls.

RESULTS: We replicated the interaction. For IL6 rs2069837 AA x IL10 rs1800871 CC, the synergy factor (SF) was 1.63 (95% confidence interval: 1.10-2.41, p = 0.01), controlling for centre, age, gender and apolipoprotein E epsilon4 (APOEepsilon4) genotype. Our results were consistent between North Europe (SF = 1.7, p = 0.03) and North Spain (SF = 2.0, p = 0.09). Further replication may require a meta-analysis. However, association due to linkage disequilibrium with other polymorphisms in the regulatory regions of these genes cannot be excluded.

CONCLUSIONS: We suggest that dysregulation of both IL-6 and IL-10 in some elderly people, due in part to genetic variations in the two genes, contributes to the development of AD. Thus, inflammation facilitates the onset of sporadic AD.

Wednesday, August 26, 2009

Sarah Pendergrass, Ph.D.

Sarah Pendergrass from the Whitfield Lab and from my lab successfully defended her Ph.D. today. The title of her dissertation is "Gene Expression Subsets and Biomarkers in the Genome-Wide Expression Profiles of Systemic Sclerosis". Nice job Sarah!

Sarah is off to the Center for Human Genetics Research at Vanderbilt University where she will be doing a postdoc with Drs. Dana Crawford and Marylyn Ritchie.

Her dissertation chapters include the following papers:

Pendergrass SA, Whitfield ML, Gardner H. Understanding systemic sclerosis through gene expression profiling. Curr Opin Rheumatol. 2007 Nov;19(6):561-7. [PubMed]

Pendergrass SA, Farina G, Lemaire R, Lafyatis RA, Whitfield ML. Biomarkers of pulmonary arterial hypertension in limited systemic sclerosis. Submitted.

Pendergrass SA, Lemaire R, Lafyatis RA, Whitfield ML. Reproducible and stable subsets in serial skin biopsies taken from patients treated in an open-label trial of rituximab. In Preparation.

Tuesday, August 25, 2009

Validation and Assessment of Machine Learning Methods

This is an interesting new paper that was brought to my attention by Dr. Rick Riolo at the University of Michigan.

Pers TH, Albrechtsen A, Holst C, Sørensen TI, Gerds TA. The validation and assessment of machine learning: a game of prediction from high-dimensional data. PLoS One. 2009 Aug 4;4(8):e6287. [PubMed]


In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively.

Monday, August 24, 2009


A new paper on MDR-PDT has been published. The original MDR-PDT paper that shows how MDR can be used to detect epistasis in general pedigrees can be found here.

Edwards TL, Torstensen E, Dudek S, Martin ER, Ritchie MD. A cross-validation procedure for general pedigrees and matched odds ratio fitness metric implemented for the multifactor dimensionality reduction pedigree disequilibrium test. Genet Epidemiol., in press. [PubMed]


As genetic epidemiology looks beyond mapping single disease susceptibility loci, interest in detecting epistatic interactions between genes has grown. The dimensionality and comparisons required to search the epistatic space and the inference for a significant result pose challenges for testing epistatic disease models. The multifactor dimensionality reduction-pedigree disequilibrium test (MDR-PDT) was developed to test for multilocus models in pedigree data. In the present study we rigorously tested MDR-PDT with new cross-validation (CV) (both 5- and 10-fold) and omnibus model selection algorithms by simulating a range of heritabilities, odds ratios, minor allele frequencies, sample sizes, and numbers of interacting loci. Power was evaluated using 100, 500, and 1,000 families, with minor allele frequencies 0.2 and 0.4 and broad-sense heritabilities of 0.005, 0.01, 0.03, 0.05, and 0.1 for 2- and 3-locus purely epistatic penetrance models. We also compared the prediction error (PE) measure of effect with a predicted matched odds ratio (MOR) for final model selection and testing. We report that the CV procedure is valid with the permutation test, MDR-PDT performs similarly with 5- and 10-fold CV, and that the MOR is more powerful than PE as the fitness metric for MDR-PDT.

Sunday, August 16, 2009

Why are scientists so dull?

This is a really interesting editorial. I think there is some truth to it.

Charlton BG. Why are modern scientists so dull? How science selects for perseverance and sociability at the expense of intelligence and creativity. Med Hypotheses. 2009 Mar;72(3):237-43. [PubMed]

QUESTION: why are so many leading modern scientists so dull and lacking in scientific ambition? ANSWER: because the science selection process ruthlessly weeds-out interesting and imaginative people. At each level in education, training and career progression there is a tendency to exclude smart and creative people by preferring Conscientious and Agreeable people. The progressive lengthening of scientific training and the reduced independence of career scientists have tended to deter vocational 'revolutionary' scientists in favour of industrious and socially adept individuals better suited to incremental 'normal' science. High general intelligence (IQ) is required for revolutionary science. But educational attainment depends on a combination of intelligence and the personality trait of Conscientiousness; and these attributes do not correlate closely. Therefore elite scientific institutions seeking potential revolutionary scientists need to use IQ tests as well as examination results to pick-out high IQ 'under-achievers'. As well as high IQ, revolutionary science requires high creativity. Creativity is probably associated with moderately high levels of Eysenck's personality trait of 'Psychoticism'. Psychoticism combines qualities such as selfishness, independence from group norms, impulsivity and sensation-seeking; with a style of cognition that involves fluent, associative and rapid production of many ideas. But modern science selects for high Conscientiousness and high Agreeableness; therefore it enforces low Psychoticism and low creativity. Yet my counter-proposal to select elite revolutionary scientists on the basis of high IQ and moderately high Psychoticism may sound like a recipe for disaster, since resembles a formula for choosing gifted charlatans and confidence tricksters. A further vital ingredient is therefore necessary: devotion to the transcendental value of Truth. Elite revolutionary science should therefore be a place that welcomes brilliant, impulsive, inspired, antisocial oddballs - so long as they are also dedicated truth-seekers.

Saturday, August 15, 2009

Biodefense Bioinformatics

My NIH/NIAID R01 (AI59694) that supports our development of algorithms and software for detecting and characterizing gene-gene interactions has been renewed for four years of funding. The abstract and specific aims are below.


Infectious bioterrorism agents such as smallpox and anthrax represent a critical public health concern. Important goals of biodefense research include the development of predictors of pathogenicity of bioterrorism agents for rapid response and the prediction of clinical outcomes such as adverse events following vaccination. Our success in these biodefense endeavors will depend critically on the bioinformatics methods and software that are available for making sense of high-dimensional data generated by technologies such as DNA microarrays and mass spectrometry. The goal of this research program is to continue the development, evaluation, distribution and support of our successful open-source Multifactor Dimensionality Reduction (MDR) software package for identifying combinations of genetic and environmental predictors of clinically important biodefense outcomes. We will first evaluate new methods from our research group and those that have been proposed by other research groups and assess the best approaches for inclusion in new versions of the MDR software (AIM 1). The inclusion of new methods such as stochastic search algorithms for genome-wide analysis and linear models for continuous endpoints will ensure that the MDR software stays on the cutting edge. Second, we propose to develop a web server that biodefense researchers can use as a source of expert knowledge in the form of gene weights that are generated from biochemical pathways, Gene Ontology (GO), chromosomal location and protein-protein interactions, for example (AIM 2). Expert knowledge files generated by the web server will be used by the MDR software to prioritize single nucleotide polymorphisms (SNPs) for interaction analysis in genome-wide association studies or GWAS. These additions will ensure that MDR is ready for application to GWAS that are now commonplace. We will then apply these methods to GWAS data from an ongoing study of adverse events following vaccination for smallpox (AIM 3). Finally, we will identify opportunities to address other important bioterrorism research questions with our software that are consistent with the research objectives of the NIAID/NIH (AIM 4). All bioinformatics methods and tools will be provided in a timely manner for free as open-source software.

AIM 1. Develop, extend, evaluate, distribute and support the open-source Multifactor Dimensionality Reduction (MDR) software package for the identification, characterization and interpretation of gene-gene interactions that are associated with discrete clinical outcomes such as adverse events following vaccination for smallpox. We propose in the next phase of this biodefense research program to extend, improve and update the MDR software package by adding new MDR-related algorithms from our research group and from other research groups. We will evaluate newly developed algorithms and then assess each for inclusion in a new version of the MDR software package. This will ensure the MDR software stays on the cutting edge and is ready for genome-wide genetic analysis.

AIM 2. Develop and make available a web server for weighting SNPs and genes using expert knowledge in the form of biochemical pathways, Gene Ontology (GO), chromosomal location and protein-protein interactions for use by the MDR software to prioritize SNPs for interaction analysis in genome-wide association studies (GWAS). This new resource will provide biodefense researchers an easy to use web interface for selecting a source of expert knowledge (e.g. GO) and the appropriate weights for each gene that can then be loaded in MDR and used to prioritize SNPs for interaction analysis.

AIM 3. Apply MDR to a genome-wide association study (GWAS) of adverse events following smallpox vaccination. We will apply these software packages and methods to a GWAS of adverse events following vaccination for smallpox that includes approximately 500,000 SNPs measured using the Illumina BeadArray platform in a detection sample of 103 volunteers and a replication sample of 60 volunteers that are part of an ongoing NIAID/NIH-sponsored trial to evaluate the Aventis Pasteur Smallpox Vaccine (APSV).

AIM 4. Explore other important biodefense applications of MDR. We will extend the range of MDR applications by applying these methods to other problems such as the prediction of pathogenicity of different Variola virus isolates, the prediction of metabolic features of bacteria using DNA sequence information and the prediction of human immune response endpoints. These new applications will play an important role in the refinement of our methods and software to ensure they are of general use to the biodefense research community and the broader biomedical research community.

Friday, August 14, 2009

Casey S. Greene, Ph.D.

Casey Greene from my lab successfully defended his Ph.D. today. The title of his dissertation is "Relief-based bioinformatics methods for the analysis of epistasis in genetic association studies". Nice job Casey!

Casey is off to the Lewis-Sigler Institute for Integrative Genomics at Princeton University where he will be doing a postdoc with Dr. Olga Troyanskaya.

His dissertation chapters include the following papers:

Greene, C.S., Penrod, N.M., Williams, S.M., Moore, J.H. Failure to replicate a genetic association may provide important clues about genetic architecture. PLoS One 4, e5639 (2009). [PubMed]

Greene, C.S., Kiralis, J., Moore, J.H. Nature-inspired algorithms for the genetic analysis of epistasis in common human diseases: A theoretical assessment of wrapper vs. filter approaches. Proceedings of the IEEE Congress on Evolutionary Computation, pp. 800-807 (2009). [IEEE]

Greene, C.S., Penrod, N.M., Kiralis, J., Moore, J.H. Spatially uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Mining, in press (2009).

Greene, C.S., Kiralis, J., Moore, J.H. The informative extremes: Using both nearest and farthest neighbors can improve Relief algorithms in the domain of human genetics. in review.

Thursday, August 13, 2009

Inferring gene networks: dream or nightmare?

The following is an interesting review of the Dialogue for Reverse Engineering Assessments and Methods (DREAM2) Reverse Engineering Competition 2007. We have thought about entering this competition. More information about DREAM can be found here.

Baralla A, Mentzen WI, de la Fuente A. Inferring gene networks: dream or nightmare? Ann N Y Acad Sci. 2009 Mar;1158:246-56. [PubMed]


Inferring gene networks is a daunting task. We here describe several algorithms we used in the Dialogue for Reverse Engineering Assessments and Methods (DREAM2) Reverse Engineering Competition 2007: an algorithm based on first-order partial correlation for discovering BCL6 targets in Challenge 1 and an algorithm using nonlinear optimization with winning performance in Challenge 3. After the gold standards for the challenges were released, the performance of alternative variants of the algorithms could be evaluated. The DREAM competition taught us some strong lessons. Amazingly, simpler methods performed in general better than more advanced, theoretically motivated approaches. Also, the challenges strongly showed that inferring gene networks requires controlled experimentation using a well-defined experimental design. Analyzing data obtained through merging many unrelated datasets indeed resulted in weak performances of all algorithms, while algorithms that explicitly took the experimental design into account performed best.

Wednesday, August 12, 2009

Treasure Your Marginal Genetic Effects

The prologue of the book about Bateson that I mentioned in my August 4th, 2009 post below has a great quote:

"Nevertheless, if I may throw out a word of council to beginners, it is: Treasure your exceptions! When there are none, the work gets so dull that noone cares to carry it further. Keep them always uncovered and in sight. Exceptions are like the rough brickwork of a growing building which tells that there is more to come and shows where the next construction is to be." William Bateson 1908

Bateson was likely referring to the exceptions to Mendelian principles of heredity (e.g. epistasis). We now know that the exceptions Bateson was referring to are the norm in human genetics. GWAS has very clearly shown that genetic variants with large marginal effects are indeed the exception and thus should be treasured. If it weren't for ApoE there would be a lot less excitement for the role of genetics in complex diseases.

Monday, August 10, 2009

Postdoctoral Studies in Quantitative Biomedical Sciences at Dartmouth

I have been awarded a new NIH training grant (R25 CA134286) to support postdoctoral students at the interface between bioinformatics, biostatistics and epidemiology with a focus on cancer research. The abstract from the grant can be found here. I have included below the advertisement. Le me know if you might be interested.

Dartmouth Medical School invites applications for a new postdoctoral training and career development program designed to cross-train scientists in the fields of bioinformatics, biostatistics and epidemiology for cancer research in the biomedical sciences. The Training Program for Quantitative Biomedical Sciences in Cancer at Dartmouth is supported by the Cancer Education and Career Development program of the National Cancer Institute. Trainees with doctorates in diverse biomedical sciences will choose a secondary focus area among the three core disciplines and participate in a combination of structured, group learning activities and individually designed mentored research opportunities. Stipends, course tuition and certificates of training are provided.

Applicants must possess a PhD, combination PhD/MD, or MD degree. Applicants with a PhD in one of the three core disciplines (bioinformatics, biostatistics and epidemiology) are encouraged to apply for the purpose of receiving training in one of the other two disciplines. Highly qualified applicants with doctoral degrees in other biomedical sciences or in clinical medicine are also eligible and are encouraged to apply. Candidates who are current or former PIs on NIH Small Grants (R03s) or Exploratory/ Developmental Grants (R21s) are eligible. Individuals appointed to the program must be citizens or non-citizen nationals of the United States (U.S.), or must have been lawfully admitted to the U.S. for permanent residence. Individuals on temporary visas are not eligible. Candidates are appointed for at least 2 years and can be supported for up to 3 years.

Founded in 1797, Dartmouth Medical School draws on the resources of Dartmouth College, Dartmouth-Hitchcock Medical Center and the Norris Cotton Cancer Center to support broad interdisciplinary programs in biomedical research, education, patient care and service. Located in the Upper Valley region of New Hampshire, the region offers idyllic landscapes and recreation, outstanding schools and cultural activities, and accessibility to major northeastern cities, including Boston (2.5 hours drive) and Montreal (3 hrs drive).

Submissions should include a letter describing the background and interests of the applicant, curriculum vitae, and names and contact information for three references.

Applicant materials should be mailed or e-mailed to:
Training Program for Quantitative Biomedical Sciences in Cancer
Dartmouth Medical School
Attention: Vicki Sayarath
7927 Rubin Building
One Medical Center Drive
Lebanon, New Hampshire 03756

Dartmouth Medical School is an affirmative action/equal opportunity employer and encourages women and minority candidates to apply.

Sunday, August 09, 2009

The Pathologies of Big Data

There is an interesting article by Adam Jacobs in the new Communications of the ACM (Aug. 09, v.52, #8, pp. 36-44) on "The Pathologies of Big Data".

"Here’s the big truth about big data in traditional databases: it’s easier to get the data in than out."

"To achieve acceptable performance for highly order-dependent queries on truly large data, one must be willing to consider abandoning the purely relational database model for one that recognizes the concept of inherent ordering of data down to the implementation level."

Wednesday, August 05, 2009

Systems genetics analysis of cancer susceptibility

An interesting new review focusing on interactions.

Quigley D, Balmain A. Systems genetics analysis of cancer susceptibility: from mouse models to humans. Nat Rev Genet. 2009, in press. [PubMed]


Genetic studies of cancer susceptibility have shown that most heritable risk cannot be explained by the main effects of common alleles. This may be due to unknown gene-gene or gene-environment interactions and the complex roles of many genes at different stages of cancer. Studies using mouse models of cancer suggest that methods that integrate genetic analysis and genomic networks with knowledge of cancer biology can help to extend our understanding of heritable cancer susceptibility.

Tuesday, August 04, 2009

The Science and Life of William Bateson

I just ordered a copy of the following biography about William Bateson who coined the term epistasis.

Cock, A.G., Forsdyke, D.R. Treasure Your Exceptions: The Science and Life of William Bateson. Springer (2008). [Amazon]

There is a nice review of this book by Wade in Evolution.

Wade MJ. Williams Bateson: Variation, heredity, and speciation. Evolution, in press (2009). [PubMed]

Monday, August 03, 2009

The genome-centric concept: resynthesis of evolutionary theory

A call to move beyond the 'one gene at a time approach' to the study of evolution. Very relevant to human genetics. I found this paper by searching PubMed for "complex system". I also love this journal.

Heng HH. The genome-centric concept: resynthesis of evolutionary theory. Bioessays. 2009 May;31(5):512-25. [PubMed]

Modern biology has been heavily influenced by the gene-centric concept. Paradoxically, this very concept--on which bioresearch is based--is challenged by the success of gene-based research in terms of explaining evolutionary theory. To overcome this major roadblock, it is essential to establish new theories, to not only solve the key puzzles presented by the gene-centric concept, but also to provide a conceptual framework that allows the field to grow. This paper discusses a number of paradoxes and illustrates how they can be addressed by the genome-centric concept in order to further resynthesize evolutionary theory. In particular, methodological breakthroughs that analyze genome evolution are discussed. The multiple interactions among different levels of a complex system provide the key to understanding the relationship between self-organization and natural selection. Darwinian natural selection applies to the biological level due to its unique genetic and heterogeneous features, but does not simply or directly apply to either the lower non-living level or higher intellectual society level. At the complex bio-system level, the genome context (the entire package of genes and their genomic physical relationship or genomic topology), not the individual genes, defines the system and serves as the principle selection platform for evolution.