Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Friday, July 31, 2009

Biological vs. Statistical Epistasis

There is a new paper in PLoS Genetics by Clayton that highlights the challenges of making biological inferences from statistical models of interaction. I was surprised to see our 2006 paper in the Journal of Theoretical Biology cited as an example of confusing mathematical and biological interaction. Clayton interpreted our paper as saying that we can make causal statements from statistical models. Quite to the contrary, we highlight in our paper the enormous challenges faced when trying to make inferences about the biology happening at the cellular level from a statistical model summarizing population-level data. He also misinterpreted our use of information theory in this paper. We very clearly state in this paper and many others that entropy measures are useful for "statistical" interpretation. We never say anywhere that this is any type of biological interpretation. Clayton should have read and cited our 2005 BioEssays paper that goes through the difference between biological and statistical epistasis in great detail.

I see the Clayton paper as a defense of the status quo statistical approach to genetic association studies. I think he missed an important opportunity here to recognize, as Snyder did in 1951 (see previous post), the value of looking at genetic association data from multiple different points of view using multiple different statistical and computational methods. After all, there is no free lunch.

Clayton DG. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet. 2009 Jul;5(7):e1000540. [PubMed]

Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006 Jul 21;241(2):252-61. [PubMed]

Moore JH, Williams SM. Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. Bioessays. 2005 Jun;27(6):637-46. [PubMed]

Thursday, July 30, 2009

Modified Two-Factor Ratios - OR - Why Did it Take Us 50 Years to Embrace Complexity in Human Genetics?

I have been reading the 1940 second edition of "The Principles of Heredity" by Laurence Snyder, Ph.D. (1901-1986). He has a nice chapter on Modified Two-Factor Ratios where he goes through in great detail deviations from the 9:3:3:1 Mendelian ratios that are due to epistasis.

"When a factor of one pair masks the expression of the factors of another pair, it is said to be epistatic to the other pair....The noun formed from the adjective epistatic is epistasis. Epistasis is thus the same effect between factors of two different pairs that dominance is between factors of two different pairs that dominance is between two alleles"

He goes on to describe in detail dominant epistasis (12:3:1), recessive epistasis (9:3:4), dominant and recessive epistasis (13:3), duplicate recessive epistasis (9:7), duplicate dominant epistasis (15:1), and incompletely duplicate epistasis (9:6:1).

He has an interesting chapter toward the end of the book on The Inheritance of Mental Traits in Man where he discusses the role of genetics in musical ability and intelligence, for example. it is important to note that there was a general recognition at this time that these traits were complex and influenced by many genes and the environment. He cites a study by Philiptschenko (1927) suggesting that musical ability is influence by four pairs of genes with modifying effects. He later suggests that intelligence is influenced by the environment and many variable genes.

It is interesting that as early as the 1930s there was a general recognition that epistasis is an important phenomenon and that human traits were likely due to multiple environmental and genetic factors. We lost this complex thinking during the reductionist molecular revolution and are only now, after the failure of genome-wide association studies (GWAS), starting to come back to it. I find it really intriguing that geneticists in the early 1900s had more insight into the complexity of human traits than many of us do now.

Consider, for example, a 1951 Amercian Journal of Human Genetics paper by Snyder on "Old and New Pathways in Human Genetics". In this paper he suggests that "...if human genetics is to progress along fresh pathways, the traditional atomistic [i.e. single gene] approach must be supplemented by new methods which will provide information on multifactorial inheritance". He further states that "We must be able to analyze genetic variability without recourse to classical single-gene analyses". It is hard to believe that more than 50 years later the "atomistic" approach still dominates human genetics.

As a side note, I love the last paragraph of Snyder's paper (below). He recognized more than 50 years ago the importance of team science. Why did it take the rest of the field so long to come around to this?

"The human genetic studies of the future must be cooperative efforts. Only by teamwork involving scientists from many areas can the understanding of the genetics of man be expected to advance appreciably. To those of you in related fields who are willing to lend your aid and advice to such teams, it may be confidently promised that in direct proportion to the data and information thus provided there will emerge a deeper and more significant understanding of human biology, and recurrent new practical ways in which to use the information for the improvement of the health and welfare of all mankind."

Wednesday, July 29, 2009

Complex Systems and Networks

The July 24, 2009 issue of Science has a special section on Complex Systems and Networks with multiple interesting reviews and perspectives including a review of transcriptional networks. See the table of contents here.

I like this quote they use from Martin Luther King Jr.: "We are caught in an inescapable network of mutuality...Whatever affects one directly, affects all indirectly."

Tuesday, July 28, 2009

EvoBIO'10 - First Call for Papers

EvoBIO 2010

The 8th annual European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics.

Istanbul, Turkey, 7-9 April 2010


Submission deadline 4th November 2009


The European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics is a multidisciplinary conference that brings together researchers working in Bioinformatics that apply advanced techniques coming from Evolutionary Computation, Machine Learning, and Data Mining to address important problems in molecular biology, proteomics, genomics and genetics. The primary focus of the conference is to present the latest advances of these approaches for Bioinformatics and to provide a forum for the discussion of new research directions.

Topics of interest include but are not limited to:

- biomarker discovery
- cell simulation and modeling
- ecological modeling
- fluxomics
- gene networks
- high-throughput biotechnology
- metabolomics
- microarray analysis
- phylogeny
- protein interaction
- proteomics
- sequence analysis and alignment
- biological networks analysis
- systems biology

EvoBIO 2010 will be the 8th annual European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. The conference will be held in conjunction with the EuroGP (13th European Conference on Genetic Programming), EvoCOP 2010 (10th European Conference on Evolutionary Computation in Combinatorial Optimisation) and EvoApplication 2010, the specialist conference on a range of evolutionary computation topics and applications.

We are interested in papers in three major areas:

1) Full research articles (maximum 12 pages) describing new
methodologies, approaches, and/or applications (oral or poster presentation)

2) System Demonstrations (maximum 6 pages) outlining the nature of the
system and describe why the demonstration is likely to be of interest for the conference. Demonstrations of interest include systems under development or in active use in research or practice domains. Selected demo submissions may be asked to give an oral presentation in the conference sessions.

3) Short reports (maximum 6 pages) describing new methodologies,
approaches, and/or applications (poster presentation)

Each accepted paper will be presented orally or as a poster at the conference and will be printed in the proceedings published by Springer Verlag in the LNCS series.

Saturday, July 25, 2009

Adaptively weighted association statistics

This looks interesting.

LeBlanc M, Kooperberg C. Adaptively weighted association statistics. Genet Epidemiol. 2009 Jul;33(5):442-52. [PubMed]


We investigate methods for testing gene-disease outcome associations in situations where the genetic relationship potentially varies among subjects with differing environmental or clinical attributes. We propose a strategy which modestly increases multiple testing by evaluating weighted test statistics which focus (or enrich) association tests within subgroups and use a Monte-Carlo method, based on simulating from the approximate large sample distribution of the statistics, to control type 1 error. We also introduce a stage-wise calculated test statistic which allows more complex weighting on multiple environmental variables. Results from simulation studies confirm improved power of the proposed approaches compared to marginal testing in many situations.

Tuesday, July 21, 2009

Maximum Entropy Conditional Probability Modeling

This new paper by Miller et al. looks interesting.

Miller DJ, Zhang Y, Yu G, Liu Y, Chen L, Langefeld CD, Herrington D, Wang Y. An Algorithm for Learning Maximum Entropy Probability Models of Disease Risk That Efficiently Searches and Sparingly Encodes Multilocus Genomic Interactions. Bioinformatics, in press, 2009. [PubMed]


MOTIVATION: In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical, and methodological challenges for accurately identifying markers/interactions and for building phenotype-predictive models. RESULTS: We address these objectives via maximum entropy conditional probability modeling (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is determined by the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the multiple k-way interactions between these markers. Additional key aspects are: i) evaluation of a select subset of up to 5-way interactions while retaining relatively low complexity; ii) flexible SNP coding (dominant, recessive) within each interaction; iii) no mathematical interaction form assumed; iv) model structure and order selection based on the Bayesian Information Criterion, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; v) MECPM directly yields a phenotype-predictive model. MECPM was compared to a panel of methods on data sets with up to 1000 SNPs and up to 8 embedded penetrance function (i.e., ground-truth) interactions, including a 5- way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers and interactions, compared with previous methods. AVAILABILITY: http://www.cbil.ece.vt.edu/ResearchOngoingSNP.htm CONTACT: djmiller@engr.psu.edu.

Monday, July 20, 2009

Fostering Innovation in a University Setting

I was asked today by another faculty member what universities can do foster innovative research. Innovation is usually defined as the act of introducing something new [e.g. Dictionary.com]. According to Wikipedia, innovation may be incremental, radical or revolutionary and is different than invention in that is represents an idea that has been successfully applied to some problem. My own personal opinion is that 'significant' innovation is usually characterized by a 'radical' new approach to a particular problem. In molecular biology, PCR was truly innovative because it allowed investigators to pursue new and important research questions that were otherwise not feasible. However, I don't see the many incremental derivatives of PCR (e.g. rtPCR) as innovative because they are all fundamentally based on the same innovative idea.

What can universities do to foster innovation from their faculty? Here are some initial ideas. Send me your suggestions. I will update this list over the next few days.

1) Provide recurring discretionary money. Unfortunately, the NIH peer-review system does not encourage or reward innovative thinking. Most research proposals that are funded by the NIH are those that present incremental advances on previous ideas. Using the molecular biology example from above, a grant proposing to develop PCR would have a much harder time getting funded than a grant proposing to develop rtPCR once PCR had already been established. It is much easier to convince your peers that an incremental advance on an existing idea will work than a new idea. Conventional wisdom says that you need to have 1/4 to 1/2 the research already done to convince the NIH reviewers that you can actually do the work. By that time, the idea is no longer innovative. An important way universities can foster innovative research is to provide talented faculty with recurring discretionary funds that they can use to pursue the kind of innovative ideas that the NIH doesn't typically fund. The best way to do this is to establish endowed chairs that return 90% or more of the interest back to the investigator. Some universities do this and some do not.

2) Require or encourage faculty to take sabbaticals at other universities. I am a firm believer that innovation is stimulated by a change in scenery. Universities should require and pay their faculty to take short sabbaticals (e.g. one month) at least once every two years and long sabbaticals (6-12 months) every five years. Ideally sabbaticals would be taken at other universities where the investigator would get exposed to new faculty and new research environments. Our ability to innovate is significantly influenced by our local environment. Alternatively, the short sabbatical could be replaced by hosting visiting professors for one month. No university official has ever recommended that I take a sabbatical of any kind.

3) Require or encourage faculty to attend multiple scientific conferences. Departments and centers should encourage their faculty to attend at least 4-5 scientific conferences each year in a diversity of different disciplines. Those of us in biomedical research should be attending conferences in economics or meteorology in addition to cell biology and genetics. Innovation often comes from seeing how others solve complex problems. Knowing the state of the art in your own field only encourages incremental science. This could be facilitated by the department or institution paying for their faculty to attend one conference per year that is in a radically different discipline.

4) Require or encourage graduate students to take courses in other disciplines. Graduate students can be a wonderful source of innovation and we need to provide them with the same opportunities for stimulating creative thought. One way to do this is to require them to take a course in a completely different area of their choosing and give them graduate level credit for it. For example, a graduate student in cell biology could take a course in graduate level course in music, psychology, art or economics. Allowing a graduate student to be innovative greatly influences the level of innovation in the research lab as a whole. I require all my students to take at least one year of additional coursework in a different area. One of my students is working on a Ph.D. in Genetics and doing an M.S. in Computer Science at the same time. This ensures they can speak multiple languages and also ensures there is a constant flow of new ideas back to the lab.

5) Provide institutional recognition for innovative research. It is critical that faculty who successfully develop innovative ideas are appropriately rewarded. This can come in the form of promotion, annual awards from the institution, additional discretionary research dollars or salary increases, for example. The challenge of course is knowing when an innovative idea has been developed and then proactively recognizing it. Institutions should not wait until an innovative faculty member threatens to leave to provide recognition.

Friday, July 17, 2009


Our technical note on adapting MDR to run on a Graphical Processing Unit (GPU) has been accepted for publication in BMC Research Notes. The source code (mdrgpu) is available on sourceforge.net. The benchmarking results are VERY impressive.

Sinnott-Armstrong, N.A., Greene, C.S., Cancare, F., Moore, J.H. Accelerating epistasis analysis in human genetics with consumer graphics hardware. BMC Research Notes 2, 149 (2009). [PubMed]


Background: Human geneticists are now capable of measuring more than one million DNA sequence variations from across the human genome. The new challenge is to develop computationally feasible methods capable of analyzing these data for associations with common human disease, particularly in the context of epistasis. Epistasis describes the situation where multiple genes interact in a complex non-linear manner to determine anindividual’s disease risk and is thought to be ubiquitous for common diseases. Multifactor Dimensionality Reduction (MDR) is an algorithm capable of detecting epistasis. An exhaustive analysis with MDR is often computationally expensive, particularly for high order interactions. This challenge has previously been met with parallel computation and expensive hardware. The option we examine here exploits commodity hardware designed for computer graphics. In modern computers Graphics Processing Units (GPUs) have more memory bandwidth and computational capability than Central Processing Units (CPUs) and are well suited to this problem. Advances in the video game industry have led to an economy of scale creating a situation where these powerful components are readily available at very low cost. Here we implement and evaluate the performance ofthe MDR algorithm on GPUs. Of primary interest are the time required for an epistasis analysis and the price to performance ratio of available solutions.

Findings: We found that using MDR on GPUs consistently increased performance per machine over both a feature rich Java software package and a C++ cluster implementation. The performance of a GPU workstation running a GPU implementation reduces computation time by a factor of 160 compared to an 8-core workstation running the Java implementation on CPUs. This GPU workstation performs similarly to 150 cores running anoptimized C++ implementation on a Beowulf cluster. Furthermore this GPU system provides extremely cost effective performance while leaving the CPU available for other tasks. The GPU workstation containing three GPUs costs $2000 while obtaining similar performance on a Beowulf cluster requires 150 CPU cores which, including the added infrastructure and support cost of the cluster system, cost approximately $82,500.

Conclusions: Graphics hardware based computing provides a cost effective means to perform genetic analysis of epistasis using MDR on large datasets without the infrastructure of a computing cluster.

Thursday, July 16, 2009

Can genes predict drug responses?

The Summer 2009 issue of Biomedical Computation Review has a nice article by Dr. Chandra Shekhar on pharmacogenetics. Discussed is a 2009 paper in the New England Journal of Medicine on warfarin dosing. I like this article because there are several statements about the importance of interactions. Also, my former student, Dr. Marylyn Ritchie, is quoted. For more information about the detection of epistasis or gene-gene interaction in pharmacologic studies see our 2005 paper in Nature Reviews Drug Discovery and our 2008 paper in Current Pharmacogenomics and Personalized Medicine.

Also in this issue of Biomedical Computation Review (see pp. 2-3) is a discussion of whether grant applications for the development and maintenance of biomedical software should compete head to head with basic research applications. There are good points on both sides of this argument. I don't have a problem with software grant competing with basic research grants as long as the reviewers are qualified to review both types.

Wednesday, July 15, 2009

GWAS Analysis Using Gene Ontology

Peter Holmans has published a very nice paper in the American Journal of Human Genetics on using Gene Ontology to analyze genome-wide association study (GWAS) data. See also my Dec. 6, 2008 post on our paper by Askland et al. that approaches the problem in the same way.

Holmans P, Green EK, Pahwa JS, Ferreira MA, Purcell SM, Sklar P; Wellcome Trust Case-Control Consortium, Owen MJ, O'Donovan MC, Craddock N. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am J Hum Genet. 2009 Jul;85(1):13-24. [PubMed]


We present a method for testing overrepresentation of biological pathways, indexed by gene-ontology terms, in lists of significant SNPs from genome-wide association studies. This method corrects for linkage disequilibrium between SNPs, variable gene size, and multiple testing of nonindependent pathways. The method was applied to the Wellcome Trust Case-Control Consortium Crohn disease (CD) data set. At a general level, the biological basis of CD is relatively well known for a complex genetic trait, and it thus acted as a test of the method. The method, known as ALIGATOR (Association LIst Go AnnoTatOR), successfully detected biological pathways implicated in CD. The method was also applied to a meta-analysis of bipolar disorder, and it implicated the modulation of transcription and cellular activity, including that which occurs via hormonal action, as an important player in pathogenesis.

Tuesday, July 07, 2009

Diversity and Complexity in DNA Recognition by Transcription Factors

Why did anyone think it would be so simple? Let's first assume complexity and not be surprised when we find it.

Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, Kuznetsov H, Wang CF, Coburn D, Newburger DE, Morris Q, Hughes TR, Bulyk ML. Diversity and complexity in DNA recognition by transcription factors. Science. 2009 Jun 26;324(5935):1720-3. [PubMed]


Sequence preferences of DNA binding proteins are a primary mechanism by which cells interpret the genome. Despite the central importance of these proteins in physiology, development, and evolution, comprehensive DNA binding specificities have been determined experimentally for only a few proteins. Here, we used microarrays containing all 10-base pair sequences to examine the binding specificities of 104 distinct mouse DNA binding proteins representing 22 structural classes. Our results reveal a complex landscape of binding, with virtually every protein analyzed possessing unique preferences. Roughly half of the proteins each recognized multiple distinctly different sequence motifs, challenging our molecular understanding of how proteins interact with their DNA binding sites. This complexity in DNA recognition may be important in gene regulation and in the evolution of transcriptional regulatory networks.