Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Sunday, August 31, 2008

MDR 2.0 alpha

An alpha version of MDR 2.0 will be ready next week for testing. If you would like to try it out send me an email. Feedback is required!

This new version will load and analyze a genome-wide association study (GWAS) using stochastic search algorithms. The key to a successfull GWAS using MDR is expert or domain-specific knowledge. This can be in the form of LOD scores from a linkage analysis, ReliefF score on the SNPs or any other number you can assign to each SNP such as that derived from biochemical pathways or Gene Ontology, for example. I am a firm believer that expert knowledge of this type is the only thing that will make a GWAS of epistasis technically feasible.

The first stochastic search algroithm provided in MDR is a simple estimation of distribution algorithm (EDA) that assigns and updates probabilities to SNPs from the quality of the models the SNPs find themselves in and/or the expert knowledge you provide. These probabilities are used to pick SNPs for MDR models. Our recent paper at the ANTS conference describes this algorithm. See my June 19, 2008 post for a short description of the paper. Email me for a preprint. The final version of the paper will be available soon.

Friday, August 15, 2008

Epistasis Analysis in E. coli: eSGA

This looks interesting. These types of high-throughput experimental methods will be essential for the study of both biological and statistical epistasis. It is a major bottleneck.

Butland G, Babu M, Díaz-Mejía JJ, Bohdana F, Phanse S, Gold B, Yang W, Li J, Gagarinova AG, Pogoutse O, Mori H, Wanner BL, Lo H, Wasniewski J, Christopolous C, Ali M, Venn P, Safavi-Naini A, Sourour N, Caron S, Choi JY, Laigle L, Nazarians-Armavil A, Deshpande A, Joe S, Datsenko KA, Yamamoto N, Andrews BJ, Boone C, Ding H, Sheikh B, Moreno-Hagelseib G, Greenblatt JF, Emili A.

eSGA: E. coli synthetic genetic array analysis. Nat Methods. 2008 Aug 1. [Epub ahead of print] [PubMed]

Physical and functional interactions define the molecular organization of the cell. Genetic interactions, or epistasis, tend to occur between gene products involved in parallel pathways or interlinked biological processes. High-throughput experimental systems to examine genetic interactions on a genome-wide scale have been devised for Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans and Drosophila melanogaster, but have not been reported previously for prokaryotes. Here we describe the development of a quantitative screening procedure for monitoring bacterial genetic interactions based on conjugation of Escherichia coli deletion or hypomorphic strains to create double mutants on a genome-wide scale. The patterns of synthetic sickness and synthetic lethality (aggravating genetic interactions) we observed for certain double mutant combinations provided information about functional relationships and redundancy between pathways and enabled us to group bacterial gene products into functional modules.

Friday, August 08, 2008

MDR 1.2.5

We fixed a small bug in the MDR attribute construction and released version 1.2.5 on Sourceforge. If the number of combinations of attributes exceeded the number of rows MDR switches from a dense matrix to a sparse matrix. The bug was in the filling up of the sparse matrix and the result was we would get a null pointer exception and the attribute would silently not be created. This probably will not impact anyone.

Friday, August 01, 2008

Does Complexity Matter?

Our paper on the development and evaluation of a computational evolution system for solving complex problem in human genetics will be published by Springer as a chapter in a new book that will be out later this year or early in 2009. I just received the page proofs. Please email me if you would like a preprint. The book (details below) is a set of contributed chapters by attendees of the Genetic Programming Theory and Practice (GPTP) Workshop held by the Center for the Study of Complex Systems at the University of Michigan this past spring. This work builds on the work discussed in my Dec. 23, 2007 blog post.

Moore, J.H., Greene, C.S., Andrews, P., White, B.C. Does complexity matter? Artificial evolution, computational evolution and the genetic analysis of common human diseases. In: Genetic Programming Theory and Practice VI, in press, Springer (2008).


Common human diseases are complex and likely the result of nonlinear interactions between multiple different DNA sequence variations. One goal of human genetics is to use data mining and machine learning methods to identify sets of discrete genetic attributes that are predictive of discrete measures of health in human population data. A variety of different computational intelligence methods based on artificial evolution have been developed and applied in this domain. While artificial evolution approaches such as genetic programming show promise, they
are only loosely based on real biological and evolutionary processes. It has recently been suggested that a new paradigm is needed where “artificial evolution” is transformed to “computational evolution” by incorporating more biological and evolutionary complexity into existing algorithms. Computational evolution systems have been proposed as more likely to solve problems of interest to biologists and biomedical researchers. To test this hypothesis, we developed a prototype computational evolution system for the analysis of human genetics
data capable of evolving operators of arbitrary complexity. Preliminary results suggest that more complex operators result in better solutions. Here we introduce modifications including a simpler stack-based solution representation, the ability to maintain and use an archive of solution building blocks, and a simpler set of solution operator building blocks capable of learning to use pre-processed expert knowledge. A parameter sweep suggests that operators that can use expert knowledge or archival information outperform those that cannot. This study supports the idea that complexity matters and thus the consideration of computational evolution for bioinformatics problem-solving in the domain of human genetics.

Figure 1. Visual overview of our computational evolution system for discovering symbolic discriminant functions that differentiate disease subjects fromhealthy subjects using information about single nucleotide polymorphisms (SNPs). The hierarchical structure is shown on the left while some specific examples at each level are shown in the middle. At the lowest level (D) is a grid of solutions. Each solution consists of a list of functions and their arguments (e.g. X1 is an attribute) that are evaluated using a stack (denoted by ST in the solution). The next level up (C) is a grid of solution operators that each consists of some combination of the ADD, DELETE and COPY functions each with their respective set of probabilities that define whether expert knowledge from ReliefF (denoted by E in the probability pie) or the archive (denoted by A in the probability pie) are used instead of a random generator (denoted by R in the probability pie). ReliefF scores are derived by pre-processing the data (E) and are stored for use by the system (F). The attribute archive (G) is derived from the frequency with which each attribute occurs among solutions in the population. The top two levels of the hierarchy (A and B) exist to generate variability in the operators that modify the solutions. This system allows operators of arbitrary complexity to modify solutions. Note that we used 18x18 and 36x36 grids of 324 and 1296 solutions, respectively, in the present study. A 12x12 grid is shown here as an example.