Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Wednesday, May 31, 2006

Genome-Wide Genetic Analysis using Genetic Programming

Our paper on "Exploiting Expert Knowledge in Genetic Programming for Genome-Wide Genetic Analysis" by Moore and White has been peer-reviewed and accepted for presentation at the Parallel Problem Solving from Nature conference in Iceland in September. This paper will be published in the Lecture Notes in Computer Science series from Springer. Here is the abstract:

Moore JH and White BC. Exploiting Expert Knowledge in Genetic Programming for Genome-Wide Genetic Analysis. Lecture Notes in Computer Science, in press (2006)


Human genetics is undergoing an information explosion. The availability of chip-based technology facilitates the measurement of thousands of DNA sequence variation from across the human genome. The challenge is to sift through these high-dimensional datasets to identify combinations of interacting DNA sequence variations that are predictive of common diseases. We have previously developed and evaluated a genetic programming (GP) approach to attribute selection and classification in this domain. We showed that GP is no better than a simple random search when classification accuracy is used as the fitness function. We then showed that including pre-processed estimates of attribute quality (i.e. expert knowledge) using Tuned ReliefF (TuRF) in a multiobjective fitness function that also includes accuracy significantly improves the performance of GP over that of random search. The goal of this paper was to develop and evaluate a GP approach that uses expert knowledge such as TuRF scores during selection to ensure trees with good building blocks are being recombined and reproduced. We simulated genetic datasets of varying effect size (i.e. signal strength) in which the disease model consists of two interacting DNA sequence variations that exhibit no independent effects on class (i.e. epistasis). We show here that using expert knowledge to select trees performs as well as a multiobjective fitness function but requires only a tenth of the population size. This study demonstrates that GP may be a useful computational discovery tool in this domain. This study raises important questions about the general utility of GP for these types of problems, the importance of data pre-processing, and the importance of expert knowledge. We anticipate this study will provide an important baseline for future studies investigating the usefulness of GP as a general computational discovery tool for large-scale genetic studies.

This work is funded by NIH grants R01 LM009012 (PI-Moore) and R01 AI59694 (PI-Moore).


Post a Comment

<< Home