A computational evolution system for open-ended automated learning of complex genetic relationships
I will be giving a talk on the following topic at IGES on Tuesday morning. I will also be presenting this as a poster at ASHG.
A computational evolution system for open-ended automated learning of complex genetic relationships.
Jason H. Moore, Doug Hill, Casey S. Greene
The failure of genome-wide association studies to reveal the genetic architecture of common diseases suggests that it is time that we embrace, rather than ignore, the complexity of the genotype-to-phenotype mapping relationship that is characterized by epistasis, plastic reaction norms, heterogeneity and other phenomena such as epigenetics. The extreme complexity of the problem suggests that simple linear models and other approaches that assume simplicity are unlikely to capture the full spectrum of genetic effects. To this end, we have developed an open-ended computational evolution system (CES) that makes no assumptions about the underlying genetic model and can learn through evolution by natural selection how to solve a particular genetic modeling problem. This is accomplished by providing the basic mathematical building blocks (e.g. +, -, *, /, LOG, <, >, =, AND, OR, NOT etc.) for models that can take any shape or form and the basic building blocks for algorithmic functions (e.g. ADD, DELETE, COPY, etc.) that can manipulate genetic models in a manner that is dependent on expert statistical and biological knowledge or prior modeling experience. We have previously demonstrated that our CES approach has excellent power to detect epistatic relationships in genome-wide data across a wide-range of heritabilities and sample sizes (Moore et al. 2008, 2009). We have also previously shown that this system can learn to utilize one of many sources of expert knowledge thus providing an important clue as to how the system solves the problem (Greene et al. 2009). Here, we introduce an additional layer to our CES approach that introduces noise into the training data (5%, 10%, 15% and 20%) to drive the discovery process toward models that are more likely to generalize. We show using simulated epistatic relationships in genome-wide data that the CES leads to significantly smaller models (P<0.001) thus reducing false-positives and overfitting while maintaining a power of 97% to 100%. These results are important because they show how introduced noise in the data can yield more parsimonious models and reduce overfitting without the need for computationally expensive cross-validation. This study is an important step towards a paradigm of genetic analysis that makes few assumptions about a genetic architecture that is very complex.