MDR for DNA Sequence Analysis
Our paper on using MDR for DNA sequence analysis has been accepted for publication in BioData Mining. This paper shows how our MDR method and software can be used for other data mining questions in bioinformatics.
Eric Arehart, Scott Gleim, Bill White, John Hwa and Jason H. Moore. Multifactor Dimensionality Reduction analysis identifies specific nucleotide patterns promoting genetic polymorphisms. BioData Mining, in press (2009).
The fidelity of DNA replication serves as the nidus for both genetic evolution and genomic instability fostering disease. Single nucleotide polymorphisms (SNPs) constitute greater than 80% of the genetic variation between individuals. A new theory regarding DNA replication fidelity has emerged where selectivity is governed by base-pair geometry and interactions between the selected nucleotide, complementary strand and the polymerase active site. We hypothesize that certain sequence combinations in the flanking regions of SNP fragments may predispose toward mutation.
We assembled a dataset from the Broad Institute as a first attempt at testing the hypothesis that flanking region motifs are associated with mutagenesis (n=2194). We expanded our inquiry by assembling another dataset of human SNPs and their flanking sequences (n = 29967) collected from the National Center for Biotechnology Information (NCBI) database and a control set of human sequences randomly selected from the NCBI database (n=909,364). The relationship between DNA sequence and mutation type was modeled using the novel multifactor dimensionality reduction (MDR) approach. MDR was originally developed to detect synergistic interactions between multiple SNPs that are predictive of disease susceptibility.
The present study represents the first use of this computational methodology for modeling nonlinear patterns in molecular genetics. We discovered six significant models in the smaller Broad Institute dataset. We also found significant models (p<< 0.001) for each SNP type examined in the larger NCBI dataset. Importantly, we also discovered a consistent motif of flanking region sites that predisposed to SNP genesis and that this motif was elongated or truncated depending on the SNP type examined. The MDR approach was able to effectively discern single sites within SNP and their respective identities and also their collective contribution to SNP genesis.