Genome-Wide Genetic Analysis with MDR
Our invited review paper on the use of multifactor dimensionality reduction (MDR) to detect epistasis on a genome-wide scale has been accepted for publication in a new book titled "Knowledge Discovery and Data Mining: Challenges and Realities with Real World Data" (Zhu and Davidson, editors) to be published by IGI. The original call for papers can be found here. The paper reviews the most current developments with using filter and wrapper approaches for applying MDR to datasets with thousands of SNPs. It also reviews our application of information theory and graph-based methods for interpreting MDR results.
Moore, J.H. Genome-wide analysis of epistasis using multifactor dimensionality reduction: feature selection and construction in the domain of human genetics. In: Zhu, Davidson (eds.)Knowledge Discovery and Data Mining: Challenges and Realities with Real World Data, IGI, in press.
Abstract
Human genetics is an evolving discipline that is being driven by rapid advances in technologies that make it possible to measure enormous quantities of genetic information. An important goal of human genetics is to understand the mapping relationship between interindividual variation in DNA sequences (i.e. the genome) and variability in disease susceptibility (i.e. the phenotype). The focus of the present study is the detection and characterization of nonlinear interactions among DNA sequence variations in human populations using data mining and machine learning methods. We first review the concept difficulty and then review a multifactor dimensionality reduction (MDR) approach that was developed specifically for this domain. We then present some ideas about how to scale the MDR approach to datasets with thousands of attributes (i.e. genome-wide analysis). Finally, we end with some ideas about how nonlinear genetic models might be statistically interpreted to facilitate making biological inferences.
This work was supported by NIH R01s AI59694 and LM009012 (PI-Moore)
1 Comments:
I attended your tutorial on Bioinformatics and your GP application really interests me . Several questions/comments that I didn't have a chance to ask
1. After the GP returns the function expression then you analysize , interpret it ... how do you know that expression is entirely correct ? Usually in GP part (actually a large part) of the solution contains introns or garbage. If your analysis takes those garbage into account then it creates some false assumptions.
2. I completely agree that the problem you're trying to solve is very difficult, especially the data which used for fitness contains too much noise (hence .60+ fitness is actually quite good, I doubt if it's possible to reach .80+). As a matter of fact an audience asked the question why it's not capable of reaching > .9, I heavily doubt any non-easy GP program can reach such good fitness (how can one know such perfect solution exist in the first place ?). In addition I think that all the linear methods currently will not capable of finding that answer evolved by GP (unless the solution is too simple). A suggestion: now that GP gives you some answer, can you keep on collecting or refining those data to "improve" that answer ? By that I mean have that GP answer included in the initial population. Furthermore, can you use the answer evolved by GP as a fitness function to find additional input data ? This is a co-evolutionary method (e.g., improving the result quality by improving the input arguments and vice versa).
3. One thing I concern about is the extreme 'fuzziness' of the database (e.g., 40% of the answers are contradicting to the other 60%), assuming that it is true, i.e., any random set of data will be at that fuzzy level, then what is the point of attempting to find any prediction function ? You know 40% of the result is false.
4. A technique I often found to help the GP processing is having some expert knowledge inputs (not similar to expert knowledge terminology), basically have some some advanced knowledge of what the expression would look like and force all the chromosomes to be in such shape before testing it against the database. This greatly reduces the computing time which was one of the concerns from the audience. Furthermore dues to the noise level in the database, you can make it less sensitive to the fitness calculation ...
I still have some more suggestions / comments if you're interested. Could you please send me your
1. tutorial presentation (not on CD)
2. the GP paper you submitted
?
Once again I found your method very encouraging, best of luck in your future researches. Thank you,
--
tvn (nguyenthanhvuh@gmail.com)
Post a Comment
<< Home