MDR 101 - Part 1 - Missing Data
The is the first of a series of posts on how to use the open-source multifactor dimensionality reduction (MDR) method and software package to detect, characterize and interpret epistasis (non-additive gene-gene interaction) in human population-based studies of health and disease susceptibility. I am working on a book about MDR and these posts will serve as a warm up. Your feedback is greatly appreciated.
If you are new to MDR I suggest you start with the basic description of the method on Wikipedia. There are three papers that you should read to get the basics. You might start with the original paper describing MDR by Ritchie et al. (2001). Then, I suggest you read my review in Expert Review of Molecular Diagnostics (2004). For the latest ideas about MDR you should read our recent paper in the Journal of Theoretical Biology (2006). When describing MDR in your presentations and publications please use the definitions of the method provided in this latter paper. For an up to date review of MDR and epistasis in general see our 2009 paper in the American Journal of Human Genetics and our 2010 paper in Bioinformatics. Finally, I highly recommend our MDR review chapter that appeared in a 2010 volume of Advances in Genetics. This is the most recent overview of the MDR method.
Perhaps the most common question I get about using MDR is how to deal with missing data. The MDR software requires a complete dataset with no missing values for analysis. There are three common approaches for handling missing genotypes.
First, you can simply remove subjects (rows) or SNPs (columns) until you have a square dataset. This is probably the least desirable option since you usually end up throwing away half your data. In general, I avoid throwing any data away. The next two options are usually preferable.
Second, you can encode the missing genotypes with a new level that is not used to code your genotypes. For example, if your genotypes are coded 0,1,2 you could code your missing genotypes with a '4' or a '9'. With this option, MDR treats your missing genotypes as a fourth level thus incorporating the information into the model. This is probably an acceptable option as long as you have few missing genotypes and they are missing at random across cases and controls (not a bad idea to test this assumption!).
Third, you can impute the missing genotypes. This is usually what I recommend to users and collaborators. Here, you use a statistical model to predict the missing genotypes. Our MDR Data Tool will perform a simple frequency-based imputation. That is, it will fill in missing genotypes with the most common genotype for that SNP. This is performed on a SNP by SNP basis. Again, this is a reasonable option if you have few missing genotypes and they are missing at random across cases and controls. If you want to get fancy you can use a multivariate imputation approach that takes into consideration patterns in the data such as linkage disequilibrium (LD). The R software has nice imputation procedures. We used one of these methods in our recent paper on bladder cancer (see Andrew et al. Carcinogenesis 2006). The nice thing about this third option is that you are left with a complete dataset (nothing is thrown away) and you don't have any extra levels in your MDR models to worry about.
If you have additional questions about imputing I suggest you consult an expert in this area or read a book such as Little and Rubin's 'Statistical Analysis with Missing Data, 2nd Ed'. There is also a 2009 paper in Genetic Epidemiology on MDR and missing data. A prefereable approach might be to use genotype imputation methods developed for genome-wide association studies (GWAS).
Don't forget that all of our MDR software including the MDR Data Tool is open-source. If you have a favorite imputation procedure, port it to Java and send it to us. We would be happy to include it in a future release.
This section was last updated on January 20, 2013.