MDR 101 - Part 3 - Analysis
After addressing quality control issues and filtering your SNPs to a reasonable subset it is time for an MDR analysis. First, make sure you are using the latest version of the open-source MDR software. Follow the link on www.epistasis.org for the newest version.
The first step is to load your file in the Analysis tab. Check the sample file distributed with the software for the data format. If you successfully load the data you can view it with the View Datafile button. The Datafile Information area will show the name of the file, the path, and number of subjects (i.e. instances), the number of SNPs (i.e. attributes/variables), and the ratio of cases to controls. Note that the current version of MDR will only load datasets with up to about 10^6 SNPs. A future version will allow you to load bigger datasets. Running MDR on more than 100 or 1000 SNPs will probably require a fast C version optimized for high-performance computing. Download libmdr from here if you want to write a C program.
Once you are sure you have loaded the right data you are ready to run an MDR analysis. Feel free to use the default settings to run a quick MDR analysis by pushing the Run Analysis button. Go to the Configuration Tab to change the settings. Here is what each setting is used for.
Random Seed: The random seed controls how the dataset is divided into different parts if cross-validation is used. Running MDR using the same random seed should give the same results from run to run as long as the other settings are the same. Feel free to run MDR using different random seeds to see how the results differ when your data is divided differently. Across 10 runs you should get similar results if you have a detectable signal. You really only need to do one seed for publication purposes.
Attribute Count Range: This tells MDR the order of the interactions to be considered. The default settings will return the best 1-, 2-, 3-, and 4-attribute models. In general, I rarely go above a 5-way interaction unless the dataset is very large (e.g. greater n=2000 instances).
Cross-Validation Count: This tells MDR how many pieces to divide your dataset into for cross-validation (CV). The default is 10-fold CV. This is what I almost always use. A 5-fold CV is ok too and will run faster.
Paired Analysis: This is used for datasets in which the cases and controls are paired or correlated in some way (i.e. matched). If this is checked MDR will keep the pairs together during cross-validation. Use this option if you have a 1:1 matched case control study or relatives such as discordant sib-pairs. We don't have an option for n:1 matching where n is greater than 1.
Tie Cells: This tells MDR how to treat genotype combinations for which there are an equal number of cases and controls. Set this to whatever you feel comfortable with. I usually use the default unless a collaborator wants something different.
Compute Fitness Landscape: Checking this box will tell MDR to keep track of every combination that it evaluates. These can then be viewed in the Landscape Tab in the analysis window. This is very useful for identifying the second or third best model. Sometimes the second best model is a better predictor. Keep in mind that the landscape option uses a lot of memory. If you have lots of SNPs in your dataset it could cause MDR to stop due to lack of memory. This is why it is off by default. It might be better to use the Top Models option described next.
Track Top Models: MDR will by default keep track of and report the top n models and then show these in the Top Models Tab of the Analysis.
Search Type: MDR carries out an exhaustive search over all possible combinations by default. If this takes too long, you might try running a random search. You can specify the total number of evaluations to perform or you can tell it how long you want to wait (e.g. 2 hours). You can also select a forced analysis if you want to evaluate one specific combination of attributes. I use the forced analysis option a lot. I use it to obtain an unbiased estimate of the testing accuracy when the cross-validation consistency (CVC) is less than 10 or to get an estimate of the testing accuracy for the second or third best model identified using the Landscape. We have also included a probabilistic search algorithm called an estimation of distribution algorithm (EDA) that assigns probabilities to each SNP for selection. These probabilities can be modified using expert knowledge to bias the search.
Once you have finished setting options in the Configuration Tab you are ready to begin your MDR analysis by pushing the Run Analysis button in the Analysis Tab. The Progress Completed bar will give you an idea of how long you need to wait. A dataset with 10-50 SNPs shouldn't take more than a few seconds or a few minutes to analyze. A dataset with 10 SNPs and 400 subjects takes less than four seconds to run on a dual-processor Dell desktop with the default settings. A dataset with 15 SNPs and 600 subjects takes about 10 seconds to run with the default settings. With 50 or more SNPs you may need to wait an hour or more depending on the number of CPUs your computer has. Make sure threading is turned on!
Note that we have added a new tab in 2010 for covariate adjustment. This allows you to select a covariate such as age or weight and adjust for that effect using a simple sampling algorithm. A paper describing this method was published in Human Heredity. We also published a 2011 paper in Annals of Human Genetics that describes a robust MDR method that uses a statistical test to determine whether each genotype combination is high-risk or low risk. This option is now in the configuration tab.
My next posts will discuss how you pick a best model, how you evaluate statistical significance, and how you interpret the results. More advanced MDR analysis methods will also be covered.
This post was last updated on January 20, 2013.