Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Sunday, November 26, 2006

MDR 101 - Part 2 - Filtering

MDR has traditionally carried out an exhaustive search of all possible 2-way, 3-way, up to n-way combinations of attributes (e.g. SNPs). This works well for candidate gene studies with 100 or fewer SNPs because the combinations can be completely enumerated in a reasonable amount of time on a desktop computer. As a side note, it is important to remember that the MDR software is designed to take advantage of PCs with multiple processors. Thus, if you have a PC with two processors that have threading turned on MDR will run approximately 4x faster that a single CPU with no threading. Make sure the next PC you buy has one or more multi-core chips in it!

When you have more than 100 SNPs an exhaustive search may not be practical unless you are willing to wait days or even weeks for a run to finish. When the number of SNPs exceeds 10,000 an exhaustive search of all 3-way and higher combinations may be infeasible, even with a parallel computer. As GWAS and whole-genome sequence data become ubiquitous there will be an increasing interest in using MDR to carry out a genome-wide analysis of epistasis. As a side note, we have released a C library (libMDR) to help you write fast MDR software for high-performance computing. We have also released a version of MDR that will run on GPUs (MDR-GPU). This GPU method was described in a 2009 paper in BMC Research Notes and a 2010 Bioinformatics paper. We have also reviewed the general use of GPU computing for bioinformatics and computational biology here.

We are actively developing two different strategies for genome-wide analysis using MDR. The first general strategy is to select a subset of 100 or 1000 SNPs, for example, from the list of 1,000,000 or more variants. This is the filter approach. The second approach is to use a stochastic search algorithm such as simulated annealing or genetic programming to search for an optimal combination. This is the wrapper approach. In this essay we will cover the filter approach since it is more of a pre-processing step. Previous blog posts (e.g. March 14, June 12, Sept. 12) have talked about our wrapper approaches. We will discuss this in more detail in a later MDR 101 post.

There are many different filters that can be applied to SNP data to reduce the number of attributes that need to be systematically evaluated using MDR. The first approach is to use statistical and computational algorithms to select SNPs that are most likely to interact. We have been evaluating and developing modifications to the Relief family of algorithms for this purpose. Relief was originally developed by Kira and Rendell (1992) and later extended in the ReliefF algorithm by Kononenko (1994). A good review of these algorithms can be found in Robnik-Sikonja and Kononenko (2003). The nice thing about ReliefF is that it can assign a single quality value to each SNPs in a way that captures dependencies or interactions between SNPs. We have developed several extensions of ReliefF that improve the power to filter interacting SNPs. Our first approach is called Tuned ReliefF or TuRF. A paper describing this approach was published in Lecture Notes in Computer Science. A second approach is called Evaporative ReliefF. This approach was developed in collaboration Dr. Brett McKinney and was published in Bioinformatics. A newer and much more powerful approach is called Spatially Uniform ReliefF or SURF. SURF is described in a 2009 BioData Mining paper. A newer SURF algorithm, SURF*, was published in 2010 in Lecture Notes in Computer Science. The key is to find a fast algorithm that can assign high quality values to SNPs involved in non-additive interactions without explicitly modeling the interactions which leads to the combinatorial problem that MDR faces. You don't want to use univariate approaches such as chi-square unless you explicitly want to condition on main effects. Interactions in the absence of statistically significant main effects are likely to be missed. For a review of filters and wrapper for epistasis analysis see our 2010 Bioinformatics paper.

The MDR software currently allows you to select ReliefF, TuRF, SURF, SURF*, chi-square, and odds ratio as filters. For ReliefF, it is probably best to set the number of nearest neighbors to 25% of your total sample size. This a lot more power than the default that most people use of 10. SURF* will have the most power unless your case-control ratio is imbalanced. Feel free to play with the settings if you want. Note that the odds ratio filter reports the 'best' odds ratio using several different reference groups. It should thus be seen as a 'maximized' odds ratio. For the chi-square test, the results can be viewed as either the chi-square values or the p-values. For all the filters, you can select the size of the subset you want by specifying the number you want, the percentage you want, or the subset that meets some threshold criteria (e.g. p-value). As an exercise, compare the list of best SNPs returned by ReliefF and chi-square. Those SNPs that have high ReliefF scores but non-significant chi-square scores are candidates for nonlinear interactions. In the papers on atrial fibrillation by Tsai et al. (2004) and Moore et al. (2006), the ACE I/D and T174M polymorphisms detected by MDR have the highest ReliefF scores of any polymorphism in the dataset but have non-significant chi-square values and thus would have been missed by conditioning on main effects. The M235T polymorphism has a main effect and is detected by both ReliefF and chi-square.

In addition to the algorithmic approach, we think it will be very useful to reduce the number of SNPs examined by using domain-specific knowledge about biochemical pathways, Gene Ontology, chromosomal location, published relationships, etc.. Why not use the years of accumulated knowledge about the disease you are studying and likely genes that might be involved? We think an optimal filter will combine algorithms such as ReliefF with knowledge about biological systems. As a side note, we have previously used pathway information to help organize our MDR analyses. See Williams et al. (2004) for an example. We review the use of biological knowledge in GWAS in our 2010 Bioinformatics paper. Marylyn Ritchie's group at Penn State has created a bioinformatics tool called BioFilter that can select SNPs prior to MDR analysis. We have also recently published a paper in BioData Mining carrying out gene set enrichment analysis of MDR results in GWAS data for ALS. This study found replicable pathway effects that were missed by previous GWAS.

In summary, statistical and biological filters can help reduce the number of SNPs that need to be exhaustively analyzed by MDR. The advantage of the filter approach is that it is computationally tractable. The disadvantage is that your list of filtered SNPs is only as good as the filter that is being used. Surely some good SNPs will be excluded from the analysis.

This section was last updated on January 20, 2013.

Sunday, November 12, 2006

MDR 101 - Part 1 - Missing Data

The is the first of a series of posts on how to use the open-source multifactor dimensionality reduction (MDR) method and software package to detect, characterize and interpret epistasis (non-additive gene-gene interaction) in human population-based studies of health and disease susceptibility. I am working on a book about MDR and these posts will serve as a warm up. Your feedback is greatly appreciated.

If you are new to MDR I suggest you start with the basic description of the method on Wikipedia. There are three papers that you should read to get the basics. You might start with the original paper describing MDR by Ritchie et al. (2001). Then, I suggest you read my review in Expert Review of Molecular Diagnostics (2004). For the latest ideas about MDR you should read our recent paper in the Journal of Theoretical Biology (2006). When describing MDR in your presentations and publications please use the definitions of the method provided in this latter paper. For an up to date review of MDR and epistasis in general see our 2009 paper in the American Journal of Human Genetics and our 2010 paper in Bioinformatics. Finally, I highly recommend our MDR review chapter that appeared in a 2010 volume of Advances in Genetics. This is the most recent overview of the MDR method.

Perhaps the most common question I get about using MDR is how to deal with missing data. The MDR software requires a complete dataset with no missing values for analysis. There are three common approaches for handling missing genotypes.

First, you can simply remove subjects (rows) or SNPs (columns) until you have a square dataset. This is probably the least desirable option since you usually end up throwing away half your data. In general, I avoid throwing any data away. The next two options are usually preferable.

Second, you can encode the missing genotypes with a new level that is not used to code your genotypes. For example, if your genotypes are coded 0,1,2 you could code your missing genotypes with a '4' or a '9'. With this option, MDR treats your missing genotypes as a fourth level thus incorporating the information into the model. This is probably an acceptable option as long as you have few missing genotypes and they are missing at random across cases and controls (not a bad idea to test this assumption!).

Third, you can impute the missing genotypes. This is usually what I recommend to users and collaborators. Here, you use a statistical model to predict the missing genotypes. Our MDR Data Tool will perform a simple frequency-based imputation. That is, it will fill in missing genotypes with the most common genotype for that SNP. This is performed on a SNP by SNP basis. Again, this is a reasonable option if you have few missing genotypes and they are missing at random across cases and controls. If you want to get fancy you can use a multivariate imputation approach that takes into consideration patterns in the data such as linkage disequilibrium (LD). The R software has nice imputation procedures. We used one of these methods in our recent paper on bladder cancer (see Andrew et al. Carcinogenesis 2006). The nice thing about this third option is that you are left with a complete dataset (nothing is thrown away) and you don't have any extra levels in your MDR models to worry about.

If you have additional questions about imputing I suggest you consult an expert in this area or read a book such as Little and Rubin's 'Statistical Analysis with Missing Data, 2nd Ed'. There is also a 2009 paper in Genetic Epidemiology on MDR and missing data. A prefereable approach might be to use genotype imputation methods developed for genome-wide association studies (GWAS).

Don't forget that all of our MDR software including the MDR Data Tool is open-source. If you have a favorite imputation procedure, port it to Java and send it to us. We would be happy to include it in a future release.

This section was last updated on January 20, 2013.

Thursday, November 09, 2006

EvoBIO 2007 ***Deadline Nov. 15th***

The paper submission deadline for EvoBIO has been extended to Nov. 15th. There is still time to submit a paper!

Friday, November 03, 2006

Genome-Wide Association Analysis of Lung Diseases

I am in Ft. Lauderdale today at the American Physiological Society (APS) conference on Physiological Genomics and Proteomics of Lung Disease. I was invited to speak about our work on "Exploiting Expert Knowledge for Genome-Wide Genetic Analysis". The technical details covered in the talk have been covered previously in this blog. These novel methods for detecting epistasis on a genome-wide scale will eventually be applied to our study of adult respiratory distress syndrome (ARDS) in patients with traumatic injury. The ascertainment and genetic analysis of ARDS following traumatic injury is funded by a five-year grant from the NIH (R01-HD047447, PI-Moore). We just finished the second year of the study and have successfully recruited over 1,000 subjects for the study. We have exceeded our three-year recruitment goal of n=750 after only two years.

Wednesday, November 01, 2006

Computational Evolution

Evolutionary computing is a computational intelligence methodology that is inspired by how nature solves problems through evolution by natural selection. We have employed a variety of evolutionary computing methods such as genetic programming in our recent work for modeling epistasis (see last post on SyMod) and for detecting epistasis on a genome-wide scale with MDR (see previous posts in May and June).

A new paper by Banzhaf et al. in Nature Reviews Genetics proposes a new field called computational evolution that builds on past successes with artificial evolution.

Banzhaf W, Beslon G, Christensen S, Foster JA, Kepes F, Lefort V, Miller JF, Radman M, Ramsden JJ. Guidelines: From artificial evolution to computational evolution: a research agenda. Nat Rev Genet. 2006 Sep;7(9):729-35. [PubMed]

Computational scientists have developed algorithms inspired by natural evolution for at least 50 years. These algorithms solve optimization and design problems by building solutions that are 'more fit' relative to desired properties. However, the basic assumptions of this approach are outdated. We propose a research programme to develop a new field: computational evolution. This approach will produce algorithms that are based on current understanding of molecular and evolutionary biology and could solve previously unimaginable or intractable computational and biological problems.

For background reading on evolutionary computing see:

Foster JA. Evolutionary computation. Nat Rev Genet. 2001 Jun;2(6):428-36. [PubMed]

Evolution does not require DNA, or even living organisms. In computer science, the field known as 'evolutionary computation' uses evolution as an algorithmic tool, implementing random variation, reproduction and selection by altering and moving data within a computer. This harnesses the power of evolution as an alternative to the more traditional ways to design software or hardware. Research into evolutionary computation should be of interest to geneticists, as evolved programs often reveal properties - such as robustness and non-expressed DNA - that are analogous to many biological phenomena.

If you are interested in using these methods for genetic analysis you might consider attending the Genetic and Evolutionary Computing Conference (GECCO-2007) next year in London and the European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBIO) next year in Valencia, Spain. . See my posts below for information about submitting a paper to either conference.