Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Thursday, December 21, 2006


We are working on a new project called BioSymphony that will facilitate biomedical research collaborations by predicting teams of investigators that should be working together. Visit the BioSymphony website for more details. Check back here for updates.

Wednesday, December 20, 2006

Software Update

We have been busy preparing two new open-source software packages for alpha testing. Both are being internally tested now and should be ready for external alpha testing in January. I will post to this blog when they are ready. You can also check www.epistasis.org, www.symbolicmodeler.org and www.exploratoryvisualanalysis.org for updates.

Symbolic Modeler (SyMod)

SyMod is an open-source software package for symbolic modeling that can be used for both discrete and continuous endpoints. The advantage of SyMod is that no assumptions are made about the functional form of the model. The user supplies a list of attributes and a list of mathematical functions (e.g. +, - ,*, /, <, >, =, AND, OR, NOT, LOG, Min, Max, etc...) that are used by the software as building blocks to contruct symbolic discriminant functions (for discrete endpoints) or symbolic regression functions (for continuous endpoints). SyMod uses a stochastic search algorithm called genetic programming to identify the optimal symbolic model(s). An advantage of SyMod is the ability to load and use expert knowledge to help guide the search. This feature will be included in the first alpha release and is being tested now. Keep an eye out for our upcoming paper on "Symbolic modeling of epistasis" by Moore et al. that will appear in an early 2007 issue of Human Heredity. This paper outlines an important five-step process for symbolic modeling.

Exploratory Visual Analysis (EVA)

EVA is a database and GUI for storing, managing, and visualizing statistical analysis results. The database stores p-values and other statistics in a database along with annotated information about Gene Ontology, biochemical pathway, chromosomal location, etc. from NCBI and Ensembl. The GUI allows the user to visually explore the statistical results in real time in the context of the biological knowledge about each gene. The goal of EVA is to facilitate the identification of biologically meaningful patterns of statistical results that aren't possible simply by browsing an Excel spreadsheet with 40,000 or more p-values. We have had a prototype for EVA for several years and will be releasing soon an open-source Java version. If you are interested you might read several recent papers on EVA including our paper at the Pacific Symposium on Biocomputing in 2005 and our 2006 paper in Oncology Reports.

Saturday, December 16, 2006

2007 GECCO

The 2007 Genetic and Evolutionary Computing Conference (GECCO) will be held July 7-11 University College London in London, England. This is one of my favorite conferences because it covers computer science methods that are inspired by biology and covers their application to biological problems. Thus, GECCO is an example of truly interdisciplinary conference with a bidirectional exchange of ideas between computer scientists and biologists. I particularly enjoy the conference because the contributers and participants are not afraid to challenge the traditional problem-solving paradigm. Science can not move forward without exploring the fringes of different disciplines. Genetic epidemiology has much to gain from the ideas presented at GECCO.

I am chairing (with Clare Bates Congdon) the Biological Applications track at GECCO in 2007. The call for papers can be found here. Papers are welcome on any topic that combines genetic and evolutionary computing methods with biological or biomedical problem-solving. The paper submission deadline is January 17, 2007.

I will also be giving a tutorial on bioinformatics at GECCO in addition to organizing and chairing the SoftGEC workshop mentioned in my previous post. If you can't make the paper submission deadline for the regular tracks I recommend you explore submitting something for one of the many workshops. The deadline for the workshops papers is March 23, 2007. The workshops usually provide a more informal forum for presenting your work and getting critical feedback. Note that my SoftGEC workshop will only have invited speakers this year.

I highly recommend the conference if you are interested in computational methods and genetics. If you are new to genetic and evolutionary computing I recommend you attend the free tutorials and workshops the first two days of the conference (July 7 and 8). The tutorials are usually very good. See you in London!

Tuesday, December 12, 2006

2007 SoftGEC Workshop

I am organizing and chairing the 2007 Workshop on Open-Source Software in Genetic and Evolutionary Computing (SoftGEC 2007) at the GECCO 2007 conference in London in July. Click here for more information.

Monday, December 04, 2006

MDR 101 - Part 5 - Interpretation

MDR is a data mining approach that uses constructive induction or attribute construction to facilitate the detection of interactions in the absence of main effects. A common criticism of data mining and machine learning methods is that they take inputs and produce an output with little understanding of the 'black box' in the middle that processes the information provided. The black box phenomenon can make models discovered by data mining methods difficult to interpret. We have placed a priority recently on providing tools that provide a statistical interpretation of MDR models that we hope will improve our ability to develop a biological interpretation.

The first thing I do is to study the graphical model of the best MDR model. If the model only involves two attributes the interpretation is fairly simple. The first question to ask is whether the distribution of high-risk (dark-shaded) and low-risk (light-shaded) genotype combinations looks nonlinear. That is, do they vary within and between rows and columns? A strong main effect will show up as a column and/or row that is all high-risk or all low-risk. In other words, the effect of the genotype doesn't vary across the other genotypes. This of course becomes more difficult in the higher dimensions which is why we now rely on information theory approaches.

An important concept in data mining is information gain which is based on measures of entropy. That is, how much information is gained about case-control status from knowledge about genotypes at one or more SNPs? Stated another way, how much entropy in case-control status is removed by considering genotype? For the purposes of an MDR analysis we would like to know how much information about case-control status is gained by combining two or more SNPs using the MDR attribute construction function. More specifically, do we gain information above and beyond that provided by each SNP individually? This is what Jakulin and others call interaction information. We have taken these ideas and methods and implemented the interaction dendrogram in the MDR software for interpreting MDR models. For more details please see our paper in the Journal of Theoretical Biology from 2006. Additional information about interaction information can be found in our 2011 Genetic Epidemiology paper and several in press papers that extend these methods to 3-way interactions (will post these soon).

The idea is simple. If combining two or more SNPs using MDR gives a positive information gain then there is evidence for a synergistic interaction. If the combination of SNPs gives a negative information gain then information is lost which happens when there is redundancy or correlation (e.g. linkage disequilibrium). If there is no gain or loss then you can conclude the SNPs have independent effects. The dendrogram (default) in the Entropy tab is constructed in the following way. First we compute the information gain for each SNP in the summary table and then each pairwise MDR combination. These information gain values are then inverted and a distance matrix constructed such that pairs of SNPs with stronger interactions have a smaller distance. This distance matrix is then used to carry out a hierarchical cluster analysis resulting in an interaction dendrogram. The shorter the line connecting two attributes the stronger the interaction. The color of the line indicates the type of interaction. Red and orange suggest there is a synergistic relationship (i.e. epistasis). Yellow suggests independence. Green and blue suggest redundancy or correlation. Thus, you can very quickly scan the dendrogram to identify the epistasis effects in your MDR analysis. Also try the interaction graphs. These are better than the dendrograms in that they also show the marginal effects. A limitation of this analysis is that it only considers pairs of attributes. We will implement higher-order entropy analyses in a future version of the MDR software. I highly recommend that you include a dendrogram along with your best model in your presentations and publications.

Note that a useful way to present the information analysis is in an interaction graph. These can be selected from the lower left of the Entropy tab window. Graphs have the advantage of visualizing the main effects and the interactions simultaneously. The graphs also show all the connections instead of the best that is shown in the dendrogram. We are releasing soon an MDR version 3.0 that will include additional network analysis tools.

Biological interpretation of these models will most likely be more difficult than the computation that went into the analysis. See our paper in Bioessays from 2005 for a discussion about biological inference. For biological interpretation I recommend using bioinformatics tools such as STRING for protein-protein interactions and data integration tools such as IMP that infer biological relationships across hundreds or thousands of experimental data sets from microarray studies, for example.

This post was last updated on January 20, 2013.

Sunday, December 03, 2006

MDR 101 - Part 4 - Results

Once you have run MDR you are ready to take a look at the results. I usually start with the Summary Table in the Analysis tab. There are four columns in the table. The first column gives the best model for each order examined. There should be four rows for the default settings corresponding to the best 1-, 2-, 3-, and 4-locus models. The Training Accuracy is reported in the second column. This is the average accuracy across the n cross-validation (CV) intervals. This is what is used to choose each of the best models on column one. You can see the second best model, for example, in the Landscape Tab if you have that option turned on. The best model for each CV interval is then applied to the testing portion of the data to derive the Testing Accuracy (TA) in column three. The accuracy estimates are averaged across all CV intervals. The TA is what I use to pick a best model. Ideally, you want to see the TA go up as the order of the model increases and then start going down at some point. The model with the highest TA is what I consider the 'best' overall model. According to the CV analysis, the model with the highest TA is the model that is most likely to generalize to independent data. The reason the TA starts going down again in higher order models is because at some point false positives are added to the model thus decreasing its predictive ability. Overfitting can be seen in the Training Accuracy which continues to rise as additional attributes are added to the model.

As a rule of thumb I like to see a TA of at least 0.55 for a model to be considered interesting. Anything over 0.60 is almost always statistically significant. I rarely see a TA over 0.70 in real data. If you get a TA closer to 1.0 you might check the data. This is highly unusual unless you are studying a mutation with a huge effect size. Remember, a TA of 0.50 is what you would expect if you were to predict case-control status by flipping a coin.

Note that accuracy is the proportion of instances or subjects that are correctly classified as being a case or a control, for example. A more formal definition is (TP + TN)/(TP + TN + FP + FN) where TP are true positives, TN are true negatives, FP are false positives and FN are false negatives. Accuracy works great as long as the number of cases and controls are balanced or equal. We now use balanced accuracy which we have shown is better for datasets that have one class (e.g. controls) that is bigger than the other. Here balanced accuracy is defined as (sensitivity + specificity)/2 where sensitivity = TP/(TP+FN) and specificity = TN/(FP+TN). This gives an accuracy estimate that is not biased by the larger class. We combine this with a threshold for defining high-risk and low-risk genotype combinations that is equal to the ratio of cases to controls in the dataset. Note that accuracy and balanced accuracy are the same when the dataset is balanced. We have a 2007 paper in Genetic Epidemiology that shows this approach has good power in imbalanced datasets.

The final column of the Summary Table lists the CV Consistency or CVC. This is a statistic that we developed to record the number of times MDR found the same model as it divided up the data into different segments. A true signal should be seen regardless of how you divide the data. Good model usually have a CVC of around 9 or 10 assuming 10-fold CV. The CVC is useful in two ways. First, it can be used to help decide on which is the best model if two or more models have similar TAs. Second, a CVC of less than 10 indicates that the TA is biased because at least one of the CV intervals will have a TA estimated from a different model. In this case I run a forced analysis (see Part 3) with the best attributes to get an unbiased estimate of the TA. I do this with any model that has a CVC less than 10 and report the unbiased estimates.

We highly recommend permutation testing to assess the statistical significance of MDR models. First, you can try a general permutation test that randomizes the case-control labels to derive an empirical distribution of the TA under the null hypothesis of no association. This will tell you whether the model is significant but will not directly provide evidence for interaction. We have a 2010 paper in the Pacific Symposium on Biocomputing that discusses how to use permutation testing to specifically test the null hypothesis of linearity. A significant p-value here reflects interaction that is independent of any marginal effects. We have a 2009 paper in Genetic Epidemiology that discusses how to use an extreme value distribution (EVD) to reduce the total number of permutation that you need to run. The MDR permutation testing module (MDRpt) is available for download.

There are up to six tabs that will be visible at the bottom of the Analysis tab that can be used to further assess and interpret your results. Some of these will be mentioned in this lesson. Interpretation will be covered later.

The Graphical Model tab shows the distribution of cases (left bars) and controls (right bars) for each genotype or level combination. The high-risk cells are shaded dark grey while the low-risk cells are shaded light grey. The distribution shown corresponds to the model selected in the Summary Table. This figure can be saved to a file for use in a presentation or manuscript. Note that MDR combines the high-risk and low-risk combinations into a new single attribute using constructive induction. It is the new single attribute that is statistically investigated. You can view the distribution of cases and controls for this single MDR attribute by going to the Attribute Construction tab at the top of the software. Once there, select the SNPs in your best model by holding down the control button and left clicking your mouse. Once the right SNPs are selected push the Construct button. This will add the new single MDR attribute to you dataset. Now do a forced analysis with that single constructed attribute and you will be able to see the statistics for the analysis of that variable along with the graphical model. In our newer papers we are putting the graphical model for the constructed attribute next to the graphical model given in the default output to show the MDR attribute construction process. This is helpful for readers to see what MDR is really doing. We will talk later about how you can include this as part of your analysis process.

The Best Model tab gives some additional statistics for the best model selected in the Summary Table. Once you have selected a best model I think it is acceptable to report the stats on the whole dataset. At this point you are not concerned about overfitting as long as you are picking your best model using the TA as discussed above. Note that the odds ratio is estimated by using the low-risk group as the reference. Also note that the chi-square p-value is probably not accurate since I haven't taken the time to calculate what the actual degrees of freedom should be for an MDR attribute. It is probably greater than one but less than that required to estimate the full effect of the SNPs. Thus, I usually don't report the chi-square values.

The IF-THEN rules is a different way of looking at the MDR model. These are the rules that are used to pool high-risk and low-risk genotypes. I have not seen anyone report these.

The CV Results tab gives the statistics and the best model identified in each CV interval. Sometimes it is interesting to see what MDR is picking when CVC is less than 10. We will discuss the Entropy tab in more detail in a later post. This is used to statistically interpret your MDR models. The dendrogram, for example, uses information theory and cluster analysis to identify synergistic, independent, and redundant genetic effects. This has become a critical part of an MDR analysis since it provides for the first time real evidence for synergistic interactions or epistasis in an MDR model. Please try to include this in your papers. Otherwise, you don't really know if what you are reporting is really epistasis. This information along with the explicit test of epistasis method described above will convince your readers that you are actually detecting non-additive effects.

If you checked the Compute Fitness Landscape option in the Configuration tab you should see a Landscape tab. This is where you can explore all the models evaluated by MDR. This is very helpful for seeing what the second, third, and even fourth best models are. Try the zoom feature by left clicking and holding the button down while dragging the mouse pointer across the plot to select an area to zoom to. If you zoom far enough in you should find black dots. Mouse over the black dots to see what the specific combination is. Once you find an interesting second best model you can then go and run a forced analysis to see how the TA compares to your best model. Sometimes it is a better predictor. We have done this in several recent papers. I think it is fine to do this without too much concern with overfitting as long as you limit your forced analysis to just a few second or third best models. Also note that there is a Track Top Models option in the Configuration tab that shows how many of the best models to show in the Top Models tab of the Analysis. This is handy so MDR doesn't need to compute the whole fitness landscape.

This post was last updated on January 20, 2013.

Saturday, December 02, 2006

MDR 101 - Part 3 - Analysis

After addressing quality control issues and filtering your SNPs to a reasonable subset it is time for an MDR analysis. First, make sure you are using the latest version of the open-source MDR software. Follow the link on www.epistasis.org for the newest version.

The first step is to load your file in the Analysis tab. Check the sample file distributed with the software for the data format. If you successfully load the data you can view it with the View Datafile button. The Datafile Information area will show the name of the file, the path, and number of subjects (i.e. instances), the number of SNPs (i.e. attributes/variables), and the ratio of cases to controls. Note that the current version of MDR will only load datasets with up to about 10^6 SNPs. A future version will allow you to load bigger datasets. Running MDR on more than 100 or 1000 SNPs will probably require a fast C version optimized for high-performance computing. Download libmdr from here if you want to write a C program.

Once you are sure you have loaded the right data you are ready to run an MDR analysis. Feel free to use the default settings to run a quick MDR analysis by pushing the Run Analysis button. Go to the Configuration Tab to change the settings. Here is what each setting is used for.

Random Seed: The random seed controls how the dataset is divided into different parts if cross-validation is used. Running MDR using the same random seed should give the same results from run to run as long as the other settings are the same. Feel free to run MDR using different random seeds to see how the results differ when your data is divided differently. Across 10 runs you should get similar results if you have a detectable signal. You really only need to do one seed for publication purposes.

Attribute Count Range: This tells MDR the order of the interactions to be considered. The default settings will return the best 1-, 2-, 3-, and 4-attribute models. In general, I rarely go above a 5-way interaction unless the dataset is very large (e.g. greater n=2000 instances).

Cross-Validation Count: This tells MDR how many pieces to divide your dataset into for cross-validation (CV). The default is 10-fold CV. This is what I almost always use. A 5-fold CV is ok too and will run faster.

Paired Analysis: This is used for datasets in which the cases and controls are paired or correlated in some way (i.e. matched). If this is checked MDR will keep the pairs together during cross-validation. Use this option if you have a 1:1 matched case control study or relatives such as discordant sib-pairs. We don't have an option for n:1 matching where n is greater than 1.

Tie Cells: This tells MDR how to treat genotype combinations for which there are an equal number of cases and controls. Set this to whatever you feel comfortable with. I usually use the default unless a collaborator wants something different.

Compute Fitness Landscape: Checking this box will tell MDR to keep track of every combination that it evaluates. These can then be viewed in the Landscape Tab in the analysis window. This is very useful for identifying the second or third best model. Sometimes the second best model is a better predictor. Keep in mind that the landscape option uses a lot of memory. If you have lots of SNPs in your dataset it could cause MDR to stop due to lack of memory. This is why it is off by default. It might be better to use the Top Models option described next.

Track Top Models: MDR will by default keep track of and report the top n models and then show these in the Top Models Tab of the Analysis.

Search Type: MDR carries out an exhaustive search over all possible combinations by default. If this takes too long, you might try running a random search. You can specify the total number of evaluations to perform or you can tell it how long you want to wait (e.g. 2 hours). You can also select a forced analysis if you want to evaluate one specific combination of attributes. I use the forced analysis option a lot. I use it to obtain an unbiased estimate of the testing accuracy when the cross-validation consistency (CVC) is less than 10 or to get an estimate of the testing accuracy for the second or third best model identified using the Landscape. We have also included a probabilistic search algorithm called an estimation of distribution algorithm (EDA) that assigns probabilities to each SNP for selection. These probabilities can be modified using expert knowledge to bias the search.

Once you have finished setting options in the Configuration Tab you are ready to begin your MDR analysis by pushing the Run Analysis button in the Analysis Tab. The Progress Completed bar will give you an idea of how long you need to wait. A dataset with 10-50 SNPs shouldn't take more than a few seconds or a few minutes to analyze. A dataset with 10 SNPs and 400 subjects takes less than four seconds to run on a dual-processor Dell desktop with the default settings. A dataset with 15 SNPs and 600 subjects takes about 10 seconds to run with the default settings. With 50 or more SNPs you may need to wait an hour or more depending on the number of CPUs your computer has. Make sure threading is turned on!

Note that we have added a new tab in 2010 for covariate adjustment. This allows you to select a covariate such as age or weight and adjust for that effect using a simple sampling algorithm. A paper describing this method was published in Human Heredity. We also published a 2011 paper in Annals of Human Genetics that describes a robust MDR method that uses a statistical test to determine whether each genotype combination is high-risk or low risk. This option is now in the configuration tab.

My next posts will discuss how you pick a best model, how you evaluate statistical significance, and how you interpret the results. More advanced MDR analysis methods will also be covered.

This post was last updated on January 20, 2013.