Epistasis Blog

From the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)

Tuesday, February 22, 2011

Gene-Gene Interaction Analysis Using ReliefF and MDR

Here are two conference papers exploring the properties of ReliefF and MDR for detecting gene-gene interactions. These are both a bit difficult to read but there are some useful ideas presented. Both build on our previous work with the ReliefF family of algorithms. I am not sure whether the first one is relevant, however, given our recent work with spatially uniform ReliefF (SURF) that takes all neighbors within a certain distance. For background reading, see our 2010 Bioinformatics paper that reviews this work.

Pengyi Yang, Joshua WK Ho, Yee Hwa Yang, Bing B Zhou. Gene-gene interaction filtering with ensemble of filters. BMC Bioinformatics 2011, 12(Suppl 1):S10 [PDF]


Background. Complex diseases are commonly caused by multiple genes and their interactions with each other. Genome-wide association (GWA) studies provide us the opportunity to capture those disease associated genes and gene-gene interactions through panels of SNP markers. However, a proper filtering procedure is critical to reduce the search space prior to the computationally intensive gene-gene interaction identification step. In this study, we show that two commonly used SNP-SNP interaction filtering algorithms, ReliefF and tuned ReliefF (TuRF), are sensitive to the order of the samples in the dataset, giving rise to unstable and suboptimal results. However, we observe that the ‘unstable’ results from multiple runs of these algorithms can provide valuable information about the dataset. We therefore hypothesize that aggregating results from multiple runs of the algorithm may improve the filtering performance.

Results. We propose a simple and effective ensemble approach in which the results from multiple runs of an unstable filter are aggregated based on the general theory of ensemble learning. The ensemble versions of the ReliefF and TuRF algorithms, referred to as ReliefF-E and TuRF-E, are robust to sample order dependency and enable a more informative investigation of data characteristics. Using simulated and real datasets, we demonstrate that both the ensemble of ReliefF and the ensemble of TuRF can generate a much more stable SNP ranking than the original algorithms. Furthermore, the ensemble of TuRF achieved the highest success rate in comparison to many state-of-the-art algorithms as well as traditional χ2-test and odds ratio methods in terms of retaining gene-gene interactions.

Can Yang, Xiang Wan, Zengyou He, Qiang Yang, Hong Xue, Weichuan Yu. The choice of null distributions for detecting gene-gene interactions in genome-wide association studies. BMC Bioinformatics 2011, 12(Suppl 1):S26 [PDF]


Background. In genome-wide association studies (GWAS), the number of single-nucleotide polymorphisms (SNPs) typically ranges between 500,000 and 1,000,000. Accordingly, detecting gene-gene interactions in GWAS is computationally challenging because it involves hundreds of billions of SNP pairs. Stage-wise strategies are often used to overcome the computational difficulty. In the first stage, fast screening methods (e.g. Tuning ReliefF) are applied to reduce the whole SNP set to a small subset. In the second stage, sophisticated modeling methods (e.g., multifactor-dimensionality reduction (MDR)) are applied to the subset of SNPs to identify interesting interaction models and the corresponding interaction patterns. In the third stage, the significance of the identified interaction patterns is evaluated by hypothesis testing.

Results. In this paper, we show that this stage-wise strategy could be problematic in controlling the false positive rate if the null distribution is not appropriately chosen. This is because screening and modeling may change the null distribution used in hypothesis testing. In our simulation study, we use some popular screening methods and the popular modeling method MDR as examples to show the effect of the inappropriate choice of null distributions. To choose appropriate null distributions, we suggest to use the permutation test or testing on the independent data set. We demonstrate their performance using synthetic data and a real genome wide data set from an Aged-related Macular Degeneration (AMD) study.

Conclusions. The permutation test or testing on the independent data set can help choosing appropriate null distributions in hypothesis testing, which provides more reliable results in practice.

Friday, February 18, 2011

A genome-wide screen of gene-gene interactions for rheumatoid arthritis susceptibility

This is a nice example of a genome-wide epistasis analysis. Nice that their interactions replicate. It would interesting to know how many of their interactions that didn't replicate are real. There are many very good resons for why an nteraction effect would not replicate in an indendent sample, especially if it is from a different study or population.

Liu C, Ackerman HH, Carulli JP. A genome-wide screen of gene-gene interactions for rheumatoid arthritis susceptibility. Hum Genet. 2011 Jan 6. [Epub ahead of print] PubMed PMID: 21210282. [PubMed]


The objective of the study was to identify interacting genes contributing to rheumatoid arthritis (RA) susceptibility and identify SNPs that discriminate between RA patients who were anti-cyclic citrullinated protein positive and healthy controls. We analyzed two independent cohorts from the North American Rheumatoid Arthritis Consortium. A cohort of 908 RA cases and 1,260 controls was used to discover pairwise interactions among SNPs and to identify a set of single nucleotide polymorphisms (SNPs) that predict RA status, and a second cohort of 952 cases and 1,760 controls was used to validate the findings. After adjusting for HLA-shared epitope alleles, we identified and replicated seven SNP pairs within the HLA class II locus with significant interaction effects. We failed to replicate significant pairwise interactions among non-HLA SNPs. The machine learning approach "random forest" applied to a set of SNPs selected from single-SNP and pairwise interaction tests identified 93 SNPs that distinguish RA cases from controls with 70% accuracy. HLA SNPs provide the most classification information, and inclusion of non-HLA SNPs improved classification. While specific gene-gene interactions are difficult to validate using genome-wide SNP data, a stepwise approach combining association and classification methods identifies candidate interacting SNPs that distinguish RA cases from healthy controls.

Friday, February 11, 2011

Epistatic Interactions in Genetic Regulation of t-PA and PAI-1 Levels in a Ghanaian Population

A new paper from our lab on epistasis analysis for QTLs.

Penrod NM, Poku KA, Vaughn DE, Asselbergs FW, Brown NJ, Moore JH, Williams SM. Epistatic Interactions in Genetic Regulation of t-PA and PAI-1 Levels in a Ghanaian Population. PLoS One. 2011 Jan 31;6(1):e16639. [PubMed] [PLoS]


The proteins, tissue plasminogen activator (t-PA) and plasminogen activator inhibitor 1 (PAI-1), act in concert to balance thrombus formation and degradation, thereby modulating the development of arterial thrombosis and excessive bleeding. PAI-1 is upregulated by the renin-angiotensin system (RAS), specifically by angiotensin II, the product of angiotensin converting enzyme (ACE) cleavage of angiotensin I, which is produced by the cleavage of angiotensinogen (AGT) by renin (REN). ACE indirectly stimulates the release of t-PA which, in turn, activates the corresponding fibrinolytic system. Single polymorphisms in these pathways have been shown to significantly impact plasma levels of t-PA and PAI-1 differently in Ghanaian males and females. Here we explore the involvement of epistatic interactions between the same polymorphisms in central genes of the RAS and fibrinolytic systems on plasma t-PA and PAI-1 levels within the same population (n = 992). Statistical modeling of pairwise interactions was done using two-way ANOVA between polymorphisms in the ETNK2, RENIN, ACE, PAI-1, t-PA, and AGT genes. The most significant interactions that associated with t-PA levels were between the ETNK2 A6135G and the REN T9435C polymorphisms in females (p = 0.006) and the REN T9435C and the TPA I/D polymorphisms (p = 0.005) in males. The most significant interactions for PAI-1 levels were with REN T9435C and the TPA I/D polymorphisms (p = 0.001) in females, and the association of REN G6567T with the TPA I/D polymorphisms (p = 0.032) in males. Our results provide evidence for multiple genetic effects that may not be detected using single SNP analysis. Because t-PA and PAI-1 have been implicated in cardiovascular disease these results support the idea that the genetic architecture of cardiovascular disease is complex. Therefore, it is necessary to consider the relationship between interacting polymorphisms of pathway specific genes that predict t-PA and PAI-1 levels.

Tuesday, February 08, 2011

Dissecting genetic networks underlying complex phenotypes: the theoretical framework

I really like the concepts presented in this paper. Right on target. Love Figure 1.

Zhang F, Zhai HQ, Paterson AH, Xu JL, Gao YM, Zheng TQ, Wu RL, Fu BY, Ali J, Li ZK. Dissecting genetic networks underlying complex phenotypes: the theoretical framework. PLoS One. 2011 Jan 20;6(1):e14541. [PLoS]


Great progress has been made in genetic dissection of quantitative trait variation during the past two decades, but many studies still reveal only a small fraction of quantitative trait loci (QTLs), and epistasis remains elusive. We integrate contemporary knowledge of signal transduction pathways with principles of quantitative and population genetics to characterize genetic networks underlying complex traits, using a model founded upon one-way functional dependency of downstream genes on upstream regulators (the principle of hierarchy) and mutual functional dependency among related genes (functional genetic units, FGU). Both simulated and real data suggest that complementary epistasis contributes greatly to quantitative trait variation, and obscures the phenotypic effects of many 'downstream' loci in pathways. The mathematical relationships between the main effects and epistatic effects of genes acting at different levels of signaling pathways were established using the quantitative and population genetic parameters. Both loss of function and "co-adapted" gene complexes formed by multiple alleles with differentiated functions (effects) are predicted to be frequent types of allelic diversity at loci that contribute to the genetic variation of complex traits in populations. Downstream FGUs appear to be more vulnerable to loss of function than their upstream regulators, but this vulnerability is apparently compensated by different FGUs of similar functions. Other predictions from the model may account for puzzling results regarding responses to selection, genotype by environment interaction, and the genetic basis of heterosis.

Monday, February 07, 2011

A Comparison of Multifactor Dimensionality Reduction and Penalized Regression

Winham S, Wang C, Motsinger-Reif AA. A Comparison of Multifactor Dimensionality Reduction and l-Penalized Regression to Identify Gene-Gene Interactions in Genetic Association Studies. Stat Appl Genet Mol Biol. 2011;10(1):Article4. [PubMed]


Recently, the amount of high-dimensional data has exploded, creating new analytical challenges for human genetics. Furthermore, much evidence suggests that common complex diseases may be due to complex etiologies such as gene-gene interactions, which are difficult to identify in high-dimensional data using traditional statistical approaches. Data-mining approaches are gaining popularity for variable selection in association studies, and one of the most commonly used methods to evaluate potential gene-gene interactions is Multifactor Dimensionality Reduction (MDR). Additionally, a number of penalized regression techniques, such as Lasso, are gaining popularity within the statistical community and are now being applied to association studies, including extensions for interactions. In this study, we compare the performance of MDR, the traditional lasso with L1 penalty (TL1), and the group lasso for categorical data with group-wise L1 penalty (GL1) to detect gene-gene interactions through a broad range of simulations. We find that each method has both advantages and disadvantages, and relative performance is context dependent. TL1 frequently over-fits, identifying false positive as well as true positive loci. MDR has higher power for epistatic models that exhibit independent main effects; for both Lasso methods, main effects tend to dominate. For purely epistatic models, GL1 has the best performance for lower minor allele frequencies, but MDR performs best for higher frequencies. These results provide guidance of when each approach might be best suited for detecting and characterizing interactions with different mechanisms.