Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Wednesday, April 20, 2016

A new permutation testing method for epistasis analysis using random forests

Li J, Malley JD, Andrew AS, Karagas MR, Moore JH. Detecting gene-gene interactions using a permutation-based random forest method. BioData Min. 2016 Apr 6;9:14. [PDF]


Identifying gene-gene interactions is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Here, we aimed at developing a permutation-based methodology relying on a machine learning method, random forest (RF), to detect gene-gene interactions. Our approach called permuted random forest (pRF) which identified the top interacting single nucleotide polymorphism (SNP) pairs by estimating how much the power of a random forest classification model is influenced by removing pairwise interactions.

We systematically tested our approach on a simulation study with datasets possessing various genetic constraints including heritability, number of SNPs, sample size, etc. Our methodology showed high success rates for detecting the interaction SNP pair. We also applied our approach to two bladder cancer datasets, which showed consistent results with well-studied methodologies, such as multifactor dimensionality reduction (MDR) and statistical epistasis network (SEN). Furthermore, we built permuted random forest networks (PRFN), in which we used nodes to represent SNPs and edges to indicate interactions.

We successfully developed a scale-invariant methodology to detect pure gene-gene interactions based on permutation strategies and the machine learning method random forest. This methodology showed great potential to be used for detecting gene-gene interactions to study underlying genetic architectures in a scale-free way, which could be benefit to uncover the complex disease mechanisms.

Tuesday, April 05, 2016

Using knowledge of human evolution to help detect genetic associations with common diseases

Question: How to extend this to epistasis analysis?

Huang M, Graham BE, Zhang G, Harder R, Kodaman N, Moore JH, Muglia L, Williams SM. Evolutionary triangulation: informing genetic association studies with evolutionary evidence. BioData Min. 2016 Apr 2;9:12. [PDF]


Genetic studies of human diseases have identified many variants associated with pathogenesis and severity. However, most studies have used only statistical association to assess putative relationships to disease, and ignored other factors for evaluation. For example, evolution is a factor that has shaped disease risk, changing allele frequencies as human populations migrated into and inhabited new environments. Since many common variants differ among populations in frequency, as does disease prevalence, we hypothesized that patterns of disease and population structure, taken together, will inform association studies. Thus, the population distributions of allelic risk variants should reflect the distributions of their associated diseases. Evolutionary Triangulation (ET) exploits this evolutionary differentiation by comparing population structure among three populations with variable patterns of disease prevalence. By selecting populations based on patterns where two have similar rates of disease that differ substantially from a third, we performed a proof of principle analysis for this method. We examined three disease phenotypes, lactase persistence, melanoma, and Type 2 diabetes mellitus. We show that for lactase persistence, a phenotype with a simple genetic architecture, ET identifies the key gene, lactase. For melanoma, ET identifies several genes associated with this disease and/or phenotypes related to it, such as skin color genes. ET was less obviously successful for Type 2 diabetes mellitus, perhaps because of the small effect sizes in known risk loci and recent environmental changes that have altered disease risk. Alternatively, ET may have revealed new genes involved in conferring disease risk for diabetes that did not meet nominal GWAS significance thresholds. We also compared ET to another method used to filter for phenotype associated genes, population branch statistic (PBS), and show that ET performs better in identifying genes known to associate with diseases appropriately distributed among populations. Our results indicate that ET can filter association results to improve our ability to discover disease loci.