Tuesday, February 24, 2015

Great feature selection method for detecting epistasis using random forests

This is a really neat approach that is worth exploring for using machine learning methods such as random forests for the detection and modeling of statistical epistasis in genetic studies of human health.

Holzinger ER, Szymczak S, Dasgupta A, Malley J, Li Q, Bailey-Wilson JE. Variable selection method for the identification of epistatic models. Pac Symp Biocomput. 2015;20:195-206. [PDF]

Abstract

Standard analysis methods for genome wide association studies (GWAS) are not robust to complex disease models, such as interactions between variables with small main effects. These types of effects likely contribute to the heritability of complex human traits. Machine learning methods that are capable of identifying interactions, such as Random Forests (RF), are an alternative analysis approach. One caveat to RF is that there is no standardized method of selecting variables so that false positives are reduced while retaining adequate power. To this end, we have developed a novel variable selection method called relative recurrency variable importance metric (r2VIM). This method incorporates recurrency and variance estimation to assist in optimal threshold selection. For this study, we specifically address how this method performs in data with almost completely epistatic effects (i.e. no marginal effects). Our results show that with appropriate parameter settings, r2VIM can identify interaction effects when the marginal effects are virtually nonexistent. It also outperforms logistic regression, which has essentially no power under this type of model when the number of potential features (genetic variants) is large. (All Supplementary Data can be found here: http://research.nhgri.nih.gov/manuscripts/Bailey-Wilson/r2VIM_epi/).

Labels: epistasis, machine learning, random forests

Friday, February 20, 2015

Is Big Data a 21st Century Maginot Line?

We have just published this open access editorial BioData Mining on whether 'big data' is a 21st century Maginot line. This is relevant because we as scientists sometimes let the data define the research questions rather than the other way around. As the size and complexity of data grows we may find ourselves asking simpler and simpler questions only some of which are important to advancing our understanding of human health and disease.

Huang X, Jennings SF, Bruce B, Buchan A, Cai L, Chen P, Cramer CL, Guan W, Hilgert UK, Jiang H, Li Z, McClure G, McMullen DF, Nanduri B, Perkins A, Rekepalli B, Salem S, Specker J, Walker K, Wunsch D, Xiong D, Zhang S, Zhang Y, Zhao Z, Moore JH. Big data - a 21st century science Maginot Line? No-boundary thinking: shifting from the big data paradigm. BioData Min. 2015 Feb 6;8:7. [PDF]

See also our previous related essay on 'no boundary thinking' in bioinformatics.

Huang X, Bruce B, Buchan A, Congdon CB, Cramer CL, Jennings SF, Jiang H, Li Z, McClure G, McMullen R, Moore JH, Nanduri B, Peckham J, Perkins A, Polson SW, Rekepalli B, Salem S, Specker J, Wunsch D, Xiong D, Zhang S, Zhao Z. No-boundary thinking in bioinformatics research. BioData Min. 2013 Nov 6;6(1):19. [PDF]

Labels: big data, bioinformatics

Epistasis Blog

Tuesday, February 24, 2015

Great feature selection method for detecting epistasis using random forests

Friday, February 20, 2015

Is Big Data a 21st Century Maginot Line?

About Me

Twitter Updates

Previous Posts

Archives