Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Monday, April 30, 2018

Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV)

Our new paper on using resampling methods to improve reproducibility of machine learning in the context of cross validation.

Piette ER, Moore JH. Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV). BioData Min. 2018 Apr 19;11:6. [PubMed]

Background: Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions.

Results: We apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of
primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results.

Conclusions: Application of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern.

Wednesday, April 25, 2018

Collective feature selection to identify crucial epistatic variants

Nice new paper from Marylyn Ritchie's group on feature selection for epistasis analysis.

Verma SS, Lucas A, Zhang X, Veturi Y, Dudek S, Li B, Li R, Urbanowicz R, Moore JH, Kim D, Ritchie MD. Collective feature selection to identify crucial epistatic variants. BioData Min. 2018 Apr 19;11:5. [Pubmed]

Machine learning methods have gained popularity and practicality in identifying linear and non-linear effects of variants associated with complex disease/traits. Detection of epistatic interactions still remains a challenge due to the large number of features and relatively small sample size as input, thus leading to the so-called "short fat data" problem. The efficiency of machine learning methods can be increased by limiting the number of input features. Thus, it is very important to perform variable selection before searching for epistasis. Many methods have been evaluated and proposed to perform feature selection, but no single method works best in all scenarios. We demonstrate this by conducting two separate simulation analyses to evaluate the proposed collective feature selection approach.

Through our simulation study we propose a collective feature selection approach to select features that are in the "union" of the best performing methods. We explored various parametric, non-parametric, and data mining approaches to perform feature selection. We choose our top performing methods to select the union of the resulting variables based on a user-defined percentage of variants selected from each method to take to downstream analysis. Our simulation analysis shows that non-parametric data mining approaches, such as MDR, may work best under one simulation criteria for the high effect size (penetrance) datasets, while non-parametric methods designed for feature selection, such as Ranger and Gradient boosting, work best under other simulation criteria. Thus, using a collective approach proves to be more beneficial for selecting variables with epistatic effects also in low effect size datasets and different genetic architectures. Following this, we applied our proposed collective feature selection approach to select the top 1% of variables to identify potential interacting variables associated with Body Mass Index (BMI) in ~‚ÄČ44,000 samples obtained from Geisinger's MyCode Community Health Initiative (on behalf of DiscovEHR collaboration).

In this study, we were able to show that selecting variables using a collective feature selection approach could help in selecting true positive epistatic variables more frequently than applying any single method for feature selection via simulation studies. We were able to demonstrate the effectiveness of collective feature selection along with a comparison of many methods in our simulation analysis. We also applied our method to identify non-linear networks associated with obesity.