Epistasis Blog

Ten important roles for academic leaders to promote equity, diversity, and inclusion in data science

2021-03-31T13:03:00.001-04:00

Our new editorial on equity, diversity, and inclusion in data science is out in BioData Mining.

Ten simple rules for writing a paper about scientific software

2020-12-01T16:17:00.001-05:00

Romano JD, Moore JH. Ten simple rules for writing a paper about scientific software. PLoS Comput Biol. 2020 Nov 12;16(11):e1008390. doi: 10.1371/journal.pcbi.1008390. PMID: 33180774; PMCID: PMC7660560. [PubMed] [PLoS Comp Bio]

Abstract

Papers describing software are an important part of computational fields of scientific research. These "software papers" are unique in a number of ways, and they require special consideration to improve their impact on the scientific community and their efficacy at conveying important information. Here, we discuss 10 specific rules for writing software papers, covering some of the different scenarios and publication types that might be encountered, and important questions from which all computational researchers would benefit by asking along the way.

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

2020-11-04T16:14:00.001-05:00

Manduchi E, Fu W, Romano JD, Ruberto S, Moore JH. Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses. BMC Bioinformatics. 2020 Oct 1;21(1):430. doi: 10.1186/s12859-020-03755-4. PMID: 32998684; PMCID: PMC7528347. [PubMed] [BMC Bioinformatics]

Abstract

Background: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.

Results: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids 'leakage' during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj.

Conclusions: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

Ten important roles for academic leaders in data science

2020-10-30T16:11:00.001-04:00

Moore JH. Ten important roles for academic leaders in data science. BioData Min. 2020 Oct 26;13:18. doi: 10.1186/s13040-020-00228-5. PMID: 33117434; PMCID: PMC7586691. [PubMed] [BioData Mining]

Abstract

Data science has emerged as an important discipline in the era of big data and biological and biomedical data mining. As such, we have seen a rapid increase in the number of data science departments, research centers, and schools. We review here ten important leadership roles for a successful academic data science chair, director, or dean. These roles include the visionary, executive, cheerleader, manager, enforcer, subordinate, educator, entrepreneur, mentor, and communicator. Examples specific to leadership in data science are given for each role.

Evaluating recommender systems for AI-driven biomedical informatics

2020-09-09T16:09:00.002-04:00

La Cava W, Williams H, Fu W, Vitale S, Srivatsan D, Moore JH. Evaluating recommender systems for AI-driven biomedical informatics. Bioinformatics. 2020 Aug 7:btaa698. doi: 10.1093/bioinformatics/btaa698. Epub ahead of print. PMID: 32766825. [PubMed] [Bioinformatics]

Abstract

Motivation: Many researchers with domain expertise are unable to easily apply machine learning to their bioinformatics data due to a lack of machine learning and/or coding expertise. Methods that have been proposed thus far to automate machine learning mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based platform that uses AI to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user's experiments as well as prior knowledge. To validate this framework, we experiment with hundreds of classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients.

Results: We find that matrix factorization-based recommendation systems outperform meta-learning methods for automating machine learning. This result mirrors the results of earlier recommender systems research in other domains. The proposed AI is competitive with state-of-the-art automated machine learning methods in terms of choosing optimal algorithm configurations for datasets. In our application to prediction of septic shock, the AI-driven analysis produces a competent machine learning model (AUROC 0.85 +/- 0.02) that performs on par with state-of-the-art deep learning results for this task, with much less computational effort.

Availability: PennAI is available free of charge and open-source. It is distributed under the GNU public license (GPL) version 3.

Supplementary information: Software and experiments are available from epistasislab.github.io/pennai.

Electronic health records and polygenic risk scores for predicting disease risk

2020-08-26T15:58:00.002-04:00

Li R, Chen Y, Ritchie MD, Moore JH. Electronic health records and polygenic risk scores for predicting disease risk. Nat Rev Genet. 2020 Aug;21(8):493-502. doi: 10.1038/s41576-020-0224-1. Epub 2020 Mar 31. PMID: 32235907. [PubMed] [Nature Reviews]

Abstract

Accurate prediction of disease risk based on the genetic make-up of an individual is essential for effective prevention and personalized treatment. Nevertheless, to date, individual genetic variants from genome-wide association studies have achieved only moderate prediction of disease risk. The aggregation of genetic variants under a polygenic model shows promising improvements in prediction accuracies. Increasingly, electronic health records (EHRs) are being linked to patient genetic data in biobanks, which provides new opportunities for developing and applying polygenic risk scores in the clinic, to systematically examine and evaluate patient susceptibilities to disease. However, the heterogeneous nature of EHR data brings forth many practical challenges along every step of designing and implementing risk prediction strategies. In this Review, we present the unique considerations for using genotype and phenotype data from biobank-linked EHRs for polygenic risk prediction.

A brief introduction to my artificial intelligence and machine learning research program - YouTube

2020-08-10T16:20:00.001-04:00

A 12-minute overview of my artificial intelligence and machine learning research program [YouTube]

treeheatr: an R package for interpretable decision tree visualizations

2020-07-30T16:04:00.002-04:00

Le TT, Moore JH. treeheatr: an R package for interpretable decision tree visualizations. Bioinformatics. 2020 Jul 23:btaa662. doi: 10.1093/bioinformatics/btaa662. Epub ahead of print. PMID: 32702108. [PubMed] [Bioinformatics]

Abstract

Summary: treeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree's leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students' understanding of a simple decision tree model before diving into more complex tree-based machine learning methods.
Availability: The treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous integration.
Supplementary information: Supplementary data are available at Bioinformatics online.

Ideas for how informaticians can get involved with COVID-19 research

2020-05-27T16:02:00.001-04:00

Moore JH, Barnett I, Boland MR, Chen Y, Demiris G, Gonzalez-Hernandez G, Herman DS, Himes BE, Hubbard RA, Kim D, Morris JS, Mowery DL, Ritchie MD, Shen L, Urbanowicz R, Holmes JH. Ideas for how informaticians can get involved with COVID-19 research. BioData Min. 2020 May 12;13:3. doi: 10.1186/s13040-020-00213-y. PMID: 32419848; PMCID: PMC7216865. [PubMed] [BioData Mining]

Abstract

The coronavirus disease 2019 (COVID-19) pandemic has had a significant impact on population health and wellbeing. Biomedical informatics is central to COVID-19 research efforts and for the delivery of healthcare for COVID-19 patients. Critical to this effort is the participation of informaticians who typically work on other basic science or clinical problems. The goal of this editorial is to highlight some examples of COVID-19 research areas that could benefit from informatics expertise. Each research idea summarizes the COVID-19 application area, followed by an informatics methodology, approach, or technology that could make a contribution. It is our hope that this piece will motivate and make it easy for some informaticians to adopt COVID-19 research projects.

Using simulations to understand the relationship between epistasis and observed GWAS findings

2020-02-05T15:52:00.001-05:00

Moore JH, Olson RS, Schmitt P, Chen Y, Manduchi E. How Computational Experiments Can Improve Our Understanding of the Genetic Architecture of Common Human Diseases. Artif Life. 2020 Winter;26(1):23-37. doi: 10.1162/artl_a_00308. Epub 2020 Feb 6. PMID: 32027528. [PubMed] [Artificial Life]

Abstract

Susceptibility to common human diseases such as cancer is influenced by many genetic and environmental factors that work together in a complex manner. The state of the art is to perform a genome-wide association study (GWAS) that measures millions of single-nucleotide polymorphisms (SNPs) throughout the genome followed by a one-SNP-at-a-time statistical analysis to detect univariate associations. This approach has identified thousands of genetic risk factors for hundreds of diseases. However, the genetic risk factors detected have very small effect sizes and collectively explain very little of the overall heritability of the disease. Nonetheless, it is assumed that the genetic component of risk is due to many independent risk factors that contribute additively. The fact that many genetic risk factors with small effects can be detected is taken as evidence to support this notion. It is our working hypothesis that the genetic architecture of common diseases is partly driven by non-additive interactions. To test this hypothesis, we developed a heuristic simulation-based method for conducting experiments about the complexity of genetic architecture. We show that a genetic architecture driven by complex interactions is highly consistent with the magnitude and distribution of univariate effects seen in real data. We compare our results with measures of univariate and interaction effects from two large-scale GWASs of sporadic breast cancer and find evidence to support our hypothesis that is consistent with the results of our computational experiment.

Getting started with TPOT for automated machine learning

2019-12-30T13:05:00.002-05:00

A great post from Dr. Trang Le on how to get started with automated machine learning (AutoML) with our Tree-Based Pipeline Optimization Tool (TPOT) in Python.

The Human Pancreas Analysis Program (HPAP)

2019-12-26T18:23:00.001-05:00

We are assisting with the bioinformatics support for The Human Pancreas Analysis Program (HPAP) which consists of two interlocking, collaborative projects at three institutions that seek to provide comprehensive molecular profiling in unprecedented detail of the pancreatic islet at various stages of type 1 diabetes (T1D) pathogenesis- pre-diabetic (positive islet autoantibodies), recent onset, and T1D of durations less than 10 years.

In the past decade, there have been dramatic advances in our ability to phenotype and molecularly profile human cells and tissues. HIRN-HPPAP will develop and apply these new technologies to study cells and tissues relevant to the beta cell loss in T1D with unprecedented resolution, including at the genomic, epigenomic, protein, and functional levels. Here we will employ state-of-the-art technologies to determine all aspects of pancreatic islet cell and immune cell biology as it pertains to the pathogenesis of type 1 diabetes. We will profile both the endocrine and immune systems with multiple modalities, and make the vast data accumulated available through the highly accessible PANC-DB, which will be developed through the project. These extensive and high quality datasets will be made available to the HIRN and the diabetes research community at-large for further discovery.

Automated machine learning analysis of metabolomics data

2019-11-20T17:53:00.000-05:00

We have expanded our TPOT automated machine learning method (AutoML) to metabolomics data.

Orlenko A, Kofink D, Lyytikäinen LP, Nikus K, Mishra P, Kuukasjärvi P, Karhunen PJ, Kähönen M, Laurikka JO, Lehtimäki T, Asselberg FW, Moore JH. Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning (AutoML). Bioinformatics. 2019 Nov 8. [PubMed]

Abstract

MOTIVATION:
Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programming. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES).

RESULTS:
We analyzed nuclear magnetic resonance (NMR)-derived lipoprotein and metabolite profiles in the ANGES cohort with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. We performed a comparative analysis of TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, applied to two phenotypic CAD profiles. As a result, TPOT generated ML pipelines that outperformed grid search optimized models across multiple performance metrics including balanced accuracy and area under the precision-recall curve. With the selected models, we demonstrated that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision, suggesting a discrepancy in the underlying processes between these phenotypes.

AVAILABILITY:
TPOT is freely available via http://epistasislab.github.io/tpot/

Embracing study heterogeneity for finding genetic interactions in large-scale research consortia

2019-10-09T17:47:00.000-04:00

New collaborative paper in Genetic Epidemiology with Dr. Yong Chen

Liu Y, Huang J, Urbanowicz RJ, Chen K, Manduchi E, Greene CS, Moore JH, Scheet P, Chen Y. Embracing study heterogeneity for finding genetic interactions in large-scale research consortia. Genet Epidemiol. 2019 Oct 4. [PubMed]

Abstract

Genetic interactions have been recognized as a potentially important contributor to the heritability of complex diseases. Nevertheless, due to small effect sizes and stringent multiple-testing correction, identifying genetic interactions in complex diseases is particularly challenging. To address the above challenges, many genomic research initiatives collaborate to form large-scale consortia and develop open access to enable sharing of genome-wide association study (GWAS) data. Despite the perceived benefits of data sharing from large consortia, a number of practical issues have arisen, such as privacy concerns on individual genomic information and heterogeneous data sources from distributed GWAS databases. In the context of large consortia, we demonstrate that the heterogeneously appearing marginal effects over distributed GWAS databases can offer new insights into genetic interactions for which conventional methods have had limited success. In this paper, we develop a novel two-stage testing procedure, named phylogenY-based effect-size tests for interactions using first 2 moments (YETI2), to detect genetic interactions through both pooled marginal effects, in terms of averaging site-specific marginal effects, and heterogeneity in marginal effects across sites, using a meta-analytic framework. YETI2 can not only be applied to large consortia without shared personal information but also can be used to leverage underlying heterogeneity in marginal effects to prioritize potential genetic interactions. We investigate the performance of YETI2 through simulation studies and apply YETI2 to bladder cancer data from dbGaP.

Alternative measures of association for GWAS

2019-08-14T12:47:00.000-04:00

Manduchi E, Orzechowski PR, Ritchie MD, Moore JH. Exploration of a diversity of computational and statistical measures of association for genome-wide genetic studies. BioData Min. 2019 Jul 9;12:14. [PubMed] [BioData Mining]

Background
The principal line of investigation in Genome Wide Association Studies (GWAS) is the identification of main effects, that is individual Single Nucleotide Polymorphisms (SNPs) which are associated with the trait of interest, independent of other factors. A variety of methods have been proposed to this end, mostly statistical in nature and differing in assumptions and type of model employed. Moreover, for a given model, there may be multiple choices for the SNP genotype encoding. As an alternative to statistical methods, machine learning methods are often applicable. Typically, for a given GWAS, a single approach is selected and utilized to identify potential SNPs of interest. Even when multiple GWAS are combined through meta-analyses within a consortium, each GWAS is typically analyzed with a single approach and the resulting summary statistics are then utilized in meta-analyses.

Results
In this work we use as case studies a Type 2 Diabetes (T2D) and a breast cancer GWAS to explore a diversity of applicable approaches spanning different methods and encoding choices. We assess similarity of these approaches based on the derived ranked lists of SNPs and, for each GWAS, we identify a subset of representative approaches that we use as an ensemble to derive a union list of top SNPs. Among these are SNPs which are identified by multiple approaches as well as several SNPs identified by only one or a few of the less frequently used approaches. The latter include SNPs from established loci and SNPs which have other supporting lines of evidence in terms of their potential relevance to the traits.

Conclusions
Not every main effect analysis method is suitable for every GWAS, but for each GWAS there are typically multiple applicable methods and encoding options. We suggest a workflow for a single GWAS, extensible to multiple GWAS from consortia, where representative approaches are selected among a pool of suitable options, to yield a more comprehensive set of SNPs, potentially including SNPs that would typically be missed with the most popular analyses, but that could provide additional valuable insights for follow-up.

Scaling tree-based automated machine learning

2019-07-24T12:44:00.000-04:00

Le TT, Fu W, Moore JH. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics, in press (2019). [PubMed] [Bioinformatics]

MOTIVATION: Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programming to recommend an optimized analysis pipeline for the data scientist's prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data.

RESULTS: We introduce two new features implemented in TPOT that helps increase the system's scalability: Feature Set Selector and Template. Feature Set Selector (FSS) provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT's efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing genetic programming to select the best subset in the final pipeline. Template enforces type constraints with strongly typed genetic programming and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual.

AVAILABILITY: Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot.

Workflows for regulome and transcriptome-based prioritization of genetic variants

2019-06-18T12:39:00.000-04:00

Manduchi E, Hemerich D, van Setten J, Tragante V, Harakalova M, Pei J, Williams SM, van der Harst P, Asselbergs FW, Moore JH. A comparison of two workflows for regulome and transcriptome-based prioritization of genetic variants associated with myocardial mass. Genet Epidemiol. 2019 Sep;43(6):717-726. [PubMed] [Genetic Epi]

A typical task arising from main effect analyses in a Genome Wide Association Study (GWAS) is to identify single nucleotide polymorphisms (SNPs), in linkage disequilibrium with the observed signals, that are likely causal variants and the affected genes. The affected genes may not be those closest to associating SNPs. Functional genomics data from relevant tissues are believed to be helpful in selecting likely causal SNPs and interpreting implicated biological mechanisms, ultimately facilitating prevention and treatment in the case of a disease trait. These data are typically used post GWAS analyses to fine‐map the statistically significant signals identified agnostically by testing all SNPs and applying a multiple testing correction. The number of tested SNPs is typically in the millions, so the multiple testing burden is high. Motivated by this, in this study we investigated an alternative workflow, which consists in utilizing the available functional genomics data as a first step to reduce the number of SNPs tested for association. We analyzed GWAS on electrocardiographic QRS duration using these two workflows. The alternative workflow identified more SNPs, including some residing in loci not discovered with the typical workflow. Moreover, the latter are corroborated by other reports on QRS duration. This indicates the potential value of incorporating functional genomics information at the onset in GWAS analyses.

Accessible AI for Automated Machine Learning

2019-05-16T15:08:00.005-04:00

We released our open-source PennAI software for automated machine learning this week. Here is the Penn Medicine press release. Here is the Github link to the source code. More info can be found at the PennAI website. We think this will bring machine learning technology to novice users.

Automated discovery of test statistics

2019-04-22T15:50:00.000-04:00

This was a fun proof-of-principle paper we did on using genetic programming to discover test statistics. We showed that with general principles that we could re-discover the two-sample t-test. This opens the door to the discovery of new test statistics for unsolved problems.

How to increase our belief in discovered statistical interactions via large-scale association studies?

2019-03-31T08:41:00.004-04:00

Our new paper with Dr. Kristel van Steen on approaches for improving evidence for statistical interactions.

Van Steen K, Moore JH. How to increase our belief in discovered statistical interactions via large-scale association studies? Hum Genet. 2019 [PubMed] [Human Genetics]

Abstract

The understanding that differences in biological epistasis may impact disease risk, diagnosis, or disease management stands in wide contrast to the unavailability of widely accepted large-scale epistasis analysis protocols. Several choices in the analysis workflow will impact false-positive and false-negative rates. One of these choices relates to the exploitation of particular modelling or testing strategies. The strengths and limitations of these need to be well understood, as well as the contexts in which these hold. This will contribute to determining the potentially complementary value of epistasis detection workflows and is expected to increase replication success with biological relevance. In this contribution, we take a recently introduced regression-based epistasis detection tool as a leading example to review the key elements that need to be considered to fully appreciate the value of analytical epistasis detection performance assessments. We point out unresolved hurdles and give our perspectives towards overcoming these.

Testing the assumptions of parametric linear models: the need for biological data mining in disciplines such as human genetics

2019-03-01T08:37:00.000-05:00

This editorial is in response to some claims that an observed linear relationship between relative pair trait correlation and IBD genetic sharing is indicative of a simple additive genetic architecture dominated by independent genetic effects. As we show here, you could observe this pattern under a genetic architecture dominated by epistasis.

Moore JH, Mackay TFC, Williams SM. Testing the assumptions of parametric linear models: the need for biological data mining in disciplines such as human genetics. BioData Min. 2019 Feb 11;12:6. [PubMed] [BioData Mining]

Abstract

All data science methods have specific assumptions that are made in order for their inferences to be valid. Some assumptions impact statistical significance testing and some influence the models themselves. For example, a fundamental assumption of linear regression is that the relationship between the independent and dependent variables is additive such that a unit increase in one leads to a unit increase in the other with some error that can be modeled using a normal distribution. The presence of a nonlinear relationship between the variables violates this assumption and can lead to inaccurate inferences. We demonstrate this here using a simple example from human genetics and then end with some thoughts about the role of biological data mining in revealing nonlinear relationships between variables.

Preparing next-generation scientists for biomedical big data: artificial intelligence approaches

2019-02-13T08:30:00.000-05:00

Our paper on how to prepare next-gen scientists for big data is out. We outline here a curriculum focused on precision medicine, data science, and artificial intelligence.

Moore JH, Boland MR, Camara PG, Chervitz H, Gonzalez G, Himes BE, Kim D, Mowery DL, Ritchie MD, Shen L, Urbanowicz RJ, Holmes JH. Preparing next-generation scientists for biomedical big data: artificial intelligence approaches. Per Med. 2019 [PubMed] [PerMed]

Abstract

Personalized medicine is being realized by our ability to measure biological and environmental information about patients. Much of these data are being stored in electronic health records yielding big data that presents challenges for its management and analysis. Here, we review several areas of knowledge that are necessary for next-generation scientists to fully realize the potential of biomedical big data. We begin with an overview of big data and its storage and management. We then review statistics and data science as foundational topics followed by a core curriculum of artificial intelligence, machine learning and natural language processing that are needed to develop predictive models for clinical decision making. We end with some specific training recommendations for preparing next-generation scientists for biomedical big data.

Analysis validation has been neglected in the Age of Reproducibility

2019-01-02T08:44:00.000-05:00

Our paper on the use of simulation to help improve analysis validation and results reproducibility.

Lotterhos KE, Moore JH, Stapleton AE. Analysis validation has been neglected in the Age of Reproducibility. PLoS Biol. 2018 Dec 10;16(12):e3000070. [PubMed] [PLoS Biology]

Abstract

Increasingly complex statistical models are being used for the analysis of biological data. Recent commentary has focused on the ability to compute the same outcome for a given dataset (reproducibility). We argue that a reproducible statistical analysis is not necessarily valid because of unique patterns of nonindependence in every biological dataset. We advocate that analyses should be evaluated with known-truth simulations that capture biological reality, a process we call "analysis validation." We review the process of validation and suggest criteria that a validation project should meet. We find that different fields of science have historically failed to meet all criteria, and we suggest ways to implement meaningful validation in training and practice.

Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits

2018-11-21T13:42:00.003-05:00

I love seeing new extensions and modifications to our MDR method. Here is a new from Dr. Lou.

Hou TT, Lin F, Bai S, Cleves MA, Xu HM, Lou XY. Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits. Genet Epidemiol, in press (2018)

Abstract

The manifestation of complex traits is influenced by gene–gene and gene–environment interactions, and the identification of multifactor interactions is an important but challenging undertaking for genetic studies. Many complex phenotypes such as disease severity are measured on an ordinal scale with more than two categories. A proportional odds model can improve statistical power for these outcomes, when compared to a logit model either collapsing the categories into two mutually exclusive groups or limiting the analysis to pairs of categories. In this study, we propose a proportional odds model‐based generalized multifactor dimensionality reduction (GMDR) method for detection of interactions underlying polytomous ordinal phenotypes. Computer simulations demonstrated that this new GMDR method has a higher power and more accurate predictive ability than the GMDR methods based on a logit model and a multinomial logit model. We applied this new method to the genetic analysis of low‐density lipoprotein (LDL) cholesterol, a causal risk factor for coronary artery disease, in the Multi‐Ethnic Study of Atherosclerosis, and identified a significant joint action of the CELSR2, SERPINA12, HPGD, and APOB genes. This finding provides new information to advance the limited knowledge about genetic regulation and gene interactions in metabolic pathways of LDL cholesterol. In conclusion, the proportional odds model‐based GMDR is a useful tool that can boost statistical power and prediction accuracy in studying multifactor interactions underlying ordinal traits.

Statistical Inference Relief (STIR) feature selection

2018-10-24T13:45:00.000-04:00

Happy to be a collaborator on this paper to add inference to the ReliefF method for feature selection. We have done a lot of work on this algorithm that is capable of detecting epistasis.

Le TT, Urbanowicz RJ, Moore JH, McKinney BA. Statistical Inference Relief (STIR) feature selection. Bioinformatics. 2018 Sep 18., in press

Abstract

MOTIVATION:
Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features.

METHODS:
We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data.

RESULTS:
We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR's straightforward extension to genome-wide association studies.

AVAILABILITY:
Code and data available at http://insilico.utulsa.edu/software/STIR.