Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Wednesday, March 31, 2021

Ten important roles for academic leaders to promote equity, diversity, and inclusion in data science

 Our new editorial on equity, diversity, and inclusion in data science is out in BioData Mining.

Tuesday, December 01, 2020

Ten simple rules for writing a paper about scientific software

Romano JD, Moore JH. Ten simple rules for writing a paper about scientific software. PLoS Comput Biol. 2020 Nov 12;16(11):e1008390. doi: 10.1371/journal.pcbi.1008390. PMID: 33180774; PMCID: PMC7660560. [PubMed] [PLoS Comp Bio]

Abstract

Papers describing software are an important part of computational fields of scientific research. These "software papers" are unique in a number of ways, and they require special consideration to improve their impact on the scientific community and their efficacy at conveying important information. Here, we discuss 10 specific rules for writing software papers, covering some of the different scenarios and publication types that might be encountered, and important questions from which all computational researchers would benefit by asking along the way.

Wednesday, November 04, 2020

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

Manduchi E, Fu W, Romano JD, Ruberto S, Moore JH. Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses. BMC Bioinformatics. 2020 Oct 1;21(1):430. doi: 10.1186/s12859-020-03755-4. PMID: 32998684; PMCID: PMC7528347. [PubMed] [BMC Bioinformatics]

Abstract

Background: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.

Results: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids 'leakage' during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj
.

Conclusions: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

Friday, October 30, 2020

Ten important roles for academic leaders in data science

Moore JH. Ten important roles for academic leaders in data science. BioData Min. 2020 Oct 26;13:18. doi: 10.1186/s13040-020-00228-5. PMID: 33117434; PMCID: PMC7586691. [PubMed] [BioData Mining]

Abstract

Data science has emerged as an important discipline in the era of big data and biological and biomedical data mining. As such, we have seen a rapid increase in the number of data science departments, research centers, and schools. We review here ten important leadership roles for a successful academic data science chair, director, or dean. These roles include the visionary, executive, cheerleader, manager, enforcer, subordinate, educator, entrepreneur, mentor, and communicator. Examples specific to leadership in data science are given for each role.

Wednesday, September 09, 2020

Evaluating recommender systems for AI-driven biomedical informatics

La Cava W, Williams H, Fu W, Vitale S, Srivatsan D, Moore JH. Evaluating recommender systems for AI-driven biomedical informatics. Bioinformatics. 2020 Aug 7:btaa698. doi: 10.1093/bioinformatics/btaa698. Epub ahead of print. PMID: 32766825. [PubMed] [Bioinformatics]

Abstract

Motivation: Many researchers with domain expertise are unable to easily apply machine learning to their bioinformatics data due to a lack of machine learning and/or coding expertise. Methods that have been proposed thus far to automate machine learning mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based platform that uses AI to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user's experiments as well as prior knowledge. To validate this framework, we experiment with hundreds of classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients.

Results: We find that matrix factorization-based recommendation systems outperform meta-learning methods for automating machine learning. This result mirrors the results of earlier recommender systems research in other domains. The proposed AI is competitive with state-of-the-art automated machine learning methods in terms of choosing optimal algorithm configurations for datasets. In our application to prediction of septic shock, the AI-driven analysis produces a competent machine learning model (AUROC 0.85 +/- 0.02) that performs on par with state-of-the-art deep learning results for this task, with much less computational effort.

Availability: PennAI is available free of charge and open-source. It is distributed under the GNU public license (GPL) version 3.

Supplementary information: Software and experiments are available from epistasislab.github.io/pennai.

Wednesday, August 26, 2020

Electronic health records and polygenic risk scores for predicting disease risk

Li R, Chen Y, Ritchie MD, Moore JH. Electronic health records and polygenic risk scores for predicting disease risk. Nat Rev Genet. 2020 Aug;21(8):493-502. doi: 10.1038/s41576-020-0224-1. Epub 2020 Mar 31. PMID: 32235907. [PubMed] [Nature Reviews]

Abstract

Accurate prediction of disease risk based on the genetic make-up of an individual is essential for effective prevention and personalized treatment. Nevertheless, to date, individual genetic variants from genome-wide association studies have achieved only moderate prediction of disease risk. The aggregation of genetic variants under a polygenic model shows promising improvements in prediction accuracies. Increasingly, electronic health records (EHRs) are being linked to patient genetic data in biobanks, which provides new opportunities for developing and applying polygenic risk scores in the clinic, to systematically examine and evaluate patient susceptibilities to disease. However, the heterogeneous nature of EHR data brings forth many practical challenges along every step of designing and implementing risk prediction strategies. In this Review, we present the unique considerations for using genotype and phenotype data from biobank-linked EHRs for polygenic risk prediction.

Monday, August 10, 2020

A brief introduction to my artificial intelligence and machine learning research program - YouTube

 A 12-minute overview of my artificial intelligence and machine learning research program [YouTube]

Thursday, July 30, 2020

treeheatr: an R package for interpretable decision tree visualizations

Le TT, Moore JH. treeheatr: an R package for interpretable decision tree visualizations. Bioinformatics. 2020 Jul 23:btaa662. doi: 10.1093/bioinformatics/btaa662. Epub ahead of print. PMID: 32702108. [PubMed] [Bioinformatics]
Abstract
Summary: treeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree's leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students' understanding of a simple decision tree model before diving into more complex tree-based machine learning methods.
Availability: The treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous integration.
Supplementary information: Supplementary data are available at Bioinformatics online.

Wednesday, May 27, 2020

Ideas for how informaticians can get involved with COVID-19 research

Moore JH, Barnett I, Boland MR, Chen Y, Demiris G, Gonzalez-Hernandez G, Herman DS, Himes BE, Hubbard RA, Kim D, Morris JS, Mowery DL, Ritchie MD, Shen L, Urbanowicz R, Holmes JH. Ideas for how informaticians can get involved with COVID-19 research. BioData Min. 2020 May 12;13:3. doi: 10.1186/s13040-020-00213-y. PMID: 32419848; PMCID: PMC7216865. [PubMed] [BioData Mining]

Abstract

The coronavirus disease 2019 (COVID-19) pandemic has had a significant impact on population health and wellbeing. Biomedical informatics is central to COVID-19 research efforts and for the delivery of healthcare for COVID-19 patients. Critical to this effort is the participation of informaticians who typically work on other basic science or clinical problems. The goal of this editorial is to highlight some examples of COVID-19 research areas that could benefit from informatics expertise. Each research idea summarizes the COVID-19 application area, followed by an informatics methodology, approach, or technology that could make a contribution. It is our hope that this piece will motivate and make it easy for some informaticians to adopt COVID-19 research projects.

Wednesday, February 05, 2020

Using simulations to understand the relationship between epistasis and observed GWAS findings

Moore JH, Olson RS, Schmitt P, Chen Y, Manduchi E. How Computational Experiments Can Improve Our Understanding of the Genetic Architecture of Common Human Diseases. Artif Life. 2020 Winter;26(1):23-37. doi: 10.1162/artl_a_00308. Epub 2020 Feb 6. PMID: 32027528. [PubMed] [Artificial Life]

Abstract

Susceptibility to common human diseases such as cancer is influenced by many genetic and environmental factors that work together in a complex manner. The state of the art is to perform a genome-wide association study (GWAS) that measures millions of single-nucleotide polymorphisms (SNPs) throughout the genome followed by a one-SNP-at-a-time statistical analysis to detect univariate associations. This approach has identified thousands of genetic risk factors for hundreds of diseases. However, the genetic risk factors detected have very small effect sizes and collectively explain very little of the overall heritability of the disease. Nonetheless, it is assumed that the genetic component of risk is due to many independent risk factors that contribute additively. The fact that many genetic risk factors with small effects can be detected is taken as evidence to support this notion. It is our working hypothesis that the genetic architecture of common diseases is partly driven by non-additive interactions. To test this hypothesis, we developed a heuristic simulation-based method for conducting experiments about the complexity of genetic architecture. We show that a genetic architecture driven by complex interactions is highly consistent with the magnitude and distribution of univariate effects seen in real data. We compare our results with measures of univariate and interaction effects from two large-scale GWASs of sporadic breast cancer and find evidence to support our hypothesis that is consistent with the results of our computational experiment.