tag:blogger.com,1999:blog-104430502024-03-23T15:12:36.155-04:00Epistasis BlogFrom the Artificial Intelligence Innovation Lab at Cedars-Sinai Medical Center (www.epistasis.org)Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.comBlogger595125tag:blogger.com,1999:blog-10443050.post-58387470341497931262021-03-31T13:03:00.001-04:002021-03-31T13:03:26.361-04:00Ten important roles for academic leaders to promote equity, diversity, and inclusion in data science<p> Our new editorial on equity, diversity, and inclusion in data science is out in <a href="https://biodatamining.biomedcentral.com/articles/10.1186/s13040-021-00256-9" target="_blank">BioData Mining</a>.</p>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-68778931109134463242020-12-01T16:17:00.001-05:002021-02-06T16:19:16.111-05:00Ten simple rules for writing a paper about scientific software<div style="text-align: left;"><span style="font-family: inherit;">Romano JD, Moore JH. Ten simple rules for writing a paper about scientific software. PLoS Comput Biol. 2020 Nov 12;16(11):e1008390. doi: 10.1371/journal.pcbi.1008390. PMID: 33180774; PMCID: PMC7660560. [<a href="https://pubmed.ncbi.nlm.nih.gov/33180774/" target="_blank">PubMed</a>] [<a href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008390" target="_blank">PLoS Comp Bio</a>]</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Abstract</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Papers describing software are an important part of computational fields of scientific research. These "software papers" are unique in a number of ways, and they require special consideration to improve their impact on the scientific community and their efficacy at conveying important information. Here, we discuss 10 specific rules for writing software papers, covering some of the different scenarios and publication types that might be encountered, and important questions from which all computational researchers would benefit by asking along the way.</span></div>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-39588157299864785862020-11-04T16:14:00.001-05:002021-02-06T16:16:54.068-05:00Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses<div style="text-align: left;"><span style="font-family: inherit;">Manduchi E, Fu W, Romano JD, Ruberto S, Moore JH. Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses. BMC Bioinformatics. 2020 Oct 1;21(1):430. doi: 10.1186/s12859-020-03755-4. PMID: 32998684; PMCID: PMC7528347. [<a href="https://pubmed.ncbi.nlm.nih.gov/32998684/" target="_blank">PubMed</a>] [<a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03755-4" target="_blank">BMC Bioinformatics</a>]</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Abstract</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Background: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Results: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids 'leakage' during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj</span><span style="font-family: inherit;">.</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Conclusions: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.</span></div>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-87575873137145823452020-10-30T16:11:00.001-04:002021-02-06T16:14:16.679-05:00Ten important roles for academic leaders in data science<p>Moore JH. Ten important roles for academic leaders in data science. BioData Min. 2020 Oct 26;13:18. doi: 10.1186/s13040-020-00228-5. PMID: 33117434; PMCID: PMC7586691. [<a href="https://pubmed.ncbi.nlm.nih.gov/33117434/" target="_blank">PubMed</a>] [<a href="https://biodatamining.biomedcentral.com/articles/10.1186/s13040-020-00228-5" target="_blank">BioData Mining</a>]</p><p>Abstract</p><p>Data science has emerged as an important discipline in the era of big data and biological and biomedical data mining. As such, we have seen a rapid increase in the number of data science departments, research centers, and schools. We review here ten important leadership roles for a successful academic data science chair, director, or dean. These roles include the visionary, executive, cheerleader, manager, enforcer, subordinate, educator, entrepreneur, mentor, and communicator. Examples specific to leadership in data science are given for each role.</p>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-76728026460418521712020-09-09T16:09:00.002-04:002021-02-06T16:11:32.965-05:00Evaluating recommender systems for AI-driven biomedical informatics<div style="text-align: left;"><span style="font-family: inherit;">La Cava W, Williams H, Fu W, Vitale S, Srivatsan D, Moore JH. Evaluating recommender systems for AI-driven biomedical informatics. Bioinformatics. 2020 Aug 7:btaa698. doi: 10.1093/bioinformatics/btaa698. Epub ahead of print. PMID: 32766825. [<a href="https://pubmed.ncbi.nlm.nih.gov/32766825/" target="_blank">PubMed</a>] [<a href="https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa698/5885079" target="_blank">Bioinformatics</a>]</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Abstract</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Motivation: Many researchers with domain expertise are unable to easily apply machine learning to their bioinformatics data due to a lack of machine learning and/or coding expertise. Methods that have been proposed thus far to automate machine learning mostly require programming experience as well as expert knowledge to tune and apply the algorithms correctly. Here, we study a method of automating biomedical data science using a web-based platform that uses AI to recommend model choices and conduct experiments. We have two goals in mind: first, to make it easy to construct sophisticated models of biomedical processes; and second, to provide a fully automated AI agent that can choose and conduct promising experiments for the user, based on the user's experiments as well as prior knowledge. To validate this framework, we experiment with hundreds of classification problems, comparing to state-of-the-art, automated approaches. Finally, we use this tool to develop predictive models of septic shock in critical care patients.</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Results: We find that matrix factorization-based recommendation systems outperform meta-learning methods for automating machine learning. This result mirrors the results of earlier recommender systems research in other domains. The proposed AI is competitive with state-of-the-art automated machine learning methods in terms of choosing optimal algorithm configurations for datasets. In our application to prediction of septic shock, the AI-driven analysis produces a competent machine learning model (AUROC 0.85 +/- 0.02) that performs on par with state-of-the-art deep learning results for this task, with much less computational effort.</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Availability: PennAI is available free of charge and open-source. It is distributed under the GNU public license (GPL) version 3.</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br />Supplementary information: Software and experiments are available from epistasislab.github.io/pennai.</span></div>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-51634149784449363062020-08-26T15:58:00.002-04:002021-02-06T16:07:37.581-05:00Electronic health records and polygenic risk scores for predicting disease risk<p><span style="background-color: white; color: #212121;"><span style="font-family: inherit;">Li R, Chen Y, Ritchie MD, Moore JH. Electronic health records and polygenic risk scores for predicting disease risk. Nat Rev Genet. 2020 Aug;21(8):493-502. doi: 10.1038/s41576-020-0224-1. Epub 2020 Mar 31. PMID: 32235907. [<a href="https://pubmed.ncbi.nlm.nih.gov/32235907/" target="_blank">PubMed</a>] [<a href="https://www.nature.com/articles/s41576-020-0224-1" target="_blank">Nature Reviews</a>]</span></span></p><p><span style="background-color: white; color: #212121;"><span style="font-family: inherit;">Abstract</span></span></p><p><span style="font-family: inherit;"><span style="background-color: white; color: #212121;"><span>Accurate prediction of disease risk based on the genetic make-up of an individual is essential for effective prevention and personalized treatment. Nevertheless, to date, individual genetic variants from genome-wide association studies have achieved only moderate prediction of disease risk. The aggregation of genetic variants under a polygenic model shows promising improvements in prediction accuracies. Increasingly, electronic health records (EHRs) are being linked to patient genetic data in biobanks, which provides new opportunities for developing and applying polygenic risk scores in the clinic, to systematically examine and evaluate patient susceptibilities to disease. However, the heterogeneous nature of EHR data brings forth many practical challenges along every step of designing and implementing risk prediction strategies. In this Review, we present the unique considerations for using genotype and phenotype data from biobank-linked EHRs for polygenic risk prediction</span></span><span face="BlinkMacSystemFont, -apple-system, "Segoe UI", Roboto, Oxygen, Ubuntu, Cantarell, "Fira Sans", "Droid Sans", "Helvetica Neue", sans-serif" style="background-color: white; color: #212121;">.</span></span></p>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-67263747083404558102020-08-10T16:20:00.001-04:002021-02-06T16:21:51.970-05:00A brief introduction to my artificial intelligence and machine learning research program - YouTube<p> A 12-minute overview of my artificial intelligence and machine learning research program [<a href="https://www.youtube.com/watch?v=_JujMHBy7t4" target="_blank">YouTube</a>]</p>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-72138234143622819112020-07-30T16:04:00.002-04:002021-02-06T16:07:53.038-05:00treeheatr: an R package for interpretable decision tree visualizations<div style="box-sizing: inherit; line-height: 1.5; margin: 1.2rem 0px; text-align: left;"><span><span style="background-color: white; color: #212121; font-family: inherit;">Le TT, Moore JH. treeheatr: an R package for interpretable decision tree visualizations. Bioinformatics. 2020 Jul 23:btaa662. doi: 10.1093/bioinformatics/btaa662. Epub ahead of print. PMID: 32702108. [<a href="https://pubmed.ncbi.nlm.nih.gov/32702108/" target="_blank">PubMed</a>] [<a href="https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa662/5875600" target="_blank">Bioinformatics</a>]</span></span></div><div style="box-sizing: inherit; line-height: 1.5; margin: 1.2rem 0px; text-align: left;"><span><span style="background-color: white; color: #212121; font-family: inherit;">Abstract</span></span></div><div style="box-sizing: inherit; line-height: 1.5; margin: 1.2rem 0px; text-align: left;"><span style="font-family: inherit;"><strong class="sub-title" style="box-sizing: inherit;">Summary: </strong>treeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree's leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students' understanding of a simple decision tree model before diving into more complex tree-based machine learning methods.<br /><strong class="sub-title" style="box-sizing: inherit;">Availability: </strong>The treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous integration.<br /><strong class="sub-title" style="box-sizing: inherit;">Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</span></div>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-77876964609686955162020-05-27T16:02:00.001-04:002021-02-06T16:04:22.743-05:00Ideas for how informaticians can get involved with COVID-19 research<div style="text-align: left;"><span style="background-color: white; color: #212121;"><span style="font-family: inherit;">Moore JH, Barnett I, Boland MR, Chen Y, Demiris G, Gonzalez-Hernandez G, Herman DS, Himes BE, Hubbard RA, Kim D, Morris JS, Mowery DL, Ritchie MD, Shen L, Urbanowicz R, Holmes JH. Ideas for how informaticians can get involved with COVID-19 research. BioData Min. 2020 May 12;13:3. doi: 10.1186/s13040-020-00213-y. PMID: 32419848; PMCID: PMC7216865. [<a href="https://pubmed.ncbi.nlm.nih.gov/32419848/" target="_blank">PubMed</a>] [<a href="https://biodatamining.biomedcentral.com/articles/10.1186/s13040-020-00213-y" target="_blank">BioData Mining</a>]</span></span></div><div style="text-align: left;"><span style="background-color: white; color: #212121;"><span style="font-family: inherit;"><br /></span></span><span style="background-color: white; color: #212121;"><span style="font-family: inherit;">Abstract</span></span></div><div style="text-align: left;"><span style="background-color: white; color: #212121;"><span style="font-family: inherit;"><br /></span></span><span style="background-color: white; color: #333333;"><span style="font-family: inherit;">The coronavirus disease 2019 (COVID-19) pandemic has had a significant impact on population health and wellbeing. Biomedical informatics is central to COVID-19 research efforts and for the delivery of healthcare for COVID-19 patients. Critical to this effort is the participation of informaticians who typically work on other basic science or clinical problems. The goal of this editorial is to highlight some examples of COVID-19 research areas that could benefit from informatics expertise. Each research idea summarizes the COVID-19 application area, followed by an informatics methodology, approach, or technology that could make a contribution. It is our hope that this piece will motivate and make it easy for some informaticians to adopt COVID-19 research projects.</span></span></div>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-38986846355658579762020-02-05T15:52:00.001-05:002021-02-06T15:57:23.683-05:00Using simulations to understand the relationship between epistasis and observed GWAS findings<p><span style="background-color: white; color: #212121;"><span style="font-family: inherit;">Moore JH, Olson RS, Schmitt P, Chen Y, Manduchi E. How Computational Experiments Can Improve Our Understanding of the Genetic Architecture of Common Human Diseases. Artif Life. 2020 Winter;26(1):23-37. doi: 10.1162/artl_a_00308. Epub 2020 Feb 6. PMID: 32027528. [<a href="https://pubmed.ncbi.nlm.nih.gov/32027528/" target="_blank">PubMed</a>] [<a href="https://www.mitpressjournals.org/doi/10.1162/artl_a_00308" target="_blank">Artificial Life</a>]</span></span></p><p><span style="background-color: white; color: #212121; font-family: inherit;">Abstract</span></p><div class="abstract-content selected" id="enc-abstract" style="background-color: white; box-sizing: inherit; clear: left; color: #212121;"><p style="box-sizing: inherit; line-height: 1.5; margin: 1.2rem 0px;"><span style="font-family: inherit;">Susceptibility to common human diseases such as cancer is influenced by many genetic and environmental factors that work together in a complex manner. The state of the art is to perform a genome-wide association study (GWAS) that measures millions of single-nucleotide polymorphisms (SNPs) throughout the genome followed by a one-SNP-at-a-time statistical analysis to detect univariate associations. This approach has identified thousands of genetic risk factors for hundreds of diseases. However, the genetic risk factors detected have very small effect sizes and collectively explain very little of the overall heritability of the disease. Nonetheless, it is assumed that the genetic component of risk is due to many independent risk factors that contribute additively. The fact that many genetic risk factors with small effects can be detected is taken as evidence to support this notion. It is our working hypothesis that the genetic architecture of common diseases is partly driven by non-additive interactions. To test this hypothesis, we developed a heuristic simulation-based method for conducting experiments about the complexity of genetic architecture. We show that a genetic architecture driven by complex interactions is highly consistent with the magnitude and distribution of univariate effects seen in real data. We compare our results with measures of univariate and interaction effects from two large-scale GWASs of sporadic breast cancer and find evidence to support our hypothesis that is consistent with the results of our computational experiment.</span></p></div>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-24757306039538300992019-12-30T13:05:00.002-05:002019-12-30T13:05:36.135-05:00Getting started with TPOT for automated machine learningA <a href="https://trang.page/2019/11/05/tpot-where-do-i-start/" target="_blank">great post</a> from Dr. Trang Le on how to get started with automated machine learning (AutoML) with our Tree-Based Pipeline Optimization Tool (TPOT) in Python.Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-36919326982040567642019-12-26T18:23:00.001-05:002019-12-26T18:24:17.324-05:00The Human Pancreas Analysis Program (HPAP)<div style="background-color: white; box-sizing: border-box; color: #3d4545; font-family: "Source Sans Pro", sans-serif; font-size: 14px; margin-bottom: 10px;">
We are assisting with the bioinformatics support for <a href="https://hpap.pmacs.upenn.edu/" target="_blank">The Human Pancreas Analysis Program</a> (HPAP) which consists of two interlocking, collaborative projects at three institutions that seek to provide comprehensive molecular profiling in unprecedented detail of the pancreatic islet at various stages of type 1 diabetes (T1D) pathogenesis- pre-diabetic (positive islet autoantibodies), recent onset, and T1D of durations less than 10 years.</div>
<div style="background-color: white; box-sizing: border-box; color: #3d4545; font-family: "Source Sans Pro", sans-serif; font-size: 14px; margin-bottom: 10px;">
In the past decade, there have been dramatic advances in our ability to phenotype and molecularly profile human cells and tissues. HIRN-HPPAP will develop and apply these new technologies to study cells and tissues relevant to the beta cell loss in T1D with unprecedented resolution, including at the genomic, epigenomic, protein, and functional levels. Here we will employ state-of-the-art technologies to determine all aspects of pancreatic islet cell and immune cell biology as it pertains to the pathogenesis of type 1 diabetes. We will profile both the endocrine and immune systems with multiple modalities, and make the vast data accumulated available through the highly accessible PANC-DB, which will be developed through the project. These extensive and high quality datasets will be made available to the <a href="https://hirnetwork.org/" target="_blank">HIRN</a> and the diabetes research community at-large for further discovery.</div>
Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-78948854473972000792019-11-20T17:53:00.000-05:002019-12-26T17:53:49.390-05:00Automated machine learning analysis of metabolomics data<span style="font-family: inherit;">We have expanded our <a href="http://epistasislab.github.io/tpot/" target="_blank">TPOT</a> automated machine learning method (<a href="http://automl.info/" target="_blank">AutoML</a>) to metabolomics data.</span><br />
<span style="font-family: inherit;"><br />Orlenko A, Kofink D, Lyytikäinen LP, Nikus K, Mishra P, Kuukasjärvi P, Karhunen PJ, Kähönen M, Laurikka JO, Lehtimäki T, Asselberg FW, Moore JH. Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning (AutoML). <i>Bioinformatics</i>. 2019 Nov 8. [<a href="https://www.ncbi.nlm.nih.gov/pubmed/31702773" target="_blank">PubMed</a>]</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Abstract</span><br />
<span style="font-family: inherit;"><br />MOTIVATION:<br />Selecting the optimal machine learning (ML) model for a given dataset is often challenging. Automated ML (AutoML) has emerged as a powerful tool for enabling the automatic selection of ML methods and parameter settings for the prediction of biomedical endpoints. Here, we apply the tree-based pipeline optimization tool (TPOT) to predict angiographic diagnoses of coronary artery disease (CAD). With TPOT, ML models are represented as expression trees and optimal pipelines discovered using a stochastic search method called genetic programming. We provide some guidelines for TPOT-based ML pipeline selection and optimization-based on various clinical phenotypes and high-throughput metabolic profiles in the Angiography and Genes Study (ANGES).</span><br />
<span style="font-family: inherit;"><br />RESULTS:<br />We analyzed nuclear magnetic resonance (NMR)-derived lipoprotein and metabolite profiles in the ANGES cohort with a goal to identify the role of non-obstructive CAD patients in CAD diagnostics. We performed a comparative analysis of TPOT-generated ML pipelines with selected ML classifiers, optimized with a grid search approach, applied to two phenotypic CAD profiles. As a result, TPOT generated ML pipelines that outperformed grid search optimized models across multiple performance metrics including balanced accuracy and area under the precision-recall curve. With the selected models, we demonstrated that the phenotypic profile that distinguishes non-obstructive CAD patients from no CAD patients is associated with higher precision, suggesting a discrepancy in the underlying processes between these phenotypes.</span><br />
<span style="font-family: inherit;"><br />AVAILABILITY:<br />TPOT is freely available via <a href="http://epistasislab.github.io/tpot/">http://epistasislab.github.io/tpot/</a></span>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-42862615670767084752019-10-09T17:47:00.000-04:002019-12-26T17:48:31.422-05:00Embracing study heterogeneity for finding genetic interactions in large-scale research consortia<span style="font-family: inherit;">New collaborative paper in <i>Genetic </i></span><i>Epidemiology</i><span style="font-family: inherit;"> with Dr. Yong Chen</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Liu Y, Huang J, Urbanowicz RJ, Chen K, Manduchi E, Greene CS, Moore JH, Scheet P, Chen Y. Embracing study heterogeneity for finding genetic interactions in large-scale research consortia. Genet Epidemiol. 2019 Oct 4. [<a href="https://www.ncbi.nlm.nih.gov/pubmed/31583758" target="_blank">PubMed</a>]</span><br />
<span style="font-family: inherit;"><br />
Abstract</span><br />
<span style="font-family: inherit;"><br />
Genetic interactions have been recognized as a potentially important contributor to the heritability of complex diseases. Nevertheless, due to small effect sizes and stringent multiple-testing correction, identifying genetic interactions in complex diseases is particularly challenging. To address the above challenges, many genomic research initiatives collaborate to form large-scale consortia and develop open access to enable sharing of genome-wide association study (GWAS) data. Despite the perceived benefits of data sharing from large consortia, a number of practical issues have arisen, such as privacy concerns on individual genomic information and heterogeneous data sources from distributed GWAS databases. In the context of large consortia, we demonstrate that the heterogeneously appearing marginal effects over distributed GWAS databases can offer new insights into genetic interactions for which conventional methods have had limited success. In this paper, we develop a novel two-stage testing procedure, named phylogenY-based effect-size tests for interactions using first 2 moments (YETI2), to detect genetic interactions through both pooled marginal effects, in terms of averaging site-specific marginal effects, and heterogeneity in marginal effects across sites, using a meta-analytic framework. YETI2 can not only be applied to large consortia without shared personal information but also can be used to leverage underlying heterogeneity in marginal effects to prioritize potential genetic interactions. We investigate the performance of YETI2 through simulation studies and apply YETI2 to bladder cancer data from dbGaP.</span>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-85427741375696080772019-08-14T12:47:00.000-04:002019-10-21T12:48:56.750-04:00Alternative measures of association for GWAS<div style="background-color: white; box-sizing: inherit; color: #333333; margin-bottom: 1.5em; padding: 0px;">
<span style="font-family: inherit;">Manduchi E, Orzechowski PR, Ritchie MD, Moore JH. Exploration of a diversity of computational and statistical measures of association for genome-wide genetic studies. BioData Min. 2019 Jul 9;12:14. [<a href="https://www.ncbi.nlm.nih.gov/pubmed/31320928" target="_blank">PubMed</a>] [<a href="https://biodatamining.biomedcentral.com/articles/10.1186/s13040-019-0201-4" target="_blank">BioData Mining</a>]</span></div>
<div style="background-color: white; box-sizing: inherit; color: #333333; margin-bottom: 1.5em; padding: 0px;">
<span style="font-family: inherit;">Background</span><br />
<span style="background-color: transparent; font-family: inherit;">The principal line of investigation in Genome Wide Association Studies (GWAS) is the identification of main effects, that is individual Single Nucleotide Polymorphisms (SNPs) which are associated with the trait of interest, independent of other factors. A variety of methods have been proposed to this end, mostly statistical in nature and differing in assumptions and type of model employed. Moreover, for a given model, there may be multiple choices for the SNP genotype encoding. As an alternative to statistical methods, machine learning methods are often applicable. Typically, for a given GWAS, a single approach is selected and utilized to identify potential SNPs of interest. Even when multiple GWAS are combined through meta-analyses within a consortium, each GWAS is typically analyzed with a single approach and the resulting summary statistics are then utilized in meta-analyses.</span></div>
Results<br />
<span style="font-family: inherit;">In this work we use as case studies a Type 2 Diabetes (T2D) and a breast cancer GWAS to explore a diversity of applicable approaches spanning different methods and encoding choices. We assess similarity of these approaches based on the derived ranked lists of SNPs and, for each GWAS, we identify a subset of representative approaches that we use as an ensemble to derive a union list of top SNPs. Among these are SNPs which are identified by multiple approaches as well as several SNPs identified by only one or a few of the less frequently used approaches. The latter include SNPs from established loci and SNPs which have other supporting lines of evidence in terms of their potential relevance to the traits.</span><br />
<br />
Conclusions<br />
<span style="font-family: inherit;">Not every main effect analysis method is suitable for every GWAS, but for each GWAS there are typically multiple applicable methods and encoding options. We suggest a workflow for a single GWAS, extensible to multiple GWAS from consortia, where representative approaches are selected among a pool of suitable options, to yield a more comprehensive set of SNPs, potentially including SNPs that would typically be missed with the most popular analyses, but that could provide additional valuable insights for follow-up.</span>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-64369536838194537762019-07-24T12:44:00.000-04:002019-10-21T12:48:25.560-04:00Scaling tree-based automated machine learning<span style="font-family: inherit;">Le TT, Fu W, Moore JH. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics, in press (2019). [<a href="https://www.ncbi.nlm.nih.gov/pubmed/31165141" target="_blank">PubMed</a>] [<a href="https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz470/5511404" target="_blank">Bioinformatics</a>]</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">MOTIVATION: Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programming to recommend an optimized analysis pipeline for the data scientist's prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">RESULTS: We introduce two new features implemented in TPOT that helps increase the system's scalability: Feature Set Selector and Template. Feature Set Selector (FSS) provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT's efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing genetic programming to select the best subset in the final pipeline. Template enforces type constraints with strongly typed genetic programming and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual.</span><br />
<span style="font-family: inherit;"><br />AVAILABILITY: Detailed simulation and analysis code needed to reproduce the results in this study is available at <a href="https://github.com/lelaboratoire/tpot-fss">https://github.com/lelaboratoire/tpot-fss</a>. Implementation of the new TPOT operators is available at <a href="https://github.com/EpistasisLab/tpot">https://github.com/EpistasisLab/tpot</a>.</span>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-53737047518113150512019-06-18T12:39:00.000-04:002019-10-21T12:39:47.034-04:00Workflows for regulome and transcriptome-based prioritization of genetic variants<span style="font-family: inherit;">Manduchi E, Hemerich D, van Setten J, Tragante V, Harakalova M, Pei J, Williams SM, van der Harst P, Asselbergs FW, Moore JH. A comparison of two </span><span style="font-family: inherit;">workflows for regulome and transcriptome-based prioritization of genetic variants associated with myocardial mass. Genet Epidemiol. 2019 Sep;43(6):717-726. [</span><a href="https://www.ncbi.nlm.nih.gov/pubmed/31145509" style="font-family: inherit;" target="_blank">PubMed</a><span style="font-family: inherit;">] [</span><a href="https://onlinelibrary.wiley.com/doi/full/10.1002/gepi.22215" style="font-family: inherit;" target="_blank">Genetic Epi</a><span style="font-family: inherit;">]</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="background-color: white; color: #1c1d1e;"><span style="font-family: inherit;">A typical task arising from main effect analyses in a Genome Wide Association Study (GWAS) is to identify single nucleotide polymorphisms (SNPs), in linkage disequilibrium with the observed signals, that are likely causal variants and the affected genes. The affected genes may not be those closest to associating SNPs. Functional genomics data from relevant tissues are believed to be helpful in selecting likely causal SNPs and interpreting implicated biological mechanisms, ultimately facilitating prevention and treatment in the case of a disease trait. These data are typically used post GWAS analyses to fine‐map the statistically significant signals identified agnostically by testing all SNPs and applying a multiple testing correction. The number of tested SNPs is typically in the millions, so the multiple testing burden is high. Motivated by this, in this study we investigated an alternative workflow, which consists in utilizing the available functional genomics data as a first step to reduce the number of SNPs tested for association. We analyzed GWAS on electrocardiographic QRS duration using these two workflows. The alternative workflow identified more SNPs, including some residing in loci not discovered with the typical workflow. Moreover, the latter are corroborated by other reports on QRS duration. This indicates the potential value of incorporating functional genomics information at the onset in GWAS analyses.</span></span>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-27577290478365054392019-05-16T15:08:00.005-04:002019-05-16T15:08:47.990-04:00Accessible AI for Automated Machine LearningWe released our open-source PennAI software for automated machine learning this week. Here is the Penn Medicine <a href="https://www.pennmedicine.org/news/news-releases/2019/may/penn-medicine-releases-free-self-service-ai-tool-for-data-analytics" target="_blank">press release</a>. Here is the <a href="https://github.com/EpistasisLab/pennai" target="_blank">Github link</a> to the source code. More info can be found at the <a href="http://pennai.org/" target="_blank">PennAI website</a>. We think this will bring machine learning technology to novice users.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJxZ2oddlLk2DFhhSCRU1dOyple5y1TbbFFUKynAerIqtEGIzsz4-8sCPx7eahjYEQ-g7wTvtin_PBx9vZ1Gfv_ddDtmTl43WWayMlgWHxqG84YutXJilgzMmNFr3ltLitsgJKCQ/s1600/PennAI+1.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="404" data-original-width="400" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJxZ2oddlLk2DFhhSCRU1dOyple5y1TbbFFUKynAerIqtEGIzsz4-8sCPx7eahjYEQ-g7wTvtin_PBx9vZ1Gfv_ddDtmTl43WWayMlgWHxqG84YutXJilgzMmNFr3ltLitsgJKCQ/s400/PennAI+1.jpg" width="395" /></a></div>
<br />Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-35031963594527528102019-04-22T15:50:00.000-04:002019-05-16T15:50:15.961-04:00Automated discovery of test statisticsThis was a <a href="https://link.springer.com/article/10.1007%2Fs10710-018-9338-z" target="_blank">fun proof-of-principle paper</a> we did on using genetic programming to discover test statistics. We showed that with general principles that we could re-discover the two-sample t-test. This opens the door to the discovery of new test statistics for unsolved problems.Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-29449322265088381202019-03-31T08:41:00.004-04:002019-03-31T08:41:56.576-04:00How to increase our belief in discovered statistical interactions via large-scale association studies?<span style="font-family: inherit;">Our new paper with Dr. Kristel van Steen on approaches for improving evidence for statistical interactions.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Van Steen K, Moore JH. How to increase our belief in discovered statistical interactions via large-scale association studies? Hum Genet. 2019 [<a href="https://www.ncbi.nlm.nih.gov/pubmed/30840129" target="_blank">PubMed</a>] [<a href="https://link.springer.com/article/10.1007%2Fs00439-019-01987-w" target="_blank">Human Genetics</a>]</span><br />
<br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Abstract</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="background-color: white;"><span style="font-family: inherit;">The understanding that differences in biological epistasis may impact disease risk, diagnosis, or disease management stands in wide contrast to the unavailability of widely accepted large-scale epistasis analysis protocols. Several choices in the analysis workflow will impact false-positive and false-negative rates. One of these choices relates to the exploitation of particular modelling or testing strategies. The strengths and limitations of these need to be well understood, as well as the contexts in which these hold. This will contribute to determining the potentially complementary value of epistasis detection workflows and is expected to increase replication success with biological relevance. In this contribution, we take a recently introduced regression-based epistasis detection tool as a leading example to review the key elements that need to be considered to fully appreciate the value of analytical epistasis detection performance assessments. We point out unresolved hurdles and give our perspectives towards overcoming these.</span></span>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-60441094871155818532019-03-01T08:37:00.000-05:002019-03-31T08:37:45.116-04:00Testing the assumptions of parametric linear models: the need for biological data mining in disciplines such as human geneticsThis editorial is in response to some claims that an observed linear relationship between relative pair trait correlation and IBD genetic sharing is indicative of a simple additive genetic architecture dominated by independent genetic effects. As we show here, you could observe this pattern under a genetic architecture dominated by epistasis.<br />
<br />
Moore JH, Mackay TFC, Williams SM. Testing the assumptions of parametric linear models: the need for biological data mining in disciplines such as human genetics. BioData Min. 2019 Feb 11;12:6. [<a href="https://www.ncbi.nlm.nih.gov/pubmed/30792817" target="_blank">PubMed</a>] [<a href="https://biodatamining.biomedcentral.com/articles/10.1186/s13040-019-0194-z" target="_blank">BioData Mining</a>]<br />
<br />
Abstract<br />
<br />
All data science methods have specific assumptions that are made in order for their inferences to be valid. Some assumptions impact statistical significance testing and some influence the models themselves. For example, a fundamental assumption of linear regression is that the relationship between the independent and dependent variables is additive such that a unit increase in one leads to a unit increase in the other with some error that can be modeled using a normal distribution. The presence of a nonlinear relationship between the variables violates this assumption and can lead to inaccurate inferences. We demonstrate this here using a simple example from human genetics and then end with some thoughts about the role of biological data mining in revealing nonlinear relationships between variables.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUKzD5Zj2X2-1uoJqehwt8-hpM2ZdvAVsUmMsEiZg3L10geARjY4rlLhrD7ZFw6iOsuCmN3GIKOfT2KeTXR-9it9Z6lA70oj8jZBQjCywsvwKAdyANsfZt41P9Vem3tsn6uRJA1A/s1600/13040_2019_194_Fig1_HTML.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="496" data-original-width="709" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUKzD5Zj2X2-1uoJqehwt8-hpM2ZdvAVsUmMsEiZg3L10geARjY4rlLhrD7ZFw6iOsuCmN3GIKOfT2KeTXR-9it9Z6lA70oj8jZBQjCywsvwKAdyANsfZt41P9Vem3tsn6uRJA1A/s320/13040_2019_194_Fig1_HTML.jpg" width="320" /></a></div>
<br />Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-48942305096008507232019-02-13T08:30:00.000-05:002019-03-31T08:30:56.144-04:00Preparing next-generation scientists for biomedical big data: artificial intelligence approaches<span style="font-family: inherit;">Our paper on how to prepare next-gen scientists for big data is out. We outline here a curriculum focused on precision medicine, data science, and artificial intelligence.</span><br />
<span style="font-family: inherit;"><br />Moore JH, Boland MR, Camara PG, Chervitz H, Gonzalez G, Himes BE, Kim D, Mowery DL, Ritchie MD, Shen L, Urbanowicz RJ, Holmes JH. Preparing next-generation scientists for biomedical big data: artificial intelligence approaches. Per Med. 2019 [<a href="https://www.ncbi.nlm.nih.gov/pubmed/30760118" target="_blank">PubMed</a>] [<a href="https://www.futuremedicine.com/doi/abs/10.2217/pme-2018-0145" target="_blank">PerMed</a>]</span><br />
<span style="font-family: inherit;"><br />Abstract</span><br />
<span style="font-family: inherit;"><br /><span style="background-color: white;">Personalized medicine is being realized by our ability to measure biological and environmental information about patients. Much of these data are being stored in electronic health records yielding big data that presents challenges for its management and analysis. Here, we review several areas of knowledge that are necessary for next-generation scientists to fully realize the potential of biomedical big data. We begin with an overview of big data and its storage and management. We then review statistics and data science as foundational topics followed by a core curriculum of artificial intelligence, machine learning and natural language processing that are needed to develop predictive models for clinical decision making. We end with some specific training recommendations for preparing next-generation scientists for biomedical big data.</span></span>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-44454014910749140872019-01-02T08:44:00.000-05:002019-03-31T08:44:55.661-04:00Analysis validation has been neglected in the Age of ReproducibilityOur paper on the use of simulation to help improve analysis validation and results reproducibility.<br />
<br />
Lotterhos KE, Moore JH, Stapleton AE. Analysis validation has been neglected in the Age of Reproducibility. PLoS Biol. 2018 Dec 10;16(12):e3000070. [<a href="https://www.ncbi.nlm.nih.gov/pubmed/30532167" target="_blank">PubMed</a>] [<a href="https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000070" target="_blank">PLoS Biology</a>]<br />
<br />Abstract<br />
<br />Increasingly complex statistical models are being used for the analysis of biological data. Recent commentary has focused on the ability to compute the same outcome for a given dataset (reproducibility). We argue that a reproducible statistical analysis is not necessarily valid because of unique patterns of nonindependence in every biological dataset. We advocate that analyses should be evaluated with known-truth simulations that capture biological reality, a process we call "analysis validation." We review the process of validation and suggest criteria that a validation project should meet. We find that different fields of science have historically failed to meet all criteria, and we suggest ways to implement meaningful validation in training and practice.Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-22244888207086377042018-11-21T13:42:00.003-05:002018-11-21T13:42:47.545-05:00Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits<span style="font-family: inherit;">I love seeing new extensions and modifications to our MDR method. Here is a new from Dr. Lou.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Hou TT, Lin F, Bai S, Cleves MA, Xu HM, Lou XY. Generalized multifactor dimensionality reduction approaches to identification of genetic interactions underlying ordinal traits. Genet Epidemiol, in press (2018)</span><br />
<span style="font-family: inherit;"><br />Abstract</span><br />
<span style="font-family: inherit;"><br />
The manifestation of complex traits is influenced by gene–gene and gene–environment interactions, and the identification of multifactor interactions is an important but challenging undertaking for genetic studies. Many complex phenotypes such as disease severity are measured on an ordinal scale with more than two categories. A proportional odds model can improve statistical power for these outcomes, when compared to a logit model either collapsing the categories into two mutually exclusive groups or limiting the analysis to pairs of categories. In this study, we propose a proportional odds model‐based generalized multifactor dimensionality reduction (GMDR) method for detection of interactions underlying polytomous ordinal phenotypes. Computer simulations demonstrated that this new GMDR method has a higher power and more accurate predictive ability than the GMDR methods based on a logit model and a multinomial logit model. We applied this new method to the genetic analysis of low‐density lipoprotein (LDL) cholesterol, a causal risk factor for coronary artery disease, in the Multi‐Ethnic Study of Atherosclerosis, and identified a significant joint action of the CELSR2, SERPINA12, HPGD, and APOB genes. This finding provides new information to advance the limited knowledge about genetic regulation and gene interactions in metabolic pathways of LDL cholesterol. In conclusion, the proportional odds model‐based GMDR is a useful tool that can boost statistical power and prediction accuracy in studying multifactor interactions underlying ordinal traits.</span>Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0tag:blogger.com,1999:blog-10443050.post-9162408587650235192018-10-24T13:45:00.000-04:002018-11-21T13:46:41.433-05:00 Statistical Inference Relief (STIR) feature selectionHappy to be a collaborator on this paper to add inference to the ReliefF method for feature selection. We have done a lot of work on this algorithm that is capable of detecting epistasis.<br />
<br />
Le TT, Urbanowicz RJ, Moore JH, McKinney BA. Statistical Inference Relief (STIR) feature selection. Bioinformatics. 2018 Sep 18., in press<br />
<br />Abstract<br />
<br />
MOTIVATION:<br />Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features.<br />
<br />
METHODS:<br />We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data.<br />
<br />
RESULTS:<br />We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR's straightforward extension to genome-wide association studies.<br />
<br />
AVAILABILITY:<br />Code and data available at http://insilico.utulsa.edu/software/STIR.Jason H. Moore, Ph.D.http://www.blogger.com/profile/07692025646640606430noreply@blogger.com0