Epistasis Blog

From the Computational Genetics Laboratory at the University of Pennsylvania (www.epistasis.org)

Wednesday, April 05, 2017

Version 0.7 of TPOT released on GitHub

Version 0.7 of our Tree-Based Pipeline Optimization Tool (TPOT) for automated machine learning is now available for download. New features include the ability to customize TPOT using a config file and the ability of TPOT to make use of multiple CPUs for parallel processing.

Thursday, March 30, 2017

Variant Set Enrichment: an R package to identify disease-associated functional genomic regions

Variant Set Enrichment (VSE) is an R package to calculate the enrichment of a set of disease-associated variants across functionally annotated genomic regions, consequently highlighting the mechanisms important in the etiology of the disease studied.

Ahmed M, Sallari RC, Guo H, Moore JH, He HH, Lupien M. Variant Set Enrichment: an R package to identify disease-associated functional genomic regions. BioData Min. 2017 Feb 22;10:9. [PDF]

Saturday, February 25, 2017

Relief Based Algorithms in Python

We have released a Python package for carrying out ReliefF-based feature selection that can be used for epistasis analysis using machine learning methods. Our ReBATE package is on GitHub. We have also released a version of this code that is compatible with the sci-kit learn machine learning library in Python. This is also available on GitHub.

For more information about ReliefF for epistasis analysis we recommend our book chapter on the subject.

Moore JH. Epistasis analysis using ReliefF. Methods Mol Biol. 2015;1253:315-25.

Saturday, January 14, 2017

Buffering mechanisms that protect an embryo’s development from detrimental effects of genetic variation

This news piece mentions a new article in Nature providing evidence for the buffering of mutations during development. Here is the citation and the abstract. A nice example of epistasis.

Cannavò E, Koelling N, Harnett D, Garfield D, Casale FP, Ciglar L, Gustafson HE, Viales RR, Marco-Ferreres R, Degner JF, Zhao B, Stegle O, Birney E, Furlong EE. Genetic variants regulating expression levels and isoform diversity during embryogenesis. Nature. 2016 Dec 26. doi: 10.1038/nature20802. [Epub ahead of print] PubMed PMID: 28024300.


Embryonic development is driven by tightly regulated patterns of gene expression, despite extensive genetic variation among individuals. Studies of expression quantitative trait loci (eQTL) indicate that genetic variation frequently alters gene expression in cell-culture models and differentiated tissues. However, the extent and types of genetic variation impacting embryonic gene expression, and their interactions with developmental programs, remain largely unknown. Here we assessed the effect of genetic variation on transcriptional (expression levels) and post-transcriptional (3' RNA processing) regulation across multiple stages of metazoan development, using 80 inbred Drosophila wild isolates, identifying thousands of developmental-stage-specific and shared QTL. Given the small blocks of linkage disequilibrium in Drosophila, we obtain near base-pair resolution, resolving causal mutations in developmental enhancers, validated transcription-factor-binding sites and RNA motifs. This fine-grain mapping uncovered extensive allelic interactions within enhancers that have opposite effects, thereby buffering their impact on enhancer activity. QTL affecting 3' RNA processing identify new functional motifs leading to transcript isoform diversity and changes in the lengths of 3' untranslated regions. These results highlight how developmental stage influences the effects of genetic variation and uncover multiple mechanisms that regulate and buffer expression variation during embryogenesis.

Thursday, January 12, 2017

Use of Information Measures and Their Approximations to Detect Predictive Gene-Gene Interaction

There is a neat paper that just appeared in the journal Entropy. The authors show how entropy-based methods can detect certain kinds of interactions that are not found with logistic regression. This builds on our previous work (e.g. Moore 2006, Hu 2013) introducing and expanding entropy as a useful metric for epistasis analysis in human genetics. We have recently reviewed these methods here. Others have reviewed these approaches here.

Sunday, January 01, 2017

10 tips for success as a tenure-track faculty member in informatics or data science

Here are my 10 tips for success as a tenure-track faculty member in informatics or data science. These tips are based on my experience rising through the ranks at Vanderbilt, Dartmouth, and Penn. I will present these at the 2017 Pacific Symposium on Biocomputing (PSB). I will add to them over time.

1) Knock the chip off of your shoulder

Over the years I have seen many young faculty start a tenure-track position at the Assistant Professor level with a big chip on their shoulder (I definitely had one). The phenotype is often (but not exclusively) someone from a top university and a top research lab. These individuals usually have publications in top journals such as Science and Nature which is why they are competitive for a faculty position. These are accomplished individuals with a healthy ego and very high expectations for themselves. Sometimes these individuals either think they don't need help getting their career off the ground or are too afraid to ask because it might be a sign of weakness. Early career scientists with this kind of chip on their shoulder can be the most challenging to mentor because they are either not receptive to advice or reject it when it is given. By the time they realize they need help it is often too late with their tenure decision looming large. My advice is to be humble and seek out advice early and often from multiple mentors (see #2 below). Academia is a complex landscape and the tenure clock ticks very quickly.

2) Find good mentors

The important component of a successful career is a long list of mentors that can guide you through different parts of the complex academic landscape. You need mentors to give you advice about your research, about how to play the federal funding game, about how to navigate university politics, about how to balance work and life, about checking the boxes for promotion and tenure, about defusing stressful situations, and about building a career (the long vision). The best mentors are those that have loads of experience in one or more of these areas. The problem is that the most experience mentors are often busy with their own success. It is sometimes the case that a department will assign junior faculty one or more mentors to be monitor progress and give advice. This can be rare in academia. As such, you should seek out several mentors in your first year. In fact, this should be on the high-priority list for your first three to six months. Do not put it off. Ask around to find out who the good mentors are. Ask them out to lunch. Come prepared with a list of questions. Establish a friendly relationship. Good mentors will make the time. Bad mentors will say they are too busy. I can attribute my own success to a long string of really awesome mentors going back to high school and college.

3) Establish your own research program

It is often the case that informatics or data science faculty are pushed to be consultants and collaborators because the demand for their skills is so high in the era of big data. Establishing collaborations is very important and you want a reputation as a good collaborator. Quantitative faculty that hide in their office and spend most of their time on their own work on not very well liked by researchers generating big data they know nothing about. However, you don't want collaboration to consume all your time because it is also true that faculty who only collaborate, and who don't have their own NIH R01s, are not seen as equals by other R01 PIs. Thus, it is important to develop your won research projects early on leading to NIH grant submissions as PI. A trick that has worked well for me over the years is to spin collaborative projects into my own research projects. One of my strategies is to first help an investigator publish a paper by performing a standard informatics analysis. Once the paper has been accepted, I approach the investigator and ask if I can do a more imaginative analysis of the same data and publish a paper with me as senior author and the collaborator's team as co-authors. Most investigators with data are more than happy to see the data generate additional publications once they have their own senior author paper. Be sure and agree on authorship first and get it in writing. This is important if it becomes something that might get published in a top journal. Views on authorship change with impact factor of the journal. I have seen it happen more than once. This is a nice approach because you already know the data and the research question. It usually easy to apply more complex computational methods when you are the one interpretation the results and writing the paper. This way, your collaborative effort is also effort towards your own paper that will help your tenure case. Further, you can then spin this paper into an R01 submission as PI. A win-win for everyone.

4) Be productive

This is so very important. The publish or perish mantra is so very true. There is no substitute for productivity unless you are one of the very few Assistant Professors in the country that are able to publish a Science, Nature, and Cell paper in five years before you apply for tenure. A CV with 30 publications looks a lot better than one with 10. This is clearly important for promotion but also helps in several other important ways. First, getting promoted to Associate Professor is partly about establishing a national reputation. Publishing lots of papers gets your name out there. Doing it early helps those paper get cited. Second, the NIH likes to count publications when reporting on the impact of grants in their portfolio. They love high-profile publications they can brag about but at the end of the day it is about numbers. Further, reviewers like to see a steady stream of publications from a researcher. This signals that they work hard and are likely to produce if given funding. Make sure there are no gaps in your CV without a really good explanation like maternity or medical leave. A year with few or no publications looks really bad. The strategy outlined in #3 above can help boost productivity. I am also a big fan of book chapters and essays. These can help pad your CV, help boost citation of your own work, and help generate text you can reuse in your grant applications. Even though these publications don't count much for promotion they do make your total body of work look bigger and help get your name out there as an expert. Also, essays and reviews tend to be highly cited. This is a good activity in your first year or two while you are generating results for data papers. Computer science and bioinformatics conferences are also a great way to get work out quickly. I like to get new ideas out fast through CS conferences and then take more time to flesh it out in a follow-up journal paper.

Productivity also applies to grants. Junior faculty should be in constant grant-writing mode until they land their first two R01s as PI. Try not to miss a deadline. As soon as one grant is submitted start working on the next one. Queue them up months ahead of time. The difference between writing and submitting one R01 per year and three R01s per year might be five or six weekends of hard work. Submitting three R01s per year will dramatically improve your chances of funding. This is much easier for informaticians because we can adapt our methods to many different problems, many different RFAs, and many different institutes. The key is finding study sections and reviewers that like your work. Try different review panels. Try difference disease areas. See what works. I have my list of go-to review panels and panels that I will never send a grant to. Trial and error is the only way. Also, informatics grants are a bit different than others. Make sure you are getting grant advice from other informaticians. I have previously published a blog post about my top 10 tips for getting an R01 funded by the NLM. Not only does writing and submitting lots of grants improve your chances of getting funded but it generates lots of text and grant-writing experience that makes each subsequent grant easier to write. The process becomes faster and faster with more and more experience. It is painful at first but will pay off down the road. Also, your mentors will work harder for you if they know you are making a good effort.

5) Choose students and staff carefully

Choosing good students and staff early in your career is SO important. One good student can make the difference between promotion and no promotion. One bad student or staff can eat up huge amounts of time that will take you away from being productive. The selection process is a good topic to take up with mentors early on. Get advice from senior faculty about how they pick good people to work with.

6) Show up

Part of the promotion and tenure process is being a good citizen. You want people to know that you are committed to your department, center, school, and university. Show up for faculty meetings and department or school social events. Network with your colleagues. Be engaged. Be interested. This will all pay off later when it comes time for your colleagues to vote on your promotion. Networking is also very important in a national level. Go to several conferences and workshops per year. Make an effort to present something even if it is only a poster. Introduce yourself to people. Ask about their work. Get their advice. Part of promotion is establishing a national reputation. You must promote yourself and your work every chance you get at national events. Use social media effectively to promote yourself. This is an art. Self-promotion is hard for some people but is so very critical. Some of the people you meet might be asked to write letters commenting on your success for the promotion and tenure committee. The kiss of death is when someone says "I don't know this person or their work".

7) Pay attention to university politics

No one likes university politics but they exist and must be understood. Politics usually don't impact junior faculty but very well could. You don't want politics to consume your every thought but it is a good idea to watch and learn. Mentors can help with this. Ask your mentors to explain how the department works, what the dean does, how decisions are made, etc. You will soon enough find yourself in the middle of university politics sooner or later. Understanding politics as much as you can will help you down the road.

8) Learn to play the job offer game

The hard truth about faculty life is that we love our universities but they typically don't love us back. Decisions are made all the time that do not take into account our individual well being. It is important to keep in mind that we are free agents and are able to move from university to university as better opportunities arise. Sometime moving is the only way to get more resources, a higher salary, and/or more space. When faculty move there is usually some factor pushing them out and a lure that attracts them to a new place. Moving is hard and disruptive so the process of getting job offers and threatening to leave should not be taken lightly. It is physically and emotionally demanding. However, most successful faculty move to a new university about once every five to 10 years. The upside is resources and salary but also a change in environment can be stimulating like a sabbatical. You find yourself with new opportunities and ideas. In my experience, the single factor that send most faculty onto the job market is lack of recognition and/or lack of respect. For example, it might be hard to swallow no big raise the year you land two R01s and publish a paper in Science. The important thing to keep in mind is that administrators are almost always busy putting out multiple fires or emergencies that consume their time. They aren't paying attention to your success because there are 10 other tings more critical at the moment (e.g. other good faculty threatening to leave). This is where mentors can be very helpful. Ask people you trust about how the system works. When is it ok to consider an offer from another university? How should offer be handled? How should they be communicated? What demands should be made? There is a right and wrong process to this 'game' that faculty play with their institutions. Regardless, it is serious business and should be approached as such.

9) Learn the federal funding system

Having good ideas and writing good grants is only half the funding battle. To be successful you must understand how the NIH (or the NSF) works. The NIH is a complex and heterogeneous organization that takes time and experience to understand. Ask your mentors about their funding strategies. Each will have different important tidbits of information that they gleaned from years of experience. The ins and outs of the funding game are extensive.

10) Work your backside off

Academia is more challenging than ever with reduced funding levels, institutional budget cuts, increased demands on faculty time, threats to academic freedom, etc. What all of this means is that you must work even harder to be successful. Taking weekends and evenings off is a luxury that will not lead to faculty success. With that said, working hard is not just about long hours. It is about using your time efficiently and making good decisions. When you are working make sure you are writing papers and grants. Make sure you are planning and strategizing. Make sure you are engaging your mentors. Make sure you are building your CV for promotion every single day. The more efficiently you work the easier it is to justify time of for R&R. I work harder now that I ever have in my career but I also work much more efficiently. In some ways, it gets much easier as you advance. This is why senior successful people seem to get so much done.

Tuesday, December 20, 2016

Complex systems analysis of bladder cancer susceptibility reveals a role for decarboxylase activity in two genome-wide association studies

An example of combining epistasis analysis and pathway analysis to get more value out of GWAS

Cheng S, Andrew AS, Andrews PC, Moore JH. Complex systems analysis of bladder cancer susceptibility reveals a role for decarboxylase activity in two genome-wide association studies. BioData Min. 2016 Dec 12;9:40. [PDF]


Bladder cancer is common disease with a complex etiology that is likely due to many different genetic and environmental factors. The goal of this study was to embrace this complexity using a bioinformatics analysis pipeline designed to use machine learning to measure synergistic interactions between single nucleotide polymorphisms (SNPs) in two genome-wide association studies (GWAS) and then to assess their enrichment within functional groups defined by Gene Ontology. The significance of the results was evaluated using permutation testing and those results that replicated between the two GWAS data sets were reported.

In the first step of our bioinformatics pipeline, we estimated the pairwise synergistic effects of SNPs on bladder cancer risk in both GWAS data sets using Multifactor Dimensionality Reduction (MDR) machine learning method that is designed specifically for this purpose. Statistical significance was assessed using a 1000-fold permutation test. Each single SNP was assigned a p-value based on its strongest pairwise association. Each SNP was then mapped to one or more genes using a window of 500 kb upstream and downstream from each gene boundary. This window was chosen to capture as many regulatory variants as possible. Using Exploratory Visual Analysis (EVA), we then carried out a gene set enrichment analysis at the gene level to identify those genes with an overabundance of significant SNPs relative to the size of their mapped regions. Each gene was assigned to a biological functional group defined by Gene Ontology (GO). We next used EVA to evaluate the overabundance of significant genes in biological functional groups. Our study yielded one GO category, carboxy-lysase activity (GO:0016831), that was significant in analyses from both GWAS data sets. Interestingly, only the gamma-glutamyl carboxylase (GGCX) gene from this GO group was significant in both the detection and replication data, highlighting the complexity of the pathway-level effects on risk. The GGCX gene is expressed in the bladder, but has not been previously associated with bladder cancer in univariate GWAS. However, there is some experimental evidence that carboxy-lysase activity might play a role in cancer and that genes in this pathway should be explored as drug targets. This study provides a genetic basis for that observation.

Our machine learning analysis of genetic associations in two GWAS for bladder cancer identified numerous associations with pairs of SNPs. Gene set enrichment analysis found aggregation of risk-associated SNPs in genes and significant genes in GO functional groups. This study supports a role for decarboxylase protein complexes in bladder cancer susceptibility. Previous research has implicated decarboxylases in bladder cancer etiology; however, the genes that we found to be significant in the detection and replication data are not known to have direct influence on bladder cancer, suggesting some novel hypotheses. This study highlights the need for a complex systems approach to the genetic and genomic analysis of common diseases such as cancer.

Wednesday, December 14, 2016

A global test for gene-gene interactions based on random matrix theory

Frost HR, Amos CI, Moore JH. A global test for gene-gene interactions based on random matrix theory. Genet Epidemiol. 2016 Dec;40(8):689-701. [PubMed]


Statistical interactions between markers of genetic variation, or gene-gene interactions, are believed to play an important role in the etiology of many multifactorial diseases and other complex phenotypes. Unfortunately, detecting gene-gene interactions is extremely challenging due to the large number of potential interactions and ambiguity regarding marker coding and interaction scale. For many data sets, there is insufficient statistical power to evaluate all candidate gene-gene interactions. In these cases, a global test for gene-gene interactions may be the best option. Global tests have much greater power relative to multiple individual interaction tests and can be used on subsets of the markers as an initial filter prior to testing for specific interactions. In this paper, we describe a novel global test for gene-gene interactions, the global epistasis test (GET), that is based on results from random matrix theory. As we show via simulation studies based on previously proposed models for common diseases including rheumatoid arthritis, type 2 diabetes, and breast cancer, our proposed GET method has superior performance characteristics relative to existing global gene-gene interaction tests. A glaucoma GWAS data set is used to demonstrate the practical utility of the GET method.

Tuesday, November 29, 2016

Modifiers of the Genotype–Phenotype Map: Hsp90 and Beyond

A great review on the role of HSP90 and its epistatic effects 

Schell R, Mullis M, Ehrenreich IM. Modifiers of the Genotype-Phenotype Map: Hsp90 and Beyond. PLoS Biol. 2016 Nov 10;14(11):e2001015. [PLoS]


Disruption of certain genes alters the heritable phenotypic variation among individuals. Research on the chaperone Hsp90 has played a central role in determining the genetic basis of this phenomenon, which may be important to evolution and disease. Key studies have shown that Hsp90 perturbation modifies the effects of many genetic variants throughout the genome. These modifications collectively transform the genotype–phenotype map, often resulting in a net increase or decrease in heritable phenotypic variation. Here, we summarize some of the foundational work on Hsp90 that led to these insights, discuss a framework for interpreting this research that is centered upon the standard genetics concept of epistasis, and propose major questions that future studies in this area should address.

Wednesday, November 23, 2016

Identifying significant gene-environment interactions using a combination of screening testing and hierarchical false discovery rate control

Frost HR, Shen L, Saykin AJ, Williams SM, Moore JH; Alzheimer's Disease Neuroimaging Initiative. Identifying significant gene-environment interactions using a combination of screening testing and hierarchical false discovery rate control. Genet Epidemiol. 2016 Nov;40(7):544-557. [PubMed]


Although gene-environment (G× E) interactions play an important role in many biological systems, detecting these interactions within genome-wide data can be challenging due to the loss in statistical power incurred by multiple hypothesis correction. To address the challenge of poor power and the limitations of existing multistage methods, we recently developed a screening-testing approach for G× E interaction detection that combines elastic net penalized regression with joint estimation to support a single omnibus test for the presence of G× E interactions. In our original work on this technique, however, we did not assess type I error control or power and evaluated the method using just a single, small bladder cancer data set. In this paper, we extend the original method in two important directions and provide a more rigorous performance evaluation. First, we introduce a hierarchical false discovery rate approach to formally assess the significance of individual G× E interactions. Second, to support the analysis of truly genome-wide data sets, we incorporate a score statistic-based prescreening step to reduce the number of single nucleotide polymorphisms prior to fitting the first stage penalized regression model. To assess the statistical properties of our method, we compare the type I error rate and statistical power of our approach with competing techniques using both simple simulation designs as well as designs based on real disease architectures. Finally, we demonstrate the ability of our approach to identify biologically plausible SNP-education interactions relative to Alzheimer's disease status using genome-wide association study data from the Alzheimer's Disease Neuroimaging Initiative (ADNI).