This is an interesting new paper.
Levin TC, Glazer AM, Pachter L, Brem RB, Eisen MB. Exploring the genetic basis of variation in gene predictions with a synthetic association study. PLoS One. 2010 Jul 29;5(7):e11645. [
PubMed]
Abstract
Identifying DNA polymorphisms that affect molecular processes like transcription, splicing, or translation typically requires genotyping and experimentally characterizing tissue from large numbers of individuals, which remains expensive and time consuming. Here we introduce an alternative strategy: a "synthetic association study" in which we computationally predict molecular phenotypes on artificial genomes containing randomly sampled combinations of polymorphic alleles, and perform a classical association study to identify genotypes underlying variation in these computationally predicted annotations. We applied this method to characterize the effects on gene structure of 32,792 single-nucleotide polymorphisms between two strains of the antibiotic producing fungus Penicilium chrysogenum. Although these SNPs represent only 0.1 percent of the nucleotides in the genome, they collectively altered 1.8 percent of predicted gene models between these strains. To determine which SNPs or combinations of SNPs were responsible for this variation, we predicted protein-coding genes in 500 intermediate genomes, each identical except for randomly chosen alleles at each SNP position. Of 30,468 gene models in the genome, 557 varied across these 500 genomes. 226 of these polymorphic gene models (40%) were perfectly correlated with individual SNPs, all of which were within or immediately proximal to the affected gene.
The genetic architectures of the other 321 were more complex, with several examples of SNP epistasis that would have been difficult to predict a priori. We expect that many of the SNPs that affect computational gene structure reflect a biologically unrealistic sensitivity of the gene prediction algorithm to sequence changes, and we propose that genome annotation algorithms could be improved by minimizing their sensitivity to natural polymorphisms. However, many of the SNPs we identified are likely to affect transcript structure in vivo, and the synthetic association study approach can be easily generalized to any computed genome annotation to uncover relationships between genotype and important molecular phenotypes.