Skip to main content
ARS Home » Midwest Area » Ames, Iowa » Corn Insects and Crop Genetics Research » Research » Publications at this Location » Publication #384375

Research Project: SoyBase and the Legume Clade Database

Location: Corn Insects and Crop Genetics Research

Title: Foster thy young: enhanced prediction of orphan genes in assembled genomes

Author
item LI, JING - Iowa State University
item SINGH, URMINDER - Iowa State University
item BHANDARY, PRIYANKA - Iowa State University
item Campbell, Jacqueline
item ARENDSEE, ZEB - Iowa State University
item SEETHARAM, ARUN - Iowa State University
item WURTELE, EVE SYRKIN - Iowa State University

Submitted to: Nucleic Acids Research
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 12/2/2021
Publication Date: 12/20/2021
Citation: Li, J., Singh, U., Bhandary, P., Campbell, J.D., Arendsee, Z., Seetharam, A., Wurtele, E. 2021. Foster thy young: enhanced prediction of orphan genes in assembled genomes. Nucleic Acids Research. 50(7):e37. https://doi.org/10.1093/nar/gkab1238.
DOI: https://doi.org/10.1093/nar/gkab1238

Interpretive Summary: Identifying genes from DNA sequences is difficult, complex and time-consuming. Different groups working in different species do not use the same methods for identifying genes. Further, protocols by different teams are often incomplete and difficult to understand. Gene prediction software often relies on sequence similarities between species to identify genes. However, this fails to identify genes unique to a given species (orphan genes) that may provide organisms with a way to quickly respond to changing environments. We have developed a new pipeline that integrates the sequences of active messages produced by genes with traditional gene prediction software. We used this method to predict genes in the genome sequence of Arabidopsis, the gold standard of plant genomes with well-documented genes and orphan genes. Our research results showed that combining the new pipeline we developed with traditional gene prediction software identified 68% of the documented orphan genes in Arabidopsis. We also created detailed and easy to understand instructions for others to use this method to identify genes.

Technical Abstract: Proteins encoded by newly-emerged genes ("orphan genes") share no sequence similarity with proteins in any other species. They provide organisms with a reservoir of genetic elements to quickly respond to changing selection pressures. Here, we systematically assess the ability of five gene annotation pipelines to accurately predict genes in genomes according to phylostratal origin. BRAKER and MAKER are existing, popular ab initio tools that infer gene structures by machine learning. Direct Inference is an evidence-based pipeline we developed to predict gene structures from alignments of RNA-Seq data. The BIND pipeline integrates ab initio predictions of BRAKER and Direct inference; MIND combines Direct Inference and MAKER predictions. We use highly-curated Arabidopsis and yeast annotations as gold-standard benchmarks, and cross-validate in rice. Each pipeline under-predicts orphan genes (as few as 11 percent, under one prediction scenario); increasing RNA-Seq diversity greatly improves prediction efficacy. BIND yields best predictions overall, identifying 68% of annotated orphan genes and 99% of ancient genes in Arabidopsis. We provide a light weight, flexible, reproducible solution to improve gene prediction.