|Schroeder, Steven - Steve|
Submitted to: BMC Genomics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 12/15/2015
Publication Date: 12/29/2015
Citation: Whitacre, L.K., Tizioto, P.C., Kim, J., Sonstegard, T., Schroeder, S.G., Alexander, L.J., Medrano, J.F., Schnabel, R.D., Taylor, J.F., Decker, J.E. 2015. What's in your next-generation sequence data? An exploration of unmapped DNA and RNA sequence reads from the bovine reference individual. BMC Genomics. 16:1114. Interpretive Summary: Next generation sequencing technologies have enabled the production of vast quantities of sequencing data. The first step in the analysis of this data is often the mapping of sequencing reads to the reference genome for the organism you are studying. Unfortunately it is common to have a fraction of sequencing reads not map to your reference genome. We undertook this study to identify the nature of these unmapped reads. Starting with DNA and RNA sequenced from the cow reference animal, we analyzed reads that did not map to the cow reference genome. Many of these reads represent vertebrate sequences that are absent, incomplete, or misassembled in the reference. Other reads represented invertebrate species. This work demonstrates two major benefits of exploring unmapped read. First, they can be used to identify shortcomings in the referene sequence and second, they can identify parasitic and commensal organisms.
Technical Abstract: BACKGROUND: Next-generation sequencing projects commonly commence by aligning reads to a reference genome assembly. While improvements in alignment algorithms and computational hardware have greatly enhanced the efficiency and accuracy of alignments, a significant percentage of reads often remain unmapped. RESULTS: We generated de novo assemblies of unmapped reads from the DNA and RNA sequencing of the Bos taurus reference individual and identified the closest matching sequence to each contig by alignment to the NCBI non-redundant nucleotide database using BLAST. As expected, many of these contigs represent vertebrate sequence that is absent, incomplete, or misassembled in the UMD3.1 reference assembly. However, numerous additional contigs represent invertebrate species. Most prominent were several species of Spirurid nematodes and a blood-borne parasite, Babesia bigemina. These species are either not present in the US or are not known to infect taurine cattle and the reference animal appears to have been host to unsequenced sister species. CONCLUSIONS: We demonstrate the importance of exploring unmapped reads to ascertain sequences that are either absent or misassembled in the reference assembly and for detecting sequences indicative of parasitic or commensal organisms.