|Heaton, Michael - Mike|
Submitted to: Faculty of 1000 Biology
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 11/10/2013
Publication Date: 2/10/2014
Citation: Kalbfleisch, T.S., Heaton, M.P. 2014. Mapping whole genome shotgun sequence and variant calling in mammalian species without their reference genomes. F1000 Research 2:244. doi: 10.12688/f1000research.2-244.v2. Interpretive Summary: Genomics research in mammals is shedding new light on complex diseases caused by a combination of genetic and environmental factors. The keystone of genomics research for any species is the fully assembled DNA sequence of its entire genome. High quality reference genome sequences are now available for humans, model species, and economically important agricultural animals. Comparisons between these species have provided unique insights into mammalian gene function. However, among the 300 or so ruminants that have survived to the present, the full DNA sequence is available for only 2% of these species, leaving researchers to guess at the content of the remaining 98%. Although reference sequences will eventually be developed for additional hoof stock, the resources required to develop a quality reference genome may be unattainable for most species for at least another decade. In this work, we showed how to use the existing high quality reference genome from one species (cattle) to interpret whole genome information obtained for another species (sheep). These two species are thought to have diverged approximately 15 to 30 million years ago. The results provide a proof-of-principle for interpreting whole genome information from any of the 98% of ruminant species that presently do not have their own genome project. Comparisons of whole genome information between domestic and exotic ruminant species are expected to provide unique insights into gene function and new basic research information and opportunities for advancing our scientific understanding.
Technical Abstract: Genomics research in mammals has produced reference genome sequences that are essential for identifying variation associated with disease. High quality reference genome sequences are now available for humans, model species, and economically important agricultural animals. Comparisons between these species have provided unique insights into mammalian gene function. However, the number of species with reference genomes is small compared to those needed for studying molecular evolutionary relationships in the tree of life. For example, among the even-toed ungulates there are approximately 300 species whose phylogenetic relationships have been calculated in the 10k trees project. Only six of these have reference genomes: cattle, swine, sheep, goat, water buffalo, and bison. Although reference sequences will eventually be developed for additional hoof stock, the resources in terms of time, money, infrastructure and expertise required to develop a quality reference genome may be unattainable for most species for at least another decade. In this work we mapped 35 Gb of next generation sequence data of a Katahdin sheep to its own species’ reference genome (Ovis aries Oar3.1) and to that of a species that diverged 15 to 30 million years ago (Bos taurus UMD3.1). In total, 56% of reads covered 76% of UMD3.1 to an average depth of 6.8 reads per site, 83 million variants were identified, of which 78 million were homozygous and likely represent interspecies nucleotide differences. Excluding genome repeat regions and sex chromosomes, approximately 3.7 million heterozygous sites were identified in this animal vs. bovine UMD3.1, representing polymorphisms occurring in sheep. Of these, 41% could be readily mapped to orthologous positions in ovine Oar3.1 with 80% corroborated as heterozygous. These variant sites, identified via interspecies mapping could be used for comparative genomics, disease association studies, and ultimately to understand mammalian gene function.