Skip to main content
ARS Home » Plains Area » Clay Center, Nebraska » U.S. Meat Animal Research Center » Genetics and Animal Breeding » Research » Publications at this Location » Publication #419678

Research Project: Genomes to Phenomes in Beef Cattle Research

Location: Genetics and Animal Breeding

Title: A vision of how low-coverage sequence data should contribute to genetic evaluation in the future

Author
item Thallman, Richard
item Borgert, Jacqueline
item Engle, Bailey
item Keele, John
item Snelling, Warren
item GONDRO, CEDRIC - Michigan State University
item Kuehn, Larry

Submitted to: Journal of Animal Science
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/5/2025
Publication Date: 9/5/2025
Citation: Thallman, R.M., Borgert, J.E., Engle, B.N., Keele, J.W., Snelling, W.M., Gondro, C., Kuehn, L.A. 2025. A vision of how low-coverage sequence data should contribute to genetic evaluation in the future. Journal of Animal Science. 103. Article skaf294. https://doi.org/10.1093/jas/skaf294.
DOI: https://doi.org/10.1093/jas/skaf294

Interpretive Summary: Low-pass sequencing refers to sequencing the DNA of animals at low cost and using bioinformatics software to impute that sequence to full genomic sequence. It has been proposed as an alternative to the current standard genotyping technology. At least one commercial product based on low-pass sequencing is available for cattle. Concerns limiting commercial adoption of the technology are: 1) the cost of storing the enormous amount of data it generates and 2) whether that additional data will result in improved accuracy of genetic evaluation. The objective is to present a vision for how low-pass sequencing technology could be implemented in the future. A format in which to store the results of low-pass sequencing is proposed. It should require orders of magnitude less storage space than the approach currently in use. A new model based on knowledge of the biology underlying the transformation of genomic variation into important traits for livestock production is proposed. It is argued that it would make better use of the information in genomic sequence than current genetic evaluation models. Changes and further advancements in the storage and modeling of genomic data and effects will provide opportunities to increase prediction accuracy of breeding values.

Technical Abstract: Low-coverage sequencing refers to sequencing DNA of individuals to a low depth of coverage (e.g., 0.5X) and imputing that sequence to a genomic sequence based on reference haplotypes from individuals sequenced to a high depth of coverage (e.g., =10X). It has been proposed as an alternative to genotyping by Single-nucleotide polymorphisms (SNP) arrays. At least one commercial product based on it is available for agricultural species. Concerns limiting adoption in its current form are: 1) the cost of storing the huge volume of data it generates and 2) whether that additional data will result in improved accuracy of genetic evaluation. This work envisions future implementation of low-coverage sequencing to reduce storage costs and enhance genetic evaluations by leveraging the additional information in the full sequence of the pangenome to account for more genetic variation. We propose addressing the storage issue by representing genomic sequence of an individual in a pair of haplotype arrays with each element pointing to an enumerated haplotype of the sequence within one of approximately 50,000 defined genome segments. Assuming 60 million genomic variants, the infrastructure required to translate the identifier of any enumerated haplotype into its genomic sequence would require less than 10 gigabytes of binary storage. Each haplotype array element would require 2 bytes, so the marginal binary storage required to represent the genomic sequence of an individual would be about 200 kilobytes (KB), similar to the genotypes from a SNP array with 200,000 markers. This assumes no pedigree and no ambiguity of the imputation, though the latter is unrealistic. Strategies to minimize, and when necessary, to manage and efficiently represent ambiguity are proposed. The genomic sequence of an individual could be stored in about 1 KB (binary) if both parents have unambiguous sequences stored as described above. The proposed system for representing the pangenome includes algorithms for read mapping and imputation intended to leverage all known genetic variation in the target population. It is also designed to use sequencing reads generated for imputing the genomic sequence of new individuals to identify unrecognized mutations, crossovers, and structural variants, thus continuously improving the genome representation, especially if widespread use of low-coverage sequencing in livestock industries is realized. This could make improved genetic merit and management of livestock feasible without computational burden.