Submitted to: Genetics Selection Evolution
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 2/27/2017
Publication Date: 3/7/2017
Citation: Van Raden, P.M., Tooker, M.E., O'Connell, J.O., Cole, J.B., Bickhart, D.M. 2017. Selecting sequence variants to improve genomic predictions for dairy cattle. Genetics Selection Evolution. 49:32. Interpretive Summary: The accuracy of a genomic prediction for an animal’s performance can be improved by including information from more DNA variants. Nearly 40 million variants have been identified from whole-genome sequence data for over 1,500 bulls, and several strategies to impute these variants to additional animals and use them in genetic evaluation for economic traits show potential. However, imputing, selecting, and predicting effects for millions of DNA variants and many thousands of animals requires efficient computation because computational costs could exceed any marginal benefits from adding more variants. This study 1) compared accuracy of genomic prediction from sequence data, array data, combined data, and different variant types; 2) tested prediction methods using simulated data before applying them to actual sequence data imputed for a large reference population; and 3) investigated efficiency of computing strategies for even larger genotyped populations. Accuracy of genomic predictions improved when adding selected sequence variants, and the gains were similar with simulated and actual data for the same population. The increase in selection reliability of 2.7 percentage points (64.7 to 67.4%) would add about $3 million per year to national genetic progress, plus additional progress globally for foreign breeders that directly use the new genotyping arrays or that indirectly benefit by selecting breeding stock from the improved U.S. population. This higher accuracy has an annual national value of about $3 million, and these annual gains would be permanent and accumulate. The initial cost of generating the U.S. sequence data for the 88 dairy bulls that were contributed to the 1000 Bull Genomes project was only $132,000. The return on investment from this research is high and greatly increased because of data sharing.
Technical Abstract: Millions of genetic variants have been identified by population-scale sequencing projects, but subsets are needed for routine genomic predictions or to include on genotyping arrays. Methods of selecting sequence variants were compared using both simulated sequence genotypes and actual data from run 5 (July 2015) of the 1000 Bull Genomes Project. Candidate sequence variants within or near genes for 444 Holstein animals were combined with high-density (HD) imputed genotypes for 26,970 progeny-tested Holstein bulls. Test 1 included single nucleotide polymorphisms (SNPs) for 481,904 candidate sequence variants, with 107,471 in exons, 9,422 in splice sites, 35,242 in untranslated regions at the beginning and end of genes, and 329,769 upstream or downstream of genes. Test 2 also included 249,966 insertions and deletions (INDELs). After merging sequence variants with 312,614 HD SNPs and editing, Test 1 included 762,588 variants, and Test 2 included 1,003,453. Imputation quality from findhap was assessed by keeping 404 of the sequenced animals in the reference population and randomly choosing 40 animals as a test set. Their sequence genotypes were reduced to the subset in common with HD genotypes and then imputed back to sequence. Predictions were tested using 2015 data of 3,983 U.S. bulls with daughters that were first phenotyped after August 2011. Percentage of correctly imputed variants averaged 97.2% across all chromosomes in Test 1 and 97.0% in Test 2. Total time required to prepare, edit, and impute the sequence variants for 27,235 animals was about 5 days using <20 threads. Computation of genomic predictions using deregressed evaluations from August 2011 for 33 traits and 19,575 bulls required about 3 days with 33 threads. Many sequence variants had larger estimated effects than nearby HD markers, but prediction reliability improved only 0.6 percentage points in Test 1 when sequence SNPs were added to HD SNPs and was only 0.4 points higher than HD SNPs in Test 2 when sequence SNPs and INDELs were included. However, selecting the 16,648 candidate SNPs with the largest estimated effects and adding those to the 60,671 SNPs used in routine evaluations improved reliabilities by 2.7 percentage points (67.4% vs. 64.7%) on average across traits compared with 35.2% for parent-average reliability. Genomic prediction reliabilities improved when adding selected sequence variants; gains were similar with simulated and actual data for the same population. With many genotyped animals and many data sources, computing strategies must efficiently balance costs of imputation, selection, and prediction when millions of variants are available.