|O'CONNELL, JEFFREY - University Of Maryland|
Submitted to: Interbull Annual Meeting Proceedings
Publication Type: Proceedings
Publication Acceptance Date: 8/8/2015
Publication Date: 8/19/2015
Citation: Van Raden, P.M., O'Connell, J.R. 2015. Strategies to choose from millions of imputed sequence variants. Interbull Annual Meeting Proceedings. Interbull Bulletin 49:10–13.
Interpretive Summary: Millions of sequence variants are now known in cattle, but subsets are needed for routine genomic predictions or for genotyping arrays because routine genomic predictions cannot afford to include all of those variants for all animals. Strategies to select variants and impute genotypes were tested using 26,984 simulated bulls, of which 1,000 had 30 million sequence variants and the rest had simulated array genotypes with the same pedigree and densities as actual Holstein bulls. Computing strategies were developed to efficiently balance the costs of imputing, selecting, and predicting effects of the variants. Total computing time was only a few days using 10-20 processors. Large gains in reliability were demonstrated if the true causative variants were identified or if advanced bioinformatic tools could identify regions of DNA likely to contain those. Large reference populations are needed in either case because most mutations have very small effects. The methods tested here will be applied to select variants using the actual bull sequences.
Technical Abstract: Millions of sequence variants are known, but subsets are needed for routine genomic predictions or to include on genotyping arrays. Variant selection and imputation strategies were tested using 26 984 simulated reference bulls, of which 1 000 had 30 million sequence variants, 773 had 600 000 markers, 24 863 had 60 000 markers, and 348 had 12 000 markers. Edits for minor allele frequency (MAF) of >0.01, linkage disequilibrium of <0.95 and keeping all 0.5 million variants in or near genes reduced the list to 8.4 million, and those were imputed for all bulls. Strategies were compared to choose variants most significant or with largest estimated variances or effect sizes for five independent traits using single or multiple regression. Reliability of prediction averaged 28.4% from parent average, 77.8% from 60 000, 80.1% from 600 000, 85.0% from 60 000 plus the best 25 000 selected sequence variants or 87.2% using only the 10 000 imputed true quantitative trait loci (QTLs) with no weight on the markers. Genome-wide association (GWA) was faster for selecting variants, but multiple regressions were more reliable. With many genotyped animals and many data sources, computing strategies must efficiently balance costs of imputing, selecting and predicting when millions of variants are available.