Skip to main content
ARS Home » Research » Publications at this Location » Publication #275543

Title: Imputation of missing genotypes from sparse to high density using long-range phasing

Author
item DAWTWYLER, HANS - Department Of Primary Industries
item Wiggans, George
item HAYES, BEN - Department Of Primary Industries
item WOOLLIAMS, JOHN - Roslin Institute
item GODDARD, MIKE - Department Of Primary Industries

Submitted to: Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/9/2011
Publication Date: 9/1/2011
Citation: Dawtwyler, H.D., Wiggans, G.R., Hayes, B.J., Woolliams, J.A., Goddard, M.E. 2011. Imputation of missing genotypes from sparse to high density using long-range phasing. Genetics. 189:317-327.

Interpretive Summary: Marker genotyping arrays from sparse to high density are now available for many species. The resulting genotypes from high-throughput methods are “unphased,” and, therefore, the paternal or maternal source of each allele is unknown. Knowledge of parental origin or haplotype information can be useful in the analysis of complex traits. A phasing algorithm (ChromoPhase) was developed that utilizes the characteristic that related individuals share potentially long chromosome segments that trace to a common ancestor. ChromoPhase also imputes missing genotypes in individuals genotyped at a lower marker density when more densely genotyped relatives are available. In simulated data with a marker density approximately equivalent to the Illumina BovineSNP50K BeadChip currently used by the dairy industry, 99.9% of loci were correctly phased, and, when imputing from 100 to 1,500 markers, more than 87% of missing genotypes were correctly imputed. ChromoPhase also was tested with a real Holstein cattle data set to impute BovineSNP50K genotypes in animals with a sparse Bovine3K genotype. In those data, 92% of genotypes were correctly imputed in animals with a genotyped sire. Accuracy of genomic predictions was evaluated with the dense, sparse, and imputed simulated data sets, and reduction in genomic evaluation accuracy was modest even with imperfectly imputed genotype data. Imputation of missing genotypes, and potentially full genome sequence, is feasible using long-range phasing.

Technical Abstract: Related individuals share potentially long chromosome segments that trace to a common ancestor. A phasing algorithm (ChromoPhase) that utilizes this characteristic of finite populations was developed to phase large sections of a chromosome. In addition to phasing, ChromoPhase imputes missing genotypes in individuals genotyped at lower marker density when more densely genotyped relatives are available. ChromoPhase uses a pedigree to collect an individual’s (the proband) surrogate parents and offspring and uses genotypic similarity to identify its genomic surrogates. The algorithm then cycles through the relatives and genomic surrogates one at a time to find shared chromosome segments. Once a segment has been identified, any missing information in the proband is filled in with information from the relative. ChromoPhase was tested using a simulated population of 400 individuals at a marker density of 1,500/M, which is approximately equivalent to the Illumina BovineSNP50K BeadChip. In simulated data, 99.9% of loci were correctly phased, and, when imputing from 100 to 1,500 markers, more than 87% of missing genotypes were correctly imputed. Performance increased when the number of generations available in the pedigree increased but was reduced when the sparse genotype contained fewer loci. However, in simulated data, ChromoPhase correctly imputed at least 12% more genotypes than fastPHASE (another phasing algorithm), depending on sparse marker density. ChromoPhase also was tested with a real Holstein cattle data set to impute BovineSNP50K genotypes in animals with a sparse Bovine3K genotype. In those data, 92% of genotypes were correctly imputed in animals with a genotyped sire. Accuracy of genomic predictions was evaluated with the dense, sparse, and imputed simulated data sets, and reduction in genomic evaluation accuracy was modest even with imperfectly imputed genotype data. Imputation of missing genotypes, and potentially full genome sequence, is feasible using long-range phasing.