Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #295607

Research Project: Enhancing Breeding of Small Grains through Improved Bioinformatics

Location: Plant, Soil and Nutrition Research

Title: Imputation of unordered markers and the impact on genomic selection accuracy

Author
item RUTKOSKI, JESSICA - Cornell University - New York
item Poland, Jesse
item Jannink, Jean-Luc
item SORRELLS, MARK - Cornell University - New York

Submitted to: Genes, Genomes, and Genomics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 12/28/2012
Publication Date: 3/1/2013
Publication URL: http://DOI: 10.1534/g3.112.005363
Citation: Rutkoski, J., Poland, J.A., Jannink, J., Sorrells, M. 2013. Imputation of unordered markers and the impact on genomic selection accuracy. Genes, Genomes, and Genomics. 3(3):427-439.

Interpretive Summary: Genomic selection, a breeding method that promises to accelerate rates of genetic gain, requires dense, genome-wide marker data. Genotyping-by-sequencing can generate such a large number of markers. However, these markers typically have a large proportion of missing data. Missing data can be filled in using a process called imputation for which algorithms have been developed for species that have been sequenced. That sequence enables markers to be ordered. Algorithms suited for unordered markers have not been rigorously evaluated. Using four empirical datasets, we evaluated and characterized four imputation methods (k-nearest neighbors, singular value decomposition, random forest regression, and expectation maximization) in terms of their imputation accuracies and the factors affecting accuracy. The effect of excluding markers with a large proportion of missing data on the genomic selection accuracy was also examined. Our results show that imputation of unordered markers can be accurate especially when linkage disequilibrium between markers is high, and genotyped individuals are related. Of the methods evaluated, random forest regression imputation produced superior accuracy. All four imputation methods we evaluated led to improved genomic selection accuracies when the level of missing data was high. Including rather than excluding markers with a large proportion of missing data nearly always led to greater GS accuracies. We conclude that high levels of missing data in dense marker sets is not a major obstacle for genomic selection, even when marker order is not known.

Technical Abstract: Genomic selection, a breeding method that promises to accelerate rates of genetic gain, requires dense, genome-wide marker data. Genotyping-by-sequencing can generate a large number of de novo markers. However, without a reference genome, these markers are unordered and typically have a large proportion of missing data. Because marker imputation algorithms were developed for species with a reference genome, algorithms suited for unordered markers have not been rigorously evaluated. Using four empirical datasets, we evaluate and characterize four such imputation methods referred to as k-nearest neighbors, singular value decomposition, random forest regression, and expectation maximization imputation in terms of their imputation accuracies and the factors affecting accuracy. The effect of imputation method on the genomic selection accuracy is assessed in comparison with mean imputation. The effect of excluding markers with a large proportion of missing data on the genomic selection accuracy is also examined. Our results show that imputation of unordered markers can be accurate especially when linkage disequilibrium between markers is high, and genotyped individuals are related. Of the methods evaluated, random forest regression imputation produced superior accuracy. In comparison with mean imputation, all four imputation methods we evaluated led to higher genomic selection accuracies when the level of missing data was high. Including rather than excluding markers with a large proportion of missing data nearly always led to greater GS accuracies. We conclude that high levels of missing data in dense marker sets is not a major obstacle for genomic selection, even when marker order is not known.