|RUTKOSKI, JESSICA - Cornell University - New York|
|SORRELLS, MARK - Cornell University - New York|
Submitted to: Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 12/28/2012
Publication Date: 3/1/2013
Citation: Rutkoski, J., Poland, J.A., Jannink, J., Sorrells, M. 2013. Imputation of unordered markers and the impact on genomic selection accuracy. Genetics. 3(3):427-39.
Interpretive Summary: Next-generating sequencing can be used to generate low-cost DNA markers for use in plant breeding. This approach of using sequencing to generate markers can be used even on species with complex genomes such as wheat. When using next-generation sequencing to generate DNA markers, missing data is common and marker data points must be imputed (estimated) prior to generating genetic models to predict yield and agronomic performance. In contrast to maize or rice, for example, wheat along with many other important crop species does not yet have a ‘reference genome sequence’. The reference genome sequence enables ordering of the DNA markers and simplifies the process of imputing missing data points. To address this issue of imputing missing data points in wheat and other ‘non-model’ species, we explored imputation algorithms for unordered DNA markers. We used data from barley, maize and wheat and tested the imputation accuracy of four different algorithms including k-nearest neighbors, singular value decomposition, random forest regression, and expectation maximization. We found that the random forest algorithm consistently produced the most accurate imputation results. Based on this study, suitable algorithms such as random forest can be used to impute missing data in datasets of unordered DNA markers. The imputed datasets can be used to generate genomic selection models with good accuracy in predicting yield and agronomic performance.
Technical Abstract: Genomic selection, a breeding method that promises to accelerate rates of genetic gain, requires dense, genome-wide marker data. Sequence-based genotyping methods can generate de novo large numbers of markers. However, without a reference genome, these markers are unordered and typically have a large proportion of missing data. Because marker imputation algorithms were developed for species with a reference genome, algorithms suited for unordered markers have not been rigorously evaluated. Using four empirical datasets, we evaluate and characterize four such imputation methods including k-nearest neighbors, singular value decomposition, random forest regression, and expectation maximization imputation in terms of their imputation accuracies and the factors affecting accuracy. The effect of imputation method on the genomic selection accuracy is assessed in comparison with mean imputation. The effect of excluding markers with a large proportion of missing data on the genomic selection accuracy is also examined. Our results show that imputation of unordered markers can be accurate especially when linkage disequilibrium between markers is high, and genotyped individuals are related. Of the methods evaluated, random forest regression imputation produced superior accuracy. In comparison with mean imputation, all four imputation methods we evaluated led to higher genomic selection accuracies when the level of missing data was high. Including rather than excluding markers with a large proportion of missing data nearly always led to greater GS accuracies. We conclude high levels of missing data in dense marker sets is not a major obstacle for genomic selection, even when marker order is not known.