Skip to main content
ARS Home » Pacific West Area » Pullman, Washington » WHGQ » Research » Publications at this Location » Publication #390584

Research Project: Improving Control of Stripe Rusts of Wheat and Barley through Characterization of Pathogen Populations and Enhancement of Host Resistance

Location: Wheat Health, Genetics, and Quality Research

Title: Classification and regression models for genomic selection of skewed phenotypes: A case for disease resistance in winter wheat (Triticum aestivum L.)

Author
item MERRICK, LANCE - WASHINGTON STATE UNIVERSITY
item LOZADA, DENNIS - NEW MEXICO STATE UNIVERSITY
item Chen, Xianming
item CARTER, ARRON - WASHINGTON STATE UNIVERSITY

Submitted to: Frontiers in Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 1/19/2022
Publication Date: 2/23/2022
Citation: Merrick, L.F., Lozada, D.N., Chen, X., Carter, A.H. 2022. Classification and regression models for genomic selection of skewed phenotypes: A case for disease resistance in winter wheat (Triticum aestivum L.). Frontiers in Genetics. 13. Article 835781. https://doi.org/10.3389/fgene.2022.835781.
DOI: https://doi.org/10.3389/fgene.2022.835781

Interpretive Summary: Most genomic prediction models are linear regression models that assume continuous and normally distributed phenotypes, but responses to diseases such as stripe rust are commonly recorded in ordinal scales and percentages. Disease severity (SEV) and infection type (IT) data in germplasm screening nurseries generally do not follow these assumptions. On this regard, researchers may ignore the lack of normality, transform the phenotypes, use generalized linear models, or use supervised learning algorithms and classification models with no restriction on the distribution of response variables, which are less sensitive when modeling ordinal scores. The goal of this research was to compare classification and regression genomic selection models for skewed phenotypes using stripe rust SEV and IT in winter wheat. We compared both regression and classification prediction models using two training populations composed of breeding lines phenotyped in four years (2016-2018, and 2020) and a diversity panel phenotyped in four years (2013-2016). The prediction models used 19,861 genotyping-by-sequencing single-nucleotide polymorphism markers. Overall, square-root transformed phenotypes using rrBLUP and support vector machine regression models displayed the highest combination of accuracy and relative efficiency across the regression and classification models. Further, a classification system based on support vector machine and ordinal Bayesian models with a 2-Class scale for SEV reached the highest class accuracy of 0.99. This study showed that breeders can use linear and non-parametric regression models within their own breeding lines over combined years to accurately predict skewed phenotypes.

Technical Abstract: Most genomic prediction models are linear regression models that assume continuous and normally distributed phenotypes, but responses to diseases such as stripe rust (caused by Puccinia striiformis f. sp. tritici) are commonly recorded in ordinal scales and percentages. Disease severity (SEV) and infection type (IT) data in germplasm screening nurseries generally do not follow these assumptions. On this regard, researchers may ignore the lack of normality, transform the phenotypes, use generalized linear models, or use supervised learning algorithms and classification models with no restriction on the distribution of response variables, which are less sensitive when modeling ordinal scores. The goal of this research was to compare classification and regression genomic selection models for skewed phenotypes using stripe rust SEV and IT in winter wheat. We compared both regression and classification prediction models using two training populations composed of breeding lines phenotyped in four years (2016-2018, and 2020) and a diversity panel phenotyped in four years (2013-2016). The prediction models used 19,861 genotyping-by-sequencing single-nucleotide polymorphism markers. Overall, square-root transformed phenotypes using rrBLUP and support vector machine regression models displayed the highest combination of accuracy and relative efficiency across the regression and classification models. Further, a classification system based on support vector machine and ordinal Bayesian models with a 2-Class scale for SEV reached the highest class accuracy of 0.99. This study showed that breeders can use linear and non-parametric regression models within their own breeding lines over combined years to accurately predict skewed phenotypes.