Skip to main content
ARS Home » Midwest Area » St. Paul, Minnesota » Plant Science Research » Research » Publications at this Location » Publication #377176

Research Project: Enhanced Alfalfa Germplasm and Genomic Resources for Yield, Quality, and Environmental Protection

Location: Plant Science Research

Title: Predictions from algorithmic modeling result in better decisions than from data modeling for soybean iron deficiency chlorosis

Author
item Xu, Zhanyou
item KUREK, ANDREOMAR - Iowa State University
item Cannon, Steven
item BEAVIS, WILLIAM - Iowa State University

Submitted to: PLOS ONE
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 4/21/2021
Publication Date: 7/9/2021
Citation: Xu, Z., Kurek, A., Cannon, S.B., Beavis, W.D. 2021. Predictions from algorithmic modeling result in better decisions than from data modeling for soybean iron deficiency chlorosis. PLoS ONE. 16(7). Article e0240948. https://doi.org/10.1371/journal.pone.0240948.
DOI: https://doi.org/10.1371/journal.pone.0240948

Interpretive Summary: Plant breeding requires repeatedly selecting superior individuals, through many years of evaluations. Making these selections involves statistical evaluation of breeding material, because the yearly performance measurements are imperfect, considering measurement error and natural variations. There are two general statistical approaches for making these selections: “data modeling” and “algorithm modeling.” Data modeling has been the most commonly used approach; however, with improvements in data collection methods, algorithm modeling has become the preferred method, due to its high prediction accuracy and extractability of useful information from the hidden patterns in the data. Using the example of iron deficiency chlorosis in soybean (which causes leaf-yellowing and low seed yield), this research compares the accuracy and efficiency of the two statistical approaches. We conclude that algorithm modeling approaches (“machine learning” methods in particular), outperform data modeling approaches, in both prediction accuracy and in extracting decisive information for data-driven agriculture. We also conclude that particular machine learning algorithms should be used to select the top portion of the best lines and discard the bottom part of the worst lines with the trait of interest. These results will help breeders apply appropriate methods and algorithms in order to make more accurate selections in plant breeding, thereby speeding up the breeding process and more efficiently producing new, improved varieties for farmers and consumers.

Technical Abstract: Soybean iron deficiency chlorosis (IDC) is one of the major yield-reducing factors in the U.S. upper Midwest. Marker-assisted selection (MAS) has not been successful for breeding for IDC resistance, due to the complexity of trait. Genomic prediction has been extensively applied to increase selection accuracy for continuous numeric traits such as yield and plant height. For ordinal data types such as IDC, which are typically scored on a scale of 1-9 (resistant to susceptible), genomic prediction methods have not been systematically compared. Data modeling methods include logistic regression and ridge-regression BLUP (rrBLUP), and algorithm modeling methods include neural networks, decision trees, Bayesian inference, and support vector machines. The objectives of this research was: 1) to select the best model measured by discard specificity, selection sensitivity, selection precision, overall prediction accuracy, and receiver operating curve (ROC); and 2) to compare prediction accuracies between data modeling approaches (rrBLUP and logistic regression) and algorithmic modeling methods (random forest (RF), gradient boosting algorithm (GBM), support vector machine (SVM), K-nearest neighbors (KNN), Naïve Bayes (NB), and artificial neural network (ANN)). We found that among the eight tested models, SVMs generated the highest discard specificity for de-selecting IDC susceptible lines, while RF has the highest selection sensitivity for selecting IDC resistant lines, and RF has the highest overall accuracy. For the soybean IDC ordinal data type, algorithmic modeling outperformed data modeling methods.