Location: Grassland Soil and Water Research Laboratory
Title: Monitoring legume nutrition with machine learning: The impact of splits in training and testing dataAuthor
![]() |
H K, CHINMAYI - Oak Ridge Institute For Science And Education (ORISE) |
![]() |
Flynn, Kyle |
![]() |
BAATH, GURJINDER - Agrilife Research |
![]() |
Gowda, Prasanna |
![]() |
Northup, Brian |
![]() |
Ashworth, Amanda |
|
Submitted to: Applied Soft Computing
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 4/16/2026 Publication Date: 4/30/2025 Citation: H K, C., Flynn, K.C., Baath, G., Gowda, P.H., Northup, B.K., Ashworth, A.J. 2025. Monitoring legume nutrition with machine learning: The impact of splits in training and testing data. Applied Soft Computing. https://doi.org/10.1016/j.asoc.2025.113186. DOI: https://doi.org/10.1016/j.asoc.2025.113186 Interpretive Summary: Hyperspectral remote sensing has potential to revolutionize crop management. This study investigates the impact of training-testing data splits and sample sizes on predicting three legumes' nutritional and biophysical/biochemical characteristics. We used hyperspectral in conjunction with machine learning [ML] to examine attributes of three legumes. Different techniques of ML were applied and results indicate larger sample sizes lead to better model performance, emphasizing the need for adequate data for training. Optimal ratios for training-testing splits for peak model performance across all models of ML were 70:30 to 80:20. The study provides insights into optimal sampling strategies and modeling techniques for using hyperspectral remote sensing as a tool in precision agriculture. Technical Abstract: Hyperspectral remote sensing has potential to revolutionize crop management. This study investigates the impact of training-testing data splits and sample sizes on predicting three legumes' nutritional and biophysical/biochemical characteristics. We used in-situ hyperspectral and pseudo-satellite-based data (CHIME) in conjunction with machine learning [ML] to examine attributes of three legumes. Different techniques of ML were applied, including weighted k-nearest neighbors [KKNN], support vector machines [SVM], and random forest [RF]. Results indicate larger sample sizes lead to better model performance, emphasizing adequate data for training. The KKNN approach was most robust for in-situ data, while RF handles variations in train-test splits effectively for CHIME. CHIME data outperforms in-situ data with small datasets. Optimal ratios for train-test splits for peak model performance across all models of ML were 70:30 to 80:20. The study provides insights into optimal sampling strategies and modeling techniques for using hyperspectral remote sensing as a tool in precision agriculture. |
