Skip to main content
ARS Home » Plains Area » Temple, Texas » Grassland Soil and Water Research Laboratory » Research » Publications at this Location » Publication #411470

Research Project: Development of Enhanced Tools and Management Strategies to Support Sustainable Agricultural Systems and Water Quality

Location: Grassland Soil and Water Research Laboratory

Title: Monitoring legume nutrition with machine learning: The impact of splits in training and testing data

Author
item H K, CHINMAYI - Oak Ridge Institute For Science And Education (ORISE)
item Flynn, Kyle
item BAATH, GURJINDER - Agrilife Research
item Gowda, Prasanna
item Northup, Brian
item Ashworth, Amanda

Submitted to: Applied Soft Computing
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 4/16/2026
Publication Date: 4/30/2025
Citation: H K, C., Flynn, K.C., Baath, G., Gowda, P.H., Northup, B.K., Ashworth, A.J. 2025. Monitoring legume nutrition with machine learning: The impact of splits in training and testing data. Applied Soft Computing. https://doi.org/10.1016/j.asoc.2025.113186.
DOI: https://doi.org/10.1016/j.asoc.2025.113186

Interpretive Summary: Hyperspectral remote sensing has potential to revolutionize crop management. This study investigates the impact of training-testing data splits and sample sizes on predicting three legumes' nutritional and biophysical/biochemical characteristics. We used hyperspectral in conjunction with machine learning [ML] to examine attributes of three legumes. Different techniques of ML were applied and results indicate larger sample sizes lead to better model performance, emphasizing the need for adequate data for training. Optimal ratios for training-testing splits for peak model performance across all models of ML were 70:30 to 80:20. The study provides insights into optimal sampling strategies and modeling techniques for using hyperspectral remote sensing as a tool in precision agriculture.

Technical Abstract: Hyperspectral remote sensing has potential to revolutionize crop management. This study investigates the impact of training-testing data splits and sample sizes on predicting three legumes' nutritional and biophysical/biochemical characteristics. We used in-situ hyperspectral and pseudo-satellite-based data (CHIME) in conjunction with machine learning [ML] to examine attributes of three legumes. Different techniques of ML were applied, including weighted k-nearest neighbors [KKNN], support vector machines [SVM], and random forest [RF]. Results indicate larger sample sizes lead to better model performance, emphasizing adequate data for training. The KKNN approach was most robust for in-situ data, while RF handles variations in train-test splits effectively for CHIME. CHIME data outperforms in-situ data with small datasets. Optimal ratios for train-test splits for peak model performance across all models of ML were 70:30 to 80:20. The study provides insights into optimal sampling strategies and modeling techniques for using hyperspectral remote sensing as a tool in precision agriculture.