Skip to main content
ARS Home » Midwest Area » Columbia, Missouri » Plant Genetics Research » Research » Publications at this Location » Publication #384646

Research Project: Genetic and Physiological Mechanisms Underlying Complex Agronomic Traits in Grain Crops

Location: Plant Genetics Research

Title: Predicting phenotypes from genetic, environment, management, and historical data using CNNs

Author
item Washburn, Jacob
item CIMEN, EMRE - Eskisehir Osmangazi University
item RAMSTEIN, GUILLAUME - Aarhus University
item REEVES, TIMOTHY - Cornell University - New York
item O'BRIANT, PATRICK - Cornell University - New York
item MCLEAN, GREG - University Of Queensland
item COOPER, MARK - University Of Queensland
item HAMMER, GRAEME - University Of Queensland
item Buckler, Edward - Ed

Submitted to: Theoretical and Applied Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 8/18/2021
Publication Date: 8/27/2021
Citation: Washburn, J.D., Cimen, E., Ramstein, G., Reeves, T., O'Briant, P., McLean, G., Cooper, M., Hammer, G., Buckler Iv, E.S. 2021. Predicting phenotypes from genetic, environment, management, and historical data using CNNs. Theoretical and Applied Genetics. 134:3997–4011. https://doi.org/10.1007/s00122-021-03943-7.
DOI: https://doi.org/10.1007/s00122-021-03943-7

Interpretive Summary: Predicting phenotypes from a combination of genetic and environmental (including human-imposed management) conditions is a long standing scientific challenge with practical implications to agriculture, medicine, and conservation. Convolutional Neural Network (CNN) models show promise for prediction under many complex scenarios, but are only beginning to enter the life sciences. This manuscript explores the use of these models in agricultural prediction, and demonstrates that, under the conditions and datasets used, a CNN model can outperform standard genomic prediction, while also offering increased data flexibility and various tools for interpretation.

Technical Abstract: Predicting phenotypes from genetic (G), environmental (E), and management (M) conditions is a long-standing challenge with implications to agriculture, medicine, and conservation. Most methods reduce the factors in a dataset (feature engineering) in a subjective and potentially oversimplified manner. Deep neural networks such as Multilayer Perceptrons (MPL) and Convolutional Neural Networks (CNN) can overcome this by allowing the data itself to determine which factors are most important. CNN models were developed for predicting agronomic yield from a combination of replicated trials and historical yield survey data. The results were more accurate than standard methods when tested on heldout G, E, and M data (r=0.5 vs r=0.4), and performed slightly worse than standard methods when only G was held out (r=0.74 vs r=0.80). Pre-training on historical data increased accuracy compared to trial data alone. Saliency map analysis indicated the CNN has “learned” to prioritize many factors of known agricultural importance