Location: Plant Genetics ResearchTitle: Predicting phenotypes from genetic, environment, management, and historical data using CNNs
|CIMEN, EMRE - Eskisehir Osmangazi University|
|RAMSTEIN, GUILLAUME - Aarhus University|
|REEVES, TIMOTHY - Cornell University|
|O'BRIANT, PATRICK - Cornell University|
|MCLEAN, GREG - University Of Queensland|
|COOPER, MARK - University Of Queensland|
|HAMMER, GRAEME - University Of Queensland|
|Buckler, Edward - Ed|
Submitted to: Theoretical and Applied Genetics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 8/18/2021
Publication Date: 8/27/2021
Citation: Washburn, J.D., Cimen, E., Ramstein, G., Reeves, T., O'Briant, P., McLean, G., Cooper, M., Hammer, G., Buckler IV, E.S. 2021. Predicting phenotypes from genetic, environment, management, and historical data using CNNs. Theoretical and Applied Genetics. 134:3997–4011. https://doi.org/10.1007/s00122-021-03943-7.
Interpretive Summary: Predicting phenotypes from a combination of genetic and environmental (including human-imposed management) conditions is a long standing scientific challenge with practical implications to agriculture, medicine, and conservation. Convolutional Neural Network (CNN) models show promise for prediction under many complex scenarios, but are only beginning to enter the life sciences. This manuscript explores the use of these models in agricultural prediction, and demonstrates that, under the conditions and datasets used, a CNN model can outperform standard genomic prediction, while also offering increased data flexibility and various tools for interpretation.
Technical Abstract: Predicting phenotypes from genetic (G), environmental (E), and management (M) conditions is a long-standing challenge with implications to agriculture, medicine, and conservation. Most methods reduce the factors in a dataset (feature engineering) in a subjective and potentially oversimplified manner. Deep neural networks such as Multilayer Perceptrons (MPL) and Convolutional Neural Networks (CNN) can overcome this by allowing the data itself to determine which factors are most important. CNN models were developed for predicting agronomic yield from a combination of replicated trials and historical yield survey data. The results were more accurate than standard methods when tested on heldout G, E, and M data (r=0.5 vs r=0.4), and performed slightly worse than standard methods when only G was held out (r=0.74 vs r=0.80). Pre-training on historical data increased accuracy compared to trial data alone. Saliency map analysis indicated the CNN has “learned” to prioritize many factors of known agricultural importance