Author
JEONG, JIGHAN - University Of Washington | |
RESOP, JONATHAN - University Of Maryland | |
MUELLER, NATHAN - University Of Minnesota | |
Fleisher, David | |
KYUNGDAHM, YUN - Harvard University | |
BUTLER, ETHAN - University Of Minnesota | |
Timlin, Dennis | |
SHIM, KYO-MOON - National Academy Of Agricultural Science | |
GERBER, JAMES - University Of Minnesota | |
Reddy, Vangimalla | |
SOO-HYUNG, KIM - University Of Washington |
Submitted to: PLOS ONE
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 5/19/2016 Publication Date: 6/7/2016 Citation: Jeong, J., Resop, J., Mueller, N., Fleisher, D.H., Kyungdahm, Y., Butler, E.E., Timlin, D.J., Shim, K., Gerber, J.S., Reddy, V., Soo-Hyung, K. 2016. Random Forests for Global and Regional Crop Yield Predictions. PLoS One. 11(6):1-15. Interpretive Summary: Predicting how crops grow in different regions of the world is important in order to understand how secure and safe our food supply is. Different kinds of mathematical models are used to make these predictions. It is important for researchers to try to improve the accuracy of these tools in order to ensure these crop growth predictions are as realistic as possible. In this project, scientists used historical crop yield data from different regions of the world to determine if newer modeling methods could produce more accurate yield predictions as compared to older approaches. The datasets used in the research contained the yields of corn, potato, and wheat from parts of the United States and Europe. Two different modeling methods were tested. The results indicated that a newer method called random forest regression could make more accurate forecasts than an older approach as long as the scientists developing the model were careful how they selected the data. This new approach was not more difficult to use than the older one, suggesting that scientists and food security planners could use the random forest regressions to improve their studies for regional food assessments. Technical Abstract: Traditional regression models have limitations when applied for predicting crop yield responses at multiple spatial scales. An alternative modeling method, Random Forest (RF) regression, was utilized to predict crop yield responses for wheat, maize, and potato at regional scales. This RF regression approach was compared with multiple linear regression (MLR) models for the same yield responses. Two regional datasets were used, the so-called 'wheat mega-environment 6' for wheat and the 'eastern seaboard region' for maize and potato. Each data was randomly separated into training and test sets. RF regression outperformed the MLR models in predicting yield of all three crops in the selected regions as determined via pseudo correlation coefficients and root mean square errors. For example, correlation coefficients for RF were an average 0.93, 0.86, and 0.87 for wheat, potato, and corn respectively compared to MLR values of 0.59, 0.33, and 0.36 for the same three crops. Similarly, RMSE for the RF regressions was 0.22, 1.69, and 1.35 tons per hectare compared to MLR values of 0.67, 4.64, and 3.63. However, RF regression exhibited a slight tendency to over-fit when the range of the target data distribution is entirely contained within that of the training data. This resulted in a loss of accuracy of the RF regression approach when predicting yield responses at the extreme ends of the input data set. Overall, the results suggested that RF regression is an effective machine-learning method for crop yield modeling at regional scales as compared to MLR models. However, careful selection of the training data is needed to minimize the tendency to over-fit. |