Skip to main content
Research article - Peer-reviewed, 2022

Potential Use of Data-Driven Models to Estimate and Predict Soybean Yields at National Scale in Brazil

Monteiro, Leonardo A.; Ramos, Rafael M.; Battisti, Rafael; Soares, Johnny R.; Oliveira, Julianne C.; Figueiredo, Gleyce K. D. A.; Lamparelli, Rubens A. C.; Nendel, Claas; Lana, Marcos Alberto


Large-scale assessment of crop yields plays a fundamental role for agricultural planning and to achieve food security goals. In this study, we evaluated the robustness of data-driven models for estimating soybean yields at 120 days after sow (DAS) in the main producing regions in Brazil; and evaluated the reliability of the "best" data-driven model as a tool for early prediction of soybean yields for an independent year. Our methodology explicitly describes a general approach for wrapping up publicly available databases and build data-driven models (multiple linear regression-MLR; random forests-RF; and support vector machines-SVM) to predict yields at large scales using gridded data of weather and soil information. We filtered out counties with missing or suspicious yield records, resulting on a crop yield database containing 3450 records (23 years x 150 "high-quality" counties). RF and SVM had similar results for calibration and validation steps, whereas MLR showed the poorest performance. Our analysis revealed a potential use of data-driven models for predict soybean yields at large scales in Brazil with around one month before harvest (i.e. 90 DAS). Using a well-trained RF model for predicting crop yield during a specific year at 90 DAS, the RMSE ranged from 303.9 to 1055.7 kg ha(-1) representing a relative error (rRMSE) between 9.2 and 41.5%. Although we showed up robust data-driven models for yield prediction at large scales in Brazil, there are still a room for improving its accuracy. The inclusion of explanatory variables related to crop (e.g. growing degree-days, flowering dates), environment (e.g. remotely-sensed vegetation indices, number of dry and heat days during the cycle) and outputs from process-based crop simulation models (e.g. biomass, leaf area index and plant phenology), are potential strategies to improve model accuracy.


Large-scale analysis; Machine learning approaches; Public databases; Geospatial and temporal variability; Climatic and soil variables

Published in

International Journal Of Plant Production
2022, volume: 16, number: 4, pages: 691-703
Publisher: SPRINGER

Authors' information

Monteiro, Leonardo A.
Food and Agriculture Organization of the United Nations (FAO)
Monteiro, Leonardo A.
University of Kentucky
Ramos, Rafael M.
University UNIEURO
Battisti, Rafael
Universidade Federal de Goias
Soares, Johnny R.
Universidade Estadual de Campinas
Oliveira, Julianne C.
Chalmers University of Technology
Figueiredo, Gleyce K. D. A.
Universidade Estadual de Campinas
Lamparelli, Rubens A. C.
Ctr Energy Planning NIPE
Nendel, Claas
Leibniz Zentrum fur Agrarlandschaftsforschung (ZALF)
Nendel, Claas
University of Potsdam
Nendel, Claas
Czech Academy of Sciences
Swedish University of Agricultural Sciences, Department of Crop Production Ecology

Sustainable Development Goals

SDG2 Zero hunger

UKÄ Subject classification

Agricultural Science

Publication Identifiers


URI (permanent link to this page)