Skip to main content
ARS Home » Midwest Area » Ames, Iowa » Corn Insects and Crop Genetics Research » Research » Publications at this Location » Publication #427504

Research Project: MaizeGDB - Database and Computational Resources for Maize Genetics, Genomics, and Breeding Research

Location: Corn Insects and Crop Genetics Research

Title: Why do some predicted protein structures fold poorly? Benchmarking AlphaFold, ESMFold, and Boltz in maize

Author
item HALEY, OLIVIA - Oak Ridge Institute For Science And Education (ORISE)
item TIBBS-CORTES, LAURA - Oak Ridge Institute For Science And Education (ORISE)
item HAYFORD, RITA - Oak Ridge Institute For Science And Education (ORISE)
item HARDING, STEPHEN - Oak Ridge Institute For Science And Education (ORISE)
item Woodhouse, Margaret
item Cannon, Ethalinda
item GARDINER, JACK - University Of Missouri
item Portwood Ii, John
item Sen, Taner
item Kim, Hye-Seon
item Andorf, Carson

Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 7/6/2025
Publication Date: 7/6/2025
Citation: Haley, O., Tibbs-Cortes, L., Hayford, R.K., Harding, S., Woodhouse, M.H., Cannon, E.K., Gardiner, J.M., Portwood II, J.L., Sen, T.Z., Kim, H., Andorf, C.M. 2025. Why do some predicted protein structures fold poorly? Benchmarking AlphaFold, ESMFold, and Boltz in maize. bioRxiv. Article 2025.07.05.663230. https://doi.org/10.1101/2025.07.05.663230.
DOI: https://doi.org/10.1101/2025.07.05.663230

Interpretive Summary: Modern artificial intelligence methods can now predict the 3D shapes of proteins without relying on expensive and time-consuming lab experiments. These tools are having a transformative effect on biology and agricultural research. However, most artificial intelligence models are trained using human, animal, and microorganism data. This poses a problem for agricultural research, where understanding plant proteins is key to developing crops that yield more food and resist disease. In this study, USDA-ARS scientists tested five of the most advanced protein prediction programs on over 400 important corn genes. These genes are well-studied and play key roles in how the plant grows and responds to pathogens and pests. The results revealed that the tools often performed poorly when predicting the structure of corn-specific proteins. Some predictions were physically impossible, and others did not match known experimental data. This work shows that protein structure prediction for crops can produce inaccurate or misleading results. By identifying the limitations of current tools, the study lays a foundation for building better models tailored to plants. Improved prediction tools will support more innovative agricultural research, leading to stronger, more resilient crops and a more secure food supply.

Technical Abstract: Protein structure prediction tools have significantly reduced the time and cost to generate protein structures and accelerated protein discovery and design. However, plant proteins are underrepresented in sequence and structural datasets used to train these programs. To quantify the downstream impact of this deficiency, we benchmarked five structure-prediction programs (AlphaFold 2, AlphaFold 3, ESMFold, Boltz-1, and Boltz-2) across 417 well-characterized Zea mays genes. These "classical" genes represent a set of well-studied genes with known genetic and phenotypic effects. We generated structures for each gene using these programs and compared how sequence, structural, and evolutionary conservation impacted the structures' confidence and geometric features. Proteins lacking conserved sequence and/or structural domains had on average 25% to 43% lower confidence scores than proteins having both domains. Proteome-wide phylostratigraphy revealed that species-specific proteins had substantially lower confidence scores than proteins conserved amongst angiosperms and Eukaryotes. Boltz-1 and ESMFold structures had the highest occurrence of structures with severe geometry issues, including overlapping atoms and unlikely bond angles. We also compared computational and experimental alignments of Arabidopsis, maize, rice, wheat, and soybean proteins from the Protein Data Bank and identified structures showing incongruences with experimental data. This study challenges the assertion that protein folding has been completely 'solved', and urges more investigation into benchmarking and standardized evaluation frameworks to improve model performance and assessment in agricultural crops.