Location: Sustainable Perennial Crops Laboratory
Title: Integrated phenotypic analysis, predictive modeling, and identification of novel trait-associated loci in a diverse Theobroma cacao collectionAuthor
![]() |
Baek, Insuck |
![]() |
CHA, MINYEOK - Orise Fellow |
![]() |
LIM, SEUNGHYUN - Orise Fellow |
![]() |
Irish, Brian |
![]() |
Oh, Sookyung |
![]() |
UPADHYAY, RAKESH - Bowie State University |
![]() |
BHATT, JISHNU - Orise Fellow |
![]() |
Kim, Moon |
![]() |
Meinhardt, Lyndel |
![]() |
Park, Sunchung |
![]() |
Ahn, Ezekiel |
|
Submitted to: BMC Plant Biology
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 7/25/2025 Publication Date: 8/9/2025 Citation: Baek, I., Cha, M., Lim, S., Irish, B.M., Oh, S., Upadhyay, R., Bhatt, J., Kim, M.S., Meinhardt, L.W., Park, S., Ahn, E.J. 2025. Integrated phenotypic analysis, predictive modeling, and identification of novel trait-associated loci in a diverse Theobroma cacao collection. BMC Plant Biology. https://doi.org/10.1186/s12870-025-07128-y. DOI: https://doi.org/10.1186/s12870-025-07128-y Interpretive Summary: Cacao, the source of chocolate, is a vital crop for millions of smallholder farmers; however, its production is hindered by diseases and the impacts of climate change. To develop better cacao varieties, breeders require a comprehensive understanding of the genes that control key traits, such as yield and disease resistance. This study offers a comprehensive examination of a diverse collection of 173 cacao trees in Puerto Rico, utilizing modern science to link their physical characteristics with their genetic makeup. Our research employed multiple approaches to discover innovative methods for enhancing cacao. We developed highly accurate machine learning models that can predict a tree’s yield potential solely from examining its pods and infection levels, providing a powerful new tool for breeders. We also compared our data with collections from Trinidad and Colombia, finding that some traits, such as the number of seeds in a pod, are stable across different environments, while also identifying intriguing links between seed size and disease susceptibility. Most importantly, by integrating genetic data, we identified a suite of novel genetic markers associated with key horticultural traits. We identified several specific genes associated with total pod number, disease infection rate, and overall yield. A key finding was a single marker on chromosome 5 that influences both pod number and yield, located in a gene that helps plants respond to stress. This research provides cacao breeders and scientists with robust predictive models and a new set of genetic markers to accelerate the development of high-yielding, resilient cacao varieties. Ultimately, these tools can help secure the future of chocolate production and support the livelihoods of farmers worldwide. Technical Abstract: The genetic improvement of Theobroma cacao L. is essential for sustainable production but requires a deep understanding of its complex trait architecture. This study performed an integrated characterization of a diverse collection of 173 cacao accessions evaluated in Puerto Rico, combining multi-year phenotypic data with comparative genomics and machine learning (ML). Analyses of 28 accessions common to a published Trinidad dataset revealed significant correlations between genetic cluster membership (based on Bekele et al. K= 7 groups) and horticultural traits, including a positive correlation between 'AMAZ,IMC' membership and Yield (r= 0.50). Comparative analysis between the Puerto Rico and Trinidad environments identified stable traits, such as 'Number of seeds' ('= 0.50), and key G×E interactions, including a potential trade-off between seed mass and infection rate. A Neural Boosted ML model accurately predicted yield (validation R2>0.99), identifying 'Total pods', 'Infection rate', and 'Pod weight' as the most influential predictors. Furthermore, a genome-wide association study (GWAS) on 28 common accessions identified multiple significant marker-trait associations (FDR < 0.01). Key associations included: TcSNP475 (in a putative zinc finger SAP gene, Tc05_t008610) with both 'Total pods' and 'Yield'; TcSNP508 (in a cysteine protease gene, Tc08v2_g002970) with 'Infection rate'; and TcSNP483 (in an ATP synthase subunit gene, Tc03v2_g019660) with 'Yield'. This study provides a robust phenotypic and correlational landscape of this germplasm, delivers highly accurate ML-based yield predictors, and identifies a suite of novel, validated genetic markers for key horticultural traits. These integrated findings offer valuable resources for advancing cacao breeding programs through marker-assisted selection and genomic prediction. |
