Location: Sustainable Perennial Crops Laboratory
Title: Optimizing the genomic bit budget: An information-theoretic framework for trait-aware genotyping and precision breeding in Theobroma cacaoAuthor
![]() |
Ahn, Ezekiel |
![]() |
Baek, Insuck |
![]() |
KANDPAL, LALIT - Orise Fellow |
![]() |
Kirubakaran, Silvas |
![]() |
LIM, SEUNGHYUN - Orise Fellow |
![]() |
BHATT, JISHNU - Orise Fellow |
![]() |
Kim, Moon |
![]() |
Park, Sunchung |
![]() |
Meinhardt, Lyndel |
|
Submitted to: Horticulture Research
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 3/12/2026 Publication Date: 3/19/2026 Citation: Ahn, E.J., Baek, I., Kandpal, L., Kirubakaran, S.J., Lim, S., Bhatt, J., Kim, M.S., Park, S., Meinhardt, L.W. 2026. Optimizing the genomic bit budget: An information-theoretic framework for trait-aware genotyping and precision breeding in Theobroma cacao. Horticulture Research. https://doi.org/10.1093/hr/uhag106. DOI: https://doi.org/10.1093/hr/uhag106 Interpretive Summary: Cacao, the source of chocolate, suffers from a global "identity crisis" where trees in gene banks are frequently mislabeled or duplicated, making it difficult for breeders to find the best varieties. To solve this, we treated the cacao genome not as a biological mystery, but as a data storage problem with a limited "budget." We developed "CacaoCipher," a minimalist digital fingerprint that shrinks over 500 genetic markers down to just 32 essential "letters." Unlike a standard ID card that only tells you a name, this smart barcode was designed using a "genomic bit budget" strategy: we "spent" our limited data allowance to ensure the barcode captures not only the plant's unique identity and ancestry but also hidden signals predicting how much chocolate it can produce. We proved that this tiny 32-marker code is powerful enough to distinguish hundreds of trees and can even predict yield across different environments, from Trinidad to Puerto Rico. This research provides gene bank curators, chocolate manufacturers, and plant breeders with a low-cost, high-tech tool to clean up collections, authenticate premium beans, and deliver reliable, high-yielding trees to farmers. Technical Abstract: Routine genotyping in Theobroma cacao is often hindered by the cost and computational redundancy of high-density arrays; to address this, we introduce an information-theoretic framework that treats marker selection as an allocation of a finite "genomic bit budget". We compressed a 536-SNP dataset into a minimalist 32-SNP "CacaoCipher" barcode, identifying k=32 as the critical saturation point that maximizes information density (R = 0.259 bits/locus) while maintaining >95% identification accuracy even under simulated genotyping error rates of 5%. Beyond identity, we implemented a "trait-aware" selection algorithm that optimized the bit budget allocation, resulting in a 39–65% increase in marker-trait associations for Pod Index and anthocyanin intensity compared to random panels, without compromising forensic resolution. Cross-environmental validation confirmed that barcode coordinates derived from Trinidad (ICGT) significantly correlated with yield phenotypes in an independent Puerto Rican (TARS) trial (r = 0.49), demonstrating that the barcode encodes stable, agronomically relevant gradients. Decomposition of the barcode’s total entropy (~37 bits) revealed a strategic information partition, 58% for unique ID, 32% for ancestral structure, and 10% for yield potential, establishing a generalized template for designing low-cost, multi-purpose genomic codes in clonally propagated crops. |
