Location: Sustainable Perennial Crops Laboratory
Title: Machine learning reveals signatures of transposable element activity driving agronomic and phenolic traits in mutagenized sorghumAuthor
![]() |
Ahn, Ezekiel |
![]() |
Oh, Sookyung |
![]() |
BOTKIN, JACOB - University Of Minnesota |
![]() |
BHATT, JISHNU - Orise Fellow |
![]() |
LEE, DONGO - Orise Fellow |
![]() |
MAGILL, CLINT - Texas A&M University |
|
Submitted to: Discover Plants
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 8/12/2025 Publication Date: 9/10/2025 Citation: Ahn, E.J., Oh, S., Botkin, J., Bhatt, J., Lee, D., Magill, C. 2025. Machine learning reveals signatures of transposable element activity driving agronomic and phenolic traits in mutagenized sorghum. Discover Plants. 2. Article 265. https://doi.org/10.1007/s44372-025-00342-w. DOI: https://doi.org/10.1007/s44372-025-00342-w Interpretive Summary: Sorghum is a globally important cereal crop, valued for its resilience to heat and drought and its versatility for food, feed, and increasingly, as a source for energy. To optimize sorghum for these uses, we need to understand the genes that control important traits like biomass yield and the levels of specific chemical compounds called phenolics, which can affect processing efficiency. This study investigated the genetic basis of these key traits in a diverse collection of 96 sorghum types, including specialized mutant lines, using their genetic background information (single nucleotide polymorphisms markers -SNPs) and measured characteristics. We used both standard statistical methods and advanced machine learning (computer-based) techniques to find links between the sorghum plants' genes and their traits. The standard method only found significant gene associations for one specific phenolic compound (luteolinidin diglucoside). However, the machine learning approaches successfully identified numerous potential genetic marker locations (SNPs) that appear important for influencing most of the traits studied, including plant height, yield, soluble solids, total phenolics, and several individual phenolic compounds. Interestingly, the gene locations highlighted by machine learning were often different from those found using the standard method or reported in previous studies on this same dataset. These newly identified markers are located near genes thought to be involved in regulating plant processes, metabolism, and transport. This research is important because it provides an expanded and refined list of candidate genes potentially controlling critical biomass and biochemical traits in sorghum. This knowledge can guide sorghum breeders and geneticists in developing improved varieties tailored for specific bioeconomy applications, such as enhanced biofuel yield or optimized chemical composition. This work benefits researchers studying plant genetics and those working to develop sustainable feedstocks for a greener economy. Technical Abstract: This study investigated the genetic architecture of four agronomic traits (Heading Date, Plant Height, Soluble Solids Content, Dry Yield) and seven phenolic measurements (Total Phenolic Content and six individual compounds) in a diverse panel of 96 sorghum (Sorghum bicolor L.) genotypes, including radiation-induced mutants. We re-analyzed publicly available phenotypic and genotypic data (192,040 SNPs) using an integrated approach comparing single-SNP linear regression (RS) with machine learning (ML) models (Bootstrap Forest, BF; Boosted Tree, BT). Given dataset limitations (n << p) restricted predictive modeling, ML analyses focused on SNP explanatory importance scores (thresholds: BF = 0.01, BT = 0.05) derived from models trained on the full dataset using JMP Pro 17. Single-SNP analysis yielded significant associations (FDR < 0.01) exclusively for luteolinidin diglucoside, implicating SNPs near genes like SbRio.02G410900 (Peroxidase), SbRio.03G385800 (Multicopper oxidase), and SbRio.10G012200 (Methyl-CpG DNA binding). In contrast, BF and BT models prioritized numerous SNPs across most traits. Notably, top-ranked SNPs identified via ML were often distinct from RS or previous GWAS findings. Examples include Chr6_25435000 (near F-box gene SbRio.06G044200) for Dry Yield, Chr1_75027213 (SbRio.01G506800, 1,4-dihydroxy-2-naphthoate phyltyltransferase) for Soluble Solids Content, and Chr4_65777513 (SbRio.04G379500, Pleckstrin-homology domain) for Total Phenolic Content and luteolinidin. Strong concordance was observed between BF/BT SNP rankings and between SNP metrics for chemically related traits. These findings demonstrate that ML-based SNP importance ranking complements traditional association methods in moderately sized, diverse populations, providing a refined list of potentially novel candidate loci near genes with putative functions in regulation, metabolism, and transport relevant to sorghum biomass and biochemical traits. This work provides valuable targets for functional validation and marker-assisted breeding to optimize sorghum for bioeconomy applications. |
