Location: Egg and Poultry Production Safety Research Unit
Title: Using core genome and machine learning for serovar prediction in Salmonella enterica subspecies I strainsAuthor
![]() |
Li, Xiang |
![]() |
Oladeinde, Adelumola |
![]() |
Rothrock Jr, Michael |
![]() |
CHUNG, TAE JUNG - Department Of Energy |
![]() |
HAKEEM, WALID GHAZI AL - Department Of Energy |
|
Submitted to: FEMS Microbiology Letters
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 4/8/2025 Publication Date: 4/21/2025 Citation: Li, X., Oladeinde, A.A., Rothrock Jr, M.J., Chung, T., Hakeem, W. 2025. Using core genome and machine learning for serovar prediction in Salmonella enterica subspecies I strains. FEMS Microbiology Letters. https://doi.org/10.1093/femsle/fnaf040. DOI: https://doi.org/10.1093/femsle/fnaf040 Interpretive Summary: We studied Salmonella, a bacteria that often causes food poisoning, in two main ways. First, we created a computer program that can identify different types of Salmonella by study their DNA. Using over 1,000 Salmonella samples, our program successfully identified different Salmonella types with more than 90% accuracy. Second, we looked at how Salmonella's important genes work together and change over time. We found that genes that help Salmonella cause disease and fight off antibiotics are carefully preserved by the bacteria, showing how important these genes are for bacterial survival. This research helps us better understand how Salmonella works and provides a new, reliable way to identify different types of Salmonella, which could help track and control food poisoning outbreaks in the future. Technical Abstract: This study presents a dual investigation of Salmonella enterica subspecies I, focusing on serovar prediction and core genome characteristics. Using two comprehensive databases - panX (500 strains, 85 serovars) and NCBI Pathogen Detection (575 strains, 62 serovars), we evaluated supervised machine learning approaches for serovar classification based on core genome dissimilarity data. Among the four tested algorithms, the Random Forest model demonstrated higher performance, achieving 90.3% accuracy with the panX dataset and 95.3% with the NCBI dataset, particularly effective when trained on >50% of available data. When combined with hierarchical clustering validation, our approach achieved 100% prediction accuracy on the simulated data. Parallel analysis of panX core genome characteristics revealed that pathogenicity-related genes (including sseA, invA, mgtC, phoP, phoQ, and sitA) exhibited similar phylogenetic topologies distinct from the core genome species tree, suggesting shared evolutionary histories. Notably, all identified core antibiotic resistance genes and virulence factors showed evidence of negative selection, indicating their essential role in bacterial survival. This study not only presents a promising machine learning-based alternative for Salmonella serovar classification, particularly valuable when analyzing newly identified serovars alongside known reference strains but also provides insights into the evolutionary dynamics of core virulence-associated genes, contributing to our understanding of Salmonella genomic architecture and pathogenicity. |
