Location: Genomics and Bioinformatics Research
Project Number: 6066-21310-006-004-I
Project Type: Interagency Reimbursable Agreement
Start Date: Sep 1, 2019
End Date: Aug 31, 2024
Rationale: Salmonellosis sickens 1.2 million people and has an estimated economic impact of $152 billion annually in the United Sates. Epidemiological investigations of Salmonella outbreaks rely on multiple, time-consuming rounds of culturing Salmonella and identification using antisera typing, multi-locus sequence typing (MLST) or genome sequencing to trace an outbreak to its source. Metagenomics has the potential to cut the time to identify a strain from 4 days to 1 or 2, but current methods for molecular typing like MLST do not work well on fragmented metagenomic data. Overall goal: We will develop local software and a web application to analyze, report and track Salmonella detected by metagenomics using novel population genomics-based methods.
Our approach identifies Salmonella strains and their properties directly from metagenomic data. Initially, consensus sequences will be built from each of the 334 core genes during training. A large database of approximately 56,000 genomes by ~900,000 SNPs will be created. From that data, a locality sensitive hashing (LSH) index will be created to rapidly identify closest strains by weighted cosine similarity. To predict strain phenotypes, machine learning models will be created by reducing dimensionality and sparsity with a variational audoencoder then training feed forward multilayer neural networks with dropout to predict the desired attributes. During sample analysis reads will be rapidly matched to the core genes by kmer search to reduce the search space. The reads will be error-corrected, merged, aligned to the reference and single nucleotide polymorphisms (SNPs) will be called. Nearest matching strains will be identified by LSH search. Predictions of other strain attributes like serotype, multi-locus sequence typing (MLST) type, antimicrobial resistance and pathogenicity will be made from machine learning models trained on the input genomes.