Location: Virus and Prion ResearchTitle: classLog: Logistic regression for the classification of genetic sequences
|ZELLER, MICHAEL - Duke-Nus Medical School
|ARENDSEE, ZEBULUN - Oak Ridge Institute For Science And Education (ORISE)
|SMITH, GAVIN - Duke-Nus Medical School
Submitted to: Frontiers in Virology
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 11/6/2023
Publication Date: 12/4/2023
Citation: Zeller, M.A., Arendsee, Z.W., Smith, G.J., Anderson, T.K. 2023. classLog: Logistic regression for the classification of genetic sequences. Frontiers in Virology. https://doi.org/10.3389/fviro.2023.1215012.
Interpretive Summary: With the decrease in cost of sequencing, large amounts of genetic sequence data are generated by diagnostic labs. Traditional approaches to the classification of the sequences can be a time-consuming process involving curated reference data and a subject-area expert to interpret phylogenetic trees. We overcome this bottleneck with the introduction of a simple and intuitive machine learning classifier, classLog, that automatically builds a prediction model that can be applied to classify ‘unknown’ sequence data. The prediction model is portable, does not need to be retrained when new sequence data are generated, and requires no specialized training to implement. Together, these advancements reduce the time and computational needs required to identify the classification of an infectious agent based on genetic sequence data. Further, classifications are given with a probability score, which reduces the need for interpretation and the scoring system allows for the rapid detection of unknown samples that require further investigation. The classLog algorithm was validated with H1 influenza A virus (IAV) in swine sequences and a porcine reproductive and respiratory syndrome virus (PRRSv) dataset under simulated conditions of sequence degradation. The classifier achieved near perfect accuracy with real data, and >85% accuracy for the PRRSv test set using 10% of the generated features with 20% sequence degradation. classLog achieved 95% accuracy for the swine H1 IAV HA dataset using 5% of total features with 20% sequence degradation. This software is publicly accessible and will increase researchers’ ability to classify endemic circulating viruses and speed response efforts by helping diagnosticians rapidly identify new viral variants.
Technical Abstract: Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in real-time genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference. We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity generating accurate output more rapidly than other classification methods. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%. When applied to a poor-quality sequence data, the classifier achieved >85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations. Our approach is implemented as a python package with code available at https://github.com/flu-crew/classLog