Skip to main content
ARS Home » Pacific West Area » Kimberly, Idaho » Northwest Irrigation and Soils Research » Research » Publications at this Location » Publication #182913

Title: AN ECOINFORMATICS TOOL FOR MICROBIAL COMMUNITY STUDIES: SUPERVISED CLASSIFICATION OF AMPLICON LENGTH HETEROGENEITY (ALH) PROFILES OF 16S RRNA

Author
item YANG, CHENGYONG - FLORIDA INTERNAT'L UNIV
item MILLS, DEETTA - FLORIDA INTERNAT'L UNIV
item MATHEE, KALAI - FLORIDA INTERNAT'L UNIV
item WANG, YONG - FLORIDA INTERNAT'L UNIV
item JAYACHANDRAN, KRISH - FLORIDA INTERNAT'L UNIV
item SIKAROODI, MASOUMEH - GEORGE MASON UNIVERSITY
item GILLEVET, PATRICK - GEORGE MASON UNIVERSITY
item Entry, James
item NARASIMHAN, GIRI - FLORIDA INTERNAT'L UNIV

Submitted to: Journal of Microbiological Methods
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/24/2005
Publication Date: 9/1/2005
Citation: Yang, C., Mills, D., Mathee, K., Wang, Y., Jayachandran, K., Sikaroodi, M., Gillevet, P., Entry, J.A., Narasimhan, G. 2005. An ecoinformatics tool for microbial community studies: Supervised classification of amplicon length heterogeneity (ALH) profiles of 16s rRNA. Journal of Microbiological Methods. 65:49-62.

Interpretive Summary: Support vector machines (SVM) and K-nearest neighbors (KNN) are two computational machine learning tools that perform supervised classification. They were used to identify and compare different types of microbial communities. Amplicon length heterogeneity profiles from several hypervariable regions of 16S rRNA of eubacterial communities from Idaho agricultural soil samples and from Chesapeake Bay marsh sediments were separately analyzed. The profiles from all available hypervariable regions were concatenated to obtain a combined profile, which was then provided to the SVM and KNN classifiers. We hypothesized that after a learning phase using feature vectors from labeled amplicon length heterogeneity profiles, both these classifiers would have the capacity to predict the labels of previously unseen samples. The resulting classifiers were able to predict the labels of the Idaho soil samples with high accuracy. The classifiers were less accurate for the classification of the Chesapeake Bay sediments suggesting greater similarity within the bay's microbial communities in the sampled sites. The profiles obtained from the V1+V2 region were more informative than that obtained from any other single region. However, combining them with profiles from the V1 region (with or without the profiles from the V3 region) resulted in the most accurate classification of the samples. The addition of profiles from the V9 region appeared to confound the classifiers. Our results show that SVM and KNN classifiers can be effectively applied to distinguish between eubacterial communities from different ecosystems based only on their ALH profiles.

Technical Abstract: Support vector machines (SVM) and K-nearest neighbors (KNN) are two computational machine learning tools that perform supervised classification. They were used to identify and compare different types of microbial communities. Amplicon length heterogeneity profiles from several hypervariable regions of 16S rRNA of eubacterial communities from Idaho agricultural soil samples and from Chesapeake Bay marsh sediments were separately analyzed. The profiles from all available hypervariable regions were concatenated to obtain a combined profile, which was then provided to the SVM and KNN classifiers. We hypothesized that after a learning phase using feature vectors from labeled amplicon length heterogeneity profiles, both these classifiers would have the capacity to predict the labels of previously unseen samples. The resulting classifiers were able to predict the labels of the Idaho soil samples with high accuracy. The classifiers were less accurate for the classification of the Chesapeake Bay sediments suggesting greater similarity within the bay's microbial communities in the sampled sites. The profiles obtained from the V1+V2 region were more informative than that obtained from any other single region. However, combining them with profiles from the V1 region (with or without the profiles from the V3 region) resulted in the most accurate classification of the samples. The addition of profiles from the V9 region appeared to confound the classifiers. Our results show that SVM and KNN classifiers can be effectively applied to distinguish between eubacterial communities from different ecosystems based only on their ALH profiles.