Location: Immunity and Disease Prevention ResearchTitle: TaxaHFE a machine learning approach to collapse microbiome datasets using taxonomic structure
|OLIVER, ANDREW - Orise Fellow|
Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 6/7/2023
Publication Date: 6/7/2023
Citation: Oliver, A., Lemay, D.G. 2023. TaxaHFE a machine learning approach to collapse microbiome datasets using taxonomic structure. bioRxiv. 10.1101/543755. https://doi.org/10.1101/2023.06.06.543755.
Interpretive Summary: Investigating the relationship between diet, the microbiome, and health often necessitates the use of high-throughput methods such as DNA sequencing combined with dietary recalls. The datasets generated by these methods are feature-rich and require special consideration in their analysis. One common method employed is machine learning (ML), but ML algorithms can suffer from the "curse of dimensionality" due to many features describing relatively few samples. Feature engineering can help address this problem by preprocessing the data prior to ML model evaluation. Feature engineering can exploit a common thread between microbiome and dietary data, which is the hierarchical structure of the features themselves. We introduce a method for hierarchical feature engineering called TaxaHFE, which dynamically collapses hierarchical data based on taxonomic information together with information gain, maximizing the information contained at various taxonomic levels while reducing redundancy in the feature space. We demonstrate its utility on microbiome data and hierarchical food data represented by taxonomic trees and show that TaxaHFE often improves the performance of ML models while simultaneously increasing interpretability of the models.
Technical Abstract: Background: DNA sequencing combined with dietary recalls are powerful, high-throughput, approaches to study the relationship between nutrition, the microbiome, and health. However, inherent to both data types is a problem and its potential solution: the high dimensionality of the data output (problem), and the hierarchical (i.e., taxonomic) relationship between the features themselves (solution). Although machine learning (ML) can be a powerful tool to comb through high dimensional data, it can suffer from too many features describing too few samples (i.e., p >> n). To reduce dimensionality for ML applications, and subsequently increase model performance and interpretability, we sought to exploit the hierarchical relationships between features using an algorithmic approach to feature pre-processing. Results: Using six previously published datasets, we show that TaxaHFE results in an 85% reduction in number of features (s.d = 14.2%) compared to using the most complete taxonomy. Comparing the most resolved taxonomic level (e.g., species) against TaxaHFE preprocessed features using machine learning showed that models based on TaxaHFE features achieved an average increase in receiver operator curve area under the curve (ROC-AUC) of 3.7%. Conclusions: Here we present a tool for dynamically collapsing hierarchical data (such as taxonomies in microbiome data) to reduce feature rich datasets. The primary strengths of this method are threefold: 1) a dramatic decrease in the number of features, 2) an increase in the performance of machine learning models after preprocessing with TaxaHFE, and 3) the ability to use both categorical and continuous dependent variables. Future work should examine the utility of TaxaHFE on other hierarchically represented data, such as dietary data represented by food trees.