Location: Immunity and Disease Prevention ResearchTitle: SAMSA2: a standalone metatranscriptome analysis pipeline
|WESTREICH, SAMUEL - University Of California, Davis|
|TREIBER, MICHELLE - University Of California, Davis|
|MILLS, DAVID - University Of California, Davis|
Submitted to: BMC Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 9/28/2017
Publication Date: 5/21/2018
Citation: Westreich, S.T., Treiber, M.L., Mills, D.A., Lemay, D.G. 2018. SAMSA2: a standalone metatranscriptome analysis pipeline. BMC Bioinformatics. 19:175. https://doi.org/10.1186/s12859-018-2189-z.
Interpretive Summary: The study of complex microbial communities is an area of rapid growth in biology. It is now possible to determine all microbial genes expressed in a sample (the metatranscriptome) using sequencing technologies and software to analyze the results. The metatranscriptome provides a measure of microbial activity and function in a sample: which microbes are present and what they are doing. This paper describes a new software package to analyze metatranscriptomes. The software package, called SAMSA2, is the second version of the Simple Analysis of Metatranscriptomes through Sequence Annotation (SAMSA) software. SAMSA2 has been redesigned for independent use on a supercomputing cluster or large workstation and to work with additional reference databases. SAMSA2 is available, along with extensive documentation, at https://github.com/transcript/samsa2.
Technical Abstract: Background The study of complex microbial communities is an area of rapid growth in biology. Decreasing sequencing costs encourage pursuit of metagenomes and metatranscriptomes—all microbial DNA and transcripts, respectively. However, these large and complex datasets require bioinformatics pipelines capable of providing rapid and accurate results. Additionally, as metagenomic and metatranscriptomic analyses are applied to a wider range of environments, analysis pipelines must support the ability to alter their reference databases to stay relevant and match the environment. Results Here we present SAMSA2, an upgrade to the original SAMSA pipeline that has been redesigned for use on a supercomputing cluster and offers increased functionality, speed, and ease of incorporation of custom reference databases. This pipeline performs quality-control assessment, removes rRNAs, uses DIAMOND to increase annotation speed against either standard or customized reference databases, and uses custom Python and R scripting to reduce the results and provide downstream analyses for figure generation and identification of significantly differing activity levels at the species level. SAMSA2 is available with detailed documentation and example input and output files, along with examples of master scripts for full pipeline execution. Conclusions Using publicly available example data, we demonstrate that SAMSA2 is a rapid and efficient metatranscriptome pipeline for analyzing large paired-end RNA-seq datasets in a supercomputing cluster environment. SAMSA2 provides simplified output that can be examined directly or used for further analyses, and its reference databases may be upgraded, altered, or customized to fit the specifics of any experiment.