Submitted to: Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 2/6/2012
Publication Date: 2/13/2012
Citation: Bunge, J., Woodard, L., Bohning, D., Foster, J.A., Connolly, S., Allen, H.K. 2012. Estimating population diversity with CatchAll. Bioinformatics. 28(7):1045-1047. Available: http://bioinformatics.oxfordjournals.org/content/28/7/1045.long. Interpretive Summary: Next-generation sequencing has produced massive quantities of data that are in great need of robust statistical analysis tools. In particular, estimtating total population sizes from sample sequences remains a challenge. We present a program called CatchAll that estimates total population sizes with ease and speed. Frequency count data from any population can be analyzed, including bacterial and phage diversity counts. This program uses modern statistical approaches to provide the user with the best overall output using up to 12 different methods and models. Importantly, CatchAll also offers a unique mathematical approach to discount potential outliers in a dataset. Graphical display of the outputs has been optimized in an Excel-based spreadsheet program.
Technical Abstract: The massive quantity of data produced by next-generation sequencing has created a pressing need for advanced statistical tools, in particular for analysis of bacterial and phage communities. Here we address estimating the total diversity in a population – the species richness. This is an important statistical problem with a rich literature, but to date only relatively simple methods have been implemented in readily available software. There is a need for a software package employing modern, computationally-intensive statistical procedures, with error terms, goodness-of-fit assessments, and robustness comparisons. The same methods also apply to estimating the total size of a population. We present CatchAll, a fast, easy-to-use, platform-independent software package which uses optimized numerical searching to compute maximum likelihood estimates for finite-mixture models, linear regression-based models with non-diagonal weight matrices, and all existing coverage-based nonparametric methods, while accounting for outlier detection/deletion and other data-analytic considerations. Given sample “frequency count” data, CatchAll computes 12 different diversity estimates and compares the results via a model-selection algorithm, providing the user with a best overall choice. In addition CatchAll derives model-based discounted estimates of total diversity to adjust for possibly uncertain low-frequency counts. It is accompanied by an Excel-based graphical display spreadsheet program.