Submitted to: Pacific Symposium on Biocomputing (PSB)
Publication Type: Peer reviewed journal
Publication Acceptance Date: 9/15/2011
Publication Date: 1/3/2012
Citation: Bunge, J., Bohning, D., Allen, H.K., Foster, J. 2012. Estimating population diversity with unreliable low frequency counts. Pacific Symposium on Biocomputing (PSB) [serial online]. 17:203-212. Available: http://psb.stanford.edu/psb-online/proceedings/psb12/abstracts/2012_p203.html. Interpretive Summary: Enumerating the members of a population is a common problem in ecology because the total census is extrapoloated by sampling (counting) a subset. This problem is particularly acute when examining microbial populations because they are too small to be counted by eye and therefore are counted by molecular methods. Molecular methods are subject to various technical biases, one of which is an inflated number of low-frequency (also thought of as high-diversity) observations, particularly resulting from high-throughput DNA sequencing technologies. Here we present statistical approaches to decrease the low-frequency observations. We show that one method in particular, fitting a parametric mixture model and deleting the highest-diversity component, is readily employed in the analysis of phage metagenomic data. We find that these statistical corrections are dependent on their underlying assumptions, but that such methods can be useful nonetheless.
Technical Abstract: We consider the classical population diversity estimation scenario based on frequency count data (the number of classes or taxa represented once, twice, etc. in the sample), but with the proviso that the lowest frequency counts, especially the singletons, may not be reliably observed. This arises especially in data derived from modern high-throughput DNA sequencing, where errors may cause sequences to be incorrectly assigned to new taxa instead of being matched to existing, observed taxa. We look at a spectrum of methods for addressing this issue, focusing in particular on fitting a parametric mixture model and deleting the highest-diversity component; we also consider regarding the data as left-censored and effectively pooling two or more low frequency counts. We find that these purely statistical "downstream" corrections will depend strongly on their underlying assumptions, but that such methods can be useful nonetheless.