Submitted to: Journal of Theoretical Biology
Publication Type: Peer reviewed journal
Publication Acceptance Date: 11/5/2003
Publication Date: 4/7/2004
Citation: Lamboy, W.F., Moreno-Hagelsieb, G. 2004. A new method of solution for the occupancy problem and its application to operon size prediction. Journal of Theoretical Biology. 227:315-322. Interpretive Summary: As the number of completely sequenced prokaryotic genomes has increased, so has the need for genome annotation, the two most important aspects of which are the identification of genes and the determination of their organization into single-gene and multi-gene transcription units. Consequently, estimation of the number of single-gene and multi-gene transcription units in a genome is an important part of the annotation procedure. In this work we develop a new simple technique for solving the mathematical problems involved in making these estimates, and use those results to compute the average number of transcription units containing a specific number of genes that we would expect to find in a genome. Such estimates assist the annotation process and facilitate comparison of transcription unit sizes between different genomes.
Technical Abstract: The problem of estimating the expected number of transcription units containing a specific number of genes arises in the context of operon size prediction in prokaryotic genomes, where operons are defined to be transcription units containing two or more genes. It turns out that this problem is identical mathematically to the balls in urns occupancy problem in probability theory. In that problem, a fixed number of indistinguishable balls are randomly placed in a known number of distinguishable urns, subject to the restriction that no urns may remain empty, and an estimate is desired for the expected number of urns containing a specific number of balls. In this paper we present a new simple technique for solving the occupancy problem when empty urns are allowed and extend it to the case when each urn must contain the same non-zero minimum number of balls. Treating transcription units as equivalent to urns, and genes as equivalent to balls, we then use that result to solve the problem of estimating the expected number of transcription units that contain a specific number of genes. The ability to make such estimates provides a probabilistic foundation for the comparison of operon predictions within and across genomes.