|Schnell ii, Raymond|
Submitted to: Tree Genetics & Genomics
Publication Type: Peer reviewed journal
Publication Acceptance Date: 11/7/2005
Publication Date: 3/29/2006
Citation: Cervantes-Martinez, C., Brown, J.S., Motamayor, J.C., Zhang, D., Schnell II, R.J. 2006. A computer simulation study on the number of loci and trees required to estimate genetic variability in cacao (Thebromoa cacao L.). Tree Genetics & Genomics. Interpretive Summary: The use of molecular markers to verify the identity of samples from plant cultivars, animal varieties or individuals, and from human samples is widely known. Numbers of markers (representing genetic loci, or genes) recommended to determine identities and to distinguish among identities have been well established, especially for human forensics. Relatively few markers are required when markers that usually have high numbers of alleles are used, such as simple sequence repeat (SSR) markers, and any supplementary information, such as species or racial membership can be used in conjunction with the marker data for higher precision. Genetic researchers also use measures of genetic variability to quantify genetic variability in absolute terms or to study related populations, quantifying their variability relative to one another. These latter measures used on related populations can provide evidence of unusual or unique mutations or bottlenecks that may have occurred in one or more populations. Such measures are based on functions of alleles of individual loci, and SSR markers, among other types of molecular markers have been used to estimate these values. Numbers of alleles necessary for accurate estimates of these statistics have not, however, heretofore been rigorously determined. Many scientific publications using these measures have justified the number of markers used by heuristic and very questionable methods, and have used as few as one to two markers (or fewer) per chromosome. The purpose of this paper was to investigate the number of markers needed for accurate and precise estimates of these measures using computer simulation and Monte Carlo methods. Relatively few markers (10 to 15 in cacao, for example) can suffice to identify and distinguish cultivars (clones). However, our work in this paper found that increasing the number of markers per chromosome (100 centimorgans) from one to five increased precision by more than 50%, and a further increase to ten markers per (100 cM) chromosome increased precision by 70%. Markers used in such analyses of germplasm are generally selected to include only polymorphic (variable, or informative) markers, and some researchers further restrict the selection of markers used to those with estimates of genetic variability above a certain threshold. Our results indicated that this practice can inflate the estimates of these statistics, at times resulting in biases over 100%. If relative values of variability are of primary interest, comparisons remove the bias, however if the absolute estimate of these statistics is of primary interest, it is of extreme importance to include non-polymorphic regions of the genome when sampling it with SSR markers. Final discussion in this report compares and contrasts different marker sampling schemes in order to determine a reasonable number of markers to use for measuring variability in populations.
Technical Abstract: A current method for measuring genetic diversity among and within populations and germplasm collections is based on statistics derived from allelic frequencies estimated at polymorphic molecular marker loci. The true mean and variance of any measure of genetic variability of the entire genome can only be approached accurately by considering all possible genes. However, only a portion of the genome is studied in practice, using a limited set of polymorphic markers. The objective of this study was to use Monte Carlo methods to investigate the accuracy and precision of the most common genetic variability and population structure estimators, as estimated from simple sequence repeat (SSR) markers in cacao (Theobroma cacao L.) populations. Computer simulated genomes of replicate populations were generated with initial allele frequencies obtained from SSR data of the Trinitario cacao genetic group. Estimators of genetic variability were studied as a function of the number of trees and loci sampled. The results showed that relatively small random samples of trees are needed to achieve consistency in the observed estimations. In contrast, very large random samples of loci per linkage group were required to enable reliable inferences to the whole genome. The precision of the estimates was increased by more than 50 % with an increment in sample size from one to five loci per linkage group of 100 cM in length. The use of fewer, highly polymorphic loci to analyze genetic variability, led to estimates with substantially smaller variance, but with an upward bias.