Submitted to: PLoS One
Publication Type: Peer Reviewed Journal Publication Acceptance Date: December 8, 2008 Publication Date: January 27, 2009 Citation: Reeves, P.A., Richards, C.M. 2009. Accurate inference of subtle population structure (and other genetic discontinuities) using principal coordinates. Plos one. 4(1) e4269. Interpretive Summary: Accurate inference of cryptic population structure from genotypic data is essential for any description of natural genetic variation. In the past, model based Bayesian statistical methods have been used with much success, however these methods are so computationally intensive as to be intractable for large genomic data sets. Furthermore, inference of the natural number of subpopulations in a data set is difficult using Bayesian approaches. We have developed an analytical strategy called PCO MC (“principal coordinatemodal clustering”) that permits accurate inference of both the number of subpopulations, and the assignment of individuals to those populations. PCO MC requires orders of magnitude less computational effort than Bayesian methods. PCO MC uses data from all principal coordinate axes simultaneously to calculate a hyperdimensional probability density function, or “density landscape”, from which cluster number and membership is determined using a valleyseeking algorithm. A statistical test to determine whether a particular cluster is significantly distinct from others is also available. Using simulated data, we have shown that PCO MC accurately infers cluster number and membership when population substructure is subtle and the number of loci considered is large. Use of PCO MC will permit more accurate and more rapid circumscription of relevant units of biodiversity in germplasm collections using molecular genotype data. The PCO MC approach may also improve linkage disequilibrium mapping studies where population structure must be explicitly known in order to control for spurious associations between a locus and a phenotype. Technical Abstract: Accurate inference of genetic discontinuities between populations is an essential component of intraspecific biodiversity and evolution studies, as well as associative genetics. The most widely used methods to infer population structure are model based, Bayesian MCMC procedures that minimize Hardy Weinberg and linkage disequilibrium within subpopulations. These methods have proven useful, but suffer from large computational requirements and a dependence on modeling assumptions that may not be met in real data sets. Here we describe the development of a new approach, PCO MC, which couples an ordination method, principal coordinate analysis, to a non parametric clustering procedure, modal clustering, for the inference of population structure from multilocus genotype data. PCO MC uses data from all principal coordinate axes simultaneously to calculate a hyperdimensional probability density function (or “density landscape”), from which the number of subpopulations, and the membership within those subpopulations, is determined using a valleyseeking algorithm. Using extensive simulations, we show that this approach outperforms a Bayesian MCMC procedure when many loci (e.g. 100) are sampled, but that the Bayesian procedure is superior with few loci (e.g. 10). When presented with sufficient data, PCO MC accurately delineated subpopulations with population Fst values as low as 0.03 (G’st > 0.2), whereas the limit of resolution of the Bayesian approach was approximately Fst = 0.05 (G’st > 0.35). We draw a distinction between population structure inference for describing biodiversity as opposed to Type I error control in associative genetics. We suggest that discrete assignments, such as those produced by PCO MC, are appropriate for circumscribing units of biodiversity whereas expression of population structure as a continuous variable is more useful for case control correction in structured association studies.
