|BOWKER, CHERYL - Colorado State University
|FETTIG, CHRISTA - Colorado State University
|TEMBROCK, LUKE - Colorado State University
Submitted to: bioRxiv
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 10/3/2016
Publication Date: 10/13/2016
Citation: Reeves, P.A., Bowker, C.L., Fettig, C.E., Tembrock, L.R., Richards, C.M. 2016. Effect of error and missing data on population structure inference using microsatellite data. bioRxiv. https://doi.org/10.1101/080630.
Interpretive Summary: Rapidly evolving genomic regions called “microsatellite” DNA, or “simple sequence repeats” (SSRs) are commonly used to estimate genetic differences between populations of organisms. All SSR data sets contain some amount of error, and most contain some missing data. Eliminating missing data and errors is time consuming and expensive. This study quantifies the effects of these data aberrations on one type of analysis important to germplasm collection management: population structure inference. We simulated data sets that exhibited a range of population structures, akin to what might be observed in nature. We then modified these data sets using models developed to mimic the characteristic features of SSR error and missing data. Data sets were analyzed before and after the introduction of error and missing data, and the number of correct and incorrect population clusters recovered was tabulated. We made three principal discoveries: 1) Missing data negatively affects population structure inference more than erroneous data; 2) Some analytical methods become more accurate as error increases; 3) The magnitude of genetic admixture is overestimated when data sets contain genotyping errors. These discoveries lead to the following practical recommendations: 1) The percent of a matrix that contains missing data should be limited to ~2%. This will allow researchers to retain most of the resolving power of their data while not incurring the extra costs associated with completing a data matrix; 2) For analyses that use simple genetic distance calculations, the error rate should be limited to ~4% of scored genotypes; 3) Model based population structure inference methods handle genotyping error well. Among these, we recommend using admixture models to identify distinct genetic clusters, but caution against their use for estimating genetic admixture, because the estimate may be artificially elevated. This study will improve our ability to understand the distribution of genetic variation in wild relatives of crop species by making genotypic data acquisition more efficient and data analysis more accurate.
Technical Abstract: Missing data and genotyping errors are common in microsatellite data sets. We used simulated data to quantify the effect of these data aberrations on the accuracy of population structure inference. Data sets with complex, randomly generated, population histories were simulated under the coalescent. Models describing the characteristic patterns of missing data and genotyping error in real microsatellite data sets were developed, then used to modify the simulated data sets. Performance of ordination, tree based, and model based methods of population structure inference was evaluated before and after data set modifications. The ability to recover correct population clusters decreased as missing data increased. The rate of decrease was similar among analytical procedures, thus no single analytical approach was preferable. For every 1% of a data matrix that contained missing genotypes, 2–4% fewer correct clusters were found. For every 1% of a matrix that contained erroneous genotypes, 1–2% fewer correct clusters were found using ordination and tree based methods. Model based procedures that minimize the deviation from Hardy Weinberg equilibrium in order to assign individuals to clusters performed better as genotyping error increased. We attribute this surprising result to the inbreeding like nature of microsatellite genotyping error, where heterozygous genotypes are mischaracterized as homozygous. We show that genotyping error elevates estimates of the level of genetic admixture. Overall, missing data negatively impact population structure inference more than typical genotyping errors.