Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #331999

Title: Evaluating imputation algorithms for low-depth genotyping-by-sequencing (GBS) data

Author
item CHEN, ARIEL - Cornell University
item HAMBLIN, MARTHA - Cornell University
item Jannink, Jean-Luc

Submitted to: PLOS ONE
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 7/25/2016
Publication Date: 8/18/2016
Citation: Chen, A.W., Hamblin, M.T., Jannink, J. 2016. Evaluating imputation algorithms for low-depth genotyping-by-sequencing (GBS) data. PLoS One. 11(8):e0160733.

Interpretive Summary: For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an economic approach for surveying variants at the genome level. Although affordable, HTS-derived datasets suffer from high rates of sequencing error, alignment errors, and missing data, all of which introduce a considerable amount of noise and uncertainty to variant discovery and genotype calling and make meaningful analysis of the data difficult. In the research article “Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data,” we focus on the issue of missing data and set out to answer two questions: 1) can we reuse existing imputation methods developed by the human genetics community to impute missing genotypes found in datasets from non-human species, such as Manihot esculenta and 2) are these methods, developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at GBS-derived variants? The number of studies using low- to medium-depth HTS data will likely grow in the future. We believe this article will provide readers with some insight into strategies for imputing missing genotypes at HTS-derived variants in non-model organisms.

Technical Abstract: Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from high rates of sequencing error, alignment errors, and missing data, all of which introduce a considerable amount of noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we reuse existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.