Skip to main content
ARS Home » Research » Publications at this Location » Publication #340277

Title: Capturing haplotypes in germplasm core collections

item Richards, Christopher
item Reeves, Patrick

Submitted to: Genetic Resources and Crop Evolution
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 10/24/2017
Publication Date: 12/1/2017
Citation: Richards, C.M., Reeves, P.A. 2017. Capturing haplotypes in germplasm core collections. Genetic Resources and Crop Evolution. doi:10.1007/s10722-017-0549-6.

Interpretive Summary: Genebank managers often need to identify the accessions within their collections that contain the most diversity or have the most useful alleles. They use this information to develop “core subsets,” which are small, representative subsets of the broader germplasm collection. A variety of data sets are currently used to make core subsets including a few genetic markers distributed about the genome, morphological traits and geospatial information about the collection site. “Genomewide SNP” (single nucleotide polymorphism) data sets offer a way to measure collection diversity and locate alleles of interest but these data sets are often too large for current software to analyze and there are no best practices to guide data analysis and utilization. This work develops new software that runs in parallel on a supercomputer to create core subsets using genomewide SNP data sets. Here, we compare the composition of core subsets for different computer-simulated haplotype block structures (the fundamental unit of genetic inheritance). We show that the maximum number of accessions necessary to capture a given proportion of the collection diversity can be determined using this simulation approach. Further we show that haplotypic variation is estimated better by genomewide SNP data sets than by other data sets, such as geospatial and environmental parameters. Natural history of the species and improvement status of the collection affect how well geospatial and environmental data estimate diversity. These results indicate that commonly used surrogate measures of genetic diversity, such as geographic proximity or similarities of the collection site environment, do not correlate well with actual genetic diversity.

Technical Abstract: Genomewide data sets of single nucleotide polymorphisms (SNPs) offer great potential to improve ex situ conservation. Two factors impede their use for producing core collections. First, due to the large number of SNPs, the assembly of collections that maximize diversity may be intractable using existing, serial software algorithms. Second, the effect of the natural partitioning of the genome into linked regions, or haplotype blocks, on the optimization of collections, and the capture of diversity, is unknown. To address the first problem, we report the development of a parallel computer program, M+, for identifying optimized core collections from arbitrarily large genotypic data sets on high performance computing systems. With respect to the second problem, we use three exemplar data sets to show that, as haplotype block length increases, the number of accessions necessary to capture a predetermined proportion of genomewide haplotypic variation also increases. This relationship is asymptotic such that the minimum haplotype block length suitable for assembling core collections can be empirically determined, and the number of accessions necessary to capture a given percentage of the haplotypic diversity present in the entire collection can be estimated, even when true haplotype structure is unknown. Additionally, we test whether simple geographic or environmental information can be used to produce core collections with elevated genomewide haplotypic diversity. We find this opportunity to be limited, and dependent on natural history and improvement status.