Publication : USDA ARS

ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #388678

Research Project: Improving Crop Efficiency Using Genomic Diversity and Computational Modeling

Location: Plant, Soil and Nutrition Research

Title: The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation

Author

	Bradbury, Peter
	CASSTEVENS, TERRY - Cornell University
	JENSEN, SARAH - Cornell University
	JOHNSON, LYNN - Cornell University
	MILLER, ZACHARY - Cornell University
	MONIER, BRANDON - Cornell University
	ROMAY, MARIA - Cornell University
	SONG, BAOXING - Cornell University
	Buckler, Edward - Ed

Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 8/28/2021
Publication Date: 8/28/2021
Citation: Bradbury, P., Casstevens, T., Jensen, S.E., Johnson, L.C., Miller, Z.R., Monier, B., Romay, M.C., Song, B., Buckler IV, E.S. 2021. The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation. bioRxiv. 2021.08.27.457652. https://doi.org/10.1101/2021.08.27.457652.
DOI: https://doi.org/10.1101/2021.08.27.457652

Interpretive Summary: A genome is the entire DNA sequence of an individual. Individuals within a species or population can vary considerably for genome content. Although a single reference genome inadequately represents that diversity, using several genomes together is challenging. Plant species can be particularly difficult. For example, as much as 40% of the genome may be different between two maize lines. As a result, a system for organizing and using information from multiple genomes, sometimes called a pangenome, would be very useful. The Practical Haplotype Graph (PHG) provides a solution to this problem by dividing a single reference genome into a large number of biologically meaningful intervals or ranges, then organizing additional genomes by finding the ranges from them that match the reference ranges. The PHG provides software for doing this and a database for storing the resulting information. In addition, the PHG uses that information to impute full genomic sequence for new samples from a relatively small amount of DNA sequence or sets of genetic markers. This provides research or breeding programs with a method to generate individual genotypes at a low cost per sample. This paper describes the design of the PHG and its performance in terms of speed and data storage efficiency, and uses simulated data to evaluate imputation accuracy. It cites additional papers that report the use of the PHG for maize, sorghum, wheat, and cassava. It also describes the tools available and under development for using and evaluating the data it generates.

Technical Abstract: Motivation: Pangenomes provide novel insights for population and quantitative genetics, genomics, and breeding not available from studying a single reference genome. Instead, a species is better represented by a pangenome or collection of genomes. Unfortunately, managing and using pangenomes for genomically diverse species is computationally and practically challenging. We developed a trellis graph representation anchored to the reference genome that represents most pangenomes well and can be used to impute complete genomes from low density sequence or variant data. Results: The Practical Haplotype Graph (PHG) is a pangenome pipeline, database (PostGRES & SQLite), data model (Java, Kotlin, or R), and Breeding API (BrAPI) web service. The PHG has already been able to accurately represent diversity in four major crops including maize, one of the most genomically diverse species, with up to 1000-fold data compression. Using simulated data, we show that, at even 0.1X coverage, with appropriate reads and sequence alignment, imputation results in extremely accurate haplotype reconstruction. The PHG is a platform and environment for the understanding and application of genomic diversity. Availability: All resources listed here are freely available. The PHG Docker used to generate the simulation results is https://hub.docker.com/ as maizegenetics/phg:0.0.27. PHG source code is at https://bitbucket.org/bucklerlab/practicalhaplotypegraph/src/master/. The code used for the analysis of simulated data is at https://bitbucket.org/bucklerlab/phg-manuscript/src/master/. The PHG database of NAM parent haplotypes is in the CyVerse data store (https://de.cyverse.org/de/) and named /iplant/home/shared/panzea/panGenome/PHG_db_maize/phg_v5Assemblies_20200608.db.

U.S. DEPARTMENT OF AGRICULTURE

Plant, Soil and Nutrition Research: Ithaca, NY