Skip to main content
ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #396647

Research Project: Improving Crop Efficiency Using Genomic Diversity and Computational Modeling

Location: Plant, Soil and Nutrition Research

Title: The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation.

Author
item BRADBURY, PETER - Former ARS Employee
item CASSTEVENS, TERRY - Cornell University
item JENSEN, SARA - Syngenta
item JOHNSON, LYNN - Cornell University
item MILLER, ZACHARY - Cornell University
item MONIER, BRANDON - Cornell University
item ROMAY, MARIA - Cornell University
item SONG, BAOXING - Cornell University
item Buckler, Edward - Ed

Submitted to: Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 6/22/2022
Publication Date: 6/24/2022
Citation: Bradbury, P.J., Casstevens, T., Jensen, S.E., Johnson, L.C., Miller, Z.R., Monier, B., Romay, M.C., Song, B., Buckler IV, E.S. 2022. The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac410.
DOI: https://doi.org/10.1093/bioinformatics/btac410

Interpretive Summary: A genome is the entire DNA sequence of an individual. Individuals within a species or population can vary considerably for genome content. Although a single reference genome inadequately represents that diversity, using several genomes together is challenging. Plant species can be particularly difficult. For example, as much as 40% of the genome may be different between two maize lines. As a result, a system for organizing and using information from multiple genomes, sometimes called a pangenome, would be very useful. The Practical Haplotype Graph (PHG) provides a solution to this problem by dividing a single reference genome into a large number of biologically meaningful intervals or ranges, then organizing additional genomes by finding the ranges from them that match the reference ranges. The PHG provides software for doing this and a database for storing the resulting information. In addition, the PHG uses that information to impute full genomic sequence for new samples from a relatively small amount of DNA sequence or sets of genetic markers. This provides research or breeding programs with a method to generate individual genotypes at a low cost per sample. This paper describes the design of the PHG and its performance in terms of speed and data storage efficiency, and uses simulated data to evaluate imputation accuracy. It cites additional papers that report the use of the PHG for maize, sorghum, wheat, and cassava. It also describes the tools available and under development for using and evaluating the data it generates.

Technical Abstract: Motivation: Pangenomes provide novel insights for population and quantitative genetics, genomics, and breeding not available from studying a single reference genome. Instead, a species is better represented by a pangenome or collection of genomes. Unfortunately, managing and using pangenomes for genomically diverse species is computationally and practically challenging. We developed a trellis graph representation anchored to the reference genome that represents most pangenomes well and can be used to impute complete genomes from low density sequence or variant data.