Publication : USDA ARS

ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Publications at this Location » Publication #374771

Research Project: Mapping Crop Genome Functions for Biology-Enabled Germplasm Improvement

Location: Plant, Soil and Nutrition Research

Title: Highly accurate HiFi long read sequencing data for five complex genome samples

Author

	HON, TING - Pacific Biosciences Inc
	MARS, KRISTIN - Pacific Biosciences Inc
	YOUNG, GREG - Pacific Biosciences Inc
	TSAI, YU-CHIH - Pacific Biosciences Inc
	KAURALIS, JOSEPH - Pacific Biosciences Inc
	LANDOLIN, JANE - Ravel Biotechnology
	MAURER, NICHOLAS - University Of California Santa Cruz
	KUDRNA, DAVID - Arizona Genomics Institute
	HARDIGAN, MICHAEL - University Of California, Davis
	STEINER, CYNTHIA - Beckman Research Institute
	KNAPP, STEVE - University Of California, Davis
	Ware, Doreen
	SHAPIRO, BETH - University Of California Santa Cruz
	PELUSO, PAUL - Pacific Biosciences Inc
	RANK, DAVID - Pacific Biosciences Inc

Submitted to: Scientific Data - Nature
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 10/27/2020
Publication Date: 10/27/2020
Citation: Hon, T., Mars, K., Young, G., Tsai, Y., Kauralis, J., Landolin, J.M., Maurer, N., Kudrna, D., Hardigan, M.A., Steiner, C.C., Knapp, S., Ware, D., Shapiro, B., Peluso, P., Rank, D.R. 2020. Highly accurate HiFi long read sequencing data for five complex genome samples. Scientific Data - Nature. 7. Article e399. https://doi.org/10.1038/s41597-020-00743-4.
DOI: https://doi.org/10.1038/s41597-020-00743-4

Interpretive Summary: There is a need for benchmarking data sets to validate and support improved algorithms for assembly. In this paper we present deep coverage of PacBio HiFi sequencing reads for mouse, frog, corn, and strawberry genomes with an average size of 10-25kb, and greater than 99.5% accuracy. We also include mock microbial community meta genome data set. These data sets can be used without restriction to develop new algorithms to support assembly and analyses of complex genome structure and evolution.

Technical Abstract: The PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets whose reads average 10-25 kb with accuracies of greater than 99.5%. These accurate long reads are applicable and improve results for complex applications such as improved single nucleotide and structural variant detection, improved genome assembly, assembly of difficult polyploid or highly repetitive genomes, and the assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus, and Zea mays, as well as two outbred complex genomes, the octoploid Fragaria ananassa, and the anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II instrument.

U.S. DEPARTMENT OF AGRICULTURE

Plant, Soil and Nutrition Research: Ithaca, NY