Location: Corn Insects and Crop Genetics ResearchTitle: Data from polishCLR: Example input genome assemblies
Submitted to: Ag Data Commons
Publication Type: Database / Dataset
Publication Acceptance Date: 2/9/2022
Publication Date: 2/9/2022
Citation: Stahlke, A.R., Coates, B.S. 2022. Data from polishCLR: Example input genome assemblies. Ag Data Commons. https://doi.org/10.15482/USDA.ADC/1524676.
Interpretive Summary: High-quality genome assemblies for pest insects are important for investigating how different genomic regions contribute to difficulties controlling damage they cause to crops and livestock. Use of some sequencing methods that generate long stretches of DNA are prone to error. These errors require correction prior to use in downstream genome assembly applications. A dataset of corrected long reads generated using a new software application were deposited in the National Agricultural Library, Ag Data Commons. This dataset and associated methods provide guidance to government, university and industry stakeholders who are interested in generating genome assemblies.
Technical Abstract: We developed a publicly available, flexible and reproducible workflow to produce the best possible de novo, chromosome-scale genome assembly from error prone continuous long reads (CLR) reads. The wokflow, called polishCLR, that is containerized so it can be run on any conventional high performance computing system. This dataset provides example input data, primary contig assemblies, to test and reproduce the demonstrated utility of our workflow. The polishCLR workflow can be easily initiated from three input cases: Case 1, an unresolved primary assembly with associated contigs, the output of FALCON 2-asm: p_ctg.fasta and a_ctg.fasta; Case 2, a haplotype-resolved but unpolished set, the output of FALCON-Unzip 3-unzip: all_p_ctg.fasta and all_h_ctg.fasta; and Case 3, a haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs, the output of FALCON-Unzip 4-polish: cns_p_ctg.fasta and cns_h_ctg.fasta. These example data are the input contigs assemblies for the pest Helicoverpa zea. These contigs are built from 49.89 Gb of raw Pacific Biosciences (PacBio) CLR data generated from a single H. zea HzStark_Cry1AcR strain male.