Skip to main content
ARS Home » Pacific West Area » Hilo, Hawaii » Daniel K. Inouye U.S. Pacific Basin Agricultural Research Center » Tropical Pest Genetics and Molecular Biology Research Unit » Research » Publications at this Location » Publication #410391

Research Project: Advancing Molecular Pest Management, Diagnostics, and Eradication of Fruit Flies and Invasive Species

Location: Tropical Pest Genetics and Molecular Biology Research Unit

Title: CCS-consensuser: A haplotype-aware consensus generator for PacBio amplicon sequences

Author
item CONGRAINS, CARLOS - University Of Hawaii
item Bremer, Forest
item DUPUIS, JULIAN - University Of Kentucky
item BARR, NORMAN - Animal And Plant Health Inspection Service (APHIS)
item GARZÓN-ORDUÑA, IVONNE - The National Autonomous University Of Mexico
item RUBINOFF, DANIEL - University Of Hawaii
item DOORENWEERD, CAMIEL - University Of Hawaii
item SAN JOSE, MICHAEL - University Of Hawaii
item MORRIS, KIMBERLEY - University Of Hawaii
item Kauwe, Angela
item Geib, Scott

Submitted to: Molecular Ecology Resources
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 3/24/2025
Publication Date: 4/4/2025
Citation: Congrains, C., Bremer, F., Dupuis, J.R., Barr, N.B., Garzón-Orduña, I.J., Rubinoff, D., Doorenweerd, C., San Jose, M., Morris, K., Kauwe, A., Geib, S. 2025. CCS-consensuser: A haplotype-aware consensus generator for PacBio amplicon sequences. Molecular Ecology Resources. Article e14113. https://doi.org/10.1111/1755-0998.14113.
DOI: https://doi.org/10.1111/1755-0998.14113

Interpretive Summary: Advances in sequencing technology have caused a revolution in the diverse fields of science. One of these relatively new techniques is known as long-read sequencing, which involves the generation of millions of fragments of several thousands of bases (10,000-20,000 bp or longer) in a cost-effective manner. This method is also flexible and can be applied to produce complete genome assemblies, as well as to investigate the genetic variation in specific gene(s) or genomic region(s). For the latter application, we highlight the technique referred to as amplicon sequencing, which consist of generating multiple copies of the target region(s) in library preparation via PCR, thus effectively mitigating the higher error rates associated with these sequencing platforms. The ultimate goal in these experiments is to obtain a single, error-free sequence for a sample (or a set of sequences if phasing or copy number analysis is necessary). However, as of now, there is no software available to create a consensus sequences in the absence of a reference for these collections of independent long-reads obtained using PacBio circular consensus sequencing (CCS). To address this gap, we have developed CCS-consensuser, an end-to-end pipeline that generates consensus sequences from amplicon sequencing using PacBio CCS. Our results indicate that the outcomes of this software are highly concordant with other sequencing approaches (Sanger and Illumina) and highly accurate when compared to simulated data. The primary advantage of our method is the ability to obtain multiple final sequences per sample, enabling the investigation of heteroplasmy in mtDNA (mtDNA variants within a single organism), the identification of cross-contamination, and the resolution of the phase of nuclear genes in diploid organisms. As a result, our strategy holds promise for a wide range of applications in biology that have been challenging to assess using traditional techniques.

Technical Abstract: DNA sequencing technology has undergone substantial improvements in recent years, to the extent that Third Generation Sequencing platforms are capable of massively generating long-reads. Amplicon sequencing has been among the most popular techniques due to its wide application in diverse fields of biological sciences. However, there is a lack of software specifically designed to analyze intra-individual genetic variation using amplicon long-read data. Here, we present CCS-consensuser, an end-to-end pipeline that generates consensus sequences from amplicon sequencing using high fidelity reads produced by PacBio circular consensus sequencing (CCS). We evaluated concordance of the results produced using CCS + CCS-consensuser and other sequencing platforms (Illumina and Sanger), as well as accuracy using a simulated dataset. This assessment showed that CCS amplicon data coupled with CCS-consensuser can produce high-quality sequences (PHRED > 30) with high levels of concordance between approaches (up to 94.94% for Sanger and 92.61% for Illumina) and accuracy (up to 95.75%). Furthermore, our pipeline can be used to detect heteroplasmy in mtDNA, cross-contamination, and resolve the phase of nuclear genes in diploid organisms. These results not only support its potential for application in studies using haploid data such as DNA barcoding, but also demonstrate its unique capacity to explore within individual haplotype variation. Therefore, our strategy shows promise for a broad range of applications in biology and medicine that have been challenging to assess using traditional techniques.