Location: Tropical Pest Genetics and Molecular Biology Research Unit
Title: CCS-consensuser: A haplotype-aware consensus generator for PacBio amplicon sequencesAuthor
![]() |
CONGRAINS, CARLOS - University Of Hawaii |
![]() |
Bremer, Forest |
![]() |
DUPUIS, JULIAN - University Of Kentucky |
![]() |
BARR, NORMAN - Animal And Plant Health Inspection Service (APHIS) |
![]() |
GARZÓN-ORDUÑA, IVONNE - The National Autonomous University Of Mexico |
![]() |
RUBINOFF, DANIEL - University Of Hawaii |
![]() |
DOORENWEERD, CAMIEL - University Of Hawaii |
![]() |
SAN JOSE, MICHAEL - University Of Hawaii |
![]() |
MORRIS, KIMBERLEY - University Of Hawaii |
![]() |
Kauwe, Angela |
![]() |
Geib, Scott |
|
Submitted to: Molecular Ecology Resources
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 3/24/2025 Publication Date: 4/4/2025 Citation: Congrains, C., Bremer, F., Dupuis, J.R., Barr, N.B., Garzón-Orduña, I.J., Rubinoff, D., Doorenweerd, C., San Jose, M., Morris, K., Kauwe, A., Geib, S. 2025. CCS-consensuser: A haplotype-aware consensus generator for PacBio amplicon sequences. Molecular Ecology Resources. Article e14113. https://doi.org/10.1111/1755-0998.14113. DOI: https://doi.org/10.1111/1755-0998.14113 Interpretive Summary: Advances in sequencing technology have caused a revolution in the diverse fields of science. One of these relatively new techniques is known as long-read sequencing, which involves the generation of millions of fragments of several thousands of bases (10,000-20,000 bp or longer) in a cost-effective manner. This method is also flexible and can be applied to produce complete genome assemblies, as well as to investigate the genetic variation in specific gene(s) or genomic region(s). For the latter application, we highlight the technique referred to as amplicon sequencing, which consist of generating multiple copies of the target region(s) in library preparation via PCR, thus effectively mitigating the higher error rates associated with these sequencing platforms. The ultimate goal in these experiments is to obtain a single, error-free sequence for a sample (or a set of sequences if phasing or copy number analysis is necessary). However, as of now, there is no software available to create a consensus sequences in the absence of a reference for these collections of independent long-reads obtained using PacBio circular consensus sequencing (CCS). To address this gap, we have developed CCS-consensuser, an end-to-end pipeline that generates consensus sequences from amplicon sequencing using PacBio CCS. Our results indicate that the outcomes of this software are highly concordant with other sequencing approaches (Sanger and Illumina) and highly accurate when compared to simulated data. The primary advantage of our method is the ability to obtain multiple final sequences per sample, enabling the investigation of heteroplasmy in mtDNA (mtDNA variants within a single organism), the identification of cross-contamination, and the resolution of the phase of nuclear genes in diploid organisms. As a result, our strategy holds promise for a wide range of applications in biology that have been challenging to assess using traditional techniques. Technical Abstract: DNA sequencing technology has undergone substantial improvements in recent years, to the extent that Third Generation Sequencing platforms are capable of massively generating long-reads. Amplicon sequencing has been among the most popular techniques due to its wide application in diverse fields of biological sciences. However, there is a lack of software specifically designed to analyze intra-individual genetic variation using amplicon long-read data. Here, we present CCS-consensuser, an end-to-end pipeline that generates consensus sequences from amplicon sequencing using high fidelity reads produced by PacBio circular consensus sequencing (CCS). We evaluated concordance of the results produced using CCS + CCS-consensuser and other sequencing platforms (Illumina and Sanger), as well as accuracy using a simulated dataset. This assessment showed that CCS amplicon data coupled with CCS-consensuser can produce high-quality sequences (PHRED > 30) with high levels of concordance between approaches (up to 94.94% for Sanger and 92.61% for Illumina) and accuracy (up to 95.75%). Furthermore, our pipeline can be used to detect heteroplasmy in mtDNA, cross-contamination, and resolve the phase of nuclear genes in diploid organisms. These results not only support its potential for application in studies using haploid data such as DNA barcoding, but also demonstrate its unique capacity to explore within individual haplotype variation. Therefore, our strategy shows promise for a broad range of applications in biology and medicine that have been challenging to assess using traditional techniques. |
