Skip to main content
ARS Home » Plains Area » Manhattan, Kansas » Center for Grain and Animal Health Research » Stored Product Insect and Engineering Research » Research » Research Project #429655

Research Project: Development of Assembly Pipelines for Long-read Sequencing Data

Location: Stored Product Insect and Engineering Research

Project Number: 3020-43000-033-010-S
Project Type: Non-Assistance Cooperative Agreement

Start Date: Sep 15, 2015
End Date: Sep 14, 2020

Objective:
To collaborate on the production of pipelines for the accurate and economical assembly of long-read data.

Approach:
The expanding research into functional genomics as a tool to develop new insect control methods is both promising and challenging. However, analyses often are dependent on a well-annotated and complete reference genome, which has been lacking for storage pests because of the complexity of obtaining genetic material and assembly of sequences. With the advent of long-read sequencing, such as the technology developed by Pacific Biosciences (PacBio), the ability to obtain sufficient quantities of genetic material is now more economical, but the challenge of assembly remains. To assemble long-read data, sophisticated algorithms are needed to correct errors in the sequences, and assembling long-read data is not trivial. Our previous experience working with Nimbix has been extremely positive, as they have worked closely with our bioinformatics scientists and staff to successfully assemble our first draft genome for a stored product insect using exclusively PacBio long-read data. Working with Nimbix scientists and software developers, we made three different assemblies of approximately 56x coverage of a 476 Mb genome using an error correction program called Mhap and different assembly parameters. Each assembly used increasing amounts of computer time and gave different parameters (such as N50), and we worked closely with the team to adjust the control scripts and to troubleshoot. We are now evaluating each assembly to determine how to best implement the pipeline for future insect long-read data assemblies. Under this agreement, we will continue to evaluate and test our current pipeline based on Mhap and the Celera assembler with additional datasets. As our future research plans include additional genome assemblies, we will partner with Nimbix to provide the computer infrastructure to successfully assemble long-read data and develop new pipelines. The algorithms for analyzing long-read data are constantly evolving as the technology is able to provide longer and longer read lengths. Therefore, we will work with Nimbix to develop new pipelines for long-read assembly based on currently available algorithms. Nimbix has dedicated resources to the development of these pipelines to offer user-friendly web-based interfaces to customers that need an economical and reliable computing resource for long-read data assembly.