Skip to main content
ARS Home » Southeast Area » Miami, Florida » Subtropical Horticulture Research » Research » Publications at this Location » Publication #259373

Title: The need for an assembly pilot project

item Kuhn, David
item SASKI, CHRIS - Clemson University
item FELTUS, F - Clemson University
item HAIMINEN, NIINA - International Business Machines Corporation (IBM)
item MAIN, DORRIE - Washington State University
item MAY, GREG - National Center For Genome Resources
item Schnell Ii, Raymond
item MOTAMAYOR, J - Mars, Inc
item MOCKAITIS, KEITHANNE - Indiana University
item Scheffler, Brian
item SHAPIRO, HOWARD - Mars, Inc

Submitted to: Plant and Animal Genome Conference
Publication Type: Abstract Only
Publication Acceptance Date: 1/10/2010
Publication Date: 1/15/2010
Citation: Kuhn, D.N., Saski, C., Feltus, F.A., Haiminen, N., Main, D., May, G.D., Schnell Ii, R.J., Motamayor, J.C., Mockaitis, K., Scheffler, B.E., Shapiro, H. 2010. The need for an assembly pilot project. Plant and Animal Genome Conference. 1.

Interpretive Summary: Theobroma cacao, the source of cocoa beans for chocolate, is an important tropical agriculture commodity that is affected by a number of fungal pathogens and insect pests, as well as concerns about yield and quality. We are trying to find molecular genetic markers that are linked to disease resistance and other important economic traits to aid in a marker assisted selection (MAS) breeding program for cacao to ensure a reliable supply of cocoa for the US confectionary industry. Currently there are about 500 molecular genetic markers for cacao and we are taking advantage of the cacao genome sequencing project to expand that to greater than 50,000 single-nucleotide polymorphism (SNP) markers. The assembly pilot project will facilitate the completion of the cacao genome sequencing project. We will use these markers to improve the resolution of our current genetic maps and to find associations between specific SNPs and advantageous traits such as disease resistance or higher yield. Our results are important to scientists trying to understand the mechanism of disease resistance and, eventually, to cacao farmers who will benefit from superior disease resistant and more productive cultivars produced through our MAS breeding program.

Technical Abstract: Progress has been rapid since the June 2008 start of the cacao genome sequencing project with the completion of the physical map and the accumulation of approximately 10x coverage of the genome with Titanium 454 sequence data of Matina1-6, the highly homozygous Amelonado tree chosen for the project. Our IBM collaborators have been analyzing the currently available software for sequence assembly and benchmarking it with synthetic datasets of various sizes and error rates. Serious concerns have been raised about the ability to assemble a genome the size of cacao (n=10,~460 Mb) de novo from 454 sequence data. The current assembly of 454 data (version 3) has 171,816 contigs (296 Mb) while the physical map produced at CUGI has only 295 contigs (representing >90% of the genome), 109 of which are anchored to the genetic recombination map. A pilot assembly project of the pooled BACs from the minimum tile path of a single contig (~3Mb) region of the cacao genome containing several disease resistance and horticultural QTLs has been proposed to determine if de novo assembly of a region of that size is possible from 454 sequence data. In addition, a subset of the BACs (~1 Mb) will be Sanger sequenced. To test the assembly pipeline, a synthetic dataset will be prepared with a distribution of read sizes and error to reflect those typically found in 454 sequence data. Successful assembly on the pilot scale will provide a strategy to complete the assembly of the genome sequence represented by the physical map.