Next-generation sequencing (NGS) technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to undertake hundreds of projects to decode the genomes of previously-unsequenced organisms. The sequence data generated by one of these projects may contain billions of short DNA sequences ("reads") that range from 100-150 nucleotides in length. These reads must then be assembled de novo before most genome analyses can begin.
The short lengths and enormous numbers of raw reads make NGS genome assembly extremely challenging. Several new whole-genome assemblers have been developed to meet this challenge, but the quality of most NGS assemblies is far short of those that were produced using earlier Sanger technology. In this talk, I will describe our recent experience assembling several large genomes that were sequenced with Illumina short-read technology. I will also discuss our recent evaluation of several of the leading assembly algorithms on four different Illumina data sets. Three overarching conclusions will become apparent: first, that data quality has a dramatic affect on the quality of an assembled genome, independently of the assembly software; second, that the contiguity of an assembly varies enormously among different assemblers; and third, that the correctness of assemblies varies widely and unexpectedly among the leading assembly programs. Finally I will talk about our strategy for sequencing and assembling the genome of the pine tree, Pinus taeda, which is eight times as large as the human genome.