|Haiminen, Niina -|
|Parida, Laxmi -|
|Rigoutsos, Isidore -|
Submitted to: PLoS One
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: August 1, 2011
Publication Date: September 7, 2011
Repository URL: http://doi:10.1371/journal.pone.0024182
Citation: Haiminen, N., Kuhn, D.N., Parida, L., Rigoutsos, I. 2011. Evaluation of Methods for de novo Genome assembly from High-throughput Sequencing Reads Reveals Dependencies that Affect the Quality of the Results. PLoS One. 6(9): e24182. Interpretive Summary: Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. But for plants where a complete genome sequence is not yet available, determining what kind and how much sequence data needs to be collected to be able to correctly assemble a genome is a difficult task. We have created synthetic datasets of short reads (~100 nt) from already sequenced genomes of different sizes and with different amounts of repetitive sequence and used them to test publicly available assembly programs. Our benchmarks can be used to roughly estimate the amount and type of sequencing coverage necessary to assemble a genome and, hence, roughly estimate the cost of a genome sequencing project.
Technical Abstract: Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole-genome assembly an appealing target application. In this paper we evaluate the feasibility of de novo genome assembly from short reads (= 100 nucleotides) through a detailed study involving genomic sequences of various lengths in conjunction with several of the currently available assembly programs. Our analysis indicates that the choice of the assembler can have a significant effect on the quality of assembly results. Our empirical computational analysis shows that one is in principle able to determine which sequencing coverage will provide the best assembly in terms of size and correctness, if the attributes of the target genome, assembly program, expected read length and error rate are known.