|Chester, Nikki - Non ARS Employee|
|Hatfield, Jason - Non ARS Employee|
Submitted to: Meeting Abstract
Publication Type: Abstract Only
Publication Acceptance Date: 7/15/2013
Publication Date: 9/5/2013
Citation: Suarez, D.L., Chester, N., Hatfield, J. 2013. Sequencing artifacts in the type A influenza database and attempts to correct them [abstract]. In: Proceedings of Options for Control of Influenza VIII, September 5-10, 2013, Capetown, South Africa. p. 96.
Technical Abstract: Type A influenza virus causes a wide range of disease in both man and animals, and considerable efforts have been made to study and sequence large numbers of isolates. Currently there are over 236,000 gene sequences representing over 50,000 strains in publicly available databases. However, the quality of the sequences submitted to these public databases are determined by the contributor and many sequence errors are present in the databases, which can affect sequence analysis and require significant curation of data by individual researchers to get usable data. As part of a high school class project, bioinformatics analysis was performed on all six internal gene segments of influenza A virus. Sequences were selected in the Influenza Research Database website that were longer than the generally accepted lengths of the individual segments, with the hypothesis that these sequences would have an error in the sequence. Because of the greater variability in sequence length of the hemagglutinin and neuraminidase genes, only the six internal genes were examined. Multiple sequence alignments were performed for each segment, and viral sequences were divided into those with additional sequence upstream or downstream of the conserved non-coding ends of the gene segment and those with additional sequence internally. Specific attention was placed on sequences with additional nucleotides upstream or downstream of the highly conserved non-coding ends of the viral segments. The virus sequences were evaluated to try and determine the probable type of error associated with each sequence. A total of 1081 sequences were identified that met this criteria when the database was queried in March 2012, which represents a 0.82% rate of potential errors. Three types of errors were commonly observed: non-influenza primer sequence was not removed from the sequence; Taq polymerase added an adenine at the end of the PCR product; and the PCR product was cloned and plasmid sequence was included in the sequence. Internal insertions of nucleotides was also commonly observed, but in many cases it was unclear if the sequence was correct or actually contained an error and therefore these sequences were not evaluated further. Students contacted some of the sequence submitters alerting them that a sequence had been identified as likely having an error and if known what type of error it probably was. The submitter was requested to review the sequence and if they agreed that the sequence was incorrect that they revise the sequences in the public databases. A total of 215 sequences, or 22.8% of the suspect sequences, were corrected in the public databases in the past year in part because of the student project. Examination of the sequence database in 2013 showed an additional 138 sequences with possible errors were added to GenBank in the last year. A new class of students once again contacted submitters and requested they review and potentially correct the sequences with at least partial success in getting sequences corrected. The integrity of the public sequence databases is largely dependent on the scientists who submit the sequence information. Some review by GenBank and other databases provide some identification of errors, primarily removing plasmid sequence from submissions, but because of the complexity and variability of sequences submitted, the error identification programs are limited. The identification of 138 new problematic sequences in the last year highlights that the error detection system is not working as well.