Skip to main content
ARS Home » Southeast Area » Athens, Georgia » U.S. National Poultry Research Center » Exotic & Emerging Avian Viral Diseases Research » Research » Publications at this Location » Publication #311373

Title: Sequencing artifacts in the type A influenza database and attempts to correct them

item Marbut, Lauren
item Suarez, David
item CHESTER, NIKKI - Athens Academy

Submitted to: International Symposium on Avian Influenza
Publication Type: Abstract Only
Publication Acceptance Date: 4/1/2015
Publication Date: 4/12/2015
Citation: Marbut, L.A., Suarez, D.L., Chester, N. 2015. Sequencing artifacts in the type A influenza database and attempts to correct them [abstract]. 9th International Symposium on Avian Influenza. p. 92.

Interpretive Summary:

Technical Abstract: Currently over 300,000 Type A influenza gene sequences representing over 50,000 strains are available in publicly available databases. However, the quality of the sequences submitted are determined by the contributor and many sequence errors are present in the databases, which can affect the results of sequence analysis. As part of a high school class project, bioinformatics analysis was performed on all eight gene segments of influenza A virus to identify and try to have these errors corrected. Using the Influenza Research Database website, suspect sequences were identified that had non-influenza sequence added to the end of the submitted sequence. Three types of errors were commonly identified: non-influenza primer sequence was not removed from the sequence; Taq polymerase added an adenine at the end of the PCR product; and the PCR product was cloned and plasmid sequence was included in the sequence. Internal insertions of nucleotides was also commonly observed, but in many cases it was unclear if the sequence were correct or not. A total of 2094 sequences were identified during the 3 years of the project, which represents a 0.73% rate of potential errors. Students contacted some of the sequence submitters alerting them that a sequence had been identified as likely having an error and, if known, what type of error it probably was, and they asked the submitter to correct the problem. A total of 283 sequences, or 13.5% of the suspect sequences, were corrected in the public databases in part because of the student project.