|Turkett Jr., William|
Submitted to: Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 11/21/2004
Publication Date: 4/1/2005
Citation: Rose, J.R., Turkett Jr., W.H., Oroian, I.C., Laegreid, W.W., Keele, J.W. 2005. Correlation of amino acid preference and mammalian viral genome type. Bioinformatics 21(8):1349-1357. Interpretive Summary: Molecular diagnostic methods that could be employed in the event of an outbreak of animal or human viral disease include sequencing of viral nucleic acid directly from samples from infected individuals. Current methods for analyzing such sequence involve comparisons to known viral sequences stored in a database. However, if the infection is caused by a novel virus, or if the virus is highly variable, such approaches may not resolve the virus identity. This study attempted to look at higher level information that could be derived from the viral sequence and classify them as to virus genome type, double stranded DNA, positive stranded RNA, retroviral, etc. Mathematical models were derived that accurately predict the viral genome type from as few as 300 bases of viral genome sequence, with more accurate classification from 600 bases of sequence. This approach may be useful in the event of an outbreak of disease caused by an unknown virus.
Technical Abstract: Motivation: In the event of an outbreak of a disease caused by an initially unknown pathogen, the ability to characterize anonymous sequences prior to isolation and culturing of the pathogen will be helpful. We show that it is possible to classify viral sequences by genome type (dsDNA, ssDNA, ssRNA positive strand, ssRNA negative strand, retroid) using amino acid distribution. Results: In this paper we describe the results of analysis of amino acid preference in mammalian viruses. The study was carried out at the genome level as well as two shorter sequence levels: short (300 amino acids) and medium length (660 amino acids). The analysis indicates a correlation between the viral genome types dsDNA, ssDNA, ssRNA positive strand, ssRNA negative strand and retroid and amino acid preference. We investigated three different models of amino acid preference. The simplest amino acid preference model, 1-AAP, is a normalized description of the frequency of amino acids in genomes of a viral genome type. A slightly more complex model is the ordered pair amino acid preference model (2-AAP), which characterizes genomes of different viral genome types by the frequency of ordered pairs of amino acids. The most complex and accurate model is the ordered triple amino acid preference model (3-AAP), which is based on ordered triples of amino acids. The results demonstrate that mammalian viral genome types differ in their amino acid preference.