Submitted to: Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 5/9/2007
Publication Date: 5/9/2007
Citation: Feng, J., Naiman, D., Cooper, B. 2007. Probability-Based Pattern Recognition and Statistical Framework for Randomization: Modeling Tandem Mass Spectrum/Peptide Sequence False Match Frequencies. Bioinformatics. 23:2210-2217.
Interpretive Summary: Mass spectrometry is a technology that is used to find the weight, or mass, of single molecules. Accurate masses can be used to determine the number and make-up of atoms in a molecule. Consequently, it is possible to identify molecules based on mass alone. Thus, mass spectrometry has wide applications in the identification of molecules and these molecules include proteins. Before proteins are analyzed, they are broken into smaller peptide pieces using enzymes or through molecular collisions with gas atoms. The fragments are analyzed by the mass spectrometer and the peptide masses identified. Commercially available software is used to convert the mass information into peptide identifications. When working with large datasets, there is some random chance that the software makes mistakes when making peptide identifications. Such mistakes lead to misidentification of proteins. By creating decoy or nonsense databases and searching them, the error-rate can be modeled and predicted. It is shown here how to create a decoy database that is more suitable for modeling than a prevailing method. The results show that the best decoy approaches closely resembles the target. This model will enable government, academic and private researchers to more confidently identify bacterial, human, animal, plant and other proteins by mass spectrometry.
Technical Abstract: Estimating and controlling the frequency of false matches between a peptide tandem mass spectrum and candidate peptide sequences is an issue pervading proteomics research. To solve this problem, we designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found by this algorithm from a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. This method, independent of instrumentation, database-search software and samples, is shown to provide better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze large sequence datasets for other biological studies and is likely to be relevant for applications such cryptology and information compression.