Publication : USDA ARS

ARS Home » Research » Publications at this Location » Publication #221999

Title: Combined Dynamic Arrays for Storing and Searching Semi-Ordered Tandem Mass Spectrometry Data

Author

	FENG, JIAN - JOHN HOPKINS
	NAIMAN, DANIEL - JOHN HOPKINS
	Cooper, Bret

Submitted to: Journal of Computational Biology
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 4/4/2008
Publication Date: 5/1/2008
Citation: Feng, J., Naiman, D., Cooper, B. 2008. Combined Dynamic Arrays for Storing and Searching Semi-Ordered Tandem Mass Spectrometry Data. Journal of Computational Biology. 15:457-468.

Interpretive Summary: Mass spectrometry is a technology that is used to find the weight, or mass, of single molecules. Accurate masses can be used to determine the number and make-up of atoms in a molecule. Consequently, it is possible to identify molecules based on mass alone. Thus, mass spectrometry has wide applications in the identification of molecules and these molecules include proteins. Before proteins are analyzed, they are broken into smaller peptide pieces using enzymes or through molecular collisions with gas atoms. The fragments are analyzed by the mass spectrometer and the peptide masses identified. Commercially available software is used to convert the mass information into peptide identifications. However, the peptides must also be reassembled into proteins if the proteins are to be understood. Many incorrect combinations of peptides can be made but only one combination truly represents the identified protein or proteins. Assembling peptides into proteins is akin to assembling a jigsaw puzzle where there is only one real solution. Since peptide jigsaw puzzles cannot be assembled by hand, computers are used to model the possible combinations. The computer algorithm described here efficiently organizes peptide and protein information so computational efforts for assembling the puzzle can perform quickly. This algorithm is presented and will enable government, academic and private researchers to implement software packages designed to efficiently organize and sort large mass spectrometry data sets or any other large data set such as public record files.

Technical Abstract: When performing bioinformatics analysis on tandem mass spectrometry data, there is a computational need to efficiently store and sort these semi-ordered data sets. To solve this problem, a new data structure based on dynamic arrays was designed and implemented in an algorithm that parses semi-ordered data made by Mascot, a separate software program that matches peptide tandem mass spectra to protein sequences in a database. By accommodating the special features of these large data sets, the Combined Dynamic Array provides efficient searching and insertion operations. The operations on real data sets using this new data structure are hundreds times faster than operations using binary tree and red-black tree structures. The difference becomes more significant when the data set size grows. This data structure may be useful for improving the speed of other related types of protein assembling software or other types of software that operate on data sets with similar semi-ordered features.