Biocomputational Tools for Analysis of Complex Agricultural Genomes
Genomics and Bioinformatics Research Unit
2012 Annual Report
1a.Objectives (from AD-416):
Advances in biotechnology have led to tremendous increases in biomolecular data. For example, over the last thirty years the number of nucleotides in GenBank, an online DNA/protein sequence repository, has literally doubled every month. Analysis and utilization of exponentially increasing quantities of biomolecular data has required more intimate association of biology with high performance computing. The single-processor bioinformatics tools written in the last few years are already proving inadequate for deriving biological information from large data sets in a timely fashion. Moreover, such huge volumes of data have created a need for more powerful visualization tools that can translate digital data into intuitive graphical formats. We will generate new data analysis/visualization tools specifically designed for use on cluster supercomputers. Parallelized programs provide the built-in scalability required for the rapidly growing computational biology community.
1b.Approach (from AD-416):
We will develop high-throughput analysis pipelines for rapidly and accurately integrating genomic, transcriptomic, proteomic, metabolomic, and phenotypic data for species of importance to U.S. agriculture. Research will focus on expediting the association of genotype with phenotype while defining the biomolecular interactions that link the two. Unlike most existing bioinformatics tools, our algorithms and pipelines will employ parallel processing and other high-performance computing (HPC) principles from their inception, thus permitting scaling of computer resources to adequately meet the storage and memory needs of a wide-array of projects. In addition to de novo tool development, we will work to upgrade existing tools using HPC concepts. An important component of our work will be development of effective ways to visualize complex relationships among diverse data sets. To make our analyzed data as accessible and understandable as possible, we will utilize gene ontology (GO) techniques to annotate and “cross-link” molecular data.
Current genome assembly programs require more random access memory (RAM) than is simultaneously accessible in supercomputers such as those at Mississippi State University’s High Performance Computing Collaboratory (HPC2) where this project is being conducted. To expedite construction of computational biology pipelines, three high RAM computer clusters were purchased. Two of these clusters have 0.5 terabytes (Tb) of shared (RAM) memory while the third has 0.25 Tb of shared memory. The high RAM computers were integrated into the HPC2 system, and popular genome assembly and analysis algorithms have been installed onto these machines. These algorithms are being tested and compared. While setting up working pipelines using existing tools is a first priority, the overarching goal is to adapt variants of these scripts so that they can take advantage of more typical high performance computing architectures.