Page Banner

United States Department of Agriculture

Agricultural Research Service

Title: Wolfpac: Building a High-Performance Distributed Computing Network for Phylogenetic Analysis Using "obsolete" Computational Resources

Authors
item Reeves, Patrick
item Friedman, Philip - CSU, DEPT. OF NATURAL SCI
item Richards, Christopher

Submitted to: Applied Bioinformatics
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: May 28, 2004
Publication Date: July 1, 2005
Citation: Reeves, P.A., Friedman, P.H., Richards, C.M. 2005. Wolfpac: Building a high-preformance distributed computing network for phylogenetic analysis using "obsolete" computational resources. Applied Bioinformatics 4:61-64.

Interpretive Summary: Mathematically complex computer-intensive procedures are necessary in order to reconstruct historical relationships among species or populations from genetic data. Rigorous analyses of large data sets may require months or even years of processing time on a single computer. This software allows a researcher to use many computers simultaneously, during times of the day when they would otherwise be idle to dramatically reduce the overall time necessary to complete an analysis. The software can be made available to the research community at large.

Technical Abstract: The reconstruction of phylogenetic trees under the maximum parsimony criterion is an Nondeterministic Polynomal time problem (Foulds and Graham, 1982). For unrooted, bifurcating trees, the number of possible solutions increases as (2n-5)!/{2n-3[(n-3)!]} where n is the number of terminals; thus, a significant computational effort is required to discover optimal solutions. Other commonly used optimality criteria, such as maximum likelihood, share a similar level of computational complexity (Sanderson and Kim, 2000). While heuristic methods have been developed to expedite searches, local optima, and islands of equivalent optima are common occurrences on both the parsimony and likelihood surface (Maddison, 1991; Salter, 2001). Therefore the use of simple hill climbing algorithms alone cannot guarantee that global optima will be discovered (Sanderson and Kim, 2000). In order to increase the probability of finding all globally optimal trees, the use of multiple independent searches, with random starting points for each search, has been suggested (Maddison, 1991; Salter, 2001). For very large data sets (e.g. > 500 terminals), the computer processor time necessary to complete multiple independent searches can be prohibitive. While refinements in search strategy (Salter and Pearl, 1997; Goloboff, 1999; Nixon, 1999), new search algorithms (Dopazo and Carazo, 1997; Lewis, 1998), and parallel computing approaches (Snell et al., 2000; Brauer et al., 2002) have shown promise for decreasing overall search times, the continually increasing size of phylogenetic data sets has resulted in a situation where the computational resources available to a typical researcher may not be adequate to complete rigorous analyses in a timely manner. In order to facilitate rapid and thorough searches of phylogenetic tree space, we have developed a distributed computing environment, wolfPAC, that utilizes the batch processing capability of the phylogenetic analysis software PAUP* (Swofford, 1999) to perform multiple, independent searches on numerous, networked Macintosh computers. The wolfPAC analysis environment is a hierarchically organized network of processors that communicate via AppleScript using the Apple File Protocol (AFP) over an IP connection (Figure 1A). Core elements include a server side directory structure which mediates job acquisition and acts as a repository for result files generated by the client processors. Two AppleScripts are used in conjunction with third-party script scheduling software to trigger a job query from the client, commence a search, and establish search duration. Searches may be scheduled to utilize recurring idle periods (e.g. overnight) in large Macintosh computer labs. In addition to streamlining the initiation and termination of PAUP* runs on numerous computers, wolfPAC offers features which facilitate sharing of computational resources among researchers. First, because it operates over IP, users can queue jobs and recover results from any Macintosh with internet access. Second, jobs may be submitted with two levels of priority. A researcher with a local wolfPAC may permit colleagues to access secondary priority processing time with the knowledge that, should the need arise, the privilege can be usurped by submission of a primary priority job. An optional, third priority level is available to provide anonymous user access when no higher priority jobs have been submitted. This allows a wolfPAC to be shared with the phylogenetics research community at large when it is not in use by its owner. Existing parallel processing systems for phylogenetic analysis have shown decreasing marginal performance as processors are added (Snell, 2000; Brauer et al., 2002). Systems which use a "massively serial" approach should exhibit near-linear scaling properties for replicated procedures (until a fixed minimum time, equal to the mean time necessary to complete a single replicate, is reached). Because all instructions are resident in client RAM, and no data are exchanged between processors during a search, limitations based on network speed and the timing of data exchange operations (I/O) are obviated. Serial approaches have a logistical drawback, however, in that large amounts of output generated from the client processors must be interpreted subsequent to a search. The time required for post processing procedures will typically be negligible compared to that of the original search effort. AppleScripts are provided with wolfPAC to facilitate two common phylogenetic procedures--generation of a maximum parsimony tree, and assembly of a bootstrap tree--using results obtained from numerous processors. An efficient means for processing results for the myriad of other analytical procedures is left to the user. To verify the utility of wolfPAC, we have compared the rate at which shortest trees were recovered from the 500 taxon "Zilla" data set (Rice et al., 1997) with a variety of published searches (Figure 1B). A 20 processor wolfPAC using overnight idle-time and a simple search strategy yielded the shortest known tree for Zilla (length=16218) approximately every 55 hours. The estimated minimum time necessary to recover length=16218 trees using this strategy was 1.4 hours, achievable by adding more processors to the wolfPAC. When using the parsimony ratchet (Nixon, 1999), a more sophisticated search strategy, shortest trees were recovered after 18 minutes, representing a six to 70 fold improvement over published single processor methods. When using recurring idle periods, searches are triggered to commence at a specific clock time. We examined whether the values obtained from the system clock and used by PAUP* to seed replicated searches were non random, resulting in a redundant search effort (i.e. multiple processors search the same tree space). In approximately 5000 independent searches conducted over a two month period, seed values were randomly distributed from zero to the maximum value of 2147483646 used by PAUP*. Redundant searches should therefore be extremely uncommon in the wolfPAC environment. While it is unclear whether any hardware or software improvements will permit timely recovery of globally optimal trees for very large data sets when using computationally slow maximum likelihood algorithms (Sanderson and Kim, 2000), the use of a distributed data processing environment such as wolfPAC should encourage thorough searches of the parsimony tree space, and allow rapid evaluation of statistical support for phylogenies when using resampling methods such as the bootstrap (Felsenstein, 1985). Furthermore, wolfPAC should facilitate parsimony analysis of large, genetic marker-based, intraspecific data sets, where no generally accepted model of evolution exists.

Last Modified: 10/24/2014
Footer Content Back to Top of Page