Challenges in Executing Large Parameter Sweep ...

Viewer
Transcript

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments Edward Walker

Chona Guiang

Texas Advanced Computing Center The University of Texas at Austin Austin, Texas, USA 1-512-232-6579

Texas Advanced Computing Center The University of Texas at Austin Austin, Texas, USA 1-512-471-7880

[email protected]

[email protected] Due to the increasing size of some of these parameter sweep studies, network-connected distributed HPC (High Performance Computing) environments like the NSF TeraGrid [1] are being increasingly used for aggregating resources from multiple sites to perform these studies. However several difficulties exist in allowing the scientific researcher to use these distributed environments productively.

ABSTRACT Large parameter sweep studies are common in a broad range of scientific disciplines. However, many challenges exist in supporting this type of computation in a widely distributed computing environment. These challenges exist because contributing sites in a federated distributed computing environment usually expose only a very narrow resource-sharing interface. This paper looks at the challenges encountered by parameter sweep studies using two concrete application examples. The paper also shows how a system for building personal clusters on demand has been developed to solve many of these problems.

First, many sites only export a resource interface suitable for single job submission. For example the Globus GRAM [2] interface, used by many TeraGrid sites, allows a user to submit jobs to a HPC cluster through a daemon or web service running on a gateway node. In many cases, the gateway node is the shared front end of the HPC cluster where multiple concurrent users log in to port, compile, and analyze results. Submitting many hundreds or thousands of jobs in a parameter study through this service is therefore a problem, not only because it impacts other users of the system on the head node, but also because the single gateway node poses a scalability bottleneck for the computation study.

Categories and Subject Descriptors J.0 [General]: Computer Applications – General.

General Terms Algorithms, Design.

Second, some of these parameter sweep studies need to run for many days or even weeks. Currently there is no implicit support in the exported interface of the resource-contributing sites to tolerate transient faults in the wide-area network (WAN) connection and periodic reboots that may occur at the gateway node. This support is important to allow very large-scale, long-running, parameter sweep studies to be reliably farmed out to resources across widely distributed HPC clusters.

Keywords Resource management, workflow, cluster computing.

1. INTRODUCTION Parameter sweep studies are a frequent occurrence in scientific computation across a broad range of disciplines. This computation modality is often used to examine the properties of some physical model over a parameter space in search for solutions satisfying some optimality criteria. Hence, with this computation modality, multiple instances of the model calculation need to be created, dispatched, and managed by the scientist.

Third, no distributed file system is provided by the exported interface of the resource-contributing sites. Instead a simple GridFTP [3] file transfer/copy interface is often exported. To augment this interface, different file transfer gateway nodes and staging directories are used at each HPC site, requiring jobs to be prescient of these differences in order to stage files correctly in and out of the HPC clusters. In parameter sweep studies where hundreds or thousands of files need to be moved, this simple approach can be tedious, complicated and error-prone.

A central challenge in large parameter sweep studies is the management of these multiple model calculations submitted as job or workflow ensembles. Managing these ensembles is difficult because they may be composed of hundreds or thousands of jobs or workflows orchestrating the calculation. Also, some of these ensembles require extended periods to execute, in the order of days or weeks.

This paper studies two concrete applications that are currently performing parameter sweep studies across distributed HPC environments. The first use case documents a team of material scientists who are populating a public scientific database of hypothetical zeolite crystalline structures [4]. These zeolite crystalline structures are a class of porous materials widely used as catalysts and ion-exchangers in many important applications and processes in science and industry. The goal of the material science researchers was to run computer simulations to identify zeolite structures that potentially have real counterparts in nature. These

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CLADE’07, June 25, 2007, Monterey, California, USA. Copyright 2007 ACM 978-1-59593-714-8/07/0006…$5.00.

11

cell sizes (a, b, c, α, β, γ ) to explore the range of hypothetical zeolite crystalline structures. To perform a thorough search for potential structures, the search was further refined by examining the discrete jumps in the lengths and angles defining a unit cell. For each unit cell size the density of tetrahedral atoms and the number of crystallographically unique tetrahedral atoms were varied to further extend the search through crystallographic space.

structures could then be added to a public database of zeolites, enabling the collective community of material scientists to synthesize and experiment with novel materials, applications and processes. The second use case documents a team of chemists who are developing a novel method for calculating the absolute binding energies of ligands to proteins. Accurate and efficient estimates of relative and absolute binding energies are necessary for the formulation of physically meaningful scoring functions in ligand-toreceptor site docking. Molecular docking, which seeks to identify the structures formed between proteins and potential drug candidates, is an important component in structure-based computational drug design. The computational work described here involves comprehensive sampling of the ligand-protein complex in aqueous environment through molecular dynamics (MD) simulations, with the solvent molecules treated explicitly.

2.2 Absolute Binding Energy Calculations Structure-based drug design seeks to find drug candidates that have a strong affinity to the receptor sites of target proteins, because such compounds are likely to enhance or significantly decrease the enzymatic activity of concern. Absolute binding energies are therefore essential in quantifying the physical interactions between a receptor site and ligand. Calculation of binding energies, however, is challenging because of the computationally prohibitive cost associated with ensuring adequate sampling of physical conformations that make important contributions to the Gibbs free energy.

The application use cases highlight the computational challenges associated with parameter sweep studies such as data management, ensuring fault-tolerance with long-running simulations and maximizing throughput by aggregating resources across distributed HPC resources. Section 2 describes the application use cases in more detail. In Section 3, the computational challenges faced by the researchers are enumerated while section 4 describes the solutions developed to address these problems. We close in section 5 with a summary of accomplishments, conclusions and plans for future work.

Various techniques have been proposed for calculating the binding energies of ligand-protein complexes, such as free energy perturbation, thermodynamic integration and the slow-growth method [6]. Although these methods calculate the entropic and enthalpic contributions to free energy in a manner that is exact in principle, their actual implementation requires large amounts of computational power to study biologically important ligand complexes.

2. APPLICATION USE CASES

In this work, the free energy change was calculated for the binding of benzamidium to trypsin in solution. The structure of the benzamidium-trypsin complex is shown in Figure 1. Sampling of the state space was performed by running molecular dynamics (MD) simulations, and using the double annihilation method [7] whereby the dynamics of two processes were simulated: corresponding to the “disappearance” of ligand in water and disappearance of ligand in the solvated ligand-protein complex, Extinction of the ligand is implemented by gradually turning off ligand-solvent and ligand-protein interactions.

In this section we look at the scientific goals of the two use cases explored in this paper. The first is the hypothetical crystal structure search project, and the second is the determination of absolute binding energies for ligand to protein binding.

2.1 Hypothetical Crystal Structure Search Zeolites are crystalline microporous materials that have found a wide range of uses in industrial applications. They are used as catalysts, molecular sieves, and ion-exchangers, and are expected to be of importance in a wide range of nanoscale applications. A typical example is ZSM-5, used as a cracking co-catalyst in the refinement of crude oil. Classical zeolites are aluminosilicates. The basic building block is a TO4 tetrahedron. Usually T = Si, although substitution of the silicon with aluminum, phosphorus, or other metals is common. The tetrahedral species is commonly denoted by T when one is concerned with structural, rather than chemical, properties of the zeolite. From these simple tetrahedral building blocks, a wide range of porous topologies can be constructed. Roughly 180 framework structures have been reported to date [5]. There is therefore a tremendous demand for new zeolite structures with novel properties. The main goal of the hypothetical zeolite project is therefore to generate topologies that are chemically feasible and are predicted to be of industrial importance to enable material scientists to design targeted materials with these new properties.

Figure 1. Benzamidine (shown with space-filled atoms) bound to trypsin, water molecules omitted from figure. This image was generated using Protein Workshop, from the PDB file courtesy of the RCSB Protein Data Bank.

To discover new chemically feasible structures, the project used simulation methods to explore the space of all possible zeolite structures. The team was intent on exploring a range of possible unit

12

computation. Prior solutions developed at TACC involved writing a MPI wrapper, where each MPI task makes a call to system()to spawn a process that performs the actual serial computation. This was unsatisfactory since the individual serial task could not be queried, controlled and resubmitted in cases of error. Furthermore, such a solution is not completely portable since not all MPI implementations support calls to fork() or system() inside an MPI task. A different solution was required to enable the massively parallel but independent jobs to be equally favored by the local job schedulers at each site.

In practice, this involves carrying out MD simulations at different values of a coupling parameter λ, which quantifies the strength of ligand interactions with other species. The software package Amber [8] will be employed for the MD calculations, using the Amoeba force field [9] to describe the energetics of all intra- and intermolecular interactions, as well as interactions with the solvent molecules which are treated explicitly in this work. MD simulations will be performed at constant temperature using the weak coupling method [10], up to a maximum simulation time of 2.5 ns with some initial time devoted to equilibration. Estimates of the free energy difference between adjacent simulations were calculated using the Bennett acceptance ratio [11].

Second, some sites have a limit to the number of jobs a user can concurrently submit to the local queue. For example, at TACC this job submission limit is 40 jobs per user. Therefore, enabling as many simulations to run per job submission was critical to ensure the scientific study could complete in a timely manner.

3. IDENTIFIED CHALLENGES This section articulates the challenges identified by the application use cases.

Challenge 3: Simulation jobs had a wide variability in the execution times.

Challenge 1: Millions of computer simulations need to be run In the first application use case, the researchers needed to optimize a zeolite figure of merit for each parameter set proposed in the study. The researchers used a biased Monte Carlo simulated annealing method that needs to be run for all possible crystal configurations. The projects goal was therefore to submit millions of simulations to cover this large parameter space. However, submitting millions of simulations to a job scheduler queue puts a considerable burden on any system and runs against many site-specific policies at HPC sites [13], [14]. Also, monitoring these jobs and checking their output results takes a considerable amount of effort by the researchers. Therefore, some means of throttling the simulation job submissions, taking remedial actions when faults occur, and checking output results when jobs complete were considered critical requirements.

In the first application use case, each simulation job was composed of a series of simulated annealing computations over a set of 100 seeds. Because of the pseudo-random nature of the algorithm, the expected run time for each job could take anywhere from a few minutes to 10 hours. In some cases, the simulation would never terminate. A majority of the jobs however were expected to complete within three to four hours. This huge variability in the run times of the simulation jobs caused some significant problems. First, each simulation job would need to request a CPU resource for 10 hours to ensure the worst case possible run-time requirement could be met. However, if a job ran for only a few minutes, the remaining time left on the CPU would be unsuitable for other simulation jobs with the same run-time requirement. It was therefore critical to re-factor the computation into smaller sub-jobs to reduce the variability of the expected job run-times.

Challenge 2: Each simulation job was serial in nature In the first application use case, the material science researchers developed their simulated annealing algorithm to solve their multivariable, nonlinear system. The simulated annealing algorithm allows a series of random “near-by” solutions to be generated over a period of decreasing simulated “temperature”, with solutions of increasing optimality generated over time. This annealing process is therefore intrinsically serial in nature because the next step depends on the result of the previous step. Hence, although collectively the entire job ensemble was embarrassingly parallel, each job in the simulation was serial in nature.

Second, some simulation jobs would never terminate. The researcher therefore had to implement additional processing to detect these types of jobs for early termination. The additional processing involved closely monitoring the run-time of the simulated annealing computation for each seed, and terminating the job if the seed run-times indicated that the job would exceed its 10 hour run-time limit. A side effect of this additional processing was to further prevent running each seed simulation in parallel, because run-times of earlier seeds need to qualify if later seeds can run.

The second use case involves running multiple molecular dynamics simulations, each one characterized by a different value of the coupling parameter λ and optionally, a set of energy parameters. The system under study may be the solvated ligand or the solvated complex, but in each case, the job is run serially and independent of other jobs.

In the second use case, each MD simulation is not guaranteed to yield stable free energy values at the end of the 2.5 ns physical simulation time. For example, fluctuations may arise in the solvated ligand simulations as the ligand disappears, caused by attractive forces among the solvent particles which cause intermolecular distances to shrink to values that correspond to the highly positive repulsive region of the potential energy. In this case, the simulation needs to be terminated, and an adjustment made to modify the potential energy function to soften the repulsive barrier and avoid the artificially high forces that cause the energy to diverge.

The serial nature of each job posed a challenge to the researchers. First, many sites have schedulers configured to favor parallel jobs over serial ones [12]. Furthermore, some scheduling policies increase the priority of jobs based on their parallel job size, i.e. larger parallel jobs are favored over smaller ones. The rationale behind these scheduling policies is to favor jobs engaged in large computations and ensure that these are not unduly penalized by other smaller jobs in the job queue. However, the hypothetical zeolite researchers and the chemists performing ligand binding calculations were constrained by the intrinsic serial nature of their

Challenge 4: Simulation jobs required non-standard shared libraries to execute correctly. In the second application use case, there was a need to export nonstandard shared math libraries for the application to run efficiently

13

jobs, examine job output as they complete, diagnose problems when they occur, and leverage tools (like workflow tools) that have been developed for these systems.

at a remote site. Some of the dynamics calculations require significant physical simulation times to achieve stable values for the free energy and any inefficiency in the MD calculation gets magnified in light of the tremendous number of iterations involved in each calculation. A significant part of guaranteeing efficient code performance requires the use of high performance BLAS and LAPACK libraries, in addition to highly optimized intrinsics libraries. The use of tuned software typically involves linking in shared libraries to ensure that the most recent version of dynamic libraries is used at run-time. There is no guarantee, however, that these dynamic libraries will be installed in the execution machines. Moreover, it is infeasible to determine in advance which systems include these libraries. Restricting execution to only those systems that contain the shared libraries is impractical as it will severely limit the number of potential execution hosts. Setting up a correct run-time environment therefore requires that these libraries be staged over to the computational backend. When a large number of shared libraries are required, it becomes cumbersome and inefficient to transfer the files over to the execution machine.

MyCluster provides a robust infrastructure for acquiring resources even in the presence of transient faults. The system does this by deploying a single semi-autonomous agent at each contributing HPC cluster to provision resources. These agents provision resources by submitting parallel job proxies to the local scheduler, where they contribute CPU resources back to the users cluster when they run. The agents are also able to recover autonomously after periodic site reboots, and survive transient network outages across the wide-area network, allowing the provisioning infrastructure to remain unperturbed while experiments are running. Finally, MyCluster allows jobs to access files transparently from the submission host across remote sites. It does this by enabling the XUFS [20] WAN distributed file system that allows unmodified applications to mount the submission directory directly on the compute nodes on which the jobs run. Jobs need not perform explicit steps to stage files through gateway nodes into staging directories at each HPC cluster. Jobs are able to transparently access files as if they were using the shared distributed file system in a typical cluster configuration.

Challenge 5: Output file analysis needs to be performed during the simulation. Running each MD simulation required close monitoring of the output file. First, convergence of the free energy will shorten the simulation time and enable higher throughput since other calculations may be started earlier. Conversely, simulations that exhibit divergent behavior near the maximum simulation time should be terminated to cut the loss of additional computational cycles. Although the post processing analysis itself only involves the straightforward parsing of output files, the absence of a shared file system between the submission and execution platforms poses logistical difficulties. Even if the resource management system were to stream standard output back to the submission host, the MD application does not write to stdout but sends output to a file that is specified as a program argument at run-time.

4.2 Hypothetical Crystal Structure Search In the first application use case, the research team used MyCluster to aggregate the distributed resource at NCSA (National Center for Supercomputing Applications), SDSC (San Diego Supercomputing Center), ANL (Argonne National Laboratory) and TACC (Texas Advanced Computing Center) into personal computing laboratories for performing the calculations required by their scientific study. The HPC clusters used at these contributing sites were shared multiuser systems, supporting hundreds of other researchers across a diverse range of scientific disciplines, running jobs of multiple modalities with different job size and run time requirements.

4. IMPLEMENTED SOLUTIONS

4.2.1 Creating personal clusters with job proxies

In this section we describe the solutions these two application use cases developed to enable their large parameter sweep studies. In both use cases, the solution adopted was based on the MyCluster system [13][14][15] MyCluster is a production software service provided by TACC (Texas Advanced Computing Center) to the University of Texas scientific research community and to computational science researchers on the NSF TeraGrid [1]. The system provides the ability to aggregate resources from distributed HPC clusters into personal clusters created on-demand. The system also allows the merging of distributed compute resources into local departmental clusters.

MyCluster allowed the researchers to create a personal Condor cluster from a workstation at TACC. The workstation served as the submission point for all jobs submitted across the multiple systems on the TeraGrid [21]. The researchers configured MyCluster to submit parallel job proxies, with job sizes ranging from 20 to 64 CPUs, to the HPC clusters at NCSA, SDSC, ANL and TACC. These job proxies, submitted with a run-time requirement of 24 hours each, then contributed CPUs to the personal cluster when they were run by the local scheduler at a HPC cluster. The CPUs from the job proxies became part of the personal Condor cluster until the job proxies contributing them were terminated by the HPC cluster scheduler. To allow the Condor scheduler in the personal cluster to correctly assign jobs to CPUs with sufficient time left to complete them, the job proxies advertised their remaining time left in the HPC cluster in a TimeToLive ClassAd [22]. Jobs could then request for CPUs in the personal cluster with a TimeToLive requirement that was at least the expected run-time of the job.

4.1 MyCluster Overview MyCluster is a system for provisioning resources into personal clusters, or into local departmental clusters, on-demand. Users are able to select, in various production to prototype versions of the system, either a Condor [16][17], OpenPBS [18] or Sun Grid Engine (SGE) [19] interface for interacting with their jobs running across the remote HPC sites. Similarly, the user can request the system to provision resources into a local Condor, SGE or OpenPBS departmental cluster. Thus the system provides a rich interface to the research projects for conducting their parameter sweep studies. These commodity interfaces allows the teams to monitor individual

14

The DAGMan tool was configured to submit from only 350 to 500 jobs to the queue in the personal Condor cluster. This prevented the client workstation from being overwhelmed by having the personal Condor cluster unnecessarily schedule thousands of jobs simultaneously.

4.2.2 Defining multi-level workflows The materials science researchers created nested workflows for executing their simulation experiments. These nested workflows orchestrated and throttled jobs through the personal Condor cluster. A visual representation of this nested workflow is shown in Figure 2. The level-1 workflow was composed of an ensemble of job chains, each chain representing a pair of jobs in a level-2 workflow. The level-2 workflow was composed of a job for scheduling the simulation, and another for post-processing the output of the simulation. The node responsible for scheduling the simulation spawned a level-3 workflow composed of a chain of five jobs. These five jobs collectively represented the simulated annealing computation for 100 seeds.

Screen snapshots of a personal Condor cluster created from the workstation at TACC are shown in Figure 3 and Figure 4. MyCluster was used to provision CPUs from HPC clusters at NCSA, SDSC, ANL and TACC. Specifically, the systems used are listed in Table 1.

Table 1. Multi-user TeraGrid systems used for provisioning resources into personal clusters TeraGrid Site

HPC System

Architecture

NCSA

tungsten mercury cobalt

IA-32 IA-64 IA-64

SDSC

tg-login

IA-64

ANL

tg-login tg-login-viz

IA-64 IA-32

TACC

lonestar

X86_64

The personal Condor cluster, shown in the snapshots, was created to process space group 15_1, which was composed of 30,000 members. This particular experiment was conducted for a period of over a week from 11-Oct-2006 to 23-Oct-2006.

Figure 2. Example of multi-level nested workflow Each job in the level-3 workflow computed over 20 seed values, for an expected run time of 2 hours per job. Therefore each job requested a CPU resource in the personal cluster with at least 2 hours in its TimeToLive value. Multiple level-3 workflow jobs were therefore able to run on each CPU acquired by the job proxies.

Figure 3 shows the expanding and shrinking Condor cluster over time, acquiring IA32, IA64 and X86_64 CPU resources for the workflow jobs.

Each level-3 workflow job also measured the run-time of the simulated annealing computation for each seed, to ensure early detection of simulations not expected to terminate. If four consecutive seeds resulted in computation times exceeding the expected run-time per seed (six minutes), the job was terminated, and a special exit code returned to the level-2 workflow. The level-2 workflow then terminated and an error message was logged. However, if the level-3 workflow completed successfully, the postprocessing job of the level-2 workflow was triggered. This postprocessing job was then scheduled locally on the client workstation to check the validity of the computed results. If an error was detected in the computed results, an error message was logged; otherwise, the result files were archived. The level-2 workflow then exited.

4.2.3 Executing workflows The team of materials science researchers grouped each experimental run into space groups. Each space group described a collection of potential crystalline structures within a range of possible cell sizes, and was typically composed of between 6000 to 30000 members. A command line tool was created to parse through each space group to generate the level-1, level-2 and level-3 workflow definition files in the Condor DAGMan tool format [23]. A personal Condor cluster was then created from the TACC workstation, and the level-1 workflows submitted to the DAGMan tool running in this personal cluster.

Figure 3. An expanding and shrinking Condor cluster aggregating IA-32, IA-64 and X86_64 processors. Figure 4 shows the jobs in the Condor queue. The total number of jobs in the queue never exceeded 380 jobs. This was because the

15

dynamically mount the submission host directory on the compute nodes on both clusters. This ensured that the required run-time environment, in particular the presence of non-standard shared math libraries, was available to the executables when they ran. It also ensured that the users were able to continually monitor the simulation output files to enable early detection of non-terminal jobs.

DAGMan tool was configured to throttle the submission of jobs to the Condor scheduler as previously described. This prevented the over-consumption of the local workstation CPU resource.

4.3.1 XUFS distributed filesystem overview MyCluster is integrated with the XUFS WAN distributed file system [20]. XUFS allows users, or jobs, to mount dynamically a remote directory on a local machine entirely in user-space. The system uses interposition techniques to redirect file access system calls for a remote file to a staged copy of the file in a local cache directory. XUFS stages remote files using an efficient striped filetransfer protocol, and implements last-close wins synchronization techniques to maintain cache consistency. XUFS is implemented as a user-space file server, and a shared object libxufs.so. The shared object interposes the personal distributed file system overlay behavior in an application’s run-time environment using the shared object preloading feature that exists in most UNIX variants. The shared object preloading feature allows XUFS to overload the functions in the system shared libraries with functions implementing our desired overlay behavior.

Figure 4. Running and pending jobs in the personal Condor cluster

Figure 6 illustrates how XUFS provides transparent access to a remote file for the open system call. When the user first accesses a file, the open system call is invoked. The XUFS shared object, preloaded into the run-time environment, intercepts the open call (1) and determines if the remote file needs to be fetched from the file server. If the file needs to be fetched, the remote file server is contacted, and the file is copied into a local cache directory (2). Finally, the shared object opens the local cached file copy and returns the file descriptor to the user (3). All subsequent access to the file is then redirected to this local cached file copy.

Finally, Figure 5 shows a hypothetical zeolite crystalline structure that was discovered by the computational study. The study has discovered over three million such structures to date.

Figure 5. A newly discovered zeolite crystal from the parameter sweep study.

4.3 Absolute Binding Energy Calculation In the second application use case, the researchers used MyCluster to provision resources from the terascale HPC cluster at TACC to a departmental Condor cluster. The researchers already had their simulation jobs running in the departmental Condor cluster, and wanted to continue using their Condor job scripts. MyCluster was used to augment the local departmental Condor cluster with additional resources provided by the TACC cluster, allowing them to continue using their submission scripts and tools.

Figure 6. A user accessing a remote file through XUFS

4.3.2 Enabling personal global file access In the binding energy calculations, the MD simulations were performed using the sander executable that is part of the Amber suite of molecular modeling software. Sander was compiled on a local workstation using commercial compilers, and was dynamically linked against commercial math libraries and compiler intrinsic libraries. Although these libraries are installed on the TACC HPC cluster, they are not available on the departmental Condor cluster compute nodes. Exporting the local workstation file system through XUFS facilitated a consistent execution environment across the

The researchers used MyCluster to overlay a personal Condor cluster over both the departmental cluster and the TACC cluster. This allowed them to leverage many of the virtual cluster-wide services provided by MyCluster. In particular, one of the services that have proved to be useful is the ability for their jobs to

16

personal Condor cluster, eliminating the need for reconfiguration of the environment at run-time to ensure access to required shared libraries on the execution system. Different sets of input files residing in distinct subdirectories (corresponding to different λ values for the ligand and ligandprotein complex systems) are generated on the local workstation as well, on an as-needed basis depending on intermediate simulation results. While it is possible to transfer files from the local workstation to the personal Condor cluster, data management and organization becomes more complicated as simulations at different λ values and energy parameters are carried out. Fortunately, the use of XUFS enabled the local workstation directory to be mounted remotely on the Condor system, such that MD simulations on the Condor system can proceed as if they were running on a global file system. This eliminates the need for transfer of input files at the beginning of the simulation and the subsequent transfer of data back to the local workstation at job completion for later post processing. More importantly, this allowed close monitoring of output files at run time. At present, Condor only stages output files on job exit or eviction, and hence there is no mechanism within Condor to implement output file monitoring while the job is running. Using XUFS for constant monitoring of output files is a capability that will be exploited in future plans for more automated job management. Through XUFS, output monitoring will make it possible for automatic termination of jobs when the free energy value for a simulation satisfies a set of specified convergence criteria. Conversely, jobs with unconverged free energies near the simulation time limit of 2.5 ns can be killed in order to prevent further loss of CPU cycles and to free the encumbered CPUs so that other MD simulations may start.

Figure 7. Recursive workflow for the long-running MD simulation. The DAG submit file consists of the JOB labeled “sander” and a POST script to resubmit the DAG submit file.

5. CONCLUSIONS This paper examines the challenges involved in conducting large parameter sweep studies in a widely distributed computing environment. We examined two concrete applications: a project searching for feasible zeolite crystalline structures and a project calculating the binding energies of ligands to proteins. We examined the specific challenges both applications faced running in a distributed HPC environment, and we describe our solution using the MyCluster system. MyCluster provides a rich interface for users to submit, manage and monitor jobs across distributed resources. The system expands the traditional narrow interfaces exposed by sites in a distributed computing environment, allowing users to mimic local job submission semantics, leverage legacy scripts and tools, access files transparently through a distributed file system, and survive transient faults. Future extensions to the MyCluster system will allow parallel job ensembles to be managed as well as serial jobs. Also, a web portal interface [24] is under development for MyCluster to enable the wider science gateways community to leverage the rich and robust infrastructure provided by the system. The experience gained from supporting these new user communities will enable us to improve our current capabilities, as well as understand the challenges involved in other computational usage modalities.

4.3.3 Defining recurrent workflows The longest-running MD simulation, corresponding to the dynamics of the solvated ligand-protein complex for 2.5 ns, takes more than eight days of wallclock time to complete. However, the maximum run-time limit of the CPU resources provisioned from the TACC terascale cluster with MyCluster is 24 hours. Thus jobs running on CPUs with only a TimeToLive of 24 hours (see section 4.2.1), need to be checkpointed and resubmitted to compute over the required eight day computing period. Note that this restart capability is desirable even when running in dedicated mode on the departmental cluster, to ensure that jobs prematurely terminated due to system failure can recover gracefully from the last saved state. Management of this recurring workflow is implemented using Condor DAGMan [23] to specify the recursive nature of the job cycle. The DAG job specification file consists of two parts: a JOB specification line which points to the path of the Condor job file for the actual MD simulation, and a SCRIPT specification line which contains the path to a post-processing script which renames the restart file, and resubmits the DAG job using the condor_submit_dag command. A diagram illustrating this workflow is shown in Figure 7.

6. ACKNOWLEDGMENTS We would like to thank Prof. Michael Deem and Prof. David Earl who are the principle investigators of the hypothetical Zeolite crystal search project. We would also like to thank Prof. Pengyu Ren and Dian Jiao who are performing the ligand binding energy calculations. Finally, we would like to acknowledge funding from NSF (sub-contracted from 0503697) which has made this work possible.

17

[13] E. Walker, J. P. Gardner, V. Litvin, and E. L. Turner, “Personal Adaptive Clusters as Containers for Scientific Jobs”, accepted for publication in Cluster Computing, Springer.

7. REFERENCES [1] NSF TeraGrid, www.teragrid.org [2] I. Foster and C. Kesselman, “Globus: A Metacomputing Infrastructure Toolkit”, Intl. Journal Supercomputing Applications, 11(2), pp. 115—128, 1997.

[14] E. Walker, J. P. Gardner, V. Litvin, and E. L. Turner, “Creating Adaptive Clusters in User-Space for Managing Scientific Jobs in a Widely Distributed Environment”, in Proc. of IEEE Workshop on Challenges of Large Applications in Distributed Environments (CLADE’2006), Paris, July 2006.

[3] GridFTP, http://www.globus.org/grid_software/data/gridftp.php [4] D. J. Earl, and M. W. Deem, “Toward a Database of Hypothetical Zeolite Structures”, Eduardo Glandt special issue, Industrial and Eng. Chem. Research, 54, 2006, pp. 5449— 5454.

[15] MyCluster TeraGrid User Guide, http://www.teragrid.org/userinfo/jobs/gridshell.php [16] Condor, High Throughput Computing Environment, http://www.cs.wisc.edu/Condor/

[5] International Zeolite Associate, http://www.iza-online.org.

[17] M. Litzkow, M. Livny, and M. Matka. Condor – A Hunter of Idle Workstations, In Proc. of the International Conference of Distributed Computing Systems, pp. 104—111, June 1988.

[6] A. R. Leach, “Molecular Modeling: Principles and Applications, Second Edition,” pp. 564-569, Pearson/Prentice Hall, Harlow, England, 2001.

[18] Portable Batch System, http://www.openpbs.org

[7] W. L. Jorgensen, J. K. Buckner, S. Boudon and J. TiradoReeves, “Efficient Calculation of Absolute Free Energies of Binding by Computer Simulations – Applications to the Methane Dimer in Water, “ Journal of Chemical Physics, 89, 1988, pp. 3742-3746.

[19] Sun Grid Engine, http://gridengine.sunsource.net/ [20] E. Walker, “A Distributed File System for a Wide-Area High Performance Computing Infrastructure”, in Proc. of the 3rd USENIX Workshop on Real, Large Distributed Systems (WORLDS’06), Seattle, Nov 2006.

[8] The Amber Molecular Dynamics Package, http://amber.scripps.edu/

[21] NSF TeraGrid Compute and Visualization Resources, http://www.teragrid.org/userinfo/hardware/resources.php

[9] P. Ren and J. W. Ponder, “Polarizable Atomic Multipole Water Model for Molecular Mechanics Simulation,” Journal of Physical Chemistry B, 2003, 107, pp. 5933-5947.

[22] R. Raman, and M. Livny, “Matchmaking: Distributed Resource Management for High Throughput Computing”, in Proc. of the 7th IEEE Symposium on High Performance Distributed Computing, July 28031, 1998.

[10] S. C. Harvey, R. K. Tan and T. E. Cheatham III, “The Flying Ice Cube: Velocity Rescaling in Molecular Dynamics Leads to Violation of Energy Equipartition,” Journal of Computational Chemistry, 1998, 19, pp.726-740.

[23] Condor DAGMan, http://www.cs.wisc.edu/condor/dagman/ [24] E. Roberts, M. Dahan, E. Walker and J. Boisseau, “MyCluster Ensemble Manager: Ensemble Jobs in the TeraGrid User Portal”, in Proc. of TeraGrid’07, Madison, June 2007.

[11] C. H.Bennett, “Efficient Estimation of Free Energy Differences from Monte Carlo Data,” Journal of Computational Physics, 1976, 22, pp. 245-268. [12] TeraGrid site scheduling policies, http://www.teragrid.org/userinfo/guide_tgpolicy.html.

18

A Systematic Study of Parameter Correlations in Large ... - Springer Link