Master’s Project Proposal: High Performance Computing with Microsoft Project Leor E. Dilmanian Rochester Institute of Technology, Rochester, NY 14623 Email:
[email protected] Submitted to the department of Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Computer Science at the Rochester Institute of Technology Supervised by: Gregor von Laszewski Rochester Institute of Technology Service Oriented Cyberinfrastructure Lab Bulding 74-1076 Lomb Memorial Drive Rochester, NY 14623
[email protected] Read by: James Heliotis Department of Computer Science Rochester Institute of Technology 102 Lomb Memorial Drive Rochester, New York 14623-5608 U.S.A. June 15, 2009
2
Contents Abstract . . . . . . . . . Introduction . . . . . . . Definitions . . . . . . . . TeraGrid . . . . . . Workflow . . . . . Quality of Service . Wall Clock Time . Previous Work . . . . . . Functional Requirements Use Cases . . . . . Microsoft Project . Console . . . . . . Workflow Engine . Proposal . . . . . . . . . Deliverables . . . . . . . Bibliography . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
3
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
4 4 6 6 6 7 7 8 8 9 9 11 12 13 13 17
4
Abstract It is difficult for many scientific researchers to perform large scale analysis. One major concern for those
end users is their productivity in designing and executing applications to perform analysis. These applications benefit greatly from concurrent execution in a distributed computing environment. The productivity of some of these researchers will be hindered if they are to learn how to use a command line tool or text based programming language. They simply wish to focus on science rather than learn another tool. Recently, some frameworks and workflow systems are being used to create and run these applications. However, these tools are still difficult to learn. We present Cyberaide Project. We will show how Cyberaide Project leverages the Microsoft Project software package to create and run workflows on the Grid. We propose several extensions for this tool and an evaluation to determine its overall usability.
Introduction The development and execution of large scale analytical applications is complicated especially for the scientific researcher. There are several issues involved in running such experiments. The researcher should be able to easily specify how to run a compute or data intensive jobs to perform large scale scientific experiments or analysis. The scientists should also be able to easily orchestrate various activities (tasks). In research computing, we find that the activities of an analysis often translate into executions of applications on the command line. For example, a biologist may be interested in running a BLAST [1] search in order to scan DNA. We assume that the researcher not having the technical experience will view programming languages and terminals as a huge inconvenience. For our biologist, we would like to eliminate the need to manually log into super-computing sites, transfer data among them, handle faults and more. We observe that scientific analysis consists of multiple activities (such as comparing the nucleotides of DNA sequences) requiring some form of input (a threshold, DNA search sequence and a database of sequences). Each activity also performs some operation on the data (searching for a match), and produces an output result (the matches of interest). The results produced by some activities are used as input by other activities in order to accomplish a larger goal (finding a human/mouse homologue). For our purposes, each activity is an execution of an application (such as BLAST) on some high end compu-
tational resource. In example, Indiana University maintains and provides public access to an IBM e1350. It is 5a distributed, shared memory cluster running 3072 processors over 768 compute nodes, with a peak performance of 30.6 teraflops and 6 terabytes of main memory. On it, they have installed a database with access to various data-sets for protein and nucleotide sequences. Once the blast search activity is complete, the output data may be forwarded to some other institution for further processing. For example, we can forward our data to the San Diego Supercomputing Center. In this proposal, we will only be considering the set of networked computational resources found on the TeraGrid [2, 3]. Resources hosted on the TeraGrid typically include clusters, massively parallel processing machines, shared memory symmetric multiprocessing machines, storage devices, visualization machines, and more. In order to automate the steps performed in analysis, it is first and foremost necessary to determine the activities that need to be carried out. It is also important to coordinate these activities. We call this “planning”. We will also be managing multiple activities which may be executing concurrently. Each activity, or task, will be assigned a resource for execution. We call this “scheduling”. In recent years, workflow tools and methodologies have been used to solve this problem [4, 5]. These tools attempt to simplify the task of running jobs on resources for the end user. We refer to these tools as “Grid workflow systems”. These systems typically expose a graphical user interface to the end user for creating applications. Even so, we assume it is still very difficult for researchers to use such tools. It is our goal to create a tool which researchers would want to use. We want to improve the productivity of researchers running scientific applications. We introduce Cyberaide Project. It is currently a very basic, first attempt for proving of a unique idea. It demonstrates that we can design workflows for distributed systems in the context of project management software. To summarize the results, we were able to execute only a small workflow on a few local machines supporting access via SSH. We illustrate how we leverage Microsoft Project as a software component and use it to plan and schedule jobs which will be run on a Grid. We will also illustrate what the user experience is. The first stage is design. In this stage the user does planning. The user uses Microsoft Project to define tasks and their temporal dependencies. For small worklows, the end user might also decide to manually schedule or assign resources to tasks. We will also discuss how Cyberaide Project can be used with other components to run and track jobs in execution. When the user is done designing a workflow, he or she should be able to run it. We will provide this feature,
6and the ability to check the status of a workflow. To do this, we will integrate Cyberaide Project with a workflow engine. Cyberaide Project will demonstrate its ability to seamlessly authenticate, execute jobs and transfer data on the TeraGrid. The end product will be something useful. We will use several pre-existing components in our software system to accomplish our goal. We provide details on how and why they are used.
Definitions TeraGrid The TeraGrid [2, 3] is a free, open, scientific research infrastructure funded by the NSF. It combines various resources from eleven partner sites using high performance network connections. Resources contributed for use on the Grid can be described as massively parallel machines (such as clusters, SMP and MPP machines), large storage, visualization and more [6]. Although each donating institution maintains its own resources, they expose uniform access to these resources using a common set of web services found in the Globus Toolkit [7]. The Globus toolkit provides services not just for computing, but also storage, user collaboration, security, data management, system monitoring, resource management, accounting and more. Application developers wanting to create grid-enabled clients can access these web services from Commodity Grid Toolkits such as the Java CoG Kit [8]. These toolkits provide access from various operating systems, programming languages and distributed object frameworks. Thus, commodity grid toolkits also provide uniform access to the many independently administered resources via a convenient API.
Workflows Formally, we begin by defining a particular type of workflow [9]. A workflow (W ) is the set of tasks (T ), dependencies (D), resources (R) and a mapping for resource assignments(A). W = (T, D, R, A) Where a task is a unit of work consisting of a set of attribute values, T = {ti |ti = [B i,1 , ..., B i,n ]}
and a dependency is used to describe the execution ordering between two tasks.
7
D = {d|d=(t1 , t2 ), t1 ∈ T, t2 ∈ T } The graph G = (T, D) is a directed acyclic graph, and creating this graph is called “planning”.1 A (t) is a function mapping each task to a resource.2 Creating this function, A (t) is called “scheduling”. A:T →R Additionally, each resource consists of a list of attribute values... R = {ri |ri = [C i,1 , ..., C i,m ]}
Quality of Service When executing workflows, we are often concerned about what is referred to as “quality of service” [5]. This term is used to describe the quality of execution on a resource. For our purpose, we assume that each resource can provide a certain level of quality with respect to one or more quality of service parameters. Some typical quality of service parameters can be categorized in terms of performance, cost, trust and reliability. For our purposes, we will simply assume that one or more of the attributes Ci,x , will be used to describe the quality of a resource ri , and that and that our end user requires that the workflow W adheres to a policy P describing its quality. Therefore, quality of service issues can be addressed through scheduling [10]. The scheduling function A attempts to optimize W with respect to P . AP : T → R
Wall Clock Time One important measure of the quality of a workflow is referred to as “wall-clock time” [11]. This measure of quality is not the property of any single task or resource, but rather a property of the entire workflow. It can be used to indicate the perceived duration of an entire workflow from start to finish. According to [11], the expected duration (τ ) of a running job can be determined by the state of the queue and several administrative decisions. To summarize the results, we simply state that for application α running on q processors, 1 2
The graph G can also be referred to as a partial ordering of tasks. Sometimes, A (t) is a relation which maps a task into more than one resource. However, we will not use this definition in this paper.
8
α (q) + τ α (q) + τ α (q) τ α (q) = τwait io calc
the total expected time to run application α on q processors can be determined using the amount of time waiting α (q), the total amount of time transferring data for a particular job τ α (q) and the total in a particular queue τwait io α (q). We do not discuss these performance measurements but instead amount of time performing calculation τcalc
we simply state that we can obtain them using the Network Weather Service [12]. Given a scheduled workflow W , each task ti ∈ T will have an attribute called “estimated duration” as described above. Then, the wall-clock time of a workflow can be determined using the chain of activities from start to end having having the longest estimated duration. One method used to find this path or chain is called the critical path method [13]. It can be used to determine which activities are critical to the timely completion of a project (workflow).
Previous Work
Be begin by discussing previous work that has been done by others in this area. Much work has already been done in the area of designing and executing workflows for distributed computing. This includes the early metacomputers [11]. We refer to many of the newer systems as “Grid Workflow Systems” [4]. They include Triana [14], Kepler [15], Pegasus [16], Karajan [17] and Condor [18]. One workflow system of particular interest, which is currently being developed, is Trident [19]. This system allows users to share workflows over a workflow repository. Trident can also schedule jobs over Windows HPC Clusters. There are also many Grid computing toolkits currently available. These toolkits provide easy access to Grid services from a convenient api. They are also available for different programming languages and distributed object frameworks. We will be using the Java Commodity Grid Kit.
Functional Requirements
This section briefly describes what our requirements are, why they are needed, and how we will fulfill them. We begin by describing some use cases. We will then describe some components in greater detail.
Use Cases
9
1. The end user will design a workflow, using Cyberaide Project. The end user can interactively use a console, graphical user interface, or both. 2. When the user is done creating a workflow, he or she will submit it to a workflow engine. The workflow engine coordinates execution of a workflow. 3. The workflow engine will use the Commodity Grid Kit [8] to simplify the task of executing jobs on machines.
Microsoft Project For the implementation of Cyberaide Project, we require that our tool be easy to use for the researcher. Our strategy to meet this requirement is to customize and re-use well known “commodity” software. The same strategy has been practiced by many Grid projects such as the Java Commodity Grid Kit [8]. In doing so, we can minimize the effort needed for the scientist to learn a new tool. One such software package used for project management is “Microsoft Project”. Microsoft Project and its graphical user interface provide a feature rich environment with powerful visualization capabilities. For example, end users have several task and resource management views defined for them. Each view is an interface which portrays different information. Such views include resource graphs, calendars, charts, and usage diagrams. Within Cyberaide Project, the Microsoft Project environment has been customized for Grid based workflows. Figure 1 shows a workflow and highlights the well known features of Microsoft Project in the “Gantt Chart” view. This view consists of the task table in the left window pane and Gantt chart in the right window pane. Within this view, one can visualize the workflow using the Gantt chart, and easily modify it using the task table (spreadsheet). The Gantt chart is currently configured to convey some progress tracking and resource assignment information along with the workflow. The task table has many custom fields, or columns, defined for execution parameters (task attributes). When changes are made to the workflow, adjustments are automatically calculated and displayed. Invalid changes to the workflow are rejected. Cyberaide Project provides the ability to model deadlines, advanced reservation, reoccurring tasks, and task splits. Graphical indicators can be used to display custom notes about task execution on the
10
Figure 1. Using Microsoft Project to build and execute a Grid based workflow.
Grid or warnings related to resource leveling and deadlines. Additionally, Visual Studio Tools for Office can used to create custom forms for the end user. Using both standard and custom fields associated with tasks and resources, the software package can be used to store and monitor various quality of service and execution parameters. A rich set of features assist in multiple project management and collaboration. These features include resource sharing, inter-project dependencies, and sub-workflows. Resource sharing and discovery can be accomplished using built in support from Microsoft Project, if not from other information services. Namely, it relies on Active Directory [20], the Address Book [21], or a Microsoft Project Server [22] to discover and share resources. Additional tools of interest include built in data analysis for quality of service prediction and visual reports for the sharing of historical information. The features of Microsoft Project had been compared to those of other leading project management solutions in a review [23]. Based on the review, we predict that end users will be familiar with the graphical components exposed in Microsoft Project. The well known spreadsheet interfaces are found in many office suites and make it convenient to enter information without a mouse. The Microsoft Project requires a small learning curve because of the layout of its user interface. The interface appropriately hides a large volume of information and functionality while dividing information into various sections(views). The one major complaint about Microsoft Project is that it is neither free nor open source. We have outlined the many advantages to using Microsoft Project. It has received the highest overall rating in the review for its user friendliness and ability to assist in collaboration and resource and project management. Beyond its favorable
ratings, we have our own reasons for using Microsoft Project.
11
We chose to implement Cyberaide Project with Microsoft Project because it meets our requirements. It is a well known “commodity” tool and the de-facto industry standard. A considerable portion of the hundreds of project management solutions comply with its standards. Therefore, we believe that Cyberaide Project can successfully re-use any compatible project management solution to design and execute workflows. Since it contains the powerful and user friendly graphical user interface we describe in this section, we predict that users will not have much difficulty adopting it. One major aspect is the spreadsheet interface. It is the same one found in office applications such as Excel. We believe the end user will spend less time learning the new interface and more time being productive. Our abstract definition of a workflow is fully supported by functionality in Microsoft Project. 1. Each task in our abstract definition translates directly to a task row in the Microsoft Project spreadsheet. 2. Each dependency in our workflow is called a predecessor in Microsoft Project. 3. Microsoft Project will verify that our workflow is a “directed acyclic graph”. 4. Each resource in our abstract definition also translates to a row on the resource spreadsheet. 5. Each attribute can be implemented as a custom field. 6. Each task can be assigned to one or more resources using the resources field. We can add additional constraints to ensure that each task is only assigned to a unique resource.
Console Cyberaide Project is actually a console application. When it is started, it begins by running an instance of Microsoft Project. Our console application behaves like a shell. It prompts the user for commands (described in table 1). However, these commands are not regular commands found in most operating systems. They will manipulate the running instance of Microsoft Project. This interface will allow one to integrate scripts or interactively create a workflow. For example, when a command is issued to add a task to the workflow, the corresponding changes are reflected within the graphical user interface of Microsoft Project.
12 Table 1. List of selected Cyberaide Commands
Command task map rmtask edit dep rdep res unmap rmres rmtask run-all save load listtasks listres find
Description Add a task to the project. Map a task to a resource. Remove a task from the project. Edit the attributes of a task. Add a dependency from one task to another. Remove a dependency from one task to another. Add a resource to the project. Remove a task/resource mapping. Remove a resource from the project. Remove a task from the project. Run the project on the Grid. Save the project. Load a project. List all tasks in the project with a valid name. List all resources in the project with a valid name. Find a task by its name.
Once again, we have implemented commands for all the operations that our user will need to create and run a workflow. We will allow the user to interactively modify workflows from directly within Microsoft Project, or the command line interface.
Workflow Engine
In this proposal, we will not go into great detail about our workflow engine. Rather, we simply state that we will use the Karajan workflow engine to execute the workflow created by Cyberaide Project. We will support the functionality to convert our workflow from one or more project file formats to the XML based Karajan language. Karajan will then properly orchestrate the execution of our workflow while handling issues of fault tolerance, data transfer,security and more. It is important to note that as we stated earlier, each task or activity in our workflow translates into a program execution3 involving data flow among resources. Therefore, we require that in our graphical user interface, these job execution parameters are specified as attributes of a task. They can also be specified as options on the command line. 3
This assumption may not always apply.
Proposal In this section we propose several extensions to Cyberaide Project [24] which will enhance Cyberaide Project. We will begin by integrating new CoG Kit Components for authentication, reliable file transfer, and job execution on the Grid. In turn, we must also verify that we can monitor the status of jobs running on the TeraGrid, and display the status in the Microsoft Project user interface. We will also be expecting to accurately create a workflow which not only models the duration of a running job, but also those of file transfers and queue wait times. 1. Cyberaide Project does not model sub-workflows. We will support the ability to do this. 2. Cyberaide Project is not integrated with a Grid computing toolkit. It will be integrated with the Java CoG Kit. It will then be tested to see if we can automate file transfers, authentication, and job execution on the TeraGrid. As an alternative, the Cyberaide Mediator [25] may be used instead of directly accessing the Java CoG Kit. This project is currently being maintained by Fugang Wang and may need to be extended. Any contributions to this work will be clearly documented. 3. Cyberaide Project has not been tested with a real e-Science workflow. A real e-Science workflow will be created using Cyberaide Project. 4. Cyberaide Project does not use the Network Weather Service to accurately model a workflow. We will use the Network Weather Service to determine the Wall Clock time of a workflow. 5. Cyberaide Project does not execute using Karajan. We will provide a user option to compile a workflow into Karajan language. 6. Performance measurements will then be taken to determine the quality of Cyberaide Project.
Deliverables The deliverables will include a report, full code, and installation instructions.
13
14
Bibliography [1] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic local alignment search tool,” J. mol. Biol, vol. 215, no. 3, pp. 403–410, 1990. [2] P. Beckman, “Building the TeraGrid,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 363, no. 1833, pp. 1715–1728, 2005. [3] C. Catlett, “The teragrid: A primer,” TeraGrid Project, USA, 2002. [4] I. Taylor, E. Deelman, D. Gannon, and M. Shields, Eds., Workflows for e-Science: Scientific Workflows for Grids.
Springer, 2007, iSBN: 978-1-84628-519-6.
[5] I. Taylor, E. Deelman, D. Gannon, and M. Shields, “Workflows for e-Science,” 2006. [6] “TeraGrid [About],” Web Page, 2009. [Online]. Available: http://www.teragrid.org/about/ [7] I. Foster, “The anatomy of the grid: Enabling scalable virtual organizations,” International Journal of High Performance Computing Applications, vol. 15, no. 3, pp. 200–222, August 2001, a brief introduciton to the grid. [Online]. Available: http://www.gl.iit.edu/database/frame/compendex.htm [8] G. von Laszewski, J. Gawor, P. Lane, N. Rehn, M. Russell, and K. Jackson, “Features of the Java Commodity Grid Kit,” Concurrency and Computation: Practice and Experience, vol. 14, pp. 1045–1055, 2002. [Online]. Available: http://www.mcs.anl.gov/∼gregor/papers/vonLaszewski--cog-features.pdf [9] R. Prodan and T. Fahringer, “Dynamic scheduling of scientific workflow applications on the grid: a case study,” in Proceedings of the 2005 ACM symposium on Applied computing. 2005, pp. 687–694. 15
ACM New York, NY, USA,
[10] J. Yu, R. Buyya, and C. Tham, “QoS-based scheduling of workflow applications on service grids,” in Proc. of the 1st IEEE International Conference on e-Science and Grid Computing (e-Science05), Melbourne, Australia, 2005. [11] G. von Laszewski, “A Loosely Coupled Metacomputer:
Cooperating Job Submissions Across
Multiple Supercomputing Sites,” Concurrency, Experience, and Practice, vol. 11, no. 5, pp. 933–948, Dec. 1999, the initial version of this paper was available in 1996. [Online]. Available: http://www.mcs.anl.gov/∼gregor/papers/vonLaszewski--CooperatingJobs.pdf [12] R. Wolski, N. Spring, and J. Hayes, “The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing,” Journal of Future Generation Computing Systems, vol. 15, no. 5-6, pp. 757–768, 1999. [13] J. Kelley Jr and M. Walker, “Critical-path planning and scheduling,” AIEE-IRE, pp. 160–173, 1959. [14] I. Taylor, M. Shields, I. Wang, and A. Harrison, “Visual Grid Workflow in Triana,” Journal of Grid Computing, vol. 3, no. 3, pp. 153–169, 2005. [15] B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. Lee, J. Tao, and Y. Zhao, “Scientific workflow management and the Kepler system,” Concurrency and Computation: Practice and Experience, vol. 18, no. 10, pp. 1039–1065, 2006. [16] E. Deelman, “Pegasus: A framework for mapping complex scientific workflows onto distributed systems,” Scientific Programming, vol. 13, no. 3, pp. 219–237, 2005. [17] G. von Laszewski and M. Hategan, “Grid Workflow - An Integrated Approach,” in Technical Report., Argonne National Laboratory, Argonne National Laboratory, 9700 S. Cass Ave., Argonne, IL 60440, 2005. [Online]. Available: http://www.mcs.anl.gov/∼gregor/papers/vonLaszewski-workflow-draft.pdf [18] J. Frey, T. Tannenbaum, I. Foster et al., “Condor-G: A Computation Management Agent for MultiInstitutional Grids,” Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10), 2001. [19] R. Barga, J. Jackson, N. Araujo, D. Guo, N. Gautam, and Y. Simmhan, “The Trident Scientific Workflow Workbench,” in eScience, 2008. eScience’08. IEEE Fourth International Conference on, 2008, pp. 317–318. 16
17 [20] “Active Directory Domain Services,” Web Page, 2009. [Online]. Available: http://msdn.microsoft.com/ en-us/library/aa362244(VS.85).aspx [21] “Windows Contacts,” Web Page, 2009. [Online]. Available: http://msdn.microsoft.com/en-us/library/ ms735779(VS.85).aspx [22] “Microsoft Project Server 2007: Getting Started with a New Platform for Developers.” [Online]. Available: http://msdn.microsoft.com/enus/library/bb456485.aspx#officepj2007platform EventsDataSets [23] “Project
Management
Software
Review
2008,”
Web
Page.
[Online].
Available:
http:
//project-management-software-review.toptenreviews.com/ [24] G. von Laszewski and L. E. Dilmanian, “e-Science Project and Experiment Management with Microsoft Project,” in GCE08 at SC08.
Austin, TX: IEEE, Nov. 16 2008. [Online]. Available: http:
//code.google.com/p/cyberaide/source/browse/trunk/papers/08-project/vonLaszewski-08-project-ieee.pdf [25] G. von Laszewski, F. Wang, A. Younge, X. He, Z. Guo, and M. Pierce, “Cyberaide javascript: A javascript commodity grid kit,” in GCE08 at SC’08. Austin, TX: IEEE, Nov. 16 2008. [Online]. Available: http://cyberaide.googlecode.com/svn/trunk/papers/08-javascript/vonLaszewski-08-javascript.pdf