Review of Parallel Computing

Viewer
Transcript

Haldia Institute of Technology

Department of Computer Science and Informatics ICARE Complex Po- Hatiberia Haldia West Bengal - 721657 http://www.hit-haldia.net INDIA

Review of Parallel Computing Report on Final Year minor Project August-November, 2007

Submitted by:

Amit Kumar Saha B.Tech(CSE, 7th Semester) College Roll Number: 04/CSE/20 University Roll Number: 10301041022 Department of Computer Science and Informatics http://amitsaha.in.googlepages.com Supervisors:

Mr. Soumen Paul, Mr. Shubabrata Barman Department of Computer Science and Informatics Haldia Institute of Technology Haldia INDIA

1

Acknowledgements

The subject of Parallel Computing is mainly limited to the domains of research labs and classrooms of post graduation courses. As an undergraduate student, working on a project dealing with Parallel Computing required enough faith in my capabilities by my supervisors to permit me to carry out the work. Thus, I am wholly indebted to my supervisors, Mr.S.Paul and Mr.S Barman for having given me the opportunity to work on this project. Open Source tools was used for the project and the people on the mailing lists especially Gleb Natapov, Jeff Squyres (Open MPI), Jim (PVM) were very generous to answer all my queries, which were dumb sometimes.

2

Review of Parallel Computing December 17, 2007

Contents 1 Introduction 1.1 Parallelism and Computing . . . . . . . . . . . . . . . . . . . . . 1.2 Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Some General Parallel Terminology . . . . . . . . . . . . . . . . .

6 6 6 8

2 Parallel Programming Models 2.1 Shared Memory . . . . . . . . . . . . . . . . . . . . 2.2 Threads . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Message Passing . . . . . . . . . . . . . . . . . . . 2.4 Data Parallel . . . . . . . . . . . . . . . . . . . . . 2.5 Other Models . . . . . . . . . . . . . . . . . . . . . 2.5.1 Hybrid: . . . . . . . . . . . . . . . . . . . . 2.5.2 Single Program Multiple Data (SPMD) . . 2.5.3 Multiple Program Multiple Data (MPMD)

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

10 10 11 12 13 14 14 14 15

3 Parallel Programming Techniques 3.1 Common Parallel Programming Paradigms 3.1.1 Crowd Computations . . . . . . . . 3.1.2 Tree Computations . . . . . . . . . . 3.2 Workload Allocation . . . . . . . . . . . . . 3.2.1 Data Decomposition . . . . . . . . . 3.2.2 Function Decomposition . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

15 15 16 16 16 16 18

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4 Experiment Setup

18

5 Parallel Virtual Machine 5.1 PVM Overview . . . . . . . . . . 5.2 The PVM System . . . . . . . . 5.3 Using PVM . . . . . . . . . . . . 5.3.1 Setting up PVM . . . . . 5.3.2 Starting PVM . . . . . . 5.3.3 Running PVM Programs 5.4 PVM User Interface . . . . . . . 5.4.1 Message Passing in PVM 5.5 Demos . . . . . . . . . . . . . . . 5.5.1 Master-Slave Demo . . . .

3

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

19 19 19 23 23 24 26 27 28 32 32

6 Message Passing Interface 6.1 Introduction . . . . . . . . . . . . . . . . . . 6.2 MPI Implementations . . . . . . . . . . . . 6.3 Open MPI . . . . . . . . . . . . . . . . . . . 6.3.1 Using Open MPI . . . . . . . . . . . 6.3.2 MPI User Interface . . . . . . . . . . 6.4 Demos . . . . . . . . . . . . . . . . . . . . . 6.4.1 Hello World of MPI . . . . . . . . . 6.5 Parallel Search using MPI . . . . . . . . . . 6.5.1 Problem Description . . . . . . . . . 6.5.2 Algorithm . . . . . . . . . . . . . . . 6.5.3 Code Listing . . . . . . . . . . . . . 6.5.4 Results . . . . . . . . . . . . . . . . 6.5.5 Comparison with Sequential Search . 6.6 Parallel Search using MPI- Modification . . 6.6.1 Problem Statement . . . . . . . . . . 6.6.2 Code Listing . . . . . . . . . . . . . 6.6.3 Results . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

37 37 37 37 38 40 42 42 43 43 43 44 47 47 48 48 48 54

7 Parallel Image Processing 7.1 Parallel Processing for Computer Vision and Image Understanding 7.2 Parallel Implementation for Image Rotation Using Parallel Virtual Machine (PVM) . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Image Filtering on .NET-based Desktop Grids . . . . . . . . . . 7.4 Towards Efficient Parallel Image Processing on Cluster Grids using GIMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 A user friendly framework for Parallel Image Processing . . . . .

55 55

8 Conclusions

57

9 Future Work

57

10 References

58

4

56 56 56 56

Abstract A wide variety of choices are currently available for parellel programming. Among these, Message Passing Interface (MPI) and Parallel Virtual Machine (PVM) are popular message passing systems which provide user level libraries to enable parallel computation. The goal of this project is to gain some practical experience of Parallel Computing using the above mentioned systems and in the process set up a parallel computing platform available to be used for parallel algorithm implementations.

5

1

Introduction

1.1

Parallelism and Computing

A parallel computer is a set of processors that are able to work cooperatively to solve a computational problem. This definition is broad enough to include parallel supercomputers that have hundreds or thousands of processors, networks of workstations, multiple-processor workstations, and embedded systems. Parallel computers are interesting because they offer the potential to concentrate computational resources—whether processors, memory, or I/O bandwidth—on important computational problems. Parallelism has sometimes been viewed as a rare and exotic subarea of computing, interesting but of little relevance to the average programmer. A study of trends in applications, computer architecture, and networking shows that this view is no longer tenable. Parallelism is becoming ubiquitous, and parallel programming is becoming central to the programming enterprise.

1.2

Parallel Computing

Traditionally, software has been written for serial computation: • To be run on a single computer having a single Central Processing Unit (CPU); • A problem is broken into a discrete series of instructions. • Instructions are executed one after another • Only one instruction may execute at any moment in time. In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. • To be run using multiple CPUs • A problem is broken into discrete parts that can be solved concurrently • Each part is further broken down to a series of instructions • Instructions from each part execute simultaneously on different CPUs The compute resources can include: • A single computer with multiple processors; • An arbitrary number of computers connected by a network; • A combination of both. The computational problem usually demonstrates characteristics such as the ability to be: • Broken apart into discrete pieces of work that can be solved simultaneously;

6

• Execute multiple program instructions at any moment in time; • Solved in less time with multiple compute resources than with a single compute resource. Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence. Some examples: • Planetary and galactic orbits • Weather and ocean patterns • Tectonic plate drift • Rush hour traffic in LA • Automobile assembly line • Daily operations within a business • Building a shopping mall • Ordering a hamburger at the drive through. Traditionally, parallel computing has been considered to be ”the high end of computing” and has been motivated by numerical simulations of complex systems and ”Grand Challenge Problems” such as: • weather and climate • chemical and nuclear reactions • biological, human genome • geological, seismic activity • mechanical devices - from prosthetics to spacecraft • electronic circuits • manufacturing processes Today, commercial applications are providing an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. Example applications include: • parallel databases, data mining • oil exploration • web search engines, web based business services • computer-aided diagnosis in medicine

7

• management of national and multi-national corporations • advanced graphics and virtual reality, particularly in the entertainment industry • networked video and multi-media technologies • collaborative work environments Ultimately, parallel computing is an attempt to maximize the infinite but seemingly scarce commodity called time.

1.3

Some General Parallel Terminology

Like everything else, parallel computing has its own ”jargon”. Some of the more commonly used terms associated with parallel computing are listed below. Most of these will be discussed in more detail later. • Task: A logically discrete section of computational work. A task is typically a program or program-like set of instructions that is executed by a processor. • Parallel Task: A task that can be executed by multiple processors safely (yields correct results) • Serial Execution: Execution of a program sequentially, one statement at a time. In the simplest sense, this is what happens on a one processor machine. However, virtually all parallel tasks will have sections of a parallel program that must be executed serially. • Parallel Execution: Execution of a program by more than one task, with each task being able to execute the same or different statement at the same moment in time. • Shared Memory: From a strictly hardware point of view, describes a computer architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same ”picture” of memory and can directly address and access the same logical memory locations regardless of where the physical memory actually exists. • Distributed Memory: In hardware, refers to network based memory access for physical memory that is not common. As a programming model, tasks can only logically ”see” local machine memory and must use communications to access memory on other machines where other tasks are executing. • Communications: Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory bus or over a network, however the actual event of data exchange is commonly referred to as communications regardless of the method employed.

8

• Synchronization: The coordination of parallel tasks in real time, very often associated with communications. Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point. Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application’s wall clock execution time to increase. • Granularity: In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. Coarse: relatively large amounts of computational work are done between communication events Fine: relatively small amounts of computational work are done between communication events • Observed Speedup: Observed speedup of a code which has been parallelized, defined as: wall-clock time of serial execution wall-clock time of parallel execution One of the simplest and most widely used indicators for a parallel program’s performance. • Parallel Overhead: The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel overhead can include factors such as: Task start-up time Synchronizations Data communications Software overhead imposed by parallel compilers, libraries, tools, operating system, etc. Task termination time • Massively Parallel: Refers to the hardware that comprises a given parallel system - having many processors. The meaning of many keeps increasing, but currently BG/L pushes this number to 6 digits. • Scalability: Refers to a parallel system’s (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Factors that contribute to scalability include: Hardware - particularly memory-cpu bandwidths and network communications Application algorithm Parallel overhead related Characteristics of your specific application and coding

9

2

Parallel Programming Models

There are several parallel programming models in common use: • Shared Memory • Threads • Message Passing • Data Parallel • Hybrid Parallel programming models exist as an abstraction above hardware and memory architectures. Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. Two examples: 1. Shared memory model on a distributed memory machine: Kendall Square Research (KSR) ALLCACHE approach. Machine memory was physically distributed, but appeared to the user as a single shared memory (global address space). Generically, this approach is referred to as ”virtual shared memory”. Note: although KSR is no longer in business, there is no reason to suggest that a similar implementation will not be made available by another vendor in the future. 2. Message passing model on a shared memory machine: MPI on SGI Origin. The SGI Origin employed the CC-NUMA type of shared memory architecture, where every task has direct access to global memory. However, the ability to send and receive messages with MPI, as is commonly done over a network of distributed memory machines, is not only implemented but is very commonly used. Which model to use is often a combination of what is available and personal choice. There is no ”best” model, although there certainly are better implementations of some models over others. The following sections describe each of the models mentioned above, and also discuss some of their actual implementations.

2.1

Shared Memory

• In the shared-memory programming model, tasks share a common address space, which they read and write asynchronously. • Various mechanisms such as locks / semaphores may be used to control access to the shared memory. • An advantage of this model from the programmer’s point of view is that the notion of data ”ownership” is lacking, so there is no need to specify explicitly the communication of data between tasks. Program development can often be simplified. 10

• An important disadvantage in terms of performance is that it becomes more difficult to understand and manage data locality. Implementations: • On shared memory platforms, the native compilers translate user program variables into actual memory addresses, which are global. • No common distributed memory platform implementations currently exist. However, as mentioned previously in the Overview section, the KSR ALLCACHE approach provided a shared memory view of data even though the physical memory of the machine was distributed.

2.2

Threads

In the threads model of parallel programming, a single process can have multiple, concurrent execution paths. Perhaps the most simple analogy that can be used to describe threads is the concept of a single program that includes a number of subroutines. Threads are commonly associated with shared memory architectures and operating systems. Implementations: • From a programming perspective, threads implementations commonly comprise: 1. A library of subroutines that are called from within parallel source code 2. A set of compiler directives imbedded in either serial or parallel source code In both cases, the programmer is responsible for determining all parallelism. • Threaded implementations are not new in computing. Historically, hardware vendors have implemented their own proprietary versions of threads. These implementations differed substantially from each other making it difficult for programmers to develop portable threaded applications. • Unrelated standardization efforts have resulted in two very different implementations of threads: POSIX Threads and OpenMP. • POSIX Threads 1. Library based; requires parallel coding 2. Specified by the IEEE POSIX 1003.1c standard (1995). 3. C Language only 4. Commonly referred to as Pthreads. 5. Most hardware vendors now offer Pthreads in addition to their proprietary threads implementations.

11

6. Very explicit parallelism; requires significant programmer attention to detail. • OpenMP 1. Compiler directive based; can use serial code 2. Jointly defined and endorsed by a group of major computer hardware and software vendors. The OpenMP Fortran API was released October 28, 1997. The C/C++ API was released in late 1998. 3. Portable / multi-platform, including Unix and Windows NT platforms 4. Available in C/C++ and Fortran implementations 5. Can be very easy and simple to use - provides for ”incremental parallelism” • Microsoft has its own implementation for threads, which is not related to the UNIX POSIX standard or OpenMP.

2.3

Message Passing

The message passing model demonstrates the following characteristics: • A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine as well across an arbitrary number of machines. • Tasks exchange data through communications by sending and receiving messages. • Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation. Implementations: • From a programming perspective, message passing implementations commonly comprise a library of subroutines that are imbedded in source code. The programmer is responsible for determining all parallelism. • Historically, a variety of message passing libraries have been available since the 1980s. These implementations differed substantially from each other making it difficult for programmers to develop portable applications. • In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations. • Part 1 of the Message Passing Interface (MPI) was released in 1994. Part 2 (MPI-2) was released in 1996. Both MPI specifications are available on the web at www.mcs.anl.gov/Projects/mpi/standard.html.

12

• MPI is now the ”de facto” industry standard for message passing, replacing virtually all other message passing implementations used for production work. Most, if not all of the popular parallel computing platforms offer at least one implementation of MPI. A few offer a full implementation of MPI-2. • For shared memory architectures, MPI implementations usually don’t use a network for task communications. Instead, they use shared memory (memory copies) for performance reasons.

2.4

Data Parallel

The data parallel model demonstrates the following characteristics: • Most of the parallel work focuses on performing operations on a data set. The data set is typically organized into a common structure, such as an array or cube. • A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure. • Tasks perform the same operation on their partition of work, for example, ”add 4 to every array element”. • On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the data structure is split up and resides as ”chunks” in the local memory of each task. Implementations: • Programming with the data parallel model is usually accomplished by writing a program with data parallel constructs. The constructs can be calls to a data parallel subroutine library or, compiler directives recognized by a data parallel compiler. • Fortran 90 and 95 (F90, F95): ISO/ANSI standard extensions to Fortran 77. Contains everything that is in Fortran 77 New source code format; additions to character set Additions to program structure and commands Variable additions - methods and arguments Pointers and dynamic memory allocation added Array processing (arrays treated as objects) added Recursive and new intrinsic functions added Many other new features Implementations are available for most common parallel platforms.

13

• High Performance Fortran (HPF): Extensions to Fortran 90 to support data parallel programming. Contains everything in Fortran 90 Directives to tell compiler how to distribute data added Assertions that can improve optimization of generated code added Data parallel constructs added (now part of Fortran 95) Implementations are available for most common parallel platforms. • Compiler Directives: Allow the programmer to specify the distribution and alignment of data. Fortran implementations are available for most common parallel platforms. • Distributed memory implementations of this model usually have the compiler convert the program into standard code with calls to a message passing library (MPI usually) to distribute the data to all the processes. All message passing is done invisibly to the programmer.

2.5

Other Models

Other parallel programming models besides those previously mentioned certainly exist, and will continue to evolve along with the ever changing world of computer hardware and software. Only three of the more common ones are mentioned here. 2.5.1

Hybrid:

• In this model, any two or more parallel programming models are combined. Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with either the threads model (POSIX threads) or the shared memory model (OpenMP). This hybrid model lends itself well to the increasingly common hardware environment of networked SMP machines. Another common example of a hybrid model is combining data parallel with message passing. As mentioned in the data parallel model section previously, data parallel implementations (F90, HPF) on distributed memory architectures actually use message passing to transmit data between tasks, transparently to the programmer. 2.5.2

Single Program Multiple Data (SPMD)

• SPMD is actually a ”high level” programming model that can be built upon any combination of the previously mentioned parallel programming models. SPMD Model A single program is executed by all tasks simultaneously. At any moment in time, tasks can be executing the same or different instructions within the same program. SPMD programs usually have the necessary logic programmed into them to allow different tasks to branch or conditionally execute only those

14

parts of the program they are designed to execute. That is, tasks do not necessarily have to execute the entire program - perhaps only a portion of it. All tasks may use different data 2.5.3

Multiple Program Multiple Data (MPMD)

• Like SPMD, MPMD is actually a ”high level” programming model that can be built upon any combination of the previously mentioned parallel programming models. MPMD Model • MPMD applications typically have multiple executable object files (programs). While the application is being run in parallel, each task can be executing the same or different program as other tasks. • All tasks may use different data

3

Parallel Programming Techniques

3.1

Common Parallel Programming Paradigms

Parallel computing using a system such as PVM may be approached from three fundamental viewpoints, based on the organization of the computing tasks. Within each, different workload allocation strategies are possible and will be discussed later in this chapter. The first and most common model for PVM applications can be termed “crowd” computing : a collection of closely related processes, typically executing the same code, perform computations on different portions of the workload, usually involving the periodic exchange of intermediate results. This paradigm can be further subdivided into two categories: • The master-slave (or host-node ) model in which a separate “control” program termed the master is responsible for process spawning, initialization, collection and display of results, and perhaps timing of functions. The slave programs perform the actual computation involved; they either are allocated their workloads by the master (statically or dynamically) or perform the allocations themselves. • The node-only model where multiple instances of a single program execute, with one process (typically the one initiated manually) taking over the noncomputational responsibilities in addition to contributing to the computation itself. The second model supported by PVM is termed a “tree” computation . In this scenario, processes are spawned (usually dynamically as the computation progresses) in a tree-like manner, thereby establishing a tree-like, parent-child relationship (as opposed to crowd computations where a star-like relationship exists). This paradigm, although less commonly used, is an extremely natural fit to applications where the total workload is not known a priori, for example, in branch-and-bound algorithms, alpha-beta search, and recursive “divide-andconquer” algorithms.

15

The third model, which we term “hybrid,” can be thought of as a combination of the tree model and crowd model. Essentially, this paradigm possesses an arbitrary spawning structure: that is, at any point during application execution, the process relationship structure may resemble an arbitrary and changing graph. We note that these three classifications are made on the basis of process relationships, though they frequently also correspond to communication topologies. Nevertheless, in all three, it is possible for any process to interact and synchronize with any other. Further, as may be expected, the choice of model is application dependent and should be selected to best match the natural structure of the parallelized program. 3.1.1

Crowd Computations

Crowd computations typically involve three phases. The first is the initialization of the process group; in the case of node-only computations, dissemination of group information and problem parameters, as well as workload allocation, is typically done within this phase. The second phase is computation. The third phase is collection results and display of output; during this phase, the process group is disbanded or terminated. 3.1.2

Tree Computations

As mentioned earlier, tree computations typically exhibit a tree-like process control structure which also conforms to the communication pattern in many instances. To illustrate this model, we consider a parallel sorting algorithm that works as follows. One process (the manually started process in PVM) possesses (inputs or generates) the list to be sorted. It then spawns a second process and sends it half the list. At this point, there are two processes each of which spawns a process and sends them one-half of their already halved lists. This continues until a tree of appropriate depth is constructed. Each process then independently sorts its portion of the list, and a merge phase follows where sorted sublists are transmitted upwards along the tree edges, with intermediate merges being done at each node.

3.2

Workload Allocation

Two general methodologies are commonly used. The first, termed data decomposition or partitioning, assumes that the overall problem involves applying computational operations or transformations on one or more data structures and, further, that these data structures may be divided and operated upon. The second, called function decomposition, divides the work based on different operations or functions. In a sense, the PVM computing model supports both function decomposition (fundamentally different tasks perform different operations) and data decomposition (identical tasks operate on different portions of the data). 3.2.1

Data Decomposition

As a simple example of data decomposition, consider the addition of two vectors, A[1..N] and B[1..N], to produce the result vector, C[1..N]. If we assume that P 16

processes are working on this problem, data partitioning involves the allocation of N/P elements of each vector to each process, which computes the corresponding N/P elements of the resulting vector. This data partitioning may be done either “statically,” where each process knows a priori (at least in terms of the variables N and P) its share of the workload, or “dynamically,” where a control process (e.g., the master process) allocates subunits of the workload to processes as and when they become free. The principal difference between these two approaches is “scheduling.” With static scheduling, individual process workloads are fixed; with dynamic scheduling, they vary as the computation progresses. In most multiprocessor environments, static scheduling is effective for problems such as the vector addition example; however, in the general PVM environment, static scheduling is not necessarily beneficial. The reason is that PVM environments based on networked clusters are susceptible to external influences; therefore, a statically scheduled, data-partitioned problem might encounter one or more processes that complete their portion of the workload much faster or much slower than the others. This situation could also arise when the machines in a PVM system are heterogeneous, possessing varying CPU speeds and different memory and other system attributes. In a real execution of even this trivial vector addition problem, an issue that cannot be ignored is input and output. In other words, how do the processes described above receive their workloads, and what do they do with the result vectors? The answer to these questions depends on the application and the circumstances of a particular run, but in general: 1. Individual processes generate their own data internally, for example, using random numbers or statically known values. This is possible only in very special situations or for program testing purposes. 2. Individual processes independently input their data subsets from external devices. This method is meaningful in many cases, but possible only when parallel I/O facilities are supported. 3. A controlling process sends individual data subsets to each process. This is the most common scenario, especially when parallel I/O facilities do not exist. Further, this method is also appropriate when input data subsets are derived from a previous computation within the same application. The third method of allocating individual workloads is also consistent with dynamic scheduling in applications where interprocess interactions during computations are rare or nonexistent. However, nontrivial algorithms generally require intermediate exchanges of data values, and therefore only the initial assignment of data partitions can be accomplished by these schemes. For example, consider the data partitioning method depicted in Figure 4.2. In order to multiply two matrices A and B, a group of processes is first spawned, using the master-slave or node-only paradigm. This set of processes is considered to form a mesh; the matrices to be multiplied are divided into subblocks, also forming a mesh. Each subblock of the A and B matrices is placed on the corresponding process, by utilizing one of the data decomposition and workload allocation strategies listed above. During computation, subblocks need to be forwarded or exchanged between processes, thereby transforming the original allocation map, as shown in the figure. At the end of the computation, however, result matrix 17

subblocks are situated on the individual processes, in conformance with their respective positions on the process grid, and consistent with a data partitioned map of the resulting matrix C. The foregoing discussion illustrates the basics of data decomposition. In a later chapter, example programs highlighting details of this approach will be presented . 3.2.2

Function Decomposition

Parallelism in distributed-memory environments such as PVM may also be achieved by partitioning the overall workload in terms of different operations. The most obvious example of this form of decomposition is with respect to the three stages of typical program execution, namely, input, processing, and result output. In function decomposition, such an application may consist of three separate and distinct programs, each one dedicated to one of the three phases. Parallelism is obtained by concurrently executing the three programs and by establishing a ”pipeline” (continuous or quantized) between them. Note, however, that in such a scenario, data parallelism may also exist within each phase. The term is generally used to signify partitioning and workload allocation by function within the computational phase. Typically, application computations contain several different subalgorithms-sometimes on the same data (the MPSD or multiple-program single-data scenario), sometimes in a pipelined sequence of transformations, and sometimes exhibiting an unstructured pattern of exchanges

4

Experiment Setup

For the purpose of this project the following computers have been used. Their hardware and software configuration are given as follows: 1. Host - ubuntu-desktop-1 Hardware Configuration - Pentium 4 Core 2 Duo, 512 MB RAM Operating System - Ubuntu Linux 7.04, 2.6.20-15-generic Parallel Software Configuration - Open MPI, PVM 2. Host - debian-desktop-1 Hardware Configuration - Pentium 4 Core 2 Duo, 512 MB RAM Operating System - Debian Linux 4, 2.6.18 Parallel Software Configuration - Open MPI, PVM 3. Host - ubuntu-desktop-2 Hardware Configuration - Pentium 4 Core 2 Duo, 512 MB RAM Operating System - Ubuntu Linux 7.04, 2.6.20-15-generic Parallel Software Configuration - Open MPI, PVM 4. Host - ubuntu-desktop-3 Hardware Configuration - Pentium 4 Core 2 Duo, 512 MB RAM Operating System - Ubuntu Linux 7.04, 2.6.20-15-generic Parallel Software Configuration - Open MPI, PVM 18

5

Parallel Virtual Machine

5.1

PVM Overview

The PVM software provides a unified framework within which parallel programs can be developed in an efficient and straightforward manner using existing hardware. PVM enables a collection of heterogeneous computer systems to be viewed as a single parallel virtual machine. PVM transparently handles all message routing, data conversion, and task scheduling across a network of incompatible computer architectures. The PVM computing model is simple yet very general, and accommodates a wide variety of application program structures. The programming interface is deliberately straightforward, thus permitting simple program structures to be implemented in an intuitive manner. The user writes his application as a collection of cooperating tasks. Tasks access PVM resources through a library of standard interface routines. These routines allow the initiation and termination of tasks across the network as well as communication and synchronization between tasks. The PVM message-passing primitives are oriented towards heterogeneous operation, involving strongly typed constructs for buffering and transmission. Communication constructs include those for sending and receiving data structures as well as high-level primitives such as broadcast, barrier synchronization, and global sum. PVM tasks may possess arbitrary control and dependency structures. In other words, at any point in the execution of a concurrent application, any task in existence may start or stop other tasks or add or delete computers from the virtual machine. Any process may communicate and/or synchronize with any other. Any specific control and dependency structure may be implemented under the PVM system by appropriate use of PVM constructs and host language control-flow statements. Owing to its ubiquitous nature (specifically, the virtual machine concept) and also because of its simple but complete programming interface, the PVM system has gained widespread acceptance in the high-performance scientific computing community.

5.2

The PVM System

PVM is an integrated set of software tools and libraries that emulates a generalpurpose, flexible, heterogeneous concurrent computing framework on interconnected computers of varied architecture. The overall objective of the PVM system is to to enable such a collection of computers to be used cooperatively for concurrent or parallel computation. Detailed descriptions and discussions of the concepts, logistics, and methodologies involved in this network-based computing process are contained in the remainder of the book. Briefly, the principles upon which PVM is based include the following: • User-configured host pool : The application’s computational tasks execute on a set of machines that are selected by the user for a given run of the PVM program. Both single-CPU machines and hardware multiprocessors (including shared-memory and distributed-memory computers) may be part of the host pool. The host pool may be altered by adding and

19

deleting machines during operation (an important feature for fault tolerance). • Translucent access to hardware: Application programs either may view the hardware environment as an attributeless collection of virtual processing elements or may choose to exploit the capabilities of specific machines in the host pool by positioning certain computational tasks on the most appropriate computers. • Process-based computation: The unit of parallelism in PVM is a task (often but not always a Unix process), an independent sequential thread of control that alternates between communication and computation. No process-to-processor mapping is implied or enforced by PVM; in particular, multiple tasks may execute on a single processor. • Explicit message-passing model: Collections of computational tasks, each performing a part of an application’s workload using data-, functional, or hybrid decomposition, cooperate by explicitly sending and receiving messages to one another. Message size is limited only by the amount of available memory. • Heterogeneity support: The PVM system supports heterogeneity in terms of machines, networks, and applications. With regard to message passing, PVM permits messages containing more than one datatype to be exchanged between machines having different data representations. • Multiprocessor support: PVM uses the native message-passing facilities on multiprocessors to take advantage of the underlying hardware. Vendors often supply their own optimized PVM for their systems, which can still communicate with the public PVM version. The PVM system is composed of two parts. The first part is a daemon , called pvmd3 and sometimes abbreviated pvmd , that resides on all the computers making up the virtual machine. (An example of a daemon program is the mail program that runs in the background and handles all the incoming and outgoing electronic mail on a computer.) Pvmd3 is designed so any user with a valid login can install this daemon on a machine. When a user wishes to run a PVM application, he first creates a virtual machine by starting up PVM. The PVM application can then be started from a Unix prompt on any of the hosts. Multiple users can configure overlapping virtual machines, and each user can execute several PVM applications simultaneously. The second part of the system is a library of PVM interface routines. It contains a functionally complete repertoire of primitives that are needed for cooperation between tasks of an application. This library contains user-callable routines for message passing, spawning processes, coordinating tasks, and modifying the virtual machine. The PVM computing model is based on the notion that an application consists of several tasks. Each task is responsible for a part of the application’s computational workload. Sometimes an application is parallelized along its functions; that is, each task performs a different function, for example, input, problem setup, solution, output, and display. This process is often called functional parallelism . A more common method of parallelizing an application is 20

called data parallelism . In this method all the tasks are the same, but each one only knows and solves a small part of the data. This is also referred to as the SPMD (single-program multiple-data) model of computing. PVM supports either or a mixture of these methods. Depending on their functions, tasks may execute in parallel and may need to synchronize or exchange data, although this is not always the case. The PVM system currently supports C, C++, and Fortran languages. This set of language interfaces have been included based on the observation that the predominant majority of target applications are written in C and Fortran, with an emerging trend in experimenting with object-based languages and methodologies. The C and C + + language bindings for the PVM user interface library are implemented as functions, following the general conventions used by most C systems, including Unix-like operating systems. To elaborate, function arguments are a combination of value parameters and pointers as appropriate, and function result values indicate the outcome of the call. In addition, macro definitions are used for system constants, and global variables such as errno and pvm errno are the mechanism for discriminating between multiple possible outcomes. Application programs written in C and C++ access PVM library functions by linking against an archival library (libpvm3.a) that is part of the standard distribution. Fortran language bindings are implemented as subroutines rather than as functions. This approach was taken because some compilers on the supported architectures would not reliably interface Fortran functions with C functions. One immediate implication of this is that an additional argument is introduced into each PVM library call for status results to be returned to the invoking program. Also, library routines for the placement and retrieval of typed data in message buffers are unified, with an additional parameter indicating the datatype. Apart from these differences (and the standard naming prefixes - pvm- for C, and pvmf for Fortran), a one-to-one correspondence exists between the two language bindings. Fortran interfaces to PVM are implemented as library stubs that in turn invoke the corresponding C routines, after casting and/or dereferencing arguments as appropriate. Thus, Fortran applications are required to link against the stubs library (libfpvm3.a) as well as the C library. All PVM tasks are identified by an integer task identifier (TID) . Messages are sent to and received from tids. Since tids must be unique across the entire virtual machine, they are supplied by the local pvmd and are not user chosen. Although PVM encodes information into each TID (see Chapter 7 for details) the user is expected to treat the tids as opaque integer identifiers. PVM contains several routines that return TID values so that the user application can identify other tasks in the system. There are applications where it is natural to think of a group of tasks . And there are cases where a user would like to identify his tasks by the numbers 0 (p - 1), where p is the number of tasks. PVM includes the concept of user named groups. When a task joins a group, it is assigned a unique “instance” number in that group. Instance numbers start at 0 and count up. In keeping with the PVM philosophy, the group functions are designed to be very general and transparent to the user. For example, any PVM task can join or leave any group at any time without having to inform any other task in the affected groups. Also, groups can overlap, and tasks can broadcast messages to groups of which they are not 21

a member. Details of the available group functions are given in Chapter 5. To use any of the group functions, a program must be linked with libgpvm3.a . The general paradigm for application programming with PVM is as follows. A user writes one or more sequential programs in C, C++, or Fortran 77 that contain embedded calls to the PVM library. Each program corresponds to a task making up the application. These programs are compiled for each architecture in the host pool, and the resulting object files are placed at a location accessible from machines in the host pool. To execute an application, a user typically starts one copy of one task (usually the “master” or “initiating” task) by hand from a machine within the host pool. This process subsequently starts other PVM tasks, eventually resulting in a collection of active tasks that then compute locally and exchange messages with each other to solve the problem. Note that while the above is a typical scenario, as many tasks as appropriate may be started manually. As mentioned earlier, tasks interact through explicit message passing, identifying each other with a system-assigned, opaque TID. main() { int cc, tid, msgtag; char buf[100]; printf("i’m t%x\n", pvm_mytid()); cc = pvm_spawn("hello_other", (char**)0, 0, "", 1, &tid); if (cc == 1) { msgtag = 1; pvm_recv(tid, msgtag); pvm_upkstr(buf); printf("from t%x: %s\n", tid, buf); } else printf("can’t start hello_other\n"); pvm_exit(); } Listing 1- PVM program hello.c Shown in Listing - 1 is the body of the PVM program hello, a simple example that illustrates the basic concepts of PVM programming. This program is intended to be invoked manually; after printing its task id (obtained with pvm mytid()), it initiates a copy of another program called hello other using the pvm-spawn() function. A successful spawn causes the program to execute a blocking receive using pvm-recv. After receiving the message, the program prints the message sent by its counterpart, as well its task id; the buffer is extracted from the message using pvm upkstr. The final pvm exit call dissociates the program from the PVM system.

#include "pvm3.h" 22

main() { int ptid, msgtag; char buf[100]; ptid = pvm_parent(); strcpy(buf, "hello, world from "); gethostname(buf + strlen(buf), 64); msgtag = 1; pvm_initsend(PvmDataDefault); pvm_pkstr(buf); pvm_send(ptid, msgtag); pvm_exit(); } Listing 2 - PVM program hello other.c Listing 2 is a listing of the “slave” or spawned program; its first PVM action is to obtain the task id of the “master” using the pvm-parent call. This program then obtains its hostname and transmits it to the master using the three-call sequence - pvm initsend to initialize the send buffer; pvm-pkstr to place a string, in a strongly typed and architecture-independent manner, into the send buffer; and pvm-send to transmit it to the destination process specified by ptid, “tagging” the message with the number 1.

5.3 5.3.1

Using PVM Setting up PVM

One of the reasons for PVM’s popularity is that it is simple to set up and use. PVM does not require special privileges to be installed. Anyone with a valid login on the hosts can do so. In addition, only one person at an organization needs to get and install PVM for everyone at that organization to use it. PVM uses two environment variables when starting and running. Each PVM user needs to set these two variables to use PVM. The first variable is PVM ROOT , which is set to the location of the installed pvm3 directory. The second variable is PVM ARCH , which tells PVM the architecture of this host and thus what executables to pick from the PVM ROOT directory. The easiest method is to set these two variables in your .bashrc file. It is recommended that the user set PVM ARCH by concatenating to the file .bashrc, the content of file $PVM ROOT/lib/bashrc.stub. The stub should be placed after PATH and PVM ROOT are defined. This stub automatically determines the PVM ARCH for this host and is particularly useful when the user shares a common file system (such as NFS) across several different architectures. Here is a sample PVM configured .bashrc file. Only the PVM relevant lines are shown: #PVM Configuration 23

export PVM_ROOT=/usr/lib/pvm3 # # append this file to your .bashrc to set path according to machine # type. you may wish to use this for your own programs (edit the last # part to point to a different directory f.e. ~/bin/_$PVM_ARCH. # if [ -z $PVM_ROOT ]; then if [ -d ~/pvm3 ]; then export PVM_ROOT=~/pvm3 else echo "Warning - PVM_ROOT not defined" echo "To use PVM, define PVM_ROOT and rerun your .bashrc" fi fi if [ -n $PVM_ROOT ]; then export PVM_ARCH=‘$PVM_ROOT/lib/pvmgetarch‘ # # uncomment one of the following lines if you want # directory to be added to your shell path. # # export PATH=$PATH:$PVM_ROOT/lib # export PATH=$PATH:$PVM_ROOT/lib/$PVM_ARCH # # uncomment the following line if you want the PVM # to be added to your shell path. # # export PATH=$PATH:$PVM_ROOT/bin/$PVM_ARCH fi 5.3.2

the PVM commands

# generic # arch-specific executable directory

Starting PVM

Before we go over the steps to compile and run parallel PVM programs, you should be sure you can start up PVM and configure a virtual machine. On any host on which PVM has been installed you can type: amit@ubuntu-desktop-1:~$ pvm and you should get back a PVM console prompt signifying that PVM is now running on this host. Now, add hosts to your virtual machine by typing at the console prompt pvm> add hostname We shall now add three hosts -debian-desktop-1, ubuntu-desktop-1, ubuntudesktop-3 to the virtual machine. pvm> add debian-desktop-1 add debian-desktop-1 24

The authenticity of host ’debian-desktop-1 (10.10.0.186)’ can’t be established. RSA key fingerprint is cf:99:5b:e7:72:53:4a:33:02:23:8e:93:1b:79:b7:a8. Are you sure you want to continue connecting (yes/no)? yes amit@debian-desktop-1’s password: 1 successful HOST DTID debian-desktop-1 80000 pvm> conf conf 2 hosts, 1 data format HOST DTID ARCH SPEED DSIG ubuntu-desktop-1 40000 LINUX 1000 0x00408841 debian-desktop-1 80000 LINUX 1000 0x00408841 pvm> add ubuntu-desktop-3 add ubuntu-desktop-3 The authenticity of host ’ubuntu-desktop-3 (10.10.2.181)’ can’t be established. RSA key fingerprint is 0d:f7:fe:a2:79:32:b4:33:4a:eb:b5:62:e3:fc:44:5e. Are you sure you want to continue connecting (yes/no)? yes amit@ubuntu-desktop-3’s password: 1 successful HOST DTID ubuntu-desktop-3 c0000 Now, type conf at the PVM console to view the PVM Configuration pvm> conf conf 3 hosts, 1 data format HOST ubuntu-desktop-1 debian-desktop-1 ubuntu-desktop-3

DTID 40000 80000 c0000

ARCH LINUX LINUX LINUX

SPEED DSIG 1000 0x00408841 1000 0x00408841 1000 0x00408841

Points to Note: 1. The hostnames has to be mapped to their appropriate IPv4 Addresses by configuring the /etc/hosts file 2. PVM Configuration Key: HOST : Hostname DTID : Global PVM Task identifier ARCH : Host Architecture SPEED :Relative Speed of Operation DSIG : 3. The PVM configuration can also be seen using XPVM

25

4. XPVM can also be used for all othe PVM related tasks - adding hosts, launching new tasks, deleting a host. More information is available at [3] 5.3.3

Running PVM Programs

To compile,build PVM programs, PVM ships with a Make-like utility called aimk Listing - 5 is a sample run of aimk, which builds two PVM programs, hello, hello other :

amit@ubuntu-desktop-1:/usr/lib/pvm3/examples$ sudo aimk hello hello_other making in LINUX/ for LINUX cc -g -I/usr/lib/pvm3/include -DSYSVSIGNAL -DNOWAIT3 -DRSHCOMMAND=\"/usr/bin/rsh\" -DNEEDE /usr/lib/pvm3/examples/hello.c: In function ?main?: /usr/lib/pvm3/examples/hello.c:55: warning: incompatible implicit declaration of built-in mv hello /usr/lib/pvm3/bin/LINUX cc -g -I/usr/lib/pvm3/include -DSYSVSIGNAL -DNOWAIT3 -DRSHCOMMAND=\"/usr/bin/rsh\" -DNEEDE /usr/lib/pvm3/examples/hello_other.c: In function ?main?: /usr/lib/pvm3/examples/hello_other.c:42: warning: incompatible implicit declaration of bui /usr/lib/pvm3/examples/hello_other.c:43: warning: incompatible implicit declaration of bui /usr/lib/pvm3/examples/hello_other.c:50: warning: incompatible implicit declaration of bui mv hello_other /usr/lib/pvm3/bin/LINUX Listing - 5 Execute the program hello: (Listing - 6) amit@ubuntu-desktop-1:/usr/lib/pvm3/bin/LINUX$ ./hello i’m t40003 26

from t80001: hello, world from debian-desktop-1 amit@ubuntu-desktop-1:/usr/lib/pvm3/bin/LINUX$ ./hello i’m t40004 from tc0001: hello, world from ubuntu-desktop-3 amit@ubuntu-desktop-1:/usr/lib/pvm3/bin/LINUX$ ./hello i’m t40005 can’t start hello_other amit@ubuntu-desktop-1:/usr/lib/pvm3/bin/LINUX$ ./hello i’m t40007 from t80002: hello, world from debian-desktop-1 Listing - 6

5.4

PVM User Interface

In PVM 3 all PVM tasks are identified by an integer supplied by the local pvmd. In the following descriptions this task identifier is called TID. It is similar to the process ID (PID) used in the Unix system and is assumed to be opaque to the user, in that the value of the TID has no special significance to him. In fact, PVM encodes information into the TID for its own internal use. Details of this encoding can be found in Chapter 7. All the PVM routines are written in C. C++ applications can link to the PVM library. Fortran applications can call these routines through a Fortran 77 interface supplied with the PVM 3 source. This interface translates arguments, which are passed by reference in Fortran, to their values if needed by the underlying C routines. The interface also takes into account Fortran character string representations and the various naming conventions that different Fortran compilers use to call C functions. The PVM communication model assumes that any task can send a message to any other PVM task and that there is no limit to the size or number of such messages. While all hosts have physical memory limitations that limits potential buffer space, the communication model does not restrict itself to a particular machine’s limitations and assumes sufficient memory is available. The PVM communication model provides asynchronous blocking send, asynchronous blocking receive, and nonblocking receive functions. In our terminology, a blocking send returns as soon as the send buffer is free for reuse, and an asynchronous send does not depend on the receiver calling a matching receive before the send can return. There are options in PVM 3 that request that data be transferred directly from task to task. In this case, if the message is large, the sender may block until the receiver has called a matching receive. A nonblocking receive immediately returns with either the data or a flag that the data has not arrived, while a blocking receive returns only when the data is in the receive buffer. In addition to these point-to-point communication functions, the model supports multicast to a set of tasks and broadcast to a user-defined group of tasks. There are also functions to perform global max, global sum, etc., across a user-defined group of tasks. Wildcards can be specified in the receive for the source and label, allowing either or both of these contexts to be ignored. A routine can be called to return information about received messages.

27

The PVM model guarantees that message order is preserved. If task 1 sends message A to task 2, then task 1 sends message B to task 2, message A will arrive at task 2 before message B. Moreover, if both messages arrive before task 2 does a receive, then a wildcard receive will always return message A. Message buffers are allocated dynamically. Therefore, the maximum message size that can be sent or received is limited only by the amount of available memory on a given host. There is only limited flow control built into PVM 3.3. PVM may give the user a can’t get memory error when the sum of incoming messages exceeds the available memory, but PVM does not tell other tasks to stop sending to this host. 5.4.1

Message Passing in PVM

Sending a message comprises three steps in PVM. First, a send buffer must be initialized by a call to pvm initsend() or pvm mkbuf(). Second, the message must be “packed” into this buffer using any number and combination of pvm pk*() routines. (In Fortran all message packing is done with the pvmfpack() subroutine.) Third, the completed message is sent to another process by calling the pvm send() routine or multicast with the pvm mcast() routine. A message is received by calling either a blocking or nonblocking receive routine and then “unpacking” each of the packed items from the receive buffer. The receive routines can be set to accept any message, or any message from a specified source, or any message with a specified message tag, or only messages with a given message tag from a given source. There is also a probe function that returns whether a message has arrived, but does not actually receive it. If required, other receive contexts can be handled by PVM 3. The routine pvm recvf() allows users to define their own receive contexts that will be used by the subsequent PVM receive routines. Message Buffers int bufid = pvm_initsend( int encoding ) If the user is using only a single send buffer (and this is the typical case) then pvm initsend() is the only required buffer routine. It is called before packing a new message into the buffer. The routine pvm initsend clears the send buffer and creates a new one for packing a new message. The encoding scheme used for this packing is set by encoding. The new buffer identifier is returned in bufid. The encoding options are as follows: • PvmDataDefault - XDR encoding is used by default because PVM cannot know whether the user is going to add a heterogeneous machine before this message is sent. If the user knows that the next message will be sent only to a machine that understands the native format, then he can use PvmDataRaw encoding and save on encoding costs. • PvmDataRaw - no encoding is done. Messages are sent in their original format. If the receiving process cannot read this format, it will return an error during unpacking. • PvmDataInPlace - data left in place to save on packing costs. Buffer contains only sizes and pointers to the items to be sent. When pvm send()

28

is called, the items are copied directly out of the user’s memory. This option decreases the number of times the message is copied at the expense of requiring the user to not modify the items between the time they are packed and the time they are sent. One use of this option would be to call pack once and modify and send certain items (arrays) multiple times during an application. The following message buffer routines are required only if the user wishes to manage multiple message buffers inside an application. Multiple message buffers are not required for most message passing between processes. In PVM 3 there is one active send buffer and one active receive buffer per process at any given moment. The developer may create any number of message buffers and switch between them for the packing and sending of data. The packing, sending, receiving, and unpacking routines affect only the active buffers. int bufid = pvm mkbuf( int encoding ) The routine pvm mkbuf creates a new empty send buffer and specifies the encoding method used for packing messages. It returns a buffer identifier bufid. int info = pvm freebuf( int bufid ) The routine pvm freebuf() disposes of the buffer with identifier bufid. This should be done after a message has been sent and is no longer needed. Call pvm mkbuf() to create a buffer for a new message if required. Neither of these calls is required when using pvm initsend(), which performs these functions for the user. int bufid = pvm_getsbuf( void ) int bufid = pvm_getrbuf( void ) pvm getsbuf() returns the active send buffer identifier. pvm getrbuf() returns the active receive buffer identifier. int oldbuf int oldbuf

= pvm_setsbuf( int bufid ) = pvm_setrbuf( int bufid )

These routines set the active send (or receive) buffer to bufid, save the state of the previous buffer, and return the previous active buffer identifier oldbuf. If bufid is set to 0 in pvm setsbuf() or pvm setrbuf(), then the present buffer is saved and there is no active buffer. This feature can be used to save the present state of an application’s messages so that a math library or graphical interface which also uses PVM messages will not interfere with the state of the application’s buffers. After they complete, the application’s buffers can be reset to active. It is possible to forward messages without repacking them by using the message buffer routines. This is illustrated by the following fragment. bufid oldid info info

= = = =

pvm_recv( src, tag pvm_setsbuf( bufid pvm_send( dst, tag pvm_freebuf( oldid

); ); ); );

29

Packing Data Each of the following C routines packs an array of the given data type into the active send buffer. They can be called multiple times to pack data into a single message. Thus, a message can contain several arrays each with a different data type. C structures must be passed by packing their individual elements. There is no limit to the complexity of the packed messages, but an application should unpack the messages exactly as they were packed. Although this is not strictly required, it is a safe programming practice. The arguments for each of the routines are a pointer to the first item to be packed, nitem which is the total number of items to pack from this array, and stride which is the stride to use when packing. A stride of 1 means a contiguous vector is packed, a stride of 2 means every other item is packed, and so on. An exception is pvm pkstr() which by definition packs a NULL terminated character string and thus does not need nitem or stride arguments. int int int int int int int int int int

info info info info info info info info info info

= = = = = = = = = =

pvm_pkbyte( char *cp, int nitem, pvm_pkcplx( float *xp, int nitem, pvm_pkdcplx( double *zp, int nitem, pvm_pkdouble( double *dp, int nitem, pvm_pkfloat( float *fp, int nitem, pvm_pkint( int *np, int nitem, pvm_pklong( long *np, int nitem, pvm_pkshort( short *np, int nitem, pvm_pkstr( char *cp ) pvm_packf( const char *fmt, ... )

int int int int int int int int

stride stride stride stride stride stride stride stride

) ) ) ) ) ) ) )

PVM also supplies a packing routine that uses a printf-like format expression to specify what data to pack and how to pack it into the send buffer. All variables are passed as addresses if count and stride are specified; otherwise, variables are assumed to be values. Sending and Receiving Data int info = pvm_send( int tid, int msgtag ) int info = pvm_mcast( int *tids, int ntask, int msgtag ) The routine pvm send() labels the message with an integer identifier msgtag and sends it immediately to the process TID. The routine pvm mcast() labels the message with an integer identifier msgtag and broadcasts the message to all tasks specified in the integer array tids (except itself). The tids array is of length ntask. int info = pvm\_psend(

int tid, int msgtag, void *vp, int cnt, int type )

The routine pvm psend() packs and sends an array of the specified datatype to the task identified by TID. In C the type argument can be any of the following: PVM_STR PVM_BYTE PVM_SHORT PVM_INT PVM_LONG PVM_USHORT

PVM_FLOAT PVM_CPLX PVM_DOUBLE PVM_DCPLX PVM_UINT PVM_ULONG 30

PVM contains several methods of receiving messages at a task. There is no function matching in PVM, for example, that a pvm psend must be matched with a pvm precv. Any of the following routines can be called for any incoming message no matter how it was sent (or multicast). int bufid = pvm_recv( int tid, int msgtag ) This blocking receive routine will wait until a message with label msgtag has arrived from TID. A value of -1 in msgtag or TID matches anything (wildcard). It then places the message in a new active receive buffer that is created. The previous active receive buffer is cleared unless it has been saved with a pvm setrbuf() call. int bufid = pvm_nrecv( int tid, int msgtag ) If the requested message has not arrived, then the nonblocking receive pvm nrecv() returns bufid = 0. This routine can be called multiple times for the same message to check whether it has arrived, while performing useful work between calls. When no more useful work can be performed, the blocking receive pvm recv() can be called for the same message. If a message with label msgtag has arrived from TID, pvm nrecv() places this message in a new active receive buffer (which it creates) and returns the ID of this buffer. The previous active receive buffer is cleared unless it has been saved with a pvm setrbuf() call. A value of -1 in msgtag or TID matches anything (wildcard). int bufid = pvm_probe( int tid, int msgtag ) If the requested message has not arrived, then pvm probe() returns bufid = 0. Otherwise, it returns a bufid for the message, but does not “receive” it. This routine can be called multiple times for the same message to check whether it has arrived, while performing useful work between calls. In addition, pvm bufinfo() can be called with the returned bufid to determine information about the message before receiving it. int bufid = pvm_trecv( int tid, int msgtag, struct timeval *tmout ) PVM also supplies a timeout version of receive. Consider the case where a message is never going to arrive (because of error or failure); the routine pvm recv would block forever. To avoid such situations, the user may wish to give up after waiting for a fixed amount of time. The routine pvm trecv() allows the user to specify a timeout period. If the timeout period is set very large, then pvm trecv acts like pvm recv. If the timeout period is set to zero, then pvm trecv acts like pvm nrecv. Thus, pvm trecv fills the gap between the blocking and nonblocking receive functions. int info = pvm_bufinfo( int bufid, int *bytes, int *msgtag, int *tid ) The routine pvm bufinfo() returns msgtag, source TID, and length in bytes of the message identified by bufid. It can be used to determine the label and source of a message that was received with wildcards specified. int info = pvm_precv(

int tid, int msgtag, void *vp, int cnt, int type, int *rtid, int *rtag, int *rcnt ) 31

The routine pvm precv() combines the functions of a blocking receive and unpacking the received buffer. It does not return a bufid. Instead, it returns the actual values of TID, msgtag, and cnt. int (*old)() = pvm_recvf(int (*new)(int buf, int tid, int tag)) The routine pvm recvf() modifies the receive context used by the receive functions and can be used to extend PVM. The default receive context is to match on source and message tag. This can be modified to any user-defined comparison function. Unpacking Data The following C routines unpack (multiple) data types from the active receive buffer. In an application they should match their corresponding pack routines in type, number of items, and stride. nitem is the number of items of the given type to unpack, and stride is the stride. int int int int int int int int int

info info info info info info info info info

= = = = = = = = =

pvm_upkbyte( pvm_upkcplx( pvm_upkdcplx( pvm_upkdouble( pvm_upkfloat( pvm_upkint( pvm_upklong( pvm_upkshort( pvm_upkstr(

char float double double float int long short char

*cp, int *xp, int *zp, int *dp, int *fp, int *np, int *np, int *np, int *cp )

nitem, nitem, nitem, nitem, nitem, nitem, nitem, nitem,

int int int int int int int int

stride stride stride stride stride stride stride stride

) ) ) ) ) ) ) )

int info = pvm_unpackf( const char *fmt, ... ) The routine pvm unpackf() uses a printf-like format expression to specify what data to unpack and how to unpack it from the receive buffer.

5.5 5.5.1

Demos Master-Slave Demo

We have a Master task which on executing spawns a number of slave tasks or worker tasks. Let, N = Number of hosts in the virtual machine, then spawned tasks, T = N * 2 Listing 7 shows the code listing of the master task, master1.c static char rcsid[] = "$Id: master1.c,v 1.4 1997/07/09 13:25:09 pvmsrc Exp $"; /* * * * * * * *

PVM version 3.4: Parallel Virtual Machine System University of Tennessee, Knoxville TN. Oak Ridge National Laboratory, Oak Ridge TN. Emory University, Atlanta GA. Authors: J. J. Dongarra, G. E. Fagg, M. Fischer G. A. Geist, J. A. Kohl, R. J. Manchek, P. Mucci, P. M. Papadopoulos, S. L. Scott, and V. S. Sunderam

32

* (C) 1997 All Rights Reserved * * NOTICE * * Permission to use, copy, modify, and distribute this software and * its documentation for any purpose and without fee is hereby granted * provided that the above copyright notice appear in all copies and * that both the copyright notice and this permission notice appear in * supporting documentation. * * Neither the Institutions (Emory University, Oak Ridge National * Laboratory, and University of Tennessee) nor the Authors make any * representations about the suitability of this software for any * purpose. This software is provided ‘‘as is’’ without express or * implied warranty. * * PVM version 3 was funded in part by the U.S. Department of Energy, * the National Science Foundation and the State of Tennessee. */ #include #include "pvm3.h" #define SLAVENAME "slave1" main() { int mytid; /* my task id */ int tids[32]; /* slave task ids */ int n, nproc, numt, i, who, msgtype, nhost, narch; float data[100], result[32]; struct pvmhostinfo *hostp; /* enroll in pvm */ mytid = pvm_mytid(); /* Set number of slaves to start */ pvm_config( &nhost, &narch, &hostp ); nproc = nhost * 3; if( nproc > 32 ) nproc = 32 ; printf("Spawning %d worker tasks ... " , nproc); /* start up slave tasks */ numt=pvm_spawn(SLAVENAME, (char**)0, 0, "", nproc, tids); if( numt < nproc ){ printf("\n Trouble spawning slaves. Aborting. Error codes are:\n"); for( i=numt ; i
} pvm_exit(); exit(1); } printf("SUCCESSFUL\n");

/* Begin User Program */ n = 100; /* initialize_data( data, n ); */ for( i=0 ; i
PVM version 3.4: Parallel Virtual Machine System University of Tennessee, Knoxville TN. Oak Ridge National Laboratory, Oak Ridge TN. Emory University, Atlanta GA. 34

* Authors: J. J. Dongarra, G. E. Fagg, M. Fischer * G. A. Geist, J. A. Kohl, R. J. Manchek, P. Mucci, * P. M. Papadopoulos, S. L. Scott, and V. S. Sunderam * (C) 1997 All Rights Reserved * * NOTICE * * Permission to use, copy, modify, and distribute this software and * its documentation for any purpose and without fee is hereby granted * provided that the above copyright notice appear in all copies and * that both the copyright notice and this permission notice appear in * supporting documentation. * * Neither the Institutions (Emory University, Oak Ridge National * Laboratory, and University of Tennessee) nor the Authors make any * representations about the suitability of this software for any * purpose. This software is provided ‘‘as is’’ without express or * implied warranty. * * PVM version 3 was funded in part by the U.S. Department of Energy, * the National Science Foundation and the State of Tennessee. */ #include #include "pvm3.h" main() { int mytid; /* my task id */ int tids[32]; /* task ids */ int n, me, i, nproc, master, msgtype; float data[100], result; float work(); /* enroll in pvm */ mytid = pvm_mytid(); /* Receive data from master */ msgtype = 0; pvm_recv( -1, msgtype ); pvm_upkint(&nproc, 1, 1); pvm_upkint(tids, nproc, 1); pvm_upkint(&n, 1, 1); pvm_upkfloat(data, n, 1); /* Determine which slave I am (0 -- nproc-1) */ for( i=0; i
result = work( me, n, data, tids, nproc ); /* Send result to master */ pvm_initsend( PvmDataDefault ); pvm_pkint( &me, 1, 1 ); pvm_pkfloat( &result, 1, 1 ); msgtype = 5; master = pvm_parent(); pvm_send( master, msgtype ); /* Program finished. Exit PVM before stopping */ pvm_exit(); } float work(me, n, data, tids, nproc ) /* Simple example: slaves exchange data with left neighbor (wrapping) */ int me, n, *tids, nproc; float *data; { int i, dest; float psum = 0.0; float sum = 0.0; for( i=0 ; i
36

6

Message Passing Interface

6.1

Introduction

MPI stands for ”Message Passing Interface”. It is a library of functions (in C) or subroutines (in Fortran) that you insert into source code to perform data communication between processes. MPI was developed over two years of discussions led by the MPI Forum, a group of roughly sixty people representing some forty organizations. The MPI-1 standard was defined in Spring of 1994. • This standard specifies the names, calling sequences, and results of subroutines and functions to be called from Fortran 77 and C, respectively. All implementations of MPI must conform to these rules, thus ensuring portability. MPI programs should compile and run on any platform that supports the MPI standard. • The detailed implementation of the library is left to individual vendors, who are thus free to produce optimized versions for their machines. • Implementations of the MPI-1 standard are available for a wide variety of platforms. An MPI-2 standard has also been defined. It provides for additional features not present in MPI-1, including tools for parallel I/O, C++ and Fortran 90 bindings, and dynamic process management. At present, some MPI implementations include portions of the MPI-2 standard but the full MPI-2 is not yet available.

6.2

MPI Implementations

MPI is a specification, not an implementation. Over the years, various implementations have come up. The most notable ones are: • Open MPI • MPICH • LAM/MPI More details on this may be found at [4,5,6]. For the purpose of this project, Open MPI has been used. Henceforth, MPI means Open MPI

6.3

Open MPI

The Open MPI Project is an open source MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the experise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

37

6.3.1

Using Open MPI

Building Open MPI: Open MPI uses a traditional configure script paired with ”make” to build. Typical installs can be of the pattern: amit@ubuntu-desktop-1$ ./configure [...options...] amit@ubuntu-desktop-1$ make all install Checking Your Open MPI Installation The ”ompi info” command can be used to check the status of your Open MPI installation (located in ¡prefix¿/bin/ompi info). Running it with no arguments provides a summary of information about your Open MPI installation. Note that the ompi info command is extremely helpful in determining which components are installed as well as listing all the run-time settable parameters that are available in each component (as well as their default values). The following options may be helpful: --all

Show a *lot* of information about your Open MPI installation. --parsable Display all the information in an easily grep/cut/awk/sed-able format. --param A of "all" and a of "all" will show all parameters to all components. Otherwise, the parameters of all the components in a specific framework, or just the parameters of a specific component can be displayed by using an appropriate and/or name. Here is a sample output: amit@ubuntu-desktop-1:~/mpi-exec$ ompi_info Open MPI: 1.2.3 Open MPI SVN revision: r15136 Open RTE: 1.2.3 Open RTE SVN revision: r15136 OPAL: 1.2.3 OPAL SVN revision: r15136 Prefix: /usr/local Configured architecture: i686-pc-linux-gnu Configured by: amit Configured on: Wed Aug 29 13:48:38 IST 2007 Configure host: ubuntu-desktop-1 Built by: root Built on: Wed Aug 29 13:57:39 IST 2007 Built host: ubuntu-desktop-1 C bindings: yes C++ bindings: yes Fortran77 bindings: no Fortran90 bindings: no Fortran90 bindings size: na 38

C compiler: C compiler absolute: C++ compiler: C++ compiler absolute: Fortran77 compiler: Fortran77 compiler abs: Fortran90 compiler: Fortran90 compiler abs: C profiling: C++ profiling: Fortran77 profiling: Fortran90 profiling: C++ exceptions: Thread support: Internal debug support: MPI parameter check: Memory profiling support: Memory debugging support: libltdl support: Heterogeneous support: mpirun default --prefix:

gcc /usr/bin/gcc g++ /usr/bin/g++ none none none none yes yes no no no posix (mpi: no, progress: no) no runtime no no yes yes yes

Compiling Open MPI Applications Open MPI provides ”wrapper” compilers that should be used for compiling MPI applications: C: mpicc C++: mpiCC (or mpic++ if your filesystem is case-insensitive) Fortran 77: mpif77 Fortran 90: mpif90 For example: amit@ubuntu-desktop-1:~/mpi-codes$ mpicc hello_world_mpi.c -o hello_world_mpi -g amit@ubuntu-desktop-1:~/mpi-codes$ All the wrapper compilers do is add a variety of compiler and linker flags to the command line and then invoke a back-end compiler. To be specific: the wrapper compilers do not parse source code at all; they are solely command-line manipulators, and have nothing to do with the actual compilation or linking of programs. The end result is an MPI executable that is properly linked to all the relevant libraries. Running Open MPI Applications Open MPI supports both mpirun and mpiexec (they are exactly equivalent). For example: amit@ubuntu-desktop-1:~/mpirun -np 2 hello_world_mpi or amit@ubuntu-desktop-1:~/ mpiexec -np 1 hello_world_mpi : -np 1 hello_world_mpi are equivalent. The rsh launcher accepts a -hostfile parameter (the option ”-machinefile” is equivalent); you can specify a -hostfile parameter indicating an standard mpirun-style hostfile (one hostname per line): 39

amit@ubuntu-desktop-1:~/ mpirun -hostfile my_hostfile -np 2 hello_world_mpi If you intend to run more than one process on a node, the hostfile can use the ”slots” attribute. If ”slots” is not specified, a count of 1 is assumed. For example, using the following hostfile: node1.example.com node2.example.com node3.example.com slots=2 node4.example.com slots=4 amit@ubuntu-desktop-1:~/ mpirun -hostfile my_hostfile -np 8 hello_world_mpi will launch MPI COMM WORLD rank 0 on node1, rank 1 on node2, ranks 2 and 3 on node3, and ranks 4 through 7 on node4. Other starters, such as the batch scheduling environments, do not require hostfiles (and will ignore the hostfile if it is supplied). They will also launch as many processes as slots have been allocated by the scheduler if no ”-np” argument has been provided. For example, running an interactive SLURM job with 8 processors: amit@ubuntu-desktop-1:~/ srun -n 8 -A amit@ubuntu-desktop-1:~/ mpirun a.out The above command will launch 8 copies of a.out in a single MPI COMM WORLD on the processors that were allocated by SLURM. Notes: 1. Communicators and Groups: MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Most MPI routines require you to specify a communicator as an argument. MPI COMM WORLD is the predefined communicator which includes all of your MPI processes. 2. Rank: Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A rank is sometimes also called a ”process ID”. Ranks are contiguous and begin at zero.It is used by the programmer to specify the source and destination of messages. Often used conditionally by the application to control program execution (if rank=0 do this / if rank=1 do that). 6.3.2

MPI User Interface

Environment Management Routines Several of the more commonly used MPI environment management routines are described below. MPI Init: Initializes the MPI execution environment. This function must be called in every MPI program, must be called before any other MPI functions and must be called only once in an MPI program. For C programs, MPI Init may be used to pass the command line arguments to all processes, although this is not required by the standard and is implementation dependent. 40

MPI_Init (*argc,*argv) MPI Comm size: Determines the number of processes in the group associated with a communicator. Generally used within the communicator MPI COMM WORLD to determine the number of processes being used by your application. MPI_Comm_size (comm,*size) MPI Comm rank: Determines the rank of the calling process within the communicator. Initially, each process will be assigned a unique integer rank between 0 and number of processors - 1 within the communicator MPI COMM WORLD. This rank is often referred to as a task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well. MPI_Comm_rank (comm,*rank) MPI Abort: Terminates all MPI processes associated with the communicator. In most MPI implementations it terminates ALL processes regardless of the communicator specified. MPI_Abort (comm,errorcode) MPI Get processor name: Gets the name of the processor on which the command is executed. Also returns the length of the name. The buffer for ”name” must be at least MPI MAX PROCESSOR NAME characters in size. What is returned into ”name” is implementation dependent - may not be the same as the output of the ”hostname” or ”host” shell commands. MPI_Get_processor_name (*name,*resultlength) Point-to-point Communication Routines The more commonly used MPI blocking message passing routines are described below. MPI Send: Basic blocking send operation. Routine returns only after the application buffer in the sending task is free for reuse. Note that this routine may be implemented differently on different systems. The MPI standard permits the use of a system buffer but does not require it. Some implementations may actually use a synchronous send (discussed below) to implement the basic blocking send. MPI_Send (*buf,count,datatype,dest,tag,comm) MPI Recv: Receive a message and block until the requested data is available in the application buffer in the receiving task. MPI_Recv (*buf,count,datatype,source,tag,comm,*status) The more commonly used MPI non-blocking message passing routines are described below. MPI Isend: Identifies an area in memory to serve as a send buffer. Processing continues immediately without waiting for the message to be copied out from the application buffer. A communication request handle is returned for handling the pending message status. The program should not modify the application buffer until subsequent calls to MPI Wait or MPI Test indicates that the non-blocking send has completed. 41

MPI_Isend (*buf,count,datatype,dest,tag,comm,*request) MPI Irecv: Identifies an area in memory to serve as a receive buffer. Processing continues immediately without actually waiting for the message to be received and copied into the the application buffer. Acommunication request handle is returned for handling the pending message status. The program must use calls to MPI Wait or MPI Test to determine when the non-blocking receive operation completes and the requested message is available in the application buffer. MPI_Irecv (*buf,count,datatype,source,tag,comm,*request)

6.4

Demos

6.4.1

Hello World of MPI

Consider the following “Hello World” MPI program: #include #include void main (int argc, char *argv[]) { int myrank, size; MPI_Init(&argc, &argv); /* Initialize MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* Get my rank */ MPI_Comm_size(MPI_COMM_WORLD, &size); /* Get the total number of processors */ printf("Processor %d of %d: Hello World!\n", myrank, size); MPI_Finalize();

/* Terminate MPI

*/

} Results: amit@ubuntu-desktop-1:~/mpi-exec$ mpirun --np 3 --hostfile mpi-host-file HellMPI Processor 0 of 3: Hello World! Processor 2 of 3: Hello World! Processor 1 of 3: Hello World! Notes: 1. The executable has to be placed in the same directory for all the hosts. for Eg. /home/amit/mpi-exec 2. Set up password-less login to all the hosts

42

6.5 6.5.1

Parallel Search using MPI Problem Description

The problem implements a parallel search of an extremely large (several thousand elements) integer array. The program finds all occurrences of a certain integer, called the target, and writes all the array indices where the target was found to an output file. In addition, the program reads both the target value and all the array elements from an input file. 6.5.2

Algorithm

PROGRAM

parallel_search

INTEGER rank,error CALL MPI_INIT(error) CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank, error) if (rank == 0) then read in target value from input data file send target value to processor 1 send target value to processor 2 send target value to processor 3 read in integer array b from input file send first third of array to processor 1 send second third of array to processor 2 send last third of array to processor 3 while (not done) receive target indices from any of the slaves write target indices to the output file end while else receive the target from processor 0 receive my sub_array from processor 0 for each element in my subarray if ( element value == target ) then convert local index into global index send global index to processor 0 end if end loop

! SEE COMMENT #1

send message to processor 0 indicating my search is done end if

43

! SEE COMMENT #2

CALL MPI_FINALIZE(error) END PROGRAM parallel_search 6.5.3

Code Listing

/* Parallel Search Implementation */

#include #include int main (int argc, char **argv) { int i,myrank, loc=0, num, num_p, recv_target_index,GI; MPI_Status status; FILE *handle; int data[300]; /* array to store the data from file in MASTER process*/ int partial_data[100]; /*array to store the 1/3 array in SLAVES */ int target; /* Value to be found*/

MPI_Init(&argc, &argv); /* Initialize MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* Get rank */ printf("My Rank is %d\n", myrank); /* Get the target value from User */ /*Get/Send the Target Value to each Processor*/ if( myrank == 0 ) { /* Get the target value from User */ fflush(stdout); printf("Enter the number to search\n"); scanf("%d",&target); fflush(stdout); /* Send the target value to each processor node*/ for(i=0;i<1;i++) //Number of Processors { MPI_Send(&target, 1, MPI_INT, i, 9, MPI_COMM_WORLD); }

44

/* Read in Integer values into the Array from the file [data.dat]*/ handle = fopen("data.data","r"); if(handle == NULL) { printf("Error Opening file"); exit(0); } while(!feof(handle)) { fscanf(handle, "%d",&num); data[loc++] = num ; printf("\n %d", num); } for(i=0;i
/* Send 1/3 Array to Processor -1 */ MPI_Send(&data[0], 100, MPI_INT, 0 , 11, MPI_COMM_WORLD);

/* Send next 1/3 Array to Processor -2 */ MPI_Send(&data[100], 100, MPI_INT, 1 , 11, MPI_COMM_WORLD); /* Send next 1/3 Array to Processor -3 */ MPI_Send(&data[200], 100, MPI_INT, 2 , 11, MPI_COMM_WORLD);

/* Recieve target indices from any of the slaves*/ printf("Waiting for slaves.."); num_p = 0; do{ MPI_Recv(&recv_target_index, 1, MPI_INT, MPI_ANY_SOURCE, 52, MPI_COMM_WORLD, &status); if (status.MPI_TAG == 52) {

//if the master recieves any message from the slave with TA 45

num_p++; /* write to a file [found.data] the processor and Index info*/ }else {

}

}while(num_p!=3);

} else /*if rank!=0*/ { /* Recieve the target value from the master processor*/ MPI_Recv(&target, 1, MPI_INT, MPI_ANY_SOURCE, 9, MPI_COMM_WORLD, &status); /* Recieve the 1/3 array from the master processor for searching operation*/ MPI_Recv(partial_data, 100, MPI_INT, MPI_ANY_SOURCE, 11, MPI_COMM_WORLD, &status);

for(i=0;i<100;i++) { if(partial_data[i] == target) /* If the target is found */ { GI = (myrank-1)*100 + i ;/* Convert Local index to global index*/ MPI_Send(&GI, 1, MPI_INT, i, 19, MPI_COMM_WORLD); } }

}

MPI_Finalize();

/* Terminate MPI */

return 0; }

46

6.5.4

Results

A sample output of the “Parallel Search Algorithm” implementation is shown below: Target Target Target Target Target Target Target Target

25 25 25 25 25 25 25 25

found found found found found found found found

in in in in in in in in

Processor Processor Processor Processor Processor Processor Processor Processor

at at at at at at at at

INDEX INDEX INDEX INDEX INDEX INDEX INDEX INDEX

97 in processor 1 204 in processor 1 398 in processor 1 505 in processor 1 197 in processor 2 304 in processor 2 498 in processor 2 605 in processor 2

6.5.5

Comparison with Sequential Search

To demonstrate the speedup obtained by using a parallel search algorithm, I measure the time taken to search an item in a list of numbers and compared against the time taken by its sequential counterpart. 1. Result Set - 1

Number of data items -> 1200 No of Nodes used -> 3 Sequential Sequential Sequential Sequential

Search Search Search Search

-> -> -> ->

0m2.712s (In 0m1.712s (In 0m1.256s (In 0m1.848s (In _________

case case case case

of of of of

succesful succesful succesful succesful

search) search) search) search)

0m1.882s

Sequential Search

Parallel Parallel Parallel Parallel

Search Search Search Search

-> 0m1.624s (In case of unsuccesful search)

-> -> -> ->

0m1.673s (In 0m1.354s (In 0m1.194s (In 0m1.170s (In _________

case case case case

of of of of

succesful succesful succesful succesful

search) search) search) search)

0m1.348s

Parallel Search

-> 0m1.202s (In case of unsuccesful search)

2. Result set -2 47

Number of data items -> 2400 No of Nodes used -> 3 Sequential Sequential Sequential Sequential

Search Search Search Search

-> -> -> ->

0m0.952s (In 0m0.801s (In 0m1.600s (In 0m1.392s (In _________

case case case case

of of of of

succesful succesful succesful succesful

search) search) search) search)

0m1.181s

Sequential Search

Parallel Parallel Parallel Parallel

Search Search Search Search

-> 0m1.504s (In case of unsuccesful search)

-> -> -> ->

0m0.834s (In 0m0.894s (In 0m1.158s (In 0m0.811s (In _________

case case case case

of of of of

succesful succesful succesful succesful

search) search) search) search)

0m0.921s

Parallel Search

6.6 6.6.1

-> 0m1.355s (In case of unsuccesful search)

Parallel Search using MPI- Modification Problem Statement

Consider 2 2-node clusters. The problem is to implement a parallel search which will search for 2-numbers in the given data set, such that during first iteration, search for 1 number is restricted to the first cluster and search for the 2nd number is restricted to the second cluster and the search is reversed during the 2nd iteration. 6.6.2

Code Listing

/* Parallel Search Implementation * Search for 2 numbers in the array * uses 2 clusters - 2 nodes each * 2*2 = 4 nodes in the cluster */

#include #include

48

int main (int argc, char **argv) { int i,j,myrank, loc=0, num, num_p=0, recv_target_index,GI,iter; int cluster1[2], cluster2[2], cluster; int FOUND=0,search=1, tries=1,temp=0; MPI_Status status; MPI_Request req; FILE *handle, *handle1; int data[2400]; /* array to store the data from file in MASTER process*/ int partial_data[600]; /*array to store the 1/2 array in SLAVES */ int target, target1, target2; /* Value to be found*/ int target1_slave, target2_slave; /* Value to be found*/

MPI_Init(&argc, &argv);

/* Initialize MPI */

MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* Get rank */

printf("Ranks in the cluster %d\n", myrank);

/* Get the target value from User */ /*Get/Send the Target Value to each Processor*/

if( myrank == 0 ) { printf("Initializing 4-node Cluster\n"); printf("This cluster is logically divided into 2- 2-node clusters\n\n********************* /* printf("Enter the node ranks to form Cluster 1 (rank1, rank2) \n"); scanf("%d %d",&cluster1[0], &cluster1[1]); printf("nodes in cluster 1%d %d\n\n", cluster1[0],cluster1[1]);

printf("Enter the node ranks to form Cluster 2 (rank3, rank4) \n"); scanf("%d %d",&cluster2[0], &cluster2[1]); 49

*/

/* Get the target value from User */ fflush(stdout); printf("Enter the numbers (num1, num2) to search\n"); scanf("%d %d",&target1, &target2); fflush(stdout);

cluster1[0]=1; cluster1[1]=2; cluster2[0]=3; cluster2[1]=4; /* Read in Integer values into the Array from the file [data.dat]*/

handle = fopen("data.data","r");

if(handle == NULL) { printf("Error Opening file"); exit(0); } while(!feof(handle)) { fscanf(handle, "%d",&num); data[loc++] = num ; }

/* Send the target value to each processor node*/

/* Search for target 1 in Cluster 1 Search for target 2 in Cluster 2 */ MPI_Send(&data, 600, MPI_INT, 1 , 11, MPI_COMM_WORLD); printf("Sent 1st 1/4\n\n");

50

MPI_Send(&data[600], 600, MPI_INT, 2 , 11, MPI_COMM_WORLD); printf("Sent 2nd 1/4\n\n"); MPI_Send(&data[1200], 600, MPI_INT, 3 , 11, MPI_COMM_WORLD); printf("Sent 3rd 1/4\n\n"); MPI_Send(&data[1800], 600, MPI_INT, 4 , 11, MPI_COMM_WORLD); printf("Sent 4th 1/4\n\n"); //FLAG: for(num=1;num<=2;num++){ if(num==2){ temp=target1; target1=target2; target2=temp; } printf("Starting %d iteration of cluster search\n***************************\n",num); search=1; FOUND=0;

for(i=1;i<=2;i++) //Number of clusters { MPI_Send(&target1, 1, MPI_INT, cluster1[i-1], 9, MPI_COMM_WORLD); MPI_Send(&target2, 1, MPI_INT, cluster2[i-1], 10, MPI_COMM_WORLD); }

fflush(stdout); printf("Target values %d , %d sent\n", target1, target2); printf("Searching for %d in Cluster 1 and %d in cluster 2 \n", target1, target2);

/* Recieve target indices from any of the slaves*/ printf("Waiting for slaves..\n\n"); num_p = 0;

do{ //printf("NUm P %d\n", num_p); MPI_Recv(&recv_target_index, 1, MPI_INT, MPI_ANY_SOURCE,MPI_ANY_TAG , MPI_COMM_WORLD, &sta 51

if (status.MPI_TAG == 52) { //if the master recieves any message from the slave with TA num_p++; if(recv_target_index == 0){ printf("Target not found\n\n"); search=0;

} } else { if (status.MPI_TAG == 19){ fflush(stdout); if (status.MPI_SOURCE== 1 || status.MPI_SOURCE== 2){ cluster=1; }

printf("Target %d found in Processor at INDEX %d in processor %d on cluster %d\n\n",target } if (status.MPI_TAG == 20){ fflush(stdout); if (status.MPI_SOURCE== 4 || status.MPI_SOURCE== 3){ cluster=2; }

printf("Target %d found in Processor at INDEX %d in processor %d on cluster %d \n\n",targe } }

}while(num_p!=4);

/* if(search==0 && ++tries <2){ temp=target1; target1=target2; target2=temp; num_p=0; printf("Values of tries %d\n\n",tries); goto FLAG; }*/ /* 52

if(search==0 && ++tries<=2){ temp=target1; target1=target2; target2=temp;

goto FLAG; }*/ } MPI_Finalize(); printf("---------------------------------------\nPress Ctrl + C to Exit\n\n"); //end of FOR } else /*if rank!=0*/ { MPI_Recv(&partial_data, 600, MPI_INT, MPI_ANY_SOURCE, 11, MPI_COMM_WORLD, &status); FLAG: /* Recieve the target value from the master processor*/ //printf("%d %d %d %d\n\n", cluster1[0],cluster1[1],cluster2[0],cluster2[1]); if (myrank == 1|| myrank == 2){ //printf("In cluster 1 Rank %d\n\n",myrank); MPI_Recv(&target1, 1, MPI_INT, MPI_ANY_SOURCE, 9, MPI_COMM_WORLD, &status); }else{ MPI_Recv(&target2, 1, MPI_INT, MPI_ANY_SOURCE, 10, MPI_COMM_WORLD, &status); //printf("In cluster 2 RANk %d\n\n",myrank); }

//printf("Searching in child\n\n"); for(i=0;i<=599;i++) { /* if(myrank == cluster1[0] || myrank == cluster1[1]) target = target1; else target = target2; */

//printf("looking for %d %d\n\n" ,target1_slave, target2_slave); if((partial_data[i] == target1 && myrank==1 )||(partial_data[i] == target1 && myrank==2)) { GI = (myrank-1)*600 + i ;/* Convert Local index to global index*/ //printf("target 1 found in %d \n\n",myrank); MPI_Send(&GI, 1, MPI_INT,0 ,19, MPI_COMM_WORLD); //for cluster 1 FOUND = 1; } 53

if((partial_data[i]==target2 && myrank==3) ||(partial_data[i]==target2 && myrank==4)){ GI = (myrank-1)*600 + i ;/* Convert Local index to global index*/ //printf("target 2 found in %d\n\n",myrank); MPI_Send(&GI, 1, MPI_INT,0 ,20, MPI_COMM_WORLD);//for cluster 2 FOUND = 1; }

}//end of For

MPI_Send(&FOUND,1,MPI_INT,0,52,MPI_COMM_WORLD); goto FLAG; }

/* Terminate MPI */ //MPI_Finalize(); return 0; } 6.6.3

Results

amit@ubuntu-desktop-1:~/mpi-exec$ mpirun --hostfile mpi-host-file --np 5 ParallelSearch Ranks in the cluster 1 Ranks in the cluster 3 Ranks in the cluster 0 Initializing 4-node Cluster This cluster is logically divided into 2- 2-node clusters ***************************************** Enter the numbers (num1, num2) to search Ranks in the cluster 2 Ranks in the cluster 4 12 13 Sent 1st 1/4 Sent 2nd 1/4 Sent 3rd 1/4 Sent 4th 1/4 Starting 1 iteration of cluster search 54

*************************** Target values 12 , 13 sent Searching for 12 in Cluster 1 and 13 in cluster 2 Waiting for slaves.. Target 12 found in Processor at INDEX 11 in processor 1 on cluster 1 Target not found Target not found Target not found Starting 2 iteration of cluster search *************************** Target values 13 , 12 sent Searching for 13 in Cluster 1 and 12 in cluster 2 Waiting for slaves.. Target not found Target 13 found in Processor at INDEX 12 in processor 1 on cluster 1 Target not found Target not found

7

Parallel Image Processing

As part of this project, I reviewed some existing works that has been done to implement Parallel Image Processing. This review work helped me gain valuable insights into the scope of improvements in the currently existing efforts, which again proved helpful in proposing a new framework for parallel image processing, which is described later on.

7.1

Parallel Processing for Computer Vision and Image Understanding

Problems in computer vision are computationally intensive. At the same time, most image processing algorithms involve similar computation on all the pixels of the image. This makes intensive image processing a potential candidate for paralellizing. More details can be found at [7]

55

7.2

Parallel Implementation for Image Rotation Using Parallel Virtual Machine (PVM)

This work describes the implementation of Parallel Image Rotation using the Parallel Virtual Machine (PVM). [10]

7.3

Image Filtering on .NET-based Desktop Grids

Image filtering is the use of computer graphics algorithms to enhance the quality of digital images or to extract information about their content. However rendering very large size digitial images on a single machine is a performance bottleneck. To address this we propose parallelising this application on a desktop Grid environment. For parallelizing this application we use the Alchemi Desktop Grid environment and the resulting framework is referred to as ImageGrid. ImageGrid allows the parallel execution of linear digital filters algorithms on images. We observed acceptable speed up as the result of parallelising filtering through ImageGrid. We run the tests on different data sets by varying the dimension of the images and the complexity of the filters. Results demonstrate potential of Grid computing for desktop applications and that the speed up obtained is more consistent for large images and complex filters. This is described at [8].

7.4

Towards Efficient Parallel Image Processing on Cluster Grids using GIMP

As it is not realistic to expect that all users, especially specialists in the graphic business, use complex low-level parallel programs to speed up image processing, we have developed a plugin for the highly acclaimed GIMP which enables to invoke a series of filter operations in a pipeline in parallel on a set of images loaded by the plugin. We present the software developments, test sce- narios and experimental results on cluster grid systems possibly featuring single- processors and SMP nodes and being used by other users at the same time. Behind the GUI, the plugin invokes a smart DAMPVM cluster grid shell which spawns processes on the best nodes in the cluster, taking into account their loads including other user processes. This enables to select the fastest nodes for the stages in the pipeline. We show by experiment that the approach prevents scenarios in which other user processes or even slightly more loaded processors become the bottlenecks of the whole pipeline. The parallel mapping is completely transparent to the end user who interacts only with the GUI. We present the results achieved with the GIMP plugin using the smart cluster grid shell as well as a simple round robin scheduling and prove the former solution to be superior. The details are described in [9]

7.5

A user friendly framework for Parallel Image Processing

Bridging the gap between the domains of Parallel Computing and Image processing is an intriguing challenge for researchers from each domain. Software tools are necessary to provide the user a friendly platform to take the advantage of parallel computing.

56

The present work is a in-progress effort towards this goal and proposes a plugin based user friendly framework for parallel image processing tasks. A user friendly framework for parallel image processing is described in [10].

8

Conclusions

As was mentioned at the very outset, the main goal of this project work was to gain some valuable hands on experience of working with message passing systems - PVM and MPI. As is apparent from this report, I have succesfully been able to accomplish my goal of gaining some intermediate experience with programming using PVM and MPI. I also implemented the parallel counterpart of common algorithms such as a Parallel Search algorithm. I also studied some of the literature related to “Parallel Image Processing” which gave me some valuable working knowledge of how parallel computing can be applied in such domains as Image processing. Taking due note of the possible improvements to the current state of work, I proposed an idea of a parallel image processing framework, described earlier which will enable graphic artists to take the advantage of parallel computing without learning the intricacies of the later.

9

Future Work

The successful completion of this project has equipped me with the necessary Parallel Computing skills to take on more application oriented tasks in the future. Such knowledge will autoamtically prove helpful while working on research areas such as Parallel and Distributed Artificial Intelligence. In the future, I shall work on parallel computing in Intelligent Systems which is one of my research interests.

57

10

References

1. Designing and Building Parallel Programs, Ian Foster 2. PVM: Parallel Virtual Machine,A Users’ Guide and Tutorial for Networked Parallel Computing, Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, Vaidy Sunderam 3. XPVM: A Graphical Console and Monitor for PVM http://www.netlib.org/utk/icl/xpvm/xpvm.html 4. Open MPI: Open Source High Performance Computing http://www.open-mpi.org/ 5. MPICH-A Portable Implementation of MPI http://www-unix.mcs.anl.gov/mpi/mpich1/ 6. LAM/MPI Parallel Computing http://www.lam-mpi.org/ 7. Choudhary, A., Ranka, S. “Parallel Processing for Computer Vision and Image Understanding” 8. Christian Vecchiola, Krishna Nadiminti and Rajkumar Buyya, “Image Filtering on .NET-based Desktop Grids” 9. Pawel Czarnul, Andrzej Ciereszko and Marcin Fraczak,“Towards Efficient Parallel Image Procesing on Cluster Grids using GIMP” 10. Saha,Amit Kumar,“Towards a user Friendly Parallel Image Processing Framework” 11. J. Hinks, S.A. Amin,“Parallel Implementation For Image Rotation Using Parallel Virtual Machine (PVM)”

58