High Performance Computing For senior undergraduate students

Lecture 6: Principles of Parallel Algorithms 08.11.2016

Dr. Mohammed Abdel-Megeed Salem Scientific Computing Department Faculty of Computer and Information Sciences Ain Shams University

Outline • Introduction to Parallel Algorithms – Tasks and Decomposition – Processes and Mapping – Processes Versus Processors

• Decomposition Techniques – Recursive Decomposition – Recursive Decomposition – Exploratory Decomposition – Hybrid Decomposition Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

2

Preliminaries: Decomposition, Tasks, and Dependency Graphs • The first step in developing a parallel algorithm is to decompose the problem into tasks that can be executed concurrently. Tasks are programmer-defined units of computation • A given problem may be decomposed into tasks in many different ways. • Tasks may be of same, or different sizes.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

3

Example: Multiplying a Dense Matrix with a Vector Consider the multiplication of a dense n x n matrix A with a vector b to yield another vector y.

Computation of each element of output vector y is independent of other elements. Based on this, a dense matrix-vector product can be decomposed into n tasks. The figure highlights the portion of the matrix and vector accessed by Task 1. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

4

Example: Multiplying a Dense Matrix with a Vector • Observations: While tasks share data (namely, the vector b ), they do not have any control dependencies - i.e., no task needs to wait for the (partial) completion of any other. • All tasks are of the same size in terms of number of operations. • Is this the maximum number of tasks we could decompose this problem into?

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

5

Example: Database Query Processing Consider the execution of the query: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE)

on the following database: ID# 4523 3476 7623 9834 6734 5342 3845 8354 4395 7352

Model Civic Corolla Camry Prius Civic Altima Maxima Accord Civic Civic

Year 2002 1999 2001 2001 2001 2001 2001 2000 2001 2002

Dr. Mohammed Abdel-Megeed Salem

Color Blue White Green Green White Green Blue Green Red Red

Dealer MN IL NY CA OR FL NY VT CA WA

High Performance Computing 2016/ 2017

Price $18,000 $15,000 $21,000 $18,000 $17,000 $19,000 $22,000 $18,000 $17,000 $18,000 Lecture 6

6

Example: Database Query Processing The execution of the query can be divided into subtasks in various ways. Each task can be thought of as generating an intermediate table of entries that satisfy a particular clause.

Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

7

Example: Database Query Processing • A decomposition can be illustrated in the form of a directed graph with nodes corresponding to tasks and edges indicating that the result of one task is required for processing the next. Such a graph is called a task dependency graph.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

8

Example: Database Query Processing Note that the same problem can be decomposed into subtasks in other ways as well.

An alternate decomposition of the given problem into subtasks, along with their data dependencies. Different task decompositions may lead to significant differences with respect to their eventual parallel performance.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

9

Granularity of Task Decompositions • The number of tasks into which a problem is decomposed determines its granularity. • Decomposition into a large number of tasks results in fine-grained decomposition and that into a small number of tasks results in a coarse grained decomposition.

A coarse grained counterpart to the dense matrix-vector product example. Each task in this example corresponds to the computation of three elements of the result vector. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

10

Degree of Concurrency • The number of tasks that can be executed in parallel is the degree of concurrency of a decomposition. • The maximum degree of concurrency is the maximum number of such tasks at any point during execution. What is the maximum degree of concurrency of the database query examples? • The average degree of concurrency is the average number of tasks that can be processed in parallel over the execution of the program. • The degree of concurrency increases as the decomposition becomes finer in granularity and vice versa.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

11

Critical Path Length • A directed path in the task dependency graph represents a sequence of tasks that must be processed one after the other. • The longest such path determines the shortest time in which the program can be executed in parallel. • The length of the longest path in a task dependency graph is called the critical path length. • The ratio of the total amount of work to the critical-path length is the average degree of concurrency.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

12

Critical Path Length Consider the task dependency graphs of the two database query decompositions:

What are the critical path lengths for the two task dependency graphs? If each task takes 10 time units, what is the shortest parallel execution time for each decomposition? How many processors are needed in each case to achieve this minimum parallel execution time? What is the maximum degree of concurrency? What is the average degree of concurrency? Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

13

Limits on Parallel Performance • It would appear that the parallel time can be made arbitrarily small by making the decomposition finer in granularity. • There is an inherent bound on how fine the granularity of a computation can be. For example, in the case of multiplying a dense matrix with a vector, there can be no more than (n2) concurrent tasks. • Concurrent tasks may also have to exchange data with other tasks. This results in communication overhead. The tradeoff between the granularity of a decomposition and associated overheads often determines performance bounds.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

14

Task Interaction Graphs • Subtasks generally exchange data with others in a decomposition. – For example, even in the trivial decomposition of the dense matrix-vector product, if the vector is not replicated across all tasks, they will have to communicate elements of the vector.

• The graph of tasks (nodes) and their interactions/data exchange (edges) is referred to as a task interaction graph.

• Note that task interaction graphs represent data dependencies, whereas task dependency graphs represent control dependencies.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

15

Sparse matrix-vector multiplication • Consider the problem of computing the product y = Ab of a sparse n x n matrix A with a dense n x 1 vector b. • A matrix is considered sparse when a significant number of entries in it are zero and the locations of the non-zero entries do not conform to a predefined structure or pattern. • Arithmetic operations involving sparse matrices can often be optimized significantly by avoiding computations involving the zeros. – For instance, while computing the ith entry of the product vector, we need to compute the products A[i, j] x b[j] for only those values of j for which A[i, j] = 0. For example, y[0] = A[0, 0].b[0] + A[0, 1].b[1] + A[0, 4].b[4] + A[0, 8].b[8]. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

16

Sparse matrix-vector multiplication • Assigning the computation of the element y[i] of the output to Task i. It also owns the row A[i, *] of the matrix and the element b[i] of the input vector. • BUT… the computation of y[i] requires access to many elements of b that are owned by other tasks. So Task i messagemust get these elements from the appropriate passing locations. paradigm • With the ownership of b[i],Task i also inherits the responsibility of sending b[i] to all the other tasks that need it for their computation. – For example, Task 4 must send b[4] to Tasks 0, 5, 8, and 9 and must get b[0], b[5], b[8], and b[9] to perform its own computation. The resulting task-interaction graph… . Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

17

Sparse matrix-vector multiplication

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

18

Outline • Introduction to Parallel Algorithms – Tasks and Decomposition – Processes and Mapping – Processes Versus Processors

• Decomposition Techniques – Recursive Decomposition – Recursive Decomposition – Exploratory Decomposition – Hybrid Decomposition Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

19

3.1.3/ 3.1.4 Processes and Mapping • In general, the number of tasks in a decomposition exceeds the number of processing elements available. • We refer to the mapping as being from tasks to processes, as opposed to processors. This is because we aggregate tasks into processes and rely on the system to map these processes to physical processors. We use processes, simply as a collection of tasks and associated data. • For this reason, a parallel algorithm must also provide a mapping of tasks to processes.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

20

Processes and Mapping • The mechanism by which tasks are assigned to processes for execution is called mapping. – For example, four processes could be assigned the task of computing one submatrix of C each in the matrix-

multiplication computation • Mappings are determined by both the task dependency and task interaction graphs. – Task dependency graphs can be used to ensure that work is equally spread across all processes at any point. – Task interaction graphs can be used to make sure that processes need minimum interaction with other processes. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

21

Processes and Mapping An appropriate mapping must minimize parallel execution time by: • Mapping independent tasks to different processes. • Assigning tasks on critical path to processes as soon as they become available. • Minimizing interaction between processes by mapping tasks with dense interactions to the same process. Note: These criteria often conflict eith each other. For example, a decomposition into one task (or no decomposition at all) minimizes interaction but does not result in a speedup at all! Can you think of other such conflicting cases?

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

22

Processes and Mapping: Example

Mapping tasks in the database query decomposition to processes. These mappings were arrived at by viewing the dependency graph in terms of levels (no two nodes in a level have dependencies). Tasks within a single level are then assigned to different processes.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

23

Processes and Mapping: Example

• A maximum of 4 processes can be employed. Since the maximum degree of concurrency is only 4. The last three tasks can be mapped arbitrarily. • It makes sense to map the tasks connected by an edge onto the same process because this prevents an inter-task interaction from becoming an inter-processes interaction. For example, in Figure (b), if Task 5 is mapped onto process P2, then both processes P0 and P1 will need to interact with P2. In the current mapping, only a single interaction between P0 and P1 Dr. Mohammed Abdel-Megeed Salem High Performance Computing 2016/ 2017 Lecture 6 24 suffices.

Outline • Introduction to Parallel Algorithms – Tasks and Decomposition – Processes and Mapping – Processes Versus Processors

• Decomposition Techniques – Recursive Decomposition – Recursive Decomposition – Exploratory Decomposition – Hybrid Decomposition Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

25

3.2 Decomposition Techniques • So how does one decompose a task into various subtasks? • While there is no single recipe that works for all problems, we present a set of commonly used techniques that apply to broad classes of problems. These include: – – – –

recursive decomposition data decomposition exploratory decomposition speculative decomposition

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

26

3.2.1 Recursive Decomposition • Generally suited to problems that are solved using the divide-and-conquer strategy. • A given problem is first decomposed into a set of sub-problems. • These sub-problems are recursively decomposed further until a desired granularity is reached. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

27

Recursive Decomposition: Example A classic example of a divide-and-conquer algorithm on which we can apply recursive decomposition is Quicksort.

• Consider the problem of sorting a sequence A of n elements using the quicksort algorithm. • Quicksort is a divide and conquer algorithm that starts by selecting a pivot element x and then partitions the sequence A into two subsequences A0 and A1 such that all the elements in A0 are smaller than x and all the elements in A1 are greater than or equal to x. • Each one of the subsequences A0 and A1 is sorted by recursively calling quicksort. • The recursion terminates when each subsequence contains only a single element. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

28

Recursive Decomposition: Example

In this example, once the list has been partitioned around the pivot, each sublist can be processed concurrently (i.e., each sublist represents an independent subtask). This can be repeated recursively. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

29

Recursive Decomposition: Example • We define a task as the work of partitioning a given subsequence. Therefore, • Figure 3.8 also represents the task graph for the problem. Initially, there is only one sequence • (i.e., the root of the tree), and we can use only a single process to partition it. The completion • of the root task results in two subsequences (A0 and A1, corresponding to the two nodes at the • first level of the tree) and each one can be partitioned in parallel. Similarly, the concurrency • continues to increase as we move down the tree. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

30

Recursive Decomposition: Example The problem of finding the minimum number in a given list can be fashioned as a divide-and-conquer algorithm. 1. procedure SERIAL_MIN (A, n) 2. begin 3. min = A[0]; 4. for i := 1 to n − 1 do 5. if (A[i] < min) min := A[i]; 6. endfor; 7. return min; 8. end SERIAL_MIN Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

31

Recursive Decomposition: Example We can rewrite the loop as follows: 1. procedure RECURSIVE_MIN (A, n) 2. begin 3. if ( n = 1 ) then 4. min := A [0] ; 5. else 6. lmin := RECURSIVE_MIN ( A, n/2 ); 7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 ); 8. if (lmin < rmin) then 9. min := lmin; 10. else 11. min := rmin; 12. endelse; 13. endelse; 14. return min; 15. endAbdel-Megeed RECURSIVE_MIN Dr. Mohammed Salem High Performance Computing 2016/ 2017

Lecture 6

32

Recursive Decomposition: Example • The code in the previous slide can be decomposed naturally using a recursive decomposition strategy. • Finding the minimum number in the set {4, 9, 1, 7, 8, 11, 2, 12}. • The task dependency graph associated with this computation is as follows:

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

33

3.2.2 Data Decomposition • Identify the data on which computations are performed. • Partition this data across various tasks. • This partitioning induces a decomposition of the problem. • Data can be partitioned in various ways - this critically impacts performance of a parallel algorithm.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

34

Data Decomposition: Output Data Decomposition • Often, each element of the output can be computed independently of others (but simply as a function of the input). • A partition of the output across tasks decomposes the problem naturally.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

35

Output Data Decomposition: Example Consider the problem of multiplying two n x n matrices A and B to yield matrix C. The output matrix C can be partitioned into four tasks as follows:

Task 1: Task 2: Task 3: Task 4:

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

36

Output Data Decomposition: Example A partitioning of output data does not result in a unique decomposition into tasks. For example, for the same problem as in previous slide, with identical output data distribution, we can derive the following two (other) decompositions:

Decomposition I

Decomposition II

Task 1: C1,1 = A1,1 B1,1

Task 1: C1,1 = A1,1 B1,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 3: C1,2 = A1,1 B1,2

Task 3: C1,2 = A1,2 B2,2

Task 4: C1,2 = C1,2 + A1,2 B2,2

Task 4: C1,2 = C1,2 + A1,1 B1,2

Task 5: C2,1 = A2,1 B1,1

Task 5: C2,1 = A2,2 B2,1

Task 6: C2,1 = C2,1 + A2,2 B2,1

Task 6: C2,1 = C2,1 + A2,1 B1,1

Task 7: C2,2 = A2,1 B1,2

Task 7: C2,2 = A2,1 B1,2

Task 8: C2,2 = C2,2 + A2,2 B2,2

Task 8: C2,2 = C2,2 + A2,2 B2,2

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

37

Output Data Decomposition: Example • Consider the problem of counting the instances of given item sets in a database of transactions. • we are given a set T containing n transactions and a set I containing m itemsets. • find the number of times that each itemset in I appears in all the transactions; Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

38

Output Data Decomposition: Example • The computation can be decomposed into two tasks by partitioning the output into two parts and having each task compute its half of the frequencies. Here the itemsets input has also been partitioned. • The primary motivation for the decomposition is to have each task independently compute the subset of frequencies assigned to it.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

39

Output Data Decomposition: Example From the previous example, the following observations can be made: • If the database of transactions is replicated across the processes, each task can be independently accomplished with no communication. • If the database is partitioned across processes as well (for reasons of memory utilization), each task first computes partial counts. These counts are then aggregated at the appropriate task. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

40

Input Data Partitioning • In many cases, this is the only natural decomposition because the output is not clearly known a-priori (e.g., the problem of finding the minimum in a list, sorting a given list, etc.). • A task is associated with each input data partition. The task performs as much of the computation with its part of the data. Subsequent processing combines these partial results. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

41

Input Data Partitioning: Example In the database counting example, the input (i.e., the transaction set) can be partitioned. This induces a task decomposition in which each task generates partial counts for all itemsets. These are combined subsequently for aggregate counts.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

42

Contacts High Performance Computing, 2016/2017 Dr. Mohammed Abdel-Megeed M. Salem Faculty of Computer and Information Sciences, Ain Shams University Abbassia, Cairo, Egypt Tel.: +2 011 1727 1050 Email: [email protected] Web: https://sites.google.com/a/fcis.asu.edu.eg/salem

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 6

43

High Performance Computing

Nov 8, 2016 - Faculty of Computer and Information Sciences. Ain Shams University ... Tasks are programmer-defined units of computation. • A given ... The number of tasks that can be executed in parallel is the degree of concurrency of a ...

1MB Sizes 1 Downloads 281 Views

Recommend Documents

High Performance Computing
Nov 29, 2016 - problem requires us to apply a 3 x 3 template to each pixel. If ... (ii) apply template on local subimage. .... Email: [email protected].

High Performance Computing
Dec 20, 2016 - Speedup. – Efficiency. – Cost. • The Effect of Granularity on Performance .... Can we build granularity in the example in a cost-optimal fashion?

High Performance Computing
Nov 1, 2016 - Platforms that support messaging are called message ..... Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree ...

High Performance Computing
Computational science paradigm: 3) Use high performance computer systems to simulate the ... and marketing decisions. .... Email: [email protected].

Advances in High-Performance Computing ... - Semantic Scholar
tions on a domain representing the surface of lake Constance, Germany. The shape of the ..... On the algebraic construction of multilevel transfer opera- tors.

SGI UV 300RL - High Performance Computing
By combining additional chassis (up to eight per standard 19-inch rack), UV 300RL is designed to scale up to 32 sockets and 1,152 threads (with hyper threading enabled). All of the interconnected chassis operate as a single system running under a sin

Advances in High-Performance Computing ... - Semantic Scholar
ement module is illustrated on the following model problem in eigenvalue computations. Let Ω ⊂ Rd, d = 2, 3 be a domain. We solve the eigenvalue problem:.

High performance computing in structural determination ...
Accepted 7 July 2008. Available online 16 July 2008 ... increasing complexity of algorithms and the amount of data needed to push the resolution limits. High performance ..... computing power and dozens of petabytes of storage distributed.

Ebook Introduction to High Performance Computing for ...
Book synopsis. Suitable for scientists, engineers, and students, this book presents a practical introduction to high performance computing (HPC). It discusses the ...

pdf-0743\high-performance-cluster-computing-programming-and ...
... the apps below to open or edit this item. pdf-0743\high-performance-cluster-computing-programming-and-applications-volume-2-by-rajkumar-buyya.pdf.

pdf-0743\high-performance-cluster-computing-programming-and ...
... the apps below to open or edit this item. pdf-0743\high-performance-cluster-computing-programming-and-applications-volume-2-by-rajkumar-buyya.pdf.

Bridging the High Performance Computing Gap: the ...
up by the system and the difficulties that have been faced by ... posed as a way to build virtual organizations aggregating .... tion and file transfer optimizations.

High-Performance Cloud Computing: A View of ...
1Cloud computing and Distributed Systems (CLOUDS) Laboratory. Department .... use of Cloud computing in computational science is still limited, but ..... Linux based systems. Being a .... features such as support for file transfer and resource.