High Performance Computing For senior undergraduate students

Lecture 9: Analytical Modeling of Parallel Systems 20.12.2016

Dr. Mohammed Abdel-Megeed Salem Scientific Computing Department Faculty of Computer and Information Sciences Ain Shams University

Outlook • Sources of Overhead in Parallel Programs • Interprocess Interaction, Idling, Excess Computation

• Performance Metrics for Parallel Systems – – – – –

Execution Time Total Parallel Overhead Speedup Efficiency Cost

• The Effect of Granularity on Performance – Adding n numbers on p processing elements – Adding n numbers cost-optimally Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

2

Terms

• Interprocess interactions: Processors need to talk to each other. • Idling: Processes may idle because of load imbalance, synchronization, or serial components. • Excess Computation: This is computation not performed by the serial version.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

3

Execution Time • Serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. • The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last processor finishes execution. • We denote the serial runtime by Ts and the parallel runtime by TP .

Ts + Tp = 100% Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

4

Performance Metrics for Parallel Systems: Total Parallel Overhead • Let Tall be the total time collectively spent by all the processing elements. Tall = p TP (p is the number of processors).

• TS is the serial time. • The total overhead To = p TP - TS

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

5

Performance Metrics for Parallel Systems: Speedup • Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with p identical processing elements.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

6

Performance Metrics: Efficiency • Efficiency is a measure of the fraction of time for which a processing element is usefully employed

• Mathematically, it is given by =

• Following the bounds on speedup, efficiency can be as low as 0 and as high as 1. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

7

Performance Metrics: Efficiency • The speedup of adding numbers on processors is given by

• Efficiency is given by =

=

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

8

Cost of a Parallel System • Cost is the product of parallel runtime and the number of processing elements used (p x TP ). • Cost reflects the sum of the time that each processing element spends solving the problem. • A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel computer is asymptotically identical to serial cost. • Since efficiency of the cost = TS / p TP, for cost optimal systems, O(1).

• Cost is sometimes referred to as work or processor-time product. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

9

Outlook • Sources of Overhead in Parallel Programs • Interprocess Interaction, Idling, Excess Computation

• Performance Metrics for Parallel Systems – – – – –

Execution Time Total Parallel Overhead Speedup Efficiency Cost

• The Effect of Granularity on Performance – Adding n numbers on p processing elements – Adding n numbers cost-optimally Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

10

Cost of a Parallel System: Example Consider the problem of adding numbers on processors. • We have, TP = log n (for p = n). • The cost of this system is given by p TP = n log n. • Since the serial runtime of this operation is Θ(n), the algorithm is not cost optimal. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

11

Impact of Non-Cost Optimality Consider a sorting algorithm that uses n processing elements to sort the list in time (log n)2. • Since the serial runtime of a (comparison-based) sort is (n log n), the speedup and efficiency of this algorithm are given by (n / log n ) and (1 / log n), respectively. • The p TP product of this algorithm is n (log n)2. • This algorithm is not cost optimal but only by a factor of log n.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

12

Impact of Non-Cost Optimality • If p < n, assigning n tasks to p processors gives TP = n (log n)2 / p .

– This follows from the fact that if n processing elements take time (log n)2, then one processing element would take time n(log n)2; and p processing elements would take time n(log n)2/p.

• The corresponding speedup of this formulation is p / log n. • This speedup goes down as the problem size n is increased for a given p ! • Consider the problem of sorting 1024 numbers (n = 1024, log n = 10) on 32 processing elements. The speedup expected is only p/log n or 3.2. This number gets worse as n increases. For n = 106, log n = 20 and the speedup is only 1.6.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

13

Outlook • Sources of Overhead in Parallel Programs • Interprocess Interaction, Idling, Excess Computation

• Performance Metrics for Parallel Systems – – – – –

Execution Time Total Parallel Overhead Speedup Efficiency Cost

• The Effect of Granularity on Performance – Adding n numbers on p processing elements – Adding n numbers cost-optimally Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

14

Effect of Granularity on Performance • Often, using fewer processors improves performance of parallel systems. • Using fewer than the maximum possible number of processing elements to execute a parallel algorithm is called scaling down a parallel system. • A naive way of scaling down is to think of each processor in the original case as a virtual processor and to assign virtual processors equally to scaled down processors.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

15

Effect of Granularity on Performance • Since the number of processing elements decreases by a factor of n / p, the computation at each processing element increases by a factor of n / p. – because each processing element now performs the work of n/p processing elements.

• The communication cost should not increase by this factor since some of the virtual processors assigned to a physical processors might talk to each other. This is the basic reason for the improvement from building granularity.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

16

Building Granularity • Consider the problem of adding n numbers on p processing elements such that p < n and both n and p are powers of 2. • Use the parallel algorithm for n processors, except, in this case, we think of them as virtual processors. • Each of the p processors is now assigned n / p virtual processors. • The first log p of the log n steps of the original algorithm are simulated in (n / p) log p steps on p processing elements. • Subsequent log n - log p steps do not require any communication.

Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

17

Building Granularity • The overall parallel execution time of this parallel system is Θ ( (n / p) log p). • The cost is Θ (n log p), which is asymptotically higher than the Θ (n) cost of adding n numbers sequentially. Therefore, the parallel system is not cost-optimal. Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

Lecture 9

18

Building Granularity Can we build granularity in the example in a cost-optimal fashion? • Each processing element locally adds its n / p numbers in time Θ (n / p). • The p partial sums on p processing elements can be added in time Θ(n /p).

A cost-optimal way of computing the sum of 16 numbers using four processing elements. Dr. Mohammed Abdel-Megeed Salem High Performance Computing 2016/ 2017 Lecture 9

19

Building Granularity • The parallel runtime of this algorithm is

• The cost is • This is cost-optimal, so long as Dr. Mohammed Abdel-Megeed Salem

High Performance Computing 2016/ 2017

! Lecture 9

20

Contacts High Performance Computing, 2016/2017 Dr. Mohammed Abdel-Megeed M. Salem Faculty of Computer and Information Sciences, Ain Shams University Abbassia, Cairo, Egypt Tel.: +2 011 1727 1050 Email: [email protected] Web: https://sites.google.com/a/fcis.asu.edu.eg/salem

High Performance Computing

Dec 20, 2016 - Speedup. – Efficiency. – Cost. • The Effect of Granularity on Performance .... Can we build granularity in the example in a cost-optimal fashion?

585KB Sizes 3 Downloads 294 Views

Recommend Documents

High Performance Computing
Nov 8, 2016 - Faculty of Computer and Information Sciences. Ain Shams University ... Tasks are programmer-defined units of computation. • A given ... The number of tasks that can be executed in parallel is the degree of concurrency of a ...

High Performance Computing
Nov 29, 2016 - problem requires us to apply a 3 x 3 template to each pixel. If ... (ii) apply template on local subimage. .... Email: [email protected].

High Performance Computing
Nov 1, 2016 - Platforms that support messaging are called message ..... Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree ...

High Performance Computing
Computational science paradigm: 3) Use high performance computer systems to simulate the ... and marketing decisions. .... Email: [email protected].

Advances in High-Performance Computing ... - Semantic Scholar
tions on a domain representing the surface of lake Constance, Germany. The shape of the ..... On the algebraic construction of multilevel transfer opera- tors.

SGI UV 300RL - High Performance Computing
By combining additional chassis (up to eight per standard 19-inch rack), UV 300RL is designed to scale up to 32 sockets and 1,152 threads (with hyper threading enabled). All of the interconnected chassis operate as a single system running under a sin

Advances in High-Performance Computing ... - Semantic Scholar
ement module is illustrated on the following model problem in eigenvalue computations. Let Ω ⊂ Rd, d = 2, 3 be a domain. We solve the eigenvalue problem:.

High performance computing in structural determination ...
Accepted 7 July 2008. Available online 16 July 2008 ... increasing complexity of algorithms and the amount of data needed to push the resolution limits. High performance ..... computing power and dozens of petabytes of storage distributed.

Ebook Introduction to High Performance Computing for ...
Book synopsis. Suitable for scientists, engineers, and students, this book presents a practical introduction to high performance computing (HPC). It discusses the ...

pdf-0743\high-performance-cluster-computing-programming-and ...
... the apps below to open or edit this item. pdf-0743\high-performance-cluster-computing-programming-and-applications-volume-2-by-rajkumar-buyya.pdf.

pdf-0743\high-performance-cluster-computing-programming-and ...
... the apps below to open or edit this item. pdf-0743\high-performance-cluster-computing-programming-and-applications-volume-2-by-rajkumar-buyya.pdf.

Bridging the High Performance Computing Gap: the ...
up by the system and the difficulties that have been faced by ... posed as a way to build virtual organizations aggregating .... tion and file transfer optimizations.

High-Performance Cloud Computing: A View of ...
1Cloud computing and Distributed Systems (CLOUDS) Laboratory. Department .... use of Cloud computing in computational science is still limited, but ..... Linux based systems. Being a .... features such as support for file transfer and resource.