PP-principles.pdf

Viewer
Transcript

Even Partition Communication Optimization Efficient Implementation

Parallel Algorithm Principles Pangfeng Liu National Taiwan University

March 6, 2015

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Principles

There are three basic principles in improving the efficiency of parallel computing. Even partition Communication reduction Efficient Implementation

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Partition

Partition is an essential parallel algorithm design technique. As in a sequential divide-and-conquer algorithm, the problem is first partitioned (divided) into sub-problems. Unlike a sequential divide-and-conquer algorithm, a parallel algorithm solves (conquers) the sub-problem in parallel. Some communication may be necessary since the sub-problems may have dependency on each other, or may need to transfer data among themselves. Finally the answers from individual sub-problems are combined into the final answer.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Partition

Partition is the first step in a divide-and-conquer algorithm. One can partition the data, and the process is called “data partition”; Or one can partition the main loop of the computation, and it is called “loop” partition. The partition has a significant impact on the overall performance.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Discussion

Give an example of divide-and-conquer computation.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Partition Principles

There are two important issues in partitioning. Even workload distribution Proper granularity

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Even Workload Distribution

We want to distribute the workload among processors so that the maximum workload among processors is minimized. The execution time of a parallel program is the execution time of the slowest processor involved, which is usually the processor that has the maximum workload. This is the “makespan” of the execution time. Note that we are interested in the makespan of the execution, not the sum or average of the execution time of processors.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Idle v.s. Busy

Uneven distribution of workload leaves some processor idle while others are busy. If everyone is busy all the time, then the workload is evenly distributed.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Workload Estimation

In order to evenly distribute the workload, one needs to accurately predict the workload. For data parallel computation, one can associate the computation with the data. If we further assume that the computation workload on every data is about the same then we can estimate the workload by counting the number of data each processor is assigned.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Workload Estimation

For task parallel computation, we must predict the workload of sub-problems. It is difficult to estimate the workload of tasks, so profiling or programmer intervention is necessary.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Discussion

Give an example to illustrate the importance of even workload distribution.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Granularity

The granularity is the basic unit in partitioning. For data parallel computation, it indicates the smallest chunk of data while assigning data chunks to processors. For task parallel computation, it indicates the smallest chunk of task while assigning tasks to processors. Recall that we can always refine a step of our algorithm into finer steps.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

The Size

It is always easier to balance the workload if the granularity is small because it is always easier to distribute a set of objects evenly if we can cut them into small pieces. However, there will be much more overhead not only in assigning these chunks to processors because the mapping table will be larger, but also in scheduling and synchronizing the processors because the number these operations will increase. More details on the communication later.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Fine and Coarse

Fine grain parallelism partitions data/task into very small pieces, then assigns them to processors for processing. Suitable for system that can spawn a large number of threads with low cost, e.g., GPU.

Coarse grain parallelism partitions data/task into very large pieces, then assigns them to processors for processing. Suitable for system that can only spawn a limited number of threads, and the thread creation is expensive, e.g., CPU.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Even Workload Distribution Proper Granularity

Discussion

Give an example to illustrate the importance of granularity in partitioning workload.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Communication Reduction

Communication is inevitable because multiple processors are working on the same problem. Communication is overhead – it does not appear in a sequential computation. Communication should be reduced.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Principles

There are two basic principles to reduce communication. Low synchronization overheads Data locality

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Synchronization

The synchronization is inevitable in parallel and distributed computing because we want to coordinate the processors. Barrier synchronization Before/after synchronization Access synchronization

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Barrier Synchronization

A computation may proceed in stages – all processors needs to finish a stage before going to the next stage. This is usually called a barrier synchronization. For example, all processors must combine their partial answer into the final answer. This usually involves all processors.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Before/After Synchronization

In task parallelism one computation may need to precede another. You need to cook dinner before you can eat it. This may be referred to as before/after synchronization. This usually involves two processors – one processor finishes a computation, then notifies the other processor to proceed.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Access Synchronization

Many processor may need to access a shared variable in a shared memory multiprocessor. Not an issue for distributed memory multicomputer since the computers do not share memory.

If the memory access is not synchronized properly, race condition may occur.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Synchronization Mechanism

Many parallel programming environments provide mechanism for program to specify synchronization explicitly. The synchronization should be efficient. The synchronization should be scalable, i.e., it should be efficient even if the number of processors involved is large.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Discussion

Give an example for each synchronization described earlier.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Synchronization Mechanism

One can use message passing or shared memory to implement barrier synchronization within the same computer. One can use signal inter-process communication to implement before/after synchronization within the same computer. One can use busy waiting or semaphore to implement the critical section for accessing shared variables. If processor of different computers are involved in the synchronization, one needs to use network protocol to implement it.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Synchronization Optimization

The number of stages should be reduced. The synchronization should be efficient. The granularity should be carefully chosen to balance the overhead in synchronization and workload distribution. A fine-grain parallel computation is hard to synchronize, but easy to have even workload. A coarse-grain parallel computation is easy to synchronize, but hard to have even workload.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Discussion

Describe the inter-process communication (IPC) mechanism that you are aware of.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Data Locality

Locality is a trend for a program to access data/instruction in proximity. When a program access a data/instruction, it is very likely it will access the same data/instruction in the dear future, or it will access the data/instruction nearby in the near future. Computer architecture explores locality for performance.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Temporal Locality

When a program access a data/instruction, it is very likely it will access the same data/instruction in the dear future. If we cache this data/instruction in a fast storage, then it is very likely we will be able to access the data fast. Data/instruction are cached in data/instruction cache for performance. CPU first tries to get the data from cache. If found then use it, otherwise the CPU gets the data from memory. There could several levels of caches.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Performance

The performance comes from the difference in accessing speed to memory and cache, and the probability of being able to find the data/cache in cache. If we can find the data/instruction in cache with high probability, i.e., with a high cache hit rate, then the performance will be improved. If the temporal locality is good, which means the same data/instruction is likely to be used again in the near future, then we have good performance.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

In the Near Future? What do we mean “in the near future”? The capacity of cache is extremely limited. When we access a data/cache, we have to place it into the cache for possible later references. If the cache is full, then some data/instructions have to removed to make space for the incoming ones. “In the near future” means when we want to access the data/instruction we placed into cache again, it is still there, i.e., before it was removed for making room for other data/instruction.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Other Applications

Hard disks maintain a small cache for data stored in the disk. Operating system maintain disk cache for frequently accessed data on disk. A translation lookaside buffer (TLB) is a cache for frequently accesses item in the page table.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Discussion

Give an example of temporal locality.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Spacial Locality

When a program access a data/instruction, it will access the data/instruction nearby in the near future. If we cache the near by data/instruction in a fast storage, then it is very likely we will be able to access the nearby data/instruction fast. Parallel processing focuses on spacial data locality.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Cache Line

Modern computer architecture does not cache data individually, instead it cache data/instruction in the unit of cache line. A cache line consists of consecutive data/instruction in memory. Nearby data/instructions are automatically cached for spacial locality. Parallel programmers preserve data locality in a much higher “data level” when partitioning the data into chunks for processing.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Data Level Locality

When we assign data to processors for processing, we not only want to distribute them evenly, we also want to preserve spacial data locality. That means when we want to process a data, the required data is nearby. What is required data? What is “near by”?

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Required Data

When we process a data, we usually need other data. For example, when we want to compute vector C , which is the sum of two vectors A and B. We need Ai and Bi to compute Ci , then Ai and Bi are required data of Ci .

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Owner

We usually follow a “owner computes” rule. If a processor is the owner of a data, i.e., data is assigned to this processor, then it is responsible for the computation of this data. The rule is simple and straightforward. On rare occasion we will not follow the “owner computes” rule.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Placement

If the length of the vector is 32, and we have two processors, how do we assign data to processors? Intuitively we can place the first 16 elements of A, B, and C to one processor, and the rest to the other processor. The workload of computing C is evenly distributed because each processor will compute 16 elements for C . When a processor compute a Ai , it can get all the required data within its memory.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Wrong Placement

Again if the length of the vector is 32, and we have two processors. We place the first 16 elements of A, B, and the last 16 elements of C to one processor, and the rest to the other processor. The workload of computing C is evenly distributed because each processor will compute 16 elements for C . When a processor compute a Ai , it cannot get any required data within its memory. Is this good?

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Nearby

“Nearby” means in the same processor. We can access the required data within the processor of the same processor by memory bandwidth. We can only access the required data within the processor of other processor by network bandwidth. Memory bandwidth is much much larger than network bandwidth.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Local v.s. Remote

We use Local memory to indicate the memory of the same processor, and remote memory as the memory of other processors. We conclude that Local memory is much much faster than remote memory. This distinction applies only to distributed memory multicomputer.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Goal

If most of the required data is “nearby”, then we have good performance. That is, we want to make sure that most of the required data are nearby, i.e., in local memory, when we partition data to processor for computation. Note that we say “most” because sometimes it is impossible to partition data so that all data access is local.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Discussion

Give an example of spacial locality.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Matrix Multiplication

We multiple matrix A and B and get C . The required data of Cij is the i’th row of A and j’th column of B. If we insist that the required data must be in local memory, then everything will be in one processor! This is against the principle of even workload distribution.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Proof

Cij has to be in the same processor as the i’th row of A and j’th column of B. Ckl has to be in the same processor as the k’th row of A and l’th column of B. Then Ckj has to be in the same processor as the k’th row of A and j’th column of B.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Proof

This implies Cij and Ckj have to be in the same processor, because they are in the same processor as the j’th column of B. Similarly Ckj and Ckl have to be in the same processor, because they are in the same processor as the k’th row of A. We conclude that Cij must be in the same processor as Ckl , for any i, j, k, and l. Finally, all data will be in the same processor, which is bad.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Best Effort

If most of the required data is in local memory, then we have good performance. We would like to increase the percentage of access local memory, which is a best effort. The data has to be carefully partitioned to preserve locality.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Communication-to-Computation Ratio

Another way to understand the data locality is through the computation-to-communication ratio. The amount of computation is roughly the same throughout different data partitioning. The amount of communication is proportional to the amount of remote data, because local data do not incur communication. If the communication-to-computation ratio is small then we have small communication overheads, which means we have good data locality.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Discussion

Give an example of good locality and another example of bad locality for the same problem, due to different partitioning methods.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Surface to Volume Ratio

Sometimes we use a surface-to-volume ratio to explain communication-to-computation ratio. We now consider the entire data as an object, and data partitioning is a way to cut the object into pieces.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Neighbors

In many computations the required data are those neighboring data. In an array the neighboring data for an array element are those that have indices differing from the element by 1. In a graph the neighboring data are those node that are adjacent to the node. In a graphic computation the neighboring data for a pixel are those that have adjacent to that pixel.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Neighbors

In a table for dynamic programming the value of an element is usually determined by those elements that have indices differing from the element by 1. In a page ranking algorithm the value of a node is determined by the neighboring nodes. In a graphic relaxing problem the new value of a pixel is determined by the eight neighbors.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Discussion

Give an example of computation that uses neighbors.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Pieces

We can use the “volume” of a piece to represent the number of data in a piece, which in turn represents the amount of computation. We assume that amount of workload is about the same for all data.

We can also use the “surface area” of a piece to represent the number of required data in a piece, which in turn represents the amount of communication. We assume that the required data are on the surface of the pieces.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Surface-to-volume Ratio

Now we can easily relate the computation-to-communication ratio to the surface-to-volume ratio. We want to have small computation-to-communication ratio, then we must partition data into pieces that have small surface-volume-ratio. Surface area is communication. Volume is computation.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Discussion

Give an example of surface-to-volume ratio. If an object has a large surface-to-volume ratio, is it easier, or harder, to coll down? How does that relate to communication costs?

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

An Example

We are given a matrix of 32 by 32 by 32, and we would like to update each cell to be the average of its six neighbors with 8 processors. We have two choices. We cut the matrix into eight 16 by 16 by 16 cubes. We cut the matrix into eight 4 by 32 by 32 slates.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Cubes

The volume of a cube is 16 by 16 by 16 = 4k. The surface area of a cube is 6 × 16 × 16 = 1.5k. The surface to volume ratio is 1.5/4 = 3/8. This means for the computation on each data the processors needs to access remote memory 3/8 times.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Slates

The volume of a slate is 4 by 32 by 32 = 4k. The surface area of a slate is 2 × 32 × 32 + 4 × 32 × 4 = 2.5k. The surface to volume ratio is 2.5/4 = 5/8. This means for the computation on each data the processors needs to access remote memory 5/8 times, which is more than the 3/8 while cutting into cubes.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Lessons

Surface-to-volume ration is a reasonable estimate on the communication-to-computation ratio. It is intuitive to partition the data into chunks so that the surface, i.e., communication, is minimized. for example, if we partition the data into checker board pattern, the surface-to-volume ratio will be very large, and data locality will be poor.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Locality

Discussion

Describe the difference in sizes of similar animals that live in tropical or Arctic area.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Transfer

Efficiency

How to synchronize processors efficiently? Global synchronization Point-to-point synchronization

How to transfer data efficiently? Batch mode message passing Overlap communication with computation Explore memory hierarchy

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Transfer

Global Synchronization

Reduction Every processor has a value for the solution of its sub-problem, and we want to compute the sum of these values. Every processor has a value for the solution of its sub-problem, and we want to compute the minimum of these values. A reduction also serves as a barrier synchronization.

Barrier synchronization One can think of a barrier synchronization as a special form of reduction in which no value is exchanged.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Transfer

Tree Optimization

We can ask a processor to coordinate the synchronization. Inherent sequential and the coordinator is the bottleneck.

Or we can organize the process as a tree. We partition the processors into two subsets. Two subsets recursively synchronize themselves in parallel. Finally the two subsets synchronize with each other. More details in lectures later.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Transfer

Two Party Synchronization

In multiprocessor environment the critical section or semaphore may not be the best synchronization solution. Unlike uni-processor environment, the overheads of critical section or semaphore is very high in multiprocessor environment. Therefore we sometimes prefer spin-locks in multiprocessor environment, e.g., in Linux kernel data structure.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Transfer

Transfer Efficiency

In many low level parallel programming environment, (e.g. OpenCL, CUDA, or MPI) the programmers can explicit control how data is transferred among professors. In these environments the programmer can apply the following techniques to improve data transfer efficiency. Batch mode message sending Overlap computation with communication Explore memory hierarchy

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Transfer

Batch Mode

Many message passing system is built on top of network protocol like TCP/IP. These protocol has a fixed start-up overhead, e.g., to establish a connection in TCP/IP. If we send a large number of data through a connection, then the start-up overhead is amortized among the data begin transferred, which means we should transfer data in large quantity.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Transfer

Overlap Communication with Computation

It is beneficial to have a large number of threads so that when a thread is waiting for data, other threads can use CPU resource for computation. For example, in GPU the large number of running threads can hide memory latency, i.e., when a thread is waiting for memory other threads can use ALU for computations. This requires a large number of threads, and a flexible scheduler to schedule them. This relieve the burden of cache. More details in later lectures.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

Even Partition Communication Optimization Efficient Implementation

Synchronization Data Transfer

Explore Memory Hierarchy

In some parallel programming environment (e.g. CUDA and OpenCL), the programmer is free to move data with the memory hierarchy. The processing units of GPU have fast and small local memory, and share a slow and large global memory. CUDA and OpenCL programmers must explicitly move the data between the global and local memory to achieve performance. This is tedious and error-prone process. More details on later lectures.

Pangfeng Liu National Taiwan University

Parallel Algorithm Principles

... University Parallel Algorithm Principles. Whoops! There was a problem loading this page. Whoops! There was a problem loading this page. PP-principles.pdf.

Download PDF

221KB Sizes 1 Downloads 408 Views

Report

PP-principles.pdf

Recommend Documents