Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency Jiajun Wang, Ahmed Khawaja, George Biros, Andreas Gerstlauer, Lizy K. John The University of Texas at Austin

Introduction • Fundamental problem in computational physics, statistics, machine learning tasks. • What is Kernel and what is Kernel Summation?

Kernel Kernel Summation

2

Related Work • Tree codes, fast multipole methods, Ewald sums ++ scale to billions or trillions of points by reducing complexity O(N2) to O(N logN) -- work for problems in low dimensions (two or three), not suitable for solving statistics, machine learning task • Rely on General Matrix Matrix Multiplication (GEMM) for high dimensional tasks

3

Kernel Summation Steps

• Denotation:  M: number of points in target set  N: number of points in source set  K: dimension • Inputs:  Target matrix A: M-by-K  Source matrix B: K-by-N  Weight vector W: N-by-1 • Gaussian Kernel: Ҝ(𝛼, 𝛽) = 𝑒𝑥𝑝 • Output:  Vector V: M-by-1



𝛼−𝛽 2 2 2

Kernel Summation Steps C←AxB 𝑅𝑖,𝑗 ← 𝑠𝑞𝑢𝑎𝑟𝑒𝐴𝑖,𝑗 + 𝑠𝑞𝑢𝑎𝑟𝑒𝐵𝑖,𝑗 - 2𝐶𝑖,𝑗

𝑈𝑖,𝑗 ←

5

GEMM

Embarrassingly Parallel

𝑅𝑖,𝑗

𝑒𝑥𝑝− 2

V←UxW

GEMV 5

Implementation Based on cuBLAS 1. C ← A x B Call cuBLAS 2. 𝑅𝑖,𝑗 ← 𝑠𝑞𝑢𝑎𝑟𝑒𝐴𝑖,𝑗 + 𝑠𝑞𝑢𝑎𝑟𝑒𝐵𝑖,𝑗 - 2𝐶𝑖,𝑗 ++ Fast. Easy to use 𝑅𝑖,𝑗 -- Sacrifice data locality − 3. 𝑈𝑖,𝑗 ← 𝑒𝑥𝑝 2 -- Waste energy on DRAM accesses 4. V ← U x W 100

High L2 MPKI indicates opportunity for fusion

K=32

K=64

K=128

K=256

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

0

M=131072

50 M=1024

L2 MPKI

150

6

We propose Fused Kernel Summation FOR each thread DO

1. reg_μC = GEMM(mem_subA, mem_subB) 2. reg_μC = Gaus (mem_subA2, mem_B2, reg_μC) 3. mem_subV = Summation(reg_μC, mem_subW)

7

GPU Background • Thread: organized in thread blocks • Thread block: executed by a SM (Streaming Multiprocessor) • Warp: basic scheduling unit, 32 threads, execute same instruction in lock-step

8

Fused Kernel Summation FOR each thread DO 1. reg_μC = GEMM(mem_subA, mem_subB) 2. reg_μC = Gaus (mem_subA2, mem_B2, reg_μC) 3. mem_subV = Summation(reg_μC, mem_subW)

9

GEMM Algorithm Overview • A thread block (bx,by) computes 𝑠𝑢𝑏𝑚𝑎𝑡𝑟𝑖𝑥𝐶𝑏𝑥,𝑏𝑦 = 𝑠𝑢𝑏𝑚𝑎𝑡𝑟𝑖𝑥A𝑏𝑦 × 𝑠𝑢𝑏𝑚𝑎𝑡𝑟𝑖𝑥𝐵𝑏𝑥 • Carefully select submatrix size and thread block size • Overlap memory read latency with computation • Rearrange data location for fast access

10

Shared memory (SMEM) • Programmer managed cache • 32 banks to be accessed simultaneously • Serialize accesses when there’s bank conflict

11

Shared Memory Data Mapping

12

Fused Kernel Summation FOR each thread DO 1. reg_μC = GEMM(mem_subA, mem_subB) 2. reg_μC = Gaus (mem_subA2, mem_B2, reg_μC) 3. mem_subV = Summation(reg_μC, mem_subW)

13

Fused Kernel Summation FOR each thread DO 1. reg_μC = GEMM(mem_subA, mem_subB) 2. reg_μC = Gaus (mem_subA2, mem_B2, reg_μC) 3. mem_subV = Summation(reg_μC, mem_subW)  Intra thread level •

Register → Shared memory

 Intra thread block level •

Shared memory → DRAM

 Inter thread block level •

DRAM → DRAM 14

Evaluation • Infrastructure  Evaluate on NVIDIA GTX970  Profile tool: nvprof  cuBLAS library in version 7.0 • Experiments  Fused: fuse our own GEMM implementation with the kernel evaluation and the summation routine.  CUDA-Unfused: call our own GEMM implementation followed by the kernel evaluation and the summation routine.  cuBLAS-Unfused: call cuBLAS GEMM function followed by the kernel evaluation and the summation routine. 15

Performance Comparison

K=32

K=64

Fused vs. CUDA-Unfused

K=128

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

4 3.5 3 2.5 2 1.5 1 0.5 0

M=1024

Speedup

 Fused beats cuBLAS-Unfused by up to 1.8X speedup when dimension K < 128.

K=256

Fused vs. cuBLAS-Unfused

16

Influence on memory • Fused optimization reduces memory transactions  Fused reduces 50%  Fused reduces 90% of L2 accesses in of DRAM accesses cuBLAS-Unfused in cuBLAS-Unfused

K=32

K=64

Fused

K=128

K=256

CUDA-Unfused

Fig. a: L2 Accesses normalized to cuBLAS-Unfused

K=32

Fused

K=64

K=128

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

0

M=524288

1

M=131072

2

M=1024

1 0.8 0.6 0.4 0.2 0

3

K=256

CUDA-Unfused

Fig. b: DRAM Accesses 17 normalized to cuBLAS-Unfused

Energy Savings Comparison • Energy Savings of Fused compared to cuBLAS-Unfused  With same K, saving more energy when M increases  The amount of energy savings obtained from fusion is greatly affected by the K value

18

Energy Breakdown • 80% reduction in DRAM access energy (8% to 24% of total energy) 40

cuBLAS-Unfused Compute Fused Compute

35

CUDA-Unfused Compute

25

cuBLAS-Unfused SMEM

20

Fused SMEM

15

CUDA-Unfused SMEM

10

cuBLAS-Unfused L2

5

Fused L2

K=32

K=64

K=128

K=256

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

M=1024

M=524288

M=131072

0

M=1024

Energy (J)

30

CUDA-Unfused L2 cuBLAS-Unfused DRAM

Fused DRAM 19 CUDA-Unfused DRAM

Summary • Presented a fused approach of implementing the kernel summation on the state of the art GPU.  Fusion leads to improvement in locality and reduction of memory accesses.  Fusion is seen to improve overall performance of kernel summation up to 1.8X.  From the energy perspective, fused kernel summation shows up to 33% of total energy saving across various experimented dimensions. 20

Thanks!

Lab of Computer Architecture http://users.ece.utexas.edu/~ljohn/publications.html 21

Optimizing GPGPU Kernel Summation for Performance and Energy ...

Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency. Jiajun Wang, Ahmed Khawaja, George Biros,. Andreas Gerstlauer, Lizy K. John.

952KB Sizes 2 Downloads 325 Views

Recommend Documents

A Distributed Kernel Summation Framework for General ...
Dequeue a set of task from it and call the serial algorithm (Algo- ..... search Scientific Computing Center, which is supported .... Learning, pages 911–918, 2000.

A Distributed Kernel Summation Framework for ...
Scale? K k (xi ,xj ). The problem is inherently super-quadratic in the number ..... hyper-rectangle .... Each state on each process converges to the average of the.

Optimizing Performance for the Flash Platform - Adobe
Aug 21, 2012 - available only in Flash Player 10.1 for Windows Mobile. ..... A 250 ms refresh rate was chosen (4 fps) because many device ...... code, then the virtual machine spends significant amounts of time verifying code and JIT ...

Optimizing Performance for the Flash Platform - Adobe
Aug 21, 2012 - Movie quality . ... Chapter 6: Optimizing network interaction .... asynchronous operations, such as a response from loading data over a network.

Optimizing pipelines for power and performance
ture definition phase of high performance, power-efficient processors. Even from ...... CPU to maintain an enormous amount of architectural and non-architectural ...

Optimizing pipelines for power and performance
models, the energy-delay product for pipe i is given by. ED = Psi /G2. Hence, the ... FO4-depth (after accounting for latch and clock skew over- head). Each latency in ...... W. Ye. Energy-driven integrated hardware-software opti- mizations using ...

A Taxonomy of GPGPU Performance Scaling - IEEE Computer Society
Kapil Dev. School of Engineering. Brown University kapil [email protected]. Joseph L. Greathouse, Indrani Paul, Wei Huang, Arjun-Karthik Venugopal, Leonardo ...

Optimizing Performance for the Flash Platform - Semantic Scholar
Aug 21, 2012 - Optimizing Adobe AIR for code execution, memory & rendering at ..... create any effects that you need on a bitmap in an authoring tool. ...... FluorineFX, WebORB, and BlazeDS, an official open-source Java Flash Remoting ...

developing a high performance gpgpu compiler using ...
optimized kernels to relieve the application developers of low-level hardware-specific performance optimizations. State-of-the-art GPUs use many-core ...

Memory hierarchy reconfiguration for energy and performance in ...
Dec 21, 2006 - 27% reduction in memory-CPI, across a broad class of appli ...... tional Symposium on Computer Architecture, pages. 2824292, June, 2000, for ...

Memory hierarchy reconfiguration for energy and performance in ...
Dec 21, 2006 - prohibitively high latencies of large on-chip caches call for a three-level ..... mapped 256 KB L1, 768 KB 3-Way L1, 1 MB 4-Way L1, 1.5.

Learning the Kernel and Finding Performance Problems ... - eLinux.org
Jun 11, 2005 - Configuration in kernel/kfistatic.conf. ○. Results are ... Can use function name in kfistatic.conf. ○ ... Look at log directly for individual call events.

Learning the Kernel and Finding Performance Problems ... - eLinux.org
Jun 11, 2005 - Output from /proc/kfi_trace has addresses. ○. Can convert to function names with addr2sym: ○ linux/scripts/addr2sym kfi.log -m System.map ...

Sensitivity summation theorems for stochastic ...
Sensitivity summation theorems for stochastic biochemical reaction systems ..... api А pi pi ј рa А 1Ю. X i. Chsj i pi. ; for all j = 0,...,M. Since the mean level at the ...

Improving Energy Performance in Canada
and Canadian businesses money by decreasing their energy bills ... oee.nrcan.gc.ca/publications/statistics/trends11/pdf/trends.pdf. economy in ...... 2015–2016.

Improving Energy Performance in Canada
Sustainable Development Technology Canada –. NextGen ..... through education and outreach, as well as through .... energy science and technology by conducting ...... 171 026. Oct. 11. 175 552. Nov. 11. 167 188. Dec. 11. 166 106. Jan. 12.

ENERGY EFFICIENCY PERFORMANCE STANDARDS - Consumer ...
Long-Term: Setting an increasingly rigorous standard over a number of years that ... Product Neutral: Attribute based standards accommodate consumer preferences ... Cost benefit analyses of past efforts to increase energy efficiency support the .....

DN1027 - Optimizing the Performance of Very ... - Linear Technology
cated to various wireless carriers. Associated with this. 20MHz channel is a companion 100MHz bandwidth DPD receiver to measure intermodulation distortion ...

Sleep to Stay Alive: Optimizing Reliability in Energy ...
We consider the problem of extending device lifetime in backbone networks by exploiting sleep modes. In particular, when the ... To the best of our knowledge,.

Optimizing Web Services Performance by Differential ...
fact that the Web services are based on the XML-based ... performance of an XML parser based on the ... data (for example large arrays of floating point numbers ...

Optimizing Energy Savings from Direct-DC in US ...
Jul 16, 2012 - and Renewable Energy, Office of Building Technology, State, and. Community ..... 2.3.2. Modeling AC-House versus DC-House Energy Use. ..... A group of interconnected loads and distributed energy resources within clearly.

Optimizing Energy Savings from Direct-DC in US ... - eScholarship
Jul 16, 2012 - Inverter without Battery Backup (AC-House) . ..... with four loads (a coffee maker, a computer, and two fluorescent lamps) and evaluated ..... Notes: BDCPM: Brushless DC permanent magnet motor; VSD: Variable-speed drive.

Energy Performance of LEED for New Construction Buildings - USGBC
Mar 4, 2008 - Figure ES- 1: LEED-NC Certifications by Year, and Percent for Each ..... one full year of measured post-occupancy energy usage data for the ...