Fast Multiplication in Binary Fields on GPUs via Register Cache Mark Silberstein Technion Eli Ben-Sasson, Matan Hamilis, Eran Tromer

Brief • Optimization methodology Register cache: replace shared memory by registers • Target applications: shared memory to cache input (e.g. stencil) • Our case: binary field multiplication • Result: 50% speedup over baseline x138 over a single core CPU with Intel’s CLMUL instruction ICS 2016

Mark Silberstein, Technion

2

Background: execution hierarchy on NVIDIA GPUs GPU kernel

Thread block

ICS 2016

Thread

Thread block

Thread

Mark Silberstein, Technion

Thread block

Thread

3

Background: memory and execution hierarchy on NVIDIA GPUs Global GPU memory GPU kernel Shared memory

Shared memory

Shared memory

Thread block

Thread block

Thread block

Registers ICS 2016

Thread

Registers

Registers

Thread

Thread

Mark Silberstein, Technion

4

Warps: Not part of programming model Global GPU memory GPU kernel Shared memory

Shared memory

Shared memory

Thread block

Thread block

Thread block

Warp Registers ICS 2016

Thread

Warp

Warp

Registers

Registers

Thread

Thread

Mark Silberstein, Technion

5

Why warp-centric programming ●

MIMD divergence-free programming across warps



SIMD-optimized lock-step execution



“Free” synchronization among threads

ICS 2016

Mark Silberstein, Technion

6

Missing layer: warp cache? Global GPU memory GPU kernel Shared memory

Shared memory

Shared memory

Thread block

Thread block

Thread block

?

?

?

Warp

Warp

Warp

Registers ICS 2016

Thread

Registers

Registers

Thread

Thread

Mark Silberstein, Technion

7

Missing layer: warp cache? Global GPU memory GPU kernel Shared memory

Shared memory

Shared memory

Thread block

Thread block

Thread block

Question: Efficient data sharing ? ? ? among warp threads? Warp Warp Warp Registers ICS 2016

Thread

Registers

Registers

Thread

Thread

Mark Silberstein, Technion

8

Shuffle: warp-level intrinsics Reading other thread's registers shuffle(SourceThreadID, OutputRegister)

Input

Output ICS 2016

Thread 0

Thread 1

Thread 2

R=0

R=1

R=2

out=shuffle(2, r)

out=shuffle(0, r)

OUT=2

OUT=0 Mark Silberstein, Technion

out=shuffle(0, r)

OUT=0 9

Shuffle vs. shared memory ●

No __syncthreads overhead



Significantly higher bandwidth

ICS 2016

Mark Silberstein, Technion

10

Shuffle vs. shared memory ●

No __syncthreads overhead



Significantly higher bandwidth

Challenge: programming complexity! Application-specific algorithm modifications ICS 2016

Mark Silberstein, Technion

11

This work: general technique to replace input shared memory with shuffle Shared memory Thread block

Warp

Warp Registers Thread ICS 2016

Registers Thread Mark Silberstein, Technion

Registers Thread 12

This work: general technique to replace input shared memory with shuffle Shared memory Thread block Register cache

Warp

Warp Registers Thread ICS 2016

Registers Thread Mark Silberstein, Technion

Registers Thread 13

Outline ●

Code transformation example: 1-d k-stencil



General methodology



Binary field multiplication



Evaluation

ICS 2016

Mark Silberstein, Technion

14

1-d k-stencil k=1 0

ICS 2016

1

2

+

+

3

6

3

4

9

12

Mark Silberstein, Technion

5

15

1-d 1-stencil: shared memory Global memory Shared memory

Global memory ICS 2016

s[]

0

1

2

3

4

5

0

1

2

3

4

5

s[0]+ s[1]+ s[2]

s[1]+ s[2]+ s[3]

s[2]+ s[3]+ s[4]

s[3]+ s[4]+ s[5]

T0

T1

T2

T3

3

6

9

12

Mark Silberstein, Technion

Read

__syncthreads()

Compute

Write output

16

1-d 1-stencil: shared memory Global memory Shared memory

Global memory ICS 2016

s[]

0

1

2

3

4

5

0

1

2

3

4

5

s[0]+ s[1]+ s[2]

s[1]+ s[2]+ s[3]

s[2]+ s[3]+ s[4]

s[3]+ s[4]+ s[5]

T0

T1

T2

T3

3

6

9

12

Read

__syncthreads()

Compute

Write output

Goal: eliminate shared memory access Mark Silberstein, Technion

17

1. Determine warp input assume 4 threads/warp

Global memory input

ICS 2016

0

1

2

3

4

5

Mark Silberstein, Technion

6

7

8

9

18

1. Determine warp input assume 4 threads/warp

Global memory input

0

1

2

3

Warp 0

ICS 2016

4

5

6

7

8

9

Warp 1

Mark Silberstein, Technion

19

2. Assign input to owner thread Global memory

0 T0

ICS 2016

1

2 T1

3 T2

Mark Silberstein, Technion

4

5 T3

20

2. Assign input to owner thread Global memory

0 T0

1

2 T1

3 T2

4

5 T3

rc

ICS 2016

Mark Silberstein, Technion

21

2. Assign input to owner thread Global memory

0

1

T0

Distribute

rc

[0]

[4]

2 T1 [1]

3

4

5

T2

T3

[2]

[3]

[5]

cyclic distribution

ICS 2016

Mark Silberstein, Technion

22

Some thread inputs are remote! Global memory

0

1

T0

Distribute

rc

[0]

[4]

2 T1 [1]

3

4

5

T2

T3

[2]

[3]

[5]

[0]+[1]+[2]

Not available!

ICS 2016

Mark Silberstein, Technion

23

We define new communication primitives ●





Receive(src_tid, remote_reg) – receive data stored in thread src_tid in remote variable remote_reg Publish(local_reg) – publish local data stored in variable local_reg For one thread to Receive, another has to Publish!

ICS 2016

Mark Silberstein, Technion

24

2. Communication phase: Receive T0

rc

[0]

T1

[4]

[1]

T2

[5]

[2]

T3

[3]

[0]+[1]+[2]

ICS 2016

Mark Silberstein, Technion

25

2. Communication phase: Receive T0

rc

[0]

T1

[4]

[1]

T2

[5]

[2]

T3

[3]

Receive (src,what) v=R(T0,rc[0])

[0]+[1]+[2]

ICS 2016

Mark Silberstein, Technion

26

2. Communication phase: Publish T0

rc Receive Publish

[0]

T1

[4]

[1]

T2

[5]

[2]

T3

[3]

v=R(T0,rc[0]) P(rc[0])

[0]+[1]+[2]

ICS 2016

Mark Silberstein, Technion

27

2. Communication phase: Publish T0

rc Receive Publish

ICS 2016

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0])

[1]

T2

[5]

v=R(T1,rc[0]) P(rc[0])

Mark Silberstein, Technion

[2]

v=R(T2,rc[0]) P(rc[0])

T3

[3]

v=R(T3,rc[0]) P(rc[0])

28

T0

rc Receive (R) Publish (P) Compute

3. Computation phase

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0]) _ac+=v

[1]

T2

[2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

T3 [3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v

_ac=[0], need [1]

ICS 2016

Mark Silberstein, Technion

29

2. Communication phase: Receive T0

rc Receive (R) Publish (P) Compute Receive

ICS 2016

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0]) _ac+=v v=R(T1,rc[0])

[1]

T2

[2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0])

Mark Silberstein, Technion

v=R(T3,rc[0])

T3

[3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1])

30

2. Communication phase: Publish T0

rc Receive (R) Publish (P) Compute Receive Publish

ICS 2016

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0]) _ac+=v v=R(T1,rc[0]) P(rc[1])

[1]

T2

[2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0])

Mark Silberstein, Technion

v=R(T3,rc[0]) P(rc[0])

T3

[3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1]) P(rc[0])

31

T0

rc Receive (R) Publish (P)

3. Computation phase

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0])

Compute Receive Publish

_ac+=v

Compute

_ac+=v

v=R(T1,rc[0]) P(rc[1])

[1]

T2

[2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T3,rc[0]) P(rc[0]) _ac+=v

T3 [3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1]) P(rc[0]) _ac+=v

_ac=[0]+[1], need [2] ICS 2016

Mark Silberstein, Technion

32

4. write result to global memory T0

rc Receive (R) Publish (P)

[0]

_ac+=v

Compute

_ac+=v

Compute ICS 2016

[4]

v=R(T0,rc[0]) P(rc[0])

Compute Receive Publish Receive Publish

T1

v=R(T1,rc[0]) P(rc[1])

v=R(T2,rc[0]) P(rc[1]) _ac+=v

_ac=[0]+[1]+[2]

[1]

T2 [2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T3,rc[0]) P(rc[0]) _ac+=v

v=R(T3,rc[0]) P(rc[1]) _ac+=v

v=R(T0,rc[1]) P(rc[0]) _ac+=v

Mark Silberstein, Technion

T3 [3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1]) P(rc[0]) _ac+=v v=R(T1,rc[1]) P(rc[0]) _ac+=v 33

Receive + Publish = shuffle Receive (R) Publish (P)

v=R(T0,rc[0]) P(rc[0])

Receive Publish

v=R(T1,rc[0]) P(rc[1])

Receive Publish

v=R(T2,rc[0]) P(rc[1])

ICS 2016

Mark Silberstein, Technion

34

Receive + Publish = shuffle Receive (R) Publish (P)

v=R(T0,rc[0]) P(rc[0])

pub_idx=0;src=0; v=shuffle(src,rc[pub_idx])

Receive Publish

v=R(T1,rc[0]) P(rc[1])

pub_idx=1;src=1; v=shuffle(src,rc[pub_idx])

Receive Publish

v=R(T2,rc[0]) P(rc[1])

pub_idx=1;src=0; v=shuffle(src,rc[pub_idx])

ICS 2016

Mark Silberstein, Technion

35

Performance benefits for k-stencil

Up to 76%! ICS 2016

See the paper for further analysis Mark Silberstein, Technion for benefits and limitations

36

Summary: Register Cache • Start from shared memory-based implementation • Identify input for each warp • Distribute data among threads • Split in multiple phases – –

Communication phase: Publish – Receive Computation phase

• Transform Publish-Receive into shuffle ICS 2016

Mark Silberstein, Technion

37

Part 2: multiplication in large binary fields 2

n

32
Binary field multiplication – computational bottleneck in many applications –

Security, Storage



Typical scenario: multiply many pairs



Main kernel: convolution of binary vectors of size n



x86 CPUs: special CLMUL instruction –

ICS 2016

IvyBridge: 14 cycles, 2 convolutions Mark Silberstein, Technion

38

Binary convolution Input v1 1

ICS 2016

0

v2 1

1

Mark Silberstein, Technion

1

1

39

Binary convolution v1 v2

1

1

1 & 1

0

1

1

ICS 2016

Mark Silberstein, Technion

40

Binary convolution v1 v2

1

1 & 1

0 & 1

1

XOR 1

ICS 2016

1

Mark Silberstein, Technion

41

Binary convolution v1 v2

1 & 1

0 & 1

1 & 1 XOR

1

ICS 2016

1

0

Mark Silberstein, Technion

42

Binary convolution v1

1

v2

0 & 1

1 & 1

1 XOR

1

ICS 2016

1

0

1

Mark Silberstein, Technion

43

Binary convolution v1

1

0

v2

1

ICS 2016

1

1 & 1

1

0

1

1

1

Mark Silberstein, Technion

44

Challenges - Solutions ●





Bit-level operations

Load balancing between warp threads Scaling to large fields

ICS 2016

Mark Silberstein, Technion

45

Challenges - Solutions ●

Bit-level operations



Bit slicing Compute 32 convolutions in a single thread





Load balancing between warp threads Scaling to large fields





Algorithmic trick to achieve divergent free execution Use register cache to free shared memory and scale better

See the paper for details ICS 2016

Mark Silberstein, Technion

46

Performance ●

CPU baseline: CLMUL intrinsic (via popular NTL library)



NVIDIA K80: 138x faster than CPU Register cache Shared memory

ICS 2016

Mark Silberstein, Technion

47

Performance CPU baseline: CLMUL intrinsic (via popular NTL library)



NVIDIA K80: 138x faster than CPU 50%



Register cache Shared memory

ICS 2016

Mark Silberstein, Technion

48

Conclusions ●

Register cache: general technique for replacing shared memory with shuffle



Apply to fast binary field multiplication



Register cache improved application performance by 50%



Total: x138 over CPU CLMUL for fields of size 32 Source code: https://github.com/HamilM/GpuBinFieldMult Further questions: [email protected]

ICS 2016

Mark Silberstein, Technion

49

Fast Multiplication in Binary Fields on GPUs via ...

Target applications: shared memory to cache input (e.g. stencil). • Our case: binary field multiplication. • Result: 50% speedup over baseline x138 over a single core CPU with Intel's CLMUL instruction. Page 3. ICS 2016. Mark Silberstein, Technion. 3. Background: execution hierarchy on NVIDIA. GPUs. Thread. Thread block.

428KB Sizes 2 Downloads 280 Views

Recommend Documents

Fast Multiplication in Binary Fields on GPUs via ...
Jun 3, 2016 - We now extend the register cache-based multiplication im- plementation described in the previous section to polynomi- als of larger degrees. Doing so requires us to cope with the challenge of limited register space. The shared memory al

Efficient Computation of Sum-products on GPUs ...
Bayesian networks on an NVIDIA GeForce 8800GTX GPU over an optimized CPU version on a single core of a 2.4 GHz Intel Core. 2 processor with 4 MB L2 cache. For sufficiently large inputs the speedups reach 2700. The significant contributor to the speed

Real-Time Particle-Based Simulation on GPUs - Semantic Scholar
†e-mail:[email protected]. ‡e-mail:[email protected]. §e-mail:[email protected] particles (spheres) as the title of this skech implies ...

Efficient Parallel CKY Parsing on GPUs - Slav Petrov
of applications in various domains by executing a number of threads and thread blocks in paral- lel, which are specified by the programmer. Its popularity has ...

Binary Codes Embedding for Fast Image Tagging ... - Semantic Scholar
tagging is that the existing/training labels associated with image exam- ..... codes for tags 'car' and 'automobile' be as close as possible since these two tags.

Neural GPUs Learn Algorithms
Mar 15, 2016 - Published as a conference paper at ICLR 2016. NEURAL ... One way to resolve this problem is by using an attention mechanism .... for x =9=8+1 and y =5=4+1 written in binary with least-significant bit left. Input .... We call this.

Mobile GPUs - CIS 565
Apr 11, 2012 - 23%. NVIDIA. 3%. Qualcomm. 31%. Samsung. 14%. TI. 17%. Others. 12% ... version. – Higher than nearly all desktop and laptop displays.

Real-Time Particle-Based Simulation on GPUs - Semantic Scholar
tion to these platforms. Therefore, we need to ... (GPUs) as a parallel computation platform. As a result, we can ... ∗e-mail: [email protected].

The Case for Operating System Services on GPUs
The GPUfs file system layer for GPU software makes core operating .... Linux kernel source stored in approxi- .... depend on the link structure within the HTML ...

Efficient Rate Conversion Filtering on GPUs with ...
audio signals for 1D downsampling by an integer factor, we evaluate ... are floating point values. We see that each ..... sample is a float. We downsample this audio signal while varying downsampling factors from 2 to 16. Here, T = 10 × DF. We measu

Particle-based Simulations on Multiple GPUs
tion to the best of our knowledge although several researchers have been using a GPU. ... The computational domain is divided into subdomains of the number of GPUs. Each ... The host invoked five threads on the. CPU and each of them ...

Fast Markov Blanket Discovery Algorithm Via Local Learning within ...
is further increased through collecting full statistics within a single data pass. We conclude that IPC-MB plus AD-tree appears a very attractive solution in very large applications. Keywords: Markov blanket, local learning, feature selection, single

Fast and Accurate Matrix Completion via Truncated ... - IEEE Xplore
achieve a better approximation to the rank of matrix by truncated nuclear norm, which is given by ... relationship between nuclear norm and the rank of matrices.

Matrix Multiplication and Inverse in Excel.pdf
fashion. Matrices consisting of a single row or a single column are called. vectors. Even though the functions are “named” with matrix there is no. help in Excel ...

Blank Multiplication Table.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Blank ...

An Energy-efficient Matrix Multiplication Accelerator by Distributed In ...
Email:[email protected] ... power consumption from additional AD-conversion and I/Os ... interface of logic crossbar requires AD/DA conversions, which.

relaxation in binary mixtures
Oct 1, 2007 - Broadband dielectric spectroscopy recently showed in binary mixtures ... A comparison between the dielectric spectra of neat Qn system and of ...

Parallel Programming CPUs & GPUs
1837-71: Charles Babbage analytical engine. • 1954: IBM 704 “first real MIMD”. • 1958: parallelism in numerical calculations. • 1962: four-processor, 16 memory modules. • 1964: SIMD. • 1969: eight processors in parallel. • 1970s: more

GPUfs: Integrating a File System with GPUs
to be largely oblivious to where data is located—whether on disk, in main memory, in a .... comprising a kernel into a single global hardware queue. The hard-.

On fundamental solutions of binary quadratic form ...
Oeuvres de Lagrange, tome 2, Gauthier-Villars, Paris, 1868. [6] B. Stolt, On a Diophantine equation of the second degree, Ark. Mat. 3 (1957), 381–. 390. [7] P. Tchebichef, Sur les formes quadratiques, J. Math. Pures Appl. 16 (1851), 257–282. [8]