Fast Multiplication in Binary Fields on GPUs via Register Cache Mark Silberstein Technion Eli Ben-Sasson, Matan Hamilis, Eran Tromer

Brief • Optimization methodology Register cache: replace shared memory by registers • Target applications: shared memory to cache input (e.g. stencil) • Our case: binary field multiplication • Result: 50% speedup over baseline x138 over a single core CPU with Intel’s CLMUL instruction ICS 2016

Mark Silberstein, Technion

2

Background: execution hierarchy on NVIDIA GPUs GPU kernel

Thread block

ICS 2016

Thread

Thread block

Thread

Mark Silberstein, Technion

Thread block

Thread

3

Background: memory and execution hierarchy on NVIDIA GPUs Global GPU memory GPU kernel Shared memory

Shared memory

Shared memory

Thread block

Thread block

Thread block

Registers ICS 2016

Thread

Registers

Registers

Thread

Thread

Mark Silberstein, Technion

4

Warps: Not part of programming model Global GPU memory GPU kernel Shared memory

Shared memory

Shared memory

Thread block

Thread block

Thread block

Warp Registers ICS 2016

Thread

Warp

Warp

Registers

Registers

Thread

Thread

Mark Silberstein, Technion

5

Why warp-centric programming ●

MIMD divergence-free programming across warps



SIMD-optimized lock-step execution



“Free” synchronization among threads

ICS 2016

Mark Silberstein, Technion

6

Missing layer: warp cache? Global GPU memory GPU kernel Shared memory

Shared memory

Shared memory

Thread block

Thread block

Thread block

?

?

?

Warp

Warp

Warp

Registers ICS 2016

Thread

Registers

Registers

Thread

Thread

Mark Silberstein, Technion

7

Missing layer: warp cache? Global GPU memory GPU kernel Shared memory

Shared memory

Shared memory

Thread block

Thread block

Thread block

Question: Efficient data sharing ? ? ? among warp threads? Warp Warp Warp Registers ICS 2016

Thread

Registers

Registers

Thread

Thread

Mark Silberstein, Technion

8

Shuffle: warp-level intrinsics Reading other thread's registers shuffle(SourceThreadID, OutputRegister)

Input

Output ICS 2016

Thread 0

Thread 1

Thread 2

R=0

R=1

R=2

out=shuffle(2, r)

out=shuffle(0, r)

OUT=2

OUT=0 Mark Silberstein, Technion

out=shuffle(0, r)

OUT=0 9

Shuffle vs. shared memory ●

No __syncthreads overhead



Significantly higher bandwidth

ICS 2016

Mark Silberstein, Technion

10

Shuffle vs. shared memory ●

No __syncthreads overhead



Significantly higher bandwidth

Challenge: programming complexity! Application-specific algorithm modifications ICS 2016

Mark Silberstein, Technion

11

This work: general technique to replace input shared memory with shuffle Shared memory Thread block

Warp

Warp Registers Thread ICS 2016

Registers Thread Mark Silberstein, Technion

Registers Thread 12

This work: general technique to replace input shared memory with shuffle Shared memory Thread block Register cache

Warp

Warp Registers Thread ICS 2016

Registers Thread Mark Silberstein, Technion

Registers Thread 13

Outline ●

Code transformation example: 1-d k-stencil



General methodology



Binary field multiplication



Evaluation

ICS 2016

Mark Silberstein, Technion

14

1-d k-stencil k=1 0

ICS 2016

1

2

+

+

3

6

3

4

9

12

Mark Silberstein, Technion

5

15

1-d 1-stencil: shared memory Global memory Shared memory

Global memory ICS 2016

s[]

0

1

2

3

4

5

0

1

2

3

4

5

s[0]+ s[1]+ s[2]

s[1]+ s[2]+ s[3]

s[2]+ s[3]+ s[4]

s[3]+ s[4]+ s[5]

T0

T1

T2

T3

3

6

9

12

Mark Silberstein, Technion

Read

__syncthreads()

Compute

Write output

16

1-d 1-stencil: shared memory Global memory Shared memory

Global memory ICS 2016

s[]

0

1

2

3

4

5

0

1

2

3

4

5

s[0]+ s[1]+ s[2]

s[1]+ s[2]+ s[3]

s[2]+ s[3]+ s[4]

s[3]+ s[4]+ s[5]

T0

T1

T2

T3

3

6

9

12

Read

__syncthreads()

Compute

Write output

Goal: eliminate shared memory access Mark Silberstein, Technion

17

1. Determine warp input assume 4 threads/warp

Global memory input

ICS 2016

0

1

2

3

4

5

Mark Silberstein, Technion

6

7

8

9

18

1. Determine warp input assume 4 threads/warp

Global memory input

0

1

2

3

Warp 0

ICS 2016

4

5

6

7

8

9

Warp 1

Mark Silberstein, Technion

19

2. Assign input to owner thread Global memory

0 T0

ICS 2016

1

2 T1

3 T2

Mark Silberstein, Technion

4

5 T3

20

2. Assign input to owner thread Global memory

0 T0

1

2 T1

3 T2

4

5 T3

rc

ICS 2016

Mark Silberstein, Technion

21

2. Assign input to owner thread Global memory

0

1

T0

Distribute

rc

[0]

[4]

2 T1 [1]

3

4

5

T2

T3

[2]

[3]

[5]

cyclic distribution

ICS 2016

Mark Silberstein, Technion

22

Some thread inputs are remote! Global memory

0

1

T0

Distribute

rc

[0]

[4]

2 T1 [1]

3

4

5

T2

T3

[2]

[3]

[5]

[0]+[1]+[2]

Not available!

ICS 2016

Mark Silberstein, Technion

23

We define new communication primitives ●





Receive(src_tid, remote_reg) – receive data stored in thread src_tid in remote variable remote_reg Publish(local_reg) – publish local data stored in variable local_reg For one thread to Receive, another has to Publish!

ICS 2016

Mark Silberstein, Technion

24

2. Communication phase: Receive T0

rc

[0]

T1

[4]

[1]

T2

[5]

[2]

T3

[3]

[0]+[1]+[2]

ICS 2016

Mark Silberstein, Technion

25

2. Communication phase: Receive T0

rc

[0]

T1

[4]

[1]

T2

[5]

[2]

T3

[3]

Receive (src,what) v=R(T0,rc[0])

[0]+[1]+[2]

ICS 2016

Mark Silberstein, Technion

26

2. Communication phase: Publish T0

rc Receive Publish

[0]

T1

[4]

[1]

T2

[5]

[2]

T3

[3]

v=R(T0,rc[0]) P(rc[0])

[0]+[1]+[2]

ICS 2016

Mark Silberstein, Technion

27

2. Communication phase: Publish T0

rc Receive Publish

ICS 2016

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0])

[1]

T2

[5]

v=R(T1,rc[0]) P(rc[0])

Mark Silberstein, Technion

[2]

v=R(T2,rc[0]) P(rc[0])

T3

[3]

v=R(T3,rc[0]) P(rc[0])

28

T0

rc Receive (R) Publish (P) Compute

3. Computation phase

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0]) _ac+=v

[1]

T2

[2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

T3 [3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v

_ac=[0], need [1]

ICS 2016

Mark Silberstein, Technion

29

2. Communication phase: Receive T0

rc Receive (R) Publish (P) Compute Receive

ICS 2016

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0]) _ac+=v v=R(T1,rc[0])

[1]

T2

[2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0])

Mark Silberstein, Technion

v=R(T3,rc[0])

T3

[3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1])

30

2. Communication phase: Publish T0

rc Receive (R) Publish (P) Compute Receive Publish

ICS 2016

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0]) _ac+=v v=R(T1,rc[0]) P(rc[1])

[1]

T2

[2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0])

Mark Silberstein, Technion

v=R(T3,rc[0]) P(rc[0])

T3

[3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1]) P(rc[0])

31

T0

rc Receive (R) Publish (P)

3. Computation phase

[0]

T1

[4]

v=R(T0,rc[0]) P(rc[0])

Compute Receive Publish

_ac+=v

Compute

_ac+=v

v=R(T1,rc[0]) P(rc[1])

[1]

T2

[2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T3,rc[0]) P(rc[0]) _ac+=v

T3 [3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1]) P(rc[0]) _ac+=v

_ac=[0]+[1], need [2] ICS 2016

Mark Silberstein, Technion

32

4. write result to global memory T0

rc Receive (R) Publish (P)

[0]

_ac+=v

Compute

_ac+=v

Compute ICS 2016

[4]

v=R(T0,rc[0]) P(rc[0])

Compute Receive Publish Receive Publish

T1

v=R(T1,rc[0]) P(rc[1])

v=R(T2,rc[0]) P(rc[1]) _ac+=v

_ac=[0]+[1]+[2]

[1]

T2 [2]

[5]

v=R(T1,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T2,rc[0]) P(rc[0]) _ac+=v

v=R(T3,rc[0]) P(rc[0]) _ac+=v

v=R(T3,rc[0]) P(rc[1]) _ac+=v

v=R(T0,rc[1]) P(rc[0]) _ac+=v

Mark Silberstein, Technion

T3 [3]

v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1]) P(rc[0]) _ac+=v v=R(T1,rc[1]) P(rc[0]) _ac+=v 33

Receive + Publish = shuffle Receive (R) Publish (P)

v=R(T0,rc[0]) P(rc[0])

Receive Publish

v=R(T1,rc[0]) P(rc[1])

Receive Publish

v=R(T2,rc[0]) P(rc[1])

ICS 2016

Mark Silberstein, Technion

34

Receive + Publish = shuffle Receive (R) Publish (P)

v=R(T0,rc[0]) P(rc[0])

pub_idx=0;src=0; v=shuffle(src,rc[pub_idx])

Receive Publish

v=R(T1,rc[0]) P(rc[1])

pub_idx=1;src=1; v=shuffle(src,rc[pub_idx])

Receive Publish

v=R(T2,rc[0]) P(rc[1])

pub_idx=1;src=0; v=shuffle(src,rc[pub_idx])

ICS 2016

Mark Silberstein, Technion

35

Performance benefits for k-stencil

Up to 76%! ICS 2016

See the paper for further analysis Mark Silberstein, Technion for benefits and limitations

36

Summary: Register Cache • Start from shared memory-based implementation • Identify input for each warp • Distribute data among threads • Split in multiple phases – –

Communication phase: Publish – Receive Computation phase

• Transform Publish-Receive into shuffle ICS 2016

Mark Silberstein, Technion

37

Part 2: multiplication in large binary fields 2

n

32
Binary field multiplication – computational bottleneck in many applications –

Security, Storage



Typical scenario: multiply many pairs



Main kernel: convolution of binary vectors of size n



x86 CPUs: special CLMUL instruction –

ICS 2016

IvyBridge: 14 cycles, 2 convolutions Mark Silberstein, Technion

38

Binary convolution Input v1 1

ICS 2016

0

v2 1

1

Mark Silberstein, Technion

1

1

39

Binary convolution v1 v2

1

1

1 & 1

0

1

1

ICS 2016

Mark Silberstein, Technion

40

Binary convolution v1 v2

1

1 & 1

0 & 1

1

XOR 1

ICS 2016

1

Mark Silberstein, Technion

41

Binary convolution v1 v2

1 & 1

0 & 1

1 & 1 XOR

1

ICS 2016

1

0

Mark Silberstein, Technion

42

Binary convolution v1

1

v2

0 & 1

1 & 1

1 XOR

1

ICS 2016

1

0

1

Mark Silberstein, Technion

43

Binary convolution v1

1

0

v2

1

ICS 2016

1

1 & 1

1

0

1

1

1

Mark Silberstein, Technion

44

Challenges - Solutions ●





Bit-level operations

Load balancing between warp threads Scaling to large fields

ICS 2016

Mark Silberstein, Technion

45

Challenges - Solutions ●

Bit-level operations



Bit slicing Compute 32 convolutions in a single thread





Load balancing between warp threads Scaling to large fields





Algorithmic trick to achieve divergent free execution Use register cache to free shared memory and scale better

See the paper for details ICS 2016

Mark Silberstein, Technion

46

Performance ●

CPU baseline: CLMUL intrinsic (via popular NTL library)



NVIDIA K80: 138x faster than CPU Register cache Shared memory

ICS 2016

Mark Silberstein, Technion

47

Performance CPU baseline: CLMUL intrinsic (via popular NTL library)



NVIDIA K80: 138x faster than CPU 50%



Register cache Shared memory

ICS 2016

Mark Silberstein, Technion

48

Conclusions ●

Register cache: general technique for replacing shared memory with shuffle



Apply to fast binary field multiplication



Register cache improved application performance by 50%



Total: x138 over CPU CLMUL for fields of size 32 Source code: https://github.com/HamilM/GpuBinFieldMult Further questions: [email protected]

ICS 2016

Mark Silberstein, Technion

49

Fast Multiplication in Binary Fields on GPUs via ...

Target applications: shared memory to cache input (e.g. stencil). • Our case: binary field multiplication. • Result: 50% speedup over baseline x138 over a single core CPU with Intel's CLMUL instruction. Page 3. ICS 2016. Mark Silberstein, Technion. 3. Background: execution hierarchy on NVIDIA. GPUs. Thread. Thread block.

428KB Sizes 2 Downloads 141 Views

Recommend Documents

Binary Codes Embedding for Fast Image Tagging ... - Semantic Scholar
tagging is that the existing/training labels associated with image exam- ..... codes for tags 'car' and 'automobile' be as close as possible since these two tags.

Neural GPUs Learn Algorithms
Mar 15, 2016 - Published as a conference paper at ICLR 2016. NEURAL ... One way to resolve this problem is by using an attention mechanism .... for x =9=8+1 and y =5=4+1 written in binary with least-significant bit left. Input .... We call this.

Fast and Accurate Matrix Completion via Truncated ... - IEEE Xplore
achieve a better approximation to the rank of matrix by truncated nuclear norm, which is given by ... relationship between nuclear norm and the rank of matrices.

An Energy-efficient Matrix Multiplication Accelerator by Distributed In ...
Email:[email protected] ... power consumption from additional AD-conversion and I/Os ... interface of logic crossbar requires AD/DA conversions, which.