Fast Multiplication in Binary Fields on GPUs via Register Cache Mark Silberstein Technion Eli Ben-Sasson, Matan Hamilis, Eran Tromer
Brief • Optimization methodology Register cache: replace shared memory by registers • Target applications: shared memory to cache input (e.g. stencil) • Our case: binary field multiplication • Result: 50% speedup over baseline x138 over a single core CPU with Intel’s CLMUL instruction ICS 2016
Mark Silberstein, Technion
2
Background: execution hierarchy on NVIDIA GPUs GPU kernel
Thread block
ICS 2016
Thread
Thread block
Thread
Mark Silberstein, Technion
Thread block
Thread
3
Background: memory and execution hierarchy on NVIDIA GPUs Global GPU memory GPU kernel Shared memory
Shared memory
Shared memory
Thread block
Thread block
Thread block
Registers ICS 2016
Thread
Registers
Registers
Thread
Thread
Mark Silberstein, Technion
4
Warps: Not part of programming model Global GPU memory GPU kernel Shared memory
Shared memory
Shared memory
Thread block
Thread block
Thread block
Warp Registers ICS 2016
Thread
Warp
Warp
Registers
Registers
Thread
Thread
Mark Silberstein, Technion
5
Why warp-centric programming ●
MIMD divergence-free programming across warps
●
SIMD-optimized lock-step execution
●
“Free” synchronization among threads
ICS 2016
Mark Silberstein, Technion
6
Missing layer: warp cache? Global GPU memory GPU kernel Shared memory
Shared memory
Shared memory
Thread block
Thread block
Thread block
?
?
?
Warp
Warp
Warp
Registers ICS 2016
Thread
Registers
Registers
Thread
Thread
Mark Silberstein, Technion
7
Missing layer: warp cache? Global GPU memory GPU kernel Shared memory
Shared memory
Shared memory
Thread block
Thread block
Thread block
Question: Efficient data sharing ? ? ? among warp threads? Warp Warp Warp Registers ICS 2016
Thread
Registers
Registers
Thread
Thread
Mark Silberstein, Technion
8
Shuffle: warp-level intrinsics Reading other thread's registers shuffle(SourceThreadID, OutputRegister)
Input
Output ICS 2016
Thread 0
Thread 1
Thread 2
R=0
R=1
R=2
out=shuffle(2, r)
out=shuffle(0, r)
OUT=2
OUT=0 Mark Silberstein, Technion
out=shuffle(0, r)
OUT=0 9
Shuffle vs. shared memory ●
No __syncthreads overhead
●
Significantly higher bandwidth
ICS 2016
Mark Silberstein, Technion
10
Shuffle vs. shared memory ●
No __syncthreads overhead
●
Significantly higher bandwidth
Challenge: programming complexity! Application-specific algorithm modifications ICS 2016
Mark Silberstein, Technion
11
This work: general technique to replace input shared memory with shuffle Shared memory Thread block
Warp
Warp Registers Thread ICS 2016
Registers Thread Mark Silberstein, Technion
Registers Thread 12
This work: general technique to replace input shared memory with shuffle Shared memory Thread block Register cache
Warp
Warp Registers Thread ICS 2016
Registers Thread Mark Silberstein, Technion
Registers Thread 13
Outline ●
Code transformation example: 1-d k-stencil
●
General methodology
●
Binary field multiplication
●
Evaluation
ICS 2016
Mark Silberstein, Technion
14
1-d k-stencil k=1 0
ICS 2016
1
2
+
+
3
6
3
4
9
12
Mark Silberstein, Technion
5
15
1-d 1-stencil: shared memory Global memory Shared memory
Global memory ICS 2016
s[]
0
1
2
3
4
5
0
1
2
3
4
5
s[0]+ s[1]+ s[2]
s[1]+ s[2]+ s[3]
s[2]+ s[3]+ s[4]
s[3]+ s[4]+ s[5]
T0
T1
T2
T3
3
6
9
12
Mark Silberstein, Technion
Read
__syncthreads()
Compute
Write output
16
1-d 1-stencil: shared memory Global memory Shared memory
Global memory ICS 2016
s[]
0
1
2
3
4
5
0
1
2
3
4
5
s[0]+ s[1]+ s[2]
s[1]+ s[2]+ s[3]
s[2]+ s[3]+ s[4]
s[3]+ s[4]+ s[5]
T0
T1
T2
T3
3
6
9
12
Read
__syncthreads()
Compute
Write output
Goal: eliminate shared memory access Mark Silberstein, Technion
17
1. Determine warp input assume 4 threads/warp
Global memory input
ICS 2016
0
1
2
3
4
5
Mark Silberstein, Technion
6
7
8
9
18
1. Determine warp input assume 4 threads/warp
Global memory input
0
1
2
3
Warp 0
ICS 2016
4
5
6
7
8
9
Warp 1
Mark Silberstein, Technion
19
2. Assign input to owner thread Global memory
0 T0
ICS 2016
1
2 T1
3 T2
Mark Silberstein, Technion
4
5 T3
20
2. Assign input to owner thread Global memory
0 T0
1
2 T1
3 T2
4
5 T3
rc
ICS 2016
Mark Silberstein, Technion
21
2. Assign input to owner thread Global memory
0
1
T0
Distribute
rc
[0]
[4]
2 T1 [1]
3
4
5
T2
T3
[2]
[3]
[5]
cyclic distribution
ICS 2016
Mark Silberstein, Technion
22
Some thread inputs are remote! Global memory
0
1
T0
Distribute
rc
[0]
[4]
2 T1 [1]
3
4
5
T2
T3
[2]
[3]
[5]
[0]+[1]+[2]
Not available!
ICS 2016
Mark Silberstein, Technion
23
We define new communication primitives ●
●
●
Receive(src_tid, remote_reg) – receive data stored in thread src_tid in remote variable remote_reg Publish(local_reg) – publish local data stored in variable local_reg For one thread to Receive, another has to Publish!
ICS 2016
Mark Silberstein, Technion
24
2. Communication phase: Receive T0
rc
[0]
T1
[4]
[1]
T2
[5]
[2]
T3
[3]
[0]+[1]+[2]
ICS 2016
Mark Silberstein, Technion
25
2. Communication phase: Receive T0
rc
[0]
T1
[4]
[1]
T2
[5]
[2]
T3
[3]
Receive (src,what) v=R(T0,rc[0])
[0]+[1]+[2]
ICS 2016
Mark Silberstein, Technion
26
2. Communication phase: Publish T0
rc Receive Publish
[0]
T1
[4]
[1]
T2
[5]
[2]
T3
[3]
v=R(T0,rc[0]) P(rc[0])
[0]+[1]+[2]
ICS 2016
Mark Silberstein, Technion
27
2. Communication phase: Publish T0
rc Receive Publish
ICS 2016
[0]
T1
[4]
v=R(T0,rc[0]) P(rc[0])
[1]
T2
[5]
v=R(T1,rc[0]) P(rc[0])
Mark Silberstein, Technion
[2]
v=R(T2,rc[0]) P(rc[0])
T3
[3]
v=R(T3,rc[0]) P(rc[0])
28
T0
rc Receive (R) Publish (P) Compute
3. Computation phase
[0]
T1
[4]
v=R(T0,rc[0]) P(rc[0]) _ac+=v
[1]
T2
[2]
[5]
v=R(T1,rc[0]) P(rc[0]) _ac+=v
v=R(T2,rc[0]) P(rc[0]) _ac+=v
T3 [3]
v=R(T3,rc[0]) P(rc[0]) _ac+=v
_ac=[0], need [1]
ICS 2016
Mark Silberstein, Technion
29
2. Communication phase: Receive T0
rc Receive (R) Publish (P) Compute Receive
ICS 2016
[0]
T1
[4]
v=R(T0,rc[0]) P(rc[0]) _ac+=v v=R(T1,rc[0])
[1]
T2
[2]
[5]
v=R(T1,rc[0]) P(rc[0]) _ac+=v
v=R(T2,rc[0]) P(rc[0]) _ac+=v
v=R(T2,rc[0])
Mark Silberstein, Technion
v=R(T3,rc[0])
T3
[3]
v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1])
30
2. Communication phase: Publish T0
rc Receive (R) Publish (P) Compute Receive Publish
ICS 2016
[0]
T1
[4]
v=R(T0,rc[0]) P(rc[0]) _ac+=v v=R(T1,rc[0]) P(rc[1])
[1]
T2
[2]
[5]
v=R(T1,rc[0]) P(rc[0]) _ac+=v
v=R(T2,rc[0]) P(rc[0]) _ac+=v
v=R(T2,rc[0]) P(rc[0])
Mark Silberstein, Technion
v=R(T3,rc[0]) P(rc[0])
T3
[3]
v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1]) P(rc[0])
31
T0
rc Receive (R) Publish (P)
3. Computation phase
[0]
T1
[4]
v=R(T0,rc[0]) P(rc[0])
Compute Receive Publish
_ac+=v
Compute
_ac+=v
v=R(T1,rc[0]) P(rc[1])
[1]
T2
[2]
[5]
v=R(T1,rc[0]) P(rc[0]) _ac+=v
v=R(T2,rc[0]) P(rc[0]) _ac+=v
v=R(T2,rc[0]) P(rc[0]) _ac+=v
v=R(T3,rc[0]) P(rc[0]) _ac+=v
T3 [3]
v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1]) P(rc[0]) _ac+=v
_ac=[0]+[1], need [2] ICS 2016
Mark Silberstein, Technion
32
4. write result to global memory T0
rc Receive (R) Publish (P)
[0]
_ac+=v
Compute
_ac+=v
Compute ICS 2016
[4]
v=R(T0,rc[0]) P(rc[0])
Compute Receive Publish Receive Publish
T1
v=R(T1,rc[0]) P(rc[1])
v=R(T2,rc[0]) P(rc[1]) _ac+=v
_ac=[0]+[1]+[2]
[1]
T2 [2]
[5]
v=R(T1,rc[0]) P(rc[0]) _ac+=v
v=R(T2,rc[0]) P(rc[0]) _ac+=v
v=R(T2,rc[0]) P(rc[0]) _ac+=v
v=R(T3,rc[0]) P(rc[0]) _ac+=v
v=R(T3,rc[0]) P(rc[1]) _ac+=v
v=R(T0,rc[1]) P(rc[0]) _ac+=v
Mark Silberstein, Technion
T3 [3]
v=R(T3,rc[0]) P(rc[0]) _ac+=v v=R(T0,rc[1]) P(rc[0]) _ac+=v v=R(T1,rc[1]) P(rc[0]) _ac+=v 33
Receive + Publish = shuffle Receive (R) Publish (P)
v=R(T0,rc[0]) P(rc[0])
Receive Publish
v=R(T1,rc[0]) P(rc[1])
Receive Publish
v=R(T2,rc[0]) P(rc[1])
ICS 2016
Mark Silberstein, Technion
34
Receive + Publish = shuffle Receive (R) Publish (P)
v=R(T0,rc[0]) P(rc[0])
pub_idx=0;src=0; v=shuffle(src,rc[pub_idx])
Receive Publish
v=R(T1,rc[0]) P(rc[1])
pub_idx=1;src=1; v=shuffle(src,rc[pub_idx])
Receive Publish
v=R(T2,rc[0]) P(rc[1])
pub_idx=1;src=0; v=shuffle(src,rc[pub_idx])
ICS 2016
Mark Silberstein, Technion
35
Performance benefits for k-stencil
Up to 76%! ICS 2016
See the paper for further analysis Mark Silberstein, Technion for benefits and limitations
36
Summary: Register Cache • Start from shared memory-based implementation • Identify input for each warp • Distribute data among threads • Split in multiple phases – –
Communication phase: Publish – Receive Computation phase
• Transform Publish-Receive into shuffle ICS 2016
Mark Silberstein, Technion
37
Part 2: multiplication in large binary fields 2
n
32
Binary field multiplication – computational bottleneck in many applications –
Security, Storage
●
Typical scenario: multiply many pairs
●
Main kernel: convolution of binary vectors of size n
●
x86 CPUs: special CLMUL instruction –
ICS 2016
IvyBridge: 14 cycles, 2 convolutions Mark Silberstein, Technion
38
Binary convolution Input v1 1
ICS 2016
0
v2 1
1
Mark Silberstein, Technion
1
1
39
Binary convolution v1 v2
1
1
1 & 1
0
1
1
ICS 2016
Mark Silberstein, Technion
40
Binary convolution v1 v2
1
1 & 1
0 & 1
1
XOR 1
ICS 2016
1
Mark Silberstein, Technion
41
Binary convolution v1 v2
1 & 1
0 & 1
1 & 1 XOR
1
ICS 2016
1
0
Mark Silberstein, Technion
42
Binary convolution v1
1
v2
0 & 1
1 & 1
1 XOR
1
ICS 2016
1
0
1
Mark Silberstein, Technion
43
Binary convolution v1
1
0
v2
1
ICS 2016
1
1 & 1
1
0
1
1
1
Mark Silberstein, Technion
44
Challenges - Solutions ●
●
●
Bit-level operations
Load balancing between warp threads Scaling to large fields
ICS 2016
Mark Silberstein, Technion
45
Challenges - Solutions ●
Bit-level operations
●
Bit slicing Compute 32 convolutions in a single thread
●
●
Load balancing between warp threads Scaling to large fields
●
●
Algorithmic trick to achieve divergent free execution Use register cache to free shared memory and scale better
See the paper for details ICS 2016
Mark Silberstein, Technion
46
Performance ●
CPU baseline: CLMUL intrinsic (via popular NTL library)
●
NVIDIA K80: 138x faster than CPU Register cache Shared memory
ICS 2016
Mark Silberstein, Technion
47
Performance CPU baseline: CLMUL intrinsic (via popular NTL library)
●
NVIDIA K80: 138x faster than CPU 50%
●
Register cache Shared memory
ICS 2016
Mark Silberstein, Technion
48
Conclusions ●
Register cache: general technique for replacing shared memory with shuffle
●
Apply to fast binary field multiplication
●
Register cache improved application performance by 50%
●
Total: x138 over CPU CLMUL for fields of size 32 Source code: https://github.com/HamilM/GpuBinFieldMult Further questions:
[email protected]
ICS 2016
Mark Silberstein, Technion
49