Simulative Analysis of a Multidimensional Torus-based Reconfigurable Cluster for Molecular Dynamics
HUCAA’14
Abhijeet Lawande
Dr. Alan George
Ph.D. Candidate
Professor of ECE
Hanchao Yang
Dr. Herman Lam
M.S. Student
Associate Professor of ECE
Introduction
Goal: Model the performance of FPGA clusters with an interFPGA network Scope: Communication-intensive molecular dynamics (MD), specifically core 3D FFT kernel
Scope: Novo-G# system, our in-house FPGA cluster
The longest-running single task in MD is the 3D FFT kernel (FFT followed by IFFT)
Novo-G reconfigurable supercomputer augmented with a 3D torus interconnect between FPGAs
Approach:
Build a discrete event simulation model for distributed 3D FFT Build and validate a system model for a direct-connected FPGA cluster Predict Novo-G# performance for 3D FFT kernel
2
Novo-G Reconfigurable Supercomputer
Developed and deployed at CHREC
Successfully used to accelerate apps from multiple domains
Bioinformatics: Smith-waterman, BLAST, Isotope pattern calculator, and others
Image processing: Image segmentation, stereo vision, and others
Computational finance: Options pricing At a fraction of cost, size, power, cooling, etc. of high-end conventional supercomputers
Supports a broad range of apps, tools, and systems research tasks in CHREC
2012 Schwarzkopf Prize
CHREC & Novo-G recognized with 2012 Alexander Schwarzkopf Prize for Technology Innovation @ NSF
48 PROCStar III boards 192 Stratix III E260 FPGAs
3
48 PROCStar IV boards 192 Stratix IV E530 FPGAs
Multi-FPGA Systems
How do scientific computing apps scale on accelerator-based systems?
Poor performance scaling of large-scale apps due to communication latency
IB Switch
Molecular dynamics, computational fluid dynamics, machine learning, linear algebra, etc. Performance at scale is impacted by communication between kernels on FPGA Traditionally, communication is handled by host (3 hops)
CPU
CPU
FPGA
FPGA
FPGA
FPGA
Node 1
Node 2
Traditional accelerator-based HPC system
Solution: Augment the system with a direct inter-FPGA interconnect
Allows low-latency communication directly between FPGA kernels High-bandwidth, distributed networks improve both app and system scalability
4
3D torus connectivity between FPGAs
Novo-G ProceV Upgrade w/ 3D Torus Novo-G# (Novo-jee-sharp)
3x QSFP+ Ports: • 32 GiDEL ProceV (Stratix V D8) 4x10Gbps • 4x4x2 3d-torus or 5d-hypercube channels each • 6 Rx-Tx links per FPGA • 4x 10 Gbps per link • Data-link layer: Serialite III protocol Stratix V D8
QSFP+ daughterboard
device
• Full-duplex, CRC32 protection, in-band or out-of-band flow control
• Physical layer: Interlaken protocol • 64B/67B encoding, multi-lane sync.
ProceV Board CXP Port (underneath): 12x10Gbps channels
8-lane PCI Express Gen3
CXP to 3-QSFP Cable (provides connectivity for 3D torus)
2x4x4 Torus (can be expanded further) 5
Molecular Dynamics (MD)
Comm. and computation in MD
Problem decomposition on multiple processors
Calculate forces on each particle
Update state for each particle
Range-limited force computation
Particle decomposition, spatial decomposition, and force decomposition
Initialize particle positions and velocities
Neighbor lists/Cutoff radius used to restrict computation and communication
Save particle state and increment timestep
Long-range force computation
Requires pairwise distance computation among all pairs of atoms (comm. intensive) Traditionally, computation optimized by transforming to frequency domain (3D FFT) Spatial decomposition 6
Anton – ASIC System for MD Developed by D.E. Shaw Research [1,2]
64 - 512 tightly coupled processing nodes @ 485MHz 3D torus network with 50.6 Gbps link bandwidth Inter- and intra-ASIC networks with specialized routing for common MD comm. patterns Autonomous DMA engines to offload comm. tasks
Y- link adapter
50.6 Gbps
Y+ link adapter
X- link adapter
Z+ link adapter Router Z- link adapter
Anton ASIC
X+ link adapter
Router
Router
Intra-node ring network Flexible subsystem 256
GC GC GC GC GC GC GC GC
Router
Router
HTIS HighThroughput Interaction Subsystem
Router
Anton ASIC Architecture
7
System modeling
System modeling in VisualSim
Hierarchical model, divided into channel and node models Node model describes behavior of FPGA/ASIC elements
Processing elements Intra-node network Memory access
Channel model describes inter-node comm.
Node connections Transition & propogation delays Packet queues
Output display blocks
Init. blocks
Channel (Inter-node) models
FPGA (Intra-node) models
Top-level model 8
From the Channel model
Application modeling
Application modeling in VisualSim
3D FFT app modeled as script running in a Virtual Machine block
Block triggers for each token on its input ports
Model can be verified (sanity check) using actual FFT computation
Output is checked against Matlab FFT function
42ns
Computation Time
Application Queue
INIT: Distribute initial FFT data tokens to each node Internal ring To output display Structure BEGIN: To ou Transaction Delay If input_token corresponds to current stagedispl Wait until all data elements arrive Layer2 If Verify_mode then router Compute local FFT; End If For all Data elements in the node Generate message for next stage 31ns 31ns Determine destination and index Output token with computation delay 13ns 19ns 25ns 13ns 19ns 25ns 13n End for X+ link X- link Adapter Adapter End If Y link Adapter
9
Communication
Data Distribution
Decomposition of 3D FFT
Z
Z
3D decomposition of data and FFT stages
X
X
3D FFT can be decomposed into 3 1D-FFT stages, one per dimension For a multi-node system, initial data and intermediate outputs need to be exchanged before next stage of computation Following diagram shows communication & computation for 32×32×32 FFT decomposed on 4×4×4 system
Y Initial Distribution
Y
X
Comm.: X - Fold X
Z
Y
Stage 1: X - FFT
Z
Y
X
Comm.: Y - Fold X
Z
Y
Stage 2: Y - FFT
Z
Y
Comm.: XZ - Turn
X Z
10 Y
Stage 3: Z - FFT
Source node Destination node Data element: Initial Data element: X-FFT Data element: Y-FFT Data element: Z-FFT
(𝑥4 𝑥3 𝑥2 , 𝑦4 𝑦3 𝑦2 , 𝑧4 𝑧3 𝑧2 ). 𝑥1 . 𝑥0 𝑦1 𝑦0 𝑧1 𝑧0
Decomposition of 3D FFT
Original spatial distribution
(𝑦1 𝑦0 𝑧0 , 𝑦4 𝑦3 𝑦2 , 𝑧4 𝑧3 𝑧2 ). 𝑧1 . 𝑥4 𝑥3 𝑥2 𝑥1 𝑥0 After comm. over x-dimension
(𝑥1 𝑥0 𝑧0 , 𝑥4 𝑥3 𝑥2 , 𝑧4 𝑧3 𝑧2 ). 𝑧1 . 𝑦4 𝑦3 𝑦2 𝑦1 𝑦0
Notation used:
After comm. over x- & y-dimensions
Notation on right used to depict FFT comm. for given problem & system size Used here to derive comm. patterns for VisualSim model & find optimizations for same
(𝑥1 𝑥0 𝑦0 , 𝑥4 𝑥3 𝑥2 , 𝑦4 𝑦3 𝑦2 ). 𝑦1 . 𝑧4 𝑧3 𝑧2 𝑧1 𝑧0 After comm. over x- & z-dimensions 32×32×32 FFT on 8×8×8 system
(𝑥4 𝑥3 , 𝑦4 𝑦3 , 𝑧4 𝑧3 ). 𝑥2 𝑥1 𝑦2 𝑧2 . 𝑥0 𝑦1 𝑦0 𝑧1 𝑧0 Original spatial distribution
(𝑦2 𝑦1 , 𝑦4 𝑦3 , 𝑧4 𝑧3 ). 𝑦0 𝑧2 𝑧1 𝑧0 . 𝑥4 𝑥3 𝑥2 𝑥1 𝑥0 After comm. over x-dimension
(𝑥2 𝑥1 , 𝑥4 𝑥3 , 𝑧4 𝑧3 ). 𝑥0 𝑧2 𝑧1 𝑧0 . 𝑦4 𝑦3 𝑦2 𝑦1 𝑦0 After comm. over x- & y-dimensions
(𝑥2 𝑥1 , 𝑥4 𝑥3 , 𝑦4 𝑦3 ). 𝑥0 𝑦2 𝑦1 𝑦0 . 𝑧4 𝑧3 𝑧2 𝑧1 𝑧0 After comm. over z-dimension 32×32×32 FFT on 4×4×4 system 11
Anton modeling TABLE I. MODELING PARAMETERS FOR ANTON MACHINE
Model Parameters:
Parameters derived from published papers on Anton Comm. patterns for 3D FFT are derived as described earlier
Parameter
Value
Reference
System Frequency
485 MHz
[3]
Internal bandwidth
124.2 Gbit/s
[2]
External bandwidth
50.6 Gbit/s
[2]
Synchronization delay
42 ns
[6]
Package writing delay
36 ns
[6]
x
4 ns
[6]
y
8 ns
[6]
z
10 ns
[6]
Trasceiver delay
20 ns
[6]
1GC’s
137 cycles
[4]
4 GC’s
75 cycles
[4]
Wire delay
FFT Calculation time
12
Anton modeling
Intra-node ring network is reduced to a delay model
Simulating 6 routers with multiple queues each greatly increases required resources Delay model uses latencies from parameters table to model each pair of network endpoints Four queues used to reduce contention in the network
TABLE I. ROUTING LATENCIES (ns) FOR ANTON MACHINE Source direction
Destinatio n
X+
X-
Y+
Y-
Z+
Z-
Processing Slice
X+
—
31
25
25
19
19
19
X-
31
—
19
19
25
25
25
Y+
25
19
—
13
25
25
31
Y-
25
19
13
—
19
19
31
Z+
19
25
25
19
—
13
25
Z-
19
25
25
19
13
—
25
Processing Slice
19
25
31
31
25
25
—
13
From the Channel model
42ns
To output display
Computation Time
Application
Wire Delay
Channel 1 (x,y,z)
4ns
X- Channel
Routing Delay
Queue
IN
IN
36ns
Internal ring Structure
To output display Transaction Delay
To output display
7ns Routing Delay
Channel 2 (x,y,z)
43ns Routing Delay
Transaction Delay Wire Delay
IN Channel 3 (x,y,z)
Layer2 router
Transaction Delay Wire Delay
Channel 4 (x,y,z)
Transaction Delay Wire Delay
13ns 19ns 25ns
13ns 19ns 25ns
Wire Delay
Channel 5 (x,y,z)
10ns
Z- Channel
13ns 19ns 25ns 13ns 19ns 25ns
X+ link X- link Adapter Adapter Y link Adapter
Transaction Delay
IN
31ns
Z link Adapter
Queue
Transaction Delay
IN
Wire Delay
Channel 6 (x,y,z)
10ns
Z+ Channel
Queue
8ns
Y+ Channel 31ns
Queue
8ns
Y- Channel IN
Queue
4ns
X+ Channel
31ns
Queue
Queue
Transaction Delay
Channel model
Node model 14
To Intra-node model
Anton modeling
Model Validation
Model is validated using published Anton data from [4] Parallel strategy: FFTs can be executed on a single GC or on 4 GCs operating in parallel TABLE I. MODEL VERIFICATION WITH ANTON DATA System Size
8x8x8
FFT Size
Parallel strategy
Anton exec. Time (µs)
Measured model time
Error
32x32x32
1FFT:4GCs
3.7
3.5
5.4%
32x32x32
1FFT:1GC
4.0
3.75
6.25%
64x64x64
1FFT:1GC
13.2
12.7
3.7%
16x16x16
1FFT:1GC
2.4
2.55
6.25%
32x32x32
2FFTs:1GC
10.5
10.0
4.8%
4x4x4
15
Novo-G# model
Computation Time
Modeling the 3D torus
Application
Internal Structure 36ns To output display
Top-level and channel models remain same Updated node model shown here
Transaction Delay
Layer2 switch Layer3 router
16
From/to Channel Blocks
44ns
Novo-G# model 3D torus modeling
Propogation delay, channel rate from h/w benchmarks FFT core data from Altera Megacore instantiation and synthesis Router data from Modelsim simulation
From H/W benchmarks
Structural change: Anton intra-node ring replaced by centralized router Novo-G# parameters are derived from various sources:
From H/W simulation
TABLE I. MODELING PARAMETERS FOR PROTOTYPE SYSTEM Parameter
Value
System Frequency
250 MHz
Notes Derived from Altera megacore data for Stratix V devices; N = FFT Size
FFT latency
N cycles
Num_cores
16
Propogation delay
20 ns
From roundtrip latency
Channel rate
10 Gbps
Data rate per channel of a link
Channel width
4
No. of physical channels per link
Internal latency for completed packets Internal latency for incomplete packets Write packet initiation delay
Simulation parameters
17
11 cycles 11 cycles
Altera FIFOs in critical data path contribute 3 cycles each (optimized for frequency)
9 cycles
System Size
-
From 2x2x2 to 4x4x4
Packet buffer length
129
Input/Output queue length
Novo-G# performance
TABLE I. PREDICTED 3D FFT KERNEL EXECUTION TIMES (µs)
Model predictions
Novo-G# model used to predict 3D FFT kernel execution times for various configurations Model prediction approx. 2x speed of Anton at 1/8 system size
Anton: 323 – 7.4 us, 643 – 26.4 us
System Size FFT Size 2×2×2
2×4×2
2×4×4
4×4×4
4×8×4
16×16×16
1.753
1.574
1.509
1.570
1.678
32×32×32
9.997
8.563
5.749
3.943
3.302
64×64×64
75.94
64.47
42.31
26.11
17.71
128×128×128
603.5
511.8
334.8
207.2
136.7
% Ideal computation time to total time 0.7 0.6 0.5
Graph of % utilization shows 0.4 sharp decline which levels 0.3 0.2 off at 20-30%
Diminishing returns: Tradeoff between lowest absolute execution time and utilization
0.1 0 2x2x2 128×128×128 18
2x4x2
2x4x4 64×64×64
4x4x4 32×32×32
4x8x4 16×16×16
Novo-G# performance
Improvement in performance over Anton is attributed to:
Stratix V devices use 28-nm fabrication
Higher computational density => smaller system => more data reuse between FFT stages
Better computation density than Anton Better clock frequencies than previous FPGA generations
Tradeoff is that smaller systems require multiple FFT rounds per stage, increasing total run time
Novo-G# network performs better with coarse-grained communication
Larger data blocks used to reduce total no. of packets Communication pattern reorganized to reduce network contention 19
MD Communication Characterization
Communication Requirements for FPGA-Centric Molecular Dynamics [7]
Assumptions
~40
Bidirectional 3D torus of FPGAs 1 FPGA = 8 CPU cores Range-limited computation data up to 3 hops away Long-range computation requires all-to-all communication
Problem Size: 100k atoms
~40
Bandwidth available on Novo-G# surpasses requirements for given problem sizes
Novo-G# provides 40 Gbps/link in 6 directions 20
Problem Size: 1M atoms
Novo-G upgrade in progress 8 ProceV nodes
21
Conclusions
3D FFT kernel execution on Anton was modeled successfully using VisualSim
<7% error over 2 different system sizes and 3 FFT sizes
Benchmarking & simulation yielded data for Novo-G# 3D torus network Data used to model Novo-G and predict 3D FFT execution time on system
Prediction approx. 2x speed of Anton machine Favorably compares to other large-scale solutions like BlueGene/L and QCDOC 22
References 1. D. E. Shaw, J. C. Chao, M. P. Eastwood, J. Gagliardo, J. P. Grossman, C. R. Ho, D. J. Ierardi, I. Kolossváry, J. L. Klepeis, T. Layman, C. McLeavey, M. M. Deneroff, M. A. Moraes, R. Mueller, E. C. Priest, Y. Shan, J. Spengler, M. Theobald, B. Towles, S. C. Wang, R. O. Dror, J. S. Kuskin, R. H. Larson, J. K. Salmon, C. Young, B. Batson, and K. J. Bowers, “Anton, a special-purpose machine for molecular dynamics simulation,” in Proceedings of the 34th annual international symposium on Computer architecture - ISCA ’07, 2007, p. 1. 2. D. E. Shaw, J. C. Chao, M. P. Eastwood, J. Gagliardo, J. P. Grossman, C. R. Ho, D. J. Lerardi, I. Kolossváry, J. L. Klepeis, T. Layman, C. McLeavey, M. M. Deneroff, M. A. Moraes, R. Mueller, E. C. Priest, Y. Shan, J. Spengler, M. Theobald, B. Towles, S. C. Wang, R. O. Dror, J. S. Kuskin, R. H. Larson, J. K. Salmon, C. Young, B. Batson, and K. J. Bowers, “Anton, a special-purpose machine for molecular dynamics simulation,” Commun. ACM, vol. 51, no. 7, p. 91, Jul. 2008. 3. D. E. Shaw, K. J. Bowers, E. Chow, M. P. Eastwood, D. J. Ierardi, J. L. Klepeis, J. S. Kuskin, R. H. Larson, K. Lindorff-Larsen, P. Maragakis, M. A. Moraes, R. O. Dror, S. Piana, Y. Shan, B. Towles, J. K. Salmon, J. P. Grossman, K. M. Mackenzie, J. A. Bank, C. Young, M. M. Deneroff, and B. Batson, “Millisecond-scale molecular dynamics simulations on Anton,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC ’09, 2009, no. c, p. 1. 4. C. Young, J. A. Bank, R. O. Dror, J. P. Grossman, J. K. Salmon, and D. E. Shaw, “A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC ’09, 2009, no. c, p. 1. 5. J. S. Kuskin, C. Young, J. P. Grossman, B. Batson, M. M. Deneroff, R. O. Dror, and D. E. Shaw, “Incorporating flexibility in Anton, a specialized machine for molecular dynamics simulation,” in 2008 IEEE 14th International Symposium on High Performance Computer Architecture, 2008, pp. 343–354. 6. R. O. Dror, J. P. Grossman, K. M. Mackenzie, B. Towles, E. Chow, J. K. Salmon, C. Young, J. a. Bank, B. Batson, M. M. Deneroff, J. S. Kuskin, R. H. Larson, M. a. Moraes, and D. E. Shaw, “Exploiting 162-Nanosecond Endto-End Communication Latency on Anton,” in 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010, pp. 1–12. 7. M.A. Khan and M.C. Herbordt, “Communication Requirements for FPGA-Centric Molecular Dynamics,” Proceedings of the Symposium on Application Accelerators for High Performance Computing, 2012
23
QUESTIONS?
[email protected] 24
Anton model
Output display blocks
Init. blocks
Channel (Inter-node) models
Top-level
Top-level
Network models
Novo-G# model
Node models
Computation Time
42ns
Computation Time
IN
Wire Delay
Channel 1 (x,y,z)
4ns
X- Channel IN
Application
Queue
Routing Delay
Internal ring Structure
36ns
Channel 2 (x,y,z)
7ns
To output display
Layer2 router
31ns
X- link Adapter
31ns
X+ link Adapter
13ns 19ns 25ns
13ns 19ns 25ns
Y link Adapter
Node model
13ns 19ns 25ns 13ns 19ns 25ns
Z link Adapter
36ns To output display
Transaction Delay Wire Delay
Channel 3 (x,y,z)
8ns
Y- Channel
Queue
Transaction Delay
Transaction Delay
IN
Wire Delay
Channel 4 (x,y,z)
8ns
Y+ Channel
Queue
Transaction Delay
IN
Wire Delay
Channel 5 (x,y,z)
10ns
Z- Channel
To Intra-node model
Layer2 switch Layer3 router
Queue
Transaction Delay
IN
Wire Delay
Channel 6 (x,y,z)
10ns
Z+ Channel
Internal Structure
Queue
4ns
IN 31ns
Channel model
Transaction Delay
Transaction Delay Wire Delay
X+ Channel
43ns
Application
Queue
Channel model
From Channel model
Queue
From/to Channel Blocks
Node model Transaction Delay
25
44ns
Protocol & Network layers Services available through IP
Application
Services available through RTL design
Network layer Dimension order routing Collective routing Source data buffering
Layer 3 Router
Protocol IP
Transceivers
Layer 2 Switch
Data-link layer
Data-link layer
Data framing Error detection (CRC)
Physical addressing Packet switching Congestion control
Physical layer Clock recovery Line coding Multi-lane sync.
Transceivers
3d-torus FPGA architecture
Network services
26