Department of Computer Science, Institute of Computer Engineering
Integration of an Easy Accessible, Highly Scalable, Energy Efficient Multi-FPGA-Based Hardware Accelerator in Common Cluster Infrastructures HUCAA 2013
Oliver Knodel
[email protected]
Lyon, 1 October 2013
Motivation • Reconfigurable hardware (e.g. FPGAs) is an interesting opportunity to change a system’s behavior and to evaluate new system concepts. • To satisfy a wide range of requirements in a computing cluster, a flexible integration of FPGAs is necessary.
⇒ Objective: Combining the needs of a wide range of applications in a single scaleable system.
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 2 of 23
Agenda 01 Introduction 02 HPC and FPGAs – Current Relation 03 Creating a Flexible Platform 04 Reconfigurable Cluster Architecture - System Design - Programming Concept
05 Conclusions
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 3 of 23
01 Introduction Field-Programmable Gate Array (FPGA) • Reconfigurable integrated circuit. • Different types of parallelism: •
Bit-Level: e.g. 1,024 bit arithmetic.
•
Pipelining: e.g. filter algorithms with several hundred stages.
•
Program-Level: e.g. up to 1.000 concurrent computation cores.
• Energy-efficent alternative to GPUs. ⇒ Applications requiring simple computation cores and data structures are highly suitable for FPGAs. 1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 4 of 23
01 Introduction
6-‐ary functions
fast-‐forward
(e.g. Xilinx Virtex-7 X690T)
memory
I/O-‐Bank
High-‐Speed-‐I/O
Lookup Table
D-‐FF (1 Bit)
PCI Express Endpoint
Lookup Table
D-‐FF (1 Bit)
User Design I/O-‐Bank
Lookup Table
D-‐FF (1 Bit)
I/O-‐Bank
Lookup Table
D-‐FF (1 Bit)
PCI Express Endpoint
Boolean Computation Unit (Slice)
PCIe 3.0 x8
FPGA Architecture
I/O-‐Bank
High-‐Speed-‐I/O
Clock (typ. 50 -‐ 250 MHz)
DDR3 channel
1 October 2013
10G Ethernet
Integration of FPGAs in Common Cluster Infrastructures
Slide 5 of 23
02 HPC and FPGAs – Current Relation I.
FPGA Coprocessor:
CPU CPU
PCIe, Ethernet FPGA Accelerator
a) Bio science: - Massively parallel short-read mapping on FPGAs - Sequence alignment (Smith & Waterman).
[1].
b) Image processing: - 3D point triangulation for structure-from-motion. - Filter algorithms for ATLAS-detector at CERN [2].
c) Low latency network processing: - High-Frequency-Trading / TCP-Offload-Engines
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
[3].
Slide 6 of 23
02 HPC and FPGAs – Current Relationship Accelerator
II. Pure Multi-FPGA Systems:
CPU
PCIe
FPGA
FPGA
FPGA
FPGA
CPU
- ASIC prototyping. - Multicore and Network-on-Chip [4]. - Simulation of large neuronal networks
[5].
III. Integration of FPGAs in HPC Clusters: - -
1 October 2013
Simple nodes with an additional GPU or FPGA coprocessor. Examples are the Axel and the QPSystems [6, 7].
Integration of FPGAs in Common Cluster Infrastructures
PCIe CPU CPU
PCIe CPU CPU
FPGA Accelerator
FPGA Accelerator
Slide 7 of 23
03 Creating a Flexible Platform The Challenge Each computing application has different requirements: • FPGAs as a hardware accelerators tightly coupled with a host processor. • Dozens of FPGAs aggregated to a multi-FPGA system without a closely coupled host processor.
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 8 of 23
03 Creating a Flexible Platform The Idea Combining of the main advantages in one single reconfigurable architecture: • Integration and centralization of distributed FPGAs in a common cluster infrastructure. • Seperate high-speed interconnects between the FPGA accelerators. • Allocation of ressources by a batch system.
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 9 of 23
04 Reconfigurable Cluster Architecture Hardware – Overview • One additional FPGA for each node. • The FPGAs are connected by a separate grid or cube topology. • The FPGAs form a shared memory. SAN
Cluster Interconnect Node 1 CPU CPU
Node 2
CPU
CPU CPU
PCIe 3.0
CPU PCIe 3.0
FPGA
FPGA
FPGA
FPGA PCIe 3.0
PCIe 3.0
CPU CPU
CPU
CPU CPU
CPU
SAN Node 3 1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Node 4 Slide 10 of 23
04 Reconfigurable Cluster Architecture Hardware – Overview • One additional FPGA for each node. • The FPGAs are connected by a separate grid or cube topology. • The FPGAs form a shared memory. SAN
Cluster Interconnect Node 1 CPU CPU
Node 2
CPU
CPU CPU
PCIe 3.0
PCIe 3.0
FPGA
FPGA
FPGA
Shared Accelerator Memory Space
CPU
FPGA PCIe 3.0
PCIe 3.0
CPU CPU
CPU
CPU CPU
CPU
SAN Node 3 1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Node 4 Slide 11 of 23
04 Reconfigurable Cluster Architecture Hardware – FPGA Accelerator The basic design sufficient for most applications consists of:
1 October 2013
Tile
Tile
Tile
On-Chip Memory Interconnect
Integration of FPGAs in Common Cluster Infrastructures
On-Chip Memory
Tile
Tile
Debug & Trace
Tile
Tile
Memory Controller
Slide 12 of 23
Host-System
Tile
Serial-‐Attached-‐ SCSI
PCI-Express Endpoint
40 GBit Ethernet
Serial-ATA
10 GBit Ethernet
• User-reconfigurable tiles, • Simple PCIe-interface to the host (DMA and GPIO), • Lightweight protocol for interFPGA communications, • Interface for on- and off-chip memories, • …
04 Reconfigurable Cluster Architecture Hardware – Inter-FPGA communication • Serial high-speed links connect the FPGAs. • The underlying network is based on a protocol featuring virtual channels and priorities: Bits 63 48 47 32 31 16 15 0 Bytes 0 Header TypeID Source Destination ChannelID 8
Additional header data
16
Payload
24
… (up to 64 KiB)
... n-‐8
Footer
Optional
Checksum (CRC-‐32)
• MPI and shared memory operations are possible.
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 13 of 23
04 Reconfigurable Cluster Architecture Software Concept The integration of the batch and monitoring system provides an efficent handling of FPGA resources: • Dynamic allocation of FPGA resources, • Flexible assignment between CPUs and FPGAs, • Dynamic FPGA reconfiguration, • Multi-user concept, • Accounting, • Partitioning.
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 14 of 23
04 Reconfigurable Cluster Architecture Software Concept The integration of the batch and monitoring system provides an efficent handling of FPGA resources: Cluster Interconnect
Cluster Interconnect
Cluster Interconnect CPU
CPU
CPU
RAM
RAM
RAM
RAM
Share Accelerator Memory
FPGA
RAM
RAM
Share Accelerator Memory
Share Accelerator Memory
RAM
RAM
RAM
FPGA
FPGA
FPGA
No d
No d
No d
RAM
CPU
3 e
RAM
CPU
d No
3 e
RAM
CPU
d No
3 e
RAM
CPU
e 2
RAM
FPGA
e 2
RAM
FPGA
e 2
RAM
FPGA
d No
RAM
CPU
FPGA
No d
RAM
No d
No d
FPGA
RAM
0 e
FPGA
d No
RAM
0 e
0 e
FPGA
RAM
d No
d No
FPGA
e 1
CPU
RAM
e 1
CPU
RAM
e 1
CPU
RAM
CPU
Size of an autonomous user design Number of concurrent user designs
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 15 of 23
04 Reconfigurable Cluster Architecture Software Concept – Software Stack
C/C++ App
SystemC App
HDL App
OpenCL App
User Application
Management and I/O Node
Resource Management Job Scheduling Accounting Libraries & Tools C/C++ C/C++
Platform Libraries
VHDL
Hybrid Computing Node
Platform API, Runtime and Driver Executable Bitfile Executable Bitfile Executable Bitfile/Kernel Processor FPGA Processor FPGA Processor FPGA
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 16 of 23
04 Reconfigurable Cluster Architecture Programming Opportunities • The environment provides the libraries necessary to communicate with the basic FPGA design. • Different opportunities for programmers: • • •
C/C++ libraries calling FPGA-kernels, realized with optimized VHDL-designs. Own HDL or SystemC designs. High-Level-Synthesis (HLS) to generate designs.
⇒ With the OpenCL model applications for heterogeneous platforms can be developed and executed.
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 17 of 23
04 Reconfigurable Cluster Architecture OpenCL • The basic FPGA design can be mapped to the OpenCL model [8]. • Memory transfers and kernel calls can be written in C. • The kernels can be implemented in HDL or HLS.
FPGA Board DDR3 Memory – FPGA Board FPGA Tile 8 ... FPGA Tile 2 FPGA Tile 1
Tile
Tile
Serial-‐Attached-‐ SCSI
Tile
Tile
On-Chip Memory
10 GBit Ethernet
Interconnect On-Chip Memory
Tile
Tile
Debug & Trace
1 October 2013
Tile
On-‐Chip Memory PCI-Express Endpoint
40 GBit Ethernet
Serial-ATA
Tile
Memory Controller
⇒ Mapping of the FPGA device to the OpenCL model
PE
PE
PE
Registers
Registers
...
Host
Integration of FPGAs in Common Cluster Infrastructures
Registers
Host Memory
Slide 18 of 23
04 Reconfigurable Cluster Architecture OpenCL Over the inter-FPGA network a direct network connection between the OpenCL devices can be established: • Bypassing the communication through the host. • Evaluating the impact of different sized OpenCL devices. Node 0 FPGA Board
FPGA Tile 8 ...
DDR3 Memory – FPGA Board
FPGA Tile 2
PE
PE
PE
Registers
Host
Registers
PE
Host Memory
1 October 2013
Registers
Host
Registers
PE
PE
On-‐Chip Memory PE
PE
PE
PE
...
... Registers
FPGA Tile 1
On-‐Chip Memory
On-‐Chip Memory PE
FPGA Tile 2
FPGA Tile 1
...
DDR3 Memory – FPGA Board FPGA Tile 8 ...
FPGA Tile 2
FPGA Tile 2 FPGA Tile 1
On-‐Chip Memory
FPGA Board
DDR3 Memory – FPGA Board FPGA Tile 8 ...
FPGA Tile 8 ...
FPGA Tile 1
Node 1
FPGA Board
FPGA Board
DDR3 Memory – FPGA Board
PE
Node 0
Node 1
Registers
Host Memory
Registers
Host
Registers
... Registers
Host Memory
Integration of FPGAs in Common Cluster Infrastructures
Registers
Host
Registers
Registers
Host Memory
Slide 19 of 23
Conclusions Summary •
Each computing application has different architecture requirements.
•
Combination of the main advantages in one single reconfigurable architecture without typical bottlenecks.
•
The presented system allows the simulation and the test of new system architectures on real hardware.
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 20 of 23
05 Conclusions Outlook – Computing Cloud
Network Cloud API
Cloud Infrastructure Management Layer
• FPGAs are mainly used as prototyping platforms for ASICs, SoCs, … • Why not offer a reconfigurable system for tests on real hardware (RaaS)?
Gateway and Service Layer Hybrid Computing Layer Sub-Cluster A
Sub-Cluster B
Reconfigurable Common CompuAng Cloud Environment
RC3E 1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 21 of 23
References [1]
Knodel, Oliver, et al.: "Next-generation massively parallel shortread mapping on FPGAs." Application-Specific Systems, Architectures and Processors (ASAP), 2011 IEEE International Conference on. IEEE, 2011.
[2]
Straessner, Arno, et al.: "Entwurf und Implementierung von parametrierbaren Filteralgorithmen für die digitale Ausleseelektronik des Flüssig-Argon-Kalorimeters des ATLASDetektors am CERN.”
[3]
Sadoghi, Mohammad, et al.: "Efficient event processing through reconfigurable hardware for algorithmic trading." Proceedings of the VLDB Endowment 3.1-2 (2010): 1525-1528.
[4]
Wawrzynek, John, et al.: "RAMP: Research accelerator for multiple processors." Micro, IEEE 27.2 (2007): 46-57.
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 22 of 23
References [5]
Jung, S., et al.: "Hardware implementation of a real-time neural network controller with a DSP and an FPGA for nonlinear systems.” Industrial Electronics, IEEE Transactions on, 2007.
[6]
Tsoi, K., et al.: "Axel: a heterogeneous cluster with FPGAs and GPUs." in Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, ACM, 2010.
[7]
Showerman, M., et al.: "QP: A heterogeneous multi-accelerator Cluster." in Proc. 10th LCI International Conference on HighPerformance Clustered Computing, 2009.
[8]
Czajkowski, T.S., et al.: "From OpenCL to high-performance hardware on FPGAS." in Proc. Field Programmable Logic and Applications (FPL), 2012
1 October 2013
Integration of FPGAs in Common Cluster Infrastructures
Slide 23 of 23