Department of Computer Science, Institute of Computer Engineering

Integration of an Easy Accessible, Highly Scalable, Energy Efficient Multi-FPGA-Based Hardware Accelerator in Common Cluster Infrastructures HUCAA 2013

Oliver Knodel [email protected]

Lyon, 1 October 2013

Motivation •  Reconfigurable hardware (e.g. FPGAs) is an interesting opportunity to change a system’s behavior and to evaluate new system concepts. •  To satisfy a wide range of requirements in a computing cluster, a flexible integration of FPGAs is necessary.

⇒ Objective: Combining the needs of a wide range of applications in a single scaleable system.

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 2 of 23

Agenda 01 Introduction 02 HPC and FPGAs – Current Relation 03 Creating a Flexible Platform 04 Reconfigurable Cluster Architecture -  System Design -  Programming Concept

05 Conclusions

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 3 of 23

01 Introduction Field-Programmable Gate Array (FPGA) •  Reconfigurable integrated circuit. •  Different types of parallelism: • 

Bit-Level: e.g. 1,024 bit arithmetic.

• 

Pipelining: e.g. filter algorithms with several hundred stages.

• 

Program-Level: e.g. up to 1.000 concurrent computation cores.

•  Energy-efficent alternative to GPUs. ⇒ Applications requiring simple computation cores and data structures are highly suitable for FPGAs. 1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 4 of 23

01 Introduction

6-­‐ary  functions

fast-­‐forward

(e.g. Xilinx Virtex-7 X690T)

memory

I/O-­‐Bank

High-­‐Speed-­‐I/O

Lookup  Table

D-­‐FF (1  Bit)

PCI  Express  Endpoint

Lookup  Table

D-­‐FF (1  Bit)

User  Design   I/O-­‐Bank

Lookup  Table

D-­‐FF (1  Bit)

I/O-­‐Bank

Lookup  Table

D-­‐FF (1  Bit)

PCI  Express  Endpoint

Boolean  Computation  Unit  (Slice)

PCIe  3.0  x8  

FPGA Architecture

I/O-­‐Bank

High-­‐Speed-­‐I/O

Clock  (typ.  50  -­‐  250  MHz)

DDR3  channel  

1 October 2013

10G  Ethernet  

Integration of FPGAs in Common Cluster Infrastructures

Slide 5 of 23

02 HPC and FPGAs – Current Relation I. 

FPGA Coprocessor:

CPU CPU

PCIe,   Ethernet FPGA   Accelerator

a)  Bio science: -  Massively parallel short-read mapping on FPGAs -  Sequence alignment (Smith & Waterman).

[1].

b)  Image processing: -  3D point triangulation for structure-from-motion. -  Filter algorithms for ATLAS-detector at CERN [2].

c)  Low latency network processing: -  High-Frequency-Trading / TCP-Offload-Engines

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

[3].

Slide 6 of 23

02 HPC and FPGAs – Current Relationship Accelerator

II.  Pure Multi-FPGA Systems:

CPU

PCIe

FPGA

FPGA

FPGA

FPGA

CPU

-  ASIC prototyping. -  Multicore and Network-on-Chip [4]. -  Simulation of large neuronal networks

[5].

III. Integration of FPGAs in HPC Clusters: -  - 

1 October 2013

Simple nodes with an additional GPU or FPGA coprocessor. Examples are the Axel and the QPSystems [6, 7].

Integration of FPGAs in Common Cluster Infrastructures

PCIe CPU CPU

PCIe CPU CPU

FPGA   Accelerator

FPGA   Accelerator

Slide 7 of 23

03 Creating a Flexible Platform The Challenge Each computing application has different requirements: •  FPGAs as a hardware accelerators tightly coupled with a host processor. •  Dozens of FPGAs aggregated to a multi-FPGA system without a closely coupled host processor.

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 8 of 23

03 Creating a Flexible Platform The Idea Combining of the main advantages in one single reconfigurable architecture: •  Integration and centralization of distributed FPGAs in a common cluster infrastructure. •  Seperate high-speed interconnects between the FPGA accelerators. •  Allocation of ressources by a batch system.

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 9 of 23

04 Reconfigurable Cluster Architecture Hardware – Overview •  One additional FPGA for each node. •  The FPGAs are connected by a separate grid or cube topology. •  The FPGAs form a shared memory. SAN

Cluster  Interconnect Node  1 CPU CPU

Node  2

CPU

CPU CPU

PCIe  3.0

CPU PCIe  3.0

FPGA

FPGA

FPGA

FPGA PCIe  3.0

PCIe  3.0

CPU CPU

CPU

CPU CPU

CPU

SAN Node  3 1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Node  4 Slide 10 of 23

04 Reconfigurable Cluster Architecture Hardware – Overview •  One additional FPGA for each node. •  The FPGAs are connected by a separate grid or cube topology. •  The FPGAs form a shared memory. SAN

Cluster  Interconnect Node  1 CPU CPU

Node  2

CPU

CPU CPU

PCIe  3.0

PCIe  3.0

FPGA

FPGA

FPGA

Shared  Accelerator   Memory  Space  

CPU

FPGA PCIe  3.0

PCIe  3.0

CPU CPU

CPU

CPU CPU

CPU

SAN Node  3 1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Node  4 Slide 11 of 23

04 Reconfigurable Cluster Architecture Hardware – FPGA Accelerator The basic design sufficient for most applications consists of:

1 October 2013

Tile

Tile

Tile

On-Chip Memory Interconnect

Integration of FPGAs in Common Cluster Infrastructures

On-Chip Memory

Tile

Tile

Debug & Trace

Tile

Tile

Memory Controller

Slide 12 of 23

Host-System

Tile

Serial-­‐Attached-­‐ SCSI

PCI-Express Endpoint

40 GBit Ethernet

Serial-ATA

10 GBit Ethernet

•  User-reconfigurable tiles, •  Simple PCIe-interface to the host (DMA and GPIO), •  Lightweight protocol for interFPGA communications, •  Interface for on- and off-chip memories, •  …

04 Reconfigurable Cluster Architecture Hardware – Inter-FPGA communication •  Serial high-speed links connect the FPGAs. •  The underlying network is based on a protocol featuring virtual channels and priorities: Bits 63 48 47 32 31 16 15 0 Bytes 0 Header TypeID Source Destination ChannelID 8

Additional  header  data

16

Payload

24

… (up  to  64  KiB)

... n-­‐8

Footer

Optional

Checksum  (CRC-­‐32)

•  MPI and shared memory operations are possible.

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 13 of 23

04 Reconfigurable Cluster Architecture Software Concept The integration of the batch and monitoring system provides an efficent handling of FPGA resources: •  Dynamic allocation of FPGA resources, •  Flexible assignment between CPUs and FPGAs, •  Dynamic FPGA reconfiguration, •  Multi-user concept, •  Accounting, •  Partitioning.

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 14 of 23

04 Reconfigurable Cluster Architecture Software Concept The integration of the batch and monitoring system provides an efficent handling of FPGA resources: Cluster  Interconnect

Cluster  Interconnect

Cluster  Interconnect CPU

CPU

CPU

RAM

RAM

RAM

RAM

Share  Accelerator   Memory

FPGA

RAM

RAM

Share  Accelerator   Memory

Share  Accelerator   Memory

RAM

RAM

RAM

FPGA

FPGA

FPGA

No d

No d

No d

RAM

CPU

3 e  

RAM

CPU

d No

3 e  

RAM

CPU

d No

3 e  

RAM

CPU

e   2

RAM

FPGA

e   2

RAM

FPGA

e   2

RAM

FPGA

d No

RAM

CPU

FPGA

No d

RAM

No d

No d

FPGA

RAM

0 e  

FPGA

d No

RAM

0 e  

0 e  

FPGA

RAM

d No

d No

FPGA

e   1

CPU

RAM

e   1

CPU

RAM

e   1

CPU

RAM

CPU

Size of an autonomous user design Number of concurrent user designs

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 15 of 23

04 Reconfigurable Cluster Architecture Software Concept – Software Stack

C/C++ App

SystemC App

HDL App

OpenCL App

User Application

Management and I/O Node

Resource Management Job Scheduling Accounting Libraries & Tools C/C++ C/C++

Platform Libraries

VHDL

Hybrid Computing Node

Platform API, Runtime and Driver Executable Bitfile Executable Bitfile Executable Bitfile/Kernel Processor FPGA Processor FPGA Processor FPGA

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 16 of 23

04 Reconfigurable Cluster Architecture Programming Opportunities •  The environment provides the libraries necessary to communicate with the basic FPGA design. •  Different opportunities for programmers: •  •  • 

C/C++ libraries calling FPGA-kernels, realized with optimized VHDL-designs. Own HDL or SystemC designs. High-Level-Synthesis (HLS) to generate designs.

⇒ With the OpenCL model applications for heterogeneous platforms can be developed and executed.

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 17 of 23

04 Reconfigurable Cluster Architecture OpenCL •  The basic FPGA design can be mapped to the OpenCL model [8]. •  Memory transfers and kernel calls can be written in C. •  The kernels can be implemented in HDL or HLS.

FPGA  Board DDR3  Memory  –  FPGA  Board FPGA  Tile  8 ... FPGA  Tile  2 FPGA  Tile  1

Tile

Tile

Serial-­‐Attached-­‐ SCSI

Tile

Tile

On-Chip Memory

10 GBit Ethernet

Interconnect On-Chip Memory

Tile

Tile

Debug & Trace

1 October 2013

Tile

On-­‐Chip  Memory PCI-Express Endpoint

40 GBit Ethernet

Serial-ATA

Tile

Memory Controller

⇒ Mapping of the FPGA device to the OpenCL model

PE

PE

PE

Registers

Registers

...

Host

Integration of FPGAs in Common Cluster Infrastructures

Registers

Host  Memory

Slide 18 of 23

04 Reconfigurable Cluster Architecture OpenCL Over the inter-FPGA network a direct network connection between the OpenCL devices can be established: •  Bypassing the communication through the host. •  Evaluating the impact of different sized OpenCL devices. Node  0 FPGA  Board

FPGA  Tile  8 ...

DDR3  Memory  –  FPGA  Board

FPGA  Tile  2

PE

PE

PE

Registers

Host

Registers

PE

Host  Memory

1 October 2013

Registers

Host

Registers

PE

PE

On-­‐Chip  Memory PE

PE

PE

PE

...

... Registers

FPGA  Tile  1

On-­‐Chip  Memory

On-­‐Chip  Memory PE

FPGA  Tile  2

FPGA  Tile  1

...

DDR3  Memory  –  FPGA  Board FPGA  Tile  8 ...

FPGA  Tile  2

FPGA  Tile  2 FPGA  Tile  1

On-­‐Chip  Memory

FPGA  Board

DDR3  Memory  –  FPGA  Board FPGA  Tile  8 ...

FPGA  Tile  8 ...

FPGA  Tile  1

Node  1

FPGA  Board

FPGA  Board

DDR3  Memory  –  FPGA  Board

PE

Node  0

Node  1

Registers

Host  Memory

Registers

Host

Registers

... Registers

Host  Memory

Integration of FPGAs in Common Cluster Infrastructures

Registers

Host

Registers

Registers

Host  Memory

Slide 19 of 23

Conclusions Summary • 

Each computing application has different architecture requirements.

• 

Combination of the main advantages in one single reconfigurable architecture without typical bottlenecks.

• 

The presented system allows the simulation and the test of new system architectures on real hardware.

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 20 of 23

05 Conclusions Outlook – Computing Cloud

Network Cloud API

Cloud Infrastructure Management Layer

•  FPGAs are mainly used as prototyping platforms for ASICs, SoCs, … •  Why not offer a reconfigurable system for tests on real hardware (RaaS)?

Gateway and Service Layer Hybrid Computing Layer Sub-Cluster A

Sub-Cluster B

Reconfigurable  Common  CompuAng  Cloud  Environment  

RC3E   1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 21 of 23

References [1]

Knodel, Oliver, et al.: "Next-generation massively parallel shortread mapping on FPGAs." Application-Specific Systems, Architectures and Processors (ASAP), 2011 IEEE International Conference on. IEEE, 2011.

[2]

Straessner, Arno, et al.: "Entwurf und Implementierung von parametrierbaren Filteralgorithmen für die digitale Ausleseelektronik des Flüssig-Argon-Kalorimeters des ATLASDetektors am CERN.”

[3]

Sadoghi, Mohammad, et al.: "Efficient event processing through reconfigurable hardware for algorithmic trading." Proceedings of the VLDB Endowment 3.1-2 (2010): 1525-1528.

[4]

Wawrzynek, John, et al.: "RAMP: Research accelerator for multiple processors." Micro, IEEE 27.2 (2007): 46-57.

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 22 of 23

References [5]

Jung, S., et al.: "Hardware implementation of a real-time neural network controller with a DSP and an FPGA for nonlinear systems.” Industrial Electronics, IEEE Transactions on, 2007.

[6]

Tsoi, K., et al.: "Axel: a heterogeneous cluster with FPGAs and GPUs." in Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, ACM, 2010.

[7]

Showerman, M., et al.: "QP: A heterogeneous multi-accelerator Cluster." in Proc. 10th LCI International Conference on HighPerformance Clustered Computing, 2009.

[8]

Czajkowski, T.S., et al.: "From OpenCL to high-performance hardware on FPGAS." in Proc. Field Programmable Logic and Applications (FPL), 2012

1 October 2013

Integration of FPGAs in Common Cluster Infrastructures

Slide 23 of 23

Integration of an Easy Accessible, Highly Scalable, Energy ... - HUCAA

Oct 1, 2013 - Reconfigurable hardware (e.g. FPGAs) is an interesting opportunity to change a system's behavior and to evaluate new system concepts. • To satisfy a wide range of requirements in a computing cluster, a flexible integration of FPGAs is necessary. Motivation. ⇒ Objective: Combining the needs of a wide ...

2MB Sizes 0 Downloads 176 Views

Recommend Documents

Integration of an Easy Accessible, Highly Scalable, Energy ... - HUCAA
Oct 1, 2013 - Integration of FPGAs in Common Cluster Infrastructures. Slide 5 of 23. I/O-‐Bank. I/O. -‐B a n k. I/O. -‐B a n k. I/O-‐Bank. High-‐Speed-‐I/O. High-‐Speed-‐I/O. P. C. I E xp re ss E n d p o in t. P. C. I E xp re ss E n d

Megastore: Providing Scalable, Highly Available Storage - CIDR
Jan 12, 2011 - 1. INTRODUCTION. Interactive online services are forcing the storage commu- .... networks that connect them to the outside world and the.

Megastore: Providing Scalable, Highly Available Storage for ...
Jan 12, 2011 - Schemas declare keys to be sorted ascending or descend- ing, or to avert sorting altogether: the SCATTER attribute in- structs Megastore to prepend a two-byte hash to each key. Encoding monotonically increasing keys this way prevents h

Highly Interactive Scalable Online Worlds - Semantic Scholar
[34] Macedonia, M. R., Brutzman, D. P., Zyda, M. J., Pratt, D. R., Barham, P. T.,. Falby, J., and Locke, J., “NPSNET: a multi-player 3D virtual environment over the. Internet”, In Proceedings of the 1995 Symposium on interactive 3D Graphics. (Mon

Megastore: Providing Scalable, Highly Available Storage - CIDR
Jan 12, 2011 - tency models complicate application development. Repli- cating data across distant datacenters while providing low latency is challenging, as ...

Highly Interactive Scalable Online Worlds - Semantic Scholar
Abstract. The arrival, in the past decade, of commercially successful virtual worlds used for online gaming and social interaction has emphasised the need for a concerted research effort in this media. A pressing problem is that of incorporating ever

Highly-Efficient Thermoelectronic Conversion of Solar Energy and ...
Jan 15, 2013 - ing a function of the geometrical design of the genera- tor, we can indeed ... with which these generators transform heat into elec- tric power.

Technology-Driven, Highly-Scalable Dragonfly ... - Research at Google
[email protected]. Abstract. Evolving technology and increasing pin-bandwidth moti- ..... router node. UGAL-G – uses queue information for all the global chan-.

Highly oscillatory integration, numerical wave optics ...
Aug 1, 2008 - Figure 2.1: It is relatively expensive to compute polynomial interpolations of oscillatory functions. The dots on each curve indicate the required number of regularly spaced nodes in order to interpolate that function to a precision of

Megastore: Providing Scalable, Highly Available Storage for - eBooks ...
Jan 12, 2011 - ence. Finally, users have come to expect Internet services to be up 24/7, so the service must be highly available. The ser- vice must be resilient ...

Highly oscillatory integration, numerical wave optics ...
gravitational waves from an asymmetric neutron star in our galaxy, finding vii ... iii. Acknowledgements v. Abstract vii. 1 Introduction. 1. 1.1 Highly oscillatory ...

Large-scale integration of wind power into different energy systems
efficient distribution of wind power and other renewable energy sources in the grids [23–25] .... Regulation strategy II: meeting both heat and electricity demands.

pdf-1498\renewable-energy-made-easy-free-energy-from ...
... the apps below to open or edit this item. pdf-1498\renewable-energy-made-easy-free-energy-fro ... er-alternative-energy-sources-by-david-craddock.pdf.

Lightweight integration of IR and DB for scalable hybrid ...
data into account. Experiments conducted on DBpedia and Wikipedia show that CE2 can provide good performance in terms .... The repository and the hybrid query engine implementing our approach are embedded into an ..... This approach has achieved supe

start an energy patrol! - California Energy Commission
Lights are a good target for the Energy. Patrol because in ... Chris graillat. Program Manager ... local business to pay for jackets, t–shirts, or hats that the Energy ...

JNI – C++ integration made easy - Semantic Scholar
The article ends with a larger-scale example. Running example ... resource management scheme which underlies the implementation of containers (arrays and strings). ..... [3] "Information Technology – Programming Languages – C++",.

JNI – C++ integration made easy - Semantic Scholar
The JNI is useful when existing libraries need to be integrated into Java code, or when portions of the ... performance. The Java Native .... resource management scheme which underlies the implementation of containers (arrays and strings).

start an energy patrol! - California Energy Commission
If you need help with starting the Energy Patrol, you can always go to ... local business to pay for jackets, t–shirts, or hats that the Energy Patrol will wear. Special ...

Towards a High Level Approach for the Programming of ... - HUCAA
... except in the data parallel operations. ▫ Implementation based on C++ and MPI. ▫ http://polaris.cs.uiuc.edu/hta/. HUCAA 2016. 6 .... double result = hta_A.reduce(plus());. Matrix A Matrix B .... Programmability versus. MPI+OpenCL.