A Programmable Parallel Accelerator for Learning and ...

Viewer
Transcript

A Programmable Parallel Accelerator for Learning and Classification Hari Cadambi, Abhinandan Majumdar, Michela Becchi Srimat Chakradhar and Hans Peter Graf NEC Laboratories America Princeton, New Jersey, USA.

www.nec-labs.com

Our Target Application Domain: Examples

Intelligent processing of l t off data. d t large amounts Learning and classification used increasingly. SEMANTIC SEARCH

FACE RECOGNITION IN A CROWD

2

1

Motivation for Accelerator • Wider range of apps: digital pathology, cognitive databases… • Increasing data Æ increasing computational load • Stringent performance constraints – Semantic Search Æ search millions of documents in a few milliseconds – Crowd face tracking Æ analyze VGA+ moving images in real-time

• Above trends justified investigating specialized architectural support for these workloads

3

This Paper • Proposed MAPLE, an accelerator with new architectural features for learning / classification • Proposed programming model, mapping strategy • What MAPLE is not: – A general-purpose engine – A replacement for GPUs or any other processor

• What this paper tries to do: – Suggest and evaluate new architectural features for these workloads, and how they could be programmed – Can an existing processor (CPU or GPU) use these features? Maybe… 4

2

How We Went About Designing MAPLE Profiled representative workloads Computational Bottlenecks Identified structure and common set of primitives Architected MAPLE Simulator, architectural exploration Prototype 5

Workload Analysis Applications Semantic S ti Search

Image Segmentation, Recognition

Digital Pathology

Algorithm

Computational Bottlenecks

SSI

Dot-products Dot products and array ranking : 99%

CNN

1D, 2D, 3D convolutions: 99%

K-means

Minimum Euclidean distance: 96%

SVM

Large matrix – vector multiplication: 85-95%

GLVQ

Minimum Euclidean distance: 99%

But is there a common structure?

6

3

Common Structure: Example • Semantic Search: Given N documents, Q concurrent queries, find top K matching documents • Computational bottleneck:

QUERY Q

QUERY 1 QUERY 2

OUT Q

ARRAY RANK

OUT 2

INTERIM RESULT

MATMUL

DOC N

OUT 1

DOC 1 DOC 2

LARGE: 1-2GB for 2-4M docs

• Common structure: Dense LA Æ large intermediate result Æ second operation (array rank, min/max, sum) 7

MAPLE Architecture Common Structure

Array Rank Find min/max Sum

Operand A DENSE LA

Operand B

LARGE INTERIM RESULT

Distribute operand B

Stream operand A

Off-chip Memory

On-chip Memory

2D PE Array

SECOND OP

REDUCED RESULT

1. In-memory processing of second op 2. Operates in parallel with PE array 3. Reduces off chip writes

Distributed Smart Memory

Chain First and Second Ops

Off-chip Memory (stream result)

8

4

PE Array and Smart Memory Simple Vector Processing Elements All PEs operate in SIMD mode Off-chip Memory y (stream A)

MEM

MEM

MEM

PE

PE

PE

MEM

MEM

MEM

PE

PE

PE

S SMART MEMORY

SMART MEMORY

Off-chip Memory (stream result) MEM

MEM

MEM

PE

PE

PE

SMART MEMORY

9

An Example: Semantic Search

DOC 3

QUERY 1

DOC 4

X

9 5

1 5

8 3

1 3

1 6

PE

QUERY 3

MAPLE’s OFF-CHIP MEMORY

BROADCAST DOC1

MEM

7 6

4 2

QUERY 2

DOC 2

5 3

MEM

PE

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

SMART MEMORY

MEM

PE

SMART MEMORY 10

5

An Example: Semantic Search

X

DOC 3

QUERY 1

DOC 4

MEM

PE

GENERATE ROW 1

QUERY 3

MAPLE’s OFF-CHIP MEMORY

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 2

5 3

MEM

Top 2

PE

5

3

9

6

4

5

8

6

MEM

3

SMART MEMORY

6

SMART MEMORY

PE

5

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

MEM

PE 7

11

An Example: Semantic Search

DOC 3

QUERY 1

DOC 4

X

MEM

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 2

5 3

PE

QUERY 3

MAPLE’s OFF-CHIP MEMORY

MEM

SEND ROW 1 to SMART MEM. BROADCAST DOC2 PE

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

5 3

SMART MEMORY

MEM

PE

7 6

SMART MEMORY 12

6

An Example: Semantic Search

X

DOC 3

QUERY 1

DOC 4

MEM

PE

QUERY 3

MAPLE’s OFF-CHIP MEMORY

GENERATE ROW 2

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 2

5 3

4

MEM

PE

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

5 2

SMART MEMORY

3

MEM SMART MEMORY

7

PE 9

5

6

13

An Example: Semantic Search

DOC 3

QUERY 1

DOC 4

X

MEM

PE

QUERY 3

MAPLE’s OFF-CHIP MEMORY

9 5

1 5

8 3

1 3

1 6

5

5

3

9

6

4

5

8

6

4

5

2

3 MEMORY

9

7 SMART

5

6 MEMORY

SMART

MEM

PE G K

Top 2

MEM

PE

1

MEM

BROADCAST DOC 3. PE SMART MEM STARTS RANKING.

7 6

4 2

QUERY 2

DOC 2

5 3

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

H L

14

7

An Example: Semantic Search

DOC 2

X

DOC 3

QUERY 1

DOC 4

MEM

5 3

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

QUERY 3

MEM

PE

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

PE

MAPLE’s OFF-CHIP MEMORY

Top 2

4

5

5

3 MEMORY

8

9 SMART

6

6 MEMORY

SMART

MEM

PE

15

Alternate Mappings: More Resources

DOC 3

MEM

PE

PE

QUERY 4

QUERY 3

PE

MEM

MEM

PE

MEM

9 5

1 5

8 3

1 3

1 6

PE

QUERY 3

MEM

QUERY 2

QUERY 1

DOC 4

7 6

4 2

QUERY 2

X

5 3

Top 2

5

3

9

6

4

5

8

6

MEM

PE

SMART MEMORY

Performance doubled: Each chain produces 2 output cols.

MEM

PE

QUERY 4

DOC 2

QUERY 1

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

MEM

PE

SMART MEMORY 16

8

Alternate Mappings: Fewer Resources

DOC 2

X

DOC 3

QU 1

DOC 4

MEM

QU 1

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

Query column does not fit in PE memory

MEM

PE

7 6

4 2

9 5

1 5

8 3

1 3

1 6

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QU 2

QU 2

PE

5 3

MEM

PE

SMART MEMORY

Specifying all this splitting, etc could be a nightmare! SMART MEMORY 17

Typical Conversation with ML Domain Expert • Us: “Do you want to program our accelerator?” • Them: “Why would we do that?” • Us: “Performance.” • Them: “We don’t even like CUDA (or even C for that matter) matter). Why should we program your accelerator?” • Us: “Um, because we’re colleagues…?” Programmers don’t easily accept new accelerators!

18

9

Specifying Semantic Search First Op = matmul Streaming matrix = A Streaming matrix rows = D // number of documents Streaming matrix cols = … On-chip matrix = B On-chip matrix rows = … On-chip matrix cols = Q

// number of concurrent queries

Second Op = arrayRank(K)

MAPPING

Given problem size and architecture params, automatically does data mapping.

MAPLE ASSEMBLY

ASSEMBLY GENERATION

The compiler can also explore the design space 19

Architectural Design Choices Off-chip Memory (stream A)

Off-chip Memory (stream A)

Off-chip Memory (stream A)

Off-chip Memory (stream A)

MEM

MEM

PE

PE

MEM

MEM

PE

PE

MEM

MEM

MEM

MEM

PE

PE

PE

PE

MEM

MEM

PE

PE

MEM

MEM

PE

PE

SMART MEMORY

2 CHAINS 2 PEs / CHAIN SMART MEMORY

SMART MEMORY

1 CHAINS 4 PEs / CHAIN

SMART MEMORY

2 OFF-CHIP MEMORY BANKS 2 CHAINS / CORE 2 PEs / CHAIN SMART MEMORY

20

10

Prototype for Experiments API for transferring assembly, matrix data,…

MAPLE on Virtex5 FPGA 512 PEs, 125 MHz 64 chains, 2 memory banks

HOST: Xeon 2.5 GHz Quad-core

Tesla GPU, 1.3 GHz, 128 cores

21

Results for Semantic Search

Miilliseconds per Query

MAPLE vs CPU 32 concurrent queries, Ranking top 32 16

2.5 GHz quad-core Xeon, 4 threads

14

MAPLE Prototype (125MH ) (125MHz)

2M documents 32 concurrent queries 128 top K

12

2.5GHz Xeon 4-core: 52ms/query C870 Tesla GPU: 11.4ms/query MAPLE Proto: 3.76ms/query

10 8

• Why is MAPLE faster?

6

• PE-smart memory chaining : perform both first and second op in parallel • In-memory processing : fewer off-chip accesses

4 2 0 256K

512K

1M

Number of Documents 22

11

Results for Conv. Neural Networks 2.5 GHz quad-core Xeon, 4 threads MAPLE Prototype (125 MHz)

300

Millliseconds per frame

250 200

CNN for Face Recognition

150

2.5GHz Xeon 4-core: 6 fps C870 Tesla GPU: 9.5 fps MAPLE Proto: 13 fps

100 50 0

23

Results for SVM Training • Compared to optimized GPU implementation from UCB, MAPLE prototype is 2-6x slower • Why? – SVM has a large matrix vector mult which is memory bandwidth limited • 1 compute op / fetch

– MAPLE cannot match GPU’s memory bandwidth (GDDR5, etc) – But MAPLE is only FPGA prototype, GPU is custom processor

24

12

Summary and Conclusions • Looked into new architectural features for learning and classification – Systematically analyzed representative workloads – Identified bottlenecks, common structure

• Prototyped the accelerator system: showed promising speedups • Future – Use of such accelerators in low-power systems (e.g., with Atom as the host processor) – Embedded learning and classification

25

Thank You!

26

13

Questions • • • • •

• •

Memory model GPU with 128 PEs, MAPLE has 512 – fair comparison? Specification holes? What about other apps besides SSI? K-means? CNN? Other? Could I not rewrite my application (on a GPU) to avoid interim storage? How much of perf win comes from computation, how much from reducing memory accesses? What fraction of peak is achieved in each case? How were the CPU and GPU optimized?

•

Reviewer comments.

•

27

14

A Parallel Accelerator for Semantic Search Semantic ...