A Programmable Parallel Accelerator for Learning and Classification Hari Cadambi, Abhinandan Majumdar, Michela Becchi Srimat Chakradhar and Hans Peter Graf NEC Laboratories America Princeton, New Jersey, USA.

www.nec-labs.com

Our Target Application Domain: Examples

Intelligent processing of l t off data. d t large amounts Learning and classification used increasingly. SEMANTIC SEARCH

FACE RECOGNITION IN A CROWD

2

1

Motivation for Accelerator • Wider range of apps: digital pathology, cognitive databases… • Increasing data Æ increasing computational load • Stringent performance constraints – Semantic Search Æ search millions of documents in a few milliseconds – Crowd face tracking Æ analyze VGA+ moving images in real-time

• Above trends justified investigating specialized architectural support for these workloads

3

This Paper • Proposed MAPLE, an accelerator with new architectural features for learning / classification • Proposed programming model, mapping strategy • What MAPLE is not: – A general-purpose engine – A replacement for GPUs or any other processor

• What this paper tries to do: – Suggest and evaluate new architectural features for these workloads, and how they could be programmed – Can an existing processor (CPU or GPU) use these features? Maybe… 4

2

How We Went About Designing MAPLE Profiled representative workloads Computational Bottlenecks Identified structure and common set of primitives Architected MAPLE Simulator, architectural exploration Prototype 5

Workload Analysis Applications Semantic S ti Search

Image Segmentation, Recognition

Digital Pathology

Algorithm

Computational Bottlenecks

SSI

Dot-products Dot products and array ranking : 99%

CNN

1D, 2D, 3D convolutions: 99%

K-means

Minimum Euclidean distance: 96%

SVM

Large matrix – vector multiplication: 85-95%

GLVQ

Minimum Euclidean distance: 99%

But is there a common structure?

6

3

Common Structure: Example • Semantic Search: Given N documents, Q concurrent queries, find top K matching documents • Computational bottleneck:

QUERY Q

QUERY 1 QUERY 2

OUT Q

ARRAY RANK

OUT 2

INTERIM RESULT

MATMUL

DOC N

OUT 1

DOC 1 DOC 2

LARGE: 1-2GB for 2-4M docs

• Common structure: Dense LA Æ large intermediate result Æ second operation (array rank, min/max, sum) 7

MAPLE Architecture Common Structure

Array Rank Find min/max Sum

Operand A DENSE LA

Operand B

LARGE INTERIM RESULT

Distribute operand B

Stream operand A

Off-chip Memory

On-chip Memory

2D PE Array

SECOND OP

REDUCED RESULT

1. In-memory processing of second op 2. Operates in parallel with PE array 3. Reduces off chip writes

Distributed Smart Memory

Chain First and Second Ops

Off-chip Memory (stream result)

8

4

PE Array and Smart Memory Simple Vector Processing Elements All PEs operate in SIMD mode Off-chip Memory y (stream A)

MEM

MEM

MEM

PE

PE

PE

MEM

MEM

MEM

PE

PE

PE

S SMART MEMORY

SMART MEMORY

Off-chip Memory (stream result) MEM

MEM

MEM

PE

PE

PE

SMART MEMORY

9

An Example: Semantic Search

DOC 3

QUERY 1

DOC 4

X

9 5

1 5

8 3

1 3

1 6

PE

QUERY 3

MAPLE’s OFF-CHIP MEMORY

BROADCAST DOC1

MEM

7 6

4 2

QUERY 2

DOC 2

5 3

MEM

PE

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

SMART MEMORY

MEM

PE

SMART MEMORY 10

5

An Example: Semantic Search

X

DOC 3

QUERY 1

DOC 4

MEM

PE

GENERATE ROW 1

QUERY 3

MAPLE’s OFF-CHIP MEMORY

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 2

5 3

MEM

Top 2

PE

5

3

9

6

4

5

8

6

MEM

3

SMART MEMORY

6

SMART MEMORY

PE

5

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

MEM

PE 7

11

An Example: Semantic Search

DOC 3

QUERY 1

DOC 4

X

MEM

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 2

5 3

PE

QUERY 3

MAPLE’s OFF-CHIP MEMORY

MEM

SEND ROW 1 to SMART MEM. BROADCAST DOC2 PE

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

5 3

SMART MEMORY

MEM

PE

7 6

SMART MEMORY 12

6

An Example: Semantic Search

X

DOC 3

QUERY 1

DOC 4

MEM

PE

QUERY 3

MAPLE’s OFF-CHIP MEMORY

GENERATE ROW 2

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 2

5 3

4

MEM

PE

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

5 2

SMART MEMORY

3

MEM SMART MEMORY

7

PE 9

5

6

13

An Example: Semantic Search

DOC 3

QUERY 1

DOC 4

X

MEM

PE

QUERY 3

MAPLE’s OFF-CHIP MEMORY

9 5

1 5

8 3

1 3

1 6

5

5

3

9

6

4

5

8

6

4

5

2

3 MEMORY

9

7 SMART

5

6 MEMORY

SMART

MEM

PE G K

Top 2

MEM

PE

1

MEM

BROADCAST DOC 3. PE SMART MEM STARTS RANKING.

7 6

4 2

QUERY 2

DOC 2

5 3

QUERY 4

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

H L

14

7

An Example: Semantic Search

DOC 2

X

DOC 3

QUERY 1

DOC 4

MEM

5 3

7 6

4 2

9 5

1 5

8 3

1 3

1 6

QUERY 2

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

QUERY 3

MEM

PE

5

3

9

6

4

5

8

6

MEM

PE

QUERY 4

PE

MAPLE’s OFF-CHIP MEMORY

Top 2

4

5

5

3 MEMORY

8

9 SMART

6

6 MEMORY

SMART

MEM

PE

15

Alternate Mappings: More Resources

DOC 3

MEM

PE

PE

QUERY 4

QUERY 3

PE

MEM

MEM

PE

MEM

9 5

1 5

8 3

1 3

1 6

PE

QUERY 3

MEM

QUERY 2

QUERY 1

DOC 4

7 6

4 2

QUERY 2

X

5 3

Top 2

5

3

9

6

4

5

8

6

MEM

PE

SMART MEMORY

Performance doubled: Each chain produces 2 output cols.

MEM

PE

QUERY 4

DOC 2

QUERY 1

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

4 documents, 4 concurrent queries, need best 2 matches

MEM

PE

SMART MEMORY 16

8

Alternate Mappings: Fewer Resources

DOC 2

X

DOC 3

QU 1

DOC 4

MEM

QU 1

DOC 1

QUERY 1 Q Q QUERY 2 QUERY 3 Q QUERY 4 Q

Query column does not fit in PE memory

MEM

PE

7 6

4 2

9 5

1 5

8 3

1 3

1 6

Top 2

5

3

9

6

4

5

8

6

MEM

PE

QU 2

QU 2

PE

5 3

MEM

PE

SMART MEMORY

Specifying all this splitting, etc could be a nightmare! SMART MEMORY 17

Typical Conversation with ML Domain Expert • Us: “Do you want to program our accelerator?” • Them: “Why would we do that?” • Us: “Performance.” • Them: “We don’t even like CUDA (or even C for that matter) matter). Why should we program your accelerator?” • Us: “Um, because we’re colleagues…?” Programmers don’t easily accept new accelerators!

18

9

Specifying Semantic Search First Op = matmul Streaming matrix = A Streaming matrix rows = D // number of documents Streaming matrix cols = … On-chip matrix = B On-chip matrix rows = … On-chip matrix cols = Q

// number of concurrent queries

Second Op = arrayRank(K)

MAPPING

Given problem size and architecture params, automatically does data mapping.

MAPLE ASSEMBLY

ASSEMBLY GENERATION

The compiler can also explore the design space 19

Architectural Design Choices Off-chip Memory (stream A)

Off-chip Memory (stream A)

Off-chip Memory (stream A)

Off-chip Memory (stream A)

MEM

MEM

PE

PE

MEM

MEM

PE

PE

MEM

MEM

MEM

MEM

PE

PE

PE

PE

MEM

MEM

PE

PE

MEM

MEM

PE

PE

SMART MEMORY

2 CHAINS 2 PEs / CHAIN SMART MEMORY

SMART MEMORY

1 CHAINS 4 PEs / CHAIN

SMART MEMORY

2 OFF-CHIP MEMORY BANKS 2 CHAINS / CORE 2 PEs / CHAIN SMART MEMORY

20

10

Prototype for Experiments API for transferring assembly, matrix data,…

MAPLE on Virtex5 FPGA 512 PEs, 125 MHz 64 chains, 2 memory banks

HOST: Xeon 2.5 GHz Quad-core

Tesla GPU, 1.3 GHz, 128 cores

21

Results for Semantic Search

Miilliseconds per Query

MAPLE vs CPU 32 concurrent queries, Ranking top 32 16

2.5 GHz quad-core Xeon, 4 threads

14

MAPLE Prototype (125MH ) (125MHz)

2M documents 32 concurrent queries 128 top K

12

2.5GHz Xeon 4-core: 52ms/query C870 Tesla GPU: 11.4ms/query MAPLE Proto: 3.76ms/query

10 8

• Why is MAPLE faster?

6

• PE-smart memory chaining : perform both first and second op in parallel • In-memory processing : fewer off-chip accesses

4 2 0 256K

512K

1M

Number of Documents 22

11

Results for Conv. Neural Networks 2.5 GHz quad-core Xeon, 4 threads MAPLE Prototype (125 MHz)

300

Millliseconds per frame

250 200

CNN for Face Recognition

150

2.5GHz Xeon 4-core: 6 fps C870 Tesla GPU: 9.5 fps MAPLE Proto: 13 fps

100 50 0

23

Results for SVM Training • Compared to optimized GPU implementation from UCB, MAPLE prototype is 2-6x slower • Why? – SVM has a large matrix vector mult which is memory bandwidth limited • 1 compute op / fetch

– MAPLE cannot match GPU’s memory bandwidth (GDDR5, etc) – But MAPLE is only FPGA prototype, GPU is custom processor

24

12

Summary and Conclusions • Looked into new architectural features for learning and classification – Systematically analyzed representative workloads – Identified bottlenecks, common structure

• Prototyped the accelerator system: showed promising speedups • Future – Use of such accelerators in low-power systems (e.g., with Atom as the host processor) – Embedded learning and classification

25

Thank You!

26

13

Questions • • • • •

• •

Memory model GPU with 128 PEs, MAPLE has 512 – fair comparison? Specification holes? What about other apps besides SSI? K-means? CNN? Other? Could I not rewrite my application (on a GPU) to avoid interim storage? How much of perf win comes from computation, how much from reducing memory accesses? What fraction of peak is achieved in each case? How were the CPU and GPU optimized?



Reviewer comments.



27

14

A Programmable Parallel Accelerator for Learning and ...

Our Target Application Domain: Examples. Intelligent processing of l t fd t ... Applications. SSI. Dot-products and ..... HOST: Xeon 2.5 GHz. Quad-core. Results for ...

386KB Sizes 1 Downloads 202 Views

Recommend Documents

A Parallel Accelerator for Semantic Search Semantic ...
"Swoogle: A Search and Metadata Engine for the Semantic Web". Proceedings of the .... 4 documents, 4 concurrent queries, need best 2 matches. QU. E. R. Y. 1 .... //Transfer documents from host to the SSI Accelerator memory. SSIMemCpy ...

A Parallel Accelerator for Semantic Search
Abstract- Semantic text analysis is a technique used in ... algorithm for semantic analysis. .... accelerator, its internal components and describe the software-.

an accelerator-centric OS for omni-programmable ...
Design alternatives. Distributed vs. centralized. ... Alternatives to NXU OS abstractions. Earlier alterna- ..... Conservation Cores: Reducing the Energy of Mature.

FreePipe: a Programmable Parallel Rendering ...
Institute of Software, Chinese Academy of Sciences‡. University of Macau§ .... The list will be sorted according to depth values in post- processing. ..... accounting for absorption and ignoring reflection [NVIDIA 2005;. Akenine-M ¨oller et al.

FreePipe: a Programmable Parallel Rendering ...
However, for large scenes with high complexity, multiple vertex .... Satishetal.2009], their data sets are always pre-determined and free to ..... Project (Grant No.

Parallel Learning to Rank for Information Retrieval - SFU Computing ...
ever, these studies were mainly concerned with accuracy and did not seek for improvement in learning efficiency through parallelization. Many parallel machine learning frameworks have been introduced, e.g., IBM Parallel Machine Learning. Toolbox (www

Parallel Learning to Rank for Information Retrieval - SFU Computing ...
General Terms: Algorithms, Performance. Keywords: Learning to rank, Parallel algorithms, Cooper- ... Overview. Evolutionary algorithms (EAs) are stochastic.

Parallel Learning to Rank for Information Retrieval - Hady Lauw
CC-based parallel learning to rank framework targeting si- ... nature of CC allows easy parallelization. .... [5] H. P. Graf, E. Cosatto, L. Bottou, I. Durdanovic, and.

Accelerator Award.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Accelerator ...

Learning Parallel JavaScript with a Visual Boids ...
Lewis & Clark College, Portland, OR, USA. Abstract. As multi-core ... is useful when teaching parallelism, as a way to motivate students and show the benefits of ...

Configuration Synthesis for Programmable ... - people.csail.mit.edu
Jun 17, 2016 - Compilers; C.1.3 [Processor Styles]: Analog Computers. Keywords Compilers, Analog Computing, .... Because Arco works with a hardware specification language that defines the capabilities of the ..... lows the programmer to describe the

A framework for parallel and distributed training of ...
Accepted 10 April 2017 ... recently proposed framework for non-convex optimization over networks, ... Convergence to a stationary solution of the social non-.

WORKFORCE ACCELERATOR FUND Request for Applications ... - EDD
Apr 23, 2014 - employment and re-employment strategies for California job seekers. The State Board and EDD will fund projects and partnerships to create ...

Cluster-parallel learning with VW - GitHub
´runvw.sh ´ -reducer NONE. Each mapper runs VW. Model stored in /model on HDFS runvw.sh calls VW, used to modify VW ...

Cluster-parallel learning with VW - PDFKUL.COM
Goals for future from last year. 1. Finish Scaling up. I want a kilonode program. 2. Native learning reductions. Just like more complicated losses. 3. Other learning algorithms, as interest dictates. 4. Persistent Demonization ...

Single chip frame buffer and graphics accelerator
Nov 5, 1999 - 16.5.1—16.5.4.*. 546, 545, 559; 365/189.07, 203, 276, 230.06,. 230.08. (List continued on next page.) -. Prim/1r Examiner—Kee M. Tun. 56. R f.

A Robotic Scenario for Programmable Fixed-Weight ... - Springer Link
in ANNs we mean the property of a fixed-weight ANN of working as an inter- .... Robot simulations were carried out using the open source software Player-Stage.

Accelerator Compiler for the VENICE Vector ... - Semantic Scholar
compile high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results ...

ACCelerator Survival Guide.pdf
Page 1 of 2. Phone: 512.223.7348. Website: http://www.austincc.edu/highland-campus/accelerator. *Hours of Operation. Mondays-Fridays: 7:30AM-10PM. Saturdays: 8AM-7PM. Sundays: 12PM-6PM. ACCELERATOR. SURVIVAL. GUIDE. Service Location Phone. Academic C

System Architecture for Programmable Connected ...
International Conference on Embedded Wireless. Systems and Networks (EWSN) 2016 .... PhD thesis, MIT Artificial Intelligence Laboratory, 1985. [2] Allseen ...

Policy Based Architecture for Active and Programmable ...
Phone: (0422) 2572177 ext. 4436. E-mail: ... Keywords: Active Networks, Policy Based Network Management, Service Level ... overall functioning of the system.

A programmable neural network hierarchical ...
PNN realizes a neural sub-system fully controllable (pro- grammable) behavior ...... comings of strong modularity, but it also affords flex- ible and plausible ...