A Parallel Accelerator for Semantic Search Abhinandan Majumdar Srihari Cadambi, Srimat Chakradhar and Hans Peter Graf NEC Laboratories America Princeton, New Jersey, USA.

www.nec-labs.com

Semantic Search • Intelligent searching and ranking of semantically similar dataset from a large database.

[1]

[1] Ding, Li; Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal C Doshi, and Joel Sachs (2004). "Swoogle: A Search and Metadata Engine for the Semantic Web". Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management (CIKM). ACM. pp. 652–659 2

1

Supervised Semantic Indexing (SSI) • A recently proposed algorithm for Semantic Search. • Ranks large set of documents based on semantic similarity to text-based queries.

3

Presentation Roadmap Background on SSI and its computational bottlenecks

Motivation: Why specialized accelerator for SSI? Architecture of the SSI Accelerator Programming the SSI Accelerator Architectural Exploration and FPGA Prototype SSI Accelerator Performance

4

2

The Big Picture of the Application

Q Concurrent Queries D Documents (D=1-2M)

Documents

SSI Server

Documents

Documents

Documents

k semantically similar Documents (k=32,64…)

Documents

5 For Q queries

SSI Algorithm •



Each document and query is represented by a vector, with each vector element a product of Term Frequency (TF) and Inverse Document Frequency (IDF). These document and query vectors when multiplied by a weighted matrix (generated by training), produces respective matrices used in SSI matching algorithm.

Documents (D x c)

X

Queries

Matrix

Intermediate data

Array

(c x Q)

Multiplication

(D x Q)

Ranking

SSI Matching Algorithm

Top-K (k x Q)

D: # of Documents c: Document Length Q: # of concurrent Queries k: top k elements

Consumes 99% of total execution (profiled on 2.5 GHz Quad-core Xeon)

6

3

Motivation for the SSI Accelerator

• Stringent performance constraints  Must search and rank millions of documents in a few milliseconds, as dictated by application-level quality-ofservice.  State-of-the-art computing platforms do not deliver required performance.

7

Motivation for SSI Accelerator An Example Must search top-64 documents in few milliseconds/query

2.5 GHz Quad-core Xeon

SSI 64 Queries on 2M Documents 61 ms/query 12 MB L2 Cache

Generates 128 MB of intermediate data 1.3 GHz 240 core Tesla GPU 9.5 ms/query 16 KB per CUDA block

• Limited fine• Smaller on-chip grained parallelism. memory. • No SW controlled FREQUENT OFF-CHIP caching policy. ACCESSES ON-CHIP MASSIVELY PARALLEL + IN-MEMORY PROCESSING PROCESSORS

HIGH PERFORMANCE 8

4

This Paper • Presents the architecture of the FPGA-based SSI accelerator. • Compares performance against optimal parallel implementations on multi-cores and GPUs. • What the accelerator is not: – A replacement of Xeon or GPU. – An accelerator for other application domains.

• What this paper tries to do: – Suggest and evaluate new architectural features for SSI, and how the accelerator can be programmed. – Can an existing processor (CPU or GPU) use these features? Maybe…

9

Presentation Roadmap Background on SSI and its computational bottlenecks

Motivation: Why specialized accelerator for SSI? Architecture of the SSI Accelerator Programming the SSI accelerator Architectural Exploration and FPGA Prototype SSI Accelerator performance

10

5

Architecture of the SSI Accelerator •

Semantic Search: Given D documents, Q concurrent queries, find top K matching documents.

SSI Core Computation Documents

LARGE INTERIM RESULT

MATMUL

Queries Distribute Queries (from off-chip)

On-chip Memory

Stream Documents (from off-chip)

ARRAY RANK

1. In-memory processing of Array Rank 2. Operates in parallel with PE array 3. Reduces off chip writes

Off-chip Memory (stream result)

Distributed On-chip Smart Memory

2D PE Array

On-chip Memory

REDUCED RESULT

MATMUL (Fine-grained Parallelism)

11

PE Array and Smart Memory Documents streamed from off-chip

Input Local Store

B R O A D C A S T

Queries distributed from off-chip MEM

MEM

MEM

PE

PE

PE

MEM

MEM

MEM

PE

PE

PE

SMART MEMORY

SMART MEMORY

Off-chip Memory (stream result)

H chains

MEM

MEM

MEM

PE

PE

PE

SMART MEMORY

M/2 PEs Simple Vector Processing Elements (each implemented using 2 FPGA DSPs)

All PEs operate in SIMD mode

12

6

SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5

Top 2

1 3 1 6

MEM

PE

MEM SMART MEMORY

PE

QUERY 4

QUERY 2

Input Local Store (On-chip)

BROADCAST DOC1 (Serially)

5 3 9 6 4 5 8 6

1 5 8 3

QUERY 3

QUERY 1

X

5 3 7 6

QUERY 1 QUERY 2 QUERY 3 QUERY 4

DOC 1 DOC 2 DOC 3 DOC 4

MEM

PE

MEM SMART MEMORY

PE

13

SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5 1 5 8 3

Input Local Store (On-chip) QUERY 2

QUERY 3

MEM

PE

GENERATE ROW 1

Top 2

5 3 9 6 4 5 8 6

1 3 1 6

MEM

PE

MEM

7

SMART MEMORY

6

SMART MEMORY

PE

5

QUERY 4

QUERY 1

X

5 3 7 6

QUERY 1 QUERY 2 QUERY 3 QUERY 4

DOC 1 DOC 2 DOC 3 DOC 4

MEM

PE 3

14

7

SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5

5 3 9 6 4 5 8 6

1 3 1 6

MEM

PE

MEM 5

PE

7

QUERY 4

Input Local Store (On-chip) QUERY 2

Top 2

1 5 8 3

QUERY 3

QUERY 1

X

5 3 7 6

QUERY 1 QUERY 2 QUERY 3 QUERY 4

DOC 1 DOC 2 DOC 3 DOC 4

MEM

SEND ROW 1 to SMART MEM. PE BROADCAST DOC2 (serially)

SMART MEMORY

MEM 3

PE

6

SMART MEMORY 15

SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5

QUERY 3

MEM

Input Local Store (On-chip) QUERY 2

5 3 9 6 4 5 8 6

1 3 1 6

PE

GENERATE ROW 2

Top 2

1 5 8 3

MEM

PE

MEM

PE

4

QUERY 4

QUERY 1

X

5 3 7 6

QUERY 1 QUERY 2 QUERY 3 QUERY 4

DOC 1 DOC 2 DOC 3 DOC 4

5 9

SMART MEMORY

MEM 3

PE 2

7

5

6

SMART MEMORY 16

8

SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5

5 3 9 6 4 5 8 6

QUERY 3

1 3 1 6

MEM

PE Input Local Store (On-chip) QUERY 2

Top 2

1 5 8 3

1

MEM

BROADCAST DOC 3 (serially). PE SMART MEM STARTS RANKING.

MEM

PE

QUERY 4

QUERY 1

X

5 3 7 6

QUERY 1 QUERY 2 QUERY 3 QUERY 4

DOC 1 DOC 2 DOC 3 DOC 4

4 5 K

SMART

9 7 MEMORY

MEM 2 3 SMART

PE G 5

H L

5 6 MEMORY

17

SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5

5 3 9 6 4 5 8 6

1 3 1 6

MEM

PE

MEM

PE

MEM

PE

QUERY 4

Input Local Store (On-chip) QUERY 2

Top 2

1 5 8 3

QUERY 3

QUERY 1

X

5 3 7 6

QUERY 1 QUERY 2 QUERY 3 QUERY 4

DOC 1 DOC 2 DOC 3 DOC 4

4 5

SMART

8 9 MEMORY

MEM

PE

5 3 SMART

6 6 MEMORY

18

9

SSI Mapping: Dynamic Parallelism 4 documents, 2 concurrent queries, need best 2 matches 5 3

QUERY 1 QUERY 2

MEM

4 2

Top 2

1 3

PE

MEM SMART MEMORY

PE

QUERY 2

QUERY 2

Input Local Store (On-chip)

MEM

5 3 9 6 4 5 8 6

1 5

QUERY 1

X

QUERY 1

DOC 1 DOC 2 DOC 3 DOC 4

PE

MEM SMART MEMORY

PE

Queries duplicated across the chain

19

SSI Mapping: Dynamic Parallelism 4 documents, 2 concurrent queries, need best 2 matches 5 3

QUERY 1 QUERY 2

MEM

4 2 1 5

PE

PE

PE Queries duplicated across the chain

QUERY 2

QUERY 2

MEM

5 3 9 6 4 5 8 6

MEM

Input Local Store (On-chip)

BROADCAST DOC1 and 2 (together)

Top 2

1 3 QUERY 1

X

QUERY 1

DOC 1 DOC 2 DOC 3 DOC 4

MEM

PE

SMART MEMORY

Performance doubled Full PE utilization

SMART MEMORY 20

10

SSI Mapping: Dynamic Parallelism • • •

Dynamic Parallelism logically reorganizes the PEs to allow multiple levels of parallelism. Maximum Parallelism depends on number of DSPs in a chain. Parallelism mode is determined after FPGA is programmed and is set by the assembler based on number of SSI queries and processor layout.

Configures the bandwidth of the broadcast bus.

MEM

Configures the SMs to rank for one/multiple queries.

MEM

PE

PE Duplicate/ distribute queries

Input Local Store (On-chip)

MEM

MEM

PE

PE

SMART MEMORY

SMART MEMORY 21

Programming the SSI Accelerator SSIMalloc (doc_fpga, size…); SSIMalloc (query_fpga, size…); SSIMalloc (asm_fpga, size…); SSIMalloc (topk_fpga, size…);

Allocate FPGA Memory

APIs

//Transfer documents from host to the SSI Accelerator memory SSIMemCpy (doc_fpga, docs, size…, HOST2FPGA); //Transfer queries from host to the SSI Accelerator memory SSIMemCpy (query_fpga, queries, size…, HOST2FPGA); //Generate Assembly asm = GenSSIASM(N,Q,K,H,M…);

MAPPING

Problem Size Arch. Parameters

//Transfer queries from host to the SSI Accelerator memory SSIMemCpy (asm_fpga, asm, size…, HOST2FPGA); //Execute SSI and Poll until completion while (SSIExecute()!=SSI_FINISH); // Transfer top-k values from SSI Accelerator memory to host SSIMemCpy (topk, topk_fpga, size…, FPGA2HOST);

ASSEMBLY GENERATION

Parallelism Mode

Data Mapping

ASSEMBLY INSTRUCTIONS

22

11

SSI Accelerator – Internal Components One dual-ported on-chip BRAM

STREAMING DATA

FILTER

PE LOCAL STORE

LIST

CMP M words REG

From PE i-1

M words

To PE i+1

O(ADDR)

Parallelization Mode

VECTOR FU

16 bit signed fixed-pt mult.

Two FPGA DSPs

THRESHOLD

SCAN M1

VAL

k

CMP

ADDR

REG

To PE i-1

From PE i+1 M2

REG

Stalls the processing but is hidden with next processing data.

Processing Element

UPDATE

Smart Memory

Stall probability decreases. 23 Invariant of k.

Presentation Roadmap Background on SSI and its computational bottlenecks

Motivation: Why specialized accelerator for SSI? Architecture of the SSI Accelerator Programming the SSI Accelerator Architectural Exploration and FPGA Prototype SSI Accelerator Performance

24

12

Alpha Data ADM-XRC-5T2 Board •

One Virtex-5 SX240T FPGA. – –

• • •

516 on-chip 36 Kbits Block RAMs. 1056 DSPs.

Data Memory: Four 512 MB DDR2 SDRAMs with B/W of 8 words/cycle. Instruction Memory: Two 256 MB DDR2 SSRAMs with B/W of 4 words/cycle. Host FPGA connection through 66 MHz, 64 bit PCI-X interface.

DRAM

DRAM

PCI-X FPGA DRAM

DRAM

25

SSI Accelerator Prototype • Resource Constraints and Accelerator Requirements – An instruction memory requires one SRAM (2 available) • Two cores with its own instruction memory.

DATA BANK 1 (DDR2)

HOST

– A data memory requires one DRAM (4 available)

(DDR2)

– A PE requires one BRAM (516 available) and two DSPs (1056 available)

DATA BANK 3

DATA BANK 4

(DDR2)

(DDR2)

LOW BANDWIDTH BANK-TO-BANK COMMUNICATION

SWITCH

• Each core uses two data memory as input and output.

• 256 total PEs with 256 BRAMs and 512 DSPs (256 DSPs/core). • Remaining BRAMs for SM, Input Local Store etc.

DATA BANK 2

High bandwidth bank-core communication

SWITCH

CORE 1

CORE 2

INST BANK 1

INST BANK 2

(SRAM)

OFF-CHIP DATA MEMORY BANKS

OFF-CHIP INST MEMORY BANKS

(SRAM)

FROM HOST

With 256 DSPs/core, what is the optimal layout of PE matrix?

26

13

PE Array Layout? MEM MEM

DSP

MEM

MEM

M DSP

DSP

MEM

MEM

DSP

MEM

DSP

MEM

M DSP

DSP

MEM

MEM

256 DSPs DSP

256 chains

MEM

DSP DSP

DSP

H chains WIDER MEM

MEM

DSP

DSP

MEM

MEM

M

DSP

DSP

NARROW

27

Core Architecture Exploration • Using a C++ simulator, determine optimal number of chains (H) and DSPs/chain (M) for 256 DSPs in a core. • Simulate SSI for 1024 queries to rank 2M documents. SSI Performance 14

Cycles (In Billions)

12 10

K = 32

8 6 4

K = 64

DSPs/chain = 8 # of Chains = 32

Insensitive to k

K = 128

2 0 4 16 64 256 Smaller chain length, 1 DSPs per chain lower parallelism Chain processing faster than off-chip b/w. Chain length matches with off-chip b/w i.e. 8 words/cycle.

28

14

FPGA Prototype and Resource Utilization • • • • • • •

Number of DSPs/chain in a core (M) = 8. Number of chains in a core (H) = 32. Number of cores = 2. Operating Frequency = 125 MHz. On-chip Memory= 1.7 MB. Block RAM Usage = 74% DSP Usage= 48%

29

Presentation Roadmap Background on SSI and its computational bottlenecks

Motivation: Why specialized accelerator for SSI? Architecture of the SSI Accelerator Programming the SSI Accelerator Architectural Exploration and FPGA Prototype SSI Accelerator Performance

30

15

SSI Accelerator Performance



Baseline – –



Xeon implementation uses multi-threaded BLAS. GPU execution uses CUBLAS, with CUDA threads computing top-k on a set of documents without incoherent memory loads/stores.

Assumptions – – –

Do not include the one-time data transfer time of documents from the host to FPGA. Do include the transfer time of query and results between host and FPGA. Include the FPGA setup time in storing the documents from off-chip to input LS, and distributing the queries.

31

Summary and Conclusion • What we presented? – FPGA-based Programmable Semantic Search Accelerator. – Performance comparison of accelerator over multi-core and GPU.

• Key takeaways – Combination of in-memory processing along with massively parallel PEs makes the SSI accelerator faster. – Logical reorganization of PEs to allow multiple levels of parallelism.

• Future work – Using this accelerator as an energy-efficient server (with Atom as host), especially when deployed in clusters and data centers. 32

16

Thank You

Q&A…?

33

Questions • Can you perform both matmul and array ranking (on-chip?) on GPU even when shared memory is large… • FPGA Memory B/W is low… How do you take care of that? • Have you thought about multiple GPUs?

34

17

FPGA Resource Utilization • • • • •

Number of DSPs/chain in a core (M) = 8. Number of chains in a core (H) = 32. Number of cores = 2. Operating Frequency = 125 MHz. On-chip Memory= 1.7 MB.

35

18

A Parallel Accelerator for Semantic Search Semantic ...

"Swoogle: A Search and Metadata Engine for the Semantic Web". Proceedings of the .... 4 documents, 4 concurrent queries, need best 2 matches. QU. E. R. Y. 1 .... //Transfer documents from host to the SSI Accelerator memory. SSIMemCpy ...

1MB Sizes 1 Downloads 245 Views

Recommend Documents

A Parallel Accelerator for Semantic Search
Abstract- Semantic text analysis is a technique used in ... algorithm for semantic analysis. .... accelerator, its internal components and describe the software-.

Distributed Indexing for Semantic Search - Semantic Web
Apr 26, 2010 - 3. INDEXING RDF DATA. The index structures that need to be built for any par- ticular search ... simplicity, we will call this a horizontal index on the basis that RDF ... a way to implement a secondary sort on values by rewriting.

Accelerator Compiler for the VENICE Vector ... - Semantic Scholar
compile high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results ...

Accelerator Compiler for the VENICE Vector ... - Semantic Scholar
This paper describes the compiler design for VENICE, a new soft vector processor ... the parallelism of FPGAs often requires custom datapath ac- celerators.

A Programmable Parallel Accelerator for Learning and ...
Our Target Application Domain: Examples. Intelligent processing of l t fd t ... Applications. SSI. Dot-products and ..... HOST: Xeon 2.5 GHz. Quad-core. Results for ...

Towards Semantic Search
While search engines do a generally good job on large classes of queries ... will come from, and in fact large amounts of structured data have been put on-.

Parallel unstructured grid computations - Semantic Scholar
optional. Although the steps are the same for structured and unstructured grids as well ... ture part (DDD) and at the unstructured mesh data structure. Finally, we ...

Efficient parallel inversion using the ... - Semantic Scholar
Nov 1, 2006 - Centre for Advanced Data Inference, Research School of Earth Sciences, Australian National University, Canberra, ACT. 0200 ... in other ensemble-based inversion or global optimization algorithms. ... based, which means that they involve

Parallel unstructured grid computations - Semantic Scholar
Huge amounts of data are produced by parallel computations ..... mandatory to define standardized interfaces for the PDE software components such that.

Parallel unstructured grid computations - Semantic Scholar
as sequential and parallel computation, programming effort can vary from almost nothing .... the application programmer UG provides a framework for building ...

Efficient parallel inversion using the ... - Semantic Scholar
Nov 1, 2006 - Centre for Advanced Data Inference, Research School of Earth Sciences, Australian ..... (what we will call the canonical version), and then.

Inducing Value Sparsity for Parallel Inference in ... - Semantic Scholar
observe the marginal of each latent variable in the model, and as soon as the probability of the max- marginal value crosses a ..... entity names in text. Bioinformatics, 21(14):3191–3192, 2005. [4] Yan Liu, Jaime G. Carbonell, Peter Weigele, and V

Delay Diversity Methods for Parallel OFDM Relays - Semantic Scholar
(1) Dept. of Signal Processing and Acoustics, Helsinki University of Technology. P.O. Box 3000 ... of wireless OFDM networks in a cost-efficient manner. ... full diversity benefits, which limits the number of relays, Nr. To achieve maximum delay.

Parallel generation of samples for simulation ... - Semantic Scholar
Analytical modeling of complex systems is crucial to de- tect error conditions or ... The current SAN solver, PEPS software tool [4], works with less than 65 million ...

Parallel generation of samples for simulation ... - Semantic Scholar
This advantage justifies its usage in several contexts where .... The main advantage of. SAN is due to ..... ular Analytical Performance Models for Ad Hoc Wireless.

Annotating a parallel monolingual treebank with semantic ... - DAESO
data-driven development of natural language processing tools such as part- of-speech taggers, chunkers and .... Obviously, this approach is sensitive to gaps due to insertion/deletion of large text segments. We therefore ... OS X, Linux and Windows,

Delay Diversity Methods for Parallel OFDM Relays - Semantic Scholar
of wireless OFDM networks in a cost-efficient manner. ... full diversity benefits, which limits the number of relays, Nr. To achieve maximum delay diversity, the ...

Deciphering Trends In Mobile Search - Semantic Scholar
Aug 2, 2007 - PDA and computer-based queries, where the average num- ber of words per ... ing the key and the system cycles through the letters in the order they're printed. ... tracted from that 5 seconds to estimate the network latency (the ..... M

Browsing-oriented Semantic Faceted Search
search solutions assume a precise information need, and thus optimise rel- ... 4], databases [7, 3, 2] and semantic data [18, 21, 13] (referred to as semantic ...

Scalable search-based image annotation - Semantic Scholar
for image dataset with unlimited lexicon, e.g. personal image sets. The probabilistic ... more, instead of mining annotations with SRC, we consider this process as a ... proposed framework, an online image annotation service has been deployed. ... ni

SEARCH COSTS AND EQUILIBRIUM PRICE ... - Semantic Scholar
Jul 5, 2013 - eBay is the largest consumer auction platform in the world, and the primary ... posted-prices under standard assumptions and common production costs (e.g., the Peters and Severinov 2006 model for auctions ..... (e.g., prices) between ve

Scalable search-based image annotation - Semantic Scholar
query by example (QBE), the example image is often absent. 123 ... (CMRM) [15], the Continuous Relevance Model (CRM) [16, ...... bal document analysis.

An Intelligent Search Agent System for Semantic ...
networks and we describe a prototype system based on the. Intelligent Agent ... H.3.3 [Information Storage and Retrieval]: Informa- tion Search Retrieval; H.4 ... Keywords. Information Retrieval, Semantic Network, Web Agents, On- tology. 1.