A Parallel Accelerator for Semantic Search Abhinandan Majumdar Srihari Cadambi, Srimat Chakradhar and Hans Peter Graf NEC Laboratories America Princeton, New Jersey, USA.
www.nec-labs.com
Semantic Search • Intelligent searching and ranking of semantically similar dataset from a large database.
[1]
[1] Ding, Li; Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal C Doshi, and Joel Sachs (2004). "Swoogle: A Search and Metadata Engine for the Semantic Web". Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management (CIKM). ACM. pp. 652–659 2
1
Supervised Semantic Indexing (SSI) • A recently proposed algorithm for Semantic Search. • Ranks large set of documents based on semantic similarity to text-based queries.
3
Presentation Roadmap Background on SSI and its computational bottlenecks
Motivation: Why specialized accelerator for SSI? Architecture of the SSI Accelerator Programming the SSI Accelerator Architectural Exploration and FPGA Prototype SSI Accelerator Performance
4
2
The Big Picture of the Application
Q Concurrent Queries D Documents (D=1-2M)
Documents
SSI Server
Documents
Documents
Documents
k semantically similar Documents (k=32,64…)
Documents
5 For Q queries
SSI Algorithm •
•
Each document and query is represented by a vector, with each vector element a product of Term Frequency (TF) and Inverse Document Frequency (IDF). These document and query vectors when multiplied by a weighted matrix (generated by training), produces respective matrices used in SSI matching algorithm.
Documents (D x c)
X
Queries
Matrix
Intermediate data
Array
(c x Q)
Multiplication
(D x Q)
Ranking
SSI Matching Algorithm
Top-K (k x Q)
D: # of Documents c: Document Length Q: # of concurrent Queries k: top k elements
Consumes 99% of total execution (profiled on 2.5 GHz Quad-core Xeon)
6
3
Motivation for the SSI Accelerator
• Stringent performance constraints Must search and rank millions of documents in a few milliseconds, as dictated by application-level quality-ofservice. State-of-the-art computing platforms do not deliver required performance.
7
Motivation for SSI Accelerator An Example Must search top-64 documents in few milliseconds/query
2.5 GHz Quad-core Xeon
SSI 64 Queries on 2M Documents 61 ms/query 12 MB L2 Cache
Generates 128 MB of intermediate data 1.3 GHz 240 core Tesla GPU 9.5 ms/query 16 KB per CUDA block
• Limited fine• Smaller on-chip grained parallelism. memory. • No SW controlled FREQUENT OFF-CHIP caching policy. ACCESSES ON-CHIP MASSIVELY PARALLEL + IN-MEMORY PROCESSING PROCESSORS
HIGH PERFORMANCE 8
4
This Paper • Presents the architecture of the FPGA-based SSI accelerator. • Compares performance against optimal parallel implementations on multi-cores and GPUs. • What the accelerator is not: – A replacement of Xeon or GPU. – An accelerator for other application domains.
• What this paper tries to do: – Suggest and evaluate new architectural features for SSI, and how the accelerator can be programmed. – Can an existing processor (CPU or GPU) use these features? Maybe…
9
Presentation Roadmap Background on SSI and its computational bottlenecks
Motivation: Why specialized accelerator for SSI? Architecture of the SSI Accelerator Programming the SSI accelerator Architectural Exploration and FPGA Prototype SSI Accelerator performance
10
5
Architecture of the SSI Accelerator •
Semantic Search: Given D documents, Q concurrent queries, find top K matching documents.
SSI Core Computation Documents
LARGE INTERIM RESULT
MATMUL
Queries Distribute Queries (from off-chip)
On-chip Memory
Stream Documents (from off-chip)
ARRAY RANK
1. In-memory processing of Array Rank 2. Operates in parallel with PE array 3. Reduces off chip writes
Off-chip Memory (stream result)
Distributed On-chip Smart Memory
2D PE Array
On-chip Memory
REDUCED RESULT
MATMUL (Fine-grained Parallelism)
11
PE Array and Smart Memory Documents streamed from off-chip
Input Local Store
B R O A D C A S T
Queries distributed from off-chip MEM
MEM
MEM
PE
PE
PE
MEM
MEM
MEM
PE
PE
PE
SMART MEMORY
SMART MEMORY
Off-chip Memory (stream result)
H chains
MEM
MEM
MEM
PE
PE
PE
SMART MEMORY
M/2 PEs Simple Vector Processing Elements (each implemented using 2 FPGA DSPs)
All PEs operate in SIMD mode
12
6
SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5
Top 2
1 3 1 6
MEM
PE
MEM SMART MEMORY
PE
QUERY 4
QUERY 2
Input Local Store (On-chip)
BROADCAST DOC1 (Serially)
5 3 9 6 4 5 8 6
1 5 8 3
QUERY 3
QUERY 1
X
5 3 7 6
QUERY 1 QUERY 2 QUERY 3 QUERY 4
DOC 1 DOC 2 DOC 3 DOC 4
MEM
PE
MEM SMART MEMORY
PE
13
SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5 1 5 8 3
Input Local Store (On-chip) QUERY 2
QUERY 3
MEM
PE
GENERATE ROW 1
Top 2
5 3 9 6 4 5 8 6
1 3 1 6
MEM
PE
MEM
7
SMART MEMORY
6
SMART MEMORY
PE
5
QUERY 4
QUERY 1
X
5 3 7 6
QUERY 1 QUERY 2 QUERY 3 QUERY 4
DOC 1 DOC 2 DOC 3 DOC 4
MEM
PE 3
14
7
SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5
5 3 9 6 4 5 8 6
1 3 1 6
MEM
PE
MEM 5
PE
7
QUERY 4
Input Local Store (On-chip) QUERY 2
Top 2
1 5 8 3
QUERY 3
QUERY 1
X
5 3 7 6
QUERY 1 QUERY 2 QUERY 3 QUERY 4
DOC 1 DOC 2 DOC 3 DOC 4
MEM
SEND ROW 1 to SMART MEM. PE BROADCAST DOC2 (serially)
SMART MEMORY
MEM 3
PE
6
SMART MEMORY 15
SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5
QUERY 3
MEM
Input Local Store (On-chip) QUERY 2
5 3 9 6 4 5 8 6
1 3 1 6
PE
GENERATE ROW 2
Top 2
1 5 8 3
MEM
PE
MEM
PE
4
QUERY 4
QUERY 1
X
5 3 7 6
QUERY 1 QUERY 2 QUERY 3 QUERY 4
DOC 1 DOC 2 DOC 3 DOC 4
5 9
SMART MEMORY
MEM 3
PE 2
7
5
6
SMART MEMORY 16
8
SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5
5 3 9 6 4 5 8 6
QUERY 3
1 3 1 6
MEM
PE Input Local Store (On-chip) QUERY 2
Top 2
1 5 8 3
1
MEM
BROADCAST DOC 3 (serially). PE SMART MEM STARTS RANKING.
MEM
PE
QUERY 4
QUERY 1
X
5 3 7 6
QUERY 1 QUERY 2 QUERY 3 QUERY 4
DOC 1 DOC 2 DOC 3 DOC 4
4 5 K
SMART
9 7 MEMORY
MEM 2 3 SMART
PE G 5
H L
5 6 MEMORY
17
SSI Mapping – Dynamic Parallelism 4 documents, 4 concurrent queries, need best 2 matches 4 2 9 5
5 3 9 6 4 5 8 6
1 3 1 6
MEM
PE
MEM
PE
MEM
PE
QUERY 4
Input Local Store (On-chip) QUERY 2
Top 2
1 5 8 3
QUERY 3
QUERY 1
X
5 3 7 6
QUERY 1 QUERY 2 QUERY 3 QUERY 4
DOC 1 DOC 2 DOC 3 DOC 4
4 5
SMART
8 9 MEMORY
MEM
PE
5 3 SMART
6 6 MEMORY
18
9
SSI Mapping: Dynamic Parallelism 4 documents, 2 concurrent queries, need best 2 matches 5 3
QUERY 1 QUERY 2
MEM
4 2
Top 2
1 3
PE
MEM SMART MEMORY
PE
QUERY 2
QUERY 2
Input Local Store (On-chip)
MEM
5 3 9 6 4 5 8 6
1 5
QUERY 1
X
QUERY 1
DOC 1 DOC 2 DOC 3 DOC 4
PE
MEM SMART MEMORY
PE
Queries duplicated across the chain
19
SSI Mapping: Dynamic Parallelism 4 documents, 2 concurrent queries, need best 2 matches 5 3
QUERY 1 QUERY 2
MEM
4 2 1 5
PE
PE
PE Queries duplicated across the chain
QUERY 2
QUERY 2
MEM
5 3 9 6 4 5 8 6
MEM
Input Local Store (On-chip)
BROADCAST DOC1 and 2 (together)
Top 2
1 3 QUERY 1
X
QUERY 1
DOC 1 DOC 2 DOC 3 DOC 4
MEM
PE
SMART MEMORY
Performance doubled Full PE utilization
SMART MEMORY 20
10
SSI Mapping: Dynamic Parallelism • • •
Dynamic Parallelism logically reorganizes the PEs to allow multiple levels of parallelism. Maximum Parallelism depends on number of DSPs in a chain. Parallelism mode is determined after FPGA is programmed and is set by the assembler based on number of SSI queries and processor layout.
Configures the bandwidth of the broadcast bus.
MEM
Configures the SMs to rank for one/multiple queries.
MEM
PE
PE Duplicate/ distribute queries
Input Local Store (On-chip)
MEM
MEM
PE
PE
SMART MEMORY
SMART MEMORY 21
Programming the SSI Accelerator SSIMalloc (doc_fpga, size…); SSIMalloc (query_fpga, size…); SSIMalloc (asm_fpga, size…); SSIMalloc (topk_fpga, size…);
Allocate FPGA Memory
APIs
//Transfer documents from host to the SSI Accelerator memory SSIMemCpy (doc_fpga, docs, size…, HOST2FPGA); //Transfer queries from host to the SSI Accelerator memory SSIMemCpy (query_fpga, queries, size…, HOST2FPGA); //Generate Assembly asm = GenSSIASM(N,Q,K,H,M…);
MAPPING
Problem Size Arch. Parameters
//Transfer queries from host to the SSI Accelerator memory SSIMemCpy (asm_fpga, asm, size…, HOST2FPGA); //Execute SSI and Poll until completion while (SSIExecute()!=SSI_FINISH); // Transfer top-k values from SSI Accelerator memory to host SSIMemCpy (topk, topk_fpga, size…, FPGA2HOST);
ASSEMBLY GENERATION
Parallelism Mode
Data Mapping
ASSEMBLY INSTRUCTIONS
22
11
SSI Accelerator – Internal Components One dual-ported on-chip BRAM
STREAMING DATA
FILTER
PE LOCAL STORE
LIST
CMP M words REG
From PE i-1
M words
To PE i+1
O(ADDR)
Parallelization Mode
VECTOR FU
16 bit signed fixed-pt mult.
Two FPGA DSPs
THRESHOLD
SCAN M1
VAL
k
CMP
ADDR
REG
To PE i-1
From PE i+1 M2
REG
Stalls the processing but is hidden with next processing data.
Processing Element
UPDATE
Smart Memory
Stall probability decreases. 23 Invariant of k.
Presentation Roadmap Background on SSI and its computational bottlenecks
Motivation: Why specialized accelerator for SSI? Architecture of the SSI Accelerator Programming the SSI Accelerator Architectural Exploration and FPGA Prototype SSI Accelerator Performance
24
12
Alpha Data ADM-XRC-5T2 Board •
One Virtex-5 SX240T FPGA. – –
• • •
516 on-chip 36 Kbits Block RAMs. 1056 DSPs.
Data Memory: Four 512 MB DDR2 SDRAMs with B/W of 8 words/cycle. Instruction Memory: Two 256 MB DDR2 SSRAMs with B/W of 4 words/cycle. Host FPGA connection through 66 MHz, 64 bit PCI-X interface.
DRAM
DRAM
PCI-X FPGA DRAM
DRAM
25
SSI Accelerator Prototype • Resource Constraints and Accelerator Requirements – An instruction memory requires one SRAM (2 available) • Two cores with its own instruction memory.
DATA BANK 1 (DDR2)
HOST
– A data memory requires one DRAM (4 available)
(DDR2)
– A PE requires one BRAM (516 available) and two DSPs (1056 available)
DATA BANK 3
DATA BANK 4
(DDR2)
(DDR2)
LOW BANDWIDTH BANK-TO-BANK COMMUNICATION
SWITCH
• Each core uses two data memory as input and output.
• 256 total PEs with 256 BRAMs and 512 DSPs (256 DSPs/core). • Remaining BRAMs for SM, Input Local Store etc.
DATA BANK 2
High bandwidth bank-core communication
SWITCH
CORE 1
CORE 2
INST BANK 1
INST BANK 2
(SRAM)
OFF-CHIP DATA MEMORY BANKS
OFF-CHIP INST MEMORY BANKS
(SRAM)
FROM HOST
With 256 DSPs/core, what is the optimal layout of PE matrix?
26
13
PE Array Layout? MEM MEM
DSP
MEM
MEM
M DSP
DSP
MEM
MEM
DSP
MEM
DSP
MEM
M DSP
DSP
MEM
MEM
256 DSPs DSP
256 chains
MEM
DSP DSP
DSP
H chains WIDER MEM
MEM
DSP
DSP
MEM
MEM
M
DSP
DSP
NARROW
27
Core Architecture Exploration • Using a C++ simulator, determine optimal number of chains (H) and DSPs/chain (M) for 256 DSPs in a core. • Simulate SSI for 1024 queries to rank 2M documents. SSI Performance 14
Cycles (In Billions)
12 10
K = 32
8 6 4
K = 64
DSPs/chain = 8 # of Chains = 32
Insensitive to k
K = 128
2 0 4 16 64 256 Smaller chain length, 1 DSPs per chain lower parallelism Chain processing faster than off-chip b/w. Chain length matches with off-chip b/w i.e. 8 words/cycle.
28
14
FPGA Prototype and Resource Utilization • • • • • • •
Number of DSPs/chain in a core (M) = 8. Number of chains in a core (H) = 32. Number of cores = 2. Operating Frequency = 125 MHz. On-chip Memory= 1.7 MB. Block RAM Usage = 74% DSP Usage= 48%
29
Presentation Roadmap Background on SSI and its computational bottlenecks
Motivation: Why specialized accelerator for SSI? Architecture of the SSI Accelerator Programming the SSI Accelerator Architectural Exploration and FPGA Prototype SSI Accelerator Performance
30
15
SSI Accelerator Performance
•
Baseline – –
•
Xeon implementation uses multi-threaded BLAS. GPU execution uses CUBLAS, with CUDA threads computing top-k on a set of documents without incoherent memory loads/stores.
Assumptions – – –
Do not include the one-time data transfer time of documents from the host to FPGA. Do include the transfer time of query and results between host and FPGA. Include the FPGA setup time in storing the documents from off-chip to input LS, and distributing the queries.
31
Summary and Conclusion • What we presented? – FPGA-based Programmable Semantic Search Accelerator. – Performance comparison of accelerator over multi-core and GPU.
• Key takeaways – Combination of in-memory processing along with massively parallel PEs makes the SSI accelerator faster. – Logical reorganization of PEs to allow multiple levels of parallelism.
• Future work – Using this accelerator as an energy-efficient server (with Atom as host), especially when deployed in clusters and data centers. 32
16
Thank You
Q&A…?
33
Questions • Can you perform both matmul and array ranking (on-chip?) on GPU even when shared memory is large… • FPGA Memory B/W is low… How do you take care of that? • Have you thought about multiple GPUs?
34
17
FPGA Resource Utilization • • • • •
Number of DSPs/chain in a core (M) = 8. Number of chains in a core (H) = 32. Number of cores = 2. Operating Frequency = 125 MHz. On-chip Memory= 1.7 MB.
35
18