An Efficient Synchronization Technique for Multiprocessor Systems on-Chip Matteo Monchiero, Gianluca Palermo, Cristina Silvano and Oreste Villa GRAPES Computer Architecture Group Politecnico di Milano – Italy {monchier, gpalermo, silvano, ovilla}@elet.polimi.it

MEDEA 2005

Outline Introduction Background Synchronization-operation Buffer (SB) Target Architecture Experimental Results Concluding Remarks

2

MEDEA 2005

Introduction Multi-core architecture: IBM Power5, Intel Montecito, Sun Niagara,… Dealing with interconnect latency

Multiprocessor System-on-Chip (MPSoC): Philips Nexperia, Intel IXP2850,... Heterogeneous cores integrated in a complete system Low-cost and low-power Programmed with ad hoc techniques

Focus of this paper: Synchronization for MPSoCs Embedded homogeneous multiprocessor (CMP) Busy-wait techniques (spin locks and barriers)

We propose a low-complexity and low-power optimization of busy-wait synchronization 3

MEDEA 2005

Background Basic sync. constructs for shared memory parallel programming Spin lock Used to protect critical sections while (!test&set(L)) {} /*critical section code*/ release(L);

Barrier Used to synchronize all threads B++; while(B < procs_num){}

event(B, proc_num)

Busy-wait, relying on polling of a shared variable

4

MEDEA 2005

Proposed Solution HW buffer for locks and events Synchronization-operation Buffer (SB)

Weak consistency model Memory read/write sequential ordering only for synchronization data All the data can be cached without needing coherence protocol, while synchronization variables are managed by the SB Cache invalidation required for shared data on each synchronization point

SB is in the Memory Controller and provides local management of spinning operations Independent of caches 5

MEDEA 2005

Locks The SB is a circular buffer Insertion in the SB on acquire_lock On release_lock ownership of the lock is updated O(1) #memory references and network transactions per lock acquisition

6

MEDEA 2005

Events event signals when variable is equal to event_value On event, insertion in the SB On each write, SB is searched and possibly event is signaled When an event is signaled, #network trans. is O(P) and, on event, #memory refs. is O(1) 7

MEDEA 2005

Target Architecture Separate private and shared memory space On-chip shared memory Off-chip main memory (private memories and instructions) Addressing space Non cacheable shared pages (synchronization) Cacheable shared pages Private pages (cacheable)

8

MEDEA 2005

Experimental Setup GRAPES

Cycle-based MPSoC simulation framework C++ & SystemC SimIt ARM Simulators (5 -stage pipeline) System-level power model Available upon request (mailing list on http://savane.elet.polimi.it/projects/grapes)

Benchmarks

FFT, LU-1, LU-2, Radix from Spalsh-2 Matrix, Norm

PARMACS parallel programming model

9

MEDEA 2005

Traffic Profile w/o SB Aggregate network badwidth Sync. traffic is dominant Norm benchmark

w/ SB

Barrier peaks reduction

10

MEDEA 2005

Performance Evaluation w/ caches and w/o caches for shared pages w/ SB and w/o SB w/ sync. pages caching and coherence protocol (directory based) SB improves performance on sync. intensive applications

11

MEDEA 2005

Energy Evaluation Effects on Memory: accesses reduction Interconnect: bandwidth reduction Processors: reduction of execution time

12

MEDEA 2005

Application Scaling Scaling of the number of cores: 4, 8 and 16 SB enables good scaling, while others configurations can’t

w/oSB SB w/o

FFT 13

w/ SB MEDEA 2005

LU-2

Concluding Remarks We presented a HW implemented synchronization mechanism based on the idea of locally manage locks and events Integrated in a MPSoC No coherence protocol overhead Good performance and energy efficiency

General approach for embedded systems Using ad hoc sync. engine is often a must for embedded systems

14

MEDEA 2005

Thanks www.elet.polimi.it/upload/monchier [email protected]

15

MEDEA 2005

Related Work Few works about programming models for MPSoCs Paulin et al. [CODES’04] Loghi & Poncino [DATE’05]

Optimization of busy-wait synchronization for MP Mellor-Crummey & Scott [ToCS’91]

Queuing locks Software MCS lock

Hardware Anderson [ToPDS’90] QOLB (Kagi et al. [CAN’97])

All these solutions rely on cache locality

Transactional Systems Rajwar & Goodman [IEEE MICRO’03]

16

MEDEA 2005

Implementation Issues SB implemented as Lock Buffer (LB) + Event Buffer (EB) For P cores P entries for LB and for the EB

Critical operation is the Associative Search (AS) phase On lock release Ö interconnect latency Gate-level synthesis normalized with respect to 512KB on-chip RAM

17

MEDEA 2005

An Efficient Synchronization Technique for ...

Low-cost and low-power. Programmed with ad ... and low-power optimization of busy-wait synchronization ... Using ad hoc sync. engine is often a must for embedded systems ... Critical operation is the Associative Search (AS) phase. On lock ...

275KB Sizes 0 Downloads 82 Views

Recommend Documents

An Adaptive Synchronization Technique for Parallel ...
the simulated time of the sender and the receiver are con- sistent with each other. .... ulator, and behaves like a perfect link-layer (MAC-to-MAC) network switch.

An Enhanced and Efficient Document Image Binarization Technique ...
and local. Global binarization methods [7-10] determine a solitary threshold value for the total image, while local or adaptive binarization methods [2, 11-16] determine threshold for every individual pixel using local information derived from the ca

An X-ray nanodiffraction technique for structural characterization of ...
Author(s) of this paper may load this reprint on their own web site provided that this cover page is retained. ... of the third-generation synchrotron radiation source and advanced high- ... discusses the application of the technique to studies of ti

Practical Synchronization Techniques for Multi-Channel ... - CiteSeerX
Sep 26, 2006 - Permission to make digital or hard copies of all or part of this work for ..... local clock and the seed comprise the hopping signature of a node.

A Novel Efficient Technique for Traffic Grooming in WDM SONET with ...
We follow the following notations ... traffic add/drop sites around the ring. .... then pick the best solution among them. .... been examined and the best solution is found. IV. .... The tech- nique reformulates the problem as a non-linear problem an

A Novel Efficient Technique for Traffic Grooming in WDM SONET with ...
SONET rings and selecting the proper line speed for each wavelength so ... [1] [2] to support multiple line speeds. ..... on Local Computer Networks, Nov. 2000.