An Efficient Synchronization Technique for Multiprocessor Systems on-Chip Matteo Monchiero, Gianluca Palermo, Cristina Silvano and Oreste Villa GRAPES Computer Architecture Group Politecnico di Milano – Italy {monchier, gpalermo, silvano, ovilla}@elet.polimi.it

MEDEA 2005

Outline Introduction Background Synchronization-operation Buffer (SB) Target Architecture Experimental Results Concluding Remarks

2

MEDEA 2005

Introduction Multi-core architecture: IBM Power5, Intel Montecito, Sun Niagara,… Dealing with interconnect latency

Multiprocessor System-on-Chip (MPSoC): Philips Nexperia, Intel IXP2850,... Heterogeneous cores integrated in a complete system Low-cost and low-power Programmed with ad hoc techniques

Focus of this paper: Synchronization for MPSoCs Embedded homogeneous multiprocessor (CMP) Busy-wait techniques (spin locks and barriers)

We propose a low-complexity and low-power optimization of busy-wait synchronization 3

MEDEA 2005

Background Basic sync. constructs for shared memory parallel programming Spin lock Used to protect critical sections while (!test&set(L)) {} /*critical section code*/ release(L);

Barrier Used to synchronize all threads B++; while(B < procs_num){}

event(B, proc_num)

Busy-wait, relying on polling of a shared variable

4

MEDEA 2005

Proposed Solution HW buffer for locks and events Synchronization-operation Buffer (SB)

Weak consistency model Memory read/write sequential ordering only for synchronization data All the data can be cached without needing coherence protocol, while synchronization variables are managed by the SB Cache invalidation required for shared data on each synchronization point

SB is in the Memory Controller and provides local management of spinning operations Independent of caches 5

MEDEA 2005

Locks The SB is a circular buffer Insertion in the SB on acquire_lock On release_lock ownership of the lock is updated O(1) #memory references and network transactions per lock acquisition

6

MEDEA 2005

Events event signals when variable is equal to event_value On event, insertion in the SB On each write, SB is searched and possibly event is signaled When an event is signaled, #network trans. is O(P) and, on event, #memory refs. is O(1) 7

MEDEA 2005

Target Architecture Separate private and shared memory space On-chip shared memory Off-chip main memory (private memories and instructions) Addressing space Non cacheable shared pages (synchronization) Cacheable shared pages Private pages (cacheable)

8

MEDEA 2005

Experimental Setup GRAPES

Cycle-based MPSoC simulation framework C++ & SystemC SimIt ARM Simulators (5 -stage pipeline) System-level power model Available upon request (mailing list on http://savane.elet.polimi.it/projects/grapes)

Benchmarks

FFT, LU-1, LU-2, Radix from Spalsh-2 Matrix, Norm

PARMACS parallel programming model

9

MEDEA 2005

Traffic Profile w/o SB Aggregate network badwidth Sync. traffic is dominant Norm benchmark

w/ SB

Barrier peaks reduction

10

MEDEA 2005

Performance Evaluation w/ caches and w/o caches for shared pages w/ SB and w/o SB w/ sync. pages caching and coherence protocol (directory based) SB improves performance on sync. intensive applications

11

MEDEA 2005

Energy Evaluation Effects on Memory: accesses reduction Interconnect: bandwidth reduction Processors: reduction of execution time

12

MEDEA 2005

Application Scaling Scaling of the number of cores: 4, 8 and 16 SB enables good scaling, while others configurations can’t

w/oSB SB w/o

FFT 13

w/ SB MEDEA 2005

LU-2

Concluding Remarks We presented a HW implemented synchronization mechanism based on the idea of locally manage locks and events Integrated in a MPSoC No coherence protocol overhead Good performance and energy efficiency

General approach for embedded systems Using ad hoc sync. engine is often a must for embedded systems

14

MEDEA 2005

Thanks www.elet.polimi.it/upload/monchier [email protected]

15

MEDEA 2005

Related Work Few works about programming models for MPSoCs Paulin et al. [CODES’04] Loghi & Poncino [DATE’05]

Optimization of busy-wait synchronization for MP Mellor-Crummey & Scott [ToCS’91]

Queuing locks Software MCS lock

Hardware Anderson [ToPDS’90] QOLB (Kagi et al. [CAN’97])

All these solutions rely on cache locality

Transactional Systems Rajwar & Goodman [IEEE MICRO’03]

16

MEDEA 2005

Implementation Issues SB implemented as Lock Buffer (LB) + Event Buffer (EB) For P cores P entries for LB and for the EB

Critical operation is the Associative Search (AS) phase On lock release Ö interconnect latency Gate-level synthesis normalized with respect to 512KB on-chip RAM

17

MEDEA 2005

An Efficient Synchronization Technique for ...

Low-cost and low-power. Programmed with ad ... and low-power optimization of busy-wait synchronization ... Using ad hoc sync. engine is often a must for embedded systems ... Critical operation is the Associative Search (AS) phase. On lock ...

275KB Sizes 0 Downloads 343 Views

Recommend Documents

An Efficient Synchronization Technique for ...
Weak consistency model. Memory read/write sequential ordering only for synchronization data. All the data can be cached without needing coherence protocol, while synchronization variables are managed by the. SB. Cache invalidation required for shared

An Adaptive Synchronization Technique for Parallel ...
network functional simulation and do not really address net- work timing issues or ..... nique is capable of simulating high speed networks at the fastest possible ...

An Adaptive Synchronization Technique for Parallel ...
the simulated time of the sender and the receiver are con- sistent with each other. .... ulator, and behaves like a perfect link-layer (MAC-to-MAC) network switch.

Efficient Accelerated Simulation Technique for Packet ...
internet which is used in most of our daily life, either for business, entertainment or ... have to produce the same results for certain measures of interest. The measure we ... The speed up of E_TA with FIFO scheduler and. E_TA with non FIFO ...

A Novel Efficient Technique for Traffic Grooming in ...
backbone networks. Two kinds of equipment are used at a node in WDM. SONET networks: Optical Add-Drop Multiplexer (OADM) and electronic Add-Drop ...

An Adaptive Hybrid Multiprocessor Technique for ... - Kaust
must process large amounts of data which may take a long time. Here, we introduce .... and di are matched, or -4 when qi and di are mismatched. To open a new ...

Primitives for Contract-based Synchronization
We investigate how contracts can be used to regulate the interaction between processes. To do that, we study a variant of the concurrent constraints calculus presented in [1] , featuring primitives for multi- party synchronization via contracts. We p

Primitives for Contract-based Synchronization
for a service X”) to the behaviour promised by a service (e.g. “I will provide you with a service Y”), and vice versa. The crucial ... and ⊣⊆ 乡(D)×D is a relation satisfying: (i) C ⊣ c whenever c ∈C; (ii) C ⊣ c whenever for all c â

An X-ray nanodiffraction technique for structural characterization of ...
Author(s) of this paper may load this reprint on their own web site provided that this cover page is retained. ... of the third-generation synchrotron radiation source and advanced high- ... discusses the application of the technique to studies of ti

An Adaptive Hybrid Multiprocessor Technique for ...
existing resources (GPU and CPU) in an efficient way is a novel approach. The peak .... Our tests were conducted using an Intel Xeon X5550 CPU. (2.676 GHz) ...

An Automatic Verification Technique for Loop and Data ...
tion/fission/splitting, merging/folding/fusion, strip-mining/tiling, unrolling are other important ... information about the data and control flow in the program. We use ...

An Efficient Auction
first or second price) cannot achieve an efficient outcome because the bids submitted by bidders 1 and 2 .... Call this strengthened version of A3, A3". ...... (1999): “An Ex-Post Efficient Auction," Discussion Paper *200, Center for Rationality an

Practical Synchronization Techniques for Multi-Channel ... - CiteSeerX
Sep 26, 2006 - Permission to make digital or hard copies of all or part of this work for ..... local clock and the seed comprise the hopping signature of a node.

Sparsifying Synchronization for High-Performance ...
edge, because they will naturally execute in program order. .... Software and workloads used in performance tests may have been optimized for performance only ..... Linear Algebra. Society for Industrial & Applied Mathematics, 2011. [16] Kyungjoo Kim

A Novel Efficient Technique for Traffic Grooming in WDM SONET with ...
We follow the following notations ... traffic add/drop sites around the ring. .... then pick the best solution among them. .... been examined and the best solution is found. IV. .... The tech- nique reformulates the problem as a non-linear problem an

A Novel Efficient Technique for Traffic Grooming in WDM SONET with ...
SONET rings and selecting the proper line speed for each wavelength so ... [1] [2] to support multiple line speeds. ..... on Local Computer Networks, Nov. 2000.

Particle Swarm Optimization: An Efficient Method for Tracing Periodic ...
[email protected] e [email protected] ..... http://www.adaptiveview.com/articles/ipsop1.html, 2003. [10] J. F. Schutte ... email:[email protected].

Particle Swarm Optimization: An Efficient Method for Tracing Periodic ...
trinsic chaotic phenomena and fractal characters [1, 2, 3]. Most local chaos control ..... http://www.adaptiveview.com/articles/ipsop1.html, 2003. [10] J. F. Schutte ...

DART: An Efficient Method for Direction-aware ... - ISLAB - kaist
DART: An Efficient Method for Direction-aware. Bichromatic Reverse k Nearest Neighbor. Queries. Kyoung-Won Lee1, Dong-Wan Choi2, and Chin-Wan Chung1,2. 1Division of Web Science Technology, Korea Advanced Institute of Science &. Technology, Korea. 2De

Conscience online learning: an efficient approach for ... - Springer Link
May 24, 2011 - as computer science, medical science, social science, and economics ...... ics in 2008 and M.Sc. degree in computer science in 2010 from Sun.