THREAD PROGRESS EQUALIZATION: DYNAMICALLY ADAPTIVE POWER AND PERFORMANCE OPTIMIZATION OF MULTI-THREADED APPLICATIONS Yatish

1 Turakhia , 1Stanford

Guangshuo

University ,

2 Carnegie

Motivation

Determining the optimal configuration for each core within a fixed power budget, however, is challenging for three key reasons: 1  The solution must be scalable for tens to hundreds of cores 2  Power/performance relationship between core configurations is particularly complex for microarchitectural adaptation 3  There is no clear performance metric to optimize for in multi-threaded applications Key Observations: 1.  Threads synchronizing on a barrier must arrive at the barrier at the same time to best utilize the power budget 2.  Differences in arrival times of threads can be explained by: (i) IPC heterogeneity and (ii) Instruction Count (IC) heterogeneity barrier 3

water.sp

barrier 2

Stalled threads

Critical thread

Critical thread

Same # of instructions

IPC Heterogeneity

Different # of instructions

IPC + Instruction Count Heterogeneity

3.  T h e n u m b e r o f instructions that each thread executes between barriers, relative to other threads, remains roughly the same

Mellon University,

MaxBIPS

3 New

Diana

2 Marculescu

York University Experimental Setup

Criticality Stacks

Does not equalize IPC

“2-wide” 66% dark silicon

“4-wide” 33% dark silicon

FFT

Siddharth

3 Garg ,

Current Approaches

Fine-grained micro-architectural adaptation of cores has an ability to emulate heterogeneous processors and provides a compelling alternative to heterogeneous multi-core processing in the “dark silicon” era. Assume 50% dark silicon

2 Liu ,

Overaccelerates critical thread •  •  •  • 

Maximizes sum-IPS across all threads No notion of thread criticality Good objective for thread-pool/map-reduce; poor for barrier workloads requiring load balancing Optimization NP-hard; not scalable

•  Threads stalling least have high criticality values •  Correctly identifies critical threads but cannot determine how much to accelerate each thread •  Works well for Big-Little configurations; poor for multiple micro-arch configurations •  Simple optimization; scalable

Goal: Optimal reconfiguration for multi-threaded workloads under power constraints

x

ij

Execution time (clock cycles) of thread i in configuration j Execution time of critical thread

TPEq optimization procedure: ci ←0; ∀i ∈ [1, N ] while(Pcurrent < Pmax ) q←critical thread cq ←cq +1

Disp.  width   ROB  Size  

1   2   3   4   5  

1   2   2   4   4  

16   32   64   64   128  

Integer   ALUs   1   3   3   6   6  

L1-I and L1-D: 128KB, 8-way L2: private 256 KB, 8-way L3: 8MB share/4 cores, 16-way Freq: 3.5GHz, Voltage: 1.0V, 22nm node

Power Budget: 80W Barrier synchronization based benchmarks:

1. TPEq Optimization Procedure

min max (xij ×Wi / IPCij )

Config  

Results

Our Approach

TPEq Objective:

Our evaluation was performed on Sniper multi-core simulator (with McPAT support) for x86 processor with 16-cores, each of which were dynamically adaptive with the following 5 configurations:

All threads set to lowest config. While power remaining Determine most critical thread Set to next highest config.

•  Best performing technique on barrier-based benchmarks •  5% and 11% average improvement over CS on IC-homogeneous (HO) and ICheterogeneous (HT) workloads

Key Observation: TPEq optimization problem is in P: O(MN logN)

2. TPEq Implementation •  CPI-stack based performance and power predictors •  History-based IC predictor; updated at every synchronization stall event •  TPEq implemented in OS and invoked on a timer interrupt •  Optimal configuration determined and control passed back to user code

Remaining benchmarks: •  Within reasonable bound of best-performing technique on non-barrier workloads: Thread pool (TP), Pipeline parallel (PP), Mapreduce (MR) •  Most generalizable of known techniques

thread progress equalization: dynamically adaptive ...

TPEq implemented in OS and invoked on a timer interrupt. • Optimal configuration determined and control passed back to user code. Power Budget: 80W. 3.

1MB Sizes 4 Downloads 220 Views

Recommend Documents

Towards Flexible Evolution of Dynamically Adaptive Systems
to Dynamically Adaptive Systems (DAS). DAS can be seen as open distributed systems that have the faculty to adapt themselves to the ongoing circumstances ...

Towards Flexible Evolution of Dynamically Adaptive ...
Abstract—Modern software systems need to be continuously available under ... of the D-CRM is to provide accurate client-related informa- tion depending on the ...

State Board of Equalization
1 Nov 2017 - (c) A manufacturing facility with excess land used for farming (portion farmed to be subclassified farm);. (d) Mobile home parks with on-site privately owned mobile homes (portions rented to be subclassified commercial, owner-occupied mo

Performance Evaluation of Equalization Techniques under ... - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 2, Issue ... Introduction of wireless and 3G mobile technology has made it possible to ...

Thread Injection.pdf
Identify main thread. 3. Suspend main thread. 4. Obtain thread context. 5. Create and write the code-cave. 6. Spoof instruction pointer to execute the code-cave.

factor price equalization and tariffs
more mobile, then factor price equalization is more likely to occur internationally. .... where Pi is the price, Ci is the cost, w is the cost of labour, r is the cost of ...

Performance Evaluation of Equalization Techniques under ... - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 2, Issue ... Introduction of wireless and 3G mobile technology has made it possible to ...

Turbo Equalization for FMT Systems
world impairments, such as frequency and timing offsets, due to high spectral ... Such a system can be efficiently implemented by means of a fast Fourier transform ..... “Equalization methods in OFDM and FMT systems for broadband wireless.

Dynamically consistent optical flow estimation - Irisa
icate situations (such as the absence of data) which are not well managed with usual ... variational data assimilation [17] . ..... pean Community through the IST FET Open FLUID Project .... on Art. Int., pages 674–679, Vancouver, Canada, 1981.

Dynamically Allocating the Resources Using Virtual Machines
Abstract-Cloud computing become an emerging technology which will has a significant impact on IT ... with the help of parallel processing using different types of scheduling heuristic. In this paper we realize such ... business software and data are

Combining Thread Level Speculation, Helper Threads ... - CiteSeerX
Jun 12, 2009 - Our model is slightly more computationally intensive as it require one extra simulation run, but our results show that it is much more accurate.

HTR Progress in China
Apr 8, 2014 - 10. Development history in China. ▫ 1970s:Technology research. ▫ 1986:National Hi-Tech program (863 program), start the design of HTGR.

Progress Against Immunization
Sep 2, 2009 - been revolutionized, making it faster, more robust, and .... A good illustration of this is the percentage of ... this is beginning to change: • Measles ...

Reducing OLTP Instruction Misses with Thread Migration
Transactions Running Parallel. 6. T1. T2. T3. Instruction parts that can fit into L1-I. Threads. Transaction. T123. Common instructions among concurrent threads ...

unf thread dimensions pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. unf thread ...

Combining Thread Level Speculation, Helper Threads ... - CiteSeerX
Jun 12, 2009 - Each one of these multithreaded execution models works best for different ... these systems, the compiler/programmer is free to generate threads without having to ..... affect the CPI as it involves the addition of mainly function call

Scalable Thread Scheduling and Global Power ... - CiteSeerX
The latter form of heterogeneity is particularly challenging, as it means that chips with homogeneously designed cores may not only differ in their core-to-core ...

Decimation of baseband DTV signals prior to channel equalization in ...
Oct 11, 2001 - symbol frequencies of the QAM and VSB signals by apply. (52). _ 348/726 ..... Selected packets are used to reproduce the audio portions of the DTV program, ..... to time samples supplied from a sample clock generator is.

On Channel Estimation and Equalization of OFDM ...
We then design an optimal MMSE ... the use of a Cyclic Prefix (CP) — a set of last few data samples prepended at ... being the original data block length and ν the CP length. The ..... In order to get better visualization of the problem at hand we

Teacher Education Progress Process (TEPP)
Post-Baccalaureate Application for Candidacy in Teacher Education ... University recommendation to the Pennsylvania Department of Education for teacher ...

2014 Progress Report.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 2014 Progress ...

2015 Progress Report.pdf
Page 1. Whoops! There was a problem loading more pages. Retrying... 2015 Progress Report.pdf. 2015 Progress Report.pdf. Open. Extract. Open with. Sign In.

progress report - World Urban Campaign
Feb 29, 2016 - The Campaign currently includes 1 Top and 3 Main Sponsoring Partners, 40 Lead ... f) 10 – 11 January 2016: Urban Thinkers Campus Dubai .... Campaign Secretariat developed a social media proposal in line with the ...