THREAD PROGRESS EQUALIZATION: DYNAMICALLY ADAPTIVE POWER AND PERFORMANCE OPTIMIZATION OF MULTI-THREADED APPLICATIONS Yatish
1 Turakhia , 1Stanford
Guangshuo
University ,
2 Carnegie
Motivation
Determining the optimal configuration for each core within a fixed power budget, however, is challenging for three key reasons: 1 The solution must be scalable for tens to hundreds of cores 2 Power/performance relationship between core configurations is particularly complex for microarchitectural adaptation 3 There is no clear performance metric to optimize for in multi-threaded applications Key Observations: 1. Threads synchronizing on a barrier must arrive at the barrier at the same time to best utilize the power budget 2. Differences in arrival times of threads can be explained by: (i) IPC heterogeneity and (ii) Instruction Count (IC) heterogeneity barrier 3
water.sp
barrier 2
Stalled threads
Critical thread
Critical thread
Same # of instructions
IPC Heterogeneity
Different # of instructions
IPC + Instruction Count Heterogeneity
3. T h e n u m b e r o f instructions that each thread executes between barriers, relative to other threads, remains roughly the same
Mellon University,
MaxBIPS
3 New
Diana
2 Marculescu
York University Experimental Setup
Criticality Stacks
Does not equalize IPC
“2-wide” 66% dark silicon
“4-wide” 33% dark silicon
FFT
Siddharth
3 Garg ,
Current Approaches
Fine-grained micro-architectural adaptation of cores has an ability to emulate heterogeneous processors and provides a compelling alternative to heterogeneous multi-core processing in the “dark silicon” era. Assume 50% dark silicon
2 Liu ,
Overaccelerates critical thread • • • •
Maximizes sum-IPS across all threads No notion of thread criticality Good objective for thread-pool/map-reduce; poor for barrier workloads requiring load balancing Optimization NP-hard; not scalable
• Threads stalling least have high criticality values • Correctly identifies critical threads but cannot determine how much to accelerate each thread • Works well for Big-Little configurations; poor for multiple micro-arch configurations • Simple optimization; scalable
Goal: Optimal reconfiguration for multi-threaded workloads under power constraints
x
ij
Execution time (clock cycles) of thread i in configuration j Execution time of critical thread
TPEq optimization procedure: ci ←0; ∀i ∈ [1, N ] while(Pcurrent < Pmax ) q←critical thread cq ←cq +1
Power Budget: 80W Barrier synchronization based benchmarks:
1. TPEq Optimization Procedure
min max (xij ×Wi / IPCij )
Config
Results
Our Approach
TPEq Objective:
Our evaluation was performed on Sniper multi-core simulator (with McPAT support) for x86 processor with 16-cores, each of which were dynamically adaptive with the following 5 configurations:
All threads set to lowest config. While power remaining Determine most critical thread Set to next highest config.
• Best performing technique on barrier-based benchmarks • 5% and 11% average improvement over CS on IC-homogeneous (HO) and ICheterogeneous (HT) workloads
Key Observation: TPEq optimization problem is in P: O(MN logN)
2. TPEq Implementation • CPI-stack based performance and power predictors • History-based IC predictor; updated at every synchronization stall event • TPEq implemented in OS and invoked on a timer interrupt • Optimal configuration determined and control passed back to user code
Remaining benchmarks: • Within reasonable bound of best-performing technique on non-barrier workloads: Thread pool (TP), Pipeline parallel (PP), Mapreduce (MR) • Most generalizable of known techniques
TPEq implemented in OS and invoked on a timer interrupt. ⢠Optimal configuration determined and control passed back to user code. Power Budget: 80W. 3.
to Dynamically Adaptive Systems (DAS). DAS can be seen as open distributed systems that have the faculty to adapt themselves to the ongoing circumstances ...
AbstractâModern software systems need to be continuously available under ... of the D-CRM is to provide accurate client-related informa- tion depending on the ...
1 Nov 2017 - (c) A manufacturing facility with excess land used for farming (portion farmed to be subclassified farm);. (d) Mobile home parks with on-site privately owned mobile homes (portions rented to be subclassified commercial, owner-occupied mo
IJRIT International Journal of Research in Information Technology, Volume 2, Issue ... Introduction of wireless and 3G mobile technology has made it possible to ...
Identify main thread. 3. Suspend main thread. 4. Obtain thread context. 5. Create and write the code-cave. 6. Spoof instruction pointer to execute the code-cave.
more mobile, then factor price equalization is more likely to occur internationally. .... where Pi is the price, Ci is the cost, w is the cost of labour, r is the cost of ...
IJRIT International Journal of Research in Information Technology, Volume 2, Issue ... Introduction of wireless and 3G mobile technology has made it possible to ...
world impairments, such as frequency and timing offsets, due to high spectral ... Such a system can be efficiently implemented by means of a fast Fourier transform ..... âEqualization methods in OFDM and FMT systems for broadband wireless.
icate situations (such as the absence of data) which are not well managed with usual ... variational data assimilation [17] . ..... pean Community through the IST FET Open FLUID Project .... on Art. Int., pages 674â679, Vancouver, Canada, 1981.
Abstract-Cloud computing become an emerging technology which will has a significant impact on IT ... with the help of parallel processing using different types of scheduling heuristic. In this paper we realize such ... business software and data are
Jun 12, 2009 - Our model is slightly more computationally intensive as it require one extra simulation run, but our results show that it is much more accurate.
Apr 8, 2014 - 10. Development history in China. â« 1970sï¼Technology research. â« 1986ï¼National Hi-Tech program (863 program), start the design of HTGR.
Sep 2, 2009 - been revolutionized, making it faster, more robust, and .... A good illustration of this is the percentage of ... this is beginning to change: ⢠Measles ...
Transactions Running Parallel. 6. T1. T2. T3. Instruction parts that can fit into L1-I. Threads. Transaction. T123. Common instructions among concurrent threads ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. unf thread ...
Jun 12, 2009 - Each one of these multithreaded execution models works best for different ... these systems, the compiler/programmer is free to generate threads without having to ..... affect the CPI as it involves the addition of mainly function call
The latter form of heterogeneity is particularly challenging, as it means that chips with homogeneously designed cores may not only differ in their core-to-core ...
Oct 11, 2001 - symbol frequencies of the QAM and VSB signals by apply. (52). _ 348/726 ..... Selected packets are used to reproduce the audio portions of the DTV program, ..... to time samples supplied from a sample clock generator is.
We then design an optimal MMSE ... the use of a Cyclic Prefix (CP) â a set of last few data samples prepended at ... being the original data block length and ν the CP length. The ..... In order to get better visualization of the problem at hand we
Post-Baccalaureate Application for Candidacy in Teacher Education ... University recommendation to the Pennsylvania Department of Education for teacher ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 2014 Progress ...
Page 1. Whoops! There was a problem loading more pages. Retrying... 2015 Progress Report.pdf. 2015 Progress Report.pdf. Open. Extract. Open with. Sign In.
Feb 29, 2016 - The Campaign currently includes 1 Top and 3 Main Sponsoring Partners, 40 Lead ... f) 10 â 11 January 2016: Urban Thinkers Campus Dubai .... Campaign Secretariat developed a social media proposal in line with the ...