Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Computational Sprinting Game Songchun Fan∗ , Seyed Majid Zahedi∗ , Benjamin C. Lee {songchun.fan, seyedmajid.zahedi, benjamin.c.lee}@duke.edu [∗ Co-First Authors] Computational Sprinting • Supply extra power to enhance performance for short durations • Activate more cores, boost voltage/frequency 2 / 25 0 de nai c ve gr isio ad n ie n sv t l m kmine ea ar n co rre a s pa lat ls ge ion ra nk tr i c an c gl e 4 3 2 1 1.0 0.5 0.0 Average Temperature (°C) 1.5 0 de nai c ve gr isio ad n ie n sv t l m kmine ea ar n co rre a s pa lat ls ge ion ra nk tri c an c gl e 5 Normalized Power 6 de nai c ve gr isio ad n ie n sv t l m kmine ea ar n co rre a s pa lat ls ge ion ra nk tr i c an c gl e Normalized Speedup Computational Sprinting • Supply extra power to enhance performance for short durations • Activate more cores, boost voltage/frequency Non−sprinting Sprinting 50 40 30 20 10 2 / 25 Sprinting Architecture • Power for sprints supplied by shared rack • Heat from sprints absorbed by thermal packages Fig. www.fortlax.se and Raghavan, Arun, et al. ”Computational sprinting on a hardware/software testbed.” 3 / 25 Power Emergencies 3600 Long-delay Duration of Current Draw (sec) Conventional Tripping Short Circuit Non-deterministic 120 • Current ↑ with sprinters To le 2 Ptrip=0 ra nc e Ptrip=1 Ba nd Tripped 0 .1 Not Tripped 1 • Sprints may trip breaker • Time ↑ with sprint duration • Risk ↑ with current, time 2 3 5 10 20 Current Normalized to Rated Current Fig. Fu, Wang, and Lefurgy. ”How much power oversubscription is safe and allowed in data centers.” 4 / 25 Uninterruptible Power Supplies • When sprints trip breaker, draw on batteries • When sprints complete, recharge batteries Fig. www.amper-ecuador.com 5 / 25 Example – Private Clouds • Applications compute on servers that share power • Processors sprint independently • Processors sprint selfishly for performance Fig. Google, www.lasknet.net 6 / 25 Sprinting Management When should processors sprint? • Phases with higher performance from sprints • But sprints prohibited as chip cools Which processors should sprint? • Processors that benefit most from sprints • But sprints prohibited as batteries recover 7 / 25 Management Desiderata Individual Performance • Sprints account for phase behavior • Sprints now constrain future sprints System Stability • Sprints account for others’ sprinting strategies • Sprints risk power emergencies 8 / 25 Sprinting Strategy • Optimize sprints given constraints • Sprint, wait ∆cooling for chip cooling • Sprint, wait ∆recovery for rack recovery if breaker trips 8 ● ● Utility from Sprint ● 7 ? ● 6 5 ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 25 ● 30 Epoch 9 / 25 Sprinting Strategy • Optimize sprints given constraints • Sprint, wait ∆cooling for chip cooling • Sprint, wait ∆recovery for rack recovery if breaker trips × ×× ×× ×× ×× 8 ● ● Utility from Sprint ● 7 ● ● ● ● 6 ● ● 5 ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 25 ● 30 Epoch 9 / 25 Game Theory Study strategic agents • Agents selfishly maximize individual utility Optimize responses • Response maximizes utility, given others’ strategies Find equilibrium • State where all agents play their best responses 10 / 25 Sprinting Game States • Active – can sprint • Cooling – cannot sprint, chip cooling • Recovery – cannot sprint, batteries recharging Actions • Sprint or not, when active Strategies • Agent’s state, app’s phase, history, ... • Others’ strategies, utilities, and states, ... 11 / 25 Mean Field Equilibrium (MFE) Challenges • Large system with many agents • Complex strategies and many competitors • Intractable optimization for best response Solution • Abstract many agents with statistical distributions • Optimize agents’ strategies against expectations 12 / 25 Equilibrium Strategy Agents maximize expected value of (not) sprinting • Current state • Utility from sprinting, u • Probability of tripping, Ptrip Agents employ threshold strategy • If active and u ≥ uT , then sprint 13 / 25 Find Equilibrium – Offline • Initialize probability of breaker trip Ptrip • Given Ptrip , optimize threshold strategy uT • Given uT , estimate number of sprinters N 0 • Given N, update probability Ptrip 0 • Iterate if Ptrip 6= Ptrip 14 / 25 Execute Strategy – Online If active and u ≥ uT , then sprint 15 / 25 Linear Regression Density 0.10 0.20 PageRank 2 3 4 5 Utility from Sprint 6 0.00 Density 0.0 0.1 0.2 0.3 0.4 Sprinting Thresholds 0 5 10 Utility from Sprint 15 • Thresholds are optimal and diverse • Agents behave strategically to maximize performance 16 / 25 Management Architecture Coordinator Alg 1 file Pro gy ate Str User User Agent Predictor Executor Engine Task User Agent Agent Predictor Executor Engine . . . Task Predictor Executor Engine Task • Offline: coordinator profiles utility, optimizes thresholds • Online: predictors estimate sprint utility • Online: agents apply threshold strategy • Online: executor adapts computation 17 / 25 Experimental Methodology Sprinting • 3 cores @1.2GHz → 12 cores @ 2.7GHz Workloads • Apache Spark • Spark engine dynamically schedules tasks on active cores Performance Metric • Tasks completed per second (TPS) Simulation Method • R-based simulator using traces of Spark computation 18 / 25 Management Policies Greedy • Sprint if neither cooling nor recovering Exponential Back-off • Sprint if neither cooling nor recovering • Wait randomly for U[0, 2k ] epochs after k th trip Cooperative Threshold • Enforce globally optimized threshold Equilibrium Threshold • Announce decentralized, strategic threshold 19 / 25 Case for Equilibria + Equilibrium Cooperative Performance Stability + • Cooperative (+): maximize global performance • Equilibrium (+): remove incentives to deviate 20 / 25 Case for Equilibria + + - Equilibrium Cooperative Performance Stability • Cooperative (+): maximize global performance • Equilibrium (+): remove incentives to deviate • Cooperative (–): enforce strategies globally 20 / 25 Case for Equilibria + + + - Equilibrium Cooperative Performance Stability • Cooperative (+): maximize global performance • Equilibrium (+): remove incentives to deviate • Cooperative (–): enforce strategies globally • Equilibrium (+): maximize individual performance 20 / 25 300 600 0 300 600 300 600 0 Exponential Backoff Cooperative Threshold 300 600 0 Greedy Equilibirum Threshold 0 Number of Sprinting Users Sprinting Behavior 0 200 400 Epoch Index 600 800 1000 21 / 25 Greedy Exponential Backoff Equilibrium Threshold Cooperative Threshold cc tri an gle co als rre lat io pa n ge ra nk na iv de e cis ion gr ad ien t sv m lin ea km r ea ns 0 1 2 3 4 5 6 Performance (Normalized to Greedy) Sprinting Performance • Greedy – aggressive, incurs emergencies • Exponential – conservative, untimely sprints • Equilibrium – strategic, produces equilibrium • Cooperative – optimal, requires enforcement 22 / 25 Game States 100% Active (not sprinting) Local cooling Global recovery Sprinting 75% 50% 25% 0% Greedy Exponential Equilibrium Cooperative • Greedy – time in recovery • Exponential – untimely sprints • Equilibrium – timely sprints • Cooperative – timely sprints 23 / 25 Conclusion Management with game theory • Agents sprint according to threshold – inexpensive • Agents have no incentives to deviate – stable • Agents optimize response – high performance Future directions • Use game theory to manage scarce resources • E.g., big/small processors, accelerators 24 / 25 Thank you Questions? 25 / 25