Download The Computational Sprinting Game

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Computational Sprinting Game
Songchun Fan∗ , Seyed Majid Zahedi∗ , Benjamin C. Lee
{songchun.fan, seyedmajid.zahedi, benjamin.c.lee}@duke.edu
[∗ Co-First Authors]
Computational Sprinting
• Supply extra power to enhance performance for short durations
• Activate more cores, boost voltage/frequency
2 / 25
0
de nai
c ve
gr isio
ad n
ie
n
sv t
l m
kmine
ea ar
n
co
rre a s
pa lat ls
ge ion
ra
nk
tr i c
an c
gl
e
4
3
2
1
1.0
0.5
0.0
Average Temperature (°C)
1.5
0
de nai
c ve
gr isio
ad n
ie
n
sv t
l m
kmine
ea ar
n
co
rre a s
pa lat ls
ge ion
ra
nk
tri c
an c
gl
e
5
Normalized Power
6
de nai
c ve
gr isio
ad n
ie
n
sv t
l m
kmine
ea ar
n
co
rre a s
pa lat ls
ge ion
ra
nk
tr i c
an c
gl
e
Normalized Speedup
Computational Sprinting
• Supply extra power to enhance performance for short durations
• Activate more cores, boost voltage/frequency
Non−sprinting
Sprinting
50
40
30
20
10
2 / 25
Sprinting Architecture
• Power for sprints supplied by shared rack
• Heat from sprints absorbed by thermal packages
Fig. www.fortlax.se and Raghavan, Arun, et al. ”Computational sprinting on a hardware/software testbed.”
3 / 25
Power Emergencies
3600
Long-delay
Duration of Current Draw (sec)
Conventional
Tripping
Short Circuit
Non-deterministic
120
• Current ↑ with sprinters
To
le
2
Ptrip=0
ra
nc
e
Ptrip=1
Ba
nd
Tripped
0 .1
Not Tripped
1
• Sprints may trip breaker
• Time ↑ with sprint duration
• Risk ↑ with current, time
2 3
5
10
20
Current Normalized to Rated Current
Fig. Fu, Wang, and Lefurgy. ”How much power oversubscription is safe and allowed in data centers.”
4 / 25
Uninterruptible Power Supplies
• When sprints trip breaker,
draw on batteries
• When sprints complete,
recharge batteries
Fig. www.amper-ecuador.com
5 / 25
Example – Private Clouds
• Applications compute on servers that share power
• Processors sprint independently
• Processors sprint selfishly for performance
Fig. Google, www.lasknet.net
6 / 25
Sprinting Management
When should processors sprint?
• Phases with higher performance from sprints
• But sprints prohibited as chip cools
Which processors should sprint?
• Processors that benefit most from sprints
• But sprints prohibited as batteries recover
7 / 25
Management Desiderata
Individual Performance
• Sprints account for phase behavior
• Sprints now constrain future sprints
System Stability
• Sprints account for others’ sprinting strategies
• Sprints risk power emergencies
8 / 25
Sprinting Strategy
• Optimize sprints given constraints
• Sprint, wait ∆cooling for chip cooling
• Sprint, wait ∆recovery for rack recovery if breaker trips
8
●
●
Utility from Sprint
●
7
?
●
6
5
●
0
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
5
10
15
20
25
●
30
Epoch
9 / 25
Sprinting Strategy
• Optimize sprints given constraints
• Sprint, wait ∆cooling for chip cooling
• Sprint, wait ∆recovery for rack recovery if breaker trips
×
××
××
××
××
8
●
●
Utility from Sprint
●
7
●
●
● ●
6
●
●
5
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
5
10
15
20
25
●
30
Epoch
9 / 25
Game Theory
Study strategic agents
• Agents selfishly maximize individual utility
Optimize responses
• Response maximizes utility, given others’ strategies
Find equilibrium
• State where all agents play their best responses
10 / 25
Sprinting Game
States
• Active – can sprint
• Cooling – cannot sprint, chip cooling
• Recovery – cannot sprint, batteries recharging
Actions
• Sprint or not, when active
Strategies
• Agent’s state, app’s phase, history, ...
• Others’ strategies, utilities, and states, ...
11 / 25
Mean Field Equilibrium (MFE)
Challenges
• Large system with many agents
• Complex strategies and many competitors
• Intractable optimization for best response
Solution
• Abstract many agents with statistical distributions
• Optimize agents’ strategies against expectations
12 / 25
Equilibrium Strategy
Agents maximize expected value of (not) sprinting
• Current state
• Utility from sprinting, u
• Probability of tripping, Ptrip
Agents employ threshold strategy
• If active and u ≥ uT , then sprint
13 / 25
Find Equilibrium – Offline
• Initialize probability of breaker trip Ptrip
• Given Ptrip , optimize threshold strategy uT
• Given uT , estimate number of sprinters N
0
• Given N, update probability Ptrip
0
• Iterate if Ptrip
6= Ptrip
14 / 25
Execute Strategy – Online
If active and u ≥ uT , then sprint
15 / 25
Linear Regression
Density
0.10 0.20
PageRank
2
3
4
5
Utility from Sprint
6
0.00
Density
0.0 0.1 0.2 0.3 0.4
Sprinting Thresholds
0
5
10
Utility from Sprint
15
• Thresholds are optimal and diverse
• Agents behave strategically to maximize performance
16 / 25
Management Architecture
Coordinator
Alg 1
file
Pro
gy
ate
Str
User
User
Agent
Predictor
Executor Engine
Task
User
Agent
Agent
Predictor
Executor Engine
. . .
Task
Predictor
Executor Engine
Task
• Offline: coordinator profiles utility, optimizes thresholds
• Online: predictors estimate sprint utility
• Online: agents apply threshold strategy
• Online: executor adapts computation
17 / 25
Experimental Methodology
Sprinting
• 3 cores @1.2GHz → 12 cores @ 2.7GHz
Workloads
• Apache Spark
• Spark engine dynamically schedules tasks on active cores
Performance Metric
• Tasks completed per second (TPS)
Simulation Method
• R-based simulator using traces of Spark computation
18 / 25
Management Policies
Greedy
• Sprint if neither cooling nor recovering
Exponential Back-off
• Sprint if neither cooling nor recovering
• Wait randomly for U[0, 2k ] epochs after k th trip
Cooperative Threshold
• Enforce globally optimized threshold
Equilibrium Threshold
• Announce decentralized, strategic threshold
19 / 25
Case for Equilibria
+
Equilibrium Cooperative
Performance
Stability
+
• Cooperative (+): maximize global performance
• Equilibrium (+): remove incentives to deviate
20 / 25
Case for Equilibria
+
+ -
Equilibrium Cooperative
Performance
Stability
• Cooperative (+): maximize global performance
• Equilibrium (+): remove incentives to deviate
• Cooperative (–): enforce strategies globally
20 / 25
Case for Equilibria
+ +
+ -
Equilibrium Cooperative
Performance
Stability
• Cooperative (+): maximize global performance
• Equilibrium (+): remove incentives to deviate
• Cooperative (–): enforce strategies globally
• Equilibrium (+): maximize individual performance
20 / 25
300 600 0
300 600
300 600 0
Exponential Backoff
Cooperative Threshold
300 600 0
Greedy
Equilibirum Threshold
0
Number of Sprinting Users
Sprinting Behavior
0
200
400
Epoch Index
600
800
1000
21 / 25
Greedy
Exponential Backoff
Equilibrium Threshold
Cooperative Threshold
cc
tri
an
gle
co als
rre
lat
io
pa n
ge
ra
nk
na
iv
de e
cis
ion
gr
ad
ien
t
sv
m
lin
ea
km r
ea
ns
0
1
2
3
4
5
6
Performance (Normalized to Greedy)
Sprinting Performance
• Greedy – aggressive, incurs emergencies
• Exponential – conservative, untimely sprints
• Equilibrium – strategic, produces equilibrium
• Cooperative – optimal, requires enforcement
22 / 25
Game States
100%
Active (not sprinting)
Local cooling
Global recovery
Sprinting
75%
50%
25%
0%
Greedy Exponential Equilibrium Cooperative
• Greedy – time in recovery
• Exponential – untimely sprints
• Equilibrium – timely sprints
• Cooperative – timely sprints
23 / 25
Conclusion
Management with game theory
• Agents sprint according to threshold – inexpensive
• Agents have no incentives to deviate – stable
• Agents optimize response – high performance
Future directions
• Use game theory to manage scarce resources
• E.g., big/small processors, accelerators
24 / 25
Thank you
Questions?
25 / 25