Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Computational Sprinting Game
Songchun Fan∗ , Seyed Majid Zahedi∗ , Benjamin C. Lee
{songchun.fan, seyedmajid.zahedi, benjamin.c.lee}@duke.edu
[∗ Co-First Authors]
Computational Sprinting
• Supply extra power to enhance performance for short durations
• Activate more cores, boost voltage/frequency
2 / 25
0
de nai
c ve
gr isio
ad n
ie
n
sv t
l m
kmine
ea ar
n
co
rre a s
pa lat ls
ge ion
ra
nk
tr i c
an c
gl
e
4
3
2
1
1.0
0.5
0.0
Average Temperature (°C)
1.5
0
de nai
c ve
gr isio
ad n
ie
n
sv t
l m
kmine
ea ar
n
co
rre a s
pa lat ls
ge ion
ra
nk
tri c
an c
gl
e
5
Normalized Power
6
de nai
c ve
gr isio
ad n
ie
n
sv t
l m
kmine
ea ar
n
co
rre a s
pa lat ls
ge ion
ra
nk
tr i c
an c
gl
e
Normalized Speedup
Computational Sprinting
• Supply extra power to enhance performance for short durations
• Activate more cores, boost voltage/frequency
Non−sprinting
Sprinting
50
40
30
20
10
2 / 25
Sprinting Architecture
• Power for sprints supplied by shared rack
• Heat from sprints absorbed by thermal packages
Fig. www.fortlax.se and Raghavan, Arun, et al. ”Computational sprinting on a hardware/software testbed.”
3 / 25
Power Emergencies
3600
Long-delay
Duration of Current Draw (sec)
Conventional
Tripping
Short Circuit
Non-deterministic
120
• Current ↑ with sprinters
To
le
2
Ptrip=0
ra
nc
e
Ptrip=1
Ba
nd
Tripped
0 .1
Not Tripped
1
• Sprints may trip breaker
• Time ↑ with sprint duration
• Risk ↑ with current, time
2 3
5
10
20
Current Normalized to Rated Current
Fig. Fu, Wang, and Lefurgy. ”How much power oversubscription is safe and allowed in data centers.”
4 / 25
Uninterruptible Power Supplies
• When sprints trip breaker,
draw on batteries
• When sprints complete,
recharge batteries
Fig. www.amper-ecuador.com
5 / 25
Example – Private Clouds
• Applications compute on servers that share power
• Processors sprint independently
• Processors sprint selfishly for performance
Fig. Google, www.lasknet.net
6 / 25
Sprinting Management
When should processors sprint?
• Phases with higher performance from sprints
• But sprints prohibited as chip cools
Which processors should sprint?
• Processors that benefit most from sprints
• But sprints prohibited as batteries recover
7 / 25
Management Desiderata
Individual Performance
• Sprints account for phase behavior
• Sprints now constrain future sprints
System Stability
• Sprints account for others’ sprinting strategies
• Sprints risk power emergencies
8 / 25
Sprinting Strategy
• Optimize sprints given constraints
• Sprint, wait ∆cooling for chip cooling
• Sprint, wait ∆recovery for rack recovery if breaker trips
8
●
●
Utility from Sprint
●
7
?
●
6
5
●
0
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
5
10
15
20
25
●
30
Epoch
9 / 25
Sprinting Strategy
• Optimize sprints given constraints
• Sprint, wait ∆cooling for chip cooling
• Sprint, wait ∆recovery for rack recovery if breaker trips
×
××
××
××
××
8
●
●
Utility from Sprint
●
7
●
●
● ●
6
●
●
5
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
5
10
15
20
25
●
30
Epoch
9 / 25
Game Theory
Study strategic agents
• Agents selfishly maximize individual utility
Optimize responses
• Response maximizes utility, given others’ strategies
Find equilibrium
• State where all agents play their best responses
10 / 25
Sprinting Game
States
• Active – can sprint
• Cooling – cannot sprint, chip cooling
• Recovery – cannot sprint, batteries recharging
Actions
• Sprint or not, when active
Strategies
• Agent’s state, app’s phase, history, ...
• Others’ strategies, utilities, and states, ...
11 / 25
Mean Field Equilibrium (MFE)
Challenges
• Large system with many agents
• Complex strategies and many competitors
• Intractable optimization for best response
Solution
• Abstract many agents with statistical distributions
• Optimize agents’ strategies against expectations
12 / 25
Equilibrium Strategy
Agents maximize expected value of (not) sprinting
• Current state
• Utility from sprinting, u
• Probability of tripping, Ptrip
Agents employ threshold strategy
• If active and u ≥ uT , then sprint
13 / 25
Find Equilibrium – Offline
• Initialize probability of breaker trip Ptrip
• Given Ptrip , optimize threshold strategy uT
• Given uT , estimate number of sprinters N
0
• Given N, update probability Ptrip
0
• Iterate if Ptrip
6= Ptrip
14 / 25
Execute Strategy – Online
If active and u ≥ uT , then sprint
15 / 25
Linear Regression
Density
0.10 0.20
PageRank
2
3
4
5
Utility from Sprint
6
0.00
Density
0.0 0.1 0.2 0.3 0.4
Sprinting Thresholds
0
5
10
Utility from Sprint
15
• Thresholds are optimal and diverse
• Agents behave strategically to maximize performance
16 / 25
Management Architecture
Coordinator
Alg 1
file
Pro
gy
ate
Str
User
User
Agent
Predictor
Executor Engine
Task
User
Agent
Agent
Predictor
Executor Engine
. . .
Task
Predictor
Executor Engine
Task
• Offline: coordinator profiles utility, optimizes thresholds
• Online: predictors estimate sprint utility
• Online: agents apply threshold strategy
• Online: executor adapts computation
17 / 25
Experimental Methodology
Sprinting
• 3 cores @1.2GHz → 12 cores @ 2.7GHz
Workloads
• Apache Spark
• Spark engine dynamically schedules tasks on active cores
Performance Metric
• Tasks completed per second (TPS)
Simulation Method
• R-based simulator using traces of Spark computation
18 / 25
Management Policies
Greedy
• Sprint if neither cooling nor recovering
Exponential Back-off
• Sprint if neither cooling nor recovering
• Wait randomly for U[0, 2k ] epochs after k th trip
Cooperative Threshold
• Enforce globally optimized threshold
Equilibrium Threshold
• Announce decentralized, strategic threshold
19 / 25
Case for Equilibria
+
Equilibrium Cooperative
Performance
Stability
+
• Cooperative (+): maximize global performance
• Equilibrium (+): remove incentives to deviate
20 / 25
Case for Equilibria
+
+ -
Equilibrium Cooperative
Performance
Stability
• Cooperative (+): maximize global performance
• Equilibrium (+): remove incentives to deviate
• Cooperative (–): enforce strategies globally
20 / 25
Case for Equilibria
+ +
+ -
Equilibrium Cooperative
Performance
Stability
• Cooperative (+): maximize global performance
• Equilibrium (+): remove incentives to deviate
• Cooperative (–): enforce strategies globally
• Equilibrium (+): maximize individual performance
20 / 25
300 600 0
300 600
300 600 0
Exponential Backoff
Cooperative Threshold
300 600 0
Greedy
Equilibirum Threshold
0
Number of Sprinting Users
Sprinting Behavior
0
200
400
Epoch Index
600
800
1000
21 / 25
Greedy
Exponential Backoff
Equilibrium Threshold
Cooperative Threshold
cc
tri
an
gle
co als
rre
lat
io
pa n
ge
ra
nk
na
iv
de e
cis
ion
gr
ad
ien
t
sv
m
lin
ea
km r
ea
ns
0
1
2
3
4
5
6
Performance (Normalized to Greedy)
Sprinting Performance
• Greedy – aggressive, incurs emergencies
• Exponential – conservative, untimely sprints
• Equilibrium – strategic, produces equilibrium
• Cooperative – optimal, requires enforcement
22 / 25
Game States
100%
Active (not sprinting)
Local cooling
Global recovery
Sprinting
75%
50%
25%
0%
Greedy Exponential Equilibrium Cooperative
• Greedy – time in recovery
• Exponential – untimely sprints
• Equilibrium – timely sprints
• Cooperative – timely sprints
23 / 25
Conclusion
Management with game theory
• Agents sprint according to threshold – inexpensive
• Agents have no incentives to deviate – stable
• Agents optimize response – high performance
Future directions
• Use game theory to manage scarce resources
• E.g., big/small processors, accelerators
24 / 25
Thank you
Questions?
25 / 25