Download Power Management in Large Server Clusters

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of electric power transmission wikipedia , lookup

Electric power system wikipedia , lookup

Spectral density wikipedia , lookup

Audio power wikipedia , lookup

Grid energy storage wikipedia , lookup

Buck converter wikipedia , lookup

Wireless power transfer wikipedia , lookup

Variable-frequency drive wikipedia , lookup

Electrification wikipedia , lookup

Mains electricity wikipedia , lookup

Three-phase electric power wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Voltage optimisation wikipedia , lookup

Alternating current wikipedia , lookup

Power engineering wikipedia , lookup

AC adapter wikipedia , lookup

Transcript
High-Performance
Power-Aware Computing
Vincent W. Freeh
Computer Science
NCSU
[email protected]
1
Acknowledgements
 NCSU
 Tyler K. Bletsch
 Mark E. Femal
 Nandini Kappiah
 Feng Pan
 Daniel M. Smith
 U of Georgia
 Robert Springer
 Barry Rountree
 Prof. David K. Lowenthal
2
The case for power management
 Eric Schmidt, Google CEO:
“it’s not speed but power—low power, because data
centers can consume as much electricity as a small
city.”
 Power/energy consumption becoming key issue
 Power limitations
 Energy = Heat; Heat dissipation is costly
 Non-trivial amount of money
 Consequence
 Excessive power consumption limits performance
 Fewer nodes can operate concurrently
 Goal
 Increase power/energy efficiency
 More performance per unit power/energy
3

frequency x voltage2
power
 How: CPU scaling
 Reduce frequency & voltage
 Reduce power & performance
 Energy/power gears
 Frequency-voltage pair
 Power-performance setting
 Energy-time tradeoff
 Why CPU scaling?
 Large power consumer
 Mechanism exists
power
application throughput
CPU scaling
frequency/voltage
frequency/voltage
4
Is CPU scaling a win?
power
PCPU
ECPU
Psystem
Pother
Eother
time
T
full
5
Is CPU scaling a win?
power
PCPU
benefit
ECPU
Psystem
cost
Eother
Pother
time
full
PCPU
T
Psystem
Pother
T+DT
reduced
6
Our work
 Exploit bottlenecks
 Application waiting on bottleneck resource
 Reduce power consumption (non-critical resource)
 Generally CPU not on critical path
 Bottlenecks we exploit
 Intra-node (memory)
 Inter-node (load imbalance)
 Contributions
 Impact studies [HPPAC ’05] [IPDPS ’05]
 Varying gears/nodes [PPoPP ’05] [PPoPP ’06 (submitted)]
 Leveraging load imbalance [SC ’05]
7
Methodology
 Cluster used: 10 nodes, AMD Athlon-64
 Processor supports 7 frequency-voltage settings
(gears)
Frequency (MHz) 2000 1800 1600 1400 1200 1000 800
Voltage (V)
1.5
1.4 1.35 1.3 1.2
1.1 1.0
 Measure
 Wall clock time (gettimeofday system call)
 Energy (external power meter)
8
NAS
9
CG – 1 node
2000MHz
800MHz
+1%
-17%
Not CPU bound:
•Little time penalty
•Large energy savings
10
EP – 1 node
+11%
-3%
CPU bound:
•Big time penalty
•No (little) energy savings
11
Operation per miss
CG: 8.60
BT: 79.6
SP: 49.5
EP: 844
12
Multiple nodes – EP
S8 = 7.9
S4 = 4.0
E = 1.02
S2 = 2.0
Perfect speedup:
E constant
as N increases
13
Multiple nodes
– LU
S = 5.3
8
E8 = 1.16
Gear 2
S8 = 5.8
E8 = 1.28
S4 = 3.3
E4 = 1.15
S2 = 1.9
E2 = 1.03
Good speedup:
E-T tradeoff
as N increases
14
Phases
16
Phases: LU
17
Phase detection
 First, divide program into blocks
 All code in block execute in same gear
 Block boundaries
MPI operation
Expect OPM change
 Then, merge adjacent blocks into phases
 Merge if similar memory pressure
Use OPM
| OPMi – OPMi+1 | small
 Merge if small (short time)
 Note, in future:
 Leverage large body of phase detection research
[Kennedy & Kremer 1998] [Sherwood, et al 2002]
18
Data collection
 Use MPI-jack
 Pre and post hooks
 For example
MPI
application
MPI
library
MPI-jack
code
Program tracing
Gear shifting
 Gather profile data during execution
 Define MPI-jack hook for every MPI operation
 Insert pseudo MPI call at end of loops
 Information collected:
 Type of call and location (PC)
 Status (gear, time, etc)
 Statistics (uops and L2 misses for OPM calculation)
19
Example: bt
20
Comparing two schedules
 What is the “best” schedule?
 Depends on user
 User supplies “better” function
bool better(i, j)
 Several metrics can be used
 Energy-delay
 Energy-delay squared [Cameron et al. SC2004]
21
Slope metric
E
 Project uses slope
 Energy-time tradeoff
limit
i
j
T
 Slope = -1  energy savings = time delay
Energy-delay product
 User-defines the limit
 Limit = 0  minimize energy
 Limit = -∞  minimize time
 If slope < limit, then better
 We do not advocate this metric over others
22
Example: bt
Solutions
Slope
< -1.5?
1
00  01
-11.7
true
2
01  02
-1.78
true
3
02  03
-1.19
false
4
02  12
-1.44
false
02 is the best
23
Benefit of multiple gears: mg
24
Current work: no. of nodes, gear/phase
25
Load imbalance
26
Node bottleneck
 Best course is to keep load balanced
 Load balancing is hard
 Slow down if not critical node
 How to tell if not critical node?
 Suppose a barrier
All nodes must arrive before any leave
No benefit to arriving early
 Measure block time
 Assume it is (mostly) the same between iterations
 Assumptions
 Iterative application
 Past predicts future
27
Example
synch pt
predicted
synch pt
synch pt
slack
predicted t
performance = 1
performance = (t-slack)/t
iteration k
iteration k+1
Reduced performance & power
 Energy savings
28
Measuring slack
 Blocking operations
 Receive
 Wait
 Barrier
 Measure with MPI_Jack
 Too frequent
 Can be hundreds or thousands per second
 Aggregate slack for one or more iterations
 Computing slack, S
 Measure times for computing and blocking phases
T= C1 + B1 + C2 + B2 + …+ Cn + Bn
 Compute aggregate slack
S = (B1+B2+…+Bn)/T
29
Communication slack
Slack
Aztec
Sweep3d
CG
 Slack
 Use net slack
 Varies between nodes
 Each node individually
determines slack
 Varies between applications
 Reduction to find min slack
30
Shifting
slack
 When to reduce performance?
 When there is enough slack
 When to increase performance?
 When application performance suffers
 Create high and low limit for slack
 Need damping
 Dynamically learn
 Not the same for all applications
 Range starts small
 Increase if necessary
reduce gear
same gear
increase gear
T
31
Aztec gears
32
Sweep3d
Aztec
Performance
33
Synthetic benchmark
34
Summary
 Contributions
 Improved energy efficiency of HPC applications
 Found simple metric for phase boundary location
 Developed simple, effective linear time algorithm
for determining proper gears
 Leveraged load imbalance
 Future work
 Reduce sampling interval to handful of iterations
 Reduce algorithm time w/ modeling and prediction
 Develop AMPERE
a message passing environment for reducing energy
http://fortknox.csc.ncsu.edu:osr/
[email protected] [email protected]
35
End
36