Download Power Management in Large Server Clusters

High-Performance Power-Aware Computing Vincent W. Freeh Computer Science NCSU [email protected] 1 Acknowledgements  NCSU  Tyler K. Bletsch  Mark E. Femal  Nandini Kappiah  Feng Pan  Daniel M. Smith  U of Georgia  Robert Springer  Barry Rountree  Prof. David K. Lowenthal 2 The case for power management  Eric Schmidt, Google CEO: “it’s not speed but power—low power, because data centers can consume as much electricity as a small city.”  Power/energy consumption becoming key issue  Power limitations  Energy = Heat; Heat dissipation is costly  Non-trivial amount of money  Consequence  Excessive power consumption limits performance  Fewer nodes can operate concurrently  Goal  Increase power/energy efficiency  More performance per unit power/energy 3  frequency x voltage2 power  How: CPU scaling  Reduce frequency & voltage  Reduce power & performance  Energy/power gears  Frequency-voltage pair  Power-performance setting  Energy-time tradeoff  Why CPU scaling?  Large power consumer  Mechanism exists power application throughput CPU scaling frequency/voltage frequency/voltage 4 Is CPU scaling a win? power PCPU ECPU Psystem Pother Eother time T full 5 Is CPU scaling a win? power PCPU benefit ECPU Psystem cost Eother Pother time full PCPU T Psystem Pother T+DT reduced 6 Our work  Exploit bottlenecks  Application waiting on bottleneck resource  Reduce power consumption (non-critical resource)  Generally CPU not on critical path  Bottlenecks we exploit  Intra-node (memory)  Inter-node (load imbalance)  Contributions  Impact studies [HPPAC ’05] [IPDPS ’05]  Varying gears/nodes [PPoPP ’05] [PPoPP ’06 (submitted)]  Leveraging load imbalance [SC ’05] 7 Methodology  Cluster used: 10 nodes, AMD Athlon-64  Processor supports 7 frequency-voltage settings (gears) Frequency (MHz) 2000 1800 1600 1400 1200 1000 800 Voltage (V) 1.5 1.4 1.35 1.3 1.2 1.1 1.0  Measure  Wall clock time (gettimeofday system call)  Energy (external power meter) 8 NAS 9 CG – 1 node 2000MHz 800MHz +1% -17% Not CPU bound: •Little time penalty •Large energy savings 10 EP – 1 node +11% -3% CPU bound: •Big time penalty •No (little) energy savings 11 Operation per miss CG: 8.60 BT: 79.6 SP: 49.5 EP: 844 12 Multiple nodes – EP S8 = 7.9 S4 = 4.0 E = 1.02 S2 = 2.0 Perfect speedup: E constant as N increases 13 Multiple nodes – LU S = 5.3 8 E8 = 1.16 Gear 2 S8 = 5.8 E8 = 1.28 S4 = 3.3 E4 = 1.15 S2 = 1.9 E2 = 1.03 Good speedup: E-T tradeoff as N increases 14 Phases 16 Phases: LU 17 Phase detection  First, divide program into blocks  All code in block execute in same gear  Block boundaries MPI operation Expect OPM change  Then, merge adjacent blocks into phases  Merge if similar memory pressure Use OPM | OPMi – OPMi+1 | small  Merge if small (short time)  Note, in future:  Leverage large body of phase detection research [Kennedy & Kremer 1998] [Sherwood, et al 2002] 18 Data collection  Use MPI-jack  Pre and post hooks  For example MPI application MPI library MPI-jack code Program tracing Gear shifting  Gather profile data during execution  Define MPI-jack hook for every MPI operation  Insert pseudo MPI call at end of loops  Information collected:  Type of call and location (PC)  Status (gear, time, etc)  Statistics (uops and L2 misses for OPM calculation) 19 Example: bt 20 Comparing two schedules  What is the “best” schedule?  Depends on user  User supplies “better” function bool better(i, j)  Several metrics can be used  Energy-delay  Energy-delay squared [Cameron et al. SC2004] 21 Slope metric E  Project uses slope  Energy-time tradeoff limit i j T  Slope = -1  energy savings = time delay Energy-delay product  User-defines the limit  Limit = 0  minimize energy  Limit = -∞  minimize time  If slope < limit, then better  We do not advocate this metric over others 22 Example: bt Solutions Slope < -1.5? 1 00  01 -11.7 true 2 01  02 -1.78 true 3 02  03 -1.19 false 4 02  12 -1.44 false 02 is the best 23 Benefit of multiple gears: mg 24 Current work: no. of nodes, gear/phase 25 Load imbalance 26 Node bottleneck  Best course is to keep load balanced  Load balancing is hard  Slow down if not critical node  How to tell if not critical node?  Suppose a barrier All nodes must arrive before any leave No benefit to arriving early  Measure block time  Assume it is (mostly) the same between iterations  Assumptions  Iterative application  Past predicts future 27 Example synch pt predicted synch pt synch pt slack predicted t performance = 1 performance = (t-slack)/t iteration k iteration k+1 Reduced performance & power  Energy savings 28 Measuring slack  Blocking operations  Receive  Wait  Barrier  Measure with MPI_Jack  Too frequent  Can be hundreds or thousands per second  Aggregate slack for one or more iterations  Computing slack, S  Measure times for computing and blocking phases T= C1 + B1 + C2 + B2 + …+ Cn + Bn  Compute aggregate slack S = (B1+B2+…+Bn)/T 29 Communication slack Slack Aztec Sweep3d CG  Slack  Use net slack  Varies between nodes  Each node individually determines slack  Varies between applications  Reduction to find min slack 30 Shifting slack  When to reduce performance?  When there is enough slack  When to increase performance?  When application performance suffers  Create high and low limit for slack  Need damping  Dynamically learn  Not the same for all applications  Range starts small  Increase if necessary reduce gear same gear increase gear T 31 Aztec gears 32 Sweep3d Aztec Performance 33 Synthetic benchmark 34 Summary  Contributions  Improved energy efficiency of HPC applications  Found simple metric for phase boundary location  Developed simple, effective linear time algorithm for determining proper gears  Leveraged load imbalance  Future work  Reduce sampling interval to handful of iterations  Reduce algorithm time w/ modeling and prediction  Develop AMPERE a message passing environment for reducing energy http://fortknox.csc.ncsu.edu:osr/ [email protected] [email protected] 35 End 36

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Power Management in Large Server Clusters