* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Power Management in Large Server Clusters
History of electric power transmission wikipedia , lookup
Electric power system wikipedia , lookup
Spectral density wikipedia , lookup
Audio power wikipedia , lookup
Grid energy storage wikipedia , lookup
Buck converter wikipedia , lookup
Wireless power transfer wikipedia , lookup
Variable-frequency drive wikipedia , lookup
Electrification wikipedia , lookup
Mains electricity wikipedia , lookup
Three-phase electric power wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Voltage optimisation wikipedia , lookup
Alternating current wikipedia , lookup
High-Performance Power-Aware Computing Vincent W. Freeh Computer Science NCSU [email protected] 1 Acknowledgements NCSU Tyler K. Bletsch Mark E. Femal Nandini Kappiah Feng Pan Daniel M. Smith U of Georgia Robert Springer Barry Rountree Prof. David K. Lowenthal 2 The case for power management Eric Schmidt, Google CEO: “it’s not speed but power—low power, because data centers can consume as much electricity as a small city.” Power/energy consumption becoming key issue Power limitations Energy = Heat; Heat dissipation is costly Non-trivial amount of money Consequence Excessive power consumption limits performance Fewer nodes can operate concurrently Goal Increase power/energy efficiency More performance per unit power/energy 3 frequency x voltage2 power How: CPU scaling Reduce frequency & voltage Reduce power & performance Energy/power gears Frequency-voltage pair Power-performance setting Energy-time tradeoff Why CPU scaling? Large power consumer Mechanism exists power application throughput CPU scaling frequency/voltage frequency/voltage 4 Is CPU scaling a win? power PCPU ECPU Psystem Pother Eother time T full 5 Is CPU scaling a win? power PCPU benefit ECPU Psystem cost Eother Pother time full PCPU T Psystem Pother T+DT reduced 6 Our work Exploit bottlenecks Application waiting on bottleneck resource Reduce power consumption (non-critical resource) Generally CPU not on critical path Bottlenecks we exploit Intra-node (memory) Inter-node (load imbalance) Contributions Impact studies [HPPAC ’05] [IPDPS ’05] Varying gears/nodes [PPoPP ’05] [PPoPP ’06 (submitted)] Leveraging load imbalance [SC ’05] 7 Methodology Cluster used: 10 nodes, AMD Athlon-64 Processor supports 7 frequency-voltage settings (gears) Frequency (MHz) 2000 1800 1600 1400 1200 1000 800 Voltage (V) 1.5 1.4 1.35 1.3 1.2 1.1 1.0 Measure Wall clock time (gettimeofday system call) Energy (external power meter) 8 NAS 9 CG – 1 node 2000MHz 800MHz +1% -17% Not CPU bound: •Little time penalty •Large energy savings 10 EP – 1 node +11% -3% CPU bound: •Big time penalty •No (little) energy savings 11 Operation per miss CG: 8.60 BT: 79.6 SP: 49.5 EP: 844 12 Multiple nodes – EP S8 = 7.9 S4 = 4.0 E = 1.02 S2 = 2.0 Perfect speedup: E constant as N increases 13 Multiple nodes – LU S = 5.3 8 E8 = 1.16 Gear 2 S8 = 5.8 E8 = 1.28 S4 = 3.3 E4 = 1.15 S2 = 1.9 E2 = 1.03 Good speedup: E-T tradeoff as N increases 14 Phases 16 Phases: LU 17 Phase detection First, divide program into blocks All code in block execute in same gear Block boundaries MPI operation Expect OPM change Then, merge adjacent blocks into phases Merge if similar memory pressure Use OPM | OPMi – OPMi+1 | small Merge if small (short time) Note, in future: Leverage large body of phase detection research [Kennedy & Kremer 1998] [Sherwood, et al 2002] 18 Data collection Use MPI-jack Pre and post hooks For example MPI application MPI library MPI-jack code Program tracing Gear shifting Gather profile data during execution Define MPI-jack hook for every MPI operation Insert pseudo MPI call at end of loops Information collected: Type of call and location (PC) Status (gear, time, etc) Statistics (uops and L2 misses for OPM calculation) 19 Example: bt 20 Comparing two schedules What is the “best” schedule? Depends on user User supplies “better” function bool better(i, j) Several metrics can be used Energy-delay Energy-delay squared [Cameron et al. SC2004] 21 Slope metric E Project uses slope Energy-time tradeoff limit i j T Slope = -1 energy savings = time delay Energy-delay product User-defines the limit Limit = 0 minimize energy Limit = -∞ minimize time If slope < limit, then better We do not advocate this metric over others 22 Example: bt Solutions Slope < -1.5? 1 00 01 -11.7 true 2 01 02 -1.78 true 3 02 03 -1.19 false 4 02 12 -1.44 false 02 is the best 23 Benefit of multiple gears: mg 24 Current work: no. of nodes, gear/phase 25 Load imbalance 26 Node bottleneck Best course is to keep load balanced Load balancing is hard Slow down if not critical node How to tell if not critical node? Suppose a barrier All nodes must arrive before any leave No benefit to arriving early Measure block time Assume it is (mostly) the same between iterations Assumptions Iterative application Past predicts future 27 Example synch pt predicted synch pt synch pt slack predicted t performance = 1 performance = (t-slack)/t iteration k iteration k+1 Reduced performance & power Energy savings 28 Measuring slack Blocking operations Receive Wait Barrier Measure with MPI_Jack Too frequent Can be hundreds or thousands per second Aggregate slack for one or more iterations Computing slack, S Measure times for computing and blocking phases T= C1 + B1 + C2 + B2 + …+ Cn + Bn Compute aggregate slack S = (B1+B2+…+Bn)/T 29 Communication slack Slack Aztec Sweep3d CG Slack Use net slack Varies between nodes Each node individually determines slack Varies between applications Reduction to find min slack 30 Shifting slack When to reduce performance? When there is enough slack When to increase performance? When application performance suffers Create high and low limit for slack Need damping Dynamically learn Not the same for all applications Range starts small Increase if necessary reduce gear same gear increase gear T 31 Aztec gears 32 Sweep3d Aztec Performance 33 Synthetic benchmark 34 Summary Contributions Improved energy efficiency of HPC applications Found simple metric for phase boundary location Developed simple, effective linear time algorithm for determining proper gears Leveraged load imbalance Future work Reduce sampling interval to handful of iterations Reduce algorithm time w/ modeling and prediction Develop AMPERE a message passing environment for reducing energy http://fortknox.csc.ncsu.edu:osr/ [email protected] [email protected] 35 End 36