Download Introduction

Power Management (1) Introduction to Basics Background Reading • http://en.wikipedia.org/wiki/CPU_power_dissipation • http://en.wikipedia.org/wiki/CMOS#Power:_switching_and_leaka ge • http://www.xbitlabs.com/articles/cpu/display/core-i5-2500t-2390ti3-2100t-pentium-g620t.html • http://www.cpu-world.com/info/charts.html • Goal: Understand  The sources of power dissipation in combinational and sequential circuits  Power vs. energy  Options for controlling power/energy dissipation (3) Moore’s Law Goal: Sustain Performance Scaling • • Performance scaled with number of transistors Dennard scaling*: power scaled with feature size From wikipedia.org *R. Dennard, et al., “Design of ion-implanted MOSFETs with very small physical dimensions,” IEEE Journal of Solid State Circuits, vol. SC-9, no. 5, pp. 256-268, Oct. 1974. (4) Where Does the Power Go in CMOS? • Dynamic Power Consumption  Caused by switching transitions  cost of switching state Vdd PMOS • Static Power Consumption Vin Vout  Caused by leakage currents in the absence of any switching activity • NMOS Ground Power consumption per transistor changes with each technology generation  No longer reducing at the same rate  What happens to power density? AMD Trinity APU (5) n-channel MOSFET GATE GATE DRAIN SOURCE tox DRAIN SOURCE BODY L L • Vgs < Vt transistor off - Vt is the threshold voltage • Vgs > Vt transistor on • Impact of threshold voltage  Higher Vt, slower switching speed, lower leakage  Lower Vt, faster switching speed, higher leakage • Actual physics is more complex but this will do for now! (6) Charge as a State Variable a b c x y For computation we should be able to identify if each of the variable (a,b,c,x,y) is in a ‘1’ or a ‘0’ state. We could have used any physical quantity to do that • Voltage • Current • Electron spin • Orientation of magnetic field • ……… All nodes have some capacitance associated with them a b c x y We choose voltage distinguish between a ‘0’ and a ‘1’. Logic 1: Cap is charged Logic 0: Cap is discharged + (7) Abstracting Energy Behavior • How can we abstract energy consumption for a digital device? • Consider the energy cost of charge transfer Vdd Vin Vout 0 1 1 0 Vin Modeled as an on/off resistance PMOS Vout NMOS Modeled as an output capacitance Ground (8) Switch from one state to another To perform computation, we need to switch from one state to another Connect the cap to GND thorough an ON NMOS Vdd Vin PMOS Vout NMOS Ground Logic 1: Cap is charged Logic 0: Cap is discharged + Connect the cap to VCC thorough an ON PMOS The logic dictates whether a node capacitor will be charged or discharged. (9) Power(watts) Power(watts) Power Vs. Energy P2 P1 Same Energy = area under the curve P0 Time P0 Time • Energy is a rate of expenditure of energy  One joule/sec = one watt • Both profiles use the same amount of energy at different rates or power (10) Dynamic Power vs. Dynamic Energy • Dynamic power: consider the rate at which switching (energy dissipation) takes place VDD VDD Voltage iDD VDD iDD CL 0 T Input to CMOS inverter CL Tim e Output Capacitor Charging Output Capacitor Discharging activity factor = fraction of total capacitance that switches each cycle æ CL ö Pdynamic = a ç ÷ ×Vdd ×Vdd × F è 2 ø Delay = k × C Vdd (Vdd -Vt ) 2 (11) Energy or delay Delay Energy Power State VDD • Energy-Delay Product (EDP) Energy-Delay Interaction Target of optimization VDD Delay decreases with supply voltage but energy/power increases æ CL ö Pdynamic = a ç ÷ ×Vdd ×Vdd × F è 2 ø Delay = k × C Vdd (Vdd -Vt ) (12) 2 Static Power • Technology scaling has caused transistors to become smaller and smaller. As a result, static power has become a substantial portion of the total power. GATE SOURCE DRAIN Gate Leakage Junction Leakage Sub-threshold Leakage Pstatic = Vdd × I static (13) leakage or delay Static Energy-Delay Interaction leakage delay GATE DRAIN SOURCE tox L Delay = k × C Vth Vdd (Vdd -Vt ) • Static energy increases exponentially with decrease in threshold voltage • Delay increases with threshold voltage (14) 2 Higher Level Blocks Vdd Vdd A A B Vdd B C C A A B B A B C A C B (15) Temperature Dependence • As temperature increases static power increases1 Pstatic = Vdd × N × Kdesign × Ileakage Supply voltage #Transistors Technology Dependent Normalized Leakage Current Ileakage = F(Temp) 1J. Butts and G. Sohi, “A Static Power Model for Architects, MICRO 2000 (16) The World Today • Yesterday scaling to minimize time (max F) æ CL ö Pdynamic = a ç ÷ ×Vdd ×Vdd × F è 2 ø Delay = k × C Vdd (Vdd -Vt ) • Maximum performance (minimum time) is too expensive in terms of power • Today: trade/balance performance for power efficiency 2 (17) Technology Factors Affecting Power • Transistor size  Affects capacitance (CL) • Rise times and fall times (delay)  Affects short circuit power (not in this course) • Threshold voltage  • Vdd Affects leakage power PMOS Vout Vin Temperature NMOS  Affects leakage power • Ground Switching activity  Frequency (F) and number of switching transistors ( æ CL ö Pdynamic = a ç ÷ ×Vdd ×Vdd × F è 2 ø Delay = k × C a ) Vdd (Vdd -Vt ) 2 (18) Low Power Design: Options? æ CL ö Pdynamic = a ç ÷ ×Vdd ×Vdd × F è 2 ø Delay = k × C Vdd (Vdd -Vt ) 2 • Reduce Vdd  Increases gate delay  Note that this means it reduces the frequency of operation of the processor! • Compensate by reducing threshold voltage?  Increase in leakage power • Reduce frequency  Computation takes longer to complete  Consumes more energy (but less power) if voltage is not scaled (19) Example HW Only (Boost) SWVisible CPU P-state Pb0 Voltage (V) 1 Freq (MHz) 2400 Pb1 0.875 1800 P0 0.825 1600 P1 0.812 1400 P2 0.787 1300 P3 0.762 1100 P4 0.75 900 AMD Trinity A105800 APU: 100W TDP (20) Optimizing Power vs. Energy Maximize battery life  minimize energy Thermal envelopes  minimize peak power Example: (21) What About Wires? Lumped RC Model 1 Cline 2 Rline = r ×l Resistance per unit length • 1 Cline 2 1 t = rc × l 2 2 Cline = c ×l Capacitance per unit length We will not directly address delay or energy expended in the interconnect in this class  Simple architecture model: lump the energy/power with the source component (22) Power Management Basics Parallelism and Power IBM Power5 Source: IBM AMD Trinity Source: forwardthinking.pcmag.com • How much of the chip area is devoted to compute? • Run many cores slower. Why does this reduce power? (24) The Power Wall P = aCV f +Vdd Ileak 2 dd • Power per transistor scales with frequency but also scales with Vdd  Lower Vdd can be compensated for with increased pipelining to keep throughput constant  Power per transistor is not same as power per area  power density is the problem!  Multiple units can be run at lower frequencies to keep throughput constant, while saving power (25) What is the Problem? Mukhopadhyay and Yalamanchili (2009) Based on scaling using Pentium-class cores  While Moore’s Law continues, scaling phenomena have changed  Power densities are increasing with each generation  (26) 26 ITRS Roadmap for Logic Devices From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008 (27) What are my Options? 1. Better technology  Manufacturing  Better devices (FinFet)  New Devices  non-CMOS?  this is the future 2. Be more efficient – activity management  Clock gating – dynamic energy/power  Power gating – static energy/power  Power state management - both 3. Improved architecture  Simpler pipelines 4. Parallelism (28) Activity Management Clock Gating Power Gating Vdd input Combinational Logic clk Power gate transistor cond clk • Turn off clock to a block of logic • Eliminate unnecessary transitions/activity • Core 0 clk Core 1 • Turn off power to a block of logic, e.g., core • No leakage Clock distribution power (29) Multiple Voltage Frequency Domains Intel Sandy Bridge Processor • • • Cores and ring in one DVFS domain Graphics unit in another DVFS domain Cores and portion of cache can be gated off From E. Rotem et. Al. HotChips 2011 (30) Processor Power States • Performance States – P-states  Operate at different voltage/frequencies o Recall delay-voltage relationship  Lower voltage  lower leakage  Lower frequency  lower power (not the same as energy!)  Lower frequency  longer execution time • Idle States - C-states  Sleep states  Differ is how much state is saved • SW or HW managed transitions between states! (31) Example of P-states AMD Trinity A10-5800 APU: 100W TDP • CPU P- Voltage state (V) HW Only (Boost) SWVisible Freq (MHz) Pb0 1 2400 Pb1 0.875 1800 P0 0.825 1600 P1 0.812 1400 P2 0.787 1300 P3 0.762 1100 P4 0.75 900 • Software Managed Power States Changing Power States is not free (32) Example of P-states From: http://www.intel.com/content/www/us/en/processors/core/2nd-gen-core-family-mobile-vol-1-datasheet.html (33) Management Knobs • Each core can be in any one of a multiple of states • How do I decide what state to set each core?  Who decides? HW? SW? • How do I decide when I can turn off a core? • What am I saving? Static energy or dynamic energy? (34) Power Management • Software controlled power management  Optimize power and/or energy  Orchestrated by the operating system or application libraries  Industry standard interfaces for power management o • Advanced Configuration and Power Interface (ACPI)  https://www.acpica.org/  http://www.acpi.info/ Hardware power management  Optimized power/energy  Failsafe operation, e.g., protect against thermal emergencies (35) Boosting Intel Sandy Bridge • Exploit package physics  Temperature changes on the order of milliseconds • Use the thermal headroom Turbo boost region Max Power TDP Power 10s of seconds Low power – build up thermal credits (36) Power Gating • Turn off components that are not being used  Lose all state information • Costs of powering down • Costs of powering up • Smart shutdown  Models to guide decisions Intel Sandy Bridge Processor (37) Parallelism • Concurrency + lower frequency  greater energy efficiency Example Core Cache Core Core Cache Cache Core Core Cache Cache • • • • • 4X #cores 0.75x voltage 0.5x Frequency 1X power 2X in performance P = aCV f +Vdd Ileak 2 dd (38) Simplify Core Design AMD Bulldozer Core • Support for branch prediction, schedulers, etc. consumes more energy per instruction • Can fit many more simpler cores on a die ARM A7 Core (arm.com) (39) Metrics • Power efficiency  MIPS/watt  Ops/watt • Energy efficiency  Joules/instruction  Joules/op • Composite  Energy-delay product  Energy-delay2 Why are these useful? (40) Thermal Issues Thermal Issues • Heat can cause damage to the chip  Need failsafe operation • Thermal fields change the physical characteristics  Leakage current and therefore power increases  Delay increases  Device degradation becomes worse • Cooling solution determines the permitted power dissipation (42) Thermal Design Power (TDP) • This is the maximum power at which the part is designed to operate  Dictates the design of the cooling system o AMD Trinity APU Max temperature  Tjmax  Typically fixed by worst case workload • Parts are typically operating below the TDP • Opportunities for turbo mode? http://ecs.vancouver.wsu.edu/thermofluids-research (43) Heat Sink Limits on Performance Thermal design power (TDP)  Performance depends on effective utilization of this thermal headroom Temp  www.legitreviews.com Workload Thermal Headroom Boost power Instructions/cycle  Determines the cooling solution & package limits Power  Time HW Boost states SW visible states Convert thermal headroom to higher performance through boosting (44) Trinity TDP Source: http://www.anandtech.com/show/6347/amd-a10-5800k-a8-5600k-review-trinity-on-the-desktop-part-2 (45) Coordinated Energy Management in Heterogeneous Processors SC13 Indrani Paul1,2, Vignesh Ravi1, Srilatha Manne1, Manish Arora1,3, Sudhakar Yalamanchili2 1 2 3 Advanced Micro Devices, Inc. Georgia Institute of Technology University of California, San Diego (46) Goal • Goal:  Optimize energy efficiency under power and performance constraints in a heterogeneous processor • Outline:      Problem State-of-the-Art Power Management HPC Application Characteristics and Frequency Sensitivity Run-time Coordinated Energy Management Results (47) State-of-the-art Heterogeneous processor Shared Northbridge  access to overlapping CPU/GPU physical address spaces Graphics processing unit (GPU): 384 AMD Radeon™ cores Multi-threaded CPU cores Accelerated processing unit (APU) Many resources are shared between the CPU and GPU – For example, memory hierarchy, power, and thermal capacity (48) Programming model Host Tasks GPU Tasks User Application Each OpenCL kernel N-Dimensional Range OpenCL™ or other Software Stack Operating System CPU GPU APU Hardware • Grid of threads, each operating over a data partition Coupled programming model  Offload compute intensive tasks to the GPU (49) CPU-GPU Phase behavior in an Exascale Proxy Application (Lulesh) CPU-GPU coupled execution  time-varying redistribution of compute intensity Energy efficient operation  coordinated distribution of power to CPU vs. GPU Coordinated power states  sensitivity of performance to CPU and GPU power state (frequency) – Need to characterize ROI: Return (performance) on investment (power) (50) Challenge: CPU-GPU Coupling effects Direct Performance Coupling Host Tasks Indirect Performance Coupling: Shared Resources Performance GPU Tasks User Application Performance Constraint Coupling Effects Coordinated Energy Management Power Efficiency • HPC applications have uncompromising performance requirements! • Need more efficient energy management (51) State of the Art Power Management State-of-the-art: Bi-directional application power management (BAPM) CU0 TE • CU1 TE GPU TE Chip is divided into BAPMcontrolled thermal entities (TEs) Power management algorithm 1. Calculate digital estimate of power consumption 2. Convert power to temperature - RC network model for heat transfer 3. Assign new power budgets to TEs based on temperature headroom 4. TEs locally control (boost) their own DVFS states to maximize performance (53) Power Management APU Die Temperature Performance and energy efficiency depend on 3.0 effective utilization of power and thermal headroom Thermal Headroo m GPU HW Only HW Boost states Convert thermal headroom to higher performance through boost Instructions/cycle APU Performance HW Only (Boost) SWVisible Time SW visible states CPU DVFSstate Pb0 Pb1 P0 P1 P2 --Pmin DVFSstate High Mediu m Low Time (54) Key observations • Overall application performance is a function of both the CPU and the GPU • State of the practice: Manage to thermal limits by locally boosting when power and thermal headroom are available  utilize all of the available headroom • Pitfall: boosting may not lead to proportional performance improvement energy inefficient • Need a concept of performance sensitivity to power states (55) Application Characteristics Frequency sensitivity of gpu kernels DVFS-high DVFS-med DVFS-low % increase in run-time 160% 140% 120% 100% 80% 60% 40% 20% 0% Total Force Neighbour Comm GPU DVFS per kernel in miniMD-> Other Some kernels are more sensitive to GPU frequency than others  more power efficient (57) Sensitivity of gpu kernel execution to cpu frequency % increase in run-time P0 P1 P2 P3 P4 50% 40% 30% 20% 10% 0% Total Force Neighbor Comm Other CPU DVFS per kernel in miniMD ->  Some kernels are more tightly coupled to CPU’s performance  Smaller kernels such as Comm have high overheads in launching and feeding the GPU (58) Sensitivity to Shared resource interference Normalized Metric -> Performance actually limited by GPU memory demand GPU_Mem_BW/Pb1 CPU_Mem_BW/Pb0 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Power management locally boosts CPU to highest DVFS states Mem_BW_breakdown CPU_DVFS_residency miniMD – Neighbor kernel Wasted energy  power inefficient Need online estimates of sensitivity to interference (59) Computation and control divergence Percentage metric -> 0.80 0.70 • GPU_freq_sensitivity: unit performance gain for unit frequency increase 0.60 0.50 0.40 0.30 • GPU_ALUBusy%: measured hardware compute utilization 0.20 0.10 0.00 GPU_freq_sensitivity(meas) GPU_ALUBusy% Graph Algorithm – BFS Control divergence  increased thread serialization  increased frequency sensitivity (60) Key Observations • HPC applications exhibit varying degrees of CPU and GPU frequency sensitivities due to  Control divergence  Interference at shared resources  Performance coupling between CPU and GPU • Efficient energy management requires metrics that can predict frequency sensitivity (power) in heterogeneous processors • Sensitivity metrics drive the coordinated setting of CPU and GPU power states (61) Energy Management Performance metrics for APU frequency sensitivity  Linear regression model using the above metrics to compute measur GPU Compute Interference Performance Coupling CPU Compute (63) DynaCO: Run-time system for coordinated energy management CPU-GPU Frequency Sensitivity Computation Performance Metric Monitor CPU-GPU Power State Decision GPU Frequency Sensitivity CPU Frequency Sensitivity Decision High Low Shift power to GPU High High Proportional power allocation Low High Shift power to CPU Low Low Reduce power of both CPU and GPU  DynaCo-1levelTh: Lowest CPU DVFS-state limited to P2  DynaCo-multilevelTh: Lowest CPU DVFS-state allowed to use up to Pmin based on degree of performance coupling (64) Key observations • Coordinated CPU-GPU execution • Linear combination of three key high level performance metrics proposed to model APU frequency sensitivity behavior • Run-time coordinated energy management scheme DynaCo to manage CPU and GPU DVFS states dynamically based on measured frequency sensitivities (65) Experimental Set-Up  Trinity A10-5800 APU: 100W TDP  CPU: Managed by HW or SW HW Only (Boost) SWVisible CPU Pstate Voltage Freq (V) (MHz) Pb0 1 2400 Pb1 0.875 1800 P0 0.825 1600 P1 0.812 1400 P2 0.787 1300 P3 0.762 1100 P4 0.75 900  GPU: Managed by sending software messages through driver layer GPU PFreq state (MHz) GPU-high 800 GPU-med 633 GPU-low 304  DynaCo implemented as a run-time software policy overlaid on top of BAPM in real hardware (66) Benchmarks BM (Description) Problem Size miniMD 32 x 32 x 32 elements miniFE 100 x 100 x 100 elements Lulesh 100 x 100 x 100 elements Sort Stencil2D 2,097,152 elements 4,096 x 4,096 elements S3D SHOC default for integrated GPU BFS 1,000,000 nodes (67) Normalized ED^2 product Energy Efficiency (ED2 product) DynaCo-1levelTh 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 DynaCo-multilevelTh Ideal-static Average energy efficiency improvement of 24% and 30% with DynaCo-1levelTh and DynaCo-multilevelTh respectively (68) Increase in run-time Execution time Impact DynaCo-1levelTh 1.06 1.04 1.02 1.00 0.98 0.96 0.94 0.92 0.90 DynaCo-multilevelTh Ideal-static Baseline Average performance slow down of 0.78% and 1.61% with DynaCo-1levelTh and DynaCo-multilevelTh respectively (69) Power Savings DynaCo-1levelTh 60% DynaCo-multilevelTh Ideal-static Power 50% 40% 30% 20% 10% 0% Average power savings of 24% and 31% with DynaCo-1levelTh and DynaCo-multilevelTh respectively (70) Conclusions • Note effects of shared resource interference, control divergence and performance coupling on energy management for HPC applications • Importance and scope of frequency sensitivity in characterizing energy behaviors in tightly coupled heterogeneous architecture • Dynamic power shifting power to the entity that can best utilize it (71) Cooperative Boosting: Needy versus Greedy Power Management Indrani Paul1,2, Srilatha Manne1, Manish Arora1,3, W. Lloyd Bircher1, Sudhakar Yalamanchili2 June 2013 1 2 3 Advanced Micro Devices, Inc. Georgia Institute of Technology University of California, San Diego (72) Goal & Outline • Goal:  Optimize performance under power and thermal constraints in heterogeneous architecture • Outline:      State-of-the-Art Power and Thermal Management Thermal Coupling Performance Coupling Cooperative Boosting Results (73) State-of-the-art Heterogeneous processor Shared Northbridge  access to overlapping CPU/GPU physical address spaces Graphics processing unit (GPU): 384 AMD Radeon™ cores Multi-threaded CPU cores Accelerated processing unit (APU) Many resources are shared between the CPU and GPU – For example, memory hierarchy, power, and thermal capacity (74) What is Thermal design power? • Thermal design power: TDP    Upper bound for the sustainable power draw Determines the cooling solution and package limits Usually set by determining worst-case execution profile  www.legitreviews.com Performance depends on effective utilization of thermal headroom Instructions/cycle • Time (75) Key Observations • Power and thermals are shared resources in a heterogeneous processor  thermal coupling • Overall application performance is a function of both the CPU and the GPU  performance coupling • State of the practice: Managing to thermal limits by locally boosting when thermal headroom is available  utilize all of the headroom! (76) Thermal Coupling Thermal signatures: CPU & GPU Steady-state thermal fields produced by BAPM on a 19W AMD Trinity APU  High-power GPU benchmark  High-power CPU benchmark, idle GPU  Worst-case GPU: 19.7 W  Worst-case CPU: 18.8 W  Higher thermal density of CPUs  steeper thermal gradients  Faster consumption of thermal headroom on the CPU (78) Running a 100% CPU workload, GPU idle GPU temp CPU temp 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 Idle GPU temperature rose by ~20oC 1 • • 51 101 151 201 251 Time (sec) -> 301 Running a 100% GPU workload (CPU cycles only to feed the GPU) 1.05 CPU temp GPU temp Peak Temperature (C) -> Peak Temperature (C) -> Thermal Time Constant 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 1 51 101 151 201 251 Time (sec) -> 301 Significant rise in temperature of the idle component due to thermal coupling and pollution from the active components within a die CPU consumes thermal headroom more rapidly (4X faster)  GPU can sustain higher power boosts longer (79) Thermal Coupling: Headroom Availability Thermal Temp throttling coupling 3 2.5 2 1.5 1 0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Peak Die Temperature 3.5 CPU power is limited, GPU running at max DVFS state CPU CU0 Pow PeakDieTemp 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 CPU & GPU Relative Power GPU Pow CPU CU1 Pow Time (seconds) -> (80) Thermal coupling: Consumption of Thermal Headroom Thermal Temp throttling coupling 3 2.5 2 1.5 1 0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Peak Die Temperature 3.5 CPU power is limited, GPU running at max DVFS state CPU CU0 Pow PeakDieTemp 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 CPU & GPU Relative Power GPU Pow CPU CU1 Pow Time (seconds) -> 6oC rise in GPU temperature once CPU power limit was removed and both CUs were allowed to boost (81) Thermal Coupling: Thermal Throttling Thermal Temp throttling coupling 3 2.5 2 1.5 1 0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Peak Die Temperature 3.5 CPU power is limited, GPU running at max DVFS state CPU CU0 Pow PeakDieTemp 1 17 33 49 65 81 97 113 129 145 161 177 193 209 225 241 257 273 289 305 321 337 353 CPU & GPU Relative Power GPU Pow CPU CU1 Pow Time (seconds) ->  Minimize detrimental effects of thermal coupling by capping maximum CPU P-state  P-state limiting (82) Residency in Different Power States GPU-low GPU-med GPU-high Peak Temp 75% 0.95 BAPM 50% 0.9 25% 0.85 0% 0.8 1 51 101 151 201 251 1 75% 0.95 50% 0.9 P2 25% 0.85 0% Normalized temperature % DVFS residency 100% 0.8 1 51 101 151 201 251  Capping the max CPU DVFS state at P2 100% % DVFS residency Normalized temperature 1 1 75% 0.95 P4 50% 0.9 25% 0.85 0% 0.8 1 51 101 151 201 251  Capping the max CPU DVFS state at P4 Normalized temperature % DVFS residency 100% (83) Key Observtions • Thermal signatures different between CPU and GPU  Heterogeneity in physical properties • High thermal density leads to faster consumption of thermal headroom in the CPU cores • Significant thermal coupling from active to idle components • Near the thermal limit, boosting based on available thermal headroom introduces inefficiencies  Reduce the CPU P-state limit (84) Performance Coupling Programming model Host Tasks GPU Tasks User Application Each OpenCL kernel N-Dimensional Range OpenCL™ or other Software Stack Operating System CPU GPU APU Hardware • Grid of threads, each operating over a data partition Coupled programming model  Offload compute intensive tasks to the GPU (86) Managing thermals for performancecoupled applications GPU-med GPU-high Speedup Normalized GPU active time 100% 1.2 80% 1.0 60% 0.8 40% 0.6 20% 0.4 0% 0.2 Normalized metric % DVFS residency GPU-low CPU P-state Limit Binary Search (87) Managing thermals for performancecoupled applications GPU-med GPU-high Speedup Normalized GPU active time 100% 1.2 80% 1.0 60% 0.8 40% 0.6 20% 0.4 0% 0.2 Normalized metric % DVFS residency GPU-low CPU P-state Limit (88) Managing thermals for performancecoupled applications GPU-med GPU-high Speedup % DVFS residency CPU thermally limiting 100% Normalized GPU active time CPU performance limiting 1.2 80% 1.0 60% 0.8 40% 0.6 20% 0.4 0% 0.2 Normalized metric GPU-low CPU P-state Limit (89) 100% 1.3 80% 1.2 60% 1.1 40% 1.0 20% 0.9 0% 0.8 Normalized metric % DVFS residency P-state sensitivity CPU P-state Limit Needle (90) Determining Critical CPU P-state GPU-med GPU-high Speedup Normalized GPU active time 1.3 1.0 0.8 0.5 0.3 P3 P4 Normalized metric GPU-low 100% 75% 50% 25% 0% % DVFS residency • Find the inflection point in performance as a function of CPU P-state  critical P-state Critical P-state is determined by interference (CPU vs. GPU) in the memory system Baseline % increase over baseline • Pb1 P0 P1 P2 CPU P-state Limit -> Critical CPU P-state Limit 20% 0% Pb1 -20% -40% Mem BW P0 P1 P2 P3 P4 Performance CPU P-state Limit -> (91) Key Observations • Performance coupling – CPU-GPU performance dependency • Balance between detrimental effects of thermal coupling and needs of performance coupling • CPU critical P-state limit is determined by performance coupling and thermal coupling • GPU memory bandwidth gradients as a function of CPU frequency along with CPU IPC serve as a measure of performance coupling (92) Cooperative Boosting Cooperative Boosting (CB) • Overlaid on top of BAPM – invoked periodically when thermal coupling is detrimental i.e. when thermal limit is approached (94) Experimental Set-up • Trinity A8-4555M APU: 19W TDP • CPU: Managed by HW or SW HW Only (Boost) SWVisible PVoltag Freq state e (V) (MHz) Pb0 1 2400 Pb1 0.875 1800 P0 0.825 1600 P1 0.812 1400 P2 0.787 1300 P3 0.762 1100 P4 0.75 900  GPU: Managed by HW only  GPU-high: 423 MHz  GPU-med: 320 MHz  Cooperative Boosting implemented as a system software policy overlaid on top of BAPM in real hardware (95) Benchmarks BM (Description) NDL (NeedlemanWusch) HS (HotSpot) Problem Size 4096x4096 data points, 1K iterations 1024x1024 data points, 100K iterations BF (BoxFilter SAT) 1Kx1K input image, 6x6 filter,10K iterations FAH (Folding at Synthesis of large protein: Home) spectrin$ BS (Binary Search) 4096 inputs, 256 segments, 1M iterations Viewdle (Haar Image 1920x1080, 2K facial recognition) iterations Lbm (CPU2006) 4 threads, Ref input Gcc (CPU2006) 4 threads, Ref input Type Performanc e-coupled Performanc e-coupled Performanc e-coupled Performanc e-coupled Performanc e-coupled Performanc e-coupled CPU-centric CPU-centric (96) Performance Improvement with Cooperative Boosting Speedup P0 1.40 1.30 1.20 1.10 1.00 0.90 0.80 0.70 0.60 0.50 • P4 CB Baseline 1.36 1.28 1.10 NDL • P2 HS 1.13 BF 1.10 FAH 1.10 1.04 BS 1.00 Viewdle Lbm 0.99 Perl MEAN Static P-state limiting requires profiling and a priori information of workload An average of 15% performance gain for performancecoupled applications with CB (97) % of power savings over baseline Power Savings CB 40% 35% 30% 25% 20% 15% 10% 5% 0% NDL • • HS BF FAH BS Viewdle Lbm Gcc MEAN Average 10% power savings across performance-coupled applications 5oC reduction in peak temperature for BS -> large percentage of leakage power savings (98) Energy*Delay^2 P0 P2 P4 CB Baseline Normalized metric 3.00 2.50 2.00 1.50 1.00 0.50 0.00 NDL HS BF FAH BS Viewdle Lbm Gcc MEAN Average 33% energy-delay2 savings across performance-coupled applications (99) Conclusions • Effects of thermal and performance coupling on performance  Applications with high GPU compute-to-load ratio are more susceptible to detrimental effects of thermal coupling  Emergent balanced workloads with split CPU-GPU computation are tightly performance-coupled • Cooperative Boosting (CB): balance effects of thermal coupling with needs of performance coupling  Shifts power to CPU only when needed (100)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction