Download Thermal Management Issues (MICRO-35 Tutorial)

Overview 1. 2. 3. 4. 5. 6. 7. 8. 9. Motivation (Kevin) Thermal issues (Kevin) Power modeling (David) Thermal management (David) Optimal DTM (Lev) Clustering (Antonio) Power distribution (David) What current chips do (Lev) HotSpot (Kevin) 1 Power modeling • Research Power Simulators – – – – – • Wattch – Brooks and Martonosi ISCA2000 SimplePower – Vijaykrishnan et al (Penn State) ISCA2000 TEMPEST – Dhodapkar et al (Intel/Wisconsin) PowerAnalyzer – Umich/Colorado AccuPower – SUNY Binghamton Industry Power Simulators – – IBM PowerTimer – Brooks and Bose PACS2000 Intel ALPS – Gunther, et al. 2 Power: The Basics • Dynamic power vs. Static power – Dynamic: “switching” power – Static: “leakage” power – Dynamic power dominates, but static power increasing in importance – Trends in each • Static power: steady, per-cycle energy cost • Dynamic power: capacitive and shortcircuit • Capacitive power: charging/discharging at transitions from 01 and 10 • Short-circuit power: power due to brief short-circuit current during transitions. • Mostly focus on capacitive, but recent work on others 3 Capacitive Power dissipation Vdd Vin Vout CL Capacitance: Function of wire length, transistor size Supply Voltage: Has been dropping with successive fab generations Power ~ ½ CV2Af Activity factor: How often, on average, do wires switch? Clock frequency: Increasing… 4 Short-Circuit Power Dissipation ISC VIN VOUT CL • Short-Circuit Current caused by finiteslope input signals • Direct Current Path between VDD and GND when both NMOS and PMOS transistors are conducting 5 Leakage Power VIN VOUT ISub IGate CL IDSub  k  e  qVT akaT • Subthreshold currents grow exponentially with increases in temperature, decreases in threshold voltage 6 Modeling Hierarchy and Tool Flow Energy Models microarch level refine, update RTL level gate-level ckt-level set of workloads Performance Test Cases Early analytical performance models Trace/exec-driven, cycle-accurate simulation models Microarch parms/specs RTL MODEL (VHDL) (Architectural) Sim Test Cases RTL sim gate-level model (if synthesized) Circuit-level (hierarchical) netlist model edit/debug edit/debug Bitvector test cases ckt sim, extract edit/tune/ debug Design rules layout-level Layout-level physical design model Cap extract, sim design rule check, validate 7 Analysis Abstraction Levels Abstraction Level Analysis Analysis Analysis Analysis Energy Capacity Accuracy Speed Resources Savings Most Worst Fastest Least Least Best Slowest Most Most Application Behavioral Architectural (RTL) Logic (Gate) Transistor (Circuit) Least 8 Power/Performance abstractions • Low-level: – Hspice – PowerMill • Medium-Level: – RTL Models • Architecture-level: – – – – – PennState SimplePower Intel Tempest Princeton Wattch IBM PowerTimer Umich/Colorado PowerAnalyzer 9 Low-level models: Hspice • Extracted netlists from circuit/layout descriptions – Diffusion, gate, and wiring capacitance is modeled • Analog simulation performed – Detailed device models used – Large systems of equations are solved – Can estimate dynamic and leakage power dissipation within a few percent – Slow, only practical for 10-100K transistors • PowerMill (Synopsys) is similar but about 10x faster 10 Medium-level models: RTL • Logic simulation obtains switching events for every signal • Structural VHDL or verilog with zero or unit-delay timing models • Capacitance estimates performed – Device Capacitance • Gate sizing estimates performed, similar to synthesis – Wiring Capacitance • Wire load estimates performed, similar to placement and routing • Switching event and capacitance estimates provide dynamic power estimates 11 Architecture level models Power ~ ½ CV2Af • Bottom-up Approach: – Estimate “CV2f” via analytical models – Tools: Wattch, PowerAnalyzer, Tempest (mixedmode) • Top-Down Approach – Estimate “CV2f” via empirical measurements – Tools: PowerTimer, AccuPower, Most Industrial Tools • Estimate “A” via statistics from architectural-performance simulators 12 Analytical Models: Capacitance • Requires modeling wire length and estimating transistor sizes • Related to RC Delay analysis for speed along critical path – But capacitance estimates require summing up all wire lengths, rather than only an accurate estimate of the longest one. 13 Register File: Capacitance Analysis Bit Decoders Pre-Charge Bit Cell Access Transistors (N1) Wordlines Cell (Number of Entries) Sense Amps Bitlines (Data Width of Entries) Number of Ports Number of Ports Cwordline  CdiffcapW ordlineDriver  NumberBitlines * CgatecapN1  Wordlinele ngth * Cmetal Cbitline  CdiffcapPchg  NumberWordlines * CdiffcapN1  Bitlinelen gth * Cmetal 14 Register File Model: Accuracy Error Rates Wordline(r) Wordline(w) Bitline(r) Bitline(w) Gate 1.11 -6.37 2.82 -10.96 Diff 0.79 0.79 -10.58 -10.60 InterConn. 15.06 -10.68 -19.59 7.98 Total 8.02 -7.99 -10.91 -5.96 (Numbers in Percent) • Validated against a register file schematic used in internal Intel design • Compared capacitance values with estimates from a layout-level Intel tool • Interconnect capacitance had largest errors – Model neglects poly connections – Differences in wire lengths -- difficult to tell wire distances of schematic nodes 15 Different Circuit Design Styles • RTL and Architectural level power estimation requires the tool/user to perform circuit design style assumptions – Static vs. Dynamic logic – Single vs. Double-ended bitlines in register files/caches – Sense Amp designs – Transistor and buffer sizings • Generic solutions are difficult because many styles are popular • Within individual companies, circuit design styles may be fixed 16 Clock Gating: What, why, when? Clock Gated Clock Gate • Dynamic Power is dissipated on clock transitions • Gating off clock lines when they are unneeded reduces activity factor • But putting extra gate delays into clock lines increases clock skew • End results: – Clock gating complicates design analysis but saves power. 17 Wattch: An Overview Wattch’s Design Goals • Flexibility • Planning-stage info • Speed • Modularity • Reasonable accuracy Overview of Features • Parameterized models for different CPU units – Can vary size or design style as needed • Abstract signal transition models for speed – Can select different conditional clocking and input transition models as needed • Based on SimpleScalar (has been ported to many simulators) • Modular: Can add new models for new units studied 18 Unit Modeling Bitline Activity Number of Active Ports Number of entries Data width of entries # Read Ports Parameterized Register File Power Model Power Estimate # Write Ports Modeling Capacitance Modeling Activity Factor • Models depend on • Use cycle-level simulator to structure, bitwidth, design determine number and type style, etc. of accesses • E.g., may model – reads, writes, how many capacitance of a register ports file with bitwidth & number • Abstract model of bitline of ports as input activity parameters 19 One Cycle in Wattch Power (Units Accessed) Fetch Dispatch Issue/Execute Writeback/ Commit  I-cache  Bpred  Rename Table  Inst. Window  Reg. File        Result Bus  Reg File  Bpred Performance  Cache Hit?  Bpred Lookup?  Inst. Window Full? Inst. Window Reg File ALU D-Cache Load/St Q Dependencies Satisfied?  Resources?  Commit Bandwidth? • On each cycle: – determine which units are accessed – model execution time issues – model per-unit energy/power based on which units used and how many ports. 20 Units Modeled by Wattch  Array Structures  Caches, Reg Files, Map/Bpred tables  Content-Addressable Memories (CAMs)  TLBs, Issue Queue, Reorder Buffer  Complex combinational blocks  ALUs, Dependency Check  Clocking network  Global Clock Drivers, 21 Local Buffers PowerTimer • IBM Tool First Develop During Summer of 2000 – Continued Development: 2001 => Today – Methodology Applied to Research and Product Power-Performance Simulators with IBM – Currently in Beta-Release – Working towards Full Academic Release 22 PowerTimer: Empirical Power Clock Tree 10% L3 Tags 2% IDU FXU 3% 4% Other 10% IFU 6% ISU 10% L2 23% CIU 4% FBC 3% Map Tables 43% Issue Queues 32% Completion Table 9% Dispatch 6% LSU 19% GX ZIO RAS 1% 4% 5% Core Buffer 1% FPU 5% Pre-silicon, POWER4-like superscalar design 23 Processor Power Density 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 IFU IDU ISU FXU LSU FPU L2 L3 Tag BHT Icache FXIssueQ Pre-silicon, POWER4-like superscalar design Originally presented at PACS2002 24 PowerTimer Circuit Power Data (Macros) SubUnit Power = f(SF, uArch, Tech) Tech Parms Compute Sub-Unit Power uArch Parms Power AF/SF Data Program Executable or Trace Architectural Performance Simulator CPI 25 PowerTimer: Energy Models • Energy models for uArch structures formed by summation of circuit-level macro data Energy Models Sub-Units (uArch-level Structures) Power=C1*SF+HoldPower SF Data Power=C2*SF+HoldPower Macro1 Macro2 Power=Cn*SF+HoldPower MacroN Power Estimate 26 Empirical Estimates with CPAM • Estimate power under “Input Hold” and “Input Switching” Modes • Input Hold: All Macro Inputs (Except Clocks) Held – Can also collect data for Clock Gate Signals • Input Switching: Apply Random Switching Patterns with 50% Switching on Input Pins Macro Inputs Macro • 0% Switching (Hold Power) • 50% Switching Power 27 Example Unit • Made up of 5 macros 800 700 macro1 macro2 macro3 macro4 macro5 total mW 600 500 400 300 200 100 0 0 10 20 30 40 50 SF 28 PowerTimer: Models f(SF) Assumption: Power linearly dependent on Switching Factor This separates Clock Power and Switching Power 1400 1200 Switching Power Unit1 Unit2 Unit3 Unit4 Unit5 mW 1000 800 600 400 Clock Power 200 0 0 10 20 30 40 50 SF At 0% SF, Power = Clock Power (significant without clock gating) 29 Key Activity Data 1400 Changes in SF 1200 Unit1 Unit2 Unit3 Unit4 Unit5 mW 1000 Changes in AF 800 600 400 200 0 0 10 20 30 40 50 SF • SF => Moves along the Switching Power Curve – Estimated on a per-unit basis from RTL Analysis • AF => Moves along the Clock Power Curve – Extracted from Microarchitectural Statistics (Turandot) 30 Microarchitectural Statistics • Stats are very similar to tracking used in Wattch, etc • Differences: – Clock Gating Modes (3 modes) – Customized Scaling Based on Circuit Style (4 styles) • Clock Gating Modes: – – – – P_constrained = P_unconstrained (not clock-gateable) P_constrained_1 = AF * (Pclock + Plogic) (common) P_constrained_2 = AF * Pclock + Plogic (rare) P_constrained_3 = Pclock + AF * Plogic (very rare) • Scaling Based on Circuit Styles – AF_1 = #valid Gating) – AF_2 = #valid - #stalls Stall Gating) – AF_3 = #writes updates) – AF_4 = #writes + #reads (Latch-and-Mux, No Stall (Latch-and-Mux, With (Arrays that only gate (Arrays, RAM Macros) 31 Clock Gating: Valid-Bit Gating • Latch-Based Structures: Execute Pipelines, Issue Queues Clock V V V V V V 32 Clock Gating Modes • P_constrained_1 = AF * (Pclock + Plogic) clock valid Plogic Pclock • P_constrained_2 = AF * Pclock + Plogic clock valid Selection Logic Pclock Plogic 33 Valid-bit Gating, Stalls? • Option 1: Stalls cannot be gated clk valid Data From Previous Pipestage Stall From Previous Pipestage Data For Next Pipestage • Option 2: Stalls can be gated clk valid Data From Previous Pipestage Stall From Previous Pipestage Data For Next Pipestage 34 Scaling: Array Structures • Option 1: Reads and Writes Eligible to Gate for Power Write Bitline Read Bitline read_wordline_active read_gate write_wordline_active write_gate Cell read_data write_gate write_data 35 Scaling: Array Structures • Option 2: Only Writes Eligible to Gate for Power Write Bitline read_entry_n read_entry_2 write_wordline_active write_gate read_entry_1 read_data Cell read_entry_0 write_gate write_data 36 12 Clock Gating Modes Gating Mode Valid Valid+ Stalls Writes Writes+ Reads Gat Gate e Cloc Both k Gate Examples Logic 0 No No No No No No No Control Logic, Buffers, Small Macros 1 Yes No No No Yes No No 2 No Yes No No Yes No No Issue Queues, Execute Pipelines 3 No No Yes No Yes No No Caches 4 No No No Yes Yes No No Some Queues 5 Yes No No No No Yes No CAMs, Selection Logic 6 No Yes No No No Yes No 7 No No Yes No No Yes No No Known macros 8 No No No Yes No Yes No No Known macros 9 Yes No No No No No Yes No Known macros 10 No Yes No No No No Yes No Known macros 11 No No Yes No No No Yes No Known macros 12 No No No Yes No No Yes No Known macros PowerTimer Observations • PowerTimer works well for POWER4like estimates and derivatives – Scale base microarchitecture quite well – E.g. optimal power-performance pipelining study – Lack of run-time, bit-level SF not seen as a problem within IBM (seen as noise) • Chip bit-level SFs are quite low (5-15%) • Most (60-70%) power is dissipated while maintaining state (arrays, latches, clocks) • Much state is not available in early-stage timers 38 Comparing models: Flexibility • Flexibility necessary for certain studies – Resource tradeoff analysis – Modeling different architectures • Purely analytical tools provides fullyparameterizable power models – Within this methodology, circuit design styles could also be studied • PowerTimer scales power models in a user-defined manner for individual subunits – Constrained to structures and circuit-styles currently in the library • Perhaps Mixed Mode tools could be very useful 39 Comparing models: Accuracy • PowerTimer -- Based on validation of individual pieces – Extensive validation of the performance model (AFs) – Power estimates from circuits are accurate – Circuit designers must vouch for clock gating scenarios – Certain assumptions will limit accuracy or require more in-depth analysis • Analytical Tools – Inherent Issues • Analytical estimates cannot be as accurate as SPICE analysis (“C” estimates, CV2 approximation) – Practical Issues • Without industrial data, must estimate transistor sizing, bits per structure, circuit choices 40 Comparing models: Speed • Performance simulation is slow enough! • Post-Processing vs. Run-Time Estimates • Wattch’s per-cycle power estimates: roughly 30% overhead – Post-processing (per-program power estimates) would be much faster (minimal overhead) • PowerTimer allows both no overhead postprocessing and run-time analysis for certain studies (di/dt, thermal) – Some clock gating modes may require run-time analysis • Third Option: Bit Vector Dumps – Flexible Post-Processing  Huge Output Files 41 Static Power Dissipation • Static power: dissipation due to leakage current • Growing worse because as Vdd scales, Vth must also scale to maintain switching speeds – “Off” transistors are no longer fully cut off 42 Leakage Power • The fraction of leakage power is increasing exponentially with each generation • Also exponentially dependent on temperature Increasing ratio across generations Static power/ Dynamic Power 70 Percentage 60 50 40 30 20 10 373 368 363 358 353 348 343 338 333 328 323 318 313 308 303 298 0 Temperature(K) 180nm 130nm 100nm 90nm 80nm 70nm Source: Sankaranarayanan et al, University of Virginia, based on ITRS 2001 data 43 Static Power - HotLeakage • Main mechanisms for leakage current – Subthreshold (Berkely predictive model): I leakage   0  COX Vdd    Vth0  Voff W ab*(Vdd Vdd0 ) 2  vt   e  vt  1  e  exp    L  n  vt       – Gate • Igate = Igate0 * exp(a*(tox-tox0)) * exp(b*(vdd-vdd0)) • There is also a weak temp dependence, ignored here • Gate leakage has essentially been ignored – New gate insulation materials may reduce urgency of problem, eg recent Intel announcement • R. Chau, Technology@intel Magazine. www.intel.com • Gate-induced drain leakage (GIDL) occurs at negative gate voltages and high Vdd or high values of reverse body bias – New leakage path opens! 44 Effects of Parameter Variations • Ioff depends exponentially on Vth • There is a large fluctuation of Ioff from die to die and from gate to gate • Controlling Vth is difficult in nanometer scale – Drain-induced barrier lowering • Channel length is not constant • Exacerbated in sub-100nm devices – Discrete dopant effects • In a very small channel, small number of dopants • Presence of these dopants and random fluctuation of their number, lead to changes in Vth from device to device • Process variation affects – Gate length (Ldrawn) – Gate oxide thickness (Tox) – Channel dose (Nsub) Srivastava et al, ISLPED 2002 45 Static Power Modeling Leakage – Butts and Sohi (MICRO-33) • Pstatic = Vcc · N · kdesign · Îleak • Îleak determined by circuit simulation, kdesign empirically • Key contribution: separate technology from design – HotLeakage (UVA TR CS-2003-05, DATE’04) • Extension of Butts & Sohi approach: scalable with Vdd, Vth, Temp, and technology node; adds gate leakage • Îleak determined by BSIM3 subthreshold equation and BSIM4 gate-leakage equations, giving an analytical expression that accounts for dependence on factors that may change at runtime, namely Vdd, Vth, and Temp • kdesign replaced by separate factors for N- and P-type transistors • kdesign also exponentially dependent on Vdd and Tox, linearly dependent on Temp • Currently integrated with SimpleScalar/Wattch for caches • http://lava.cs.virginia.edu/HotLeakage 46 Static Power • Modeling Leakage (cont.) – Su et al, IBM (ISLPED’03) • Similar approach to HotLeakage – but they observe that modeling the change in leakage allows linearization of the equations – Many, many other papers on various aspects of modeling different aspects of leakage • Most focus on subthreshold • Few suggest how to model leakage in microarchitecture simulations 47 Power modeling summary • Wattch provides excellent relative accuracy – Underestimates full chip power (some units not modeled, etc) • PowerTimer models based on circuit-level power analysis – Inaccuracy is introduced in SF/AF and scaling assumptions • Modeling leakage is essential for thermal research – “Thermal runaway” is even possible 48 Overview 1. 2. 3. 4. 5. 6. 7. 8. 9. Motivation (Kevin) Thermal issues (Kevin) Power modeling (David) Thermal management (David) Optimal DTM (Lev) Clustering (Antonio) Power distribution (David) What current chips do (Lev) HotSpot (Kevin) 49 Existing Work • Research Ideas – DEETM – Huang and Torrellas MICRO2000 – DTM – Brooks and Martonosi HPCA2001 – Control-Theoretic DTM – Skadron, Abdelzaher, Stan HPCA2002 – Thermal Scheduling – Cai, Lim, Daasch WCED2002 • Commercial Products – PowerPC G3 Microprocessor – Pentium III – Pentium 4 50 Overview • Hard to optimize power-performance at design time for all cases • Forces conservative choices for issues like cooling, current delivery, resource sizes • Want to explore dynamic power optimizations for run-time power management – – – – – – Dynamic Voltage/Frequency Scaling [Burd, 2000] Dynamic Hardware Resizing [Albonesi, 1999] Fetch Throttling [Sanchez, 1997] Global Clock Gating [Gunther, 2001] Speculation Control [Manne, 1998] Dynamic Thermal Management [Brooks, 2001][Huang, 2000] 51 Relative Power Important to optimize P & T early 12FO4 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 tradeoff via changing Vdd and HI tradeoff via changing frequency tradeoff via changing pipeline depth 14FO4 18FO4 0.7 0.8 0.9 Maximum Power Budget 23FO4 1 1.1 1.2 1.3 Relative 1/Performance 1.4 1.5 1.6 52 Dynamic Thermal Management • Goal: – Provide dynamic techniques to cool chip when needed – Exploit natural variations due to different applications, phase behavior, … – Allow designers to target average, rather than worst-case behavior • Design Decisions: – Mechanism & policy for triggering response? – What should response be? – How to select DTM trigger levels? 53 Power consumption impacts cost CPU From: Gunther, et al. “Managing the Impact of Increasing Microprocessor Power Consumption,” Intel Technology Journal, Q1, 2001 • System costs associated with power dissipation: – Thermal control cost • Heatsinks, fans – Power delivery • Power supply • Decoupling caps… 54 Average and Worst Case Power 100 Max 80 Avg 60 40 20 0 Alpha 21264 Intel PPro • System costs are constrained by worst case power dissipation • Average case power dissipation can often be much lower – Aggressive Clock Gating – Applications variations – Underutilized resources • Not enough ILP • Floating point units during integer code execution • Currently about a 30% difference • Likely to further diverge… 55 Dynamic Thermal Management Temperature Designed for Cooling Capacity w/out DTM Designed for Cooling Capacity w/ DTM System Cost Saving DTM Trigger Level DTM Disabled DTM/Response Engaged Time 56 DTM: Definitions Trigger Reached Turn Response On Initiation Response Delay Delay Check Temp Policy Delay Check Temp Turn Response Off Shutoff Delay  Initiation Delay – OS interrupt/handler  Response Delay – Invocation time (e.g. adjust clock)  Policy Delay – Number of cycles engaged  Shutoff Delay – Disabling time (e.g. re-adjust clock) 57 DTM: When, How, and What Trigger Mechanism: When do we enable DTM techniques? Initiation Mechanism: How do we enable technique? Response Mechanism: What technique do we enable? 58 DTM: Trigger Mechanisms • Mechanism: How to  Policy: When to begin deduce temperature? responding?  Trigger level set too high: • Direct approach: Packaging cost will be high Temperature sensors Little advantage providing feedback  Trigger level set too low – Implemented in some Frequent triggering causes PowerPC chips (G3, G4) performance to suffer [Sanchez, 1997]  Choose trigger level to exploit – Sensor quantity, placement, and precision difference between average and worst-case power. will be discussed later • Other indirect approaches possible 59 DTM: Initiation Mechanisms • Operating system or microarchitectural control? – Hardware support can significantly reduce performance penalty • Policy Delay Settings – For Volt/Freq scaling, much of the performance penalty can be attributed to enabling/disabling – Increasing policy delay reduces overhead; smarter initiation techniques would help as well 60 DTM: Response Mechanisms • Scaling Techniques – Clock Frequency Scaling [Intel Pentium 4] – Voltage and Frequency Scaling – Temperature-tracking frequency scaling[Skadron03] • Adjusts frequency to account for T-dep. of switching speed • Microarchitectural Techniques – Speculation Control [Manne98] – Low-Power Cache Techniques [Huang00] – – – – – – • Hierarchical Responses Decode Throttling [Sanchez97] Fetch Toggling [Brooks01] Feedback controlled Fetch Gating [Skadron02] Migrating Computation [Skadron03] Dual Pipelines [Lim02] Hybrid DTM [Skadron04] 61 Dynamic Voltage/Frequency Scale • Voltage Scheduler predicts workload requirements • Set frequency/voltage to nearoptimal, energy savings • Burd, et al., ISSCC2000 – 5MHz @ 1.2V: 6 MIPS, 2.8mW – 80MHz @ 3.8V: 85 MIPS, 460mW – 70us 1.2V <-> 3.8V • Transmeta Crusoe – Commercial implementation (500700MHz, 1.2-1.6V) 62 Temperature-Tracking Frequency Temperature affects : • Transistor threshold and mobility • Ion, Ioff, Igate, delay • ITRS: 85°C for high-performance, 110°C for embedded! • So adjust frequency as f(T) -- TTDFS Ioff Ion NMOS 63 Speculation Control • Manne et al. (ISCA ’98) – Branch confidence estimator used to determine whether to speculate – Pipeline gating based on confidence estimation – 38% reduction in wrong-path instructions with ~1% performance loss • But Parikh et al. (HPCA ’02) found much smaller savings; ED product is zero or negative – Significant energy savings only come with significant loss of performance – This is because many instructions are squashed early in the pipeline, so reduction in wrong-path instructions is not a useful metric – Benefit is actually a function of prediction accuracy • Only for very badly predicted programs do you get benefit • Well-predicted programs suffer 64 Dynamic Hardware Resizing • Complexity Adaptive Processors • Based on application characteristics – Underutilized structures may be reduced with minimal performance impact – Resize Caches, Issue Queues, etc. – Resize => Reduce Capacitance => Reduce Energy – Of course, this only helps manage heat if it reduces heat dissipation within hot spots • And does so for a sufficiently long duration 65 DEETM • Dynamic Energy Efficiency and Temperature Management • Slack algorithm detects if slowdown can be tolerated – If so, invoke techniques to reduce energy • Temperature algorithm – If temperature limit is reached, invokes techniques • Techniques considered – Filter Cache, Voltage Scaling, etc. 66 Control-theoretic DTM • Fetch toggling – disable fetch every N cycles – 4/5, 2/3, 1/2, 1/3, 1/5, … IF ID EX MEM WB IF ID EX MEM WB – How to set the fetch rate? • (Assume idealized temperature sensing) 67 Feedback-Control of Fetch Toggling • Formal feedback control setpoint e Controller measured T u[k] Actuator: I-fetch toggling P Thermal dynamics T Temp. sensor Integral ctrl: u[k] = u[k-1] + Kc · e[k-1] Original work used PID • easy to compute • toggling = f(m) • optimality: will discuss shortly 68 Formal Feedback Control • Regulatory control problem: hold value to a specified setpoint – Example: temperature – Proved that PID controller will not allow temperature to exceed setpoint by more than 0.02° • Max power dissipation, thermal dynamics, sampling rate  max overshoot • This precision is excessive but illustrates the value of formal feedback control theory 69 0% 25% MEAN 30% bzip vortex perlbmk eon parser fma3d facerec crafty equake art mesa gcc Percent Loss in Performance Performance Loss • Performance loss reduced by 65% toggle1 PID 20% 15% 10% 5% 70 Migrating Computation • When one unit overheats, migrate its functionality to a distant, spare unit (MC) – – – – Spare register file (Skadron et al. 2003) Separate core (CMP) (Heo et al. 2003) Microarchitectural clusters etc. • Raises many interesting issues – – – – Cost-benefit tradeoff for that area Use both resources (scheduling) Extra power for long-distance communication Floorplanning • This is probably one of the richest opportunities for DTM 71 Migrating Computation – Reg File 72 Thermal Scheduling (Cai 2002) Primary FE DE Secondary DE OOP EX Majority mobile apps with performance requirements RF IOP Text email, caller-id, reminder and other none high performance w/ anywhere-anytime requested apps • Primary pipeline: maximal performance, complex pipeline structure • Second pipeline: Minimum power and energy consumption, very simple in order structure and target mobile anywhereanytime applications. • Transparent to OS and applications • Maximal utilizing on die clock/power gating for energy saving 73 Scheduling Algorithm (Cai 2002) T1  TL & T2  TL T1 > TL || T2 > TL Temperature (C) Tmax TH TS1 S1 T1  TL S4 T1 < TH T1  TH S2 T1  TL T1  TH TL Ta tcycle tcool theat TS2 T1 > TL & T2  TH S3 T1 > TL & T2 < TH Time (s) S1: Normal Operation (Primary Pipeline) S2: Stall Fetch & Clear Pipeline S3: Alternate Operation (Secondary Pipeline) S4: Disable Clock or Scale F-V 74 Hybrid DTM • DVS is attractive because of its cubic advantage – P  V2f – This factor dominates when DTM must be aggressive – But changing DVS setting can be costly • Resynchronize PLL • Sensitive to sensor noise  spurious changes • “ILP techniques” are attractive because they can use instruction level parallelism to hide/reduce impact of DTM – Only effective when DTM is mild • So use both! – Need to find “crossover point” 75 Hybrid DTM, cont. • Combine fetch gating with DVS – When DVS is better, use it – Otherwise use fetch gating – Determined by magnitude of temperature overshoot – Crossover at FG duty cycle of 3 – FG has low overhead: helps reduce cost of sensor noise 1.4 FG Hyb Slowdown DVS 1.3 1.2 1.2 1.1 Slowdown 1.3 1.1 1.0 1.0 20 5 Duty Cycle 2 20 15 10 Duty Cycle 5 0 76 Hybrid DTM, cont. • DVS doesn’t need more than two settings for thermal control – Lower voltage cools chip faster • FG by itself does need multiple duty cycles and hence requires PI control • But in a hybrid configuration, FG does not require PI control – FG is only used at mild DTM settings – Can pick one fixed duty cycle • This is beneficial because feedback control is vulnerable to noise 77 Simulation Details • 85°C maximum temperature – Guard band requires a trigger threshold of 81.8° • Ambient temperature (inside computer case): 45°C • Rpackage = 0.8 K/W (old package model) – 0.7 K/W necessary if DTM not available • Die thickness: 0.5mm • Currently neglecting interface material • 9 SPEC2000 benchmarks, both integer and FP – 4 hover near 81.8°C, rest are above • SimpleScalar/Wattch, modified to model pipeline and power of an Alpha 21364 as closely as possible • Scaled to 130nm, 1.3V, 3.0 GHz 78 Performance Comparison • TT-DFS is best but can’t prevent excess temperature – • Hybrid technique reduces DTM cost by 25% vs. DVS (DVS overhead important) A substantial portion of MC’s benefit comes from the altered floorplan, which separates hot units 1.40 Slowdown Factor • Suitable for use with aggressive clock rates at low temp. 1.359 1.270 1.30 1.231 1.20 1.112 1.10 1.045 1.00 TTDFS DVS FG Hyb MC 79 Floorplan Impact 2 cycle penalty LSQ moves: no penalty 80 Conclusions so far • • DTM can be used to reduce cooling costs Proper modeling is required – HotSpot is publicly available at http://lava.cs.virginia.edu/HotSpot • • ILP matters Hybrid techniques beneficial – Merge advantages of different schemes – Simplify control • • Architectural techniques important in thermal design Growing use of clusters and redundant units opens an incredibly rich design space 81 DTM: Summary and Key Issues • Dynamic optimizations translate max-power problem to averagepower problem • Heightens importance of averagepower techniques like clock gating • Key Issues: – Initiation interval – Collection of possible response mechanisms – Sensor accuracy and data fusion – Hot spots 82

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Thermal Management Issues (MICRO-35 Tutorial)