Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Power-Aware and Temperature-Aware Architecture © 2006, Kevin Skadron Kevin Skadron LAVA/HotSpot Lab Dept. of Computer Science University of Virginia Charlottesville, VA [email protected] © 2006, Kevin Skadron “Cooking-Aware” Computing? 2 Thermal Packaging is Expensive © 2006, Kevin Skadron • Nvidia GeForce 5900 • Nvidia GeForce 7900 http://www.ixbt.com/video2/images/g71 Source: Tech-Report.com Source: Gordon Bell, “A Seymour Cray perspective” /7900gtx-front.jpg http://www.research.microsoft.com/users/gbell/craytalk/ 3 “Moore’s Law” for Power ? 100 Pentium® II Pentium® 4 Max Power (Watts) Pentium® Pro Pentium® III 10 Pentium® Pentium® w/MMX tech. i486 © 2006, Kevin Skadron i386 1 1.5m 1m 0.8m 0.6m 0.35m 0.25m 0.18m 0.13m Source: Intel • Reasons: higher frequencies, more “stuff” 4 Leakage – A Growing Problem © 2006, Kevin Skadron Source: N. S. Kim et al.., “Leakage Current: Moore’s Law Meets Static Power,” IEEE Computer, Dec. 2003. • The fraction of leakage power is increasing exponentially • Also exponentially dependent on temperature • This is bad for designs with idle logic, e.g. multi-core processors, specialized functional units, lots of storage, etc. 5 Inter-Related Design Objectives Vdd Vth Frequency Area Throughput Performance Dynamic Power Leakage Power exp Temp © 2006, Kevin Skadron exp Cost • Reliability Performance gains increasingly require gains in • • Cooling efficiency Power efficiency 6 ITRS Projections 2001 – was 0.4 Year Tech node (nm) Vdd (high perf) (V) Vdd (low power) (V) Frequency (high perf) (GHz) High-perf w/ heatsink Cost-performance Hand-held 2003 100 1.2 1.0 3.0 149 80 2.1 © 2006, Kevin Skadron ITRS 2005 2006 2010 70 45 1.1 1.0 0.9 0.7 6.8 15.1 Max power (W) 180 198 98 119 3.0 3.0 2013 32 0.9 0.6 23.0 2016 22 0.8 0.5 39.7 198 137 3.0 198 151 3.0 2001 – was 288 • These are targets, doubtful that they are feasible • Growth in power density means cooling costs continue to grow 7 Hitting the Power Wall • • Intel canceled Pentium 4 microarchitecture in part due to power limits Couldn’t keep raising clock frequency • • • General-purpose CPU community shifting to replicating cores • • • • © 2006, Kevin Skadron Slow growth in frequency Reduces growth in power density – but not total heat flux Programming model an open question In-order or out-of-order cores? • • Non-ideal power scaling Vdd scaling limited due to leakage (Vth) Our early results suggest OO is often superior How many threads per core? • • Sun, for example, puts 4 threads per core on its 8-core T2000 to hide memory latency This comes at the expense of single-thread latency 8 Multi-Core Isn’t Enough • • © 2006, Kevin Skadron • High degrees of integration still max out heat removal Core type and core count must be selected to maximize power efficiency Simply replicating cores and then trying to scale Vdd and frequency will not work 9 Talk Outline • Different philosophies of Power-Aware design • Energy efficient vs. low power vs. temperatureaware • Power Management Techniques • Dynamic • Static © 2006, Kevin Skadron • Thermal Issues • Factors to consider • DTM techniques • Architectural modeling • Summary of Important Challenges 10 Metrics • Power Design for power delivery • Average power, instantaneous power, peak power • Energy Low-Power Design Power-Aware/ • Energy (MIPS/W) Energy-Efficient 2 • Energy-Delay product (MIPS /W) Design • Energy-Delay2 product (MIPS3/W) – voltage independent! © 2006, Kevin Skadron • Temperature Temperature-Aware Design • On-chip temperature: correlated with localized power density • Enclosure/rack/data-center cooling 11 © 2006, Kevin Skadron 12 Circuit Techniques • • • • • Transistor sizing Dynamic vs. static logic Signal and clock gating Circuit restructuring Low power caches, register files, queues © 2006, Kevin Skadron • These typically reduce the capacitance being switched 13 Clock Gating, Signal Gating “Disabling a functional block when it is not required for an extended period” • Implementation • Simple gate that replaces one buffer in the clock tree • Signal gating is similar, helps avoid glitches • Delay is generally not a concern except at fine granularities signal functional functional unitunit © 2006, Kevin Skadron ctrl • Choice of circuit design and clock gating style can have a dramatic effect on temperature distribution 14 Circuit Restructuring • Parallelize (can reduce frequency) • Pipeline (tolerate smaller, longer-latency circuitry) • Reorder inputs so that most active input is closest to output (reduces switched capacitance) • Restructure gates (equivalent functions are not equivalent in switched capacitance) Example: Parallelizing (maintain throughput) Vdd © 2006, Kevin Skadron Logic Block Vdd/2 Freq = 1 Vdd = 1 Logic Block Throughput = 1 Power = 1 Logic Block Area = 1 Pwr Den = 1 Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den = 0.125 Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004 15 bitline wordline sens amp Column dec 80 Read Write 70 60 pJoules row dec R rows C cols bitline Cache Design 50 40 30 20 10 © 2006, Kevin Skadron • • • • Switched capacitance Voltage swing Activity factor Frequency th er I/O O bu se s LS A D B A TB LS s lin e W de r ec o Caccess = R C Ccell Reducing power D • • 0 TBLSA: Tagbitlines & sense amp. DBLSA: Data bitlines and sense amp. Cache parameters: 16 KB cache 0.25 μm Villa et al, MICRO 2000 16 Cache Design • Banked organization • • • Dividing bit line • • Sense amplifiers used to detect Vdiff across bitlines Read operation can complete as soon as Vdiff is detected Limiting voltage swing saves a fraction of power Pulse word lines • © 2006, Kevin Skadron Same effect for wordlines Reducing voltage swings • • • • Targets switched capacitance Caccess = R C Ccell / B • Enabling the word line for the time needed to discharge bitcell voltage Designer needs to estimate access time and implement a pulse generator 17 Architectural-Level Techniques • Sleep modes • Pipeline depth • Energy-efficient front end Prevalent • Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy • • • • • • Integration (e.g. multiple cores) Multi-threading Dynamic voltage/frequency scaling Multi clock domain architectures (similar to GALS) Power islands Encoding/compression © 2006, Kevin Skadron • Can reduce both switched capacitance and cross talk • Application specific hardware • Co-processors, functional units, etc. • Compiler techniques Growing or Imminent 18 Optimal Pipeline Depth • • Increased power and diminishing returns vs. increased throughput 5-10 stages, 15-30 FO4 Srinivasan et al, MICRO-35, Hartstein and Puzak, ACM TACO, Dec. 2004 4-wide issue © 2006, Kevin Skadron Single issue Hartstein and Puzak, ACM TACO, Dec. 2004 Pipeline Stages 19 Multi-threading • Do more useful work per unit time • Amortize overhead and leakage • Switch-on-event MT • Switch on cache misses, etc. (Ex: Sun T2000 “throughput computing”) • Can even rotate among threads every instruction (Tera/Cray) • Simultaneous Multithreading/HyperThreading © 2006, Kevin Skadron • For superscalar – eliminate wasted slots • Intel Pentium 4, IBM POWER5, Alpha 21464 20 Architectural-Level Techniques • Sleep modes • Pipeline depth • Energy-efficient front end Prevalent • Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy • Integration (e.g. multiple cores) • Multi-threading • Dynamic voltage/frequency scaling • Limits © 2006, Kevin Skadron • Multi clock domain architectures (similar to GALS) • Power islands • Encoding/compression • Can reduce both switched capacitance and cross talk • Application specific hardware • Co-processors, functional units, etc. • Compiler techniques Growing or Imminent 21 Multi Clock Domain Architecture • • • • © 2006, Kevin Skadron • Multiple voltage/clock domains inside the processor Globally-asynchronous locally synchronous (GALS) clock style Independent voltage/frequency scaling in each domain Synchronizers to ensure inter-domain communication Good for domains that are loosely coupled anyway • Integer/FP units in CPUs • Multiple cores 22 Multi Clock Domain Architecture • Advantages • Local clock design is not aware of global skew • Each domain limited by its local critical path, allowing higher frequencies • Different voltage regulators allow for a finergrain energy control • Frequency/voltage of each domain can be tailored to its dynamic requirements • Clock power is reduced © 2006, Kevin Skadron • Drawbacks • Complexity and penalty of synchronizers • Feasibility of multiple voltage regulators 23 © 2006, Kevin Skadron Simple Example of MCD in GPUs • • • T is performance ED^2 and E are energy efficiency metrics All normalized to default case with no MCD • The higher the leakage, the more DVS pays off 24 © 2006, Kevin Skadron 25 Static Power Dissipation • Static power: dissipation due to leakage current • Exponentially dependent on T, Vdd, Vth • Most important sources of static power: subthreshold leakage and gate leakage • We will focus on subthreshold • Gate leakage has essentially been ignored © 2006, Kevin Skadron – New gate insulation materials may solve problem 26 Thermal Runaway • The leakage-temperature feedback can lead to a positive feedback loop © 2006, Kevin Skadron • Temperature increases leakage increases temperature increases leakage increases • … Source: www.usswisconsin.org 27 A Smorgasbord • Transistor sizing • Multi Vth • Dynamic threshold voltage – reverse body bias – Transmeta Efficeon • Transmeta uses runtime compilation and load monitoring to select thresholds • Stack effect • Sleep transistors • DVS © 2006, Kevin Skadron • Coarse or fine grained • Low leakage caches, register files, queues • Hurry up and wait • Low leakage: maintain min possible V, f • High leakage: use high V/f to finish work quickly, then go to sleep 28 Leakage Control Body Bias Stack Effect Sleep Transistor Vbp Vdd +Ve Equal Loading © 2006, Kevin Skadron -Ve Logic Block Vbn 2-10X 5-10X 2-1000X Reduction Reduction Reduction Source: Shekhar Borkar, keynote presentation, MICRO-37, 2004 29 Sleep Transistors • Recent work suggests that a properly sized, low-Vth footer transistor can preserve enough leakage to keep the cell active (Li et al, PACT’02; Agarwal et al, DAC’02) • Great care must be taken when switching back to full voltage: noise can flip bits • Extra latency may be necessary when re-activating Logic Block • Similar to principles in sub-threshold computing © 2006, Kevin Skadron • Ex – sensor motes for wireless sensor networks • Concerns about susceptibility to SEU 30 Low-Leakage Caches • Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28) • Uses sleep transistor on Vdd/ground for each cache line • Typically considered non-state-preserving, but recent work (Agarwal et al, DAC’02) suggests that gated-Vss it may preserve state • Many algorithms for determining when to gate – May want to make decay interval temperature-dependent • Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay interval • Workload-adaptive decay intervals - hard © 2006, Kevin Skadron • Drowsy cache (Flautner et al, ISCA-29) • Uses dual supply voltages: normal Vdd and a low Vdd close to the threshold voltage • State preserving, but requires an extra cycle to wake up – two extra cycles if tags are decayed 31 A Smorgasbord • Transistor sizing • Multi Vth • Dynamic threshold voltage – reverse body bias – Transmeta Efficeon • Transmeta uses runtime compilation and load monitoring to select thresholds • Stack effect • Sleep transistors • DVS © 2006, Kevin Skadron • Coarse or fine grained • Low leakage caches, register files, queues • Hurry up and wait • Low leakage: maintain min possible V, f • High leakage: use high V/f to finish work quickly, then go to sleep 32 © 2006, Kevin Skadron 33 Thermal Issues - Outline • Arguments for dynamic thermal management • Factors to consider, such as reliability © 2006, Kevin Skadron • • • Brief discussion of DTM techniques Architectural modeling Sensing 34 Worst-Case Leads to Over-Design • Average case temperature lower than worst-case • Aggressive clock gating • Application variations • Underutilized resources, e.g. FP units during integer code, vertex units during fill-bound region • Currently 20-40% difference Reduced target power density © 2006, Kevin Skadron TDP Reduced cooling cost Source: Gunther et al, ITJ 2001 35 Temporal, Spatial Variations © 2006, Kevin Skadron Temperature variation of SPEC applu over time Localized hot spots dictate cooling solution 36 Application Variations • • Wide variation across applications Architectural and technology trends are making it worse, e.g. simultaneous multithreading (SMT) ST SMT 420 420 © 2006, Kevin Skadron Kelvin Kelvin 410 410 400 400 390 390 380 380 370 370 gzip gzip mcf mcf swim mgrid mgrid applu applu swim eon eon mesa mesa 37 © 2006, Kevin Skadron Temperature-Aware Design • Worst-case design is wasteful • Power management is not sufficient for chip-level thermal management • Must target blocks with high power density • When they are hot • Spreading heat helps – Even if energy not affected – Even if average temperature goes up • This also helps reduce leakage 38 Dynamic Thermal Management © 2006, Kevin Skadron Temperature Designed for Cooling Capacity w/out DTM Designed for Cooling Capacity w/ DTM System Cost Savings DTM Trigger Level DTM Disabled DTM/Response Engaged Time Source: David Brooks 2002 39 DTM • Worst case design for the external cooling solution is wasteful • Yet safe temperatures must be maintained when worst case happens © 2006, Kevin Skadron • Thermal monitors allow • Tradeoff between cost and performance • Cheaper package – More triggers, less performance • Expensive package – No triggers full performance 40 Role of Architecture? Dynamic thermal management (DTM) • Automatic hardware response when temp. exceeds cooling • Cut power density at runtime, on demand • Trade reduced costs for occasional performance loss © 2006, Kevin Skadron • Architecture natural granularity for thermal management • Activity, temperature correlated within arch. units • DTM response can target hottest unit: permits finetuned response compared to OS or package • Modern architectures offer rich opportunities for remapping computation – e.g., CMPs/SoCs, graphics processors, tiled architectures – e.g., register file • DTM will intermittently affect performance 41 © 2006, Kevin Skadron Existing DTM Implementations • Intel Pentium 4: Global clock gating with shut-down fail-safe • Intel Pentium M: Dynamic voltage scaling • Transmeta Crusoe: Dynamic voltage scaling • IBM Power 5: Probably fetch gating • ACPI: OS configurable combination of passive & active cooling • These solutions sacrifice time (slower or stalled execution) to reduce power density Better: a solution in “space” • • Tradeoff between exacerbating leakage (more idle logic) or reducing leakage (lower temperatures) 42 © 2006, Kevin Skadron Alternative: Migrating Computation This is only a simplistic illustrative example 43 Space vs. Time • Moving the hotspot, rather than throttling it, reduces performance overhead by almost 60% Space Time © 2006, Kevin Skadron Slowdown Factor 1.40 1.30 1.359 1.270 1.231 1.20 1.112 1.10 1.00 DVS FG Hyb MC The greater the replication and spread, the greater the opportunities 44 Granularity of DTM • Subunit (single queue entry, register, etc.) • • Structure (queue, register file, ALU, etc.) • • Yuck: copy stalls required, hard to avoid throttling Core • © 2006, Kevin Skadron Lots of replication, low migration cost, but not spread out Lots of replication, good spread, but high migration cost, and local hotspots remain – But, if threads are short, scheduling can achieve thermal load balancing without migration The greater the replication and spread, the greater the opportunities The shorter the threads, the more flexiblity 45 Thermal Consequences © 2006, Kevin Skadron Temperature affects: • Circuit performance • Circuit power (leakage) • IC reliability • IC and system packaging cost • Environment 46 Performance and Leakage Temperature affects : • Transistor threshold and mobility • Subthreshold leakage, gate leakage • Ion, Ioff, Igate, delay • ITRS: 85°C for high-performance, 110°C for embedded! © 2006, Kevin Skadron Ioff Ion NMOS 47 Reliability The Arrhenius Equation: MTF=A*exp(Ea/K*T) © 2006, Kevin Skadron MTF: mean time to failure at T A: empirical constant Ea: activation energy K: Boltzmann’s constant T: absolute temperature Failure mechanisms: • Electromigration • Dielectric breakdown • Mechanical stress • Negative bias temperature instability (NBTI) 48 Reliability as f(T) • • • • • Reliability criteria (e.g., DTM thresholds) are typically based on worst-case assumptions But actual behavior is often not worst case So aging occurs more slowly This means the DTM design is over-engineered! We can exploit this, e.g. for DTM or frequency © 2006, Kevin Skadron Spend Bank 49 © 2006, Kevin Skadron Average slowdown Reliability-Aware DTM 0.16 0.12 0.08 0.04 0.00 _C e s a B re u ig f on _C h g i H _ n tio c e v on .. . s e R _ k ic h T e at M _ d a re p S l a i r DTM_controller DTM_reliability 50 Thermal Issues - Outline • Arguments for dynamic thermal management • Factors to consider, such as reliability © 2006, Kevin Skadron • • • Brief discussion of DTM techniques Architectural modeling Sensing 51 Heat Mechanisms • Conduction is the main mechanism in a single chip • Conduction is proportional to the temperature difference and surface area © 2006, Kevin Skadron • Convection is the main mechanism in racks, data centers, etc. 52 Simplistic steady-state model All thermal transfer: R = k/A T_hot Power density matters! Ohm’s law for thermals (steady-state) V = I · R -> T = P · R T_amb T_hot = P · Rth + T_amb © 2006, Kevin Skadron Ways to reduce T_hot: - reduce P (power-aware) - reduce Rth (packaging, spread heat) - reduce T_amb (Alaska?) - maybe also take advantage of transients (Cth) 53 Simplistic dynamic thermal model © 2006, Kevin Skadron Electrical-thermal duality V temp (T) I power (P) R thermal resistance (Rth) C thermal capacitance (Cth) RC time constant T_hot T_amb KCL differential eq. I = C · dV/dt + V/R difference eq. V = I/C · t + V/RC · t thermal domain T = P/C · t + T/RC · t (T = T_hot – T_amb) One can compute stepwise changes in temperature for any granularity at which one can get P, T, R, C 54 Thermal resistance © 2006, Kevin Skadron • Θ = rt / A = t / kA 55 Thermal capacitance © 2006, Kevin Skadron • Cth = V· Cp· (Aluminum) = 2,710 kg/m3 Cp(Aluminum) = 875 J/(kg-°C) V = t· A = 0.000025 m3 Cbulk = V· Cp· = 59.28 J/°C 56 Thermal issues summary • Temperature affects performance, power, and reliability • Architecture-level: conduction only © 2006, Kevin Skadron • Very crude approximation of convection as equivalent resistance • Convection: too complicated – Need CFD! • Radiation: can be ignored • • • • Use compact models for package Power density is key Temporal, spatial variation are key Hot spots drive thermal design 57 Thermal modeling • Want a fine-grained, dynamic model of temperature • • • • At a granularity architects can reason about That accounts for adjacency and package That does not require detailed designs That is fast enough for practical use • HotSpot - a compact model based on thermal R, C © 2006, Kevin Skadron • Parameterized to automatically derive a model based on various – Architectures – Power models – Floorplans – Thermal Packages 58 Dynamic compact thermal model Electrical-thermal duality V temp (T) I power (P) R thermal resistance (Rth) C thermal capacitance (Cth) RC time constant (Rth Cth) T_hot T_amb © 2006, Kevin Skadron Kirchoff Current Law differential eq. I = C · dV/dt + V/R thermal domain P = Cth · dT/dt + T/Rth where T = T_hot – T_amb At higher granularities of P, Rth, Cth P, T are vectors and Rth, Cth are circuit matrices 59 Example System Heat sink IC Package © 2006, Kevin Skadron Heat spreader PCB Die Pin Interface material 60 Surface-to-surface contacts © 2006, Kevin Skadron • Not negligible, heat crowding • Thermal greases/epoxy (can “pump-out”) • Phase Change Films (undergo a transition from solid to semisolid with the application of heat) Source: CRC Press, R. Remsburg Ed. “Thermal Design of Electronic Equipment”, 2001 61 © 2006, Kevin Skadron Our Model (lateral and vertical) Interface material (not shown) 62 HotSpot • Time evolution of temperature is driven by unit activities and power dissipations averaged over 10K cycles • Power dissipations can come from any power simulator, act as “current sources” in RC circuit ('P' vector in the equations) • Simulation overhead in Wattch/SimpleScalar: < 1% © 2006, Kevin Skadron • Requires models of • Floorplan: important for adjacency • Package: important for spreading and time constants • R and C matrices are derived from the above 63 Validation • Validated and calibrated using FEM simulations, FPGA measurements, and MICRED test chips © 2006, Kevin Skadron • • • 9x9 array of power dissipators and sensors Compared to HotSpot configured with same grid, package Within 7% for both steady-state and transient stepresponse • Interface material (chip/spreader) matters 64 Sensors © 2006, Kevin Skadron Caveat emptor: We are not well-versed on sensor design; the following is a digest of information we have been able to collect from industry sources and the research literature. 65 Desirable Sensor Characteristics © 2006, Kevin Skadron • • • • • • • Small area Low Power High Accuracy + Linearity Easy access and low access time Fast response time (slew rate) Easy calibration Low sensitivity to process and supply noise 66 Types of Sensors (In approx. order of increasing ease to build) • Thermocouples – voltage output • Junction between wires of different materials; voltage at terminals is α Tref – Tjunction • Often used for external measurements • Thermal diodes – voltage output • Biased p-n junction; voltage drop for a known current is temperature-dependent • Biased resistors (thermistors) – voltage output © 2006, Kevin Skadron • Voltage drop for a known current is temperature dependent – You can also think of this as varying R • Example: 1 KΩ metal “snake” • BiCMOS, CMOS – voltage or current output • Rely on reference voltage or current generated from a reference band-gap circuit; current-based designs often depend on tempdependence of threshold • 4T RAM cell – decay time is temp-dependent • [Kaxiras et al, ISLPED’04] 67 Sensors: Problem Issues • Poor control of CMOS transistor parameters • Noisy environment • Cross talk • Ground noise • Power supply noise © 2006, Kevin Skadron • These can be reduced by making the sensor larger • This increases power dissipation • But we may want many sensors 68 “Reasonable” Values • Based on conversations with engineers at Sun, Intel, and HP (Alpha) • Linearity: not a problem for range of temperatures of interest • Slew rate: < 1 μs • This is the time it takes for the physical sensing process (e.g., current) to reach equilibrium © 2006, Kevin Skadron • Sensor bandwidth: << 1 MHz, probably 100200 kHz • This is the sampling rate; 100 kHz = 10 μs • Limited by slew rate but also A/D – Consider digitization using a counter 69 “Reasonable” Values: Precision • • Mid 1980s: < 0.1° was possible Precision • • • • © 2006, Kevin Skadron • ± 3° is very reasonable ± 2° is reasonable ± 1° is feasible but expensive < ± 1° is really hard P: 10s of mW The limited precision of the G3 sensor seems to have been a design choice involving the digitization 70 Calibration • Accuracy vs. Precision • Analogous to mean vs. stdev • Calibration deals with accuracy • The main issue is to reduce inter-die variations in offset • © 2006, Kevin Skadron • Typically requires per-part testing and configuration Basic idea: measure offset, store it, then subtract this from dynamic measurements 71 Dynamic Offset Cancellation • • • • © 2006, Kevin Skadron • Rich area of research Build circuit to continuously, dynamically detect offset and cancel it Typically uses an op-amp Has the advantage that it adapts to changing offsets Has the disadvantage of more complex circuitry 72 Role of Precision • Suppose: © 2006, Kevin Skadron • Junction temperature is J • Max variation in sensor is S, offset is O • Thermal emergency is T • T=J–S–O • Spatial gradients • If sensors cannot be located exactly at hotspots, measured temperature may be G° lower than true hotspot • T=J–S–O–G 73 Rate of Change of Temperature • • © 2006, Kevin Skadron • • Our FEM simulations suggest maximum 0.1° in about 25-100 μs This is for power density < 1 W/mm2 die thickness between 0.2 and 0.7mm, and contemporary packaging This means slew rate is not an issue But sampling rate is! 74 Sensors Summary • Sensor precision cannot be ignored • Reducing operating threshold by 1-2 degrees will affect performance • Precision of 1° is conceivable but expensive © 2006, Kevin Skadron • Maybe reasonable for a single sensor or a few • Precision of 2-3° is reasonable even for a moderate number of sensors • Power and area are probably negligible from the architecture standpoint • Sampling period <= 10-20 μs 75 © 2006, Kevin Skadron 76 Massive Multi-Core Design Space © 2006, Kevin Skadron • • • • • • • # cores Pipeline depth Pipeline width In-order vs. out-of-order Cache per core Core-to-core interconnect fabric All dependent on temperature constraints! 77 Wither Core Type? © 2006, Kevin Skadron vs. Source: Chrostopher Reeve Homepage, http://www.chrisreevehomepage.com/ Hot spot? Cores may also be heterogeneous, with a few powerful cores and very many small cores 78 Impact of Thermal Constraints Thermal limits change the optimal pipeline width as core count increases 2MB/18FO4/2 2MB/18FO4/4 12 10 BIPS 8 6 4 © 2006, Kevin Skadron 2 0 2 4 6 8 10 12 14 16 18 20 Core Number 79 Impact of Thermal Constraints Pipeline depth, which is often fixed early in the design, can impact the multi-core performance dramatically Thermal limits favor shallower pipelines 4MB/12FO4/4 4MB/18FO4/4 2MB/12FO4/4 2MB/18FO4/4 4MB/24FO4/4 4MB/30FO4/4 2MB/24FO4/4 2MB/30FO4/4 45 12 40 10 35 8 25 BIPS BIPS 30 20 4 15 10 2 5 0 0 © 2006, Kevin Skadron 6 2 4 6 8 10 12 14 Core Number 16 18 20 Without thermal constraints 2 4 6 8 10 12 14 Core Number 16 18 With thermal constraints 80 20 Workload Sensitivity CPU- and memory-bound applications desire different resources 26-53% performance loss if switch the best configurations! 2MB/12FO4/4 2MB/24FO4/4 2MB/18FO4/2 8MB/18FO4/2 12 8MB/12FO4/4 8MB/24FO4/4 8MB/18FO4/2 16MB/18FO4/2 2MB/18FO4/4 2MB/30FO4/4 2MB/18FO4/4 8MB/18FO4/4 5 10 4 BIPS BIPS 8 6 4 3 2 2 1 0 0 2 4 6 8 10 12 14 16 18 20 Core Number © 2006, Kevin Skadron 8MB/18FO4/4 8MB/30FO4/4 8MB/18FO4/4 16MB/18FO4/4 400mm2 Cheap Thermal package CPU bound 2 4 6 8 10 12 14 16 18 20 Core Number 400mm2 Cheap Thermal package Memory bound 81 Summary • Reviewed current techniques for managing dynamic power, leakage power, temperature • A major obstacle with architectural techniques is the difficulty of predicting performance impact • Spread heat in space, not time • © 2006, Kevin Skadron • • Continuing integration makes power and thermal constraints even more important Optimal multi-core design is dependent on thermal considerations Security challenges 82 © 2006, Kevin Skadron ...or email me: [email protected] 83 More Info © 2006, Kevin Skadron http://www.cs.virginia.edu/~skadron LAVA Lab 84 Backup Slides © 2006, Kevin Skadron • These slides are an assortment that wouldn’t fit in the talk but I kept to answer questions or provide more info 85 Hot Chips are No Longer Cool! 1000 Rocket Nozzle Nuclear Reactor Watts/cm 2 100 SIA Pentium® 4 Pentium® III Pentium® II Hot plate 10 Today’s laptops: Pentium® Pro Pentium® i386 © 2006, Kevin Skadron i486 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. 86 © 2006, Kevin Skadron ITRS quotes – thermal challenges • For small dies with high pad count, high power density, or high frequency, “operating temperature, etc for these devices exceed the capabilities of current assembly and packaging technology.” • “Thermal envelopes imposed by affordable packaging discourage very deep pipelining.” • Intel recently canceled its NetBurst microarchitecture – Press reports suggest thermal envelopes were a factor 87 Dynamic Power Consumption • Power dissipated due to switching activity • A capacitance is charged and discharged Vdd Ec=1/2CLV2 © 2006, Kevin Skadron Ed=1/2CLV2 Charge/discharge at the frequency f P=a CLV2 f 88 Transistor Sizing • Transistor sizing plays an important role to reduce power K = Ci/Ci-1 © 2006, Kevin Skadron C0 C1 CN-1 CN • Delay ~ a (k / ln K) • Power ~ K / (K-1) • Optimum K for both power and delay must be pursued 89 Signal Gating “techniques to mask unwanted switching activities from propagating forward, causing unnecessary power dissipation” • Implementation • Simple gate • Tristate buffer • ... signal Output ctrl • Control signal needed • Generation requires additional logic © 2006, Kevin Skadron • Especially helps to prevent power dissipation due to glitches 90 Different Implementation and Corresponding Clock Gating Choices Latch-Mux Design © 2006, Kevin Skadron SRAM Design 91 DVS “Critical Power Slope” • It may be more efficient not to use DVS, and to run at the highest possible frequency, then go into a sleep mode! • Depends on power dissipation in sleep mode vs. power dissipation at lowest voltage © 2006, Kevin Skadron • This has been formalized as the critical power slope (Miyoshi et al, ICS’02): • mcritical = (Pfmin – Pidle) / fmin • If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical then it is more energy efficient to run at the highest frequency, then go to sleep • Switching overheads must be taken into account 92 Application-Specific Hardware • • Specialized logic is usually much lower power Co-processors • • Functional units • • © 2006, Kevin Skadron • Ex: TCP/IP offload, codecs, etc. Ex: Intel SSE, specialized arithmetic (e.g., graphics), etc. Ex: Custom instructions in configurable cores (e.g., Tensilica) Specific example: Zoran ER4525 – cell phone • • • • • • ARM microcontroller, no DSP! Video capture & pre/post processing Video codec 2D/3D rendering Video display Security 93 Gate Leakage • Not clear if new oxide materials will arrive in time • Any technique that reduces Vdd helps • Otherwise it seems difficult to develop architecture techniques that directly attack gate leakage • In fact, very little work has been done in this area • One example: domino gates (Hamzaoglu & Stan, ISLPED’02) © 2006, Kevin Skadron • Replace traditional NMOS pull-down network with a PMOS pull-up network • Gate leakage is greater in NMOS than PMOS • But PMOS domino gate is slower • Note: Gate oxide so thin - especially prone to manufacturing variations 94 Static Power - Modeling • Modeling Leakage • © 2006, Kevin Skadron • Butts and Sohi (MICRO-33) – Pstatic = Vcc · N · kdesign · Îleak – Îleak determined by circuit simulation, kdesign empirically – Key contribution: separate technology from design HotLeakage (UVA TR CS-2003-05, DATE’04) – Extension of Butts & Sohi approach: scalable with Vdd, Vth, Temp, and technology node; adds gate leakage – Îleak determined by BSIM3 subthreshold equation and BSIM4 gate-leakage equations, giving an analytical expression that accounts for dependence on factors that may change at runtime, namely Vdd, Vth, and Temp – kdesign replaced by separate factors for N- and P-type transistors – kdesign also exponentially dependent on Vdd and Tox, linearly dependent on Temp – Currently integrated with SimpleScalar/Wattch for caches 95 Static Power – Modeling • Modeling Leakage (cont.) © 2006, Kevin Skadron • Su et al, IBM (ISLPED’03) –Similar approach to HotLeakage – but they observe that modeling the change in leakage allows linearization of the equations • Many, many other papers on various aspects of modeling different aspects of leakage –Most focus on subthreshold –Few suggest how to model leakage in microarchitecture simulations 96 Performance Comparison • TT-DFS is best but can’t prevent excess temperature • • • Suitable for use with aggressive clock rates at low temp. Hybrid technique reduces DTM cost by 25% vs. DVS (DVS overhead important) A substantial portion of MC’s benefit comes from the altered floorplan, which separates hot units Slowdown Factor © 2006, Kevin Skadron 1.40 1.359 1.270 1.30 1.231 1.20 1.112 1.10 1.045 1.00 TTDFS DVS FG Hyb MC 97 EM Model t failure 0 1 e kT (t ) Ea kT ( t ) © 2006, Kevin Skadron Life Consumption Rate: dt th , th const 1 R (t ) e kT (t ) Ea kT ( t ) Apply in a “lumped” fashion at the granularity of microarchitecture units, just like RAMP [Srinivasan et al.] 98 Carnot efficiency • Note that in all cases, heat transfer is proportional to ΔT • This is also one of the reasons energy “harvesting” in computers is probably not cost-effective • ΔT w.r.t. ambient is << 100° • For example, with a 25W processor, thermoelectric effect yields only ~50mW © 2006, Kevin Skadron • Solbrekken et al, ITHERM’04 • This is also why Peltier coolers are not energy efficient • 10% eff., vs. 30% for a refrigerator 99 Thermal Modeling • Want a fine-grained, dynamic model of temperature • • • • At a granularity architects can reason about That accounts for adjacency and package That does not require detailed designs That is fast enough for practical use HotSpot - a compact model based on thermal R, C © 2006, Kevin Skadron • Parameterized to automatically derive a model based on various… – Architectures – Power models – Floorplans – Thermal Packages 100 Temperature equations • Fundamental RC differential equation • P = C dT/dt + T / R • Steady state • dT/dt = 0 • P=T/R © 2006, Kevin Skadron • When R and C are network matrices • Steady state – T = R x P • Modified transient equation – dT/dt + (RC)-1 x T = C-1 x P • HotSpot software mainly solves these two equations 101 Our Model (Lateral and Vertical) © 2006, Kevin Skadron Derived from material and geometric properties Interface material (not shown) 102 Transient solution • Solves differential equations of the form dT + AT = B where A and B are constants • • • © 2006, Kevin Skadron • In HotSpot, A is constant (RC) but B depends on the power dissipation Solution – assume constant average power dissipation within an interval (10 K cycles) and call RK4 at the end of each interval In RK4, current temperature (at t) is advanced in very small steps (t+h, t+2h ...) till the next interval (10K cycles) RK – `4` because error term is 4th order i.e., O(h^4) 103 © 2006, Kevin Skadron Transient solution contd... • 4th order error has to be within the required precision • The step size (h) has to be small enough even for the maximum slope of the temperature evolution curve • Transient solution for the differential equation is of the form Ae-Bt with A and B are dependent on the RC network • Thus, the maximum value of the slope (AxB) and the step size are computed accordingly 104 HotSpot • Time evolution of temperature is driven by unit activities and power dissipations averaged over 10K cycles • Power dissipations can come from any power simulator, act as “current sources” in RC circuit • Simulation overhead in Wattch/SimpleScalar: < 1% © 2006, Kevin Skadron • Requires models of • Floorplan: important for adjacency • Package: important for spreading and time constants 105 Notes • Note that HotSpot currently measures temperatures in the silicon • But that’s also what the most sensors measure • Temperature continues to rise through each layer of the die © 2006, Kevin Skadron • Temperature in upper-level metal is considerably higher • Interconnect model released soon! 106 HotSpot Summary • HotSpot is a simple, accurate and fast architecture level thermal model for microprocessors • Over 850 downloads since June’03 • Ongoing active development – architecture level floorplanning will be available soon • Download site © 2006, Kevin Skadron • http://lava.cs.virginia.edu/HotSpot • Mailing list • www.cs.virginia.edu/mailman/listinfo/hotspot 107 Hybrid DTM • DVS is attractive because of its cubic advantage • • • • Fetch gating is attractive because it can use instruction level parallelism to reduce impact of DTM © 2006, Kevin Skadron • • P V2f This factor dominates when DTM must be aggressive But changing DVS setting can be costly – Resynchronize PLL – Sensitive to sensor noise spurious changes Only effective when DTM is mild So use both! 108 Migrating Computation • When one unit overheats, migrate its functionality to a distant, spare unit (MC) • • • • © 2006, Kevin Skadron • Spare register file (Skadron et al. 2003) Separate core (CMP) (Heo et al. 2003) Microarchitectural clusters etc. Raises many interesting issues • • • • Cost-benefit tradeoff for that area Use both resources (scheduling) Extra power for long-distance communication Floorplanning 109 Hybrid DTM, cont. Combine fetch gating with DVS • • • • • When DVS is better, use it Otherwise use fetch gating Determined by magnitude of temperature overshoot Crossover at FG duty cycle of 3 FG has low overhead: helps reduce cost of sensor noise 1.3 1.4 FG Hyb Slowdown © 2006, Kevin Skadron DVS 1.3 1.2 1.2 1.1 1.1 1.0 1.0 20 5 Duty Cycle 2 20 15 10 Duty Cycle 5 0 110 Slowdown • Hybrid DTM, cont. • DVS doesn’t need more than two settings for thermal control • • FG by itself does need multiple duty cycles and hence requires PI control • But in a hybrid configuration, FG does not require PI control • • © 2006, Kevin Skadron Lower voltage cools chip faster • FG is only used at mild DTM settings Can pick one fixed duty cycle This is beneficial because feedback control is vulnerable to noise 111 Sensors • Almost half of DTM overhead is due to • • • • Need localized, fine-grained sensing Need new sensor designs that are cheap and can be used liberally – co-locate with hotspots © 2006, Kevin Skadron • • • Guard banding due to offset errors and lack of co-located sensors Spurious sensor readings due to noise But these may be imprecise Many sensor designs look promising Need new data fusion techniques to reduce imprecision, possibly combine heterogeneous sensors 112 Impact of Physical Constraints • • Thermal constraints shift optimum toward fewer and simpler cores CPU-bound programs still want aggressive superscalar cores despite throttling—but not deeply pipelined • • You can still have lots of cores • • © 2006, Kevin Skadron • Mem-bound programs want narrow cores, lots of L2 They will be severely throttled (e.g., up to 45% voltage reduction and 75% frequency reduction) But you still win by adding cores until throttling outweighs the benefit of an additional core Preliminary results suggest that OO cores are always preferable: they are more efficient in terms of BIPS/area 113