* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download et al - University of Virginia, Department of Computer Science
Variable-frequency drive wikipedia , lookup
Utility frequency wikipedia , lookup
Power factor wikipedia , lookup
Power inverter wikipedia , lookup
Stray voltage wikipedia , lookup
Standby power wikipedia , lookup
Spectral density wikipedia , lookup
Pulse-width modulation wikipedia , lookup
Wireless power transfer wikipedia , lookup
Electrification wikipedia , lookup
Electric power system wikipedia , lookup
History of electric power transmission wikipedia , lookup
Power over Ethernet wikipedia , lookup
Distributed generation wikipedia , lookup
Buck converter wikipedia , lookup
Audio power wikipedia , lookup
Life-cycle greenhouse-gas emissions of energy sources wikipedia , lookup
Amtrak's 25 Hz traction power system wikipedia , lookup
Power electronics wikipedia , lookup
Distribution management system wikipedia , lookup
Rectiverter wikipedia , lookup
Voltage optimisation wikipedia , lookup
Power engineering wikipedia , lookup
Alternating current wikipedia , lookup
Switched-mode power supply wikipedia , lookup
© 2004, Kevin Skadron and Jose Gonzalez Power-Aware Design for High-Performance Processors A Tutorial at HPCA-2004 Kevin Skadron Jose Gonzalez University of Virginia Intel Labs Barcelona © 2004, Kevin Skadron and Jose Gonzalez Roadmap Introduction & Trends Dynamic Power Dissipation Static Power Dissipation Sources, modeling, reduction techniques Sources, modeling, reduction techniques Summary 2 © 2004, Kevin Skadron and Jose Gonzalez Introduction Power: Work done per unit time (watts) Energy: Total Work (joules) Why is power a concern in current processors? ? Increased market demand for consumer electronics powered by batteries; battery life is a selling point Electricity, cooling costs for large data centers are becoming substantial • 5-25% of data center income (cf. Rajamony & Bianchini tutorial, ICS’02) Government energy-efficiency requirements • (eg Energy* in US) Electricity costs for large ISPs are becoming substantial Packaging and cooling costs (due to the increase in the power density) are becoming prohibitive Power dissipation may reach technology limits are becoming prohibitive Current delivery is becoming3 expensive © 2004, Kevin Skadron and Jose Gonzalez Metrics Some different power metrics & fallacies: Reducing power does not always save energy Energy = P dt • If you reduce power but increase execution time, energy may go up Also note that reducing power does not always reduce temperature Sustained power density limits thermal design/packaging – approx. same as thermal design power – note that on-chip temperatures and total heat production are somewhat different concerns 4 © 2004, Kevin Skadron and Jose Gonzalez Metrics Power Energy Average power Power density map Energy (MIPS/W) Energy-Delay product (MIPS2/W) Energy-Delay2 product (MIPS3/W) – voltage independent! (Zyuban, GVLSI’02) Temperature Average temperature Peak temperature Temperature map • Does not necessarily match power density map No good figures of merit for trading off thermal efficiency against performance, area, or energy efficiency 5 © 2004, Kevin Skadron and Jose Gonzalez Power Dissipation Dynamic power dissipation Due to switching activity Static power dissipation Due to leakage current – major paths are: • Subthreshold leakage Exponentially dependent on Vdd, Vth, Temp • Gate leakage Exponentially dependent on Vdd, Tox 6 © 2004, Kevin Skadron and Jose Gonzalez Power Dissipation Total power actually consists of Switching power Short-circuit power Leakage power 7 © 2004, Kevin Skadron and Jose Gonzalez Big Picture - Trends Data on current power dissipation for various chips Distribution of power within a typical processor Trends in Scaling trends in power dissipation Trends in leakage power Power Trends in battery life 8 © 2004, Kevin Skadron and Jose Gonzalez Power Dissipation Processor Alpha 21364 Clock 1.15 GHz Rate Power 110W (Max) AMD Opteron 2.2 GHz HPIBMPA8700 Power 4 870 MHz 1.7 GHz Intel Itanium 2 1.5 GHz Intel Xeon 3.2 GHz MIPS R14000 600 MHz 86 W 75W 130W 86W 16W 100W Source: Microprocessor Report 9 © 2004, Kevin Skadron and Jose Gonzalez Power Dissipation Breakdown Alpha 21264 Global clock network Instruction issue units Caches FP execution units Int. execution units Mem. management unit I/O Miscellaneous Source: Gowan et al. “Power Considerations in the design of the alpha 21264 microprocessor”, DAC 1998 10 © 2004, Kevin Skadron and Jose Gonzalez Effects of Technology Scaling on Power Dissipation Feature size is scaling down Frequency is increasing at least 30% (Ideal scaling: decreases by 30%) Vdd is not scaled down at the same rate as feature size 25% (Ideal scaling: decreases by 50%) Active capacitance increases ~2x (Ideal scaling: decreases by 30%) Area increases due to microarchitecture improvements 30% 0-10% (Ideal scaling) 30% Ideal scaling: P CV2f → 0.72 reduction 0.5 Observed scaling → 2 – 2.5x increase Power density becomes a problem! Especially since the power density is non-uniform 11 © 2004, Kevin Skadron and Jose Gonzalez Power Evolution ? 100 Pentium® II Pentium® 4 Max Power (Watts) Pentium® Pro Pentium® III 10 Pentium® Pentium® w/MMX tech. i486 i386 1 1.5m Source: Intel 1m 0.8m 0.6m 0.35m 12 0.25m 0.18m 0.13m © 2004, Kevin Skadron and Jose Gonzalez Trends in Power Density 1000 Rocket Nozzle Watts/cm 2 Nuclear Reactor 100 Pentium® 4 Pentium® III Pentium® II Hot plate 10 Pentium® Pro Pentium® i386 i486 1 1.5m 1m 0.7m 0.5m 0.35m 0.25m 0.18m 0.13m 0.1m 0.07m * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. 13 © 2004, Kevin Skadron and Jose Gonzalez ITRS Projections Year Tech node (nm) Vdd (high perf) (V) Vdd (low power) (V) Frequency (high perf) (GHz) High-perf w/ heatsink Cost-performance Hand-held 2003 100 1.0 1.1 3.1 2006 2010 70 45 0.9 0.6 1.0 0.8 5.6 11.5 Max power (W) 180 218 98 120 3.5 3.0 160 85 3.2 2013 32 0.5 0.7 19.3 2016 22 0.4 0.6 28.8 251 138 3.0 288 158 3.0 ITRS 2001 These are targets Based on historical trends, the high-performance power targets seem optimistic Intel papers suggest that in the 45-75W range, cooling costs $1/W; but then rate of increase goes up: $2, $3/W, maybe more! (Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01) 14 The fraction of leakage power is increasing exponentially with each generation Also exponentially dependent on temperature Increasing ratio across generations Static power/ Dynamic Power 70 60 50 40 30 20 10 0 29 8 30 3 30 8 31 3 31 8 32 3 32 8 33 3 33 8 34 3 34 8 35 3 35 8 36 3 36 8 37 3 Percentage © 2004, Kevin Skadron and Jose Gonzalez Leakage Power Temperature(K) 180nm 130nm 100nm Source: Skadron et al, University of Virginia 15 90nm 80nm 70nm © 2004, Kevin Skadron and Jose Gonzalez Trends in Battery Technology Battery lifetime is increasing perhaps 8-10%/yr. (Powers, Proc. of IEEE 1995) Not keeping up with rate of growth in energy consumption Source: Rabaey 1995, cited in Irwin et al, “Low Power Design Methodologies, Hardware and Software Issues”, tutorial at PACT 2000 16 © 2004, Kevin Skadron and Jose Gonzalez Roadmap Introduction & Trends Dynamic Power Dissipation Static Power Dissipation Sources, modeling, reduction techniques Sources, modeling, reduction techniques Summary 17 © 2004, Kevin Skadron and Jose Gonzalez Dynamic Power Dissipation Roadmap Sources of dynamic power dissipation Modeling dynamic power Circuit- and architecture-domain techniques to reduce power 18 © 2004, Kevin Skadron and Jose Gonzalez Dynamic Power Consumption Power dissipated due to switching activity A capacitance is charged and discharged Vdd 01 Ec=1/2CLV2 Ed=1/2CLV2 10 Charge/discharge at the frequency f P=CLV2 f Note that energy consumed from battery is CLV2 and is drawn upon charging 19 © 2004, Kevin Skadron and Jose Gonzalez Dynamic Power Dissipation Equation P = a CL Vdd2 f a: Activity factor Depends on the processor architecture CL: Capacitance of the circuit Depends on the design style, number of transistors, transistor sizing, etc Vdd: Operating voltage f: Frequency 20 © 2004, Kevin Skadron and Jose Gonzalez Dynamic Power Modelling P = a CL V2 f Information needed Activity counters in each unit Energy dissipated per access Configuration Performance Model Activity Performance metrics Power Model Power metrics For precision, “a” (# of signal transitions) should be measured or at least estimated with a probabilistic model More commonly, a = 0.5 is assumed 21 © 2004, Kevin Skadron and Jose Gonzalez Dynamic Power Modelling Activity counters Energy per access Analytically: calculating capacitances as function of size, ports, etc Example: Cache access: decoder, precharge transistors, bitline, cell access, wordline, sense amplifiers ... • Wattch (Brooks et al, ISCA 2000) • Cacti Empirically: using low level designs and applying “virus” tests • Virus test: microbenchmark that stresses a particular unit • ALPS (Gunther et al, ITJ, 2001) Circuit-extracted model Performance model is used Counters for: cache access, FU usage, Register File, ... PowerTimer – IBM Power4 (Brooks et al, PACS’00) AccuPower – Parameterized, based on SPICE measurements of actual layouts (SUNY Binghamton, Ponomarev et al, DATE’02) PowerAnalyzer – StrongARM (Michigan, assoc. w/ SimpleScalar) Many of these ignore the actual number of signal transitions 22 © 2004, Kevin Skadron and Jose Gonzalez Circuit-Level Techniques Transistor sizing Signal and clock gating Circuit restructuring Low power caches Low power register files Issue queue These typically reduce the capacitance being switched 23 © 2004, Kevin Skadron and Jose Gonzalez Transistor Sizing Transistor sizing plays an important role to reduce power K = Ci/Ci-1 C0 C1 CN-1 CN Delay ~ a (k / ln K) Power ~ K / (K-1) Optimum K for both power and delay must be pursued 24 © 2004, Kevin Skadron and Jose Gonzalez Signal Gating “techniques to mask unwanted switching activities from propagating forward, causing unnecessary power dissipation” Implementation ctrl Generation requires additional logic Identification of signals to be gated Output Control signal needed Simple gate Tristate buffer ... signal Clock Address bus Also helps to prevent power dissipation due to glitches 25 © 2004, Kevin Skadron and Jose Gonzalez Clock Gating “Disabling a functional block when it is not required for a extended period” Implementation signal Simple gate that replaces one buffer in the clock tree ctrl Delay is generally not a concern Decision Architectural level 26 functional functional unitunit © 2004, Kevin Skadron and Jose Gonzalez Circuit Restructuring Pipeline (can reduce frequency) Parallelize (can reduce frequency) Reorder inputs so that most active input is closest to output (reduces switched capacitance) Restructure gates (equivalent functions are not equivalent in switched capacitance) Energy-efficient flip-flops and latches 27 bitline bitline R rows C cols row dec 80 Read Write 70 60 wordline 50 sens amp 40 Column dec 30 20 10 Switched capacitance Voltage swing Activity factor Frequency th er I/O O bu se s LS A D B A TB LS W lin e s 0 de r Caccess = R C Ccell Reducing power ec o D © 2004, Kevin Skadron and Jose Gonzalez Cache Design TBLSA: Tagbitlines & sense amp. DBLSA: Data bitlines and sense amp. Cache parameters: 16 KB cache 0.25 μm Villa et al, MICRO 2000 28 © 2004, Kevin Skadron and Jose Gonzalez Cache Design Banked organization Dividing word line Same effect for wordlines Reducing voltage swings Targets switched capacitance Caccess = R C Ccell / B Sense amplifiers used to detect Vdiff across bitlines Read operation can be curtailed as soon as Vdiff is detected Limiting voltage swing saves a fraction of power Pulse word lines Enabling the word line for the time needed to discharge bitcell voltage Designer needs to estimate access time and implement a pulse generator 29 © 2004, Kevin Skadron and Jose Gonzalez Low Power Register File Design RF’s usually single-ended bitlines Modified storage cell Lot of zeros fetched from the RF Bitline connections are modified to eliminate bitline discharge when reading a zero Tseng and Asanovic, ICSD, 2000 Zyuban and Kogge, ISLPED 1998 30 © 2004, Kevin Skadron and Jose Gonzalez Efficient Issue Queue Constitute a high fraction of the overall power >25% for some authors Tag 1 Tag w OR RDY comp comp comp comp Oprnd Oprnd 31 OR RDY © 2004, Kevin Skadron and Jose Gonzalez Efficient Issue Queue Useful comparison Empty entries and ready entries consume energy • Wakeup of empty entries can be disabled Gating off precharge logic using valid bit • Wakeup of ready sources can be disabled Gating off precharge logic using ready bit Folegnani and Gonzalez, ISCA 2001 Energy-efficient Comparators Traditional comparators dissipate energy on a mismatch in any bit position. 10%-20% of source operands match each cycle Solution: comparators that dissipate energy in a match Kuckuc et al, ISLPED 2001 32 © 2004, Kevin Skadron and Jose Gonzalez Architectural-Level Techniques Encoding/compression Energy-efficient front end Energy-efficient caches Asymmetric processors Dynamic Voltage/Frequency scaling Multi clock domain architectures (similar to GALS) Pipeline gating Compiler techniques Sleep modes These typically take advantage of locality or slack 33 © 2004, Kevin Skadron and Jose Gonzalez Bus Invert Encoding Reduce power of parallel synchronous signals Idea: Minimize the number of transitions • (Stan & Burleson, IEEE Trans. on VLSI, 1995) Sender examines the current and the next values Decides whether sending the true or the compliment signal Additional polarity signal is sent along with data Example Current data 110011101 Next data 000100110 Number of transitions Current data NOT (Next data) Number of transitions 8 34 110011101 111011001 2 © 2004, Kevin Skadron and Jose Gonzalez Dynamic Zero Compression Zero Indicator Bit (ZIB) added to each byte Circuit Modifications Zero-detection and store bus drivers Wordline gating: 8-bit data is driven by the associated ZIB Sense Amps: modified to drive a zero if ZIB active Drawbacks Enabled if a zero is stored in cache On a read access, bitline discharge is prevented by disabling local wordline On a write, if the byte is zero, just ZIB is written. 9% area increase, 2-gate delay increase Results 26% energy reduction data cache, 10% instruction cache Villa et al, MICRO 2000 35 High percentage of integer operations require <16 bits Difficult for the compiler to know the actual operand size Variability for the same instruction in successive instances Clock Gating is used to partially disable the FU zero48 0 Result 64 zero48 clk 1 AND Zero detec High latch Operand A 64 Low latch zero48 clk Operand B AND Integer FU © 2004, Kevin Skadron and Jose Gonzalez Exploiting Narrow Width Operands High latch 64 Low latch 36 Brooks and Martonosi, HPCA 1999 0-15 16-63 64 © 2004, Kevin Skadron and Jose Gonzalez Energy-Efficient Front End: Branch Prediction Branch Prediction Parikh et al, HPCA’02, IEEE Trans. Computers ‘04 Branch prediction accuracy is a major determinant of pipeline activity -> spending more power in the branch predictor can be worthwhile if it improves accuracy Branch predictors can be designed to reduce power, eg • Banking • Gate off unnecessary accesses (“prediction probe detector”) 37 © 2004, Kevin Skadron and Jose Gonzalez Energy Efficient Front End: Register Renaming RAT often implemented as a multiported register file indexed by logical register, returns physical register Liu and Lu , MICRO’00 Kucuk et al, PATMOS’03 Hierarchical RAT- top level is a cache of the full table Prevent lookup of sources that will be supplied by a freshly renamed instruction in the same rename group Filter cache Could instead organize as an associative lookup in a table organized by physical register with dissipate-onmatch comparator (Ergin et al, ICCD’02) 38 © 2004, Kevin Skadron and Jose Gonzalez Energy-Efficient Caches Filter cache Banks Selective cache ways (Albonesi, MICRO-32) Small L0 cache filters many accesses to L1, allows an L1 with fewer ports (Kin et al, MICRO-30) Ways in a set associative cache can be disabled if not needed Many variations of this approach Staggering number of papers on this topic Exploit victim cache, load-store queue Clever cache organizations (eg combining banks w/ high assoc, specialized caches, etc.) See recent proceedings of VLSI, architecture conferences, esp. ISLPED 39 © 2004, Kevin Skadron and Jose Gonzalez Asymmetric Processors Processors have different “versions” of the same resource, with different power/latency Fast, power-hungry resources are allocated to critical instructions Slow, low-power resources are allocated to non-critical instructions Criticality predictor is needed!!! 40 © 2004, Kevin Skadron and Jose Gonzalez Asymmetric Processors Reducing power of functional units Critical instructions 2 sets of functional units 2 sets of instruction queues Criticality predictor In-order queue: critical path is usually a serial chain of dependent instructions Fast functional units Non-critical instructions OoO queue Slow functional units Seng et al, MICRO 2001 41 © 2004, Kevin Skadron and Jose Gonzalez Decode Fetch Slow pipeline Reg File Commit Dual Speed Pipelines Fast pipeline Criticality predictor Slow pipeline works at half the frequency Criticality predictor key component to keep energy-efficiency No communications penalties Pyreddy and Tyson, WCED 2001 42 © 2004, Kevin Skadron and Jose Gonzalez Dynamic Voltage/Frequency Scaling Allow the device to dynamically adapt the voltage (and the frequency) Already implemented in many processors Implementation P ~ Vdd2 F ~ Vdd/(Vdd-Vth)k Tradeoff between power reductions and delay increase MUST BE energy-efficient Voltage regulator Predict future processor utilization and adjust frequency/voltage to maximize power reduction while keeping performance 43 © 2004, Kevin Skadron and Jose Gonzalez TransmetaTM LongRunTM Crusoe processor can configure itself* Management Voltage changes in steps of 25 mV (depending on the voltage regulator) Frequency changes in steps of 33 MHz From 1.6v, 600 MHz to 1.2V, 300MHz (2001) Implemented in the Code MorphingTM software layer Idle time of the system is sampled to determine performance demands Thermal extension May be a form of thermal throttling Expands the thermal budget of the processor * Source: http://www.transmeta.com 44 © 2004, Kevin Skadron and Jose Gonzalez Transmeta™LongRun™ Idle time On-line activity Voltage drops to minimum Voltage raises to maximum Real-Time activity Voltage adjusted to meet requirements DVD player • 24 frames/second Source: Transmeta 45 © 2004, Kevin Skadron and Jose Gonzalez Intel SpeedStep® Configuration* From 0.844v (600MHz) to 1.48v (1.7 GHz) 100μs delay Voltage-Frequency switching separation No Change Volt. Transition Freq. Transition Volt. Transition * Source: http://www.intel.com Freq. Transition 46 © 2004, Kevin Skadron and Jose Gonzalez Intel SpeedStep® Configuration Clock partitioning • Core clock • Bus clock (sequencer and interrupt interface) Event blocking • Interrupts, pin events and snoop requests are not lost 47 © 2004, Kevin Skadron and Jose Gonzalez Voltage Scheduling Real-time problem will be discussed later For non-real time workload, goal is to improve energy efficiency This is hard, because it is difficult to predict an arbitrary workload’s future needs without deadline information Instead, try to schedule processes and voltages to reduce idle time eg, Weiser et al, OSDI-1 48 © 2004, Kevin Skadron and Jose Gonzalez Sleep Modes ACPI: Advance Configuration and Power Interface Developed by Microsoft, HP, Toshiba, Phoenix and Intel Establishes interfaces for OS-directed powermanagement Replaces APM, MPS APIs and PnP BIOS Defines Hardware registers BIOS interfaces System and device power states Source: ACPI overview, http://www.acpi.info 49 © 2004, Kevin Skadron and Jose Gonzalez DVS “Critical Power Slope” It may be more efficient not to use DVS, and to run at the highest possible frequency, then go into a sleep mode! Depends on power dissipation in sleep mode And power dissipation at lowest voltage This has been formalized as the critical power slope (Miyoshi et al, ICS’02): mcritical = (Pfmin – Pidle) / fmin If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical then it is more energy efficient to run at the highest frequency, then go to sleep Switching overheads must be taken into account 50 © 2004, Kevin Skadron and Jose Gonzalez Multi Clock Domain Architecture Multiple clock domains inside the processor Globally-asynchronous locally synchronous (GALS) clock style Independent voltage/frequency scaling Synchronizers to ensure inter-domain communication 51 © 2004, Kevin Skadron and Jose Gonzalez Multi Clock Domain Architecture Advantages Local clock design is not aware of global skew Each domain limited by its local critical path, allowing higher frequencies Different voltage regulators allow for a finer-grain energy control Frequency/voltage of each domain can be tailored to its dynamic requirements Clock Power is reduced Drawbacks Complexity and penalty of synchronizers Feasibility of multiple voltage regulators 52 © 2004, Kevin Skadron and Jose Gonzalez Multi Clock Domain Architecture Synchronization 1 4 CLK1 2 3 CLK2 Src runs with CLK1, dst with CLK2 Src writes at T1 T Semeraro et al, ISCA 2003 53 If T > Ts then dst can use the data at T2 If T < Ts then dst can use the data at T3 © 2004, Kevin Skadron and Jose Gonzalez Multi Clock Domain Architecture Domains must be carefully chosen Small cost on communications Re-using existing structures Example 5 domains • • • • • Front-end Integer unit FP unit On-chip cache unit Main memory 54 © 2004, Kevin Skadron and Jose Gonzalez Multi Clock Domain Architecture Integer CPU IIQ int. register file int. FUs Memory Front-end fetch L1 i-cache IFQ branch predict dispatch rename LSQ Floating Point FIQ Magklis et al, ISCA 2003 L2 L1 unified d-cache cache 55 fp. register file fp. FUs Main Memory © 2004, Kevin Skadron and Jose Gonzalez Multi Clock Domain Architecture Dynamic voltage/frequency scaling in each domain Reconfiguration points must be chosen Off-line “shaker” algorithm • Aggressive oracle algorithm with good results • Uses detailed dynamic execution trace to find frequencies • It is not practical, requires future knowledge of this precise dynamic run On-line Attack-decay • Interval-based hardware algorithm • Transparent to the application, minimal overhead • More conservative, achieves 75% efficiency of off-line Profile-based • Use profiling to associate frequencies with parts of the code • When these points in the code are reached during a dynamic run then change frequencies 56 © 2004, Kevin Skadron and Jose Gonzalez Gating/Throttling Gating: Disable some of the stages of the processor To reduce useless activity: after a branch misprediction Manne et al, ISCA 1998 Effectiveness is heavily dependent on accuracy of branch confidence predictor Parikh et al, HPCA’02 Throttling: Slow down some processor stage when it is predicted that the performance will not be reduced Branch misprediction Long latency load miss IPC reduction in general Baniasadi and Moshovos, ISLPED 2001 57 Control Speculation increases power dissipation (28%) Energy wasted by mispredicted instructions 30 Speedup & Savings (%) © 2004, Kevin Skadron and Jose Gonzalez Selective Throttling for Control Speculation Speedup Power savings Energy savings E-D improvement 25 20 15 10 5 0 h e fetc oracl Selective throttling of fetch/decode Based on branch confidence Gating of selection stage ct ode e sele e dec oracl oracl Instructions that likely belong to a mispredicted path 9% Energy-Delay improvement Aragon et al, HPCA 2003 58 © 2004, Kevin Skadron and Jose Gonzalez Co-Adaptive Instruction Fetch and Issue Fetch gating based on issue queue utilization Fetch is stopped if close parallelism is present Rather than using instruction window usage Just instructions from the head of the IQ are issued To match the size of the window residing in the IQ to application’s ILP Fetch gating combined with dynamic issue queue adaptation 20% energy-delay improvement Buyuktosunoglu et al, ISCA 2003 59 © 2004, Kevin Skadron and Jose Gonzalez Compiler Techniques for Low Power Good reference: tutorial by Kremer, PLDI’03 Traditional compiler optimizations often improve energy efficiency But some compiler optimizations waste energy eg, register allocation, CSE, tiling for cache hit rate eg, aggressive speculation Energy efficiency of code sequences is highly dependent on microarchitecture eg, free slot in a VLIW word 60 © 2004, Kevin Skadron and Jose Gonzalez Compiler Techniques for Low Power, cont. Compiler-guided DVS v1: reduce voltage while meeting real-time deadlines v2: reduce voltage in memory-bound program regions • Hsu and Kremer, ISLPED’01, PLDI’03 • Xie et al, PLDI’03 Dynamic resource configuration/hibernation Deactivate modules when they won’t be used for a long time (>> sleep/wakeup time) • Heath et al, PACT’02 Profile/compiler-guided adaptation eg,profile-guided MCD adaptation mentioned earlier (Magklis et al, ISCA’03) eg, subroutine-guided (“positional”) adapation (Huang et al, ISCA’03) • Uses a hierarchy of low-power modes Much work in this area – this only touches the surface 61 © 2004, Kevin Skadron and Jose Gonzalez Power Savings for Real Time Systems Soft vs. hard real time Periodic vs. aperiodic Periodic tasks are especially important in control systems Most work has focused on DVS scheduling Examples MPEG playback Web server 62 © 2004, Kevin Skadron and Jose Gonzalez DVS for Multimedia Apps (soft real-time approach) MM apps must process every frame within a time limit If idle time, then there is some slack IPC is constant across frames of the same type Slow down the processor to meet deadlines 2 Phases Profiling • Determines max. number of insts. can be executed for each conf • Sorts that list Adaptation • Predicts the number of instructions to be executed in the next interval • Uses the lowest energy hardware configuration that fulfills requirements Hughes et al MICRO 2001 63 © 2004, Kevin Skadron and Jose Gonzalez DVS for Multimedia Apps (hard real-time approach) decrease frequency Buffering decoded frames provides a control point to enforce deadlines using feedback control Dead-zone proportional-integral controller sets DVS to maintain queue occupancy No profiling or other prior knowledge about stream is needed If queue becomes empty, “panic” model forces highest speed dead zone increase frequency Lu et al ICCD 2003 64 © 2004, Kevin Skadron and Jose Gonzalez DVS for Web Servers Basic idea: load balance, then do DVS to reclaim slack (Elnozahy et al, PACS’02) But it may be more profitable to cluster requests onto fewer nodes and put some to sleep Even on single nodes, it may be profitable to briefly defer requests, then batch them at the highest frequency before going to sleep (Elnozahy et al, USITS’03) To provide delay guarantees requires feedback control (Sharma et al RTSS 2001) A natural and effective control point is synthetic utilization • Combines true utilization with real-time schedulability 65 © 2004, Kevin Skadron and Jose Gonzalez Other Approaches Almost all RT algorithms attempt to reclaim slack Episode detection (Flautner et al, MOBICOM’01) Identify interactive and periodic events, schedule accordingly Program checkpoints – check performance relative to deadline and adjust DVS accordingly Exploit direct knowledge of task execution times or utilization VISA (Anantaraman et al, ISCA’03) Model a superscalar (unpredictable processor) as a predictable scalar processor to perform RT analysis and scheduling, then reduce DVS setting when superscalar processor runs faster than predicted Use program checkpoints to check progress/slack 66 © 2004, Kevin Skadron and Jose Gonzalez Short-Circuit Power Main solutions are Reduce rise/fall times • Tradeoff: reducing rise/fall times requires stronger drivers, more dynamic power Reduce capacitance being switched 67 © 2004, Kevin Skadron and Jose Gonzalez Roadmap Introduction & Trends Dynamic Power Dissipation Static Power Dissipation Sources, modeling, reduction techniques Sources, modeling, reduction techniques Summary 68 © 2004, Kevin Skadron and Jose Gonzalez Static Power Dissipation Static power: dissipation due to leakage current Growing worse because Vth is not scaling as fast as Vdd Roadmap Most important sources of static power: subthreshold leakage and gate leakage Inter-process variation Trends Modeling leakage power Circuit/architectural-level techniques 69 © 2004, Kevin Skadron and Jose Gonzalez Static Power Main mechanisms for leakage current Subthreshold (Berkely predictive model): I leakage m 0 COX Vdd W e a b*(Vdd Vdd0 ) vt2 1 e vt L exp Vth0 Voff n vt Gate • Igate = Igate0 * exp(a*(tox-tox0)) * exp(b*(vdd-vdd0)) We will focus on subthreshold Gate leakage has essentially been ignored New gate insulation materials may solve problem, eg recent Intel announcement • R. Chau, Technology@intel Magazine. www.intel.com Gate-induced drain leakage (GIDL) occurs at negative gate voltages and high Vdd or high values of reverse body bias 70 © 2004, Kevin Skadron and Jose Gonzalez Effects of Parameter Variations Ioff depends exponentially on Vth There is a large fluctuation of Ioff from die to die and from gate to gate Controlling Vth is difficult in nanometer scale Drain-induced barrier lowering • Channel length is not constant • Exacerbated in sub-100nm devices Discrete dopant effects • In a very small channel, small number of dopants • Presence of these dopants and random fluctuation of their number, lead to changes in Vth from device to device Process variation affects Gate length (Ldrawn) Gate oxide thickness (Tox) Channel dose (Nsub) Srivastava et al, ISLPED 2002 71 Motivation Growing relative to dynamic power dissipation: soon 50% of total power Exponentially dependent on Temp, Vth, Vdd Natural target for optimization: idle transistors Increasing ratio across generations Static power/ Dynamic Power 70 60 50 40 30 20 10 0 29 8 30 3 30 8 31 3 31 8 32 3 32 8 33 3 33 8 34 3 34 8 35 3 35 8 36 3 36 8 37 3 Percentage © 2004, Kevin Skadron and Jose Gonzalez Static Power Temperature(K) 180nm 130nm 100nm Source: Skadron et al, University of Virginia 72 90nm 80nm 70nm © 2004, Kevin Skadron and Jose Gonzalez Static Power Modeling Leakage Butts and Sohi (MICRO-33) • Pstatic = Vcc · N · kdesign · Îleak • Îleak determined by circuit simulation, kdesign empirically • Key contribution: separate technology from design HotLeakage (UVA TR CS-2003-05, DATE’04) • Extension of Butts & Sohi approach: scalable with Vdd, Vth, Temp, and technology node; adds gate leakage • Îleak determined by BSIM3 subthreshold equation and BSIM4 gate-leakage equations, giving an analytical expression that accounts for dependence on factors that may change at runtime, namely Vdd, Vth, and Temp • kdesign replaced by separate factors for N- and P-type transistors • kdesign also exponentially dependent on Vdd and Tox, linearly dependent on Temp • Currently integrated with 73 SimpleScalar/Wattch for caches © 2004, Kevin Skadron and Jose Gonzalez Static Power Modeling Leakage (cont.) Su et al, IBM (ISLPED’03) • Similar approach to HotLeakage – but they observe that modeling the change in leakage allows linearization of the equations Many, many other papers on various aspects of modeling different aspects of leakage • Most focus on subthreshold • Few suggest how to model leakage in microarchitecture simulations 74 © 2004, Kevin Skadron and Jose Gonzalez Circuit/architectural level techniques Transistor sizing Dual Vth DVS Dynamic threshold voltage – reverse body bias Sleep transistors Low leakage caches/branch predictors Low leakage register file Low leakage issue queue Low leakage ALUs Techniques for reducing gate leakage What else? 75 © 2004, Kevin Skadron and Jose Gonzalez Transistor sizing, Dual-Vth Transistor sizing Dual-Vth Reducing W/L reduces leakage: use smallest possible transistors Leakage-performance tradeoff High-threshold transistors dramatically reduce leakage: use low-Vth on critical paths, high-Vth elsewhere Often suggested in caches: many possible permutations DVS Leakage is exponentially dependent on Vdd, so DVS reduces leakage 76 © 2004, Kevin Skadron and Jose Gonzalez Dynamic Threshold Voltage Adjust threshold voltage dynamically Also called reverse body bias (RBB), auto backgatecontrolled multi-threshold CMOS (ABB-MTCMOS) (Nii et al, ISPLED’98) Apply negative voltage to body: requires larger VGS to establish channel, so it raises Vth Engage RBB for idle transistors Preserves state Requires twin-well process; more expensive to manufacture Limited by GIDL Can also be used at testing to adjust circuit properties and reduce parameter variations 77 © 2004, Kevin Skadron and Jose Gonzalez Sleep Transistors Add a high-Vth transistor between the circuit and either/both power rails – the sleep transistor Also referred to as a “header” (to Vdd) or “footer” (to ground) The high-Vth transistor cuts off most leakage In fact, a properly sized, lower-Vth footer transistor can preserve enough leakage to keep the cell active (Li et al, PACT’02; Agarwal et al, DAC’02) Great care must be taken when switching back to full voltage: noise can flip bits Extra latency may be necessary when reactivating 78 © 2004, Kevin Skadron and Jose Gonzalez Low-Leakage Caches Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28) Drowsy cache (Flautner et al, ISCA-29) Uses sleep transistor on Vdd/ground for each cache line Typically considered non-state-preserving, but recent work (Agarwal et al, DAC’02) suggests that gated-Vss it may preserve state Many algorithms for determining when to gate Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay interval Adaptive decay intervals - hard Uses dual supply voltages: normal Vdd and a low Vdd close to the threshold voltage State preserving, but requires an extra cycle to wake up – two extra cycles if tags are decayed State preservation using leakage currents (Li et al, PACT’02; Agarwal et al, DAC’02) Similar to gated-Vss but designed to keep supply voltage high enough to preserve state (100-120 mV) 79 © 2004, Kevin Skadron and Jose Gonzalez Low Leakage Caches, cont. Comparison (Parikh, Li, et al, WDDD’03, DATE’04) Compared non-state-preserving gated-Vss with state-preserving drowsy cache If gating is state-preserving, it wins because it essentially eliminates subthreshold and gate leakage • Unless wakeup time is significantly longer than with drowsy Otherwise, drowsy cache typically has an advantage because it is state preserving; no L2 accesses needed on “induced misses” But induced misses are rare, so for a reasonable range of onchip L2 penalties (< 8 cycles in our studies), gating can still be superior 80 © 2004, Kevin Skadron and Jose Gonzalez Low-Leakge Caches, cont: 4T Cells 4 transistor cells [ 4T ] 6T (left) and 4T (right) circuit diagrams 4T-based branch predictors, caches Hu , Juang, et al, ISLPED’02, CA-Letters’02 Non state-preserving Decay rate : temperature-dependent • Can be adjusted with passives Eliminates decay state bits 81 Eliminates two transistors connected to Vdd Naturally decays over time Refreshes upon access When decayed, force default output Up to 33% smaller than equivalent 6T Decays quickly [8K cycles at 1 GHz] Leak only as much energy as is deposited © 2004, Kevin Skadron and Jose Gonzalez Low-Leakage Caches, cont: Other Techniques RBB (Nii et al, ISLPED’98) Leakage-biased bitlines (Heo et al, ISCA-29) Back bias cache lines that are idle – can use the same decay counters as gated-Vdd/Vss Disable precharge and let the bitlines float: they will settle to a value that minimizes leakage Can only be applied to idle subbanks and requires accurate prediction of which subbank will be accessed Huge variety of other techniques – this is only an overview of some of the major ones 82 © 2004, Kevin Skadron and Jose Gonzalez Register Files In general, state-preserving techniques for caches may work for register files too Leakage-biased bitlines work here too Register file divided into subbanks Alvandpour et al, Intel, ISLPED’01 Uses dual Vth and a conditional keeper • “Keeper” used on dynamic circuits to counteract voltage droop due to leakage – they constitute a static pull-up path • Dynamic circuits arise in the muxes due to multiporting • “Conditional” keeper technique uses two cascaded keepers; one is fixed and the other only engaged when needed to drive an output – requires careful timing analysis Access transistors and keepers are high-Vt/ 83 © 2004, Kevin Skadron and Jose Gonzalez ALUs Usually Dual-VT domino logic Area & Speed Sleep transistors can be used but it has a cost Dynamic nodes are discharged Can be used if worthy Dropsho et al, MICRO842002 © 2004, Kevin Skadron and Jose Gonzalez Other Techniques Queues (eg, issue queues) Various occupancy-based or rate-matching techniques have been proposed for issue queue resizing. Deactivating queue entries reduces leakage eg, Ponomarev et al, MICRO-34 Compiler techniques When compiler knows that regions are idle, they can be deactivated eg, Zhang et al, MICRO-35 85 © 2004, Kevin Skadron and Jose Gonzalez Gate Leakage Any technique that reduces Vdd Otherwise it seems difficult to develop architecture techniques that directly attack gate leakage In fact, very little work has been done in this area One example: domino gates (Hamzaoglu & Stan, ISLPED’02) Replace traditional NMOS pull-down network with a PMOS pullup network Gate leakage is greater in NMOS than PMOS But PMOS domino gate is slower 86 © 2004, Kevin Skadron and Jose Gonzalez Roadmap Introduction & Trends Dynamic Power Dissipation Static Power Dissipation Sources, modeling, reduction techniques Sources, modeling, reduction techniques Summary 87 © 2004, Kevin Skadron and Jose Gonzalez Other Power-Related Issues Thermal Managing on-chip temperatures (as opposed to average heat dissipation) is not just a matter of reducing average power density Spatial and temporal variation • Spatial: hot spots—must reduce power density in the right places • Temporal: must reduce power when chip is hot This is often when there is less slack Most model temperature directly • Average power metrics do not accurately predict temperature (Skadron et al, ISCA’03) 88 © 2004, Kevin Skadron and Jose Gonzalez Other Power-Related Issues Voltage stability (dI/dt) Inductance means that abrupt changes in current can cause voltage droop This can be addressed with decoupling capacitance, but required capacitance is becoming expensive Grochowski et al HPCA’02, Joseph et al, HPCA’03 89 © 2004, Kevin Skadron and Jose Gonzalez Roadmap Introduction & Trends Dynamic Power Dissipation Sources, modeling, reduction techniques Static Power Dissipation Sources, modeling, reduction techniques Summary 90 © 2004, Kevin Skadron and Jose Gonzalez Summary Power dissipation is becoming a huge concern Power dissipation Total power budget Power density (thermal) Energy consumption & battery life Switching Short-circuit Leakage Power modeling crucial Academia: accurate research Industry: detect hot spots on time to meet POR 91 © 2004, Kevin Skadron and Jose Gonzalez Summary Reducing dynamic power Circuits perspective • Energy-effective access (reducing capacitance or driving voltage) • Gating Architectural perspective • Decreasing activity factor • Pipeline gating • Adjusting voltage/frequency to meet application requirements Reducing static power • Dual Vth • Non-state-preserving vs. state-preserving techniques 92