Download Power Management in High Performance Processors through

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California San Diego Outline – Multiple Sleep Mode      Brief overview of state-of-art superscalar processor Introducing the idea of multiple sleep modes design Architectural control of multiple sleep modes Results Conclusions Copyright © 2010 Houman Homayoun University of California San Diego 2 Superscalar Architecture Fetch Decode ROB Reservation Station Physical Register File Logical Register File Rename Dispatch Instruction Queue Issue Write-Back F.U. Load Store Queue Execute F.U. F.U. Copyright © 2010 Houman Homayoun F.U. University of California San Diego 3 On-chip SRAMs+CAMs and Power  On-chip SRAMs+CAMs in high-performance processors are large         Branch Predictor Reorder Buffer Instruction Queue Instruction/Data TLB Load and Store Queue L1 Data Cache L1 Instruction Cache L2 Cache   Pentium M processor die photo Courtesy of intel.com more than 60% of chip budget Dissipate significant portion of power via leakage Copyright © 2010 Houman Homayoun University of California San Diego 4 Techniques Address Leakage in SRAM+CAM   Circuit       Architecture   Copyright © 2010 Houman Homayoun Gated-Vdd, Gated-Vss Voltage Scaling (DVFS) ABB-MTCMOS Forward Body Biasing (FBB), RBB Sleepy Stack Sleepy Keeper Way Prediction, Way Caching, Phased Access  Predict or cache recently access ways, read tag first Drowsy Cache  Keeps cache lines in low-power state, w/ data retention Cache Decay  Evict lines not used for a while, then power them down Applying DVS, Gated Vdd, Gated Vss to memory cell  Many architectural support to do that. University of California San Diego 5 Sleep Transistor Stacking Effect  Subthreshold current: inverse exponential function of threshold voltage VT  VT 0   (  (2) F  VSB  2 F ) Stacking transistor N with slpN:  The source to body voltage (VM ) of transistor N increases, reduces its subthreshold leakage current, when both transistors are off Drawback : rise time, fall time, wakeup delay, area, dynamic power, instability Copyright © 2010 Houman Homayoun University of California San Diego vdd VC Vgn N CL VM Vgslpn vss slpN vss 6 Wakeup Latency  To benefit the most from the leakage savings of stacking sleep transistors    keep the bias voltage of NMOS sleep transistor as low as possible (and for PMOS as high as possible) Drawback: impact on the wakeup latency (sleep transistor wakeup delay + sleep signal propagation reduction in delay) of thereduction circuit in the leakage power Control the gate voltage of the sleep transistors circuit wakeup delay  savings overhead Increasing the gate voltage of footer sleep transistor reduces the virtual ground voltage (VM) Copyright © 2010 Houman Homayoun University of California San Diego 7 1 4.5 0.9 4.0 0.8 3.5 0.7 3.0 0.6 0.5 0.4 0.3 0.2 (0 .3 0, 0. 75 ) (0 .2 5, 0. 80 ) (0 ,1 ) 1.0 (0 .2 0, 0. 85 ) 1.5 (0 .1 5, 0. 89 ) 2.0 trade-off between the wakeup overhead and leakage power saving (0 .1 ,0 .9 3) 2.5 Normalized Wake-Up Delay 5.0 (0 .0 5, .9 6) Normalized Leakage Power Wakeup Delay vs. Leakage Power Reduction (Footer,Header) Gate Bias Voltage Pair Normalized leakage  Normalized wake-up delay Increasing the bias voltage increases the leakage power while decreases the wakeup delay overhead Copyright © 2010 Houman Homayoun University of California San Diego 8 Multiple Sleep Modes Specifications On-chip SRAM multiple sleep mode normalized leakage power savings   BPRED FRF IRF IL1 DL1 L2 DTLB ITLB basic-lp 0.29 0.21 0.21 -- -- -- 0.25 0.25 lp 0.43 0.31 0.31 0.37 0.37 -- 0.34 0.34 aggr-lp 0.55 0.58 0.58 0.48 0.48 0.44 0.49 0.49 ultra-lp 0.67 0.65 0.65 0.69 0.64 0.63 0.57 0.57 Wakeup Delay varies from 1~more than 10 processor cycles (2.2GHz). Large wakeup power overhead for large SRAMs.  Need to find Period of Infrequent Access Copyright © 2010 Houman Homayoun University of California San Diego 9 Reducing Leakage in SRAM Peripherals  Maximize the leakage reduction   put SRAM into ultra low power mode adds few cycles to the SRAM access latency   significantly reduces performance Minimize Performance Degradation   put SRAM into the basic low power mode requires near zero wakeup overhead  Not noticeable leakage power reduction Copyright © 2010 Houman Homayoun University of California San Diego 10 Motivation for Dynamically Controlling Sleep Mode  large leakage reduction benefit   low performance impact benefit   Basic-lp mode Periods of frequent access   Ultra and dynamically adjust aggressive low power modes sleep power mode Basic-lp mode Periods of infrequent access  Ultra and aggressive low power modes Copyright © 2010 Houman Homayoun University of California San Diego 11 Architectural Motivations  Architectural Motivation  A load miss in L1/L2 caches takes a long time to service   When dependent instructions cannot issue   prevents dependent instructions from being issued performance is lost At the same time, energy is lost as well!  This is an opportunity to save energy Copyright © 2010 Houman Homayoun University of California San Diego 12 Multiple Sleep Mode Control Mechanism Processor continue all pending DL1 misses serviced basic-lp lp 3 pending DL1 miss L2 m se L rvi 2 m i ce d/f ss lus he Pending DL1 misses Processor stall ultra-lp L2 miss i ss e Proc sso r stall aggr-lp d Pending L2 miss/es General state machine to control power mode transitions   L2 cache miss or multiple DL1 misses triggers power mode transitioning. The general algorithm may not deliver optimal results for all units.  modified the algorithm for individual on-chip SRAM-based units to maximize the leakage reduction at NO performance cost. Copyright © 2010 Houman Homayoun University of California San Diego 13 Branch Predictor IPB IPB applu apsi art bzip2 4.5 equake 324.1 facerec 28.9 galgel 8.1 gap 6.7 gcc 4.21 mcf 20.0 mesa 14.3 mgrid 14.2 parser 6.3 perlbmk crafty eon 8.5 gzip 8.2 lucas 9.5 sixtrack 25.6 swim ammp  IPB 3.9 twolf 11.0 vortex 310.4 vpr 6.0 wupwise 7.2 average IPB 7.6 5.7 9.0 8.7 37.8 11.9 77.1 1 out of every 9 fetched instructions in integer benchmarks and out of 63 fetched instructions in floating point benchmarks accesses the branch predictor  always put branch predictor in deep low power modes (lp, ultra-lp or aggr-lp) and waking up on access.  noticeable performance degradation for some benchmarks. Copyright © 2010 Houman Homayoun University of California San Diego 14 Observation: Branch Predictor Access Pattern 350 30 swim equake 25 IPB every 512 cycles IPB every 512 cycles 300 250 200 150 100 50 20 15 10 5 0 0 1M cycles 1 M cycles Distribution of the number of branches per 512-instruction interval (over 1M cycles)  Within a benchmark there is significant variation in Instructions Per Branch (IPB).  once the IPB drops (increases) significantly it may remain low (high) for a long period of time. Copyright © 2010 Houman Homayoun University of California San Diego 15 Branch Predictor Peripherals Leakage Control  Can identify the high IPB period, once the first low IPB period is detected.    The number of fetched branches is counted every 512 cycles, once the number of branches is found to be less than a certain threshold (24 in this work) a high IPB period identified. The IPB is then predicted to remain high for the next twenty 512 cycles intervals (10K cycles). Branch predictor peripherals transition from basic-lp mode to lp mode when a high IPB period is identified. During pre-stall and stall periods the branch predictor peripherals transition to aggr-lp and ultra-lp mode, respectively. Copyright © 2010 Houman Homayoun University of California San Diego 16 Leakage Power Reduction 40% 35% 30% 25% 20% 15% 10% 5% am m ap p pl u ap si a bz rt ip cr 2 af ty eq eo u n fa ake ce r ga ec lg el ga p gc c gz lu ip ca s m m cf e m sa g pa ri d pe rse rl r si bm xt k ra c sw k im tw vo olf rte x w v up p r av wis er e ag e 0% basic-lp lp aggr-lp ultra-lp Noticeable Contribution of Ultra and Basic low power mode Copyright © 2010 Houman Homayoun University of California San Diego 17 Outline – Resource Adaptation      why an IQ, ROB, RF major power dissipators? Study processor resources utilization during L2/multiple L1 misses service time Architectural approach on dynamically adjusting the size of resources during cache miss period for power conservation Results Conclusions Copyright © 2010 Houman Homayoun University of California San Diego 18 Instruction Queue  The Instruction Queue is a CAM-like structure which holds instructions until they can be issued.     Set entries for new dispatched instructions Read entries to issue instructions to functional units Wakeup instructions waiting in the IQ once a result is ready Select instructions for issue when the number of instructions available exceed the processor issue limit (Issue Width). Main Complexity: Wakeup Logic Copyright © 2010 Houman Homayoun University of California San Diego 19 tagIW3 tagIW2 tagIW1 tagIW0 tagIW0 tagIW1 tagIW2 tagIW3 tag03 tag02 tag01 tag00 tag00 tag02 tag03 Vdd tag01 Logical View of Instruction Queue Pre-charge matchline1 matchline2 Ready Bit OR matchline3 matchline4     At each cycle, the match lines are pre-charged high  To allow the individual bits associated with an instruction tag to be compared with the results broadcasted on the taglines. Upon a mismatch, the corresponding matchline is discharged. Otherwise, the match line stays at Vdd, which indicates a tag match. At each cycle, up to 4 instructions broadcasted on the taglines,  four sets of one-bit comparators for each one-bit cell are needed. All four matchlines must be ORed together to detect a match on any of the broadcasted tags. The result of the OR sets the ready bit of instruction source operand No Need to always have such aggressive wakeup/issue width! Copyright © 2010 Houman Homayoun University of California San Diego 20 ROB and Register File  The ROB and the register file are multi-ported SRAM structures with several functionalities:    Setting entries for up to IW instructions in each cycle, Releasing up to IW entries during commit stage in a cycle, and Flushing entries during the branch recovery. data output driver 29% sense_am p 4% decode 8% wordline 1% sense_am p 3% bitline and memory cell 58% Dynamic Power Copyright © 2010 Houman Homayoun data output driver 15% decode 11% wordline 8% bitline and memory cell 63% Leakage Power University of California San Diego 21 Architectural Motivations  Architectural Motivation:  A load miss in L1/L2 caches takes a long time to service   When dependent instructions cannot issue  After a number of cycles the instruction window is full    prevents dependent instructions from being issued ROB, Instruction Queue, Store Queue, Register Files The processor issue stalls and performance is lost At the same time, energy is lost as well!  This is an opportunity to save energy   Scenario I: L2 cache miss period Scenario II: three or more pending DL1 cache misses Copyright © 2010 Houman Homayoun University of California San Diego 22 How Architecture can help reducing power in ROB, Register File and Instruction Queue Issue rate decrease Scenario I pa bz ip 2 cr af ty ga p gc c gz ip m cf rs er tw o vo lf rte x IN T vp av r er ag e ap pl u ap si A eq rt ua fa ke ce re c ga lg el lu ca s m gr id sw w im FP upw av ise er ag e Scenario II 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% -10% Scenario I: The issue rate drops by more than 80% Scenario II: The issue rate drops is 22% for integer benchmarks and 32.6% for floating-point benchmarks. Significant issue width decrease! Copyright © 2010 Houman Homayoun University of California San Diego 23 How Architecture can help reducing power in ROB, Register File and Instruction Queue   ROB occupancy grows significantly during scenario I and II for integer benchmarks: 98% and 61% on average The increase in ROB occupancy for floating point benchmarks is less, 30% and 25% on average for scenario I and II. Copyright © 2010 Houman Homayoun Benchmark Scenario I bzip2 165.0 88.6 crafty 179.6 gap Scenario II Benchmark Scenario I Scenario II applu 13.8 -4.9 63.6 apsi 46.6 18.2 6.6 61.7 Art 31.7 56.9 gcc 97.7 43.9 equake 49.8 38.1 gzip 152.9 41.0 facerec 87.9 14.1 mcf 42.2 40.6 galgel 30.9 34.4 parser 31.3 102.3 lucas -0.7 54.0 twolf 81.8 58.8 mgrid 8.8 5.6 vortex 118.7 57.8 swim -4.3 11.4 vpr 96.6 55.7 wupwise 40.2 24.4 INT average 98.2 61.4 FP average 30.5 25.2 University of California San Diego 24 How Architecture can help reducing power in ROB, Register File and Instruction Queue nonnonnonnonRegister File Scenario I Scenario I Scenario II Scenario II Scenario I Scenario I Scenario II Scenario occupancy IRF FRF IRF FRF II FRF IRF FRF IRF bzip2 crafty gap gcc gzip mcf parser twolf vortex vpr INT average applu apsi art equake facerec galgel lucas mgrid swim wupwise FP average 74.4 83.4 46.2 46.3 45.1 40.8 37.4 58.7 70.9 63.9 55.3 6.0 16.1 35.4 34.2 52.6 50.4 21.7 5.9 23.3 26.3 26.6 28.8 31.9 41.1 21.2 27.2 29.3 29.8 32.3 31.1 29.0 29.2 5.6 18.3 25.0 27.4 22.5 27.4 23.8 6.2 27.8 28.8 20.9 0.0 0.1 0.1 0.2 0.0 1.0 0.0 2.6 0.3 7.8 1.1 76.6 65.7 36.2 16.1 50.0 41.8 47.7 90.0 77.1 53.5 56.5 0.0 0.0 0.7 0.1 0.0 1.1 0.0 2.1 0.2 8.6 1.2 64.8 37.6 30.7 7.1 28.9 48.7 44.0 80.7 78.1 28.7 44.7 56.6 51.4 65.8 28.7 39.8 46.8 57.0 46.0 52.4 66.4 50.3 1.7 15.8 23.0 32.7 30.3 32.1 41.7 1.9 29.7 40.5 24.0 30.7 32.2 42.9 24.0 27.2 36.4 29.8 29.8 35.0 41.0 32.0 6.2 17.9 29.0 29.4 38.4 26.0 22.1 6.4 23.1 26.9 22.1 0.0 0.0 0.6 0.0 0.0 3.2 0.1 2.5 0.2 8.7 1.4 77.3 58.8 42.9 21.0 48.1 61.0 29.7 96.7 87.1 38.0 56.2 0.0 0.0 0.5 0.1 0.0 0.1 0.0 2.0 0.2 8.3 1.0 73.7 43.6 6.3 9.6 35.0 44.2 47.0 87.2 76.2 42.2 46.0 IRF occupancy always grows for both scenarios when experimenting with integer benchmarks. a similar case is for FRF when running floating-point benchmarks and only during scenario II Copyright © 2010 Houman Homayoun University of California San Diego 25 Proposed Architectural Approach  Adaptive resource resizing during cache miss period    Reduce the issue and the wakeup width of the processor during L2 miss service time. Increase the size of ROB and RF during L2 miss service time or when at least three DL1 misses are pending simple resizing scheme: reduce to half size. not necessarily optimized for individual units, but a simple scheme to implement at circuit! Copyright © 2010 Houman Homayoun University of California San Diego 26 Results 50% Power (Dynamic/Leakage) Reduction 45% 40% 35% 30% 25% 20% 15% 10% 5% T av vp r er ag e ap pl u ap si A eq rt ua k fa e ce re ga c lg el lu ca s m gr id sw w i u m FP pw av ise er ag e 0% IN  Small Performance loss~1% 15~30% dynamic and leakage power reduction bz ip 2 cr af ty ga p gc c gz ip m pa cf rs er tw o vo lf rte x  ROB Leakage ROB Dynamic Issue Queue Power (Dynamic/Leakage) Reduction 40% 35% 30% 25% 20% 15% 10% 5% av vp r er ag e ap pl u ap si A eq rt ua fa ke ce re c ga lg el lu ca s m gr id sw w i u m FP pw av ise er ag e IN T gc c gz ip m pa cf rs er tw ol vo f rte x ga p bz ip 2 cr af ty 0% INT RF Leakage INT RF Dynamic FP RF Lekage FP RF Dynamic 6% IPC Degradation 5% 4% 3% 2% 1% av vp er r ag ap e pl u ap si eq Art ua fa ke ce re ga c lg e lu l ca m s gr id s w wim up FP w av i se er ag e IN T bz ip cr 2 af ty ga p gc c gz ip m p a cf rs er tw o vo lf rt ex 0% Copyright © 2010 Houman Homayoun University of California San Diego 27 Conclusions      Introducing the idea of multiple sleep mode design Apply multiple sleep mode to on-chip SRAMs  Find period of low activity for state transition Introduce the idea of resource adaptation Apply resource adaptation to on-chip SRAMs+CAMs  Find period of low activity for state transition Applying similar adaptive techniques to other energy hungry resources in the processor  Multiple sleep mode functional units Copyright © 2010 Houman Homayoun University of California San Diego 28

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Power Management in High Performance Processors through