Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University of California Irvine [email protected] IC-SAMOS 2008 Outline Introduction: why an IQ, ROB, RF major power dissipators? Study processor resources utilization during L2/multiple L1 misses service time Architectural approach on dynamically adjusting the size of resources during cache miss period for power conservation Hardware modification + circuit assists to implement the approach Experimental results Conclusions Superscalar Architecture Fetch Decode ROB Reservation Station Physical Register File Rename Dispatch Instruction Queue Issue Write-Back F.U. Execute F.U. F.U. F.U. Load Store Queue Logical Register File Instruction Queue The Instruction Queue is a CAM-like structure which holds instructions until they can be issued. Set entries for new dispatched instructions Read entries to issue instructions to functional units Wakeup instructions waiting in the IQ once a result is ready Select instructions for issue when the number of instructions available exceed the processor issue limit (Issue Width). Main Complexity: Wakeup Logic tagIW3 tagIW2 tagIW1 tagIW0 tagIW0 tagIW1 tagIW2 tagIW3 tag03 tag02 tag01 tag00 tag00 tag02 tag03 Vdd tag01 Circuit Implementation of Instruction Queue Pre-charge matchline1 matchline2 Ready Bit OR matchline3 matchline4 At each cycle, the match lines are pre-charged high To allow the individual bits associated with an instruction tag to be compared with the results broadcasted on the taglines. Upon a mismatch, the corresponding matchline is discharged. Otherwise, the match line stays at Vdd, which indicates a tag match. At each cycle, up to 4 instructions broadcasted on the taglines, four sets of one-bit comparators for each one-bit cell are needed. All four matchlines must be ORed together to detect a match on any of the broadcasted tags. The result of the OR sets the ready bit of instruction source operand No Need to always have such aggressive wakeup/issue width! Instruction Queue Matchline Power Dissipation Matchline discharge is the major energy consumption activity responsible for more than 58% of the energy consumption in the instruction queue As the matchlines must go across the entire width of the instruction queue, it has a large wire capacitance. Adding the one-bit comparators diffusion capacitance makes the equivalent capacitance of matchline large Pre-charging and discharging this large capacitor is responsible for the majority of power in the instruction queue a broadcasted tag has on average one dependent instruction in the instruction queue Discharging other matchlines cause significant power dissipation in the instruction queue ROB and Register File The ROB and the register file are multi-ported SRAM structures with several functionalities: Setting entries for up to IW instructions in each cycle, Releasing up to IW entries during commit stage in a cycle, and Flushing entries during the branch recovery. Circuit level implementation of an SRAM ROB and Register File Bitline Decoder and Wordline Drivers Bitline addr0 The majority of power (both leakage and dynamic) is dissipated in the bitline (and memory cells) Local Wordline addr1 addr2 addr3 Bitline leakage is accumulated with the memory cell leakage which flow through two off pass data output decode transistors driver 11% decode data 15% 8% wordline sense_am Bitline dynamic power is decided output 1% driver p wordline 29% 3% by its equivalent capacitance 8% which is N * diffusion capacitance bitline and sense_am of pass transistors + wire bitline and memory p memory cell capacitance (usually 10% of total 4% cell 63% 58% diffusion capacitance) where N is the total number of rows Dynamic Power Leakage Power Sense amp Output Drivers Input Drivers Bitline is the major power dissipator 58% of dynamic power and 63% of leakage power System Description L1 I-cache 128KB, 64 byte/line, 2 cycles L1 D-cache 128KB, 64 byte/line, 2 cycles, 2 R/W ports L2 cache 4MB, 8 way, 64 byte/line, 20 cycles issue 4 way out of order Branch predictor 64KB entry g-share, 4K-entry BTB Reorder buffer 96 entries Instruction queue 64 entry (32 INT and 32 FP) Register file 128 integer and 128 floating point Load/store queue 32 entry load and 32 entry store Arithmetic unit 4 Integer, 4 Floating Point units Complex unit 2 INT, 2 FP multiply/divide units Pipeline 15 cycles (some stages are multi-cycles) Simulation Environment The clock frequency of the processor is 2GHz SPEC2K benchmarks were using the Compaq compiler for the Alpha 21264 processor compiled with the -O4 flag executed with reference data sets The architecture was simulated using an extensively modified version of SimpleScalar 4.0 (sim-mase) The benchmarks were fast–forwarded for 2 billion instructions, then fully simulated for 2 billion instructions A modified version of Cacti4 was used for estimating power in the ROB and the Register files in 65nm technology The power in the Instruction Queue was evaluated using Spice and the TSMC 65nm technology Vdd at 1.08 volts Architectural Motivations Architectural Motivation: A load miss in L1/L2 caches takes a long time to service When dependent instructions cannot issue After a number of cycles the instruction window is full prevents dependent instructions from being issued ROB, Instruction Queue, Store Queue, Register Files The processor issue stalls and performance is lost At the same time, energy is lost as well! This is an opportunity to save energy Scenario I: L2 cache miss period Scenario II: three or more pending DL1 cache misses How Architecture can help reducing power in ROB, Register File and Instruction Queue Issue rate decrease pa m cf rs er tw o vo lf rte x IN T vp av r er ag e ap pl u ap si A eq rt ua fa ke ce re c ga lg el lu ca s m gr id sw w im FP upw av ise er ag e Scenario II bz ip 2 cr af ty ga p gc c gz ip 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% -10% Scenario I Scenario I: The issue rate drops by more than 80% Scenario II: The issue rate drops is 22% for integer benchmarks and 32.6% for floating-point benchmarks. Significant issue width decrease! How Architecture can help reducing power in ROB, Register File and Instruction Queue ROB occupancy grows significantly during scenario I and II for integer benchmarks: 98% and 61% on average The increase in ROB occupancy for floating point benchmarks is less, 30% and 25% on average for scenario I and II. Benchmark Scenario I bzip2 165.0 88.6 crafty 179.6 gap Scenario II Benchmark Scenario I Scenario II applu 13.8 -4.9 63.6 apsi 46.6 18.2 6.6 61.7 Art 31.7 56.9 gcc 97.7 43.9 equake 49.8 38.1 gzip 152.9 41.0 facerec 87.9 14.1 mcf 42.2 40.6 galgel 30.9 34.4 parser 31.3 102.3 lucas -0.7 54.0 twolf 81.8 58.8 mgrid 8.8 5.6 vortex 118.7 57.8 swim -4.3 11.4 vpr 96.6 55.7 wupwise 40.2 24.4 INT average 98.2 61.4 FP average 30.5 25.2 How Architecture can help reducing power in ROB, Register File and Instruction Queue nonnonnonnonRegister File Scenario I Scenario I Scenario II Scenario II Scenario I Scenario I Scenario II Scenario occupancy IRF FRF IRF FRF II FRF IRF FRF IRF bzip2 crafty gap gcc gzip mcf parser twolf vortex vpr INT average applu apsi art equake facerec galgel lucas mgrid swim wupwise FP average 74.4 83.4 46.2 46.3 45.1 40.8 37.4 58.7 70.9 63.9 55.3 6.0 16.1 35.4 34.2 52.6 50.4 21.7 5.9 23.3 26.3 26.6 28.8 31.9 41.1 21.2 27.2 29.3 29.8 32.3 31.1 29.0 29.2 5.6 18.3 25.0 27.4 22.5 27.4 23.8 6.2 27.8 28.8 20.9 0.0 0.1 0.1 0.2 0.0 1.0 0.0 2.6 0.3 7.8 1.1 76.6 65.7 36.2 16.1 50.0 41.8 47.7 90.0 77.1 53.5 56.5 0.0 0.0 0.7 0.1 0.0 1.1 0.0 2.1 0.2 8.6 1.2 64.8 37.6 30.7 7.1 28.9 48.7 44.0 80.7 78.1 28.7 44.7 56.6 51.4 65.8 28.7 39.8 46.8 57.0 46.0 52.4 66.4 50.3 1.7 15.8 23.0 32.7 30.3 32.1 41.7 1.9 29.7 40.5 24.0 30.7 32.2 42.9 24.0 27.2 36.4 29.8 29.8 35.0 41.0 32.0 6.2 17.9 29.0 29.4 38.4 26.0 22.1 6.4 23.1 26.9 22.1 0.0 0.0 0.6 0.0 0.0 3.2 0.1 2.5 0.2 8.7 1.4 77.3 58.8 42.9 21.0 48.1 61.0 29.7 96.7 87.1 38.0 56.2 0.0 0.0 0.5 0.1 0.0 0.1 0.0 2.0 0.2 8.3 1.0 73.7 43.6 6.3 9.6 35.0 44.2 47.0 87.2 76.2 42.2 46.0 IRF occupancy always grows for both scenarios when experimenting with integer benchmarks. a similar case is for FRF when running floating-point benchmarks and only during scenario II Proposed Architectural Approach Adaptive resource resizing during cache miss period Reduce the issue and the wakeup width of the processor during L2 miss service time. Increase the size of ROB during L2 miss service time or when at least three DL1 misses are pending Reduce IRF size when running floating-point benchmarks. Similarly reduce FRF size when running integer benchmarks. The same algorithm applied to ROB is being applied for IRF when running integer benchmarks and FRF when running floating point benchmarks. simple resizing scheme: reduce to half size. not necessarily optimized for individual units, but a simple scheme to implement Reducing issue/wakeup width tag0 tag1 tag2 tag3 Drivers tag0 tag1 tag2 tag3 Select Tag Lines Tag Lines Wordline Precharge SLP SLP Wordline Precharge (a) (b) avoid pre-charging half of matchlines during L2 cache miss service time worse case scenario: more than half of taglines are broadcasting tags during L2 miss period where only half of matchlines are active Small 8 entries auxiliary broadcast buffer Reducing ROB and Register File size Using the divide bit line technique which has been proposed for SRAM memory design and attempt to reduce the bit line capacitance and hence its dynamic power. Vdd Pre-charge Vdd Wordline Bit line Wordline Segment Bit line Bitline cap = N * diffusion capacitance of pass transistors + wire capacitance Divided Bitline cap = M * diffusion capacitance + wire capacitance Segment Select Wordline Turning of the entire partition by applying gated-vdd technique to the partition memory cell and wordline driver parts. Wordline (Segment Select) SS DS (Downsizing Signal) Sense amp Output Simulation Results 50% Power (Dynamic/Leakage) Reduction 45% 40% 15% 10% 5% av vp r er ag e ap pl u ap si A eq rt ua k fa e ce re ga c lg el lu ca s m gr id sw w i u m FP pw av ise er ag e IN T bz ip 2 cr af ty ga p gc c gz ip m pa cf rs er tw o vo lf rte x 0% ROB Leakage ROB Dynamic Issue Queue Power (Dynamic/Leakage) Reduction 40% 35% 30% 25% 20% 15% 10% 5% av vp r er ag e ap pl u ap si A eq rt ua fa ke ce re c ga lg el lu ca s m gr id sw w i u m FP pw av ise er ag e IN T gc c gz ip m pa cf rs er tw ol vo f rte x 0% INT RF Leakage INT RF Dynamic FP RF Lekage FP RF Dynamic 6% IPC Degradation 5% 4% 3% 2% 1% T av vp er r ag ap e pl u ap si eq Art ua fa ke ce re ga c lg e lu l ca m s gr id s w wim up FP w av i se er ag e 0% IN 20% ga p 25% bz ip 2 cr af ty Performance loss 0.9% for integer benchmarks and 2.2% for floatingpoint benchmarks. The average dynamic and leakage power savings for IRF is 26% and 30% respectively and 20% and 24% for FRF. 24% dynamic power reduction in instruction queue for FP benchmarks and 11% reduction in integer benchmarks. 19% dynamic power reduction and 23% leakage power savings for ROB. bz ip cr 2 af ty ga p gc c gz ip m p a cf rs er tw o vo lf rt ex 35% 30% Conclusions Reducing L2 Cache Leakage Power Reducing Reorder Buffer, Instruction Queue and Register File Power Architectural study during L2 cache miss service time Study the break down of leakage in L2 cache show the peripheral circuits leaking considerably Architectural approach on when to turn on/off L2 cache for reducing leakage power while conserving performance, 20+% Power savings while 2-% performance degradation Circuit assist, minimal modifications and transition overhead Study processor resources utilization during L2/multiple L1 misses service time Architectural approach on dynamically adjust the size of resources during cache miss period for power conservation Hardware modification + circuit assists to implement the approach Applying similar adaptive techniques to other energy hungry resources in the processor