Download slides - Duke People

CHAPTER 4 Optimizing Capacitance and Switching Activity to Reduce Dynamic Power SECTIONS 1-7 By Astha Chawla Introduction  C and A are intertwined  P = V2 X f x Ceffective.  ILP + Frequency increase => Power problem!!  Factors affecting A:   Complexity of the processor  Exploitation of parallelism  Bit-width of its structures etc.  Optimized at the architectural and microarchitectural level  Can be changed by run-time optimizations Factors affecting C:  Size of a processor’s structure  Organization to exploit locality  Manipulated at the circuit and process technology level  Determined at fixed design time Excess Switching Activity.  Idle-Unit switching activity:  Triggered by clock transitions in unused portions of hardware.  Idle –width switching activity :  Mismatch in the implemented and the actual width of processor structures.  Idle-capacity switching activity :  When a program does not use the provided hardware architectures in their entirety.  Parallel switching activity:  Activity expended in parallel for performance  Cacheable switching activity:  Repetitive switching activity, convert computing activity to cache lookups  Speculative switching activity:  Speculatively executing incorrect instructions is wasted activity  Value- dependent switching activity:  Power consumed depends on the actual data values. Capacitance  Does not change dynamically  Total capacitance = Capacitance of transistors + capacitance of wires.  Burd and Brodersen: CL = CW + Cfixed  Low power architectural techniques require partitioning:  Wire partitioning  Bit-line segmentation IDLE- UNIT SWITCHING ACTIVITY.  No effect on computation  Clock gating  Static logic: To eliminate switching, enough to prevent inputs from changing.  Dynamic logic: Power can be consumed even if the inputs to the circuit do not change   Precomputation: aims to derive a precomputation circuit for a logic block  multiplexed precomputation architecture.  F(x=0), F(x=1) Guarded evaluation: aims to shuts down part of the original circuit. Deterministic clock gating  Gating the clock to the processor structures when they are known to be idle  Power savings, improves EDP, without performance loss.  Clock gating examples:   IBM’s Power 5  Reduction in switching power > 25%  Implements fine-grain gating domain Intel’s Xscale processors  Implements three power- saving modes: Idle, Standby, Sleep  Cuts down power consumption by 30% Idle- Width switching activity: Core  Arises from a mismatch between the designed bit-width of a processor and the actual bitwidth needed in frequently occurring operations  Dynamically detects narrow- width (16 bit wide or less) operands.  Abundance in integer and multimedia applications  Approaches:    Value gating: disabling the unused width.  Disabling switching in unused parts of ALU if both operands are narrow.  Significant power savings Operation packing: Packs more than one narrow- width operation in the full width of hardware  Improves performance without significant power overhead.  Speculative operation packing. Significance compression: Compresses non-significant bits.  Byte serial pipeline. Idle- Width switching activity: Caches  Dynamic zero compression: accesses only significant bits   Only compresses zero bytes.- zero indicator bit Frequent value compression: dictionary loaded with the frequent values of a program.  Simple  Most efficient compression mechanism  Frequent value cache: cache line contains compressed and uncompressed words.  First array: holds 8 low-order bits.  Second array: holds remaining 24 high-order bits Packing compressed cache lines  Space freed by compression remains empty.  Increases cache utilization: indirect power savings.  Packing techniques:  Variable packing: packs variable number of cache lines into cache frames.    expensive Fixed packing: preset number of cache lines are packed  Reduced opportunities for compression  Compression cache:  Uses frequent value compression  Does not attempt to pack cache lines into frames  Frame holds either two compressed or one uncompressed line.  Significance compression cache: lines are compressed using sign compression Instruction compression. IDLE- CAPACITY SWITCHING ACTIVITY  Wasted activity related to out-of-order execution  Processor resources over provisioned to support high instruction throughput.  Power inefficiency of out-of-order processors:  Energy-per-instruction growth Ei ~ (IW)γ . Resource partitioning.  Cannot afford latency of very long wires.  Partitioned by placing buffers  Aimed at size vs speed trade-off.  Wire partitioning  Wire delay proportional to R x C .  Breaking wire into ‘k’ segments improves delay by k2  Total energy increases exponentially with k.  Replacing buffers with tristate devices. IDLE- CAPACITY SWITCHING ACTIVITY: INSTRUCTION QUEUE.  Resizable IQ, mix of CAM and SRAM  Readiness feedback control   Adjust IQ size based on the activity of its entries.  Decision making scheme has a safety mechanism. Occupancy feedback control  IQ, LSQ, ROB.  Occupancy of a structure is the appropriate feedback control metric.  Logical resizing without partitioning  IQ organized as a circular FIFO buffer.  Limiting the size logically by limiting the part that can be allocated to new entries  ILP- contribution feedback control  Instruction queue collapsing IDLE-CAPACITY SWITCHING ACTIVITY: CORE  Dynamically changing the width of an 8-issue processor to 6 or 4-issue.  6-issue processor: half of a cluster is disabled  4-issue processor: one whole cluster is disabled  Appropriate functional units are clock gated.  Decisions made at the end of the sampling window THANK YOU!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides - Duke People