Download slides - Duke People

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CHAPTER 4
Optimizing Capacitance and Switching
Activity to Reduce Dynamic Power
SECTIONS 1-7
By
Astha Chawla
Introduction

C and A are intertwined

P = V2 X f x Ceffective.

ILP + Frequency increase => Power problem!!

Factors affecting A:


Complexity of the processor

Exploitation of parallelism

Bit-width of its structures etc.

Optimized at the architectural and microarchitectural level

Can be changed by run-time optimizations
Factors affecting C:

Size of a processor’s structure

Organization to exploit locality

Manipulated at the circuit and process technology level

Determined at fixed design time
Excess Switching Activity.

Idle-Unit switching activity:
 Triggered by clock transitions in unused portions of hardware.

Idle –width switching activity :
 Mismatch in the implemented and the actual width of processor structures.

Idle-capacity switching activity :
 When a program does not use the provided hardware architectures in their entirety.

Parallel switching activity:
 Activity expended in parallel for performance

Cacheable switching activity:
 Repetitive switching activity, convert computing activity to cache lookups

Speculative switching activity:
 Speculatively executing incorrect instructions is wasted activity

Value- dependent switching activity:
 Power consumed depends on the actual data values.
Capacitance

Does not change dynamically

Total capacitance = Capacitance of transistors + capacitance of wires.

Burd and Brodersen: CL = CW + Cfixed

Low power architectural techniques require partitioning:

Wire partitioning

Bit-line segmentation
IDLE- UNIT SWITCHING ACTIVITY.

No effect on computation

Clock gating

Static logic: To eliminate
switching, enough to prevent
inputs from changing.

Dynamic logic: Power can be
consumed even if the inputs to the
circuit do not change


Precomputation: aims to derive a precomputation circuit for a logic block

multiplexed precomputation architecture.

F(x=0), F(x=1)
Guarded evaluation: aims to shuts down part of the original circuit.
Deterministic clock gating

Gating the clock to the processor structures when they are known to be idle

Power savings, improves EDP, without performance loss.

Clock gating examples:


IBM’s Power 5

Reduction in switching power > 25%

Implements fine-grain gating domain
Intel’s Xscale processors

Implements three power- saving modes: Idle, Standby, Sleep

Cuts down power consumption by 30%
Idle- Width switching activity: Core

Arises from a mismatch between the designed bit-width of a processor and the actual bitwidth needed in frequently occurring operations

Dynamically detects narrow- width (16 bit wide or less) operands.

Abundance in integer and multimedia applications

Approaches:



Value gating: disabling the unused width.

Disabling switching in unused parts of ALU if both operands are narrow.

Significant power savings
Operation packing: Packs more than one narrow- width operation in the full width of
hardware

Improves performance without significant power overhead.

Speculative operation packing.
Significance compression: Compresses non-significant bits.

Byte serial pipeline.
Idle- Width switching activity: Caches

Dynamic zero compression: accesses only
significant bits


Only compresses zero bytes.- zero indicator bit
Frequent value compression: dictionary
loaded with the frequent values of a program.

Simple

Most efficient compression mechanism

Frequent value cache: cache line contains
compressed and uncompressed words.

First array: holds 8 low-order bits.

Second array: holds remaining 24 high-order bits
Packing compressed cache lines

Space freed by compression remains empty.

Increases cache utilization: indirect power savings.

Packing techniques:

Variable packing: packs variable number of cache lines into cache frames.



expensive
Fixed packing: preset number of cache lines are packed

Reduced opportunities for compression

Compression cache:

Uses frequent value compression

Does not attempt to pack cache lines into frames

Frame holds either two compressed or one uncompressed line.

Significance compression cache: lines are compressed using sign compression
Instruction compression.
IDLE- CAPACITY SWITCHING ACTIVITY

Wasted activity related to out-of-order execution

Processor resources over provisioned to support high instruction throughput.

Power inefficiency of out-of-order processors:

Energy-per-instruction growth Ei ~ (IW)γ .
Resource partitioning.

Cannot afford latency of very long wires.

Partitioned by placing buffers

Aimed at size vs speed trade-off.

Wire partitioning

Wire delay proportional to R x C .

Breaking wire into ‘k’ segments improves
delay by k2

Total energy increases exponentially with k.

Replacing buffers with tristate devices.
IDLE- CAPACITY SWITCHING ACTIVITY:
INSTRUCTION QUEUE.

Resizable IQ, mix of CAM and SRAM

Readiness feedback control


Adjust IQ size based on the activity of its entries.

Decision making scheme has a safety mechanism.
Occupancy feedback control

IQ, LSQ, ROB.

Occupancy of a structure is the appropriate feedback control metric.

Logical resizing without partitioning

IQ organized as a circular FIFO buffer.

Limiting the size logically by limiting the part
that can be allocated to new entries

ILP- contribution feedback control

Instruction queue collapsing
IDLE-CAPACITY SWITCHING ACTIVITY:
CORE

Dynamically changing the width of an 8-issue
processor to 6 or 4-issue.

6-issue processor: half of a cluster is disabled

4-issue processor: one whole cluster is
disabled

Appropriate functional units are clock gated.

Decisions made at the end of the sampling
window
THANK YOU!