Download et al - University of Virginia, Department of Computer Science

Document related concepts

Decibel wikipedia , lookup

Variable-frequency drive wikipedia , lookup

Utility frequency wikipedia , lookup

Power factor wikipedia , lookup

Power inverter wikipedia , lookup

Stray voltage wikipedia , lookup

Standby power wikipedia , lookup

Islanding wikipedia , lookup

Spectral density wikipedia , lookup

Pulse-width modulation wikipedia , lookup

Wireless power transfer wikipedia , lookup

Electrification wikipedia , lookup

Electric power system wikipedia , lookup

History of electric power transmission wikipedia , lookup

Power over Ethernet wikipedia , lookup

Distributed generation wikipedia , lookup

Buck converter wikipedia , lookup

Audio power wikipedia , lookup

Life-cycle greenhouse-gas emissions of energy sources wikipedia , lookup

Amtrak's 25 Hz traction power system wikipedia , lookup

Power electronics wikipedia , lookup

Distribution management system wikipedia , lookup

Rectiverter wikipedia , lookup

Voltage optimisation wikipedia , lookup

Power engineering wikipedia , lookup

Alternating current wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Mains electricity wikipedia , lookup

AC adapter wikipedia , lookup

Transcript
© 2004, Kevin Skadron and Jose Gonzalez
Power-Aware Design for
High-Performance Processors
A Tutorial at HPCA-2004
Kevin Skadron
Jose Gonzalez
University of Virginia
Intel Labs Barcelona
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap

Introduction & Trends
 Dynamic Power Dissipation


Static Power Dissipation


Sources, modeling, reduction techniques
Sources, modeling, reduction techniques
Summary
2
© 2004, Kevin Skadron and Jose Gonzalez
Introduction

Power: Work done per unit time (watts)
Energy: Total Work (joules)

Why is power a concern in current processors? ?



Increased market demand for consumer electronics powered by
batteries; battery life is a selling point
Electricity, cooling costs for large data centers are becoming
substantial
• 5-25% of data center income (cf. Rajamony & Bianchini tutorial, ICS’02)

Government energy-efficiency requirements
• (eg Energy* in US)


Electricity costs for large ISPs are becoming substantial
Packaging and cooling costs (due to the increase in the power
density) are becoming prohibitive

Power dissipation may reach technology limits are

becoming prohibitive
Current delivery is becoming3 expensive
© 2004, Kevin Skadron and Jose Gonzalez
Metrics
Some different power metrics & fallacies:


Reducing power does not always save energy
Energy =  P dt
• If you reduce power but increase execution time, energy
may go up


Also note that reducing power does not always
reduce temperature
Sustained power density limits thermal
design/packaging
– approx. same as thermal design power
– note that on-chip temperatures and total heat production are
somewhat different concerns
4
© 2004, Kevin Skadron and Jose Gonzalez
Metrics

Power



Energy




Average power
Power density map
Energy (MIPS/W)
Energy-Delay product (MIPS2/W)
Energy-Delay2 product (MIPS3/W) – voltage independent!
(Zyuban, GVLSI’02)
Temperature



Average temperature
Peak temperature
Temperature map
• Does not necessarily match power density map

No good figures of merit for trading off thermal efficiency against
performance, area, or energy efficiency
5
© 2004, Kevin Skadron and Jose Gonzalez
Power Dissipation

Dynamic power dissipation


Due to switching activity
Static power dissipation

Due to leakage current – major paths are:
• Subthreshold leakage

Exponentially dependent on Vdd, Vth, Temp
• Gate leakage

Exponentially dependent on Vdd, Tox
6
© 2004, Kevin Skadron and Jose Gonzalez
Power Dissipation

Total power actually consists of



Switching power
Short-circuit power
Leakage power
7
© 2004, Kevin Skadron and Jose Gonzalez
Big Picture - Trends





Data on current power dissipation for various
chips
Distribution of power within a typical processor
Trends in Scaling trends in power dissipation
Trends in leakage power
Power Trends in battery life
8
© 2004, Kevin Skadron and Jose Gonzalez
Power Dissipation
Processor Alpha
21364
Clock
1.15 GHz
Rate
Power
110W
(Max)
AMD
Opteron
2.2 GHz
HPIBMPA8700 Power 4
870 MHz 1.7 GHz
Intel
Itanium 2
1.5 GHz
Intel
Xeon
3.2 GHz
MIPS
R14000
600 MHz
86 W
75W
130W
86W
16W
100W
Source: Microprocessor Report
9
© 2004, Kevin Skadron and Jose Gonzalez
Power Dissipation Breakdown

Alpha 21264
Global clock network
Instruction issue units
Caches
FP execution units
Int. execution units
Mem. management unit
I/O
Miscellaneous
Source: Gowan et al. “Power Considerations in the design of the alpha 21264 microprocessor”, DAC 1998
10
© 2004, Kevin Skadron and Jose Gonzalez
Effects of Technology Scaling on
Power Dissipation

Feature size is scaling down


Frequency is increasing




at least 30% (Ideal scaling: decreases by 30%)
Vdd is not scaled down at the same rate as feature size


25% (Ideal scaling: decreases by 50%)
Active capacitance increases


~2x (Ideal scaling: decreases by 30%)
Area increases due to microarchitecture improvements


30%
0-10% (Ideal scaling) 30%
Ideal scaling: P  CV2f → 0.72 reduction  0.5
Observed scaling → 2 – 2.5x increase
Power density becomes a problem!

Especially since the power density is non-uniform
11
© 2004, Kevin Skadron and Jose Gonzalez
Power Evolution
?
100
Pentium® II
Pentium® 4
Max Power (Watts)
Pentium® Pro
Pentium® III
10
Pentium®
Pentium®
w/MMX tech.
i486
i386
1
1.5m
Source: Intel
1m
0.8m
0.6m
0.35m
12
0.25m
0.18m
0.13m
© 2004, Kevin Skadron and Jose Gonzalez
Trends in Power Density
1000
Rocket
Nozzle
Watts/cm
2
Nuclear Reactor
100
Pentium® 4
Pentium® III
Pentium® II
Hot plate
10
Pentium® Pro
Pentium®
i386
i486
1
1.5m
1m
0.7m
0.5m
0.35m
0.25m
0.18m
0.13m
0.1m
0.07m
* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” –
Fred Pollack, Intel Corp. Micro32 conference key note - 1999.
13
© 2004, Kevin Skadron and Jose Gonzalez
ITRS Projections
Year
Tech node (nm)
Vdd (high perf) (V)
Vdd (low power) (V)
Frequency (high perf) (GHz)
High-perf w/ heatsink
Cost-performance
Hand-held
2003
100
1.0
1.1
3.1
2006
2010
70
45
0.9
0.6
1.0
0.8
5.6
11.5
Max power (W)
180
218
98
120
3.5
3.0
160
85
3.2
2013
32
0.5
0.7
19.3
2016
22
0.4
0.6
28.8
251
138
3.0
288
158
3.0
ITRS 2001



These are targets
Based on historical trends, the high-performance power targets
seem optimistic
Intel papers suggest that in the 45-75W range, cooling costs $1/W;
but then rate of increase goes up: $2, $3/W, maybe more!
(Borkar, IEEE Micro ’99, Gunther et al, ITJ ’01)
14

The fraction of leakage power is increasing
exponentially with each generation
 Also exponentially dependent on temperature
Increasing
ratio
across
generations
Static power/ Dynamic Power
70
60
50
40
30
20
10
0
29
8
30
3
30
8
31
3
31
8
32
3
32
8
33
3
33
8
34
3
34
8
35
3
35
8
36
3
36
8
37
3
Percentage
© 2004, Kevin Skadron and Jose Gonzalez
Leakage Power
Temperature(K)
180nm
130nm
100nm
Source: Skadron et al, University of Virginia 15
90nm
80nm
70nm
© 2004, Kevin Skadron and Jose Gonzalez
Trends in Battery Technology

Battery lifetime is increasing perhaps 8-10%/yr.
(Powers, Proc. of IEEE 1995)

Not keeping up with rate of growth in energy
consumption
Source: Rabaey 1995, cited in Irwin et al, “Low Power Design Methodologies, Hardware and Software Issues”,
tutorial at PACT 2000
16
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap

Introduction & Trends
 Dynamic Power Dissipation


Static Power Dissipation


Sources, modeling, reduction techniques
Sources, modeling, reduction techniques
Summary
17
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Dissipation

Roadmap



Sources of dynamic power dissipation
Modeling dynamic power
Circuit- and architecture-domain techniques to reduce
power
18
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Consumption

Power dissipated due to switching activity

A capacitance is charged and discharged
Vdd
01
Ec=1/2CLV2
Ed=1/2CLV2
10
Charge/discharge at the frequency f
P=CLV2 f
Note that energy consumed from battery is CLV2 and is
drawn upon charging
19
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Dissipation

Equation
P = a  CL Vdd2  f

a: Activity factor


Depends on the processor architecture
CL: Capacitance of the circuit

Depends on the design style, number of transistors,
transistor sizing, etc

Vdd: Operating voltage
 f: Frequency
20
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Modelling


P = a  CL V2  f
Information needed


Activity counters in each unit
Energy dissipated per access
Configuration
Performance
Model
Activity
Performance metrics


Power
Model
Power metrics
For precision, “a” (# of signal transitions) should be measured or at
least estimated with a probabilistic model
More commonly, a = 0.5 is assumed
21
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Power Modelling

Activity counters



Energy per access




Analytically: calculating capacitances as function of size, ports, etc
Example: Cache access: decoder, precharge transistors, bitline, cell
access, wordline, sense amplifiers ...
• Wattch (Brooks et al, ISCA 2000)
• Cacti
Empirically: using low level designs and applying “virus” tests
• Virus test: microbenchmark that stresses a particular unit
• ALPS (Gunther et al, ITJ, 2001)
Circuit-extracted model




Performance model is used
Counters for: cache access, FU usage, Register File, ...
PowerTimer – IBM Power4 (Brooks et al, PACS’00)
AccuPower – Parameterized, based on SPICE measurements of actual
layouts (SUNY Binghamton, Ponomarev et al, DATE’02)
PowerAnalyzer – StrongARM (Michigan, assoc. w/ SimpleScalar)
Many of these ignore the actual number of signal transitions
22
© 2004, Kevin Skadron and Jose Gonzalez
Circuit-Level Techniques







Transistor sizing
Signal and clock gating
Circuit restructuring
Low power caches
Low power register files
Issue queue
These typically reduce the capacitance being
switched
23
© 2004, Kevin Skadron and Jose Gonzalez
Transistor Sizing

Transistor sizing plays an important role to reduce power
K = Ci/Ci-1
C0



C1
CN-1
CN
Delay ~ a (k / ln K)
Power ~ K / (K-1)
Optimum K for both power and delay must be pursued
24
© 2004, Kevin Skadron and Jose Gonzalez
Signal Gating
“techniques to mask unwanted switching activities from propagating
forward, causing unnecessary power dissipation”

Implementation




ctrl
Generation requires additional logic
Identification of signals to be gated



Output
Control signal needed


Simple gate
Tristate buffer
...
signal
Clock
Address bus
Also helps to prevent power dissipation due to glitches
25
© 2004, Kevin Skadron and Jose Gonzalez
Clock Gating
“Disabling a functional block when it is not required for a extended
period”

Implementation



signal
Simple gate that replaces
one buffer in the clock tree ctrl
Delay is generally not a concern
Decision

Architectural level
26
functional
functional
unitunit
© 2004, Kevin Skadron and Jose Gonzalez
Circuit Restructuring





Pipeline (can reduce frequency)
Parallelize (can reduce frequency)
Reorder inputs so that most active input is
closest to output (reduces switched capacitance)
Restructure gates (equivalent functions are not
equivalent in switched capacitance)
Energy-efficient flip-flops and latches
27
bitline
bitline
R rows
C cols
row dec
80
Read
Write
70
60
wordline
50
sens amp
40
Column dec
30
20
10




Switched capacitance
Voltage swing
Activity factor
Frequency
th
er
I/O
O
bu
se
s
LS
A
D
B
A
TB
LS
W
lin
e
s
0
de
r

Caccess = R  C  Ccell
Reducing power
ec
o

D
© 2004, Kevin Skadron and Jose Gonzalez
Cache Design
TBLSA: Tagbitlines & sense amp.
DBLSA: Data bitlines and sense amp.
Cache parameters: 16 KB cache 0.25 μm
Villa et al, MICRO 2000
28
© 2004, Kevin Skadron and Jose Gonzalez
Cache Design

Banked organization



Dividing word line


Same effect for wordlines
Reducing voltage swings




Targets switched capacitance
Caccess = R  C  Ccell / B
Sense amplifiers used to detect Vdiff across bitlines
Read operation can be curtailed as soon as Vdiff is detected
Limiting voltage swing saves a fraction of power
Pulse word lines


Enabling the word line for the time needed to discharge bitcell
voltage
Designer needs to estimate access time and implement a pulse
generator
29
© 2004, Kevin Skadron and Jose Gonzalez
Low Power Register File Design


RF’s usually single-ended bitlines
Modified storage cell


Lot of zeros fetched from the RF
Bitline connections are modified to eliminate bitline discharge
when reading a zero
Tseng and Asanovic, ICSD, 2000
Zyuban and Kogge, ISLPED 1998
30
© 2004, Kevin Skadron and Jose Gonzalez
Efficient Issue Queue

Constitute a high fraction of the overall power

>25% for some authors
Tag 1
Tag w
OR
RDY
comp
comp
comp
comp
Oprnd
Oprnd
31
OR
RDY
© 2004, Kevin Skadron and Jose Gonzalez
Efficient Issue Queue

Useful comparison

Empty entries and ready entries consume energy
• Wakeup of empty entries can be disabled

Gating off precharge logic using valid bit
• Wakeup of ready sources can be disabled

Gating off precharge logic using ready bit
Folegnani and Gonzalez, ISCA 2001

Energy-efficient Comparators



Traditional comparators dissipate energy on a mismatch in any
bit position.
10%-20% of source operands match each cycle
Solution: comparators that dissipate energy in a match
Kuckuc et al, ISLPED 2001
32
© 2004, Kevin Skadron and Jose Gonzalez
Architectural-Level Techniques


Encoding/compression
Energy-efficient front end
 Energy-efficient caches
 Asymmetric processors
 Dynamic Voltage/Frequency scaling
 Multi clock domain architectures (similar to GALS)
 Pipeline gating
 Compiler techniques
 Sleep modes

These typically take advantage of locality or slack
33
© 2004, Kevin Skadron and Jose Gonzalez
Bus Invert Encoding


Reduce power of parallel synchronous signals
Idea: Minimize the number of transitions
• (Stan & Burleson, IEEE Trans. on VLSI, 1995)




Sender examines the current and the next values
Decides whether sending the true or the compliment signal
Additional polarity signal is sent along with data
Example
Current data
110011101
Next data
000100110
Number of
transitions
Current data
NOT (Next data)
Number of
transitions
8
34
110011101
111011001
2
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Zero Compression

Zero Indicator Bit (ZIB) added to each byte




Circuit Modifications




Zero-detection and store bus drivers
Wordline gating: 8-bit data is driven by the associated ZIB
Sense Amps: modified to drive a zero if ZIB active
Drawbacks


Enabled if a zero is stored in cache
On a read access, bitline discharge is prevented by disabling
local wordline
On a write, if the byte is zero, just ZIB is written.
9% area increase, 2-gate delay increase
Results

26% energy reduction data cache, 10% instruction cache
Villa et al, MICRO 2000
35

High percentage of integer operations require <16 bits



Difficult for the compiler to know the actual operand size
Variability for the same instruction in successive instances
Clock Gating is used to partially disable the FU
zero48
0
Result
64
zero48
clk
1
AND
Zero
detec
High
latch
Operand
A
64
Low
latch
zero48
clk
Operand
B
AND
Integer FU
© 2004, Kevin Skadron and Jose Gonzalez
Exploiting Narrow Width Operands
High
latch
64
Low
latch
36
Brooks and Martonosi, HPCA 1999
0-15
16-63
64
© 2004, Kevin Skadron and Jose Gonzalez
Energy-Efficient Front End:
Branch Prediction

Branch Prediction



Parikh et al, HPCA’02, IEEE Trans. Computers ‘04
Branch prediction accuracy is a major determinant of
pipeline activity -> spending more power in the branch
predictor can be worthwhile if it improves accuracy
Branch predictors can be designed to reduce power, eg
• Banking
• Gate off unnecessary accesses (“prediction probe detector”)
37
© 2004, Kevin Skadron and Jose Gonzalez
Energy Efficient Front End:
Register Renaming


RAT often implemented as a multiported register file
indexed by logical register, returns physical register
Liu and Lu , MICRO’00


Kucuk et al, PATMOS’03



Hierarchical RAT- top level is a cache of the full table
Prevent lookup of sources that will be supplied by a freshly
renamed instruction in the same rename group
Filter cache
Could instead organize as an associative lookup in a
table organized by physical register with dissipate-onmatch comparator (Ergin et al, ICCD’02)
38
© 2004, Kevin Skadron and Jose Gonzalez
Energy-Efficient Caches

Filter cache



Banks
Selective cache ways (Albonesi, MICRO-32)



Small L0 cache filters many accesses to L1, allows an L1 with
fewer ports (Kin et al, MICRO-30)
Ways in a set associative cache can be disabled if not needed
Many variations of this approach
Staggering number of papers on this topic



Exploit victim cache, load-store queue
Clever cache organizations (eg combining banks w/ high assoc,
specialized caches, etc.)
See recent proceedings of VLSI, architecture conferences,
esp. ISLPED
39
© 2004, Kevin Skadron and Jose Gonzalez
Asymmetric Processors




Processors have different “versions” of the same
resource, with different power/latency
Fast, power-hungry resources are allocated to critical
instructions
Slow, low-power resources are allocated to non-critical
instructions
Criticality predictor is needed!!!
40
© 2004, Kevin Skadron and Jose Gonzalez
Asymmetric Processors

Reducing power of functional units




Critical instructions



2 sets of functional units
2 sets of instruction queues
Criticality predictor
In-order queue: critical path is usually a serial chain of
dependent instructions
Fast functional units
Non-critical instructions


OoO queue
Slow functional units
Seng et al, MICRO 2001
41
© 2004, Kevin Skadron and Jose Gonzalez
Decode
Fetch
Slow pipeline
Reg
File
Commit
Dual Speed Pipelines
Fast pipeline
Criticality
predictor



Slow pipeline works at half the frequency
Criticality predictor key component to keep energy-efficiency
No communications penalties
Pyreddy and Tyson, WCED 2001
42
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Voltage/Frequency Scaling

Allow the device to dynamically adapt the voltage (and the
frequency)






Already implemented in many processors
Implementation


P ~ Vdd2
F ~ Vdd/(Vdd-Vth)k
Tradeoff between power reductions and delay increase
MUST BE energy-efficient
Voltage regulator
Predict future processor utilization and adjust frequency/voltage to
maximize power reduction while keeping performance
43
© 2004, Kevin Skadron and Jose Gonzalez
TransmetaTM LongRunTM

Crusoe processor can configure itself*




Management



Voltage changes in steps of 25 mV (depending on the voltage
regulator)
Frequency changes in steps of 33 MHz
From 1.6v, 600 MHz to 1.2V, 300MHz (2001)
Implemented in the Code MorphingTM software layer
Idle time of the system is sampled to determine performance
demands
Thermal extension


May be a form of thermal throttling
Expands the thermal budget of the processor
* Source: http://www.transmeta.com
44
© 2004, Kevin Skadron and Jose Gonzalez
Transmeta™LongRun™

Idle time


On-line activity


Voltage drops to minimum
Voltage raises to maximum
Real-Time activity


Voltage adjusted to meet
requirements
DVD player
• 24 frames/second
Source: Transmeta
45
© 2004, Kevin Skadron and Jose Gonzalez
Intel SpeedStep®

Configuration*



From 0.844v (600MHz) to 1.48v (1.7 GHz)
100μs delay
Voltage-Frequency switching separation
No Change
Volt. Transition
Freq. Transition
Volt. Transition
* Source: http://www.intel.com
Freq. Transition
46
© 2004, Kevin Skadron and Jose Gonzalez
Intel SpeedStep®

Configuration

Clock partitioning
• Core clock
• Bus clock (sequencer and interrupt interface)

Event blocking
• Interrupts, pin events and snoop requests are not lost
47
© 2004, Kevin Skadron and Jose Gonzalez
Voltage Scheduling

Real-time problem will be discussed later
 For non-real time workload, goal is to improve
energy efficiency
 This is hard, because it is difficult to predict an
arbitrary workload’s future needs without
deadline information
 Instead, try to schedule processes and voltages
to reduce idle time

eg, Weiser et al, OSDI-1
48
© 2004, Kevin Skadron and Jose Gonzalez
Sleep Modes

ACPI: Advance Configuration and Power Interface




Developed by Microsoft, HP, Toshiba, Phoenix and Intel
Establishes interfaces for OS-directed powermanagement
Replaces APM, MPS APIs and PnP BIOS
Defines



Hardware registers
BIOS interfaces
System and device power states
Source: ACPI overview, http://www.acpi.info
49
© 2004, Kevin Skadron and Jose Gonzalez
DVS “Critical Power Slope”

It may be more efficient not to use DVS, and to
run at the highest possible frequency, then go
into a sleep mode!



Depends on power dissipation in sleep mode
And power dissipation at lowest voltage
This has been formalized as the critical power
slope (Miyoshi et al, ICS’02):



mcritical = (Pfmin – Pidle) / fmin
If the actual slope m = (Pf - Pfmin) / (f – fmin) < mcritical
then it is more energy efficient to run at the highest
frequency, then go to sleep
Switching overheads must be taken into account
50
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture

Multiple clock domains inside the processor
 Globally-asynchronous locally synchronous
(GALS) clock style
 Independent voltage/frequency scaling
 Synchronizers to ensure inter-domain
communication
51
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture

Advantages






Local clock design is not aware of global skew
Each domain limited by its local critical path, allowing higher
frequencies
Different voltage regulators allow for a finer-grain energy control
Frequency/voltage of each domain can be tailored to its dynamic
requirements
Clock Power is reduced
Drawbacks


Complexity and penalty of synchronizers
Feasibility of multiple voltage regulators
52
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture

Synchronization
1

4
CLK1
2

3
CLK2
Src runs with CLK1, dst
with CLK2
Src writes at T1


T
Semeraro et al, ISCA 2003
53
If T > Ts then dst can use
the data at T2
If T < Ts then dst can use
the data at T3
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture

Domains must be carefully chosen



Small cost on communications
Re-using existing structures
Example

5 domains
•
•
•
•
•
Front-end
Integer unit
FP unit
On-chip cache unit
Main memory
54
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture
Integer
CPU
IIQ
int.
register
file
int.
FUs
Memory
Front-end
fetch
L1
i-cache
IFQ
branch
predict
dispatch
rename
LSQ
Floating Point
FIQ
Magklis et al, ISCA 2003
L2
L1
unified
d-cache
cache
55
fp.
register
file
fp.
FUs
Main
Memory
© 2004, Kevin Skadron and Jose Gonzalez
Multi Clock Domain Architecture


Dynamic voltage/frequency scaling in each domain
Reconfiguration points must be chosen

Off-line “shaker” algorithm
• Aggressive oracle algorithm with good results
• Uses detailed dynamic execution trace to find frequencies
• It is not practical, requires future knowledge of this precise dynamic
run

On-line Attack-decay
• Interval-based hardware algorithm
• Transparent to the application, minimal overhead
• More conservative, achieves 75% efficiency of off-line

Profile-based
• Use profiling to associate frequencies with parts of the code
• When these points in the code are reached during a dynamic run
then change frequencies
56
© 2004, Kevin Skadron and Jose Gonzalez
Gating/Throttling

Gating: Disable some of the stages of the processor



To reduce useless activity: after a branch misprediction
Manne et al, ISCA 1998
Effectiveness is heavily dependent on accuracy of branch
confidence predictor
Parikh et al, HPCA’02
Throttling: Slow down some processor stage when it is
predicted that the performance will not be reduced
Branch misprediction

Long latency load miss

IPC reduction in general
Baniasadi and Moshovos, ISLPED 2001

57

Control Speculation increases power dissipation (28%)

Energy wasted by mispredicted instructions
30
Speedup & Savings (%)
© 2004, Kevin Skadron and Jose Gonzalez
Selective Throttling for Control Speculation
Speedup
Power savings
Energy savings
E-D improvement
25
20
15
10
5
0
h
e fetc
oracl

Selective throttling of fetch/decode


Based on branch confidence
Gating of selection stage


ct
ode
e sele
e dec
oracl
oracl
Instructions that likely belong to a mispredicted path
9% Energy-Delay improvement
Aragon et al, HPCA 2003
58
© 2004, Kevin Skadron and Jose Gonzalez
Co-Adaptive Instruction Fetch and Issue

Fetch gating based on issue queue utilization


Fetch is stopped if close parallelism is present




Rather than using instruction window usage
Just instructions from the head of the IQ are issued
To match the size of the window residing in the IQ to
application’s ILP
Fetch gating combined with dynamic issue queue
adaptation
20% energy-delay improvement
Buyuktosunoglu et al, ISCA 2003
59
© 2004, Kevin Skadron and Jose Gonzalez
Compiler Techniques for Low Power
Good reference: tutorial by Kremer, PLDI’03
 Traditional compiler optimizations often improve
energy efficiency



But some compiler optimizations waste energy


eg, register allocation, CSE, tiling for cache hit rate
eg, aggressive speculation
Energy efficiency of code sequences is highly
dependent on microarchitecture

eg, free slot in a VLIW word
60
© 2004, Kevin Skadron and Jose Gonzalez
Compiler Techniques for Low Power, cont.

Compiler-guided DVS


v1: reduce voltage while meeting real-time deadlines
v2: reduce voltage in memory-bound program regions
• Hsu and Kremer, ISLPED’01, PLDI’03
• Xie et al, PLDI’03

Dynamic resource configuration/hibernation

Deactivate modules when they won’t be used for a long time (>>
sleep/wakeup time)
• Heath et al, PACT’02

Profile/compiler-guided adaptation


eg,profile-guided MCD adaptation mentioned earlier (Magklis et
al, ISCA’03)
eg, subroutine-guided (“positional”) adapation (Huang et al,
ISCA’03)
• Uses a hierarchy of low-power modes

Much work in this area – this only touches the surface
61
© 2004, Kevin Skadron and Jose Gonzalez
Power Savings for Real Time Systems


Soft vs. hard real time
Periodic vs. aperiodic



Periodic tasks are especially important in control systems
Most work has focused on DVS scheduling
Examples


MPEG playback
Web server
62
© 2004, Kevin Skadron and Jose Gonzalez
DVS for Multimedia Apps
(soft real-time approach)

MM apps must process every frame within a time limit




If idle time, then there is some slack
IPC is constant across frames of the same type
Slow down the processor to meet deadlines
2 Phases

Profiling
• Determines max. number of insts. can be executed for each conf
• Sorts that list

Adaptation
• Predicts the number of instructions to be executed in the next interval
• Uses the lowest energy hardware configuration that fulfills
requirements
Hughes et al MICRO 2001
63
© 2004, Kevin Skadron and Jose Gonzalez
DVS for Multimedia Apps
(hard real-time approach)

decrease
frequency
Buffering decoded frames provides a
control point to enforce deadlines using
feedback control



Dead-zone proportional-integral controller sets
DVS to maintain queue occupancy
No profiling or other prior knowledge about
stream is needed
If queue becomes empty, “panic” model forces
highest speed
dead
zone
increase
frequency
Lu et al ICCD 2003
64
© 2004, Kevin Skadron and Jose Gonzalez
DVS for Web Servers

Basic idea: load balance, then do DVS to
reclaim slack (Elnozahy et al, PACS’02)

But it may be more profitable to cluster requests onto
fewer nodes and put some to sleep

Even on single nodes, it may be profitable to
briefly defer requests, then batch them at the
highest frequency before going to sleep
(Elnozahy et al, USITS’03)
 To provide delay guarantees requires feedback
control (Sharma et al RTSS 2001)

A natural and effective control point is synthetic
utilization
• Combines true utilization with real-time schedulability
65
© 2004, Kevin Skadron and Jose Gonzalez
Other Approaches

Almost all RT algorithms attempt to reclaim slack

Episode detection (Flautner et al, MOBICOM’01)




Identify interactive and periodic events, schedule accordingly
Program checkpoints – check performance relative to
deadline and adjust DVS accordingly
Exploit direct knowledge of task execution times or
utilization
VISA (Anantaraman et al, ISCA’03)


Model a superscalar (unpredictable processor) as a predictable
scalar processor to perform RT analysis and scheduling, then
reduce DVS setting when superscalar processor runs faster than
predicted
Use program checkpoints to check progress/slack
66
© 2004, Kevin Skadron and Jose Gonzalez
Short-Circuit Power

Main solutions are

Reduce rise/fall times
• Tradeoff: reducing rise/fall times requires stronger drivers,
more dynamic power

Reduce capacitance being switched
67
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap

Introduction & Trends
 Dynamic Power Dissipation


Static Power Dissipation


Sources, modeling, reduction techniques
Sources, modeling, reduction techniques
Summary
68
© 2004, Kevin Skadron and Jose Gonzalez
Static Power Dissipation

Static power: dissipation due to leakage current
 Growing worse because Vth is not scaling as fast
as Vdd
 Roadmap





Most important sources of static power: subthreshold
leakage and gate leakage
Inter-process variation
Trends
Modeling leakage power
Circuit/architectural-level techniques
69
© 2004, Kevin Skadron and Jose Gonzalez
Static Power

Main mechanisms for leakage current

Subthreshold (Berkely predictive model):
I leakage  m 0  COX

Vdd

W
  e a b*(Vdd Vdd0 )  vt2  1  e vt

L



  exp   Vth0  Voff


n  vt






Gate
• Igate = Igate0 * exp(a*(tox-tox0)) * exp(b*(vdd-vdd0))


We will focus on subthreshold
Gate leakage has essentially been ignored

New gate insulation materials may solve problem, eg recent Intel
announcement
• R. Chau, Technology@intel Magazine. www.intel.com

Gate-induced drain leakage (GIDL) occurs at negative gate voltages
and high Vdd or high values of reverse body bias
70
© 2004, Kevin Skadron and Jose Gonzalez
Effects of Parameter Variations



Ioff depends exponentially on Vth
There is a large fluctuation of Ioff from die to die and from gate to
gate
Controlling Vth is difficult in nanometer scale

Drain-induced barrier lowering
• Channel length is not constant
• Exacerbated in sub-100nm devices

Discrete dopant effects
• In a very small channel, small number of dopants
• Presence of these dopants and random fluctuation of their number, lead to
changes in Vth from device to device

Process variation affects



Gate length (Ldrawn)
Gate oxide thickness (Tox)
Channel dose (Nsub)
Srivastava et al, ISLPED 2002
71
Motivation



Growing relative to dynamic power dissipation: soon 50% of total
power
Exponentially dependent on Temp, Vth, Vdd
Natural target for optimization: idle transistors
Increasing
ratio
across
generations
Static power/ Dynamic Power
70
60
50
40
30
20
10
0
29
8
30
3
30
8
31
3
31
8
32
3
32
8
33
3
33
8
34
3
34
8
35
3
35
8
36
3
36
8
37
3

Percentage
© 2004, Kevin Skadron and Jose Gonzalez
Static Power
Temperature(K)
180nm
130nm
100nm
Source: Skadron et al, University of Virginia
72
90nm
80nm
70nm
© 2004, Kevin Skadron and Jose Gonzalez
Static Power

Modeling Leakage

Butts and Sohi (MICRO-33)
• Pstatic = Vcc · N · kdesign · Îleak
• Îleak determined by circuit simulation, kdesign empirically
• Key contribution: separate technology from design

HotLeakage (UVA TR CS-2003-05, DATE’04)
• Extension of Butts & Sohi approach: scalable with Vdd, Vth,
Temp, and technology node; adds gate leakage
• Îleak determined by BSIM3 subthreshold equation and BSIM4
gate-leakage equations, giving an analytical expression that
accounts for dependence on factors that may change at
runtime, namely Vdd, Vth, and Temp
• kdesign replaced by separate factors for N- and P-type
transistors
• kdesign also exponentially dependent on Vdd and Tox, linearly
dependent on Temp
• Currently integrated with 73
SimpleScalar/Wattch for caches
© 2004, Kevin Skadron and Jose Gonzalez
Static Power

Modeling Leakage (cont.)

Su et al, IBM (ISLPED’03)
• Similar approach to HotLeakage – but they observe that
modeling the change in leakage allows linearization of the
equations

Many, many other papers on various aspects of
modeling different aspects of leakage
• Most focus on subthreshold
• Few suggest how to model leakage in microarchitecture
simulations
74
© 2004, Kevin Skadron and Jose Gonzalez
Circuit/architectural level techniques

Transistor sizing
 Dual Vth
 DVS
 Dynamic threshold voltage – reverse body bias
 Sleep transistors
 Low leakage caches/branch predictors
 Low leakage register file
 Low leakage issue queue
 Low leakage ALUs
 Techniques for reducing gate leakage
 What else?
75
© 2004, Kevin Skadron and Jose Gonzalez
Transistor sizing, Dual-Vth

Transistor sizing



Dual-Vth



Reducing W/L reduces leakage: use smallest possible
transistors
Leakage-performance tradeoff
High-threshold transistors dramatically reduce
leakage: use low-Vth on critical paths, high-Vth
elsewhere
Often suggested in caches: many possible
permutations
DVS

Leakage is exponentially dependent on Vdd, so
DVS reduces leakage
76
© 2004, Kevin Skadron and Jose Gonzalez
Dynamic Threshold Voltage

Adjust threshold voltage dynamically







Also called reverse body bias (RBB), auto backgatecontrolled multi-threshold CMOS (ABB-MTCMOS)
(Nii et al, ISPLED’98)
Apply negative voltage to body: requires larger VGS to
establish channel, so it raises Vth
Engage RBB for idle transistors
Preserves state
Requires twin-well process; more expensive to
manufacture
Limited by GIDL
Can also be used at testing to adjust circuit properties
and reduce parameter variations
77
© 2004, Kevin Skadron and Jose Gonzalez
Sleep Transistors

Add a high-Vth transistor between the
circuit and either/both power rails – the
sleep transistor



Also referred to as a “header” (to Vdd) or
“footer” (to ground)
The high-Vth transistor cuts off most
leakage
In fact, a properly sized, lower-Vth
footer transistor can preserve enough
leakage to keep the cell active (Li et
al, PACT’02; Agarwal et al, DAC’02)


Great care must be taken when switching
back to full voltage: noise can flip bits
Extra latency may be necessary when reactivating
78
© 2004, Kevin Skadron and Jose Gonzalez
Low-Leakage Caches

Gated-Vdd/Vss (Powell et al, ISLPED’00; Kaxiras et al, ISCA-28)






Drowsy cache (Flautner et al, ISCA-29)



Uses sleep transistor on Vdd/ground for each cache line
Typically considered non-state-preserving, but recent work (Agarwal et al,
DAC’02) suggests that gated-Vss it may preserve state
Many algorithms for determining when to gate
Simplest (Kaxiras et al, ISCA-28): Two-bit access counter and decay
interval
Adaptive decay intervals - hard
Uses dual supply voltages: normal Vdd and a low Vdd close to the
threshold voltage
State preserving, but requires an extra cycle to wake up – two extra
cycles if tags are decayed
State preservation using leakage currents (Li et al, PACT’02; Agarwal
et al, DAC’02)

Similar to gated-Vss but designed to keep supply voltage high enough to
preserve state (100-120 mV)
79
© 2004, Kevin Skadron and Jose Gonzalez
Low Leakage Caches, cont.

Comparison (Parikh, Li, et al, WDDD’03, DATE’04)


Compared non-state-preserving gated-Vss with state-preserving
drowsy cache
If gating is state-preserving, it wins because it essentially
eliminates subthreshold and gate leakage
• Unless wakeup time is significantly longer than with drowsy


Otherwise, drowsy cache typically has an advantage because it
is state preserving; no L2 accesses needed on “induced misses”
But induced misses are rare, so for a reasonable range of onchip L2 penalties (< 8 cycles in our studies), gating can still be
superior
80
© 2004, Kevin Skadron and Jose Gonzalez
Low-Leakge Caches, cont: 4T Cells
4 transistor cells [ 4T ]




6T (left) and 4T (right) circuit diagrams
4T-based branch predictors, caches



Hu , Juang, et al, ISLPED’02,
CA-Letters’02
Non state-preserving
Decay rate : temperature-dependent
•

Can be adjusted with passives
Eliminates decay state bits
81



Eliminates two
transistors connected to
Vdd
Naturally decays over
time
Refreshes upon access
When decayed, force
default output
Up to 33% smaller than
equivalent 6T
Decays quickly [8K
cycles at 1 GHz]
Leak only as much
energy as is deposited
© 2004, Kevin Skadron and Jose Gonzalez
Low-Leakage Caches, cont:
Other Techniques

RBB (Nii et al, ISLPED’98)


Leakage-biased bitlines (Heo et al, ISCA-29)



Back bias cache lines that are idle – can use the same
decay counters as gated-Vdd/Vss
Disable precharge and let the bitlines float: they will
settle to a value that minimizes leakage
Can only be applied to idle subbanks and requires
accurate prediction of which subbank will be accessed
Huge variety of other techniques – this is only an
overview of some of the major ones
82
© 2004, Kevin Skadron and Jose Gonzalez
Register Files

In general, state-preserving techniques for
caches may work for register files too
 Leakage-biased bitlines work here too


Register file divided into subbanks
Alvandpour et al, Intel, ISLPED’01

Uses dual Vth and a conditional keeper
• “Keeper” used on dynamic circuits to counteract voltage
droop due to leakage – they constitute a static pull-up path
• Dynamic circuits arise in the muxes due to multiporting
• “Conditional” keeper technique uses two cascaded keepers;
one is fixed and the other only engaged when needed to
drive an output – requires careful timing analysis

Access transistors and keepers are high-Vt/
83
© 2004, Kevin Skadron and Jose Gonzalez
ALUs

Usually Dual-VT domino logic


Area & Speed
Sleep transistors can be used but it has a cost


Dynamic nodes are discharged
Can be used if worthy
Dropsho et al, MICRO842002
© 2004, Kevin Skadron and Jose Gonzalez
Other Techniques

Queues (eg, issue queues)




Various occupancy-based or rate-matching
techniques have been proposed for issue queue
resizing.
Deactivating queue entries reduces leakage
eg, Ponomarev et al, MICRO-34
Compiler techniques


When compiler knows that regions are idle, they can
be deactivated
eg, Zhang et al, MICRO-35
85
© 2004, Kevin Skadron and Jose Gonzalez
Gate Leakage




Any technique that reduces Vdd
Otherwise it seems difficult to develop architecture
techniques that directly attack gate leakage
In fact, very little work has been done in this area
One example: domino gates (Hamzaoglu & Stan,
ISLPED’02)



Replace traditional NMOS pull-down network with a PMOS pullup network
Gate leakage is greater in NMOS than PMOS
But PMOS domino gate is slower
86
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap

Introduction & Trends
 Dynamic Power Dissipation


Static Power Dissipation


Sources, modeling, reduction techniques
Sources, modeling, reduction techniques
Summary
87
© 2004, Kevin Skadron and Jose Gonzalez
Other Power-Related Issues

Thermal


Managing on-chip temperatures (as opposed to
average heat dissipation) is not just a matter of
reducing average power density
Spatial and temporal variation
• Spatial: hot spots—must reduce power density in the right
places
• Temporal: must reduce power when chip is hot


This is often when there is less slack
Most model temperature directly
• Average power metrics do not accurately predict temperature

(Skadron et al, ISCA’03)
88
© 2004, Kevin Skadron and Jose Gonzalez
Other Power-Related Issues

Voltage stability (dI/dt)



Inductance means that abrupt changes in current can
cause voltage droop
This can be addressed with decoupling capacitance,
but required capacitance is becoming expensive
Grochowski et al HPCA’02, Joseph et al, HPCA’03
89
© 2004, Kevin Skadron and Jose Gonzalez
Roadmap

Introduction & Trends
 Dynamic Power Dissipation

Sources, modeling, reduction techniques

Static Power Dissipation
 Sources, modeling, reduction techniques
 Summary
90
© 2004, Kevin Skadron and Jose Gonzalez
Summary

Power dissipation is becoming a huge concern




Power dissipation




Total power budget
Power density (thermal)
Energy consumption & battery life
Switching
Short-circuit
Leakage
Power modeling crucial


Academia: accurate research
Industry: detect hot spots on time to meet POR
91
© 2004, Kevin Skadron and Jose Gonzalez
Summary

Reducing dynamic power

Circuits perspective
• Energy-effective access (reducing capacitance or driving
voltage)
• Gating

Architectural perspective
• Decreasing activity factor
• Pipeline gating
• Adjusting voltage/frequency to meet application requirements

Reducing static power
• Dual Vth
• Non-state-preserving vs. state-preserving techniques
92