Download Lecture 9 – Low Power Design

Document related concepts

Resistive opto-isolator wikipedia , lookup

Power factor wikipedia , lookup

Wireless power transfer wikipedia , lookup

Three-phase electric power wikipedia , lookup

Power inverter wikipedia , lookup

Standby power wikipedia , lookup

Rectifier wikipedia , lookup

Audio power wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Islanding wikipedia , lookup

Electrification wikipedia , lookup

Electric power system wikipedia , lookup

Power over Ethernet wikipedia , lookup

Variable-frequency drive wikipedia , lookup

Opto-isolator wikipedia , lookup

Pulse-width modulation wikipedia , lookup

Distributed generation wikipedia , lookup

Stray voltage wikipedia , lookup

Life-cycle greenhouse-gas emissions of energy sources wikipedia , lookup

Electrical substation wikipedia , lookup

Amtrak's 25 Hz traction power system wikipedia , lookup

Surge protector wikipedia , lookup

History of electric power transmission wikipedia , lookup

Power electronics wikipedia , lookup

Power MOSFET wikipedia , lookup

Power engineering wikipedia , lookup

Buck converter wikipedia , lookup

Power supply wikipedia , lookup

Voltage optimisation wikipedia , lookup

Alternating current wikipedia , lookup

AC adapter wikipedia , lookup

CMOS wikipedia , lookup

Mains electricity wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Transcript
ELEC 516 VLSI System Design
and Design Automation Spring
2010
Lecture 9 - Low Power Digital
CMOS Design
Reading Assignment:
Rabaey: Chapter 4 and 7,11
Note: some of the figures in this slide set are adapted from the slide set
of “ Digital Integrated Circuits” by Rabaey, Copyright UCB 2002
1
ELEC516/10 Lecture 9
Why worry about power?
-- Heat Dissipation
Besides low power
required for portable
applications !!!!
Power is a very big concern in today’s advanced technologies.
Power produces heat on the chip which has to be carried off
through the chip socket  expensive packaging solutions
2
ELEC516/10 Lecture 9
Motivation for low power design
•
•
•
•
•
•
Packaging costs
Power supply rail design
Chip and system cooling costs
Noise immunity and system reliability
Battery life (in portable systems)
Environmental concerns
– ICT equipment accounted for 10% of total US
commercial energy usage in 2010 and may reach 20%
by 2020
– Energy Star compliant systems
3
ELEC516/10 Lecture 9
BATTERY
(40+ lbs)
Nominal Capacity (Watt-hours / lb)
Why worry about power — Portability
50
Rechargable Lithium
40
Ni-Metal Hydride
30
20
Nickel-Cadium
10
0
65
70
75
80
85
90
95
Year
Multimedia Terminals
Laptop Computers
Expected Battery Lifetime increase
over next 5 years: 30-40%
Digital Cellular Telephony
4
ELEC516/10 Lecture 9
Why worry about power? -- Standby Power
Year

2002
2005
2008
2011
2014
Power supply Vdd (V)
1.5
1.2
0.9
0.7
0.6
Threshold VT (V)
0.4
0.4
0.35
0.3
0.25
Drain leakage will increase as VT decreases to maintain noise margins and
meet frequency demands, leading to excessive battery draining standby
power consumption.
8KW
50%
…and phones leaky!
Standby Power
40%
1.7KW
30%
20%
400W
88W
12W
10%
0%
2000
5
2002
2004
2006
2008
Source: Borkar, De Intel
ELEC516/10 Lecture 9
Power and Energy Figures of Merit
• Power consumption in Watts
– determines battery life in hours
• Peak power
– determines power ground wiring designs
– sets packaging limits
– impacts signal noise margin and reliability analysis
• Energy efficiency in Joules
– rate at which power is consumed over time
• Energy = power * delay
– Joules = Watts * seconds
– lower energy number means less power to perform a
computation at the same frequency
6
ELEC516/10 Lecture 9
Power versus Energy
Power is height of curve
Watts
Lower power design could simply be slower
Approach 1
Approach 2
Watts
time
Energy is area under curve
Two approaches require the same energy
Approach 1
Approach 2
time
7
ELEC516/10 Lecture 9
PDP and EDP
•
Power-delay product (PDP) = Pav * tp = (CLVDD2)/2
– PDP is the average energy consumed per switching event
(Watts * sec = Joule)
– lower power design could simply be a slower design
Energy-delay product (EDP) = PDP * tp = Pav * tp2
l
l
EDP is the average energy
consumed multiplied by the
computation time required
takes into account that one
can trade increased delay
for lower energy/operation
(e.g., via supply voltage
scaling that increases delay,
but decreases energy
consumption)
Energy-Delay (normalized)

15
energy-delay
10
energy
5
delay
0
0.5
l
8
allows one to understand tradeoffs better
1
1.5
2
2.5
Vdd (V)
ELEC516/10 Lecture 9
Where Does Power Go in CMOS?
•
Dynamic power
– Charge-discharge current:
• Dominant source of power (CV2 per transition)
– Short circuit current (both NMOS and PMOS on during transit)
• <10% of c/d current if transitions are fast
• Subthreshold leakage (transistors not OFF completely)
– Becoming important 10-30% active power in <0.18um techn
– Diode leakage from reverse source and drain diodes (neglig)
– Gate leakage (no longer negligible due to very thin gate oxide)
9
ELEC516/10 Lecture 9
Dynamic Power Consumption
Vdd
Vin
Vout
CL
Energy/transition = C
Power = Energy/transition *
L
*V
dd
2
f = CL * V
dd
2
*f
Not a function of transistor sizes!
Need to reduce C L , V dd , and f to reduce power.
10
ELEC516/10 Lecture 9
Dynamic Power Consumption - Revisited
Power = Energy/transition * transition rate
= C L * V dd 2 * f 0  1
= C L * V dd 2 * P 0  1 * f
= C EFF * V
dd
2
*f
Power Dissipation is Data Dependent
Function of Switching Activity
C EFF = Effective Capacitance = C
11
L
* P 0 1
ELEC516/10 Lecture 9
Power Consumption is Data Dependent
Example: Static 2 Input NOR Gate
Assume:
P(A=1) = 1/2
P(B=1) = 1/2
Then:
P(Out=1) = 1/4
P(01)
= P(Out=0).P(Out=1)
= 3/4 X 1/4 = 3/16
CEFF = 3/16 * CL
12
ELEC516/10 Lecture 9
Transition Probabilities for Basic Gates
13
ELEC516/10 Lecture 9
Transition Probability of 2-input NOR Gate
14
ELEC516/10 Lecture 9
Inter-signal Correlations
• Determining switching activity is complicated by the
fact that signals exhibit correlation in space and time
– reconvergent fan-out
(1-0.5)(1-0.5)x(1-(1-0.5)(1-0.5)) = 3/16
0.5
A
0.5
B
X
Z
(1- 3/16 x 0.5) x (3/16 x 0.5) = 0.085
Reconvergent
P(Z=1) = P(B=1) & P(A=1 | B=1)
•
16
Have to use conditional probabilities
ELEC516/10 Lecture 9
Logic Restructuring

Logic restructuring: changing the topology of a logic
network to reduce transitions
AND: P01 = P0 x P1 = (1 - PAPB) x PAPB
3/16
0.5
A
B
0.5
(1-0.25)*0.25 = 3/16
7/64
W
X
15/256
C
F
0.5
D
0.5
0.5 A
0.5 B
0.5
C
0.5 D
Y
15/256
F
Z
3/16
Chain implementation has a lower overall switching activity than
the tree implementation for random inputs
Ignores glitching effects
17
ELEC516/10 Lecture 9
Input Ordering
(1-0.5x0.2)x(0.5x0.2)=0.09
0.5
A
B
0.2
X
C
0.1
F
0.2
B
C
0.1
(1-0.2x0.1)x(0.2x0.1)=0.0196
X
A
0.5
F
Beneficial to postpone the introduction of signals with a high
transition rate (signals with signal probability close to 0.5)
19
ELEC516/10 Lecture 9
How about Dynamic Circuits?
VDD
f
In1
In2
In3
Mp
Out
PDN
f
Me
Power is Only Dissipated when Out=0!
CEFF = P(Out=0).C L
20
ELEC516/10 Lecture 9
2-input NOR Gate
Example: Dynamic 2 Input NOR Gate
Assume:
P(A=1) = 1/2
P(B=1) = 1/2
Then:
P(Out=0) = 3/4
CEFF = 3/4 * CL
Switching Activity Is Always Higher in Dynamic Circuits
21
ELEC516/10 Lecture 9
Transition Probabilities for Dynamic Gates
Switching Activity for Precharged Dynamic Gates
P0 1 = P0
22
ELEC516/10 Lecture 9
Glitching in Static CMOS Networks
• Gates have a nonzero propagation delay resulting in spurious
transitions or glitches (dynamic hazards)
– glitch: node exhibits multiple transitions in a single cycle
before settling to the correct logic value
A
B
X
Z
C
ABC
101
000
X
Z
24
Unit Delay
ELEC516/10 Lecture 9
Example : Adder Circuit
Cin
Add0
Add1
Sum Output Voltage, Volts
S0
Add2
Add14
S2
S14
S1
4.0
Add15
S15
4
S15
6
2.0
3
S10
Cin
5
S1
2
0.0
0
5
10
Time, ns
25
ELEC516/10 Lecture 9
How to Cope with Glitching?
0
F1
0
1
F2
0
0
2
F3
0
0
F1
1
F3
0
0
F2
1
Equalize Lengths of Timing Paths Through Design
26
ELEC516/10 Lecture 9
Balanced Delay Paths to Reduce Glitching

Glitching is due to a mismatch in the path lengths in
the logic network; if all input signals of a gate change
simultaneously, no glitching occurs
0
0
0
0
F1
0
0
1
F1
1
F2 2
F3
0
0
F3
F2
1
So equalize the lengths of timing paths through logic
27
ELEC516/10 Lecture 9
Glitch Reduction by Pipelining
• Glitches depend on the logic depth of the circuit - gates
deeper in the logic network are more prone to glitching
– arrival times of the gate inputs are more spread due to
delay imbalances
– usually affected more by primary input switching
• Reduce logic depth by adding pipeline registers
– additional energy used by the clock and pipeline registers
Memory
D$
WriteBack
MDR
Execute
MAR
I$
Decode
Instruction
PC
Fetch
pipeline
stage
isolation
register
clk
28
ELEC516/10 Lecture 9
Short Circuit Power Consumption
Vin
Isc
Vout
CL
Finite slope of the input signal causes a direct
current path between VDD and GND for a short
period of time during switching when both the
NMOS and PMOS transistors are conducting.
29
ELEC516/10 Lecture 9
Short Circuit Currents Determinates
Esc = tsc VDD Ipeak P01
Psc = tsc VDD Ipeak f01
• Duration and slope of the input signal, tsc
• Ipeak determined by
– the saturation current of the P and N transistors which
depend on their sizes, process technology, temperature,
etc.
– strong function of the ratio between input and output
slopes
• a function of CL
30
ELEC516/10 Lecture 9
Impact of CL on Psc
Isc  0
Vin
Isc  Imax
Vout
CL
31
Vin
Vout
CL
Large capacitive load
Small capacitive load
Output fall time significantly
larger than input rise time.
Output fall time substantially
smaller than the input rise
time.
ELEC516/10 Lecture 9
Ipeak as a Function of CL
2.5
x 10-4
CL = 20 fF
2
When load capacitance
is small, Ipeak is large.
1.5
1
CL = 100 fF
0.5
0
0
-0.5
2
Short circuit dissipation
is minimized by
CL = 500 fF
matching the rise/fall
times of the input and
4
6
x 10-10 output signals - slope
engineering.
time (sec)
500 psec input slope
32
ELEC516/10 Lecture 9
Static Power Consumption
Vdd
Istat
Vout
Vin=5V
CL
Pstat = P(In=1) .Vdd . Istat
• Dominates over dynamic consumption
• Not a function of switching frequency
33
ELEC516/10 Lecture 9
Leakage (Static) Power Consumption
VDD Ileakage
Vout
Drain junction
leakage
Gate leakage
Sub-threshold current
Sub-threshold current is the dominant factor.
All increase exponentially with temperature!
34
ELEC516/10 Lecture 9
Power consideration
– leakage current
I sub  K  e
(Vgs Vth )q / nkT
 (1  e
Vds q / kT
)
K: technology constant; q: electronic charge; k: Boltzman constant
N: nonlinearity constant (between 1 and 2); T: Temperature
35
ELEC516/10 Lecture 9
Leakage as a Function of VT

Continued scaling of supply voltage and the subsequent
scaling of threshold voltage will make subthreshold
conduction a dominate component of power dissipation.
10-2
ID (A)

10-7
VT=0.4V
VT=0.1V
An 90mV/decade VT
roll-off - so each
255mV increase in
VT gives 3 orders of
magnitude reduction
in leakage (but
adversely affects
performance)
10-12
0
0.2
0.4
0.6
0.8
1
VGS (V)
36
ELEC516/10 Lecture 9
Sub-Threshold in MOS
I D
V T =0.2
V T =0.6
V GS
Lower Bound on Threshold to Prevent Leakage
37
ELEC516/10 Lecture 9
Low Power Design Space
• The dynamic power consumption equation reveals
the three degrees of freedom inherent in the low
power design space:
– Voltage
– Physical capacitance
– Data activity
• Optimization for power entails an attempt to reduce
one or more of these factors. Interactions among
these factors complicate the optimization problem.
• Deep sub-micron design - need to minimize leakage
and sub-threshold current
38
ELEC516/10 Lecture 9
Power Reduction Strategy (I)
• Voltage Reduction
– 5V ->3.3 V -> 2.5V ->1.8V-> 1.0V
– Mixed supplies in system and/or on chip, by using the
minimum voltage for different chips of functions
within a chip, together with on-chip voltage
converters if required.
• Low-voltage circuit techniques are required to give
good performances even with low voltages
• Less noisy structures and better signal integrity
handling is required
• Lower Vth process is required to maintain good
transistor speed performances
39
ELEC516/10 Lecture 9
NORMALIZED POWER-DELAY PRODUCT
Reducing Vdd
1.5
P x t d = E t = C L * V dd 2
1.00
0.70
0.50
0.30
0.20
0.15
quadratic dependence
0.1
E (Vdd=2)
E (Vdd=5)
=
(C L) * (2) 2
(C L) * (5) 2
51 stage ring oscillator
0.07
E (Vdd=2)  0.16 E (Vdd =5)
0.05
8-bit adder
0.03
1
2
5
Vdd (volts)
Strong function of voltage (V
2
dependence).
Relatively independent of logic function and style.
Power Delay Product Improves with lowering V
40
DD .
ELEC516/10 Lecture 9
Lower Vdd Increases Delay
7.50
7.00
multiplier
2.0mm technology
clock generator
L * V dd
Td =
6.50
NORMALIZED DELAY
C
I
6.00
5.50
5.00
I ~ (Vdd - Vt)2
4.50
4.00
3.50
ring oscillator
3.00
2.50
2.00
1.50
1.00
(2) * (5 - 0.7)2
Td(Vdd=2)
microcoded DSP chip
Td(Vdd=5)
adder
adder (SPICE)
2.00
4.00
V dd (volts)
=
(5) * (2 - 0.7)2
 4
6.00
Relatively independent of logic function and style.
41
ELEC516/10 Lecture 9
Lowering the Threshold
Delay
I
2V t
Vdd
D
Vt = 0
Vt = 0.2
VGS
Reduces the Speed Loss, But Increases Leakage
Interesting Design Approach:
DESIGN FOR PLeakage == PDynamic
42
ELEC516/10 Lecture 9
What Threshold Voltage to Use?
• Energy vs. Vt for a fixed throughput
• An “optimal” Vt/Vdd point trades switching and
leakage energy
43
ELEC516/10 Lecture 9
Stack Effect
• Leakage is a function of the circuit topology and the value
of the inputs
VT = VT0 + (|-2fF + VSB| - |-2fF|)
where VT0 is the threshold voltage at VSB = 0; VSB is the sourcebulk (substrate) voltage;  is the body-effect coefficient
A
B
Out
A
VX
B
44
A
B
VX
ISUB
0
0
VT ln(1+n)
VGS=VBS= -VX
0
1
0
VGS=VBS=0
1
0
VDD-VT
VGS=VBS=0
1
1
0
VSG=VSB=0
• Leakage is least when A = B = 0
• Leakage reduction due to stacked
transistors is called the stack effect
ELEC516/10 Lecture 9
• Reducing the VT increases
the sub-threshold leakage
current (exponentially)
– 90mV reduction in VT
increases leakage by an
order of magnitude
• But, reducing VT decreases
gate delay (increases
performance)
ID (A)
Leakage as a Function of Design Time VT
VT=0.4V
VT=0.1V
0
0.2
0.4
0.6
0.8
1
VGS (V)
• Determine the critical path(s) at design time and use low VT
devices on the transistors on those paths for speed. Use a high
VT on the other logic for leakage control.
– A careful assignment of VT’s can reduce the leakage by as
much as 80%
45
ELEC516/10 Lecture 9
Dual-Thresholds Inside a Logic Block
• Minimum energy consumption is achieved if all logic
paths are critical (have the same delay)
• Use lower threshold on timing-critical paths
– Assignment can be done on a per gate or transistor basis;
no clustering of the logic is needed
– No level converters are needed
46
ELEC516/10 Lecture 9
Variable VT (ABB) at Run Time
For an n-channel device, the substrate is normally tied
to ground (VSB = 0)

•
VT = VT0 + (|-2fF + VSB| - |-2fF|)
 A negative
bias on VSB
causes VT to increase
 Adjusting
the substrate
bias at run time is called
adaptive body-biasing
(ABB)
Requires a dual well fab
process
l
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
-2.5
47
-2
-1.5
-1
VSB (V)
-0.5
0
ELEC516/10 Lecture 9
Techniques for Burst Mode Computation
V dd
SLEEP
High VT
on
Low V T
Vin
on
SLEEP
High VT
Multiple VT Technology
- V >0
+ p
standby
Vout
standby
+ V <0
- n
Substrate Bias Controlled
Variable VT Devices
(Disable high VT devices during idle periods) (increase VT during idle periods)
e.g. [Sakata-93], [Mutoh-93]
48
e.g. [Seta-95]
ELEC516/10 Lecture 9
Multiple-Threshold Circuits
• Previous approaches cannot reduce leakage power
during active mode.
• Use dual Vt CMOS logic, low Vt for critical paths and
high Vt for non-critical paths, problem - large
threshold swing
• Triple Vt CMOS circuit to reduce the sub-thresold
swing [Fujii et. al. ISSCC-98]
– for high speed low power active mode, low and
medium Vt are used for critical and non-critical paths,
respectively.
– For standby mode, high Vt MOSFET is inserted
between the supply rail and the virtual supply rail.
49
ELEC516/10 Lecture 9
Multiple-Threshold Circuits
50
ELEC516/10 Lecture 9
Power considerations – reduce leakage
• Leakage proportional to device width
– Use smallest devices for critical path.
• Leakage drops with stacked devices (drain voltage
divider)
– Use stacked transistors for critical path.
• Leakage drops with increasing channel length
– Slightly increase L for critical Path
• Use dual VT process providing two threshold VT
– Use high VT transistors for critical path
51
ELEC516/10 Lecture 9
Power consideration
– reduce leakage
• Switch off critical path
transistors when not needed.
• Stand-by mode between supply
and virtual supply lines
• Stand-by vectors
– Apply input vector which
minimizes leakage.
– Achieved using Mux
52
ELEC516/10 Lecture 9
Power gating transistor Sizing
• The effect of power gating transistor size
– As the size decreases, logic performance also
decreases.
– As the size increases, leakage current and chip
area also increase.
– Proper sizing is very important.
– power gating transistor size should be decided within
2% performance degradation.
VDD
Low Vt
Switch
Control
High Vt
Vop = VDD - V
V must be sized
within 2% performance degradation
.
GND
53
ELEC516/10 Lecture 9
Power Reduction Strategy (II)
• Reducing capacitance
– Process scaling and better integration, with smaller
capacitances in more aggressive processes
– Improved devices and interconnect technology
– Efficient clock generation and distribution
– Good memory hierarchy
– In-place optimization using a library containing ranges
of gates with different strengths, through replacement
of gates to use the optimum drive in the critical paths
and minimum drive elsewhere
54
ELEC516/10 Lecture 9
Reducing Effective Capacitance
Global bus architecture
Local bus architecture
Shared Resources incur Switching Overhead
55
ELEC516/10 Lecture 9
Power consideration – Reduce
Capacitance
• Reduce switched capacitance C
– Careful transistor sizing, transistor ordering, tighter and
more compact layout
– Hierarchical architecture and add TG to isolate buses
– Segmented structures
Shared bus driven by A or B when
Sending values to C
Insert switch to isolate bus segment
when B sending to C
56
ELEC516/10 Lecture 9
Power Reduction Strategy (III)
• Reducing activities
– lowering operating frequency
– Using power management strategies, such as Gated
clocks, Power-Down of non-operational units
– Reduce switching activities, Power = Energy/transition *
transition rate =
 CL Vdd2  f 01  CL Vdd2  P01  f  Ceff Vdd2  f
• Power dissipation is data dependent and hence is a
function of switching activity.
• P0->1 is the switching probability
• Effective switching capacitance = Ceff=CL* P0->1
57
ELEC516/10 Lecture 9
Factors Affecting Power
Consumption - Revisited
• Degree of freedom for low power design space:
– Voltage
– Physical capacitance
– Data activity
• Power minimization approaches:
– Run at minimum allowable voltage
– Minimize effective switching capacitances
58
ELEC516/10 Lecture 9
System Level - Power Down Techniques
Operating States:
Active or Full-On
(fastest clock)
Standby
(slow clock)
Micro
Processor
Suspend or Sleep
(slow clock or shut down)
Activity Analyzer
59
ELEC516/10 Lecture 9
Dynamic Power as a Function of VDD
5.5
• Decreasing the VDD
decreases dynamic
energy consumption
(quadratically)
• But, increases gate
delay (decreases
performance)
5
4.5
4
3.5
3
2.5
2
1.5
1
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
VDD (V)
• Determine the critical path(s) at design time and use high
VDD for the transistors on those paths for speed. Use a
lower VDD on the other gates, especially those that drive
large capacitances (as this yields the largest energy
benefits).
60
ELEC516/10 Lecture 9
Multiple VDD Considerations
• How many VDD? – Two is becoming common
– Many chips already have two supplies (one for core and one for
I/O)
• When combining multiple supplies, level converters are required
whenever a module at the lower supply drives a gate at the higher
supply (step-up)
– If a gate supplied with VDDL drives a gate at VDDH, the PMOS
never turns off
VDDH
• The cross-coupled PMOS transistors
do the level conversion
• The NMOS transistor operate on a
Vout
VDDL
reduced supply
Vin
– Level converters are not needed
for a step-down change in voltage
– Overhead of level converters can be mitigated by doing
conversions at register boundaries and embedding the
level conversion inside the flipflop
61
ELEC516/10 Lecture 9
Dual-Supply Inside a Logic Block
• Minimum energy consumption is achieved if all logic paths
are critical (have the same delay)
• Clustered voltage-scaling
– Each path starts with VDDH and switches to VDDL (gray logic
gates) when delay slack is available
– Level conversion is done in the flipflops at the end of the
paths
62
ELEC516/10 Lecture 9
Power Conscious Behavioral Design
• Run functional units at the minimum allowed voltage
while satisfying the timing constraints
• Parallelize or pipeline data-path, memory and controllers
to compensate for throughput loss due to reduced
supply voltage
• Power down functional units which are not is use; Put in
“dynamic power management” capability
• Avoid centralized resources (controllers, functional
blocks, global busses, etc.) as much possible
• Map functions to hardware so that inter-chip
communication is minimized
• Schedule and bind operations to functional units so as
to reduce the activity of the input operands
• Reorder operands to reduce switching activity; Keep the
inputs to an idle unit unchanged
63
ELEC516/10 Lecture 9
Low Power Datapath Design - reducing
the supply voltage
• Reducing supply voltage has quadratic effect on
power saving, but a negative effect on performance.
• Performance can be gained back by logical and
architectural optimizations, e.g. lookahead adder
instead of ripple-carry adder, using parallelism to
increase performance
64
ELEC516/10 Lecture 9
Power Considerations – reduce f and VDD
• Reducing frequency does not save energy, just reduces
rate at which it is consumed
– Power is lower but system must run longer
• Reducing supply voltage is very effective (reduce
voltage by 0.5 improves energy/transition by 0.25).
• Dropping the voltage will result in reduced
performance in terms of speed (need to recover the
performance using parallelism).
• Trade surplus performance for lower energy by
reducing the supply voltage until performance are as
required
65
ELEC516/10 Lecture 9
Power considerations
– Dynamic voltage Scaling
Can run at lower voltage and hence
improves quadraticaly power comsp.
– 8-bit adder/comparator: (Chandrakasan er. Al.)
• consumes Pref at 40Mhz at 5V with Area= 0.53mm2
– Two parallel interleaved architecture:
• Consumes 0.36Pref at 20MHz, 2.9V with Area=1.80mm2
– One Pipelined architecture:
• Consumes 0.39Pref at 40MHz, 2.9V with Area=0.69mm2
– Pipelined and parallel
• Consumes 0.2Pref at 20MHz, 2.0V with Area=1.96mm2
66
ELEC516/10 Lecture 9
Minimizing the power consumption
using parallelism
• Reference Design
Critical path = 25ns, clock frequency = 40MHz
Supply voltage = 5V
2 f
Pref  Cref Vref
ref
67
ELEC516/10 Lecture 9
• Parallel implementation of the reference design
New critical path = 50nsec,
Cpar->2.15 Cref, Vpar ->2.9V, fpar ->0.5 fref
f ref
2
2
Ppar  C parV par f par  (2.15Cref )(0.58Vref ) (
)  0.36 Pref
68
2
ELEC516/10 Lecture 9
Pipelined Data Path
• Critical path delay is less => max[Tadder, Tcomparator].
• Keeping clock rate constant: fpipe=fref, Voltage can be
dropped to Vpipe = Vref/1.7, while maintaining the original
througput
• Capacitance slightly higher: Cpipe=1.15Cref
• Ppipe=(1.15Cref)(Vref/1.7)2fref » 0.39Pref
69
ELEC516/10 Lecture 9
How low a voltage can be used
• Capacitance overhead starts to dominate at “high”
levels of parallelism and results in an optimum
voltage.
70
ELEC516/10 Lecture 9
Voltage as a design variable
Adapting Voltage to Workload yields cubic reduction
71
ELEC516/10 Lecture 9
Multiple supply voltages – Filter Example
1
2
3
4
5
6
7
8
9
10
*
*
*
*
*
*
2.4V
+
*
*
+
+
+
*
+
+ 5V
*
*
+
+
+
*
+
+
+
3V
+
+
+
+
Power (5V)/Power(5V,3V,2.4V) = 1.5 [Raje95]
72
ELEC516/10 Lecture 9
Using Multiple Voltages
Vdd3
Vdd2
Vdd1
C2
C1
C3
Vdd1Vdd2 Vdd3
Vdd2
Vdd1
critical path
non-critical
[Horowitz-95]
Vdd1Vdd2
73
ELEC516/10 Lecture 9
Processor Usage Model
Desired
Throughput
Compute-intensive and
Low-latency processes
Ceiling: set by top
Speed of the processor
Single-user system
not always
computing
• System Optimizations:
time
Background and highlatency processes
– Maximize Peak Throughput
– Minimize Average Energy/operation
– Maximize computation per battery life
74
ELEC516/10 Lecture 9
Scale Supply Voltage with fclk
75
ELEC516/10 Lecture 9
Dynamic Voltage Scaling
Implementation
Frequency detector
Mfref +
Loop Filter
-
battery
DC-DC
VCO
fout
Vdd
mP
• VCO: ring oscillator which matches mP critical path
• DC-DC: perform D/A, converts battery to regulated
Vdd
• Provides both voltage regulation and clock
generation.
76
ELEC516/10 Lecture 9
Dynamic Voltage Scaling in Practice
Fixed Throughput, Energy/operation
Throughput = 8MIPS
Energy/ops = 0.24nJ/inst.
1.92mW
Throughput = 100MIPS
Energy/ops = 2.2nJ/inst.
220mW
Occasionally Demand Peak Throughput
Peak Throughput = 100MIPS
Average Energy/op. = 0.24nJ/inst (~1.8mW)
DVS only advantageous when a majority of
computation is performed at low throughput
77
ELEC516/10 Lecture 9
Low-voltage Switching Regulator
Page 17
[Stratakos-94]
• Arbitrary Vdd (<Vin) generated using the Buck
converter
– Vdd = Vin Duty Cycle at Node X
• Chief sources of inefficiencies:
– Conduction loss (I2R)
– Switching loss (CxVin2fs and LsI2fs)
– Gate-drive loss (CgVin2fs)
78
ELEC516/10 Lecture 9
Adaptive Power Supply Voltage
Self-timed
Processor
REG
Power
Supply
Vdd(t)
FIFO
FIFO
REG
Control
• Exploit Data Dependent Computation Times To vary
the Supply [Nielsen94]
79
ELEC516/10 Lecture 9
Variable Supply Voltage Control Scheme
80
ELEC516/10 Lecture 9
Voltage Scheduling for variable
workload system
• Voltage scheduling under timing constraints
• Example [Ishihara-98]
– Energy consumption of a processor:
• 10nJ/cycle at 2.5V
• 25nJ/cycle at 4 V
• 40nJ/cycle at 5V
– maximum clock frequencies:
• 50MHz at 5V, 40MHz at 4V, 25MHz at 2.5V
– Given that an application needs 1000M cycles to
finish and the timing constraint is 25sec.
81
ELEC516/10 Lecture 9
Energy consumption ( Vdd2)
Different Voltage Schedules
40J
1000Mcycles
50MHz
5.02
0
5.02
10
15
32.5J
750Mcycles
50MHz
(A)
20
25
250Mcycles
25MHz
Time(sec)
(B)
2.52
0
5
5.02
4.02
10
15
20
25
25J
1000Mcycles
40MHz
0
82
5
Timing constraint
5
10
15
Time(sec)
(C)
20
25
Time(sec)
ELEC516/10 Lecture 9
Reduce Power Further by Buffering
Two samples processed every two sample periods
-> increased latency
83
ELEC516/10 Lecture 9
Example of Buffering
Tsample
Tsample
Block 1
Block 2
Block 3
Block 4
Vdd =5
Block 1, 2
Vdd =3.75
Vdd =2.5
Block 1, 2
Vdd =3.75
Vdd =5
Block 3,4
Vdd =3.75
Vdd =2.5
Block 3,4
Vdd =3.75
1
C (5)  C (2.5) 2
2
Energysample 
2
 14C
2
84
2(0.75)C (3.75) 2
Energysample 
2
 10.5C
ELEC516/10 Lecture 9
Voltage Island Concept
Vdd1
Vdd2
Vddo
SWITCH
SWITCH
Logic
Low VT
Logic
 Trade off power for delay by running
functional blocks at different voltages
 Can use mix of Low and High Vt to
balance performance and leakage
 Switch off inactive blocks to reduce
leakage power
 Requires IP standards for power
management, clock gating, etc.
Delay vs. Voltage
30
Std. Vt
25
IP2
Power Management Unit
E.g.: Telecom ASIC with 1.0/1.2 V islands saved :
16 % active power
50 % standby power
85
*Slide from Prof Kyung of KAIST
20
Ddelay (ps)
IP1
Low Vt
15
10
5
0
0.7
0.8
0 .9 1.0 1.1
Voltage (Vdd)
1.2
1.3
Source from Bergamaschi
ELEC516/10 Lecture 9
Power Management
I/O’s, VReg, Gnd
ROM
Vdd1
DSP
Vdd 2
Vdd 1
RLM 1
RLM 2
Microcontroller
Vdd 2
ROM
Vdd 1
Analog Vdd 5
Memory Arrays
Vdd 3
Low Vt device arrays
Optimized for low active
power
RLM 3
Monitor Logic Vdd 4
I/O’s, VReg, Gnd
I/O’s, VReg, Gnd
I/O’s, VReg, Gnd
Memory Arrays
Vdd 3
Low Vt device arrays
Optimized for low active
power
Memory Arrays
Vdd 4
High Vt device arrays
Optimized for low active
power
 Independently controlled domain power switches
 Multiple On-Chip Voltage Islands
 On-Chip Voltage Regulators
86
*Slide from Prof Kyung of KAIST
ELEC516/10 Lecture 9
Controlling VDD and VTH for low power
Active
Stand-by
Multiple VTH
Dual-VTH
MTCMOS
Variable VTH
VTH hopping
VTCMOS
Multiple VDD
Dual-VDD
Boosted gate MOS
Variable VDD
VDD hopping
Software-hardware cooperation
Technology-circuit cooperation
 MTCMOS
: Multi-Threshold CMOS
 VTCMOS : Variable Threshold CMOS
 Multiple : spatial assignment
 Variable : temporal assignment
87
*Slide from Prof Kyung of KAIST
ELEC516/10 Lecture 9
RTL-level optimization-Reducing
effective switching activity
• General Principle: Avoid Waste.
– Application-specific processing
– Resource sharing/Locality of reference
– Data representations
– Preservation of Data correlations
– architectural restructuring
– Distributed processing
– Demand-driven/data-driven computation
88
ELEC516/10 Lecture 9
Application Specific Processing
Application Specific Processing Reduces
“Implementation Overhead”
89
ELEC516/10 Lecture 9
Eliminating Redundant Computation
Power Down
….
fs
fs
fs
fs
fs
Out
90
• Dynamically vary the number of operations per
sample.
• Trade power consumption and filter quality [Ludwig96]
ELEC516/10 Lecture 9
Eliminating Redundant Computation
Peak Number of Operations
Switched Capacitance Reduction ~=
Average Number of Operations
~=2
Strong Function of Signal Statistics
91
ELEC516/10 Lecture 9
Reducing switching activity
• Multiplexing multiple operations on a single
hardware unit can have detrimental effect on the
power consumption, because the switching activity
may be increased, e.g. shared bus
92
ELEC516/10 Lecture 9
Bus Multiplexing
• Buses are a significant source of power dissipation due
to high switching activities and large capacitive loading
– 15% of total power in Alpha 21064
– 30% of total power in Intel 80386
• Share long data buses with time multiplexing (S1 uses
even cycles, S2 odd)
S1
S2
D1
S1
D1
D2
S2
D2
• But what if data samples are correlated (e.g., sign bits)?
93
ELEC516/10 Lecture 9
Reducing the Effective Capacitance
• Circuit and Logic Style - select a circuit style with low
capacitance and/or switching activity, e.g. 8-bit adder
94
ELEC516/10 Lecture 9
Reducing Glitching Activity
• Some circuit structures can be the cause of
spurious transients, e.g. a 16-bit ripple carry adder
• Glitches can be reduced
by selecting structures
that have balanced
signal paths, e.g. tree.
• The Brent-Kung
lookahead adder and
Wallace tree multiplier
both have this
properties, thus more
power attractive.
95
ELEC516/10 Lecture 9
Data Representation
Two’s complement
Sign Magnitude
• Sign-extension activity significantly reduced using sign
magnitude representation.
• An accumulator example: sign magnitude datapath switches
30% less capacitance for uniformly distributed inputs
96
ELEC516/10 Lecture 9
Data Representation –
Accumulator Example
Sign magnitude datapath switches 30% less
capacitance for uniformly distributed inputs
97
ELEC516/10 Lecture 9
Two’s Complement vs.
Sign-Magnitude
Two’s complement datapath has a significantly
higher switching activity
98
ELEC516/10 Lecture 9
Bus encoding to reduce switching
activity
• Minimizing temporal bit transition activity by data
representation
• Bit encoding:
– Active high encoding
• high-level voltage for 1, low-level voltage for 0
– Transition-based encoding
• voltage change identifies logic 1, no voltage range
identifies logic 0
• Word encoding
– assign patterns of 1’s and 0’s to each word of
information
– Non-redundant codes vs. redundant codes
• Example of low power coding
– Limited-weight code
– Gray code
– One-hot code
– Bus-invert code
ELEC516/10 Lecture 9
99
Example of Bit Encoding
• Reducing average no. of switching on a data bus
• transition-based encoding may limit the no. of transitions
for non equiprobable input lines
• Let p(0) > p(1) and no temporal correlation exists. If
active-high encoding is used, the average no. of
transition is
p(0) p(1)  p(1) p(0)  2(1  p(1)) p(1)
• For transition-based encoding, it is simply p(1). If p(1) <<
p(0), transition-based is better than active-high encoding.
• To reduce transition, the input patterns are transformed
in such a way that the p(0) and p(1) prob. of each input
line becomes as different as possible, and then to apply
a transition-based encoding before the data is
transmitted.
100
ELEC516/10 Lecture 9
Word encoding
• One-hot coding
•Gray coding - good for sequential data, e.g.
addressing for microprocessor [Su-94]
101
•Disadvantage of Gray code - only good for
address bus, not for data bus, additional
conversion circuitry is needed
ELEC516/10 Lecture 9
Word encoding (cont.)
• Bus-invert encoding - use redundancy to save power.
• Given a n bit data bus with 2n patterns to be represented and
assumes all the patterns are equiprobable and no temporal
correlation. p(0) = p(1) = 0.5, the average no. of transition is
n/2 per cycle, while the worst case transition is n.
• Bus-invert coding - add an extra line to the bus, I, and then
comparing the consecutive patterns before transmission. Tw
cases
– - If the Hamming distance between the two patterns =< n/2, the
current pattern is transmitted as it is and I is set to 0.
– - if the Hamming distance between the two patterns > n/2, the
current pattern is first inverted and then transmitted, and I is se
to 1.
• Max. transition is limited to n/2 and average transition is
reduced by 25%
• Drawback - extra line I to indicate whether a pattern has been
inverted.
102
ELEC516/10 Lecture 9
Conditional Inversion Coding
103
ELEC516/10 Lecture 9
Bus encoding for cross coupling cap.
• Cross-coupling cap dominates for very deep submicron technology, e.g. < 70nm
• Bus model
bit 0
– Stand alone cap. Cs
– Cross coupling cap. Cc
Cc
bit 1
• Stand alone switching
– Apply to single bit line
– 0-1 transition
Cs
Cc
Cs
bit 2
Cs
• Cross coupling switching
– Occurs on adjacent wires
– Four types of coupling transition
H --> L
H --> L
Type 1
104
0
L --> H
H --> L
H --> L
L --> L
L --> H
Type 2
Type 3
Type 4
2
0
1
H --> L
ELEC516/10 Lecture 9
Permutation Based Address code
• Rearrange the physical order of the bit lines
Sender
Receiver
• Work efficiently on address bus (40%), but not on
instruction bus (4%)
– Correlation is lower
– Permutation is fixed
105
ELEC516/10 Lecture 9
Dynamic Reconfigurable Bus
Encoding Scheme for Instruction Bus
106
ELEC516/10 Lecture 9
Overview of the Scheme
Decoding information
stored as header
Encoding
Encoding during compilation
Bit lines reordering
Computer
Decoding
Target Processor
Processing Element
MEM
.
.
.
Bn
107
CACHE
Mem_bus
B1
.
.
Bm
Look-Up
Table
Cache_bus
Decoder
(Cross_bar)
 Decoding information
are loaded to LUT first
Instruction is called
Address_bus
B1
Target
Mem Processor
CPU Core
Decoding information
stored in LUT will control
the Cross_bar
Instructions enter
Cross_bar to decode
ELEC516/10 Lecture 9
Demand/Data-driven operation
• Clocking strategy
– Gated clocking
• System Power Down
• Computing Paradigm
108
ELEC516/10 Lecture 9
Logic Level Power Down Technique
• Activity driven - precomputation-based sequential logic
optimization
– Selectively precompute the output logic values of the circuit
one clock cycle before they are required, and use the
precomputed values to reduce internal switching activity in
the succeeding cycle
f
R1
A
R1
A
R2
R2
LE
Original Circuit
109
g1
FF
g2
FF
It is required that
g1 = 1 -> f = 1
g2 = 1 -> f = 0
Modified Circuit
ELEC516/10 Lecture 9
An example: n-bit comparator
• This circuit compares two n-bit numbers A & and
computes the function A > B
• In general, precomputation works best when there are a
small number of complex functions corresponding to
the logic block A
110
ELEC516/10 Lecture 9