* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 9 – Low Power Design
Resistive opto-isolator wikipedia , lookup
Power factor wikipedia , lookup
Wireless power transfer wikipedia , lookup
Three-phase electric power wikipedia , lookup
Power inverter wikipedia , lookup
Standby power wikipedia , lookup
Audio power wikipedia , lookup
Immunity-aware programming wikipedia , lookup
Electrification wikipedia , lookup
Electric power system wikipedia , lookup
Power over Ethernet wikipedia , lookup
Variable-frequency drive wikipedia , lookup
Opto-isolator wikipedia , lookup
Pulse-width modulation wikipedia , lookup
Distributed generation wikipedia , lookup
Stray voltage wikipedia , lookup
Life-cycle greenhouse-gas emissions of energy sources wikipedia , lookup
Electrical substation wikipedia , lookup
Amtrak's 25 Hz traction power system wikipedia , lookup
Surge protector wikipedia , lookup
History of electric power transmission wikipedia , lookup
Power electronics wikipedia , lookup
Power MOSFET wikipedia , lookup
Power engineering wikipedia , lookup
Buck converter wikipedia , lookup
Power supply wikipedia , lookup
Voltage optimisation wikipedia , lookup
Alternating current wikipedia , lookup
ELEC 516 VLSI System Design and Design Automation Spring 2010 Lecture 9 - Low Power Digital CMOS Design Reading Assignment: Rabaey: Chapter 4 and 7,11 Note: some of the figures in this slide set are adapted from the slide set of “ Digital Integrated Circuits” by Rabaey, Copyright UCB 2002 1 ELEC516/10 Lecture 9 Why worry about power? -- Heat Dissipation Besides low power required for portable applications !!!! Power is a very big concern in today’s advanced technologies. Power produces heat on the chip which has to be carried off through the chip socket expensive packaging solutions 2 ELEC516/10 Lecture 9 Motivation for low power design • • • • • • Packaging costs Power supply rail design Chip and system cooling costs Noise immunity and system reliability Battery life (in portable systems) Environmental concerns – ICT equipment accounted for 10% of total US commercial energy usage in 2010 and may reach 20% by 2020 – Energy Star compliant systems 3 ELEC516/10 Lecture 9 BATTERY (40+ lbs) Nominal Capacity (Watt-hours / lb) Why worry about power — Portability 50 Rechargable Lithium 40 Ni-Metal Hydride 30 20 Nickel-Cadium 10 0 65 70 75 80 85 90 95 Year Multimedia Terminals Laptop Computers Expected Battery Lifetime increase over next 5 years: 30-40% Digital Cellular Telephony 4 ELEC516/10 Lecture 9 Why worry about power? -- Standby Power Year 2002 2005 2008 2011 2014 Power supply Vdd (V) 1.5 1.2 0.9 0.7 0.6 Threshold VT (V) 0.4 0.4 0.35 0.3 0.25 Drain leakage will increase as VT decreases to maintain noise margins and meet frequency demands, leading to excessive battery draining standby power consumption. 8KW 50% …and phones leaky! Standby Power 40% 1.7KW 30% 20% 400W 88W 12W 10% 0% 2000 5 2002 2004 2006 2008 Source: Borkar, De Intel ELEC516/10 Lecture 9 Power and Energy Figures of Merit • Power consumption in Watts – determines battery life in hours • Peak power – determines power ground wiring designs – sets packaging limits – impacts signal noise margin and reliability analysis • Energy efficiency in Joules – rate at which power is consumed over time • Energy = power * delay – Joules = Watts * seconds – lower energy number means less power to perform a computation at the same frequency 6 ELEC516/10 Lecture 9 Power versus Energy Power is height of curve Watts Lower power design could simply be slower Approach 1 Approach 2 Watts time Energy is area under curve Two approaches require the same energy Approach 1 Approach 2 time 7 ELEC516/10 Lecture 9 PDP and EDP • Power-delay product (PDP) = Pav * tp = (CLVDD2)/2 – PDP is the average energy consumed per switching event (Watts * sec = Joule) – lower power design could simply be a slower design Energy-delay product (EDP) = PDP * tp = Pav * tp2 l l EDP is the average energy consumed multiplied by the computation time required takes into account that one can trade increased delay for lower energy/operation (e.g., via supply voltage scaling that increases delay, but decreases energy consumption) Energy-Delay (normalized) 15 energy-delay 10 energy 5 delay 0 0.5 l 8 allows one to understand tradeoffs better 1 1.5 2 2.5 Vdd (V) ELEC516/10 Lecture 9 Where Does Power Go in CMOS? • Dynamic power – Charge-discharge current: • Dominant source of power (CV2 per transition) – Short circuit current (both NMOS and PMOS on during transit) • <10% of c/d current if transitions are fast • Subthreshold leakage (transistors not OFF completely) – Becoming important 10-30% active power in <0.18um techn – Diode leakage from reverse source and drain diodes (neglig) – Gate leakage (no longer negligible due to very thin gate oxide) 9 ELEC516/10 Lecture 9 Dynamic Power Consumption Vdd Vin Vout CL Energy/transition = C Power = Energy/transition * L *V dd 2 f = CL * V dd 2 *f Not a function of transistor sizes! Need to reduce C L , V dd , and f to reduce power. 10 ELEC516/10 Lecture 9 Dynamic Power Consumption - Revisited Power = Energy/transition * transition rate = C L * V dd 2 * f 0 1 = C L * V dd 2 * P 0 1 * f = C EFF * V dd 2 *f Power Dissipation is Data Dependent Function of Switching Activity C EFF = Effective Capacitance = C 11 L * P 0 1 ELEC516/10 Lecture 9 Power Consumption is Data Dependent Example: Static 2 Input NOR Gate Assume: P(A=1) = 1/2 P(B=1) = 1/2 Then: P(Out=1) = 1/4 P(01) = P(Out=0).P(Out=1) = 3/4 X 1/4 = 3/16 CEFF = 3/16 * CL 12 ELEC516/10 Lecture 9 Transition Probabilities for Basic Gates 13 ELEC516/10 Lecture 9 Transition Probability of 2-input NOR Gate 14 ELEC516/10 Lecture 9 Inter-signal Correlations • Determining switching activity is complicated by the fact that signals exhibit correlation in space and time – reconvergent fan-out (1-0.5)(1-0.5)x(1-(1-0.5)(1-0.5)) = 3/16 0.5 A 0.5 B X Z (1- 3/16 x 0.5) x (3/16 x 0.5) = 0.085 Reconvergent P(Z=1) = P(B=1) & P(A=1 | B=1) • 16 Have to use conditional probabilities ELEC516/10 Lecture 9 Logic Restructuring Logic restructuring: changing the topology of a logic network to reduce transitions AND: P01 = P0 x P1 = (1 - PAPB) x PAPB 3/16 0.5 A B 0.5 (1-0.25)*0.25 = 3/16 7/64 W X 15/256 C F 0.5 D 0.5 0.5 A 0.5 B 0.5 C 0.5 D Y 15/256 F Z 3/16 Chain implementation has a lower overall switching activity than the tree implementation for random inputs Ignores glitching effects 17 ELEC516/10 Lecture 9 Input Ordering (1-0.5x0.2)x(0.5x0.2)=0.09 0.5 A B 0.2 X C 0.1 F 0.2 B C 0.1 (1-0.2x0.1)x(0.2x0.1)=0.0196 X A 0.5 F Beneficial to postpone the introduction of signals with a high transition rate (signals with signal probability close to 0.5) 19 ELEC516/10 Lecture 9 How about Dynamic Circuits? VDD f In1 In2 In3 Mp Out PDN f Me Power is Only Dissipated when Out=0! CEFF = P(Out=0).C L 20 ELEC516/10 Lecture 9 2-input NOR Gate Example: Dynamic 2 Input NOR Gate Assume: P(A=1) = 1/2 P(B=1) = 1/2 Then: P(Out=0) = 3/4 CEFF = 3/4 * CL Switching Activity Is Always Higher in Dynamic Circuits 21 ELEC516/10 Lecture 9 Transition Probabilities for Dynamic Gates Switching Activity for Precharged Dynamic Gates P0 1 = P0 22 ELEC516/10 Lecture 9 Glitching in Static CMOS Networks • Gates have a nonzero propagation delay resulting in spurious transitions or glitches (dynamic hazards) – glitch: node exhibits multiple transitions in a single cycle before settling to the correct logic value A B X Z C ABC 101 000 X Z 24 Unit Delay ELEC516/10 Lecture 9 Example : Adder Circuit Cin Add0 Add1 Sum Output Voltage, Volts S0 Add2 Add14 S2 S14 S1 4.0 Add15 S15 4 S15 6 2.0 3 S10 Cin 5 S1 2 0.0 0 5 10 Time, ns 25 ELEC516/10 Lecture 9 How to Cope with Glitching? 0 F1 0 1 F2 0 0 2 F3 0 0 F1 1 F3 0 0 F2 1 Equalize Lengths of Timing Paths Through Design 26 ELEC516/10 Lecture 9 Balanced Delay Paths to Reduce Glitching Glitching is due to a mismatch in the path lengths in the logic network; if all input signals of a gate change simultaneously, no glitching occurs 0 0 0 0 F1 0 0 1 F1 1 F2 2 F3 0 0 F3 F2 1 So equalize the lengths of timing paths through logic 27 ELEC516/10 Lecture 9 Glitch Reduction by Pipelining • Glitches depend on the logic depth of the circuit - gates deeper in the logic network are more prone to glitching – arrival times of the gate inputs are more spread due to delay imbalances – usually affected more by primary input switching • Reduce logic depth by adding pipeline registers – additional energy used by the clock and pipeline registers Memory D$ WriteBack MDR Execute MAR I$ Decode Instruction PC Fetch pipeline stage isolation register clk 28 ELEC516/10 Lecture 9 Short Circuit Power Consumption Vin Isc Vout CL Finite slope of the input signal causes a direct current path between VDD and GND for a short period of time during switching when both the NMOS and PMOS transistors are conducting. 29 ELEC516/10 Lecture 9 Short Circuit Currents Determinates Esc = tsc VDD Ipeak P01 Psc = tsc VDD Ipeak f01 • Duration and slope of the input signal, tsc • Ipeak determined by – the saturation current of the P and N transistors which depend on their sizes, process technology, temperature, etc. – strong function of the ratio between input and output slopes • a function of CL 30 ELEC516/10 Lecture 9 Impact of CL on Psc Isc 0 Vin Isc Imax Vout CL 31 Vin Vout CL Large capacitive load Small capacitive load Output fall time significantly larger than input rise time. Output fall time substantially smaller than the input rise time. ELEC516/10 Lecture 9 Ipeak as a Function of CL 2.5 x 10-4 CL = 20 fF 2 When load capacitance is small, Ipeak is large. 1.5 1 CL = 100 fF 0.5 0 0 -0.5 2 Short circuit dissipation is minimized by CL = 500 fF matching the rise/fall times of the input and 4 6 x 10-10 output signals - slope engineering. time (sec) 500 psec input slope 32 ELEC516/10 Lecture 9 Static Power Consumption Vdd Istat Vout Vin=5V CL Pstat = P(In=1) .Vdd . Istat • Dominates over dynamic consumption • Not a function of switching frequency 33 ELEC516/10 Lecture 9 Leakage (Static) Power Consumption VDD Ileakage Vout Drain junction leakage Gate leakage Sub-threshold current Sub-threshold current is the dominant factor. All increase exponentially with temperature! 34 ELEC516/10 Lecture 9 Power consideration – leakage current I sub K e (Vgs Vth )q / nkT (1 e Vds q / kT ) K: technology constant; q: electronic charge; k: Boltzman constant N: nonlinearity constant (between 1 and 2); T: Temperature 35 ELEC516/10 Lecture 9 Leakage as a Function of VT Continued scaling of supply voltage and the subsequent scaling of threshold voltage will make subthreshold conduction a dominate component of power dissipation. 10-2 ID (A) 10-7 VT=0.4V VT=0.1V An 90mV/decade VT roll-off - so each 255mV increase in VT gives 3 orders of magnitude reduction in leakage (but adversely affects performance) 10-12 0 0.2 0.4 0.6 0.8 1 VGS (V) 36 ELEC516/10 Lecture 9 Sub-Threshold in MOS I D V T =0.2 V T =0.6 V GS Lower Bound on Threshold to Prevent Leakage 37 ELEC516/10 Lecture 9 Low Power Design Space • The dynamic power consumption equation reveals the three degrees of freedom inherent in the low power design space: – Voltage – Physical capacitance – Data activity • Optimization for power entails an attempt to reduce one or more of these factors. Interactions among these factors complicate the optimization problem. • Deep sub-micron design - need to minimize leakage and sub-threshold current 38 ELEC516/10 Lecture 9 Power Reduction Strategy (I) • Voltage Reduction – 5V ->3.3 V -> 2.5V ->1.8V-> 1.0V – Mixed supplies in system and/or on chip, by using the minimum voltage for different chips of functions within a chip, together with on-chip voltage converters if required. • Low-voltage circuit techniques are required to give good performances even with low voltages • Less noisy structures and better signal integrity handling is required • Lower Vth process is required to maintain good transistor speed performances 39 ELEC516/10 Lecture 9 NORMALIZED POWER-DELAY PRODUCT Reducing Vdd 1.5 P x t d = E t = C L * V dd 2 1.00 0.70 0.50 0.30 0.20 0.15 quadratic dependence 0.1 E (Vdd=2) E (Vdd=5) = (C L) * (2) 2 (C L) * (5) 2 51 stage ring oscillator 0.07 E (Vdd=2) 0.16 E (Vdd =5) 0.05 8-bit adder 0.03 1 2 5 Vdd (volts) Strong function of voltage (V 2 dependence). Relatively independent of logic function and style. Power Delay Product Improves with lowering V 40 DD . ELEC516/10 Lecture 9 Lower Vdd Increases Delay 7.50 7.00 multiplier 2.0mm technology clock generator L * V dd Td = 6.50 NORMALIZED DELAY C I 6.00 5.50 5.00 I ~ (Vdd - Vt)2 4.50 4.00 3.50 ring oscillator 3.00 2.50 2.00 1.50 1.00 (2) * (5 - 0.7)2 Td(Vdd=2) microcoded DSP chip Td(Vdd=5) adder adder (SPICE) 2.00 4.00 V dd (volts) = (5) * (2 - 0.7)2 4 6.00 Relatively independent of logic function and style. 41 ELEC516/10 Lecture 9 Lowering the Threshold Delay I 2V t Vdd D Vt = 0 Vt = 0.2 VGS Reduces the Speed Loss, But Increases Leakage Interesting Design Approach: DESIGN FOR PLeakage == PDynamic 42 ELEC516/10 Lecture 9 What Threshold Voltage to Use? • Energy vs. Vt for a fixed throughput • An “optimal” Vt/Vdd point trades switching and leakage energy 43 ELEC516/10 Lecture 9 Stack Effect • Leakage is a function of the circuit topology and the value of the inputs VT = VT0 + (|-2fF + VSB| - |-2fF|) where VT0 is the threshold voltage at VSB = 0; VSB is the sourcebulk (substrate) voltage; is the body-effect coefficient A B Out A VX B 44 A B VX ISUB 0 0 VT ln(1+n) VGS=VBS= -VX 0 1 0 VGS=VBS=0 1 0 VDD-VT VGS=VBS=0 1 1 0 VSG=VSB=0 • Leakage is least when A = B = 0 • Leakage reduction due to stacked transistors is called the stack effect ELEC516/10 Lecture 9 • Reducing the VT increases the sub-threshold leakage current (exponentially) – 90mV reduction in VT increases leakage by an order of magnitude • But, reducing VT decreases gate delay (increases performance) ID (A) Leakage as a Function of Design Time VT VT=0.4V VT=0.1V 0 0.2 0.4 0.6 0.8 1 VGS (V) • Determine the critical path(s) at design time and use low VT devices on the transistors on those paths for speed. Use a high VT on the other logic for leakage control. – A careful assignment of VT’s can reduce the leakage by as much as 80% 45 ELEC516/10 Lecture 9 Dual-Thresholds Inside a Logic Block • Minimum energy consumption is achieved if all logic paths are critical (have the same delay) • Use lower threshold on timing-critical paths – Assignment can be done on a per gate or transistor basis; no clustering of the logic is needed – No level converters are needed 46 ELEC516/10 Lecture 9 Variable VT (ABB) at Run Time For an n-channel device, the substrate is normally tied to ground (VSB = 0) • VT = VT0 + (|-2fF + VSB| - |-2fF|) A negative bias on VSB causes VT to increase Adjusting the substrate bias at run time is called adaptive body-biasing (ABB) Requires a dual well fab process l 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 -2.5 47 -2 -1.5 -1 VSB (V) -0.5 0 ELEC516/10 Lecture 9 Techniques for Burst Mode Computation V dd SLEEP High VT on Low V T Vin on SLEEP High VT Multiple VT Technology - V >0 + p standby Vout standby + V <0 - n Substrate Bias Controlled Variable VT Devices (Disable high VT devices during idle periods) (increase VT during idle periods) e.g. [Sakata-93], [Mutoh-93] 48 e.g. [Seta-95] ELEC516/10 Lecture 9 Multiple-Threshold Circuits • Previous approaches cannot reduce leakage power during active mode. • Use dual Vt CMOS logic, low Vt for critical paths and high Vt for non-critical paths, problem - large threshold swing • Triple Vt CMOS circuit to reduce the sub-thresold swing [Fujii et. al. ISSCC-98] – for high speed low power active mode, low and medium Vt are used for critical and non-critical paths, respectively. – For standby mode, high Vt MOSFET is inserted between the supply rail and the virtual supply rail. 49 ELEC516/10 Lecture 9 Multiple-Threshold Circuits 50 ELEC516/10 Lecture 9 Power considerations – reduce leakage • Leakage proportional to device width – Use smallest devices for critical path. • Leakage drops with stacked devices (drain voltage divider) – Use stacked transistors for critical path. • Leakage drops with increasing channel length – Slightly increase L for critical Path • Use dual VT process providing two threshold VT – Use high VT transistors for critical path 51 ELEC516/10 Lecture 9 Power consideration – reduce leakage • Switch off critical path transistors when not needed. • Stand-by mode between supply and virtual supply lines • Stand-by vectors – Apply input vector which minimizes leakage. – Achieved using Mux 52 ELEC516/10 Lecture 9 Power gating transistor Sizing • The effect of power gating transistor size – As the size decreases, logic performance also decreases. – As the size increases, leakage current and chip area also increase. – Proper sizing is very important. – power gating transistor size should be decided within 2% performance degradation. VDD Low Vt Switch Control High Vt Vop = VDD - V V must be sized within 2% performance degradation . GND 53 ELEC516/10 Lecture 9 Power Reduction Strategy (II) • Reducing capacitance – Process scaling and better integration, with smaller capacitances in more aggressive processes – Improved devices and interconnect technology – Efficient clock generation and distribution – Good memory hierarchy – In-place optimization using a library containing ranges of gates with different strengths, through replacement of gates to use the optimum drive in the critical paths and minimum drive elsewhere 54 ELEC516/10 Lecture 9 Reducing Effective Capacitance Global bus architecture Local bus architecture Shared Resources incur Switching Overhead 55 ELEC516/10 Lecture 9 Power consideration – Reduce Capacitance • Reduce switched capacitance C – Careful transistor sizing, transistor ordering, tighter and more compact layout – Hierarchical architecture and add TG to isolate buses – Segmented structures Shared bus driven by A or B when Sending values to C Insert switch to isolate bus segment when B sending to C 56 ELEC516/10 Lecture 9 Power Reduction Strategy (III) • Reducing activities – lowering operating frequency – Using power management strategies, such as Gated clocks, Power-Down of non-operational units – Reduce switching activities, Power = Energy/transition * transition rate = CL Vdd2 f 01 CL Vdd2 P01 f Ceff Vdd2 f • Power dissipation is data dependent and hence is a function of switching activity. • P0->1 is the switching probability • Effective switching capacitance = Ceff=CL* P0->1 57 ELEC516/10 Lecture 9 Factors Affecting Power Consumption - Revisited • Degree of freedom for low power design space: – Voltage – Physical capacitance – Data activity • Power minimization approaches: – Run at minimum allowable voltage – Minimize effective switching capacitances 58 ELEC516/10 Lecture 9 System Level - Power Down Techniques Operating States: Active or Full-On (fastest clock) Standby (slow clock) Micro Processor Suspend or Sleep (slow clock or shut down) Activity Analyzer 59 ELEC516/10 Lecture 9 Dynamic Power as a Function of VDD 5.5 • Decreasing the VDD decreases dynamic energy consumption (quadratically) • But, increases gate delay (decreases performance) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 VDD (V) • Determine the critical path(s) at design time and use high VDD for the transistors on those paths for speed. Use a lower VDD on the other gates, especially those that drive large capacitances (as this yields the largest energy benefits). 60 ELEC516/10 Lecture 9 Multiple VDD Considerations • How many VDD? – Two is becoming common – Many chips already have two supplies (one for core and one for I/O) • When combining multiple supplies, level converters are required whenever a module at the lower supply drives a gate at the higher supply (step-up) – If a gate supplied with VDDL drives a gate at VDDH, the PMOS never turns off VDDH • The cross-coupled PMOS transistors do the level conversion • The NMOS transistor operate on a Vout VDDL reduced supply Vin – Level converters are not needed for a step-down change in voltage – Overhead of level converters can be mitigated by doing conversions at register boundaries and embedding the level conversion inside the flipflop 61 ELEC516/10 Lecture 9 Dual-Supply Inside a Logic Block • Minimum energy consumption is achieved if all logic paths are critical (have the same delay) • Clustered voltage-scaling – Each path starts with VDDH and switches to VDDL (gray logic gates) when delay slack is available – Level conversion is done in the flipflops at the end of the paths 62 ELEC516/10 Lecture 9 Power Conscious Behavioral Design • Run functional units at the minimum allowed voltage while satisfying the timing constraints • Parallelize or pipeline data-path, memory and controllers to compensate for throughput loss due to reduced supply voltage • Power down functional units which are not is use; Put in “dynamic power management” capability • Avoid centralized resources (controllers, functional blocks, global busses, etc.) as much possible • Map functions to hardware so that inter-chip communication is minimized • Schedule and bind operations to functional units so as to reduce the activity of the input operands • Reorder operands to reduce switching activity; Keep the inputs to an idle unit unchanged 63 ELEC516/10 Lecture 9 Low Power Datapath Design - reducing the supply voltage • Reducing supply voltage has quadratic effect on power saving, but a negative effect on performance. • Performance can be gained back by logical and architectural optimizations, e.g. lookahead adder instead of ripple-carry adder, using parallelism to increase performance 64 ELEC516/10 Lecture 9 Power Considerations – reduce f and VDD • Reducing frequency does not save energy, just reduces rate at which it is consumed – Power is lower but system must run longer • Reducing supply voltage is very effective (reduce voltage by 0.5 improves energy/transition by 0.25). • Dropping the voltage will result in reduced performance in terms of speed (need to recover the performance using parallelism). • Trade surplus performance for lower energy by reducing the supply voltage until performance are as required 65 ELEC516/10 Lecture 9 Power considerations – Dynamic voltage Scaling Can run at lower voltage and hence improves quadraticaly power comsp. – 8-bit adder/comparator: (Chandrakasan er. Al.) • consumes Pref at 40Mhz at 5V with Area= 0.53mm2 – Two parallel interleaved architecture: • Consumes 0.36Pref at 20MHz, 2.9V with Area=1.80mm2 – One Pipelined architecture: • Consumes 0.39Pref at 40MHz, 2.9V with Area=0.69mm2 – Pipelined and parallel • Consumes 0.2Pref at 20MHz, 2.0V with Area=1.96mm2 66 ELEC516/10 Lecture 9 Minimizing the power consumption using parallelism • Reference Design Critical path = 25ns, clock frequency = 40MHz Supply voltage = 5V 2 f Pref Cref Vref ref 67 ELEC516/10 Lecture 9 • Parallel implementation of the reference design New critical path = 50nsec, Cpar->2.15 Cref, Vpar ->2.9V, fpar ->0.5 fref f ref 2 2 Ppar C parV par f par (2.15Cref )(0.58Vref ) ( ) 0.36 Pref 68 2 ELEC516/10 Lecture 9 Pipelined Data Path • Critical path delay is less => max[Tadder, Tcomparator]. • Keeping clock rate constant: fpipe=fref, Voltage can be dropped to Vpipe = Vref/1.7, while maintaining the original througput • Capacitance slightly higher: Cpipe=1.15Cref • Ppipe=(1.15Cref)(Vref/1.7)2fref » 0.39Pref 69 ELEC516/10 Lecture 9 How low a voltage can be used • Capacitance overhead starts to dominate at “high” levels of parallelism and results in an optimum voltage. 70 ELEC516/10 Lecture 9 Voltage as a design variable Adapting Voltage to Workload yields cubic reduction 71 ELEC516/10 Lecture 9 Multiple supply voltages – Filter Example 1 2 3 4 5 6 7 8 9 10 * * * * * * 2.4V + * * + + + * + + 5V * * + + + * + + + 3V + + + + Power (5V)/Power(5V,3V,2.4V) = 1.5 [Raje95] 72 ELEC516/10 Lecture 9 Using Multiple Voltages Vdd3 Vdd2 Vdd1 C2 C1 C3 Vdd1Vdd2 Vdd3 Vdd2 Vdd1 critical path non-critical [Horowitz-95] Vdd1Vdd2 73 ELEC516/10 Lecture 9 Processor Usage Model Desired Throughput Compute-intensive and Low-latency processes Ceiling: set by top Speed of the processor Single-user system not always computing • System Optimizations: time Background and highlatency processes – Maximize Peak Throughput – Minimize Average Energy/operation – Maximize computation per battery life 74 ELEC516/10 Lecture 9 Scale Supply Voltage with fclk 75 ELEC516/10 Lecture 9 Dynamic Voltage Scaling Implementation Frequency detector Mfref + Loop Filter - battery DC-DC VCO fout Vdd mP • VCO: ring oscillator which matches mP critical path • DC-DC: perform D/A, converts battery to regulated Vdd • Provides both voltage regulation and clock generation. 76 ELEC516/10 Lecture 9 Dynamic Voltage Scaling in Practice Fixed Throughput, Energy/operation Throughput = 8MIPS Energy/ops = 0.24nJ/inst. 1.92mW Throughput = 100MIPS Energy/ops = 2.2nJ/inst. 220mW Occasionally Demand Peak Throughput Peak Throughput = 100MIPS Average Energy/op. = 0.24nJ/inst (~1.8mW) DVS only advantageous when a majority of computation is performed at low throughput 77 ELEC516/10 Lecture 9 Low-voltage Switching Regulator Page 17 [Stratakos-94] • Arbitrary Vdd (<Vin) generated using the Buck converter – Vdd = Vin Duty Cycle at Node X • Chief sources of inefficiencies: – Conduction loss (I2R) – Switching loss (CxVin2fs and LsI2fs) – Gate-drive loss (CgVin2fs) 78 ELEC516/10 Lecture 9 Adaptive Power Supply Voltage Self-timed Processor REG Power Supply Vdd(t) FIFO FIFO REG Control • Exploit Data Dependent Computation Times To vary the Supply [Nielsen94] 79 ELEC516/10 Lecture 9 Variable Supply Voltage Control Scheme 80 ELEC516/10 Lecture 9 Voltage Scheduling for variable workload system • Voltage scheduling under timing constraints • Example [Ishihara-98] – Energy consumption of a processor: • 10nJ/cycle at 2.5V • 25nJ/cycle at 4 V • 40nJ/cycle at 5V – maximum clock frequencies: • 50MHz at 5V, 40MHz at 4V, 25MHz at 2.5V – Given that an application needs 1000M cycles to finish and the timing constraint is 25sec. 81 ELEC516/10 Lecture 9 Energy consumption ( Vdd2) Different Voltage Schedules 40J 1000Mcycles 50MHz 5.02 0 5.02 10 15 32.5J 750Mcycles 50MHz (A) 20 25 250Mcycles 25MHz Time(sec) (B) 2.52 0 5 5.02 4.02 10 15 20 25 25J 1000Mcycles 40MHz 0 82 5 Timing constraint 5 10 15 Time(sec) (C) 20 25 Time(sec) ELEC516/10 Lecture 9 Reduce Power Further by Buffering Two samples processed every two sample periods -> increased latency 83 ELEC516/10 Lecture 9 Example of Buffering Tsample Tsample Block 1 Block 2 Block 3 Block 4 Vdd =5 Block 1, 2 Vdd =3.75 Vdd =2.5 Block 1, 2 Vdd =3.75 Vdd =5 Block 3,4 Vdd =3.75 Vdd =2.5 Block 3,4 Vdd =3.75 1 C (5) C (2.5) 2 2 Energysample 2 14C 2 84 2(0.75)C (3.75) 2 Energysample 2 10.5C ELEC516/10 Lecture 9 Voltage Island Concept Vdd1 Vdd2 Vddo SWITCH SWITCH Logic Low VT Logic Trade off power for delay by running functional blocks at different voltages Can use mix of Low and High Vt to balance performance and leakage Switch off inactive blocks to reduce leakage power Requires IP standards for power management, clock gating, etc. Delay vs. Voltage 30 Std. Vt 25 IP2 Power Management Unit E.g.: Telecom ASIC with 1.0/1.2 V islands saved : 16 % active power 50 % standby power 85 *Slide from Prof Kyung of KAIST 20 Ddelay (ps) IP1 Low Vt 15 10 5 0 0.7 0.8 0 .9 1.0 1.1 Voltage (Vdd) 1.2 1.3 Source from Bergamaschi ELEC516/10 Lecture 9 Power Management I/O’s, VReg, Gnd ROM Vdd1 DSP Vdd 2 Vdd 1 RLM 1 RLM 2 Microcontroller Vdd 2 ROM Vdd 1 Analog Vdd 5 Memory Arrays Vdd 3 Low Vt device arrays Optimized for low active power RLM 3 Monitor Logic Vdd 4 I/O’s, VReg, Gnd I/O’s, VReg, Gnd I/O’s, VReg, Gnd Memory Arrays Vdd 3 Low Vt device arrays Optimized for low active power Memory Arrays Vdd 4 High Vt device arrays Optimized for low active power Independently controlled domain power switches Multiple On-Chip Voltage Islands On-Chip Voltage Regulators 86 *Slide from Prof Kyung of KAIST ELEC516/10 Lecture 9 Controlling VDD and VTH for low power Active Stand-by Multiple VTH Dual-VTH MTCMOS Variable VTH VTH hopping VTCMOS Multiple VDD Dual-VDD Boosted gate MOS Variable VDD VDD hopping Software-hardware cooperation Technology-circuit cooperation MTCMOS : Multi-Threshold CMOS VTCMOS : Variable Threshold CMOS Multiple : spatial assignment Variable : temporal assignment 87 *Slide from Prof Kyung of KAIST ELEC516/10 Lecture 9 RTL-level optimization-Reducing effective switching activity • General Principle: Avoid Waste. – Application-specific processing – Resource sharing/Locality of reference – Data representations – Preservation of Data correlations – architectural restructuring – Distributed processing – Demand-driven/data-driven computation 88 ELEC516/10 Lecture 9 Application Specific Processing Application Specific Processing Reduces “Implementation Overhead” 89 ELEC516/10 Lecture 9 Eliminating Redundant Computation Power Down …. fs fs fs fs fs Out 90 • Dynamically vary the number of operations per sample. • Trade power consumption and filter quality [Ludwig96] ELEC516/10 Lecture 9 Eliminating Redundant Computation Peak Number of Operations Switched Capacitance Reduction ~= Average Number of Operations ~=2 Strong Function of Signal Statistics 91 ELEC516/10 Lecture 9 Reducing switching activity • Multiplexing multiple operations on a single hardware unit can have detrimental effect on the power consumption, because the switching activity may be increased, e.g. shared bus 92 ELEC516/10 Lecture 9 Bus Multiplexing • Buses are a significant source of power dissipation due to high switching activities and large capacitive loading – 15% of total power in Alpha 21064 – 30% of total power in Intel 80386 • Share long data buses with time multiplexing (S1 uses even cycles, S2 odd) S1 S2 D1 S1 D1 D2 S2 D2 • But what if data samples are correlated (e.g., sign bits)? 93 ELEC516/10 Lecture 9 Reducing the Effective Capacitance • Circuit and Logic Style - select a circuit style with low capacitance and/or switching activity, e.g. 8-bit adder 94 ELEC516/10 Lecture 9 Reducing Glitching Activity • Some circuit structures can be the cause of spurious transients, e.g. a 16-bit ripple carry adder • Glitches can be reduced by selecting structures that have balanced signal paths, e.g. tree. • The Brent-Kung lookahead adder and Wallace tree multiplier both have this properties, thus more power attractive. 95 ELEC516/10 Lecture 9 Data Representation Two’s complement Sign Magnitude • Sign-extension activity significantly reduced using sign magnitude representation. • An accumulator example: sign magnitude datapath switches 30% less capacitance for uniformly distributed inputs 96 ELEC516/10 Lecture 9 Data Representation – Accumulator Example Sign magnitude datapath switches 30% less capacitance for uniformly distributed inputs 97 ELEC516/10 Lecture 9 Two’s Complement vs. Sign-Magnitude Two’s complement datapath has a significantly higher switching activity 98 ELEC516/10 Lecture 9 Bus encoding to reduce switching activity • Minimizing temporal bit transition activity by data representation • Bit encoding: – Active high encoding • high-level voltage for 1, low-level voltage for 0 – Transition-based encoding • voltage change identifies logic 1, no voltage range identifies logic 0 • Word encoding – assign patterns of 1’s and 0’s to each word of information – Non-redundant codes vs. redundant codes • Example of low power coding – Limited-weight code – Gray code – One-hot code – Bus-invert code ELEC516/10 Lecture 9 99 Example of Bit Encoding • Reducing average no. of switching on a data bus • transition-based encoding may limit the no. of transitions for non equiprobable input lines • Let p(0) > p(1) and no temporal correlation exists. If active-high encoding is used, the average no. of transition is p(0) p(1) p(1) p(0) 2(1 p(1)) p(1) • For transition-based encoding, it is simply p(1). If p(1) << p(0), transition-based is better than active-high encoding. • To reduce transition, the input patterns are transformed in such a way that the p(0) and p(1) prob. of each input line becomes as different as possible, and then to apply a transition-based encoding before the data is transmitted. 100 ELEC516/10 Lecture 9 Word encoding • One-hot coding •Gray coding - good for sequential data, e.g. addressing for microprocessor [Su-94] 101 •Disadvantage of Gray code - only good for address bus, not for data bus, additional conversion circuitry is needed ELEC516/10 Lecture 9 Word encoding (cont.) • Bus-invert encoding - use redundancy to save power. • Given a n bit data bus with 2n patterns to be represented and assumes all the patterns are equiprobable and no temporal correlation. p(0) = p(1) = 0.5, the average no. of transition is n/2 per cycle, while the worst case transition is n. • Bus-invert coding - add an extra line to the bus, I, and then comparing the consecutive patterns before transmission. Tw cases – - If the Hamming distance between the two patterns =< n/2, the current pattern is transmitted as it is and I is set to 0. – - if the Hamming distance between the two patterns > n/2, the current pattern is first inverted and then transmitted, and I is se to 1. • Max. transition is limited to n/2 and average transition is reduced by 25% • Drawback - extra line I to indicate whether a pattern has been inverted. 102 ELEC516/10 Lecture 9 Conditional Inversion Coding 103 ELEC516/10 Lecture 9 Bus encoding for cross coupling cap. • Cross-coupling cap dominates for very deep submicron technology, e.g. < 70nm • Bus model bit 0 – Stand alone cap. Cs – Cross coupling cap. Cc Cc bit 1 • Stand alone switching – Apply to single bit line – 0-1 transition Cs Cc Cs bit 2 Cs • Cross coupling switching – Occurs on adjacent wires – Four types of coupling transition H --> L H --> L Type 1 104 0 L --> H H --> L H --> L L --> L L --> H Type 2 Type 3 Type 4 2 0 1 H --> L ELEC516/10 Lecture 9 Permutation Based Address code • Rearrange the physical order of the bit lines Sender Receiver • Work efficiently on address bus (40%), but not on instruction bus (4%) – Correlation is lower – Permutation is fixed 105 ELEC516/10 Lecture 9 Dynamic Reconfigurable Bus Encoding Scheme for Instruction Bus 106 ELEC516/10 Lecture 9 Overview of the Scheme Decoding information stored as header Encoding Encoding during compilation Bit lines reordering Computer Decoding Target Processor Processing Element MEM . . . Bn 107 CACHE Mem_bus B1 . . Bm Look-Up Table Cache_bus Decoder (Cross_bar) Decoding information are loaded to LUT first Instruction is called Address_bus B1 Target Mem Processor CPU Core Decoding information stored in LUT will control the Cross_bar Instructions enter Cross_bar to decode ELEC516/10 Lecture 9 Demand/Data-driven operation • Clocking strategy – Gated clocking • System Power Down • Computing Paradigm 108 ELEC516/10 Lecture 9 Logic Level Power Down Technique • Activity driven - precomputation-based sequential logic optimization – Selectively precompute the output logic values of the circuit one clock cycle before they are required, and use the precomputed values to reduce internal switching activity in the succeeding cycle f R1 A R1 A R2 R2 LE Original Circuit 109 g1 FF g2 FF It is required that g1 = 1 -> f = 1 g2 = 1 -> f = 0 Modified Circuit ELEC516/10 Lecture 9 An example: n-bit comparator • This circuit compares two n-bit numbers A & and computes the function A > B • In general, precomputation works best when there are a small number of complex functions corresponding to the logic block A 110 ELEC516/10 Lecture 9