Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Low-Power Design Techniques in Digital Systems Prof. Vojin G. Oklobdzija University of California November 19, 2003 Outline of the Talk • • • • • Power trends in VLSI Scaling theory and predictions Research efforts in power reduction Efficiency measures and design guidelines Latches and Flip-Flops for Low-Power – Dual-Edge FFs – SOI • Conclusion: Low-Power perspective 2 Power trends in VLSI 3 “CMOS Circuits dissipate little power by nature. So believed circuit designers” (Kuroda-Sakurai, 95) 100 Power (W) x4 / 3years 10 1 0.1 0.01 80 85 90 95 “By the year 2000 power dissipation of high-end ICs will exceed the practical limits of ceramic packages, even if the supply voltage can be feasibly reduced.” (* Taken from Sakurai’s ISSCC 2001 presentation) 4 Gloom and Doom predictions Source: Shekhar Borkar, Intel 5 6 Source: Shekhar Borkar, Intel Power (Watts) Power versus Year: taken from ISSCC, uP Report, Hot-Chips 80 RISC 70 High-end growing at 25% / year y = 2E-222e0.2574x x86 Consumer 60 Dec Alpha Expon. (RISC) Expon. (x86) 50 Expon. (Consumer) y = 3E-97e0.1131x Expon. (Dec Alpha) RISC @ 12% / yr 40 X86 @ 15% / yr 30 y = 2E-124e0.1442x 20 Consumer (low-end) At 13% / year y = 6E-109e0.126x 10 0 1995.5 Year 1996 1996.5 1997 1997.5 1998 1998.5 1999 1999.5 2000 2000.5 7 VDD, Power and Current Trend Voltage Voltage [V] 2 Power 1.5 Current 1 0.5 0 1998 2002 2006 2010 500 Power per chip [W] 200 0 2014 VDD current [A] 2.5 0 Year International Technology Roadmap for Semiconductors 1999 update sponsored by the Semiconductor Industry Association in cooperation with European Electronic Component Association (EECA) , Electronic Industries Association of Japan (EIAJ), Korea Semiconductor Industry Association (KSIA), 8 and Taiwan Semiconductor Industry Association (TSIA) (* Taken from Sakurai’s ISSCC 2001 presentation) Power Delivery Problem (not just California) Your car starter ! Source: Shekhar Borkar, Intel 9 Trend in L di/dt • di/dt is roughly proportional to I * f, where I is the chip’s current and f is the clock frequency or I * Vdd * f / Vdd = P * f / Vdd, where P is the chip’s power. • The trend is: P f Vdd on-chip L package L slightly decreases • Therefore, L di/dt fluctuation increases significantly. (* Taken from Norman Chang, HP) 10 ISPEC^2/Watt vs Feature Size (microns) ISPEC^2/Watt 100 Energy-Delay product is improving more than 2x / generation y = 0.3733x-2.5778 Saving Grace ! 10 1 0.00 0.20 0.40 0.60 Feature Size (microns) 11 ISPEC^2/Watt vsYear 100 Consumer X86 efficiency 90 x86 improving 80 dramatically 70 Server 4X / generation 60 High-End processors 50 efficiency not improving 40 30 20 10 0 1995 1996 1997 1998 average improving 3X / generation Year 1999 2000 2001 12 Scaling theory and predictions 13 The power dissipation has increased 1000 times over the 15 years and is exceeding 70 Watts Scaling principles: 1. A “constant field scaling” theory [Dennard] assumes that device voltages as well as device dimensions are scaled by a scaling factor x (>1), resulting in a constant electric field in a device: power density remains constant circuit performance can be improved in terms of: density x2 speed x power 1/ x2 power-delay product 1/ x3 Limitless progress in CMOS is promised with this scaling scenario 14 In practice neither a supply voltage nor a threshold voltage had been scaled till 1990 leading to the theory of: “Constant voltage scaling” which assumes the constant voltage This assumption yields: • speed improvement by x2 • power density increases rapidly by x3 15 The constant field is not realistic, x0.5 is satisfactory - however even with that the power dissipation would exceed ECL by 2001: a new philosophy is required ! (* Taken from Sakurai and Kuroda, IEICE 95 paper) 16 High-Performance View Point on Power *taken from Ron Preston, DEC Alpha P=k C V2 f : • Shrinking to the new technology (30% reduction in l) – C decreases by 30% – f increases by 1/0.7 = 43% – Pnew=0.7 (1/0.7) Pold = Pold (No Change in Power ! ) • New design: – Double the No. of devices – Pnew=2 x 0.7 (1/0.7) Pold = 2 X Pold (Power Doubles !) Scale Vdd by 30% in the new design: – Pnew=2 x 0.7 (1/0.7) (0.7)2Pold = Pold (Power stays constant !) 17 High-Performance View Point on Power *taken from Ron Preston, DEC Alpha Reality: Chip l Vdd Freq. Power 21164 05u 3.3V 300MHz 50W 21264 0.35u 2.0V 600MHz 72W Change -30% -39% +100% +44% Paradigm Changes: More Aggressive Circuits, Toggle rate increasing, Out of Order, Speculative Execution What to Expect: Power will be limited by the package and cooling techniques Frequency will be determined by the power - as high as package can take ! 18 Research Efforts in Low-Power Design Reduce the active load: •Minimize the circuits •Use more efficient design •Charge recycling •More efficient layout Technology scaling: •The highest win •Thresholds should scale •Leakage starts to byte •Dynamic voltage scaling Psw = k CL V2cc fCLK Reduce Switching Activity: •Conditional clock •Conditional precharge •Switching-off inactive blocks •Conditional execution Run it slower: •Use parallelism •Less pipeline stages •Use double-edge flip-flop 19 Reducing the Power Dissipation • The power dissipation can be minimized by reducing: • supply voltage • load capacitance • switching activity – Reducing the supply voltage brings a quadratic improvement – Reducing the load capacitance contributes to the improvement of both power dissipation and circuit speed. 20 Voltage Scaling There are three means to maintain the throughput: • Reduce Vth to improve circuit speed • Introduce parallel and pipelined architecture while using slower device speeds (assumes limitless no. of transistors, in reality the transistor density is only increasing by 60% per year) • Prepare multiple supply voltages and for each cluster of circuits choose the lowest supply voltage that satisfies the speed. (A good level converter is necessary which exhibits small delay and consumes little power, small area) 21 22 Is there an optimal design point ? 23 Power Dissipation and Circuit Delay Power : V th 2 P = pt •f CLK • CL • VDD + I 0 •10 S Delay VDD • -4 1 5 0.8 Power (W) k•Q I = k • CL • VDD a (VDD - Vth ) ( a=1.3) -10 x 10 x 10 4 Delay (s) 0.6 0.4 = A 0.2 B 0 4 3 2 1 04 3 2 10.8 (* Taken from T. Sakurai) 0.4 0 -0.4 3 A B 2 1 0.8 0.4 -0.4 0 24 Power-Delay Product, Energy-Delay Product Lowest Voltage – Highest Threshold – no optimum (*from Sakurai, Kuroda, IEICE 95 paper) •Power-Delay Product is a misleading measure; it will always favor a processor that operates at lower frequency •Energy-Delay is more adequate - but Energy-Delay2 should be used 25 Power-Delay Product, Energy-Delay Product Horowitz, Indermaur, Gonzales argue against Power-Delay, SLPE’94 26 Energy-Delay**2 (*courtesy of Prof. T. Sakurai) 27 Energy-Delay Product vs. Energy-Delay**2 Nowka, Hofstee, Carpenter of IBM argue against Energy-Delay as a design efficiency measure (private communication) 28 Energy-Delay Product vs. Energy-Delay**2 Optimal point: (due to to Vth being fixed ?) The same design should have relatively the same efficiency Nowka, Hofstee, Carpenter of IBM argue against Energy-Delay as a design efficiency measure (private communication) 29 Feature 601+ 604 620 100 100 133 (100) Diff. Frequency MHz CMOS Process .5u 5-metal .5u 4-metal .5u 4-metal same ~same Cache Total 32KB Cache 64K ~same Load/Store Unit No 16K+16K Cache Yes Dual Integer Unit No Yes Yes Register Renaming No Yes Yes Peak Issue 2 + Br 4 Insts 4 Insts ~double Transistors 2.8 Million 3.6 Million 6.9 Million +30% /+146% SPECint92 105 160 +50% /+61% SPECfp02 125 165 Power 4W 13W 26.5/31.2 12.3/12.7 225 (169) 300 (225) 30W (22.5W) 7.5/10 4.0E-6 13.0E-6 12.8E-6 (PF/Trans)*E12 1.43 3.61 1.86 IPC 1.05 1.6 1.69 PE*IPC**3 (*E6) 4.01 12.98 12.69 PE=Watt/Spec**3 3.46E-6 3.17E-6 2.63E-6 Spec/Watt PF=Watt/Freq**3 Yes Example: PowerPC +30% /+80% +225%/+463% -115%/ -252% 30 Feature Digital 21164 MIPS 10000 PowerPC 620 500 MHz 200 MHz 200 MHz 180 MHz 250 MHz Pipeline Stages 7 5-7 5 7-9 6-9 Issue Rate 4 4 4 4 4 6 lds 32 16 56 none Register Renam. (int/FP) none/8 32/32 8/8 56 none Transistors/ Logic transistors 9.3M/ 1.8M 5.9M/ 2.3M 6.9M/ 2.2M 3.9M*/ 3.9M 3.8M/ 2.0M 12.6/18.3 8.9/17.2 9/9 10.8/18.3 8.5/15 25W 30W 30W 40W 20W SpecInt/ Watt 0.5 0.3 0.3 0.27 0.43 1/Energy*Delay 6.4 2.6 2.7 2.9 3.6 Watt/Freq**3 0.2E-6 3.75E-6 3.75E-6 6.86E-6 1.28E-6 (PF/Trans)*E12 0.022 0.64 0.54 1.76 0.34 (PF/LTrans)*E12 0.11 1.63 1.7 1.76 0.64 12.5E-3 42.5E-3 41.5E-3 31.7E-3 32.5E-3 Freq Out-of-Order Exec. SPEC95 (Intg/FlPt) Power Watt/Spec**3 HP 8000 Sun Ultra-Sparc 31 Sensitivity to Vth fluctuation ΔVTH = 1.8 Normalized Delay VDD =1.0 V ±0.15V 1.5 V ±0.05V 1.4 3.0 V 1.0 0.6 (* Taken from T. Sakurai) 5.0 V 0 0.2 0.4 0.5 VTH (V) 0.7 1 32 Use of Different Circuits Families 33 Capacitance Reduction The load capacitance is the sum of: • gate capacitance • diffusion capacitance • routing capacitance Using small number of transistors, or small size of transistors contributes to the reduction in the gate capacitance and the diffusion capacitance. Pass transistor logic may have advantage because it comprises fewer transistors and exhibits smaller stray capacitance than conventional static CMOS logic. 34 Pass-Transistor Logic 35 Pass-Transistor Logic: CVSL, CPL, SRPL, DSL, DPL, DCVSPG 36 SAPL: Sense-Amplifying Pass-transistor Logic All nodes are first discharged and then evaluated by inputs. Outputs are 100mV above GND 37 Where does the power go ? 38 Power use is different from chip to chip: (*from Sakurai, Kuroda, IEICE 95 paper) MPU1 is a low end microprocessor MPU2 is a high-end CPU with large cache ASSP1 is MPEG-2 decoder ASSP2 is an ATM switch 39 Design Example: Strong Arm 110 Two power modes: idle and sleep Power: 0.5W using 1.1V internal PS: 184 Drystone/MIPS @162MHz 1.1W using 2V internal PS: 245 Drystone/MIPS @ 215MHz Power Breakdown: I-Cache D-Cache I-Unit Exec-Unit I-MMU D-MMU Clock Others *from D. Dobberpuhl 27% 16% 18% 8% 9% 8% 10% 4% (PLL < 1%) 40 Design Example: Strong Arm 110 *from D. Dobberpuhl 41 Design Example: Strong Arm 110 *from D. Dobberpuhl *from D. Dobberpuhl However, leakage currents starts to affect stand-by power 42 Controlling both: VDD and VTH for low power 43 Controlling VDD and VTH for low power Low power Low VDD Low speed Low VTH High leakage VDD-VTH control Multiple V TH Variable VTH Multiple V DD Variable VDD Active Dual-VTH VTH hopping Dual-VDD VDD hopping Stand-by MTCMOS VTCMOS Boosted gate MOS Software-hardware cooperation Technology-circuit cooperation *) MTCMOS: Multi-Threshold CMOS *) VTCMOS: Variable Threshold CMOS • Multiple : spatial assignment • Variable : temporal assignment (* from Prof. T. Sakurai) 44 Dual-VTH concept Low-VTH circuit (High leakage) High-VTH circuit (Low leakage) Critical paths Non-critical paths (* from Prof. T. Sakurai) 45 Clustered Voltage Scaling for Multiple VDD’s Conventional Design CVS Structure FF FF FF FF Level-Shifting F/F FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Critical Path Critical Path Lower VDD portion is shown as shaded Once VL is applied to a logic gate, VL is applied to subsequent logic gates until F/F’s to eliminate DC current paths. F/F’s restore VH. M.Takahashi et al., “A 60mW MPEG4 Video Codec Using Clustered Voltage Scaling with Variable Supply-Voltage Scheme,” ISSCC, pp.36-37, Feb.1998. (* from Prof. T. Sakurai) 46 If you don’t need to hussle, VDD should be as low as possible VDD should be lowered to the minimum level which ensures the real-time operation. 1.0 Normalized power Energy consumption is proportional to the square of VDD. 0.8 0.6 0.4 0.2 0.0 0.0 (* from Prof. T. Sakurai) Variable Vdd Fixed Vdd 0.2 0.4 0.6 0.8 Normalized workload 1.0 47 Measured voltage waveforms V DDmax =8% on average V DDmax V DDmin V DD 1 sync frame 200ms Sleep signal Sleep Sleep=6% on average (* from Prof. T. Sakurai) 48 Measured power characteristics Total power = 0.8W x 0.08 + 0.16W x 0.86 + 0.07W x 0.06 = 0.2W Power: P [W] 1 0.8W 0.8 Time for VDDmax: 8% Down ƒ=200MHz to 1/5 0.6 0.4 0.2 ƒ=100MHz Time for VDDmin: 86% 0.16W Time for sleep: 6% 0.07W 0 0 1 2 Supply voltage: V DD [V] VDD hopping can cut down power consumption to 1/4 (* from Prof. T. Sakurai) 49 Simulation results MPEG-2 video decoding VSELP speech encoding 0.40 0.28 RPC: 2 levels (f,f/2) RPC: 3 levels (f,f/2,f/3) RPC: 4 levels (f,f/2,f/3,f/4) RPC: infinite levels post-simulation analysis 0.24 0.20 Normalized Power P/PFIX Normalized Power P/PFIX 0.32 0.16 0.12 0.08 0.35 0.30 0.25 0.20 0.15 0.10 0.04 0.05 0.00 0.00 0.0 0.2 0.4 0.6 0.8 Transition Delay T (ms) TD (* from Prof. T. Sakurai) 1.0 RPC: 2 levels (f,f/2) RPC: 3 levels (f,f/2,f/3) RPC: 4 levels (f,f/2,f/3,f/4) RPC: infinite levels post-simulation analysis 0.0 0.2 0.4 0.6 0.8 1.0 Transition Delay T (ms) TD 50 Aggressive Voltage Scaling *Taken from Kuroda If we can dynamically scale Vdd and Vth the advantage is obvious 51 Example 52 TransMeta Example *Taken from Doug Laird’s presentation, January 19 th 2000 53 TransMeta Example *Taken from Doug Laird’s presentation, January 19 th 2000 54 TransMeta Example *Taken from Doug Laird’s presentation, January 19 th 2000 • “Code Morphing” is another contributor to power reduction since it eliminates unnecessary external memory access 55 TransMeta Example 56 Latches and Flip-Flops for Low-Power 57 Simulation Condition and Testbench Timing Data In D Total FF overhead is setup + clock-to-output time Circuit optimization towards td-q Clock skew robustness obtained from observing DQ curve SET 14X min inv Q Clk CLR Clock 14X min inv Q 14X min inv t d min (t DQ ) t DCLK Power-Delay Product Overall performance parameter at fixed frequency PDP(at fixed f) EDP t d Pdiss 58 Flip-Flop Performance Comparison Data D Q Test bench Clock • Total power consumed – internal power – data power – clock power • Measured for four cases Clk Q 200fF 200fF 50fF Delay is (minimum D-Q): Clk-Q + Setup time – no activity (0000… and 1111…) – maximum activity (0101010..) – average activity (random sequence) 59 OLD TEST BENCH OLD TEST BENCH: • Total Power = Drivers Power + Test Unit Power • PDP- Optimized = Equal Trade-off on Power and Delay • Improper Load on Drivers NEW TEST BENCH: • Drivers: Fixed Gain and Driving Test Unit Only • Data-to-Output Delay NEW TEST BENCH • PD2P Optimized = Best for Constant-Field Scaling 60 Comparison in terms of speed and EDPtot Technology: 0.2u, Vdd=2V, T=20oC, measured @ 100MHz • Delay: below 200ps • SDFF • HLFF • K-6 ETL – 200-300ps • PowerPC latch • 21264 Alpha FF • Strong Arm FF • mC2MOS latch • 187ps 199ps 200ps 266ps 272ps 275ps 292ps – above 500ps • • • • SSTC latch DSTC latch SSTC* latch DSTC* latch PDPtot @100MHz – below 30fJ • PowerPC latch 28fJ – 30 - 50fJ • • • • • HLFF SDFF mC2MOS latch 21264 Alpha FF Strong Arm FF 29fJ 39fJ 40fJ 43fJ 45fJ – 50 - 70fJ 592ps 629ps 898ps 1060ps • K-6 ETL 70fJ – above 70fJ • SSTC latch • DSTC latch 95fJ 125fJ 61 Delay comparison 350 300 Delay [ps] 250 200 150 100 50 0 SDFF HLFF K6 PowerPC Alpha 21264 Strong Arm FF FF mC2MOS • F-F design brings the fastest structures 62 Delay comparison 700 Delay [ps] 600 500 400 300 200 100 0 K6 SA-F/F StrongArm SSTC DSTC 350 Delay [ps] 300 • F-F design brings the fastest structures 250 200 150 100 50 0 SDFF HLFF PowerPC mC2MOS 63 Overall ranking PDPtot ranges @100MHz Activity=0.25 equal transition probability 350 PDPtotal [fJ] 300 250 200 150 100 50 0 HLFF SDFF Pow erPC mC2MOS Strong Arm Alpha 21264 K6 SSTC DSTC SSTC* DSTC* • EDPtot accepted as the overall cost function • Proposed “low-power” latches from Yuan & Svensson, compared with other presented structures do not show advantage, (the optimization was not properly done - optimization is yet to be repeated under different setup) 64 Overall ranking, zoomed Activity=0.25 equal transition probability 80 70 60 PDPtot [fJ] 50 40 30 20 10 0 HLFF SDFF Pow erPC mC2MOS Strong Arm Alpha 21264 K6 • Real signals have the activity between 0 and 1.0 () • Precharged hybrid structures are the fastest but their power consumption strongly depends on the probability of “ones” • More “ones” above the point 65 Overall performance 160 60 140 50 PDPtot [fJ] PDPtot [fJ] 120 40 30 100 80 60 20 40 10 20 0 0 SA-F/F HLFF SDFF PowerPC Activity=0.5 equal transition probability mC2MOS Strong Arm110 K6 SSTC DSTC Activity=0.5 equal transition probability • Real signals have the activity between 0 and 1.0 () • Precharged hybrid structures are the fastest but their power consumption strongly depends on the probability of “ones” • More “ones” above the point 66 Conventional Clk-Q vs. minimum D-Q 400 HLFF Total power [uW] 350 PowerPC 300 Strong Arm FF 250 200 Alpha 21264 FF 150 mC2MOS latch 100 K6 ETL 50 0 150 SSTC 200 250 300 350 400 450 500 550 600 Delay [ps] HLFF 350 Total Power [uW] DSTC SDFF 400 PowerPC 300 • Strong Arm FF 250 200 Alpha 21264 FF 150 mC2MOS latch 100 K6 ETL 50 0 100 650 • Hidden positive setup time Degradation of Clk-Q SSTC 150 200 250 Clk-Q delay [ps] 300 350 DSTC SDFF 67 Internal Power distribution Internal Power [uW] 400 350 300 250 200 150 100 50 0 Random, activity=0.5 …01010101… activity=1 …11111111… activity=0 …00000000… activity=0 Data patterns HLFF SDFF PowerPC 603 latch mC2MOS latch StrongARM FF Alpha 21264 FF K6 ETL • Four sequences characterize the boundaries for internal power consumption – – – – …010101… random, equal transition probability, …111111… …000000… maximum average precharge activity leakage + internal clock processing 68 Comparison of Clock power consumption DSTC MS latch SSTC MS latch K6 ETL StrongArm FF SA-F/F mC2MOS PowerPC MS latch SDFF HLFF 0 10 20 30 40 50 Local Clock power consumption [W] 69 Using Dual-Edge Flip-Flop (run at ½ of the frequency save on the power consumed in clock distribution tree) 70 Dual-Edge vs. Single-Edge Flip-Flops Comparison Delay [ps] Total Power [W] 400 350 350 300 300 250 200 150 100 DETFF1 DETFF2 DETFF3 SDFF HLFF PowerPC 250 200 150 100 50 50 0 0 •Fujitsu 0.18u process; Clock frequency 500MHz (250MHz for Dual Edge FFs) •Data activity ratio a = 0.5 •VDD = 1.8V •Temp = 25º DETFF1 DETFF2 DETFF3 SDFF HLFF PowerPC 71 Dual-Edge vs. Single-Edge Flip-Flops Comparison Internal Power [W] Clock Power [W] 300 60 250 50 DETFF1 DETFF2 DETFF3 SDFF HLFF PowerPC 200 150 100 DETFF1 DETFF2 DETFF3 SDFF HLFF PowerPC 40 30 20 50 10 0 0 Data Power [W] 25 20 15 10 5 0 DETFF1 DETFF2 DETFF3 SDFF HLFF PowerPC •Fujitsu 0.18u process; Clock frequency 500MHz (250MHz for Dual Edge FFs) •Data activity ratio a = 0.5 •VDD = 1.8V •Temp = 25º 72 Silicon on Insulator (SOI) Technology 73 SOI Comparison 70 140 60 120 50 100 40 80 30 60 20 40 10 20 0 50 45 40 35 30 25 20 15 10 5 0 0 Delay [ps] Internal Power [uW] 160 6 140 5 120 Clock Power [uW] HAL PowPC HLFF SDFF SAFF SA 110 4 100 3 80 60 2 40 1 20 0 Total Power [uW] 0 EDP [fJ]@500Mhz F= 1GHz, a = 0.5, Le = 0.08 m, VDD=1.3V, T = 25C 74 In conclusion…. What can we expect that low power will bring to us ? 75 Wearable Computer 76 Wearable Computer 77 Wearable Computer 78 Digital Ink 79 Implantable Computer 80 Bluetooth 81 Year 2010 Extrapolation of the trend with some saturation Many important interesting application Home, Entertainment, Office, Translation , Health care Year 2020??? More assembly technique: 3D Year 2110 Combination of bio and semiconductor Brain Sensor Infrared Humidity CO2 Ultra small volume Small number of neuron cells Extremely low power Real time image processing (Artificial) Intelligence Long lifetime 3D flight control by DNA manipulation Bio-computer 82 Mosquito