* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Grant Proposal for Project Name - WSU EECS
Pulse-width modulation wikipedia , lookup
Audio power wikipedia , lookup
Power over Ethernet wikipedia , lookup
Electrical substation wikipedia , lookup
Stray voltage wikipedia , lookup
Solar micro-inverter wikipedia , lookup
Wireless power transfer wikipedia , lookup
Electric power system wikipedia , lookup
Power inverter wikipedia , lookup
Electrification wikipedia , lookup
Amtrak's 25 Hz traction power system wikipedia , lookup
Opto-isolator wikipedia , lookup
Power MOSFET wikipedia , lookup
Buck converter wikipedia , lookup
History of electric power transmission wikipedia , lookup
Variable-frequency drive wikipedia , lookup
Distribution management system wikipedia , lookup
Surge protector wikipedia , lookup
Life-cycle greenhouse-gas emissions of energy sources wikipedia , lookup
Power electronics wikipedia , lookup
Distributed generation wikipedia , lookup
Power supply wikipedia , lookup
Power engineering wikipedia , lookup
Voltage optimisation wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Circuits and Architectures to Deliver Low Power and High Speed Systems. By: Jabulani Nyathi Washington State University School of EECS April 30, 2009 Outline CMOS Scaling Its benefits and The challenges it brings about Various Their Techniques for Limiting Leakage Currents shortfalls Bridging the speed-Power Gap The Tunable Emerging Body Biasing Scheme Devices and Technologies Concluding Remarks CMOS Scaling and its Benefits Aggressive CMOS scaling has been a very positive development allowing: Fast switching devices, thus high speed computing. Massive integration due to miniaturization No longer do we need multiple chips to implement a microprocessor and its peripherals In fact, we can now have multiple computing elements on a single die resulting in system on a chip. CMOS Scaling and its Challenges CMOS scaling results in: increased leakage currents (5X/node) and Increased dynamic power dissipation. The interconnect does not scale as fast as the transistor thus Highly integrated designs require elaborate clock distribution schemes. IPs within a System on a Chip would be difficult to synchronize with a single clock source. Scaling Implications Global Interconnects Global Interconnects Module2 Scaled Local Interconnects Module1 Dynamic Vs Leakage Power Research Motivation Desire to Bridge the Speed-Power Gap by Exploring the feasibility of optimizing devices to operate effectively in both sub-threshold and above threshold voltages. Emerging Technologies that are Ultra-Low power can benefit from increased speed. Wearable computers, sensor networks, implantable medical technology Emphasis on design for energy-efficiency Existing Low Power Design Approaches Solve energy dissipation problem from a region of operation standpoint Sub-threshold design DTMOS: shows a 5.5 times increase in current SBB: 4.4 times frequency increase Above threshold (Super-threshold) design Dynamic threshold provides energy efficiency MTCMOS: high and low threshold devices VT Scheme: reduce power by 50% using ABB and “sleep”/“active” modes Architectural Gating Techniques: 45% of total power DTMOS/SBB Output Voltage Clamping SBB, DTMOS, TBB 1.8 V 600 mV Traditional Proposed Approach Change approach to include all possible operating regions: Tunable Body Biasing (TBB) Sub-threshold and super-threshold operation bridged Ultra-low energy and low speed or high energy and high speed Utilize body biasing to improve performance of sub-threshold operation Target increased performance at sub-threshold and slightly above threshold. Save energy by eliminating idle time and process continuously with variable power supplies (perform just in time task completion) Target applications Mobile, battery operated (power constrained), variable processing devices Cell phones, PDAs, notebooks, wireless sensors, embedded systems, ASICs, medical technology, etc. TBB Implementation Goals Attain ON state current gain while minimizing OFF state leakage current increase Highlight advantages of sub-threshold operation while allowing super-threshold operation if needed Control bulk terminal to tunable potentials depending on VDD and desired region of operation MOS Bulk Control Circuits Multiplexer-based approach Two transistors per bulk control circuit Utilizes Vthn0 TBB Bulk Control Circuits TBB MOS Bulk Control Signal VDD pMOS Bulk nMOS Bulk VSS<VDD ≤Vthn0 VSS VDD VDD > Vthn0 VDD – Vthn0 Vthn0 Relies on passing of good/poor logic “1” and logic “0” properties of pass-transistors Requires external control signals SubVt and SubVt_b TBB Bulk Control Circuit Simulation Super-threshold: pBulk = VDD – Vthn0 Sub-threshold: pBulk = 0 V Device Optimization TBB encourages varying supply voltages How will devices be sized for optimal operation at any supply voltage? Maintain symmetric switching Examine inverter at varying supply voltages Device Optimization (Switching Point) VDD Ideal Inverter Threshold Simulated Inverter Threshold Percent Variation 1.8 V 900 mV 900 mV 0.0% 1.0 V 500 mV 498 mV 0.4% 376.2 mV 188.1 mV 198.7 mV 5.6% 188.1 mV 94.05 mV 108.6 mV 13.4% Sub-threshold Noise Margins Noise Margins significant for proper logic levels TBB and Traditional static CMOS inverter have comparable noise margins TBB VIH is 12.5% worse TBB VIL is 14.3% better 300 Propagation Delay Transmission Gate Inverter Two Input NAND Two Input NOR Two Input XOR AVERAGE SWITCHING DELAY (ns) 250 200 Gate Traditional Delay TBB Delay % Decrease TG 98 ns 14 ns 86 Inv 125 ns 20 ns 84 NAND 133 ns 18 ns 86 NOR 163 ns 25 ns 85 XOR 289 ns 40 ns 89 150 100 50 0 TRADITIONAL SBB TBB Static CMOS at Vdd = Vthn0 with varying Body Biasing DTMOS Review of SubVth Circuits Benefits So far, the presentation has shown: TBB requires control of MOS bulks to span the operating regions of interest. Implementation is successful. Study of simple logic gates showed: TBB gives a dramatic speed increase (up to 7x) Static CMOS design style is suitable for sub-threshold and superthreshold operation Sizing of efficient devices for the TBB approach is possible However, how will a complex system perform? Design with previous knowledge (logic style, sizing) Analyze post-layout simulations Complex System-on-Chip Design Using TBB Work addresses the challenges of Global Interconnect Delays Clock distribution Synchronization of unrelated clocks and Power dissipation Conclusion TBB scheme has been devised to span all regions of operation from ultra-low power to high-speed. New kind of body biasing Forward-biasing causes exponential sub-threshold current gain Focus on sub-threshold and slightly above threshold to utilize leakage Bulk control circuits are effective Leads to 7 times frequency increase in simple logic gates 4% area and 8.9% power dissipation increase Static CMOS is ideal overall design style Device sizing at either sub-threshold or super-threshold allows efficient operation with variable supply voltages Concluding Remarks Allowing tunable operation allows the designer to choose operating point (kHz, MHz, GHz) – Energy Dissipation is affected. Other schemes do not offer this flexibility TBB can lead to significant energy savings LFSR results show TBB gives: Maximal 5.7 times speed increase (sub-threshold) Comparable energy at super-threshold and favorable at subthreshold Favorable EDP at all operating regions Operate at the same speed with less energy dissipation Idle state leakage current can be minimized by collapsing the supply voltage Integrating Research Into Instruction Data Path Circuits Memory Design Sub-System ROUTER CHIP Incorporating Research into Instruction A long term objective is to place some of the integrated chips on development boards such as those Digilent Inc produces. The integrated chips become part of a system and can be used in some of our low level courses. Most important is the use of these programmable boards to show case the research outcomes, particularly to visiting prospective students. A sample development board: Questions and Comments Welcome! Multiple Clock Domain Synchronization ; EqualClock s ; RationalCl ocks Computational Module Computational Module ; ArbitraryC locks Computational Module MicroNetwork f fast n 1 n f slow n Z n Q Synchronous Islands Computational Module Isochronous Communication Computational Module Computational Module Reducing Interconnect Delays Improved latency and bandwidth Global interconnects are pipelined at or near the rate of computation Sources of Power Consumption Ptotal Pstatic Pdynamic Pshort circuit Pstatic Pleakage PDC Pdynamic Vdd Vswing f clk Cload Pshort circuit Vswing I avg short circuit Most straight forward method to reduce power consumption from any source is to reduce VDD Controlling frequency directly manipulates dynamic power Controlling device threshold manipulates leakage current, affecting leakage and short circuit power. Distributed FIFO Control Circuitry Traditional vs. Tunable Body Biasing Traditional Body Biasing Vdd LocalClock2 V delay (ps) freq (GHz) 1 111.2 0.7 current Tunable Body Biasing LocalClock2 current Tunable BB % diff uA delay (ps) freq (GHz) uA freq current 9 3100 103.1 9.7 2988 7.8 -3.6 172.55 5.8 1240 177.7 5.6 1042 -3.4 -16 0.35 1354.5 0.7383 71 1438 0.6954 72.9 -5.8 -2.7 0.2 96700 0.0103 2.81 16640 0.0601 5.051 483 79.8 The synchronizer/buffer shows an increase in performance at sub-threshold voltages when using tunable body biasing Tunable Body Biasing Current (uA) Max Freq (GHz) Vdd (V) Traditional Body Biasing Tunable Body Biasing Peak Avg Power (uW) Idle Peak Avg Idle 1 4 5597 2382 8.696 5597 2382 8.696 0.7 2 2222 803.4 4.873 1555.4 562.38 3.411 0.35 0.125 131.1 35.58 1.468 45.885 12.453 0.514 0.2 0.01 7.452 2.895 1.349 1.49 0.579 0.27 1 4 5140 2460 9.54 5140 2460 9.54 0.7 2 2050 833 4.423 1435 583.1 3.096 0.35 0.167 132 39.8 1.589 46.2 13.93 0.556 0.2 0.015 9.468 4.03 1.239 1.894 0.806 0.248 Pursuit of Low Power Operation It is likely that not all IP blocks in a SoC need to operate at high speed Power dissipation for those IP blocks could be reduced by operating at a lower voltage TBB offers the possibility to dynamically operate at either sub-threshold or superthreshold voltages Variable Voltage SoC Vdd1 Vdd4 Vdd5 Computational Module Computational Module Computational Module Vdd2 MicroNetwork Consider a SoC with 50 IP blocks, each requiring communication at a rate of 10 MHz Each IP could operate at subthreshold levels The channel could operate at super-threshold voltages while the IP blocks are in sub-threshold Computational Module Synchronous Islands Vdd3 Isochronous Communication Computational Module Computational Module Idle vs Operating Power Idle Vdd (V) Current (uA) Operating Current Power (uW) (uA) Power (uW) 1 16.9 16.9 2988 2988 0.7 5.3 3.71 1042 729.4 0.35 1.5 0.525 72.9 25.52 0.2 0.925 0.185 5.051 1.01 During idle periods, it is advantageous to reduce leakage current by Reducing the power supply voltage or Increasing the threshold voltage (e.g. bulk voltage manipulation) Speed at Varying VDD Delay Comparison of a TBB and Traditional LFSR 100000 Minimum Clock Period ( ns ) 10000 1000 TBB Delay Traditional Delay 100 TBB 5.7x Faster At 376.2 mV TBB 20% Faster At 1.8 V 10 1 0 0 0.2 0.4 0.6 0.8 1 1.2 Supply Voltage ( V ) 1.4 1.6 1.8 2 Energy-delay Product Energy Delay Product for TBB with Control 8-Bit LFSR 10000000 Energy Delay Product ( ns*fJ ) 1000000 100000 TBB Energy-delay Product Traditional Energy-delay Product 10000 EDP of TBB outperforms Traditional at ALL operating regions, significantly in super-threshold 1000 100 0 0.2 0.4 0.6 0.8 1 1.2 Supply Voltage ( V ) 1.4 1.6 1.8 2 Regions of Operation Delay vs. Energy Dissipation Tradeoff for TBB LFSR 10000 10000000 TBB Delay TBB Energy Dissipation 1000000 Clock Period ( ns ) 100 1.1 GHz with 3.85 nJ/cycle 3.9 MHz with 0.6 fJ/cycle 10000 1000 100 222.2 MHz with 103 fJ/cycle 10 10 1 0 1 0 0.3262 0.3762 0.5643 0.7524 1.1286 Supply Voltage ( V ) 1.5048 1.8 Energy Dissipation ( fJ ) 100000 1000 Contributions of this work Proposed scheme alleviates the communication bottleneck and offers a way to synchronize SoC multiple clocks Perform data transfers up to 10 GHz Proposed scheme maintains high performance under the influence of any clock skew 6.5 GHz for any process corner and any skew Low power FIFO scheme with a small impact on area when used in SoCs with many modules Contributions of this work Process corners have a minor impact on performance, resulting in a 10% reduction of speed The optimal voltage for minimum energy consumption per transaction is at 2Vth Introduction of TBB to address leakage and dynamic power dissipation 500% increase in performance at sub-threshold voltages with a modest 80% increase in power 5-10% less power dissipation than traditional body biasing Summary of Proposed FIFO Scheme Linear FIFO scheme that addresses Signal propagation across communication channel Successful Synchronization Synchronizes equal, rational & arbitrary clocks 6.5 GHz sustained performance after process corner analysis using 3 stages. Compared to CN scheme Sustained throughput over long distances Fewer devices per stage, fewer stages needed 25% higher performance, 12% lower power Operates at both super- and sub-threshold voltages Lower instantaneous power demands from local clocks (less di/dt) Optimal energy per transaction at 0.7V in a 65nm process Sub-threshold reduces power by 3 orders of magnitude Tunable Body Biasing provides 50% increased performance in sub-threshold while maintaining super-threshold operation TBB Scalability At 90 nm, the % difference is much less At 180 nm, TBB sub-threshold static power % is large Technology 180 nm 90 nm Body Biasing and Operating Region Total Average Power Dissipation Static Power Contribution [%] Total Average Power Dissipation Static Power Contribution [%] Traditional in Sub-threshold 193 pW 0.1% 13.1 nW 1.8% Traditional in Super-threshold 39.6 μW Negligible 22.1 μW negligible TBB in Sub-threshold 1430 pW 25.2% 20.4 nW 6.1% TBB in Super-threshold 39.4 μW 0.000034% 22.1 μW 0.0025% Total TBB sub-threshold power is large Total TBB sub-threshold power isn’t so large LFSR Energy vs. Frequency TBB and Traditional LFSR Energy Dissipation vs Frequency 225 200 Energy Dissipation [fJ] 175 150 125 100 75 50 Traditional Energy TBB Energy 25 0 0 100 200 300 400 500 600 Frequency [MHz] 700 800 900 1000 1100 TBB Implementation Cont. TBB Implementation Cont. Logic Gate Analysis (Power) Power Dissipation vs Supply Voltage 1000.0000 100.0000 Traditional CMOS Power Power Dissipation [ nW ] 10.0000 TBB CMOS Power 1.0000 0.1000 0.0100 0.0010 0.0001 0.25 0.3762 0.75 Supply Voltage 1.8 Inverter Power Dissipation VDD Power Dissipation [fW] 0.3262 8.27 0.4262 •Average Power •[nW] Maximum Frequency [MHz] Period [ns] 3.5 0.416 2400.0 11.41 30.0 2.6 380.0 0.5643 15.64 651.6 41.7 24.0 1.8 82.30 68.60 833.3 1.2 VDD Power Dissipation [fW] 0.3262 8.52 0.4262 •Average Power •[nW] Maximum Frequency [MHz] Period [ns] 22.4 2.6 380.0 13.00 259.8 20. 50.0 0.5643 15.13 2102.0 138.9 7.2 1.8 81.47 81.5 1000. 1.0 Logic Gate Analysis (Energy) Energy Dissipation vs Supply Voltage 180 160 140 Energy Dissipation [ fJ ] Traditional CMOS Energy TBB CMOS Energy 120 100 80 60 40 20 0 0.25 0.3762 0.75 Supply Voltage [V] 1.8 Logic Gate Analysis (EDP) EDP vs Power Supply 30000 25000 20000 EDP [ fJ*ns ] Traditional CMOS EDP TBB CMOS EDP 15000 10000 5000 0 -5000 0.25 0.3762 0.75 Supply Voltage [V] 1.8 Logic Gate Analysis (Fan-in) 1400 1200 Propagation Delay [ ns ] 1000 800 Traditional NAND TBB NAND Traditional NOR TBB NOR 600 400 200 0 One Two Three Number of Inputs Four Logic Gate Analysis (Logic Styles) Energy Dissipation vs Supply Voltage 70 60 Traditional Pseudo-nMOS Energy Energy Dissipated [ fJ ] 50 TBB Pseudo-nMOS Energy TBB 40 30 20 10 0 0.5*Vthn 0.75*Vthn Vthn - 50 mV Supply Voltage [V] Vthn Vthn + 50 mV 1.5*Vthn Power Comparison of a TBB and Traditional LFSR LFSR Power Dissipation 800 Average Power Dissipation ( uW ) 700 600 500 TBB Power 400 Traditional Power 300 200 100 0 -100 0 0.2 0.4 0.6 0.8 1 1.2 Supply Voltage ( V ) 1.4 1.6 1.8 2 Device Optimization (Optimal Region) Delay vs. Energy Dissipation Tradeoff for TBB LFSR 4000 4500000 3500 4000000 Clock Period ( ns ) 3000000 2500 TBB Delay 2500000 TBB Energy Dissipation 2000 2000000 1500 1500000 1000 1000000 500 500000 0 0 0.3262 0.3762 0.5643 0.7524 Supply Voltage ( V ) 1.1286 1.5048 1.8 Energy Dissipation ( fJ ) 3500000 3000 Regions of Operation Super-threshold (1.8 V) Sub-threshold (250 mV) Optimal (750 mV) Design Delay (ns) Energy (fJ) Delay (ns) Energy (fJ) Delay (ns) Energy (fJ) Traditional LFSR 0.7 437.6 20000 105 7 74.1 TBB LFSR 0.6 437 4500 22.8 4.5 73.6 GHz kHz MHz Logic Gate Results Results Highlights TBB, SBB, and DTMOS increase speed up to 7 times in sub-threshold Static CMOS has best overall logic style performance Pseudo-nMOS, Domino, and pass-transistor still are valuable in niche situations TBB and Traditional Noise Margins are comparable