Download Lecture 2 rev2

Microprocessor and DSP Technologies for the Nanoscale Era Seminar 2 Ram Kumar Krishnamurthy Microprocessor Research Labs Intel Corporation, Hillsboro, OR [email protected] Intel July 11, 2005 Labs 1 Outline • General Technology and Circuit Challenges Beyond 65nm: • Switching and active leakage energy • Leakage tolerance and robustness • On-chip interconnect scaling • Process parameter variations and tolerance • Execution core thermal/power density • Emerging trends in wireless and embedded DSP industry • Circuit solutions: • Active and standby leakage power reduction strategies • Multi-supply design: switching + leakage power benefits • Energy-efficient arithmetic circuit technologies • HW accelerators for specialized DSP applications 2 Technology Scaling 101 Gate 1 Tox Source L Body 1 Delay  1 Freq  1 Drain 1 Gate 0.7 Tox Source 0.7 L Body 0.49 Drain 0.7 Delay  0.7 1 Freq   1.43 0.7 0.7 3 Leakage vs. Switching Power Power (Watts) 250 200 Active Power Active Leakage power 150 100 50 0 250nm 180nm 130nm 90nm Technology 65nm • From a mP perspective, but true for DSPs too • Ioff increase 3-5X per generation • Active leakage power > 50% of total power 4 Interconnect Delay 30% per generation 30% per generation Typical Gate Delay 0.001 Delay (ns) 0.1 0.01 1.0 On-chip Interconnect Performance 250 200 150 100 50 Technology Node (nm) • RC/mm increases 40-60% per generation • Local inter-gate wires dominate critical-path delays • Global wire lengths not scaling by 0.7x 5 Process Variation Tolerance 200 (180nm CMOS, 110°C) 1.2 Number of dies 1.3 30% Frequency 1.4 1.1 20X 1.0 150 100 Fast corner 50 0.9 0 5 10 Leakage 15 20 0 0 1 2 3 4 5 7 Normalized IOFF • • • • Significant variation in IOFF (hence Fmax spread) Worsening with process scaling Excess leakage dies: lack in robustness Low leakage dies: over-designed for robustness Process parameter variation tolerant design techniques 6 DSP Application Demands VOICE DATA and APPLICATIONS Capability > 200 MIPS > 100 MIPS ASIC DSP Hardware Assist ASIC DSP Hardware • 2.5G: GPRS EDGE IS-95B • 3G: WCDMA < 50 MIPS • 2G: GSM PDC IS-95 2001 2003 2005 Time • Smart cell-phones: $2B in ’02  $15B in ’06! • Huge demand for high-performance DSPs 7 Multimedia, Graphics, Enterprise… 200+ MIPS 64+ MB Flash 16+ MB RAM Capability > 100 MIPS 16+ MB Flash 8+ MB RAM < 50 MIPS 4+ MB Flash 0.5+ MB RAM • Simple User interface • Calendar • Notepad 2001 • Color Screen • Audio •Graphics • Email • • • • Full OS GUI Browser Suite of apps 2003 • Speech recognition • Multimedia • Large files and applications • Secure remote access • Full OS and user interface • Browser • Suite of apps Multimedia Graphics Enterprise OS, Services and Apps 2005 Time • Market is hungry for DSP MIPS (if you deliver, they will use it!)8 Typical Performance Requirements Total required memory MHz per task 1000 64MB 200 - 300 MHz 150 - 200 MHz 100 80 - 100 MHz Pocket PC 40 - 80 MHz 8 - 12 MHz • MP3 encode 32 MB • MPEG 4 Playback • MP3 Playback 16MB • Robust handwriting recognition 25 - 50 MHz 10 • MPEG 4 Playback • Voice 128-bit encryption and decryption 8MB • Graphical Browser - small screen • ASCII Browser < 4MB 9 Energy Efficiency in MOPS/mW So, How Do We Meet This Surging Demand Within Given Power Envelope? 1000 Dedicated HW ASIC 100 Configurable Processor/Logic Berkeley’s Pleiades: 10-80 MOPS/mW 10 Digital Signal Processors or other ASIPs 1-2 MOPS/mW 1.0 Embedded Processors SA110: 0.4 MOPS/mW 0.1 Flexibility (Coverage) Courtesy: Prof. J. Rabaey, UC Berkeley • Energy vs. Flexibility Trade-off 10 Energy and Area Efficiency Courtesy: Prof. Teresa Meng, Stanford 11 MOPS/mW Distinction: General-purpose vs. Dedicated Courtesy: Prof. B. Brodersen, UC Berkeley, ISSCC’02   DSP functions are more throughput-oriented Amenable for parallelism and pipelining (better powerperformance optimization) 12 Emerging Trends in DSP Industry Normalized power efficiency Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates 1000 Specialized hardware 100 10 Programmable DSP Embedded/mP Microprocessors: Best flexibility 1 Flexibility Prof. L. Clark, CICC 2002 [2] 13 Emerging Trends in DSP Industry Normalized power efficiency Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates 1000 Specialized hardware Microprocessors add specialized HW and coprocessors with DSP functionality 100 10 Programmable DSP Embedded/mP Microprocessors: Best flexibility 1 Flexibility 14 Example Case Study IBM 32b PowerPC Processor, Nowka et al, ISSCC 2002 [3] • 153-380MHz, 53-500mW in 180nm CMOS, 1.0-1.8V • 5.84M transistors, 36mm2 • Dedicated DESEncryption and Speech processing accelerators Encryption and Speech Processing Specialized HW 15 Emerging Trends in DSP Industry Normalized power efficiency Specialized hardware: best energy efficiency, enables algorithm tuning + low clock rates 1000 100 Specialized hardware DSPs add microcontroller functionality and specialized HW accelerators Programmable DSP 10 Embedded/mP Microprocessors: Best flexibility 1 Flexibility 16 Example Case Study TI VLIW DSP with 1MB L2 cache, Agarwala et al, ISSCC 2002 [4] • 600MHz, 4.8GOPS, 718mW in 130nm CMOS, 1.2V • Dedicated Viterbi and Turbo decoding co-processor HW • 64M transistors • Integrated DMA controller, PCI, 1MB L2, 16K I$ & D$ Viterbi and Turbo Co-processors 17 Specialized Hardware Accelerators • Specialized (fixed function) hardware is 10-100x more efficient than general purpose processors: Why? – Trades hardware for power – Allows very low clock rates – Essential for some wireless functions – Viterbi and Turbo decoding, speech recognition, encryption • Allows custom algorithms and coefficients to limit power – Use shifts instead of multiplies • Cost is flexibility – Fixed algorithms and coefficients – As new applications and wireless standards emerge, is this enough? – How does this cover the application space? 18 Reconfigurable Processors FPGA – Fine Grain Reconfigurable Fabric • Fine-grain gate-level functions • Array of MUXes to implement any N-input boolean function • Speed sacrificed for generality Course Grain Reconfigurable Fabric • Moderate grain function blocks • Collections of Add, Mpy, Mux, … • Interconnect overhead is moderate to low • If functions and connectivity are known, can be highly optimized Courtesy: Prof. F. Kurdahi, UCI 19 Generic Reconfigurable Architecture Datapath Tiles Registers Configuration Control Array of Fine/Coarse Grain Datapath Tiles and Registers 20 How Do Reconfigurable Processors Work?  Execute one algorithm/ protocol at any given time  Each algorithm is ‘configured’ from the building blocks  Time between subsequent configurations: ~1-10ms  Configuration Control unit decides which algorithm to execute when Protocol 1 Time 21 How Do Reconfigurable Processors Work?  Execute one algorithm/ protocol at any given time  Each algorithm is ‘configured’ from the building blocks  Time between subsequent configurations: ~1-10ms  Configuration Control unit decides which algorithm to execute when Protocol 2 Time 22 How Do Reconfigurable Processors Work?  Execute one algorithm/ protocol at any given time  Each algorithm is ‘configured’ from the building blocks  Time between subsequent configurations: ~1-10ms  Configuration Control unit decides which algorithm to execute when Protocol 3 Time 23 Standby Leakage Reduction: Sleep Transistor design • Motivation: Cut off power supply in sleep-mode • Insert “sleep” transistor between main supply and functional unit’s supply rails • Latches tied to main supply rails: retain state sleep transistor Virtual Vcc Functional Unit Virtual Vss sleep transistor MTCMOS Boosted Sleep Sleep-TR size 5.1% 2.3% NonBoosted Sleep 3.2% Leakage power reduction Virtual supply bounce 1450X 3130X 11.5X 60 mV 59 mV 58 mV Standby leakage benefit for 5% delay penalty 24 Switching + Leakage Reduction: Forward Body Bias Vbp +Ve Normalized total power Vdd Vcc: 1, 1.05, 1.1 … 1.5V 4 o 110 C ZBB 3 =0.1 2 FBB 1.2V 500mV 1 1.1V 0 0.6 0.8 1 1.2 1.4 -Ve Vbn A. Keshavarzi et al, 2002 Symp. VLSI Circuits [6] 20% power reduction at 1GHz 8%  frequency at iso-power 20X  idle-mode leakage FBB/ZBB leakage ratio Frequency (GHz) 30 20 10 27oC 0 0.6 0.8 1 1.2 1.4 Frequency (GHz) 25 Multi-Vcc Usage Model VRM1 VCCcore1 VCCcore2 VRM2 VRM3 VCCcore3 VCCcore4 VRM4 • Optimize performance and power with parallelism and voltage 26 Switching + Leakage Reduction: Multi Supply Design • Active leakage benefit with lower supply voltage • Exponential subthreshold and gate leakage reduction R. Krishnamurthy et al, 2002 Symp. VLSI Circuits [7] Normalized Leakage 100 80 60 40 20 0 0 Subthreshold lkg Gate lkg 0.3 0.6 0.9 1.2 1.5 Voltage (V) Leakage Energy (Normalized) Measured Leakage in 1.2V, 130nm process 130nm L1 cache leakage 12 w.c. corner 10 8 6 79% 4 Nominal corner 2 0 0 0.2 0.4 0.6 0.8 VCC (V) 1 1.2 1.4 27 Adaptive Vcc: Variation-tolerant Circuits • Motive: change Vcc adaptively to reduce impact of parameter variations • Large Fmax vs. leakage spread (worsening with scaling) • Lower Vdd on leakage-limited circuits (subject to stability limits) • Higher Vdd on speed-limited circuits (subject to reliability limits) 100% 80% 1.2 Die count 1.3 30% Frequency 1.4 1.1 20X 1.0 Fixed Vdd: 1.05V Adaptive Vdd: 20mV resolution 60% 40% 20% 0.9 0 5 10 Leakage 15 0% 20 0.85 0.9 0.95 1 1.05 Frequency Bin 100% Die count 5.3 mm 80% 4.5 mm 21 sub-sites within 1 die J. Tschanz et al, 2002 Symp. VLSI Circuits Adaptive Vdd + body bias Adaptive Vdd + WID body bias 60% 40% 20% 0% 0.85 0.9 0.95 Frequency Bin 1 1.05 28 Viterbi Decoder Organization Branch Metric Unit (BMU) Encoded Bits Path Metric Unit (PMU) Branch Error Traceback Unit (TBU) Transitions Decoded Bits • BMU calculates errors for all branches • PMU accumulates errors and outputs transitions with minimum error • TBU traces minimum error path back to get best estimate of original input One of the most performance and power critical algorithms in wireless baseband DSP 29 90nm CMOS Implementation PM memory PM memory BMU BMU PM memory 90nm dual-Vt 7-metal CMOS technology 64-state radix-2 design: 40mW at 500Mbps, 1.2V TB control PM memory BMU BMU PM memory 8 ACS PM memory TB memory ACS 230µm x 210 µm Path memory TB memory Traceback 260µm x 510 µm M. Anders et al, 2004 VLSI Circuits Symp. [10] 30 Summary Technology 90nm dual-Vt CMOS ACS area 230µm x 210µm (0.048 mm2) Traceback area 260µm x 510µm (0.133 mm2) Viterbi states 64-state ACS precision 10 bits Radix-2 max. TB length 96 symbols M. Anders et al, 2004 VLSI Circuits Symp. [10] • Fastest reported 64-state Viterbi accelerator – Total power at 2 GHz (500Mbps) is 40mW (1.2V) • Lowest power 802.11a implementation – Total power at 216 MHz (54Mbps) is 5mW (0.7V) 31 Streaming Media Accelerators: 32-bit MAC [ISSCC’03] • 5GHz 32-bit multiply-accumulate unit • Targeted for special purpose streaming processors/graphics MULTIPLIER CLK ALIGNER ACCUMULATE FIFOs & SCAN Die Area 1.32 x 1.57 mm2 Process 90nm CMOS Interconnect 1 poly, 7 metal Transistors 230K Frequency 5GHz Maximum Vcc 1.5V Chip Power 1.2W @ 1.2V Pad Count 75 NORMALIZE S. Vangal et al, ISSCC’03 [11] 32 32-bit MAC Architecture Overview Scan Reg MAC Scan Out 32 32 FIFO FIFO FIFO A B C x FIFO Control + 32 Scan Reg Scan In • Single-cycle 5GHz 32-bit MAC loop • New Multiplier and Accumulator ALU circuit techniques 33 TCP/IP Off-load Accelerator [ISSCC’03] • 10GHz TCP/IP offload accelerator unit • Targeted for 10Gbps Ethernet packet processing accel. Core Area Process Interconnect Transistors Frequency Max Vcc Chip power Pad count 2.23 x 3.54mm2 90nm dual-VT CMOS 1 poly, 7 metal 260K 10GHz 1.5V 1.9W @1.2V 306 Y. Hoskote et al, ISSCC’03 [12] 34 10GHz TCP/IP Execution Core ROB Key 6 96 input TCB CLB ALU TCB 264 index Working register Rcv buffer Next address Branch address Start address 32 Scratch registers Pipelined ALU 32 PC 112 9 decode IR ALU output 10GHz sparse-tree ALU Instr ROM • At-speed packet processing execution core for 10Gbps 35 Conclusions • Several Technology and Circuit Challenges Beyond 65nm • Switching and active leakage energy • Leakage tolerance and robustness • On-chip interconnect scaling • Process parameter variations and tolerance • Execution core thermal/power density • Emerging trends in DSP industry • Specialized hardware accelerators and co-processors • Reconfigurable engines • Circuit solutions: • Active and standby leakage power reduction strategies • Multi-supply design: switching + leakage power benefits • Energy-efficient arithmetic circuit technologies • DSP HW accelerators for Viterbi, Streaming media, TCP/IP 36 References [1] R. Krishnamurthy et al, “High-performance and low-power challenges for sub-70nm microprocessor circuits”, IEEE Custom Integrated Circuits Conference 2002, pp. 125-128. [2] L. Clark et al, “Trends and challenges for wireless embedded DSPs”, IEEE Custom Integrated Circuits Conference 2003, pp. 171-176. [3] K. Nowka et al, “A 0.9 V to 1.95 V dynamic voltage-scalable and frequency-scalable 32 b PowerPC processor”, ISSCC 2002, pp. 340-341. [4] S. Agarwala et al, “A 600 MHz VLIW DSP”, ISSCC 2002, pp. 56-57. Reconfigurable processors: 1. http://brass.cs.berkeley.edu/ 2. http://www.eng.uci.edu/comp.arch/ 3. http://www.pactcorp.com/xneu/px_xpphw.html [5] J. Tschanz et al, “Design optimizations of a high performance microprocessor using combinations of dual-Vt allocation and transistor sizing”, Symposium on VLSI Circuits 2002, pp. 218-219. [6] A. Keshavarzi et al, “Forward body bias for microprocessors in 130nm technology generation and beyond”, Symposium on VLSI Circuits 2002, pp. 312-315. [7] R. Krishnamurthy et al, “Dual supply voltage clocking for 5 GHz 130 nm integer execution core”, Symposium on VLSI Circuits 2002, pp. 128-129. [8] S. Mathew et al, “A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core”, Symposium on VLSI Circuits 2002, pp. 126-127. [9] S. Mathew et al, “A 4GHz 300mW 64b integer execution ALU with dual supply voltages in 90nm CMOS”, ISSCC 2004, pp. 162-163. [10] M. Anders et al, “A 64-state 2GHz 500Mbps 40mW Viterbi accelerator in 90nm CMOS”, Symposium on VLSI Circuits 2004, pp. 174-175. [11] S. Vangal et al, “A 5 GHz floating point multiply-accumulator in 90 nm dual Vt CMOS”, ISSCC 2003, pp. 334-335. [12] Y. Hoskote et al, “A 10GHz TCP/IP offload accelerator for 10Gb/s Ethernet in 90nm CMOS”, ISSCC 2003, pp. 258-259. 37

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 2 rev2