Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Evolution of Chip Design ECE 111 Spring 2011 A Brief History • 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at Texas Instruments Courtesy Texas Instruments • 2010 – Intel Core i7 mprocessor • 2.3 billion transistors – 64 Gb Flash memory [Trinh09] © 2009 IEEE. • > 16 billion transistors Source: David Harris, CMOS VLSI Design Lecture Slides Annual Sales • >1019 transistors manufactured in 2008 – 1 billion for every human on the planet Source: David Harris, CMOS VLSI Design Lecture Slides Feature Size • Minimum feature size shrinking 30% every 2-3 years Source: David Harris, CMOS VLSI Design Lecture Slides NRE Mask Costs Source: MIT Lincoln Labs, M. Fritze, October 2002 Subwavelength Lithography Challenges Source: Raul Camposano, 2003 The Designer’s Escalating Problem Source: Raul Camposano, 2003 Wire Delays and Noise Problems Dramatically Complicate Design 180 nm 1 cycle 45 nm • Unstructured “Place and Route” Standard Cell Methodologies will Breakdown ASIC NRE Costs Not Justified for Many Applications • Forecast: By 2010, a complex ASIC will have an NRE Cost of over $40M = $28M (NRE Design Cost) + $12M (NRE Mask Cost) • Many “ASIC” applications will not have the volume to justify a $40M NRE cost • e.g. a $30 IC with a 33% margin would require sales of 4M units (x $10 profit/IC) just to recoup $40M NRE Cost Case For Programmable Solutions • Can “amortized” high NRE costs across many applications – e.g. microprocessors, DSPs, FPGAs • Complex ASICs today require 18+ months vs. ~4 months for same function on DSP – e.g. Voice-over-IP chip vs. Voice-over-IP on a DSP – “Design time” gap will widen dramatically • Many applications simply requires “programmability”, e.g. cell phones – multiple modes – evolving standards – evolving features, differentiation … But … • Advance applications and algorithms (e.g. latest video games, broadband wireless …) require enormous computation power – 100s to 1000s of GOPS • And very high efficiency – 100s of MOPS/mW (GOPS/W) – 10s of GOPs/$ • Existing microprocessors, DSPs, and FPGAs don’t come close Why are Conventional Processor Architectures Inefficient? • e.g. Intel Itanium II – 6-Way Integer Unit < 2% die area – Cache logic > 50% die area • Most of chip there to keep these 6 Integer Units at “peak” rate • Main issue is external DRAM latency (50ns) to internal clock (0.25ns) is 200:1 • Can “in theory” fit >300 ALUs (tens of thousands in future) in same die area, but how to keep them “busy”? INT6 Cache logic Why are ASICs so Efficient? Parallelism Locality (Millions of gates operating in parallel) (Fed by dedicated “local” wires & memories) Source: Bill Dally, 2003 20MIPS cpu in 1987 Few thousand gates Source: Anant Agarwal, MIT, NOCS 2009 Keynote The billion transistor chip of 2007 Source: Anant Agarwal, MIT, NOCS 2009 Keynote Tilera’s TILEPro64™ Processor Multicore Performance (90nm) Number of tiles Cache-coherent distributed cache Operations @ 750MHz (32, 16, 8 bit) Bisection bandwidth 64 5 MB 144-192-384 BOPS 2 Terabits per second Power Efficiency Power per tile (depending on app) Core power for h.264 encode (64 tiles) Clock speed 170 – 300 mW 12W Up to 866 MHz I/O and Memory Bandwidth I/O bandwidth Main Memory bandwidth 40 Gbps 200 Gbps Product reality Programming ANSI standard C SMP Linux programming Stream programming Source: Anant Agarwal, MIT, NOCS 2009 Keynote Tile Processor Block Diagram A Complete System on a Chip DDR2 Memory Controller 0 DDR2 Memory Controller 1 XAUI MAC PHY 0 PCIe 0 MAC PHY PROCESSOR CACHE L2 CACHE Reg File P2 P1 P0 Serdes L1I L1D ITLB DTLB 2D DMA Serdes UART, HPI JTAG, I2C, SPI GbE 0 SWITCH MDN TDN UDN IDN STN GbE 1 Flexible IO Flexible IO PCIe 1 MAC PHY XAUI MAC PHY 1 Serdes Serdes DDR2 Memory Controller 3 DDR2 Memory Controller 2 Source: Anant Agarwal, MIT, NOCS 2009 Keynote What Does the Future Look Like? Corollary of Moore’s law: Number of cores will double every 18 months ‘02 ‘05 ‘08 ‘11 ‘14 Research 16 64 256 1024 4096 Industry 4 16 64 256 1024 1K cores by 2014! Are we ready? (Cores minimally big enough to run a self respecting OS!) Source: Anant Agarwal, MIT, NOCS 2009 Keynote Massively Parallel Processing On-a-Chip SRAM DDR DRAM DDR DRAM DDR Interface DDR DRAM 32 GB/s Registers 544 GB/s DDR DRAM 64 Tiles x 8 ALUs = 512 ALUs @ 2 GHz, 1000 GOPS = 1 TOPS Parallelism + Locality 2 GB/s Bandwidth Hierarchy is Key Memory BW Global RF BW Local RF BW Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s Source: Bill Dally, 2003 IBM/Sony/Toshiba Cell Processor 0.5 Tb/s Memory I/O SIMD Engine 7 ALUs 0.5 Tb/s Chip I/O 64-bit Dual-Thread Tb/s PowerPC Ring Network • Used in Playstation 3 • 4.6 GHz 64-bit Dual-Threaded PowerPC • 8 SIMD Engines x 7 ALUs = 56 ALUs @ 4.6 GHz = 256 GFLOPS • Terabit on-chip ring network • Terabit external memory and chip-to-chip IO • 90nm process • 234 million transistors • 221 mm2 die NVIDIA GeForce 8800 32-bit CPU 8 clusters x 16 ALUs = 128 ALUs 0.7 Tb/s Memory I/O • 8 Clusters x 16 ALUs = 128 ALUs • 32-bit on-chip CPU • Terabit external memory IO • 1.35 GHz clock • 90nm process • 681 million transistors