Download MPSoC

University of Tehran Electrical and Computer Engineering School Design of ASIC CMOS Systems Course MPSoC Presented by: Mahdi Hamzeh Instructor: Dr S.M. Fakhraie Spring 2006 This is a class presentation. All data are copy righted to respective authors as listed in the references and have been used here for educational purpose only Outline       Introduction A Power-Efficient High-Throughput 32-Thread SPARC Processor A 16-Core RISC Microprocessor with Network Extensions A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache A 2.6GHz Dual-Core 64b×86 Microprocessor with DDR2 Memory Support Conclusion Introduction     Multi-processor systems-on-chip (MPSoC) pose many new challenges to the design of embedded systems [5] Multi-processor systems-on-chip (MPSoCs) are becoming a necessary way to balance performance, power and reliability while maintaining the maximum degree of flexibility [6] Energy-efficient design is strongly required in MPSoC's [6] Dynamic power and leakage power A Power-Efficient High-Throughput 32-Thread SPARC Processor  Features              High throughput Performance/watt optimized Concurrent execution of 32 threads 8 symmetrical 4-way 64b multithreaded SPARC cores 16KB ICache per Core 8KB DCache per Core High-bandwidth low-latency cache/memory Single-issue 6-stage pipeline for each core Maximize pipeline utilization CPU-to-cache crossbar of 134GB/s 4-banked 12-way L2 cache Pipelined shared 3MB L2 cache of 153.6GB/s Four 144b DDR2 DIMM channels at 400MT/s (Mega-transfers/s) delivering 25.6GB/s Processor block diagram [1] Niagara processor micrograph and overview [1] A Power-Efficient High-Throughput 32-Thread SPARC Processor (Contd.)  Thread features    Measured IPC (instructions per cycle)      4 threads are interleaved per cycle with zero thread-switch cost When any thread is blocked by a cache miss or branch penalty, the other threads issue instructions more frequently, effectively hiding the miss latency of the first thread. 5.76 with an actual L2 latency of 20.9 CPU cycles and memory latency of 106ns on Java Business Benchmark (SpecJBB) Pipeline efficiency 71% (5.76 out of a maximum of 8). A balanced H-tree scheme is used to distribute the global clock Thermal gradient of only 7°C. At 63W Worst-case junction temperature is 66°C.(Compared to a typical Tj of 105°C, reliability improves by 5×). Technology        90nm CMOS process 9 layers of Cu interconnect 378mm2 die 279M transistors packaged in a flip-chip ceramic LGA with 1933 pins Power dissipation is 63W at 1.2V and 1.2GHz Library cells are static CMOS with a 1.5 P/N Chip power consumption: 63W [1] width ratio Components  Peak Power Control    Active power- and temperature-control mechanisms allow threads and cores to be dynamically scheduled or idled. Clock-gating techniques include coarse-grain to disable selective cores, and fine-grain to disable about 30% of the datapath flops on average. PLL generate 3 ratioed clock domains    CPU, crossbar, L2 cache memory interface system interface Components (Contd.)  On-chip L2 cache      12-way set-associative Divided into 4 independent banks Operate concurrently to read out up to 256B Each sub-bank supplies 16B with 2-cycle throughput providing a maximum data array read bandwidth of 153.6GB/s L2 cache data array floorplan and interlocking clock header [1] Design Methodology  Hold-time methodology based on metalprogrammable delay buffers, allowing the top level route to freeze while still resolving violations A 16-Core RISC Microprocessor with Network Extensions  Features            Size of Icache (each core) 32kB Size of Dcache (each core) 8kB L2 Cache 1MB Number of MIPS Cores 16 Number of Metal Layers 9 Process 0.13μm CMOS Voltage 1.2V Frequency 600MHz Power 25W Power for an individual processor is 450mW @600MHz Chip plot [2] Number of transistors 180 million A 16-Core RISC Microprocessor with Network Extensions (Contd.)    Targeted for layer-4 through layer-7 network applications Designed for power efficiency Components (Hardwired)       Security engines Network function accelerators Memory/network/bus controllers Most of the silicon area is dedicated to the 16 RISC processors and the 1MB L2 cache remaining area is occupied by network coprocessors and physical interfaces Chip interfaces    64b 133MHz PCI/PCIX 144b 800MHz DDR2 36b 600MHz low-latency DRAM interface in addition to miscellaneous and Peak performance [2] general-purpose I/Os Architecture    Each RISC core can issue in-order two MIPS instructions per cycle 32kB 4-way set-associative virtual instruction cache The execute unit consists of two pipelines    memory section consists       first handles all instructions second only handles ALU/insert/extract/shift/move instructions 8kB fully associative Dcache 2kB write buffer 32-entry (64 page) unified translation look-aside buffers (TLBs) multiplication/division unit in addition to supporting the standard MIPS instructions Cryptographic operations are accelerated by dedicated units supporting different encryption methods: 3DES, AES, MD5, SHA1/256/512, and GF2 The 16 processors share a 1MB fully coherent L2 write-back cache. RISC Processor [2] Power   Aggressive clock gating of all place-and route and custom islands Some blocks present natural exclusivity and hardware enforces exclusivity to reduce peak power    Example : In execution unit only the ALU or the shifter need be enabled power performance is approximately 2000 MIPS/W Global clock distribution power is <1W at 1.2V, 600MHz for a skew of <50ps Design Methodology     combination of industry-standard synthesis and place-and-route flow for control blocks, and full custom schematic/layout design for the datapath-style units Global clock distribution is full custom and consists of a powerefficient variable-density grid that minimizes total metal capacitance while maintaining low resistance paths to the heaviest clock loads Local conditional clocks are two gain stages from the global clock and are designed on an ad-hoc basis Global floorplanning and wiring is done with an in-house tool that handles routing in addition to optimal repeater, local clock driver and decoupling capacitance placement Chip floorplan [3] A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache  Features                two 64b cores 16MB unified L3 cache Each core has two threads Each core has unified 1MB L2 cache 1.328B transistors. Frequency 3.0GHz 435mm2 die 1.25V core supply worst-case power dissipation 165W Typical server workload is 110W process 65nm 8 copper interconnect low-k carbon-doped oxide (k=2.9) inter-level dielectric flip-chip (C4) attached to a 12-layer (4-4-4) organic package with an integrated heat spreader Package has 604 pins(238 are signal pins and the rest are power and ground) Die micrograph [3] Components  L3 cache        Using 256 data sub-arrays(64kB each) 32 redundancy sub-arrays(68kB each) data sub-array stores 32 bits redundancy sub-array stores 34 bits 6T memory-cell bit size is 0.624μm2 physical address is 40b wide Clock and PLL    3 PLLS uncore clock is distributed through a balanced tree embedded in nine vertical spines De-skew circuits controlled by on-die fuses reduce the uncore clock skew to less than 11ps. Clock distribution map [3] Components (Contd.)  Level shifters   used between voltage domains DFT and debug features      Scan observability registers (scan-out) I/O loopback and I/O test generator (IBIST) on-die clock shrink … Power     only 0.8% of all L3 cache array blocks are powered up for each cache access To reduce the L3 cache leakage, NMOS sleep transistors are implemented in the SRAM sub-arrays PMOS power gating devices in the cache periphery (Both saving about 3W of leakage) Supply voltage  three supply voltage     one for the two cores separate supply for the L3 cache together with the associated control logic third one for the FSB I/O circuits design uses longer Le devices (10% longer than nominal) in nontiming-critical paths to reduce subthreshold leakage (About 54% of the transistor width in the cores and 76% of the transistor width in the Voltage domains and power breakdown [3] uncore (excluding cache arrays) L3 cache sleep circuit and shut-off mode [3] A 2.6GHz Dual-Core 64b×86 Microprocessor with DDR2 Memory Support  Features               90nm triple-Vt, artially-depleted SOI. 9 layer Cu metallization Dual gate-oxide thickness Process technology 2 CPU Cores 220mm2 die area 77.4mm2 L2 cache area 243M Transistor count, 134M L2 array 13M L1 array L1 instruction cache 64kB per core, parity protected L1 data cache 64kB per core, ECC protected L2 cache 1MB per core, ECC protected 128b DDR2-800, 12.8GB/s Memory interface 2.6GHz 1.35V core supply power dissipation 95W The chip implements the Pacifica architecture for hardware support of virtualization design has 7% frequency margin and 10% voltage margin at its operating point die micrograph [4] Components    2 Hammer cores on-chip DDR2 memory controller 3 identical PLLs    2 PLLs provide clocks for 3 Hyper Transport links third provides a clock for the memory controller and both cores Clock Distribution   balanced H-tree drives the clock signal from the PLL to final clock buffers Worst-case clock skew is 21ps Power    fine-grained clock gating reduces the load on the clock grid and reduces power consumption The clock grids over the 2 cores can be separately enabled low-power operating modes     clock grids over the CPU cores are disabled clock grid over the memory controller runs at 1/256th the frequency of the system clock The grid provides a low-resistance path to all clock receivers so clock drivers do not have to be tuned based on loading at the end of the design cycle Reducing power dissipation by reducing voltage from 1.35V to 1.1V achieves a three-fold reduction in static leakage and a 47% reduction in dynamic leakage at a cost of 20% in frequency Static leakage versus frequency [4] Conclusion    Power-hungry techniques like memory speculation, out-of order execution, and predication are not needed to achieve the desired performance [1] Extensive use of simple static CMOS circuits improves the robustness [2] Designed for power efficiency, which is a key requirement for MPSoC embedded applications [3] References       [1] A. S. Leon, J. L. Shin, K. W. Tam, W. Bryg, F. Schumacher, P. Kongetira, W. Weisner, A. Strong,” A Power-Efficient High-Throughput 32-Thread SPARC Processor”, International Solid-State Circuits Conference ,February 2006. [2] V. Yalala, D. Brasili, D. Carlson, A. Hughes, A. Jain, T. Kiszely, K. Kodandapani, A. Varadharajan, T. Xanthopoulos,” A 16-Core RISC Microprocessor with Network Extensions”, International Solid-State Circuits Conference ,February 2006. [3] S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang,” A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache”, International Solid-State Circuits Conference ,February 2006. [4] M. Golden, S. Arekapudi, G. Dabney, M. Haertel, S. Hale, L. Herlinger, Y. Kim, K. McGrath, V. Palisetti, M. Singh,” A 2.6GHz Dual-Core 64b×86 Microprocessor with DDR2 Memory Support”, International Solid-State Circuits Conference ,February 2006. [5] K. C. Chang, J. S. Shen, T. F. Chen,” Evaluation and Design Trade-Offs Between CircuitSwitched and Packet-Switched NOCs for Application-Specific SOCs”, 43rd Design Automation Conference, July 2006. [6] I. Issenin, E. Brockmeyer, B. Durinck, N. Dutt, “Multiprocessor System-on-Chip Data Reuse Analysis for Exploring Customized Memory Hierarchies”, 43rd Design Automation Conference, July 2006. Questions ?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download MPSoC