Download A Power-Efficient High-Throughput 32

A Power-Efficient High Throughput 32-Thread SPARC Processor Negar Esmaeilie Falah Instructor : Prof. M. fakhraiee Class Presentation Adopted of ISSCC 2006 / SESSION 5 / PROCESSORS / 5.1 1 Outline        Motivation Architecture Overview Performance / Power Physical Implementation Integer Register File L2 Cache Conclusion 2 Motivation  Commercial server applications – High thread level parallelism (TLP) – Low instruction level parallelism (ILP)  Major concerns: – Power – Cooling – Space 3 The Niagara SPARC Processor   New architecture and new pipeline to achieve throughput and performance/watt Many small, simple cores – Shallow single issue pipeline – Small L1 caches    Fine-grain multithreading within core L2 cache shared across all cores High bandwidth memory sub-system 4 Architecture Features  CPU with 32 threads to exploit TLP  8 cores/chip with 4 threads/core to hide memory and pipeline stalls – Shared pipeline to reuse resources – Shared L2 cache for efficient data sharing among threads  High bandwidth memory sub-system to increase throughput: – Highly associative banked L2 cache – High bandwidth crossbar to L2 cache – High bandwidth to DRAM 5 Processor Block Diagram Sparc 0 Sparc 1 Sparc 2 Sparc 3 Sparc 4 Sparc 5 Sparc 6 Sparc 7 JTAG Clock & Test Unit Crossbar Floating Point Unit DRAM Control L2 Bank 0 Channel 0 L2 Bank 1 Channel 1 L2 Bank 2 Channel 2 L2 Bank 3 Channel 3 Control Register Interface [1] JBUS System Interface SSI ROM Interface DDR2 144@400 MT/s DDR2 144@400 MT/s DDR2 144@400 MT/s DDR2 144@400 MT/s JBUS (200 MHz) SSI (50 MHz) 6 Micrograph and Overview L2Tag Bank 0 DDR2_1 L2 Buff Bank 0 L2 Buff Bank 1 DRAM Ctl 0,2 IO Bridge CLK / Test CROSSBAR Unit L2Tag Bank 1 DRAM Ctl 1,3 JBUS L2 Data Bank 2 L2Tag Bank 2 FPU L2 Buff Bank 2 L2 Buff Bank 3 L2Tag Bank 3 L2 Data Bank 1 L2 Data Bank 3 SPARC SPARC SPARC SPARC Core 1 Core 3 Core 5 Core 7 [1] DDR2_2 L2 Data Bank 0 Features: DDR2_3 DDR2_0 SPARC SPARC SPARC SPARC Core 0 Core 2 Core4 Core 6 8 64-bit Multithreaded SPARC Cores Shared 3MB L2 Cache 16KB I-Cache per Core 8KB D-Cache per Core 4 144-bit DDR2 channels 3.2 GB/sec JBUS I/O Technology: 90nm CMOS Process 9LM Cu Interconnect 63 Watts @ 1.2GHz/1.2V Die Size: 378mm2 279M Transistors 7 Flip-chip ceramic LGA SpecJBB Execution Efficiency Idle Time Single Threaded 3.79 cycles 1 = 1 + 3.79 21% Efficiency 4 = 4 + 1.56 72% Efficiency Idle Time 1.56 cycles Four Threaded Cycles 0 Compute 4 8 Pipeline Latency Pipeline Conflict [1] Memory Latency 8 Power  Power efficient architecture – Single issue, in-order six stage pipeline – No speculation, predication or branch prediction – Small cores can operate at lower frequency while achieving high throughput performance  Thermal monitoring – – – – Peak power closer to average power Control issue rate within the cores Halt idle threads Optimize thread distribution across cores for performance or power under limited workload 9 Chip power consumption: 63W 10 [1] H-Tree Clock Distribution [3] 11 Cool Threads Advantages 59oC 59oC 66oC   66oC 59oC 59oC 59oC 107oC [1] Improved reliability with lower and more uniform junction temperatures – Increased lifetime – Total failure rate reduced by ~8X (vs 105oC) Optimized performance/ reliability trade-off – Frequency guardbands due to CHC, NBTI, etc. reduced by > 55% – Reduced design margins (EM/NBTI) – Less variation across die 12 Physical Design    Fully static cell based design methodology – Many replicated blocks – Custom design only for SRAMs, Analogue and IOs – Increased chip robustness and test coverage Clock distribution combines H-tree and buffered tree All SRAMs testable through the scan chain Statistics Transistors Standard Cells Flops Repeaters Memory 279 Million 2.2 Million 400,614 32 scan chains 161,000 20 Macros 416 instances Memory Bits 35.6 Million Decoupling Caps 710 nF 13 Integer Register File Overview    One register file required per thread Supports standard SPARC window RF Highly integrated cell structure to support 4 threads while saving area and power – 8 windows of 32 entries – 3 read ports + 2 write ports for active window – Read/write: single cycle throughput / 1-cycle latency  Swaps are pipelined across threads for save / restore operations – Swaps block within a thread but not across threads for optimal CMT performance – 3 cycle latency with single cycle throughput 14 IRF Swaps Across Thread Swap Swap Swap #1 #2 #3 Back to Back Swap Requests Clk SAVE RSTO SAVE RSTO SAVE RSTO Thread 1 Thread 2 CONVENTIONAL SWAP Thread 3 Swap requests fulfilled every 2 cycles DEC SAVE DEC RSTO Thread 1 SAVE RSTO Thread 2 DEC SAVE RSTO Thread 3 INTERNAL PIPELINED SWAP Swap requests fulfilled every cycle Fixed 3-cycle latency [1] 15 L2 Cache  High bandwidth 3MB shared Level 2 Cache – – – – – –  Four 750KB independent banks. 12-way set associative 16B read and write operations 2 cycle throughput with 8 cycle latency Direct communication to DRAM and JBus Maximum bandwidth of 153.6GB/s Reverse-Mapped Directory – CAM based Directory contains L1 cache tags instead of L2 tags to reduce area 16 Crossbar       8 cores communicate with L2, FPU and Ctl Register Interface 134.4 GB/s data BW 3 stage pipeline: request, arbitrate, transmit 2 queue entries per source/destination pair Arbiter prioritizes requests by age Standard cell macros with semi-custom route 17 [1] 64KB Array 32KB Array 32KB Array 128b Data 128b Data Interface Datapath Unit 128b Data 128b Data way9 panel way10 panel way11 panel [1] Logical Sub-Bank 3 Logical Sub-Bank 2 Logical Sub-Bank 0 Logical Sub-Bank 1 L2 Data Array      Each 750KB bank divided into 4 sub-banks Each sub-bank reads 16B independently 12 16KB panels per subbank Each panel contains data for 1 of the 12 ways 12 64KB custom macros per bank 18 L2 Data Clock Header Design Special clock header design allows – Sub-bank and panel level gating to minimize nonactive power – Only 1-4 panels activated out of 48 panels in a bank – Interlocking scheme for 2-cycle throughput  access_done Enable L2 Clk Q Dyn FF Q sbank_en reset set panel_en po_reset way_select 19 po_reset L2 Clk [1] Conclusion      New CMT architecture developed to address commercial workload requirements 32-threads to hide instruction latency in a short and simple pipeline Large bandwidth instead of high frequency to deliver target performance at low power Cooler and more uniform chip temperature to enhance performance/reliability trade-off Circuits designed for high bandwidth and low power to support multithreading 20 References    [1] Ana Sonia Leon, Jinuk Luke Shin, Kenway W. Tam, William Bryg, Francis Schumacher, Poonacha Kongetira, David Weisner, Allan Strong, P. Kongetira, “A Power-Efficient High-Throughput 32Thread SPARC Processor”, 2006. [2] P. Kongetira, “A 32-Way Multithreaded SPARC Processor,” 16th Hot Chips Symp., Aug., 2004. [3] Magdy A. El-Moursy and Eby G. Friedman, “Exponentially Tapered H-Tree Clock Distribution Networks”, 2004. 21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Power-Efficient High-Throughput 32