* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Design… - AMiner
Standby power wikipedia , lookup
Electric power system wikipedia , lookup
Wireless power transfer wikipedia , lookup
Electrification wikipedia , lookup
Alternating current wikipedia , lookup
Audio power wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Solar micro-inverter wikipedia , lookup
Immunity-aware programming wikipedia , lookup
Power over Ethernet wikipedia , lookup
Time-to-digital converter wikipedia , lookup
Design and Implementation of the POWER5 Microprocessor J. Clabes1, J. Friedrich1, M. Sweet1, J DiLullo1, S. Chu1, D. Plass2, J. Dawson2, P. Muench2, L. Powell1, M. Floyd1, B. Sinharoy2, M. Lee1, M. Goulet1, J. Wagoner1, N. Schwarz1, S. Runyon1, G. Gorman1, P. Restle3, R. Kalla1, J. McGill1, S. Dodson1 1IBM System Group, Austin, TX 2IBM System Group, Poughkeepsie, NY 3IBM Research, Yorktown Heights, NY 1 Outline Project Objective Microarchitecture Changes Implementation Overview Design Enablers Integration Challenges Timing and Hardware Performance Power Efficiency Summary 2 Project… POWER5™ Chip Objectives Build on POWER4™ base Maintain binary and structural compatibility Deliver superior performance Enhance and extend SMP scalability Provide additional server flexibility Enhance reliability, availability, serviceability (RAS) attributes Deliver power efficient design 3 Microarchitecture… Simultaneous Multithreading in POWER5 Chip Each chip appears as a 4-way SMP to software Processor resources optimized for enhanced SMT performance Software controlled thread priority Dynamic feedback of runtime behavior to adjust priority Dynamic switching between single and multithreaded mode Single Threaded Operation FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 0 active 4 Microarchitecture… Simultaneous Multithreading in POWER5 Chip Each chip appears as a 4-way SMP to software Processor resources optimized for enhanced SMT performance Software controlled thread priority Dynamic feedback of runtime behavior to adjust priority Dynamic switching between single and multithreaded mode Simultaneous Multi-Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 0 active Thread 1 active 5 Modifications to POWER4 System Structure P P P L2 P Reduced L3 Latency L2 Larger SMPs Fab Ctl Fab Ctl L3 Cntrl L3 Mem Ctl Memory L3 Cntrl Faster access to memory Number of chips cut in half L3 Mem Ctl Memory 6 Implementation… POWER5 Chip Overview Technology: 130nm lithography, SOI, Cu wiring 276M transistors 389 mm2 die size Two 8-way superscalar SMT cores Memory subsystem with 1.9MB L2-Cache, L3 directory and memory controller on chip Extensive RAS support High-speed elastic bus interface 7 Design… ERAT and D-Cache Array Design Changes System performance vs. area trade-off ERAT: Fully associative, implemented as Sum-Address CAM D-cache: 4-way associativity Result: 2-3% performance gain with improved wireability at 5% area cost POWER4 POWER5 2-way ERAT 128-way ERAT CAM 64 entries 64 entries = = = Hit Logic 128 entries 128 entries 2-way DCache 128 entries = Hit Logic = = 64 entries 64 entries = = 64 entries 64 entries 4-way DCache 8 Design… L2 and I-Cache Array Design Changes SMT drives thread level parallelism Improved associativity on L2-Cache (10-way) and I-Cache (2-way) L2 access shifted by ½ cycle avoiding extensive array redesign High speed latch with compare on I-Cache access path 9 Design… 2nd Generation Elastic Interface Design EI-II performance improvements Runs over 2 GHz in laboratory -- head-room on IO frequencies – Allows bus frequencies to continue scaling with processor frequency error free data valid window Vref error free data valid window guardband early guardband guardband functional data late guardband Optimizes Vref at T0 by level forwarding Maintains guardband via periodic self calibration 10 Integration… Implementation of Engineered Buses and IO Wires Pre-planned and custom routed buses ~50K engineered wires at chip level ~2X of POWER4 chip Custom buffer insertion process ~250K buffer/inverters 2.5X of POWER4 chip Wire and bus characterization Noise tolerance Impact of coupling on delay Inductance analysis 11 Integration… Implementation of Engineered Buses and IO Wires Pre-planned and custom routed buses ~50K engineered wires at chip level ~2X of POWER4 chip Custom buffer insertion process ~250K buffer/inverters 2.5X of POWER4 chip Wire and bus characterization Noise tolerance Impact of coupling on delay Inductance analysis IO performance driven routing 5Ω resistance limit on chip Fully shielded (single ended design) 12 Integration… Dual Clock Distribution total nominal skew total nominal skew 18ps 18ps local skew local skew 9ps 9ps slew rate from 30 - 70% 52 - 71ps latency PLL to LCB 777ps latency PLL to LCB 777ps dutyswitching cyclepower control ±25ps @ 1.08V andpower 2GHz switching @ 1.08V and 1.8GHz 9.5W duty cycle control Main Clock Grid (91 Buffers) 1 full chip buffer 1 central chip buffer 3 half chip buffers 6 quadrant buffers 80 sector buffers 52 - 71ps slew rate from 30 - 70% ±25ps 10.5W Memory Clock Domain (4 Buffers) 1 central chip buffer 3 sector buffers asynchronous to main mesh 13 Timing… Chip Timing and Shmoo Plot Shmoo Plot Timing Closure Timing Model Analysis 690K scannable M/S latches 180K non-scan mid-cycle latches 6.75M timing checks TAT 19 hours 2.3 Fail 2.2 Frequency (GHz) Sort mode (functional/scan/lbist) Early mode (functional/scan) 2.4 2.1 2.0 Pass 1.9 1.8 1.7 1.6 1.5 1.40 1.35 1.30 1.25 1.20 1.15 1.10 1.05 Voltage (Volt) at 25ºC 14 Power… Power Efficient Design Implementation DC power mitigation Leverage triple Vt technology • Decrease low Vt usage by 90% • Increase high Vt usage by 30% Leverage triple Tox technology • Thick Tox usage for decoupling capacitors POWER4 Device Width 26.2% 4.4% 69.4% POWER5 Device Width AC power mitigation Minimal usage of dynamic circuits Reduce loading on clock mesh Incorporation of dynamic clock gating 33.9% 65.7% high Vt 0.4% low Vt normal Vt 15 Power… Dynamic Clock Gating Implementation global disable scan-only latches mesh clock local disable gated c1 clock enable C2 latches gating logic dynamic stop cycle-to-cycle clock control (~1/2 cycle path) global disable mesh clock scan-only local disable latches gated c1 clock enable C2 latches gating logic dynamic stop MS latch cycle-predict clock control (~full cycle path) Approach allows aggressive use of clock gating to conserve power 16 Power… Improved Power Efficiency AC power reduction by ≥ 25% DC power reduction by ≥ 50% Total power reduction by > 33% for numerical intensive workload Relative Power Impact of Power Saving Measures 1.5 1.0 0.5 0.0 design without power saving features clock gating added thick oxide added AC Power Gate Leakage Power HVT added LVT removed Channel Leakage Power 17 Power… Thermal Protection Temperature at Instruction Sequencing Unit Stage 1 Throttling Engaged @ 84oC & Disabled @ 81oC 86 over-temperature temperature (oC) 85 84 83 82 81 80 recovery-temperature 79 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 time (s) 18 Summary… Summary First dual core SMT microprocessor Extended SMP to 64-way Operating in laboratory Power dynamically managed with no performance penalty Implementation permits future technology scalability from circuit and power perspective Innovative approach leveraging technology with system focus for high performance in a power efficient design 19