Download Design… - AMiner

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Standby power wikipedia , lookup

Electric power system wikipedia , lookup

Wireless power transfer wikipedia , lookup

Electrification wikipedia , lookup

Alternating current wikipedia , lookup

Audio power wikipedia , lookup

Switched-mode power supply wikipedia , lookup

Solar micro-inverter wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Power over Ethernet wikipedia , lookup

Time-to-digital converter wikipedia , lookup

Power engineering wikipedia , lookup

Microprocessor wikipedia , lookup

Transcript
Design and Implementation of the
POWER5 Microprocessor
J. Clabes1, J. Friedrich1, M. Sweet1, J DiLullo1, S. Chu1, D. Plass2, J.
Dawson2, P. Muench2, L. Powell1, M. Floyd1, B. Sinharoy2, M. Lee1,
M. Goulet1, J. Wagoner1, N. Schwarz1, S. Runyon1, G. Gorman1, P.
Restle3, R. Kalla1, J. McGill1, S. Dodson1
1IBM
System Group, Austin, TX
2IBM System Group, Poughkeepsie, NY
3IBM Research, Yorktown Heights, NY
1
Outline








Project Objective
Microarchitecture Changes
Implementation Overview
Design Enablers
Integration Challenges
Timing and Hardware Performance
Power Efficiency
Summary
2
Project…
POWER5™ Chip Objectives
Build on POWER4™ base
 Maintain binary and structural compatibility
 Deliver superior performance
 Enhance and extend SMP scalability
 Provide additional server flexibility
 Enhance reliability, availability, serviceability (RAS) attributes
 Deliver power efficient design
3
Microarchitecture…
Simultaneous Multithreading in POWER5 Chip
 Each chip appears as a 4-way
SMP to software
 Processor resources optimized
for enhanced SMT performance
 Software controlled thread
priority
 Dynamic feedback of runtime
behavior to adjust priority
 Dynamic switching between
single and multithreaded mode
Single Threaded Operation
FX0
FX1
FP0
FP1
LS0
LS1
BRX
CRL
Thread 0 active
4
Microarchitecture…
Simultaneous Multithreading in POWER5 Chip
 Each chip appears as a 4-way
SMP to software
 Processor resources optimized
for enhanced SMT performance
 Software controlled thread
priority
 Dynamic feedback of runtime
behavior to adjust priority
 Dynamic switching between
single and multithreaded mode
Simultaneous Multi-Threading
FX0
FX1
FP0
FP1
LS0
LS1
BRX
CRL
Thread 0 active
Thread 1 active
5
Modifications to POWER4 System Structure
P
P
P
L2
P
Reduced
L3
Latency
L2
Larger
SMPs
Fab Ctl
Fab Ctl
L3
Cntrl
L3
Mem Ctl
Memory
L3
Cntrl
Faster
access to
memory
Number of
chips cut
in half
L3
Mem Ctl
Memory
6
Implementation…
POWER5 Chip Overview
 Technology: 130nm
lithography, SOI, Cu wiring
 276M transistors
 389 mm2 die size
 Two 8-way superscalar SMT
cores
 Memory subsystem with
1.9MB L2-Cache, L3
directory and memory
controller on chip
 Extensive RAS support
 High-speed elastic bus
interface
7
Design…
ERAT and D-Cache Array Design Changes
 System performance vs. area trade-off
ERAT: Fully associative, implemented as Sum-Address CAM
D-cache: 4-way associativity
Result: 2-3% performance gain with improved wireability at 5% area cost
POWER4
POWER5
2-way ERAT
128-way ERAT CAM
64
entries
64
entries
=
=
=
Hit Logic
128
entries
128
entries
2-way DCache
128
entries
=
Hit
Logic
=
=
64
entries
64
entries
=
=
64
entries
64
entries
4-way DCache
8
Design…
L2 and I-Cache Array Design Changes
 SMT drives thread level parallelism
Improved associativity on L2-Cache (10-way) and I-Cache (2-way)
L2 access shifted by ½ cycle avoiding extensive array redesign
High speed latch with compare on I-Cache access path
9
Design…
2nd Generation Elastic Interface Design
 EI-II performance improvements
Runs over 2 GHz in laboratory -- head-room on IO frequencies
– Allows bus frequencies to continue scaling with processor frequency
error free data valid window
Vref
error free data valid window
guardband
early
guardband
guardband
functional
data
late
guardband
Optimizes Vref at T0 by level forwarding
Maintains guardband via periodic self calibration
10
Integration…
Implementation of Engineered Buses and IO Wires
 Pre-planned and custom routed
buses
 ~50K engineered wires at chip level
 ~2X of POWER4 chip
 Custom buffer insertion process
 ~250K buffer/inverters
 2.5X of POWER4 chip
 Wire and bus characterization
 Noise tolerance
 Impact of coupling on delay
 Inductance analysis
11
Integration…
Implementation of Engineered Buses and IO Wires
 Pre-planned and custom routed
buses
 ~50K engineered wires at chip level
 ~2X of POWER4 chip
 Custom buffer insertion process
 ~250K buffer/inverters
 2.5X of POWER4 chip
 Wire and bus characterization
 Noise tolerance
 Impact of coupling on delay
 Inductance analysis
 IO performance driven routing
 5Ω resistance limit on chip
 Fully shielded (single ended design)
12
Integration…
Dual Clock Distribution
total
nominal skew
total
nominal
skew
18ps
18ps
local skew
local
skew
9ps
9ps
slew rate from 30 - 70%
52 - 71ps
latency PLL to LCB
777ps
latency PLL to LCB
777ps
dutyswitching
cyclepower
control
±25ps
@ 1.08V andpower
2GHz
switching
@ 1.08V and 1.8GHz
9.5W
duty cycle control
Main Clock Grid
(91 Buffers)
 1 full chip buffer
 1 central chip buffer
 3 half chip buffers
 6 quadrant buffers
 80 sector buffers
52 - 71ps
slew rate from 30 - 70%
±25ps
10.5W
Memory Clock Domain
(4 Buffers)
 1 central chip buffer
 3 sector buffers
 asynchronous to
main mesh
13
Timing…
Chip Timing and Shmoo Plot
Shmoo Plot
 Timing Closure
 Timing Model Analysis
690K scannable M/S latches
180K non-scan mid-cycle latches
6.75M timing checks
TAT 19 hours
2.3
Fail
2.2
Frequency (GHz)
Sort mode (functional/scan/lbist)
Early mode (functional/scan)
2.4
2.1
2.0
Pass
1.9
1.8
1.7
1.6
1.5
1.40
1.35
1.30
1.25
1.20
1.15
1.10
1.05
Voltage (Volt) at 25ºC
14
Power…
Power Efficient Design Implementation
 DC power mitigation
Leverage triple Vt technology
• Decrease low Vt usage by 90%
• Increase high Vt usage by 30%
Leverage triple Tox technology
• Thick Tox usage for decoupling
capacitors
POWER4 Device Width
26.2%
4.4%
69.4%
POWER5 Device Width
 AC power mitigation
Minimal usage of dynamic circuits
Reduce loading on clock mesh
Incorporation of dynamic clock
gating
33.9%
65.7%
high Vt
0.4%
low Vt
normal Vt
15
Power…
Dynamic Clock Gating Implementation
global disable
scan-only
latches
mesh clock
local disable
gated
c1 clock
enable
C2 latches
gating
logic
dynamic stop
cycle-to-cycle clock control (~1/2 cycle path)
global disable
mesh clock
scan-only local disable
latches
gated
c1 clock
enable
C2 latches
gating
logic
dynamic
stop
MS
latch
cycle-predict clock control (~full cycle path)
 Approach allows aggressive use of clock gating to conserve power
16
Power…
Improved Power Efficiency
 AC power reduction by ≥ 25%
 DC power reduction by ≥ 50%
Total power reduction by > 33% for numerical intensive workload
Relative Power
Impact of Power Saving Measures
1.5
1.0
0.5
0.0
design without
power saving
features
clock gating added thick oxide added
AC Power
Gate Leakage Power
HVT added
LVT removed
Channel Leakage Power
17
Power…
Thermal Protection
Temperature at Instruction Sequencing Unit
Stage 1 Throttling Engaged @ 84oC & Disabled @ 81oC
86
over-temperature
temperature (oC)
85
84
83
82
81
80
recovery-temperature
79
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
time (s)
18
Summary…
Summary
 First dual core SMT microprocessor
 Extended SMP to 64-way
 Operating in laboratory
 Power dynamically managed with no
performance penalty
 Implementation permits future technology scalability
from circuit and power perspective
 Innovative approach leveraging technology with
system focus for high performance in a power
efficient design
19