Download 7810-22

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS 7810
Lecture 22
Processor Case Studies,
The Microarchitecture of the Pentium 4 Processor
G. Hinton et al.
Intel Technology Journal
Q1, 2001
Clock Frequencies
• Aggressive clocks => little work per pipeline stage
=> deep pipelines => low IPC, large buffers, high
power, high complexity, low efficiency
• 50% increase in clock speed => 30% increase in
performance
Mispredict latency = 10 cyc
Mispredict latency = 20 cyc
Deep Pipelines
Variable Clocks
• The fastest clock is defined as the time for an
ALU operation and bypass (twice the main
processor clock)
• Different parts of the chip operate at slower
clocks to simplify the pipeline design (e.g. RAMs)
Microarchitecture Overview
Front End
• ITLB, RAS, decoder
• Trace Cache: contains 12Kmops (~8K-16KB
I-cache), saves 3 pipe stages, reduces power
• Front-end BTB accessed on a trace cache miss
and smaller Trace-cache BTB to detect next
trace line – no details on branch pred algo
• Microcode ROM: implements mop translation for
complex instructions
Execution Engine
• Allocator: resource (regs, IQ, LSQ, ROB) manager
• Rename: 8 logical regs are renamed to 128 phys
regs; ROB (126 entries) only stores pointers
(Pentium 4) and not the actual reg values (unlike
P6) – simpler design, less power
• Two queues (memory and non-memory) and
multiple schedulers (select logic) – can issue six
instrs/cycle
Schedulers
• 3GHz clock speed = time for a 16-bit add and bypass
NetBurst
• 3GHz ALU clock = time for a 16-bit add and bypass
to itself (area is kept to a minimum)
• Used by 60-70% of all mops in integer programs
• Staggered addition – speeds up execution of
dependent instrs – an add takes three cycles
• Early computation of lower 16 bits => early
initiation of cache access
Detailed Microarchitecture
Data Cache
• 4-way 8KB cache; 2-cycle load-use latency for
integer instrs and 6-cycle latency for fp instrs
• Distance between load scheduler and execution
is longer than load latency
• Speculative issue of load-dependent instrs and
selective replay
• Store buffer (24 entries) to forward results to loads
(48 entries) – no details on load issue algo
Cache Hierarchy
• 256KB 8-way L2; 7-cycle latency; new operation
every two cycles
• Stream prefetcher from memory to L2 – stays
256 bytes ahead
• 3.2GB/s system bus: 64-bit wide bus at 400MHz
Performance Results
Quick Facts
• November 2000: Willamette, 0.18m, Al interconnect,
42M transistors, 217mm2, 55W, 1.5GHz
• February 2004: Prescott, 0.09m, Cu interconnect,
125M transistors, 112mm2, 103W, 3.4GHz
Improvements
• Willamette (2000)  Prescott (2004)
• L1 data cache 8KB  16KB
• L2 cache 256KB  1MB
• Pipeline stages 20  31
• Frequency 1.5GHz  3.4GHz
• Technology 0.18m  0.09m
Pentium M
• Based on the P6 microarchitecture
• Lower design complexity (some inefficiencies
persist, such as copying register values from ROB
to architected register file)
• Improves on P4 branch predictor
PM Changes to P6, cont.
• Intel has not released the exact length of the pipeline.
• Known to be somewhere between the P4 (20 stage)
and the P3 (10 stage). Rumored to be 12 stages.
• Trades off slightly lower clock frequencies (than P4) for better
performance per clock, less branch prediction penalties, …
Banias
• 1st version
• 77 million transistors, 23
million more than P4
• 1 MB on die Level 2
cache
• 400 MHz FSB (quad
pumped 100 MHZ)
• 130 nm process
• Frequencies between 1.3 –
1.7 GHz
• Thermal Design Pointhttp://www.intel.com/pressroom/archive/photos/centrino.htm
of
24.5 watts
Dothan
• Launched May 10,
2004
• 140 million transistors
• 2 MB Level 2 cache
• 400 or 533 MHz FSB
• Frequencies between
1.0 to 2.26 GHz
• Thermal Design Point
of 21(400 MHz FSB)
http://www.intel.com/pressroom/archive/photos/centrino.htm
to 27 watts
Branch Prediction
• Longer pipelines mean higher penalties for
mispredicted branches
• Improvements result in added performance
and hence less energy spent per instruction
retired
Branch Prediction in Pentium M
• Enhanced version of Pentium 4 predictor
• Two branch predictors added that run in
tandem with P4 predictor:
– Loop detector
– Indirect branch detector
• 20% lower misprediction rate than PIII
resulting in up to 7% gain in real
performance
Branch Prediction
Based on diagram found here: http://www.cpuid.org/reviews/PentiumM/index.php
Loop Detector
• A predictor that always
branches in a loop will
always incorrectly
branch on the last
iteration
• Detector analyzes
branches for loop
behavior
• Benefits a wide variety
of program types
http://www.intel.com/technology/itj/2003/volume07
issue02/art03_pentiumm/p05_branch.htm
Indirect Branch Predictor
• Picks targets based
on global flow
control history
• Benefits programs
compiled to branch
to calculated
addresses
http://www.intel.com/technology/itj/2003/volume07iss
ue02/art03_pentiumm/p05_branch.htm
Benchmark
Battery Life
UltraSPARC IV
• CMP with 2 UltraSPARC IIIs – speedups of 1.6
and 1.14 for swim and lucas (static parallelization)
• UltraSPARC III : 4-wide, 16 queue entries, 14
pipeline stages
• 4KB branch predictor – 95% accuracy, 7-cycle
penalty
• 2KB prefetch buffer between L1 and L2
Alpha 21364
• Tournament predictor – local and global; 36Kb
• Issue queue (20-Int, 15-FP), 4-wide Int, 2-wide FP
• Two clusters, each with 2 FUs and a copy of the
80-entry register file
Title
• Bullet