Download EE 333

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
EE 333
Fall 2006
Computer Organization
Lecture 20
Pipelining: “bucket brigade”
MIPS pipeline & control
Pentium 4 architecture
Lillevik 333f06-l20
University of Portland
School of Engineering
1
EE 333
Fall 2006
Pipelining overview
• Pipelining
– Increased performance through parallel
operations
– Goal: complete several operations at the same
time
• Hazards
– Conditions which inhibit parallel operations
– Techniques exist to minimize the problem
Lillevik 333f06-l20
University of Portland
School of Engineering
2
EE 333
Fall 2006
A laundry pipeline
• To Do laundry: wash, dry, fold, put away
• Each step takes 30 minutes, but for four students ....
• Laundry done at 2 AM
Lillevik 333f06-l20
University of Portland
School of Engineering
3
EE 333
Fall 2006
Let’s speed it up (pipeline)
• Move one load from one step to the next
• But start the next load before first is complete
• Takes only until 9:30 PM – party time !!
Bucket Brigade
Lillevik 333f06-l20
University of Portland
School of Engineering
4
EE 333
Fall 2006
Speedup
• Speedup
– Ratio of serial time to parallel
– Metric to compare advantages of parallel operations
Tseries
S
Tparallel
Lillevik 333f06-l20
University of Portland
School of Engineering
5
EE 333
Fall 2006
Find the laundry speedup?
Lillevik 333f06-l20
University of Portland
School of Engineering
6
EE 333
Fall 2006
A computer pipeline
• Assume
– Instructions require multiple clocks to complete
– Each instruction follows approximately the
same steps (stage)
• Method
– Start initial instruction on first clock
– On following clocks start subsequent
instructions
Lillevik 333f06-l20
University of Portland
School of Engineering
7
EE 333
Fall 2006
MIPS instruction steps/stages
1. IF: Fetch instruction from memory
2. ID: Read registers while decoding instruction
3. EX: Execute the operation or calculate an
address
4. MEM: Access an operand in data memory
5. WB: Write the result into a register
Lillevik 333f06-l20
University of Portland
School of Engineering
8
EE 333
Fall 2006
MIPS pipeline
IF
 First instruction ends
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Fifth instruction starts 
Lillevik 333f06-l20
University of Portland
School of Engineering
WB
9
EE 333
Fall 2006
Find the MIPS pipeline speedup?
Assume five instructions
Lillevik 333f06-l20
University of Portland
School of Engineering
10
EE 333
Fall 2006
What about a large program?
Series
Speedup
Tseries  N Tn  T1  ...  TN
Tseries
S
Tparallel
 5N

Pipelined
Tparallel  N Tn  T1  ...  TN
 5  1  ...  1
5N
5

5  ( N  1) 5  1  1
N
N
5
lim S 
5
N 
0 1 0
 5  ( N  1)
Lillevik 333f06-l20
University of Portland
School of Engineering
11
EE 333
Fall 2006
Speedup of pipeline with p stages?
Lillevik 333f06-l20
University of Portland
School of Engineering
12
EE 333
Fall 2006
MIPS pipelined datapath
Pipeline registers added to datapath
Lillevik 333f06-l20
University of Portland
School of Engineering
13
EE 333
Fall 2006
Save for WB
stage
Save for
Mem stage
Pipelined Control
WB
Instruction
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
Save for Ex
stage
IF/ID
Signals used in later stage determined by IF/ID
Lillevik 333f06-l20
University of Portland
School of Engineering
14
EE 333
Fall 2006
Datapath & pipelined control
Lillevik 333f06-l20
University of Portland
School of Engineering
15
EE 333
Fall 2006
Pentium 4 pipeline
• Twenty stages long
• Theoretical speedup of 20
• Hazards (forced sequential operations) reduce
speedup
– Some instructions executed “out of order” to
avoid hazard
– Multiple (optimistic) pipelines created, one
selected to create result, other data discarded
Lillevik 333f06-l20
University of Portland
School of Engineering
16
EE 333
Fall 2006
Early Pentium 4
•
•
•
•
Socket 423/478
42 M transistors, 0.18 and 0.13 mm technology
2.0 GHz core frequency, ~60 W
Integrated heat spreader, built-in thermal monitor
Lillevik 333f06-l20
University of Portland
School of Engineering
17
EE 333
Fall 2006
NetBurst Architecture
• Faster system bus
• Advanced transfer cache
• Advanced dynamic execution (execution
trace cache, enhanced branch prediction)
• Hyper pipelined technology
• Rapid execution engine
• Enhanced floating point and multi-media
(SSE2)
Lillevik 333f06-l20
University of Portland
School of Engineering
18
EE 333
Fall 2006
Architecture Overview
Lillevik 333f06-l20
University of Portland
School of Engineering
19
EE 333
Fall 2006
Front Side Bus
Lillevik 333f06-l20
University of Portland
School of Engineering
20
EE 333
Fall 2006
FSB Bandwidth
• Clocked at 100 MHz, quad “pumped”
Clock A
Clock B
•
•
•
•
128 B cache lines, 64-bit (8 B) accesses
Split transactions, pipelined
External bandwidth: 100M x 8 x 4 = 3.2 GB/s
Makes better use of bus bandwidth
Lillevik 333f06-l20
University of Portland
School of Engineering
21
EE 333
Fall 2006
L2 Advanced Transfer Cache
Lillevik 333f06-l20
University of Portland
School of Engineering
22
EE 333
Fall 2006
Full-Speed L2 Cache
• Depth of 256 KB
• Eight-way set associative, 128 B line
• Wide instruction & data interface of 256 bits (32
B)
• Read latency of 7 clocks, but …
• Clocked at core frequency (2.0 GHz)
• Internal bandwidth, 32 x 2.0 G = 64 GB/s
• Optimizes data transfers to/from memory
Lillevik 333f06-l20
University of Portland
School of Engineering
23
EE 333
Fall 2006
L1 Data Cache
Lillevik 333f06-l20
University of Portland
School of Engineering
24
EE 333
Fall 2006
L1 Data Cache
•
•
•
•
Depth of 8 KB
Four-way, set associative, 64 B line
Read latency of 2 clocks, but ….
Dual port for one load & one store-perclock
• Supports advanced pre-fetch algorithm
Lillevik 333f06-l20
University of Portland
School of Engineering
25
EE 333
Fall 2006
Dynamic Execution
Lillevik 333f06-l20
University of Portland
School of Engineering
26
EE 333
Fall 2006
Trace Cache & Branch Prediction
• Replaces traditional L1 instruction cache
• Trace cache contains ~12K decoded
instructions (micro-operations), removes
decode latency
• Improved branch prediction algorithm,
eliminates 33% of P3 mis-predictions
(pipeline stalls)
• Keeps correct instructions executing
Lillevik 333f06-l20
University of Portland
School of Engineering
27
EE 333
Fall 2006
Execution Engine
Lillevik 333f06-l20
University of Portland
School of Engineering
28
EE 333
Fall 2006
Hyper Pipelined Technology
• Execution pipeline contains 20 stages
– Out-of-order, speculative execution unit
– 126 instructions “in flight”
– Includes 48 load, 24 stores
• Rapid execution engine
– 2 ALUs, 2X clocked (one instruction in ½ clock)
– 2 AGUs, 2X clocked
• Results in higher throughput and reduced latency
Lillevik 333f06-l20
University of Portland
School of Engineering
29
EE 333
Fall 2006
Streaming SIMD Extensions
•
FPU and MMX
–
–
•
SSE2 (extends MMX and SSE)
–
–
–
–
•
128-bit format
AGU data movement register
144 new instructions
DP floating-point
Integer
Cache and memory management
Performance increases across broad range of
applications
Lillevik 333f06-l20
University of Portland
School of Engineering
30
EE 333
Lillevik 333f06-l20
Fall 2006
University of Portland
School of Engineering
31
EE 333
Fall 2006
Find the laundry speedup?
Tseries
S
Tparallel
8hrs

 2.29
3.5hrs
Lillevik 333f06-l20
University of Portland
School of Engineering
32
EE 333
Fall 2006
Find the MIPS pipeline speedup?
Assume five instructions
Tseries  N5I n
Tpipe  N5I n
 I1  I 2  I 3  I 4  I 5
 5 1111
 55555
9
 25
Tseries 25
S
  2.8
Tparallel 9
Lillevik 333f06-l20
University of Portland
School of Engineering
33
EE 333
Fall 2006
Speedup of pipeline with p stages?
Series
Tseries  N Tn  T1  ...  TN
 pN
Tseries
S
Tparallel

Parallel
Tparallel  N Tn  T1  ...  TN
 p  1  ...  1
pN
p

p  ( N  1) p  1  1
N
N
p
lim S 
p
N 
0 1 0
 p  ( N  1)
Lillevik 333f06-l20
University of Portland
School of Engineering
34
Related documents