Download Yes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
SUPERSCALAR
DESIGN PRIME
Zhao Zhang
CprE 381, Computer Organization and AssemblyLevel Programming, Fall 2012
Original slides from CprE 581, Advanced
Computer Architecture
History Superscalar Design
First appearance in 1960s
• Scoreboarding
• Tomasulo Algorithm
Popular use since 1990s
• SGI MIPS processors
• Sun UltraSPARC
• Dec Alpha 21x64 series
• Intel/AMD processors
Now appearing in embedded processors
• Cortex-A9: Two-way, limited out-of-order
• Certex-A15: Three-way, close to Intel/AMD design
Why Superscalar
Get more performance than scalar pipeline
Superscalar Techniques:
Deep pipeline
Multi-issue
Branch prediction
Register renaming
Out-of-order Execution
Speculative Execution
Memory disambiguation
Code Example
for (i = 0; i < 1000; i++)
X[i] = X[i] + b;
; loop body, initialization not shown
; R4: &X[i], R5: (X+1000)*4, R6: b
Loop: LW
R8, R4($0) ; load X[i], R4 stores X
ADD R8, R8, R6 ; X[i] = X[i] + b
SW R8, R4($0) ; store X[i]
ADDI R4, R4, 4
; next element
SLT R9, R4, R5 ; R9 = (R4 < R5)
BNE R9, R0, loop ; end of loop?
Frontend and Backend
Frontend: In-order fetch, decode, and rename
Backend: Out-of-order issue, execute/writeback, in-order commit
Frontend may send “junk” instructions to the backend
•
•
Junk instructions occur with branch mis-prediction or exceptions
Design goal: Minimize the percentage of “junk” instructions
Backend must be able to detect and handle “junk” instructions
•
•
•
Flush junk instructions upon detetion
In-order commit (retire) so that junk instructions won’t affect the
“architectural state”
Dozens of cycles likely for handling a branch mis-prediction
Frontend and Backend
Backend
Frontend
“Cortex-A9 Processor Microarchitecture”, slide 6
The Multi-Issue Factor
Multi-issue affects all pipeline stages: In the same cycle,
• N inst. are fetched: Usually from one I-cache block
• N inst. are decoded: Multiple decoders
• N inst. are renamed: Multi-ported renaming table, detecting intra-
group dependence
In the backend
• Up to N inst. are scheduled: Multi-ported queue with broadcast
• N inst. read register file: Multi-ported register file
• M inst. are executed at functional units: Multiple functional units
• N inst. writes back register values: multi-ported register file
• N inst. are committed: Multi-banked reorder buffer, also involves
rename table
Note: “N” is not necessary the same value across pipeline stages
Frontend: Branch Prediction
Branch prediction is critical to reducing “junk” instructions
good inst
good inst
good inst
With “disaster” branch prediction performance:
SPECint programs have on average ~15% branches
• Every 100 instructions contain 15 branches
• Assume 10% mis-prediction => 1.5 branch mis-predictions
• Assume 20-cycle mis-prediction penalty => 30 lost cycles
• Assume IPC=3.0 => 33.3 cycles for execution 100 inst
• 90% loss for the 10% mis-prediction
• Mis-prediction penalty is workload-dependent, and can be
significantly longer than 20 cycles
Frontend: Branch Prediction
Branch prediction is made
every cycle
• Otherwise, instruction flow
stops
• It’s done in parallel with
instruction fetch
The backend sends back
feedback about past
predictions
Single cycle loop
Pred-PC
Inst.
Cache
Target, branch,
and return addr.
predictors
INST
Feedback from
the backend
Frontend: Branch Prediction
Three components in simple design
Branch Target Buffer (BTB): What’s the branch target?
Branch History Table (BHT): Is the branch taken or not?
Return Address Stack (RAS)
• Function return is a special type of branch instruction
• There are multiple valid branch targets for the return
How BTB and BHT works in general
• Bet the same patterns will repeat
• Use only PC and past branch outcome history in the prediction
Frontend: Branch Prediction
Branch Target Buffer with combined Branch History Table
Branch PC
Predicted PC
PC of instruction
FETCH
=?
No: branch not
predicted, proceed normally
(Next PC = PC+4)
Extra
Yes: instruction is prediction state
branch and use
Bits (see later)
predicted PC as
next PC
From slides of CprE 581 Computer Systems Architecture
Frontend: Branch Prediction
LW
ADD
SW
ADDI
STL
BNE
Branch PC
-------
Predicted PC
-------
0
0
0
0
0
0
=> NT, right
=> NT, right
=> NT, right
=> NT, right
=> NT, right
=> NT, WRONG
First time fetching at BNE: Predicted as Not Taken
Loop: LW
R8, R4($0)
; load X[i], R4 stores X
ADD R8, R8, R6
; X[i] = X[i] + b
SW
R8, R4($0)
; store X[i]
ADDI R4, R4, 4
; next element
SLT
R9, R4, R5
; end of array?
BNE R9, R0, loop => mis-prediction on 1st fetch
Frontend: Branch Prediction
LW
ADD
SW
ADDI
STL
BNE
Branch PC
-----BNE-PC
Predicted PC
-----LW-PC
0
0
0
0
0
1
What happen after the mis-prediction
1. The frontend starts fetch junk instructions, probably in
dozens
2. The backend detects the mis-prediction, flush backend
pipeline, notifies the frontend about the mis-predicted branch
3. The frontend updates the BTB/BHT, filling in BNE-PC and
LW-PC, change prediction state bit
4. The frontend restarts fetching from LW-PC
Frontend: Branch Prediction
LW
ADD
SW
ADDI
STL
BNE
Branch PC
-----BNE-PC
Predicted PC
-----LW-PC
0
0
0
0
0
1
=> NT, right
=> NT, right
=> NT, right
=> NT, right
=> NT, right
=> Taken, RIGHT
2nd time fetching at BNE: Predicted as Taken, jump to LW-PC
Loop:
LW
R8, R4($0)
; load X[i], R4 stores X
ADD R8, R8, R6
; X[i] = X[i] + b
SW
R8, R4($0)
; store X[i]
ADDI R4, R4, 4
; next element
SLT
R9, R4, R5
; end of array?
=>
BNE R9, R0, loop ;
Frontend: Branch Prediction
LW
ADD
SW
ADDI
STL
BNE
Branch PC
-----BNE-PC
Predicted PC
-----LW-PC
0
0
0
0
0
0
Last time fetching at BNE-PC, predicted as Taken
• It’s wrong because the loop will exit
This time, the prediction state bit is changed to 0
• Next time the prediction outcome on BNE-PC is Not Taken
16
Branch Prediction State Bit
General Form
1. Access
2. Predict
Output T/NT
state
PC
3. Feedback T/NT
1-bit prediction
Feedback
T
Predict Taken
NT
1
NT
T
0
From CprE 581, Computer Systems Architecture
Predict Not
Taken
Branch History Table
Branch direction prediction is usually more challenging
• BHT can be separated from BTB (often the case)
• 2-bit or 3-bit state are usually used
• BHT can be organized in two levels to predict on
correlation between branches
• BHT can have sophisticated organizations to further
improve accuracy
Return Address Stack: Work on return instructions, simple
and effective (not to be discussed more)
Frontend: Register Renaming
Consider two loop iterations: Conflict on register usage, cannot
be executed in parallel, but they are mostly parallel
LW
R8, R4($0)
; load X[i], R4 stores X
ADD R8, R8, R6
; X[i] = X[i] + b
SW
R8, R4($0)
; store X[i]
ADDI R4, R4, 4
; next element
SLT
R9, R4, R5
; end of array?
BNE R9, R0, loop ;
LW
R8, R4($0)
; load X[i], R4 stores X
ADD R8, R8, R6
; X[i] = X[i] + b
SW
R8, R4($0)
; store X[i]
ADDI R4, R4, 4
; next element
SLT
R9, R4, R5
; end of array?
BNE R9, R0, loop ;
Frontend: Register Renaming
Rename architectural registers to physical registers, remove
false dependence and keep true dep.
LW
P32, P4($0) ; load X[i], R4 stores X
ADD P33, P32, P6 ; X[i] = X[i] + b
SW
P33, P4($0) ; store X[i]
ADDI P34, P4, 4
; next element
SLT
P35, P34, P5 ; end of array?
BNE P35, P0, loop ;
LW
P36, P34($0) ; load X[i], R4 stores X
ADD P37, P36, P6 ; X[i] = X[i] + b
SW
P37, P34($0) ; store X[i]
ADDI P38, P34, 4 ; next element
SLT
P38, P38, P5 ; end of array?
BNE R38, p0, loop ;
Frontend: Register Renaming
How the design works:
• There is a register mapping table that maps architecture register
•
•
•
•
to physical register
There is a queue of free physical register
Every instruction with output register is assigned with an unused,
free physical register
Another mapping table is used to recover from mis-predicted path
There are a number of design variants in real processors
Frontend: Register Renaming
The roles of register renaming:
• Remove register name dependence, keep true data dependence,
so that more instructions can be safely reordered
• Help backend implement speculative execution, as no junk
instructions cannot affect the input of good instructions
• A younger instruction writes to newly assigned physical register, so it
cannot affect the input of old instructions
• A good instruction is always older than any junk instruction
Backend: Out-Of-Order Scheduling
Common Design: Issue Queue
Op
busy? dst
src1 ready? src2 ready? ROB LSQ
LW
yes
P32
P4
yes
0x0
yes
1
1
ADD
yes
P33
P32
no
P6
yes
2
-
SW
yes
--
P33
no
P4
yes
3
2
ADDI
yes
P34
P4
yes
0x4
yes
4
-
SLT
yes
P35
P34
no
P5
yes
5
-
BNE
yes
--
P35
no
P0
yes
6
-
Backend: Out-Of-Order Scheduling
Schedule: Select ready instructions, broadcast their tag
(dst) to all other instructions for matching
Op
busy? dst
src1 ready? src2 ready? ROB LSQ
LW
yes
P32
P4
yes
0x0
yes
1
1
ADD
yes
P33
P32
no
P6
yes
3
-
SW
yes
--
P33
no
P4
yes
2
2
ADDI
yes
P34
P4
yes
0x4
yes
4
-
SLT
yes
P35
P34
no
P5
yes
5
-
BNE
yes
--
P35
no
P0
yes
6
-
Backend: Out-Of-Order Scheduling
After LW and ADDI are issued, assume no new instructions
Op
busy? dst
src1 ready? src2 ready? ROB LSQ
--
no
--
--
--
--
--
--
--
ADD
yes
P33
P32
yes
P6
yes
2
-
SW
yes
--
P33
no
P4
yes
3
2
--
--
--
--
--
--
--
--
-
SLT
yes
P35
P34
yes
P5
yes
5
-
BNE
yes
--
P35
no
P0
yes
6
-
Backend: Out-Of-Order Scheduling
After ADD and SLT are issued, assume no new instructions
Op
busy? dst
src1 ready? src2 ready? ROB LSQ
--
no
--
--
--
--
--
--
--
--
no
--
--
--
--
--
--
-
SW
yes
--
P33
yes
P4
yes
2
2
--
--
--
--
--
--
--
--
-
--
--
--
--
--
--
--
--
-
BNE
yes
--
P35
yes
P0
yes
6
-
Backend: Out-Of-Order Scheduling
How the design works
• Instructions are sent to the issue queue after renaming
• A select logic chooses up to N instructions, all
dependence free, to be executed
• The tag of the selected instructions are broadcast to all
other queue entries
• A wakeup logic clears the dependence of other
instructions on the selected instructions
Two major design variants: Issue Queue vs. Reservation
Station
Backend: Register Read, Data
Forwarding and Writeback
Issue Queue
Issue (scheduling)
Register File
Reg-Read
Forwarding Network
Load
Store
Int
Mult
Div
Execute
Other
Writeback
Note: In reservation-station design, register-read happens
before instruction scheduling
28
Reorder Buffer and In-Order Commit
head
tail
head
…
tail
…
freed
head
tail
…
allocated
29
Reorder Buffer and In-Order Commit
“Architectural Register State”
changes in program order
Junk instructions may produce
values, but their values never
appear in the “Architectural
Register State”
• Junk instructions will be flushed
upon detection
Branch or L/W?
Reorder Buffer
Instructions enter and leave
ROB in program order
Dest arch reg
Dest phy reg
Exceptions?
Program Counter
Ready?
Recall the Renaming Example
Consider two loop iterations: Rename architectural registers to
physical registers, remove false dependence and keep true dep.
LW
P32, P4($0) ; load X[i], R4 stores X
ADD P33, P32, P6 ; X[i] = X[i] + b
SW
P33, P4($0) ; store X[i]
ADDI P34, P4, 4
; next element
SLT
P35, P34, P5 ; end of array?
BNE P35, P0, loop ;
LW
P36, P34($0) ; load X[i], R4 stores X
ADD P37, P36, P6 ; X[i] = X[i] + b
SW
P37, P34($0) ; store X[i]
ADDI P38, P34, 4 ; next element
SLT
P38, P38, P5 ; end of array?
BNE R38, p0, loop ;
Architectural Register State
architectural register mapping
LW
ADD
SW
ADDI
SLT
BNE
LW
ADD
SW
ADDI
SLT
BNE
R8, R4($0)
R8, R8, R6
R8, R4($0)
R4, R4, 4
R9, R4, R5
R9, R0, loop
R8, R4($0)
R8, R8, R6
R8, R4($0) Mis-predicted
path
R4, R4, 4
R9, R4, R5
R9, R0, loop
R0
R4
R5
R6
R8
R9
P0
P4
P5
P6
speculative register mapping
P8
P9
R0
R4
R5
R6
R8
R9
P0
P4
P5
P6
P8
P9
R6
R8
R9
P0
P4
P5
P6
speculative register mapping
P8
P9
R0
R4
R5
R6
R8
R9
P0
P34
P5
P6
P33
P35
R8
R9
P0
P34
P5
P6
speculative register mapping
P33
P35
R0
R4
R5
R6
R8
R9
P0
P38
P5
P6
P37
P39
architectural register mapping
R0
R4
R5
architectural register mapping
R0
R4
R5
R6
Summary
What we have learned
• In-order frontend vs. out-of-order backend
• Branch prediction to keep instruction flow
• Register renaming to remove name dependence and
support speculative execution
• Out-of-order scheduling with issue queue
• In-order commit with re-order buffer
What we haven’t learned yet
• Memory disambiguation using load/queue and store
queue
• Detail in complex real processors