Download predicate registers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
9th Lecture
 Branch prediction (rest)
 Predication
 Intel Pentium II/III
 Intel Pentium 4
1
Hybrid Predictors
 The second strategy of McFarling is to combine multiple separate branch
predictors, each tuned to a different class of branches.
 Two or more predictors and a predictor selection mechanism are necessary
in a combining or hybrid predictor.
 McFarling:
combination of two-bit predictor and gshare two-level adaptive,
 Young and Smith: a compiler-based static branch prediction with a two-level
adaptive type,
 and many more combinations!
 Hybrid predictors often better than single-type predictors.
2
Simulations of Grunwald 1998
Table 1.1. SAg, gshare and MCFarling‘s combining predictor
committed conditional taken
Application instructions branches branches
(in millions) (in millions) (%)
compress
80.4
14.4
54.6
gcc
250.9
50.4
49.0
perl
228.2
43.8
52.6
go
548.1
80.3
54.5
m88ksim
416.5
89.8
71.7
xlisp
183.3
41.8
39.5
vortex
180.9
29.1
50.1
jpeg
252.0
20.0
70.0
mean
267.6
46.2
54.3
SAg
10.1
12.8
9.2
25.6
4.7
10.3
2.0
10.3
8.6
misprediction rate
(%)
gshare combining
10.1
9.9
23.9
12.2
25.9
11.4
34.4
24.1
8.6
4.7
10.2
6.8
8.3
1.7
12.5
10.4
14.5
8.1
3
Results
 Simulation of Keeton et al. 1998 using an OLTP (online transaction
workload) on a PentiumPro multiprocessor reported a misprediction rate
of 14% with an branch instruction frequency of about 21%.
 The speculative execution factor, given by the number of instructions
decoded divided by the number of instructions committed, is 1.4 for the
database programs.
 Two different conclusions may be drawn from these simulation results:
 Branch predictors should be further improved
 and/or branch prediction is only effective if the branch is predictable.
 If a branch outcome is dependent on irregular data inputs, the branch
often shows an irregular behavior.
 Question: Confidence of a branch prediction?
4
4.3.4 Predicated Instructions and Multipath Execution
- Confidence Estimation
 Confidence estimation is a technique for assessing the quality of a
particular prediction.
 Applied to branch prediction, a confidence estimator attempts to assess the
prediction made by a branch predictor.
 A low confidence branch is a branch which frequently changes its branch
direction in an irregular way making its outcome hard to predict or even
unpredictable.
 Four classes possible:
 correctly predicted
with high confidence C(HC),
 correctly predicted with low confidence C(LC),
 incorrectly predicted with high confidence I(HC), and
 incorrectly predicted with low confidence I(LC).
5
Implementation of a confidence estimator
 Information from the branch prediction tables is used:
 Use of saturation counter information to construct a confidence estimator
 speculate more aggressively when the confidence level is higher
 Used of a miss distance counter table (MDC):
 Each time a branch is predicted, the value in the MDC is compared to a
threshold.
If the value is above the threshold, then the branch is considered to have high
confidence, and low confidence otherwise.
 A small number of branch history patterns typically leads to correct
predictions in a PAs predictor scheme.
The confidence estimator assigned high confidence to a fixed set of patterns
and low confidence to all others.
 Confidence estimation can be used for speculation control,
thread switching in multithreaded processors or multipath execution
6
Predicated Instructions
 Provide predicated or conditional instructions and one or more predicate




registers.
Predicated instructions use a predicate register as additional input operand.
The Boolean result of a condition testing is recorded in a (one-bit) predicate
register.
Predicated instructions are fetched, decoded and placed in the instruction
window like non predicated instructions.
It is dependent on the processor architecture, how far a predicated instruction
proceeds speculatively in the pipeline before its predication is resolved:
 A predicated instruction
executes only if its predicate is true, otherwise the
instruction is discarded. In this case predicated instructions are not executed before
the predicate is resolved.
 Alternatively, as reported for Intel's IA64 ISA, the predicated instruction may be
executed, but commits only if the predicate is true, otherwise the result is discarded.
7
Predication Example
if (x = = 0) {
a = b + c;
d = e - f;
}
g = h * i;
/*branch b1 */
(Pred = (x = = 0) )
if Pred then a = b + c;
if Pred then e = e - f;
g = h * i;
/* branch b1: Pred is set to true in x equals 0 */
/* The operations are only performed */
/* if Pred is set to true */
/* instruction independent of branch b1 */
8
Predication
 Able to eliminate a branch and therefore the associated branch prediction
 increasing the distance between mispredictions.
 The the run length of a code block is increased  better compiler
scheduling.
 Predication affects the instruction set, adds a port to the register file, and
complicates instruction execution.
 Predicated instructions that are discarded still consume processor
resources; especially the fetch bandwidth.
 Predication is most effective when control dependences can be completely
eliminated, such as in an if-then with a small then body.
 The use of predicated instructions is limited when the control flow
involves more than a simple alternative sequence.
9
Eager (Multipath) Execution
 Execution proceeds down both paths of a branch, and no prediction is made.
 When a branch resolves, all operations on the non-taken path are discarded.
 Oracle execution: eager execution with unlimited resources
 gives the same theoretical maximum performance as a perfect branch prediction
 With limited resources, the eager execution strategy must be employed carefully.
 Mechanism is required that decides when to employ prediction and when eager
execution: e.g. a confidence estimator
 Rarely implemented (IBM mainframes) but some research projects:
 Dansoft processor, Polypath architecture, selective dual path execution, simultaneous
speculation scheduling, disjoint eager execution
10
.7
.3
1
.21
.49
2
.15
.34
3
.7
1
.10
.24
.49
2
4
.17
.07
.7
(a)
.3 2
1
5
.12
6
.05
.3
.09
.21
.49
3
(b)
.21
4
5
.34
3
6
.24
5
.10
(c)
(a) Single path speculative execution
(b) full eager execution
(c) disjoint eager execution
.15
.21 6
4
4.3.5 Prediction of Indirect Branches
 Indirect branches, which transfer control to an address stored in register,
are harder to predict accurately.
 Indirect branches occur frequently in machine code compiled from objectoriented programs like C++ and Java programs.
 One simple solution is to update the PHT to include the branch target
addresses.
12
Branch handling techniques and implementations
Technique
No branch prediction
Static prediction
always not taken
always taken
backward taken, forward not taken
semistatic with profiling
Dynamic prediction:
1-bit
2-bit
two-level adaptive
Hybrid prediction
Predication
Eager execution (limited)
Disjoint eager execution
Implementation examples
Intel 8086
Intel i486
Sun SuperSPARC
HP PA-7x00
early PowerPCs
DEC Alpha 21064, AMD K5
PowerPC 604, MIPS R10000,
Cyrix 6x86 and M2, NexGen 586
Intel PentiumPro, Pentium II, AMD K6, Athlon
DEC Alpha 21264
Intel/HP Merced and most signal processors as e.g.
ARM processors, TI TMS320C6201 and many other
IBM mainframes: IBM 360/91, IBM 3090
none yet
13
High-Bandwidth Branch Prediction
 Future microprocessor will require more than one prediction per cycle
starting speculation over multiple branches in a single cycle,
 e.g.
Gag predictor is independent of branch address.
 When multiple branches are predicted per cycle, then instructions must be
fetched from multiple target addresses per cycle, complicating I-cache
access.
 Possible
solution: Trace cache in combination with next trace prediction.
 Most likely a combination of branch handling techniques will be applied,
 e.g. a multi-hybrid branch predictor combined with support for context
switching, indirect jumps, and interference handling.
14
The Intel P5 and P6 family
Year
P5
P6
NetBurst
1993
1994
1995
1996
1997
1998
1995
1997
1998
1998
1997
1998
1998
1999
1999
1999
2000
Type
Pentium
Pentium
Pentium
Pentium
Pentium MMX
Mobile Pentium MMX
PentiumPro
PentiumPro
Intel Celeron
Intel Celeron
Pentium II
Mobile Pentium II
Pentium II Xeon
Pentium II Xeon
Pentium III
Pentium III Xeon
Pentium 4
Transistors
(x1000)
3100
3200
3200
3300
4500
4500
5500
5500
7500
19000
7000
7000
7000
7000
8200
8200
42000
Technology
(m m)
0.8
0.6
0.6/0.35
0.35
0.35
0.25
0.35
0.35
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.25
0.18
including L2 cache
Clock
(MHz)
66
75-100
120-133
150-166
200-233
200-233
150-200
200
266-300
300-333
233-450
300
400-450
450
450-1000
500-1000
1500
Issue
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
Word
L1 cache
format
32-bit
2 X 8 kB
32-bit
2 X 8 kB
32-bit
2 X 8 kB
32-bit
2 X 8 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 8 kB
32-bit
2 X 8 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 X 16 kB
32-bit
2 x 16 kB
32-bit 8kB / 12kµOps
L2 cache
256/512 kB
1 MB
-128 kB
256 kB/512 kB
256 kB/512 kB
512 kB/1 MB
512 kB/2 MB
512 kB
512 kB
256 kB
15
Micro-Dataflow in PentiumPro 1995
 ... The flow of the Intel Architecture instructions is predicted and these
instructions are decoded into micro-operations (mops), or series of mops,
and these mops are register-renamed, placed into an out-of-order
speculative pool of pending operations, executed in dataflow order (when
operands are ready), and retired to permanent machine state in source
program order. ...
 R.P. Colwell, R. L. Steck: A 0.6 mm BiCMOS Processor with Dynamic
Execution, International Solid State Circuits Conference, Feb. 1995.
16
PentiumPro and Pentium II/III
 The Pentium II/III processors use the same dynamic execution





microarchitecture as the other members of P6 family.
This three-way superscalar, pipelined micro-architecture features a
decoupled, multi-stage superpipeline, which trades less work per pipestage
for more stages.
The Pentium II/III processor has twelve stages with a pipestage time 33
percent less than the Pentium processor, which helps achieve a higher
clock rate on any given manufacturing process.
A wide instruction window using an instruction pool.
Optimized scheduling requires the fundamental “execute” phase to be
replaced by decoupled “issue/execute” and “retire” phases. This allows
instructions to be started in any order but always be retired in the original
program order.
Processors in the P6 family may be thought of as three independent
engines coupled with an instruction pool.
17
Pentium® Pro Processor and Pentium II/III
Microarchitecture
18
External Bus
Bus Interface Unit
Memory
Reorder
Buffer
D-cache
Unit
Instruction Fetch Unit
(with I-cache)
Branch
Target
Buffer
Instruction
Decode
Unit
Microcode
Instruction
Sequencer
Register
Alias
Table
Reservation Station Unit
Pentium
II/III
L2 Cache
Memory
Interface
Unit
Functional
Units
Reorder
Buffer
&
Retirement
Register
File
Pentium II/III: The In-Order Section
 The instruction fetch unit (IFU) accesses a non-blocking I-cache, it
contains the Next IP unit.
 The Next IP unit provides the I-cache index (based on inputs from the
BTB), trap/interrupt status, and branch-misprediction indications from the
integer FUs.
 Branch prediction:
 two-level
adaptive scheme of Yeh and Patt,
 BTB contains 512 entries, maintains branch history information and the
predicted branch target address.
 Branch misprediction penalty: at least 11 cycles, on average 15 cycles
 The instruction decoder unit (IDU) is composed of three separate decoders
20
Pentium II/III: The In-Order Section (Continued)
 A decoder breaks the IA-32 instruction down to mops, each comprised of
an opcode, two source and one destination operand. These mops are of
fixed length.
 Most
IA-32 instructions are converted directly into single micro ops (by any
of the three decoders),
 some instructions are decoded into one-to-four mops (by the general decoder),
 more complex instructions are used as indices into the microcode instruction
sequencer (MIS) which will generate the appropriate stream of mops.
 The mops are send to the register alias table (RAT) where register
renaming is performed,
i.e., the logical IA-32 based register references are converted into
references to physical registers.
 Then, with added status information, mops continue to the reorder buffer
(ROB, 40 entries) and to the reservation station unit (RSU, 20 entries).
21
The Fetch/Decode Unit
Instruction Fetch Unit
Next_IP
Microcode
Instruction
Sequencer
Simple Decoder
Instruction
Decode
Unit
Simple Decoder
Branch
Target
Buffer
Alignment
General Decoder
I-cache
IA-32
instructions
op1
op2
op3
Register
Alias
Table
(a) in-order section
(b) instruction decoder unit (IDU)
22
The Out-of-Order Execute Section
 When the mops flow into the ROB, they effectively take a place in




program order.
mops also go to the RSU which forms a central instruction window with 20
reservation stations (RS), each capable of hosting one mop.
mops are issued to the FUs according to dataflow constraints and resource
availability, without regard to the original ordering of the program.
After completion the result goes to two different places, RSU and ROB.
The RSU has five ports and can issue at a peak rate of 5 mops each cycle.
23
Latencies and throughtput for Pentium II/III FUs
RSU Port FU
Integer arithmetic/logical
Shift
Integer mul
Floating-point add
0
Floating-point mul
Floating-point div
MMX arithmetic/logical
MMX mul
Integer arithmetic/logical
1
MMX arithmetic/logical
MMX shift
2
Load
3
Store address
4
Store data
Latency
1
1
4
3
5
long
1
3
1
1
1
3
3
1
Throughput
1
1
1
1
0.5
nonpipelined
1
1
1
1
1
1
1
1
24
MMX
Functional Unit
Floating-point
Functional Unit
Issue/Execute Unit
to/from
Reorder
Buffer
Reservation Station Unit
Port 0
Integer
Functional Unit
MMX
Functional Unit
Jump
Functional Unit
Port 1
Integer
Functional Unit
Port 2
Load
Functional Unit
Port 3
Store
Functional Unit
Port 4
Store
Functional Unit
The In-Order Retire Section.
 A mop can be retired
 if its execution is completed,
 if it is its turn in program order,
 and if no interrupt, trap, or misprediction occurred.
 Retirement means taking data that was speculatively created and writing it
into the retirement register file (RRF).
 Three mops per clock cycle can be retired.
26
Retire Unit
to/from
D-cache
Reservation
Station
Unit
Memory
Interface
Unit
Retirement
Register
File
to/from Reorder Buffer
27
BTB1
Reservation station
ROB
read
RSU
IFU1
Port 0
IFU2
Port 1
IDU0
Port 2
IDU1
RAT
Reorder buffer read
ROB
read
Port 3
Reorder buffer
write-back
ROB
write
Retirement
RRF
Port 4
(b)
Retirement
IFU0
Register renaming
(a)
Reorder buffer read
Issue
BTB0
Execution and completion
I-cache access BTB access
Decode
Fetch and predecode
The Pentium II/III Pipeline
(c)
28
Pentium® Pro Processor Basic Execution Environment
232-1
Eight 32-bit
Registers
Six 16-bit
Registers
General Purpose
Registers
Segment Registers
32 bits
EFLAGS Register
32 bits
EIP (Instruction
Pointer Register)
Address
Space*
0
* The address space can be flat or segmented
29
Application Programming Registers
30
Pentium III
31
Pentium II/III summary and offsprings
 Pentium III in 1999, initially at 450 MHz (0.25 micron technology),
former name Katmai
 two 32 kB caches, faster floating-point performance
 Coppermine is a shrink of Pentium III down to 0.18 micron.
32
Pentium 4
 Was announced for mid-2000 under the code name
Willamette
 native IA-32 processor with Pentium III processor core
 running at 1.5 GHz
 42 million transistors
 0.18 µm
 20 pipeline stages (integer pipeline), IF and ID not included
 trace execution cache (TEC) for the decoded µOps
 NetBurst micro-architecture
33
Pentium 4 Features
Rapid Execution Engine:
 Intel: “Arithmetic Logic Units (ALUs) run at twice the
processor frequency”
 Fact: Two ALUs, running at processor frequency
connected with a multiplexer running at twice the
processor frequency
Hyper Pipelined Technology:
 Twenty-stage pipeline to enable high clock rates
 Frequency headroom and performance scalability
34
Advanced Dynamic Execution
 Very deep, out-of-order, speculative execution engine
 Up
to 126 instructions in flight (3 times larger than the Pentium
III processor)
 Up to 48 loads and 24 stores in pipeline (2 times larger than the
Pentium III processor)
 branch prediction
 based
on µOPs
 4K entry branch target array (8 times larger than the Pentium III
processor)
 new algorithm (not specified), reduces mispredictions compared
to G-Share of the P6 generation about one third
35
First level caches
 12k µOP Execution Trace Cache (~100 k)
 Execution Trace Cache that removes decoder latency
from main execution loops
 Execution Trace Cache integrates path of program
execution flow into a single line
 Low latency 8 kByte data cache with 2 cycle latency
36
Second level caches
 Included on the die
 size: 256 kB
 Full-speed, unified 8-way 2nd-level on-die Advance
Transfer Cache
 256-bit data bus to the level 2 cache
 Delivers ~45 GB/s data throughput (at 1.4 GHz
processor frequency)
 Bandwidth and performance increases with processor
frequency
37
NetBurst Micro-Architecture
38
Streaming SIMD Extensions 2 (SSE2) Technology
 SSE2 Extends MMX and SSE technology with the
addition of 144 new instructions, which include support
for:
 128-bit
SIMD integer arithmetic operations.
 128-bit SIMD double precision floating point operations.
 Cache and memory management operations.
 Further enhances and accelerates video, speech,
encryption, image and photo processing.
39
400 MHz Intel NetBurst micro-architecture system
bus
 Provides 3.2 GB/s throughput (3 times faster than the
Pentium III processor).
 Quad-pumped 100MHz scalable bus clock to achieve
400 MHz effective speed.
 Split-transaction, deeply pipelined.
 128-byte lines with 64-byte accesses.
40
Pentium 4 data types
41
Pentium 4
42
Pentium 4 offsprings
Foster
 Pentium 4 with external L3 cache and DDR-SDRAM support
 provided for server
 clock rate 1.7 - 2 GHz
 to be launched in Q2/2001
Northwood
 0.13 µm technique
 new 478 pin socket
43