Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
9th Lecture Branch prediction (rest) Predication Intel Pentium II/III Intel Pentium 4 1 Hybrid Predictors The second strategy of McFarling is to combine multiple separate branch predictors, each tuned to a different class of branches. Two or more predictors and a predictor selection mechanism are necessary in a combining or hybrid predictor. McFarling: combination of two-bit predictor and gshare two-level adaptive, Young and Smith: a compiler-based static branch prediction with a two-level adaptive type, and many more combinations! Hybrid predictors often better than single-type predictors. 2 Simulations of Grunwald 1998 Table 1.1. SAg, gshare and MCFarling‘s combining predictor committed conditional taken Application instructions branches branches (in millions) (in millions) (%) compress 80.4 14.4 54.6 gcc 250.9 50.4 49.0 perl 228.2 43.8 52.6 go 548.1 80.3 54.5 m88ksim 416.5 89.8 71.7 xlisp 183.3 41.8 39.5 vortex 180.9 29.1 50.1 jpeg 252.0 20.0 70.0 mean 267.6 46.2 54.3 SAg 10.1 12.8 9.2 25.6 4.7 10.3 2.0 10.3 8.6 misprediction rate (%) gshare combining 10.1 9.9 23.9 12.2 25.9 11.4 34.4 24.1 8.6 4.7 10.2 6.8 8.3 1.7 12.5 10.4 14.5 8.1 3 Results Simulation of Keeton et al. 1998 using an OLTP (online transaction workload) on a PentiumPro multiprocessor reported a misprediction rate of 14% with an branch instruction frequency of about 21%. The speculative execution factor, given by the number of instructions decoded divided by the number of instructions committed, is 1.4 for the database programs. Two different conclusions may be drawn from these simulation results: Branch predictors should be further improved and/or branch prediction is only effective if the branch is predictable. If a branch outcome is dependent on irregular data inputs, the branch often shows an irregular behavior. Question: Confidence of a branch prediction? 4 4.3.4 Predicated Instructions and Multipath Execution - Confidence Estimation Confidence estimation is a technique for assessing the quality of a particular prediction. Applied to branch prediction, a confidence estimator attempts to assess the prediction made by a branch predictor. A low confidence branch is a branch which frequently changes its branch direction in an irregular way making its outcome hard to predict or even unpredictable. Four classes possible: correctly predicted with high confidence C(HC), correctly predicted with low confidence C(LC), incorrectly predicted with high confidence I(HC), and incorrectly predicted with low confidence I(LC). 5 Implementation of a confidence estimator Information from the branch prediction tables is used: Use of saturation counter information to construct a confidence estimator speculate more aggressively when the confidence level is higher Used of a miss distance counter table (MDC): Each time a branch is predicted, the value in the MDC is compared to a threshold. If the value is above the threshold, then the branch is considered to have high confidence, and low confidence otherwise. A small number of branch history patterns typically leads to correct predictions in a PAs predictor scheme. The confidence estimator assigned high confidence to a fixed set of patterns and low confidence to all others. Confidence estimation can be used for speculation control, thread switching in multithreaded processors or multipath execution 6 Predicated Instructions Provide predicated or conditional instructions and one or more predicate registers. Predicated instructions use a predicate register as additional input operand. The Boolean result of a condition testing is recorded in a (one-bit) predicate register. Predicated instructions are fetched, decoded and placed in the instruction window like non predicated instructions. It is dependent on the processor architecture, how far a predicated instruction proceeds speculatively in the pipeline before its predication is resolved: A predicated instruction executes only if its predicate is true, otherwise the instruction is discarded. In this case predicated instructions are not executed before the predicate is resolved. Alternatively, as reported for Intel's IA64 ISA, the predicated instruction may be executed, but commits only if the predicate is true, otherwise the result is discarded. 7 Predication Example if (x = = 0) { a = b + c; d = e - f; } g = h * i; /*branch b1 */ (Pred = (x = = 0) ) if Pred then a = b + c; if Pred then e = e - f; g = h * i; /* branch b1: Pred is set to true in x equals 0 */ /* The operations are only performed */ /* if Pred is set to true */ /* instruction independent of branch b1 */ 8 Predication Able to eliminate a branch and therefore the associated branch prediction increasing the distance between mispredictions. The the run length of a code block is increased better compiler scheduling. Predication affects the instruction set, adds a port to the register file, and complicates instruction execution. Predicated instructions that are discarded still consume processor resources; especially the fetch bandwidth. Predication is most effective when control dependences can be completely eliminated, such as in an if-then with a small then body. The use of predicated instructions is limited when the control flow involves more than a simple alternative sequence. 9 Eager (Multipath) Execution Execution proceeds down both paths of a branch, and no prediction is made. When a branch resolves, all operations on the non-taken path are discarded. Oracle execution: eager execution with unlimited resources gives the same theoretical maximum performance as a perfect branch prediction With limited resources, the eager execution strategy must be employed carefully. Mechanism is required that decides when to employ prediction and when eager execution: e.g. a confidence estimator Rarely implemented (IBM mainframes) but some research projects: Dansoft processor, Polypath architecture, selective dual path execution, simultaneous speculation scheduling, disjoint eager execution 10 .7 .3 1 .21 .49 2 .15 .34 3 .7 1 .10 .24 .49 2 4 .17 .07 .7 (a) .3 2 1 5 .12 6 .05 .3 .09 .21 .49 3 (b) .21 4 5 .34 3 6 .24 5 .10 (c) (a) Single path speculative execution (b) full eager execution (c) disjoint eager execution .15 .21 6 4 4.3.5 Prediction of Indirect Branches Indirect branches, which transfer control to an address stored in register, are harder to predict accurately. Indirect branches occur frequently in machine code compiled from objectoriented programs like C++ and Java programs. One simple solution is to update the PHT to include the branch target addresses. 12 Branch handling techniques and implementations Technique No branch prediction Static prediction always not taken always taken backward taken, forward not taken semistatic with profiling Dynamic prediction: 1-bit 2-bit two-level adaptive Hybrid prediction Predication Eager execution (limited) Disjoint eager execution Implementation examples Intel 8086 Intel i486 Sun SuperSPARC HP PA-7x00 early PowerPCs DEC Alpha 21064, AMD K5 PowerPC 604, MIPS R10000, Cyrix 6x86 and M2, NexGen 586 Intel PentiumPro, Pentium II, AMD K6, Athlon DEC Alpha 21264 Intel/HP Merced and most signal processors as e.g. ARM processors, TI TMS320C6201 and many other IBM mainframes: IBM 360/91, IBM 3090 none yet 13 High-Bandwidth Branch Prediction Future microprocessor will require more than one prediction per cycle starting speculation over multiple branches in a single cycle, e.g. Gag predictor is independent of branch address. When multiple branches are predicted per cycle, then instructions must be fetched from multiple target addresses per cycle, complicating I-cache access. Possible solution: Trace cache in combination with next trace prediction. Most likely a combination of branch handling techniques will be applied, e.g. a multi-hybrid branch predictor combined with support for context switching, indirect jumps, and interference handling. 14 The Intel P5 and P6 family Year P5 P6 NetBurst 1993 1994 1995 1996 1997 1998 1995 1997 1998 1998 1997 1998 1998 1999 1999 1999 2000 Type Pentium Pentium Pentium Pentium Pentium MMX Mobile Pentium MMX PentiumPro PentiumPro Intel Celeron Intel Celeron Pentium II Mobile Pentium II Pentium II Xeon Pentium II Xeon Pentium III Pentium III Xeon Pentium 4 Transistors (x1000) 3100 3200 3200 3300 4500 4500 5500 5500 7500 19000 7000 7000 7000 7000 8200 8200 42000 Technology (m m) 0.8 0.6 0.6/0.35 0.35 0.35 0.25 0.35 0.35 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.18 including L2 cache Clock (MHz) 66 75-100 120-133 150-166 200-233 200-233 150-200 200 266-300 300-333 233-450 300 400-450 450 450-1000 500-1000 1500 Issue 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 Word L1 cache format 32-bit 2 X 8 kB 32-bit 2 X 8 kB 32-bit 2 X 8 kB 32-bit 2 X 8 kB 32-bit 2 X 16 kB 32-bit 2 X 16 kB 32-bit 2 X 8 kB 32-bit 2 X 8 kB 32-bit 2 X 16 kB 32-bit 2 X 16 kB 32-bit 2 X 16 kB 32-bit 2 X 16 kB 32-bit 2 X 16 kB 32-bit 2 X 16 kB 32-bit 2 X 16 kB 32-bit 2 x 16 kB 32-bit 8kB / 12kµOps L2 cache 256/512 kB 1 MB -128 kB 256 kB/512 kB 256 kB/512 kB 512 kB/1 MB 512 kB/2 MB 512 kB 512 kB 256 kB 15 Micro-Dataflow in PentiumPro 1995 ... The flow of the Intel Architecture instructions is predicted and these instructions are decoded into micro-operations (mops), or series of mops, and these mops are register-renamed, placed into an out-of-order speculative pool of pending operations, executed in dataflow order (when operands are ready), and retired to permanent machine state in source program order. ... R.P. Colwell, R. L. Steck: A 0.6 mm BiCMOS Processor with Dynamic Execution, International Solid State Circuits Conference, Feb. 1995. 16 PentiumPro and Pentium II/III The Pentium II/III processors use the same dynamic execution microarchitecture as the other members of P6 family. This three-way superscalar, pipelined micro-architecture features a decoupled, multi-stage superpipeline, which trades less work per pipestage for more stages. The Pentium II/III processor has twelve stages with a pipestage time 33 percent less than the Pentium processor, which helps achieve a higher clock rate on any given manufacturing process. A wide instruction window using an instruction pool. Optimized scheduling requires the fundamental “execute” phase to be replaced by decoupled “issue/execute” and “retire” phases. This allows instructions to be started in any order but always be retired in the original program order. Processors in the P6 family may be thought of as three independent engines coupled with an instruction pool. 17 Pentium® Pro Processor and Pentium II/III Microarchitecture 18 External Bus Bus Interface Unit Memory Reorder Buffer D-cache Unit Instruction Fetch Unit (with I-cache) Branch Target Buffer Instruction Decode Unit Microcode Instruction Sequencer Register Alias Table Reservation Station Unit Pentium II/III L2 Cache Memory Interface Unit Functional Units Reorder Buffer & Retirement Register File Pentium II/III: The In-Order Section The instruction fetch unit (IFU) accesses a non-blocking I-cache, it contains the Next IP unit. The Next IP unit provides the I-cache index (based on inputs from the BTB), trap/interrupt status, and branch-misprediction indications from the integer FUs. Branch prediction: two-level adaptive scheme of Yeh and Patt, BTB contains 512 entries, maintains branch history information and the predicted branch target address. Branch misprediction penalty: at least 11 cycles, on average 15 cycles The instruction decoder unit (IDU) is composed of three separate decoders 20 Pentium II/III: The In-Order Section (Continued) A decoder breaks the IA-32 instruction down to mops, each comprised of an opcode, two source and one destination operand. These mops are of fixed length. Most IA-32 instructions are converted directly into single micro ops (by any of the three decoders), some instructions are decoded into one-to-four mops (by the general decoder), more complex instructions are used as indices into the microcode instruction sequencer (MIS) which will generate the appropriate stream of mops. The mops are send to the register alias table (RAT) where register renaming is performed, i.e., the logical IA-32 based register references are converted into references to physical registers. Then, with added status information, mops continue to the reorder buffer (ROB, 40 entries) and to the reservation station unit (RSU, 20 entries). 21 The Fetch/Decode Unit Instruction Fetch Unit Next_IP Microcode Instruction Sequencer Simple Decoder Instruction Decode Unit Simple Decoder Branch Target Buffer Alignment General Decoder I-cache IA-32 instructions op1 op2 op3 Register Alias Table (a) in-order section (b) instruction decoder unit (IDU) 22 The Out-of-Order Execute Section When the mops flow into the ROB, they effectively take a place in program order. mops also go to the RSU which forms a central instruction window with 20 reservation stations (RS), each capable of hosting one mop. mops are issued to the FUs according to dataflow constraints and resource availability, without regard to the original ordering of the program. After completion the result goes to two different places, RSU and ROB. The RSU has five ports and can issue at a peak rate of 5 mops each cycle. 23 Latencies and throughtput for Pentium II/III FUs RSU Port FU Integer arithmetic/logical Shift Integer mul Floating-point add 0 Floating-point mul Floating-point div MMX arithmetic/logical MMX mul Integer arithmetic/logical 1 MMX arithmetic/logical MMX shift 2 Load 3 Store address 4 Store data Latency 1 1 4 3 5 long 1 3 1 1 1 3 3 1 Throughput 1 1 1 1 0.5 nonpipelined 1 1 1 1 1 1 1 1 24 MMX Functional Unit Floating-point Functional Unit Issue/Execute Unit to/from Reorder Buffer Reservation Station Unit Port 0 Integer Functional Unit MMX Functional Unit Jump Functional Unit Port 1 Integer Functional Unit Port 2 Load Functional Unit Port 3 Store Functional Unit Port 4 Store Functional Unit The In-Order Retire Section. A mop can be retired if its execution is completed, if it is its turn in program order, and if no interrupt, trap, or misprediction occurred. Retirement means taking data that was speculatively created and writing it into the retirement register file (RRF). Three mops per clock cycle can be retired. 26 Retire Unit to/from D-cache Reservation Station Unit Memory Interface Unit Retirement Register File to/from Reorder Buffer 27 BTB1 Reservation station ROB read RSU IFU1 Port 0 IFU2 Port 1 IDU0 Port 2 IDU1 RAT Reorder buffer read ROB read Port 3 Reorder buffer write-back ROB write Retirement RRF Port 4 (b) Retirement IFU0 Register renaming (a) Reorder buffer read Issue BTB0 Execution and completion I-cache access BTB access Decode Fetch and predecode The Pentium II/III Pipeline (c) 28 Pentium® Pro Processor Basic Execution Environment 232-1 Eight 32-bit Registers Six 16-bit Registers General Purpose Registers Segment Registers 32 bits EFLAGS Register 32 bits EIP (Instruction Pointer Register) Address Space* 0 * The address space can be flat or segmented 29 Application Programming Registers 30 Pentium III 31 Pentium II/III summary and offsprings Pentium III in 1999, initially at 450 MHz (0.25 micron technology), former name Katmai two 32 kB caches, faster floating-point performance Coppermine is a shrink of Pentium III down to 0.18 micron. 32 Pentium 4 Was announced for mid-2000 under the code name Willamette native IA-32 processor with Pentium III processor core running at 1.5 GHz 42 million transistors 0.18 µm 20 pipeline stages (integer pipeline), IF and ID not included trace execution cache (TEC) for the decoded µOps NetBurst micro-architecture 33 Pentium 4 Features Rapid Execution Engine: Intel: “Arithmetic Logic Units (ALUs) run at twice the processor frequency” Fact: Two ALUs, running at processor frequency connected with a multiplexer running at twice the processor frequency Hyper Pipelined Technology: Twenty-stage pipeline to enable high clock rates Frequency headroom and performance scalability 34 Advanced Dynamic Execution Very deep, out-of-order, speculative execution engine Up to 126 instructions in flight (3 times larger than the Pentium III processor) Up to 48 loads and 24 stores in pipeline (2 times larger than the Pentium III processor) branch prediction based on µOPs 4K entry branch target array (8 times larger than the Pentium III processor) new algorithm (not specified), reduces mispredictions compared to G-Share of the P6 generation about one third 35 First level caches 12k µOP Execution Trace Cache (~100 k) Execution Trace Cache that removes decoder latency from main execution loops Execution Trace Cache integrates path of program execution flow into a single line Low latency 8 kByte data cache with 2 cycle latency 36 Second level caches Included on the die size: 256 kB Full-speed, unified 8-way 2nd-level on-die Advance Transfer Cache 256-bit data bus to the level 2 cache Delivers ~45 GB/s data throughput (at 1.4 GHz processor frequency) Bandwidth and performance increases with processor frequency 37 NetBurst Micro-Architecture 38 Streaming SIMD Extensions 2 (SSE2) Technology SSE2 Extends MMX and SSE technology with the addition of 144 new instructions, which include support for: 128-bit SIMD integer arithmetic operations. 128-bit SIMD double precision floating point operations. Cache and memory management operations. Further enhances and accelerates video, speech, encryption, image and photo processing. 39 400 MHz Intel NetBurst micro-architecture system bus Provides 3.2 GB/s throughput (3 times faster than the Pentium III processor). Quad-pumped 100MHz scalable bus clock to achieve 400 MHz effective speed. Split-transaction, deeply pipelined. 128-byte lines with 64-byte accesses. 40 Pentium 4 data types 41 Pentium 4 42 Pentium 4 offsprings Foster Pentium 4 with external L3 cache and DDR-SDRAM support provided for server clock rate 1.7 - 2 GHz to be launched in Q2/2001 Northwood 0.13 µm technique new 478 pin socket 43