Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
(Recap) Control Hazards COMP381 by M. Hamdi 1 Control (Branch) Hazard A: beqz r2, label B: --------label: P: Problem: The outcome of the branch (taken/not taken) is determined deep in the pipeline Do not know whether to execute B or execute P? Solutions: • Stall pipeline for required number of cycles • Methods to reduce the branch penalty: 1. Speculatively predict branch outcome and recover if mispredicted 2. Delayed Branch: Always execute instruction(s) following the branch COMP381 by M. Hamdi 2 Branch Hazards: Stall pipeline A: beqz r2, label stall stall stall B ----label: P A B P 1 IF 2 ID Execution sequence: A, NOP, NOP, NOP, B ….. or A, NOP, NOP, NOP, P 3 EX 4 MEM 5 WB 6 7 8 9 IF ID EX MEM WB COMP381 by M. Hamdi 3 Reducing Branch Penalty 1. Stall scheme had a branch penalty of 3 cycles. CPI assuming branch frequency of 30% is 1.9 (with no data stalls) 2. Predicting branch not taken: • Speculatively fetch and execute in-line instructions following the branch • If prediction incorrect flush pipeline of speculated instructions • Convert these instructions to NOPs by clearing pipeline registers. • Fetch instructions from the branch target COMP381 by M. Hamdi 4 Predict branch not taken B IF A ID T=2 C IF B ID T=3 D IF C IF D T=1 EX MEM WB A EX MEM WB ID B EX A MEM WB ID C EX B MEM Branch actually not taken: No stalls T=4 E COMP381 by M. Hamdi A WB 5 Predict branch not taken B IF A ID T=2 C IF B ID T=3 D IF C ID T=1 EX MEM WB A EX MEM WB B EX MEM WB A Branch actually taken: FLUSH pipeline Make B,C,D NOPS T=4 P IF ID EX COMP381 by M. Hamdi MEM A WB 6 Reducing Branch Stall Cycles 1- Find out whether a branch is taken earlier in the pipeline. 2- Compute the taken PC earlier in the pipeline. In MIPS: – In MIPS branch instructions BEQZ, BNE, test a register for equality to zero. – This can be completed in the ID cycle by moving the zero test into that cycle. – Both PCs (taken and not taken) must be computed early. – Requires an additional adder because the current ALU is not useable until EX cycle. – This results in just a single cycle stall on branches. COMP381 by M. Hamdi 7 Reducing Branch Penalty (contd … ) 3. Predicting branch taken: • Speculatively fetch and execute instructions at the branch target – But haven’t calculated branch target address in MIPS • MIPS still incurs 1 cycle branch penalty • Other machines: branch target known before outcome COMP381 by M. Hamdi 8 Reducing Branch Penalty (contd … ) 4. Delayed Branch: Branch delay slots: In-line instructions following a branch instruction conditional branch instruction sequential successor1 sequential successor2 …….. sequential successorn branch target if taken Always executed: Usually 1 delay slot (short pipelines) • Compiler tries to put useful instructions in delay slot to mask delay COMP381 by M. Hamdi 9 Delayed Branch • Instruction in branch delay slot is always executed • Compiler (tries to) move a useful instruction into delay slot. (a) From before the Branch: Always helpful when possible ADD R1, R2, R3 BEQZ R2, L1 DELAY SLOT L1: BEQZ R2, L1 ADD R1, R2, R3 L1: • If the ADD instruction were: ADD R2, R1, R3 the move would not be possible COMP381 by M. Hamdi 10 Delayed Branch (b) From the Target: Helps when branch is taken. May duplicate instructions ADD R2, R1, R3 ADD R2, R1, R3 BEQZ R2, L1 DELAY SLOT L1: SUB R4, R5, R6 L2: BEQZ R2, L2 SUB R4, R5, R6 L1: SUB R4, R5, R6 L2: Instructions between BEQ and SUB (in fall through) must not use R4. COMP381 by M. Hamdi 11 Delayed Branch ( c ) From Fall Through: Helps when branch is not taken. ADD R2, R1, R3 BEQZ R2, L1 DELAY SLOT SUB R4, R5, R6 - ADD R2, R1, R3 BEQZ R2, L1 SUB R4, R5, R6 - L1: L1: Instructions at target (L1 and after) must not use R4 till set again. • Cancelling (Nullifying) Branch: Branch instruction indicates direction of prediction. If mispredicted the instruction in the delay slot is cancelled. Greater flexibility for compiler to schedule instructions. COMP381 by M. Hamdi 12 Canceling branch • Cancelling (Nullifying) Branch: – Branch instruction indicates direction of prediction. – Mispredicted branch “annuls” the instruction in delay slot. Instruction behaves as a no-op. – Gives more leverage to compiler to select instructions to fill delay slots. COMP381 by M. Hamdi 13 Delayed Branch • Limitations of delayed branch – Compiler may not find appropriate instructions to fill delay slots. Then it fills delay slots with noops. – Visible architectural feature – likely to change with new implementations • Pipeline structure is exposed to compiler. Need to know how many delay slots. COMP381 by M. Hamdi 14 Delayed Branch • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot – Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches – Growth in available transistors has made dynamic approaches relatively cheaper COMP381 by M. Hamdi 15 Dynamic Branch Prediction • Builds on the premise that history matters – Observe the behavior of branches in previous instances and try to predict future branch behavior – Try to predict the outcome of a branch early on in order to avoid stalls – Branch prediction is critical for multiple issue processors • In an n-issue processor, branches will come n times faster than a single issue processor COMP381 by M. Hamdi 16 Basic Branch Predictor • Use a 1-bit branch predictor buffer or branch history table • 1 bit of memory stating whether the branch was recently taken or not • Bit entry updated each time the branch instruction is executed T Prediction Taken Not Taken State 1 0 Branch outcome Taken Not Taken 1 0 1 0 NT NT State 1 State 0 Predict Taken Predict Not Taken T COMP381 by M. Hamdi 17 1-bit Branch Prediction Buffer Problem – even simplest branches are mispredicted twice LD R1, #5 Loop: LD R2, 0(R5) ADD R2, R2, R4 STORE R2, 0(R5) ADD R5, R5, #4 SUB R1, R1, #1 BNEZ R1, Loop First time: prediction = 0 but the branch is taken change prediction to 1 miss Time 2, 3, 4: prediction = 1 and the branch is taken Time 5: prediction = 1 but the branch is not taken change prediction to 0 miss COMP381 by M. Hamdi 18 Dynamic Branch Prediction Accuracy COMP381 by M. Hamdi 19