Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS:APP Chapter 4 Computer Architecture Pipelined Implementation Part II Randal E. Bryant Carnegie Mellon University http://csapp.cs.cmu.edu CS:APP Overview Make the pipelined processor work! Data Hazards Instruction having register R as source follows shortly after instruction having register R as destination Common condition, don’t want to slow down pipeline Control Hazards Mispredict conditional branch Our design predicts all branches as being taken Naïve pipeline executes two extra instructions Getting return address for ret instruction Naïve pipeline executes three extra instructions Making Sure It Really Works –2– What if multiple special cases happen simultaneously? CS:APP Pipeline Stages W_icode, W_valM W_valE, W_valM, W_dstE, W_dstM W valM Fetch Memory Addr, Data Select current PC Read instruction Compute incremented PC Data Data memory memory M_icode, M_Bch, M_valA M Bch valE CC CC Execute ALU ALU aluA, aluB Decode E Read program registers valA, valB Execute Decode D Instruction Instruction memory memory Fetch PC –3– B valP icode, ifun, rA, rB, valC Read or write data memory Write Back A Register Register M file file E Write back Operate ALU Memory d_srcA, d_srcB valP PC PC increment increment predPC f_PC F Update register file CS:APP Write back PIPE- Hardware W icode valE Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths Values passed from one stage to next Cannot jump past stages write Memory Data Data memory memory data in Addr M_valA M_Bch M icode Bch valE valA dstE dstM e_Bch Execute E icode ALU fun. ALU ALU CC CC ifun ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB Select A Decode D Fetch icode d_rvalA A dstE dstM srcA srcB W_valM B Register Register M file file E e.g., valC passes through decode dstE dstM data out read Mem. control valM ifun rA rB Instruction Instruction memory memory valC W_valE valP PC PC increment increment Predict PC f_PC M_valA Select PC F –4– W_valM predPC CS:APP Data Dependencies: 2 Nop’s # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D F E D F M E D F W M E D F 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt 6 7 8 9 10 W M E D F W M E D W M E W M W Cycle 6 W R[ %eax] R[ 3 ] 3 %eax • • • D valA R[ %edx] = 10 valB R[ %eax] = 0 –5– Error CS:APP Data Dependencies: No Nop # demo-h0.ys 1 2 3 4 5 6 7 8 0x000: irmovl $10,%edx F D F E D F M E D F W M E D W M E W M W 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt Cycle 4 M M_valE = 10 M_dstE = %edx E e_valE 0 + 3 = 3 E_dstE = %eax D valA R[ %edx] = 0 valB R[ %eax] = 0 –6– Error CS:APP Stalling for Data Dependencies # demo-h2.ys 1 2 3 4 5 6 0x000: irmovl $10,%edx F D F E D F M E D F W M E D W M E 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop bubble 0x00e: addl %edx,%eax 0x010: halt –7– F D F 7 8 9 10 11 W M E D F W M E D W M E W M W If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage CS:APP Write back Stall Condition W icode valE dstE dstM data out read Mem. control Source Registers valM write Memory Data Data memory memory data in Addr M_valA M_Bch srcA and srcB of current instruction in decode stage Destination Registers dstE and dstM fields Instructions in execute, memory, and write-back stages Special Case Don’t stall for register ID 8 M –8– Bch valE valA dstE dstM dstE dstM srcA dstE dstM srcA e_Bch Execute E icode ALU fun. ALU ALU CC CC ifun ALU A ALU B valC valA valB srcB d_srcA d_srcB Select A Decode d_rvalA A srcB W_valM B Register RegisterM file file W_valE E D Fetch icode ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC f_PC M_valA Select PC Indicates absence of register operand icode F W_valM predPC CS:APP Detecting Stall Condition # demo-h2.ys 1 2 3 4 5 6 0x000: irmovl $10,%edx F D F E D F M E D F W M E D W M E 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop bubble 0x00e: addl %edx,%eax 0x010: halt F D F 7 8 9 10 11 W M E D F W M E D W M E W M W Cycle 6 W W_dstE = %eax W_valE = 3 • • • D –9– srcA = %edx srcB = %eax CS:APP Stalling X3 # demo-h0.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D F E D M E W M E 0x006: irmovl $3,%eax bubble bubble 6 W M E bubble 0x00c: addl %edx,%eax F 0x00e: halt D F D F D F 7 8 9 10 11 W M E D F W M E D W M E W M W Cycle 6 W Cycle 5 W_dstE = %eax M Cycle 4 M_dstE = %eax E • • • D E_dstE = %eax D – 10 – srcA = %edx srcB = %eax srcA = %edx srcB = %eax • • • D srcA = %edx srcB = %eax CS:APP What Happens When Stalling? # demo-h0.ys Cycle 8 4 5 6 7 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt Write Back Memory Execute Decode Fetch 0x000: irmovl 0x006: bubble $10,%edx $3,%eax 0x000: irmovl 0x006: bubble $10,%edx $3,%eax 0x006: addl 0x00c: irmovl bubble %edx,%eax $3,%eax 0x00c: halt 0x00e: addl %edx,%eax 0x00e: halt Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage Like dynamically generated nop’s Move through later stages – 11 – CS:APP Implementing Stalling W_dstM W_dstE W icode valE valM dstE dstM valE valA dstE dstM M_dstM M_dstE M icode Bch E_dstM Pipe control logic E_dstE E_bubble E icode ifun valC valA valB dstE dstM srcA srcB d_srcB d_srcA srcB D_icode D_stall D F_stall F srcA icode ifun rA rB valC valP predPC Pipeline Control – 12 – Combinational logic detects stall condition Sets mode signals for how pipeline registers should update CS:APP Pipeline Register Modes Input = y Normal Output = x x stall =0 Output = x x stall =1 – 13 – Output = x x stall =0 y Rising clock Output = x x bubble =0 Input = y Bubble Output = y bubble =0 Input = y Stall Rising clock Rising clock n o p Output = nop bubble =1 CS:APP Data Forwarding Naïve Pipeline Register isn’t written until completion of write-back stage Source operands read from register file in decode stage Needs to be in register file at start of stage Observation Value generated in execute or memory stage Trick – 14 – Pass value directly from generating instruction to decode stage Needs to be available at end of decode stage CS:APP Data Forwarding Example # demo-h2.ys 1 2 3 4 5 0x000: irmovl $10,%edx F D F E D F M E D F W M E D F 0x006: irmovl $3,%eax 0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax 0x010: halt – 15 – irmovl in writeback stage Destination value in W pipeline register Forward as valB for decode stage 6 7 8 9 10 W M E D F W M E D W M E W M W Cycle 6 W R[ %eax] 3 W_dstE = %eax W_valE = 3 • • • D srcA = %edx srcB = %eax valA R[ %edx] = 10 valB W_valE = 3 CS:APP Bypass Paths W_icode, W_valM W_valE, W_valM, W_dstE, W_dstM W_valE W_valM W Decode Stage m_valM Forwarding logic selects Memory valA and valB Normally from register file Addr, Data M_valE M Forwarding: get valA or valB from later pipeline Execute stage Execute: valE Memory: valE, valM Write back: valE, valM e_valE Bch CC CC ALU ALU E_valA, E_valB, E_srcA, E_srcB Forwarding Sources Data Data memory memory M_icode, M_Bch, M_valA E valA, valB Forward d_srcA, d_srcB Decode A B Register Register M file file E Write back – 16 – D icode, ifun, rA, rB, valC valP CS:APP valP Data Forwarding Example #2 # demo-h0.ys 1 2 3 4 5 6 7 8 0x000: irmovl $10,%edx F D F E D F M E D F W M E D W M E W M W 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax 0x00e: halt Register %edx Generated by ALU during previous cycle Forward from memory as valA Register %eax – 17 – Value just generated by ALU Forward from execute as valB Cycle 4 M M_dstE = %edx M_valE = 10 E E_dstE = %eax e_valE 0 + 3 = 3 D srcA = %edx srcB = %eax valA M_valE = 10 valB e_valE = 3 CS:APP Implementing Forwarding W_valE Write back W_valM W icode valE valM dstE dstM data out read Mem. control Data Data memory memory write Memory m_valM Add additional feedback paths from E, M, and W pipeline registers into decode stage Create logic blocks to select from multiple sources for valA and valB in decode stage data in Addr M_Bch M icode M_valA M_valE Bch valE valA dstE dstM e_Bch e_valE ALU ALU CC CC Execute E icode ifun ALU fun. ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB dstE dstM srcA srcB Sel+Fwd A Decode Fwd B A W_valM B Register Register M file file W_valE E D icode – 18 – Fetch ifun rA rB Instruction Instruction memory memory valC valP PC PC increment increment Predict PC CS:APP f_PC M_valA Select Implementing Forwarding W_valE W_valM valE valM dstE dstM data out read Mem. control m_valM Data Data memory memory write data in Addr M_valA M_valE Bch valE valA dstE dstM e_Bch e_valE ALU ALU CC CC ALU fun. ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB dstE dstM srcA srcB Sel+Fwd A Fwd B A W_valM B Register Register M file file W_valE E rA rB Instruction Instruction memory –valC 19 – CS:APP valP PC PC increment ## What should be the A value? int new_E_valA = [ # Use incremented PC D_icode in { ICALL, IJXX } : D_valP; # Forward valE from execute d_srcA == E_dstE : e_valE; # Forward valM from memory d_srcA == M_dstM : m_valM; # Forward valE from memory d_srcA == M_dstE : M_valE; # Forward valM from write back d_srcA == W_dstM : W_valM; # Forward valE from write back d_srcA == W_dstE : W_valE; # Use value read from register file 1 : d_rvalA; ]; Predict PC Limitation of Forwarding # demo-luh.ys 0x000: 0x006: 0x00c: 0x012: 0x018: 0x01e: 0x020: 1 2 4 irmovl $128,%edx F D E M irmovl $3,%ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10,%ebx F mrmovl 0(%edx),%eax # Load %eax addl %ebx,%eax # Use %eax halt Load-use dependency 3 Value needed by end of decode stage in cycle 7 Value read from memory in memory stage of cycle 8 5 6 W M E D F W M E D F 7 8 9 10 11 W M E D F W M E D W M E W M W Cycle 7 Cycle 8 M M M_dstE = %ebx M_valE = 10 M_dstM = %eax m_valM M[128] = 3 • • • D valA M_valE = 10 valB R[%eax] = 0 – 20 – Error CS:APP Avoiding Load/Use Hazard # demo-luh.ys 1 2 3 4 5 6 0x000: irmovl $128,%edx 0x006: irmovl $3,%ecx F D F E D M E W M W 0x00c: rmmovl %ecx, 0(%edx) F D 0x012: irmovl $10,%ebx F 0x018: mrmovl 0(%edx),%eax # Load %eax bubble E D F 0x01e: addl %ebx,%eax # Use %eax 0x020: halt Stall using instruction for one cycle Can then pick up loaded value by forwarding from memory stage 7 8 M E W M W D E F D F 9 10 M E W M W D F E D M E 11 W M 12 W Cycle 8 W W_dstE = %ebx W_valE = 10 M M_dstM = %eax m_valM M[128] = 3 • • • D – 21 – valA W_valE = 10 valB m_valM = 3 CS:APP data out read Mem. control Data Data memory memory write Memory m_valM Detecting Load/Use Hazard data in Addr M_Bch M icode M_valA M_valE Bch valE valA dstE dstM e_Bch e_valE ALU ALU CC CC Execute E icode ifun ALU fun. ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB dstE Sel +Fwd A Decode D icode Condition Fetch Load/Use Hazard F – 22 – rA rB Instruction Instruction memory memory valC Trigger srcA srcB Fwd B A ifun dstM W_valM B Register RegisterM file file E W_valE valP PC PC increment increment Predict PC f_PC M_valA E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } Select PC W_valM predPC CS:APP Control for Load/Use Hazard # demo-luh.ys 1 2 3 0x000: 0x006: 0x00c: 0x012: 0x018: 4 irmovl $128,%edx F D E M irmovl $3,%ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10,%ebx F mrmovl 0(%edx),%eax # Load %eax bubble 0x01e: addl %ebx,%eax # Use %eax 0x020: halt Stall instructions in fetch and decode stages Inject bubble into execute stage Condition Load/Use Hazard – 23 – 5 6 7 W M E D F W M E D W M E F D F 8 9 10 11 12 W M E D F W M E D W M E W M W F D E M W stall stall bubble normal normal CS:APP Branch Misprediction Example demo-j.ys 0x000: xorl %eax,%eax 0x002: jne t 0x007: irmovl $1, %eax 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $3, %edx 0x017: irmovl $4, %ecx 0x01d: irmovl $5, %edx – 24 – # Not taken # Fall through # Target (Should not execute) # Should not execute # Should not execute Should only execute first 8 instructions CS:APP Handling Misprediction # demo-j.ys 1 2 3 4 5 6 0x000: xorl %eax,%eax F 0x002: jne target # Not taken D F E D F M E D W M W E M W D F E D F M E D 0x011: t: irmovl $2,%edx # Target bubble 0x017: irmovl $3,%ebx # Target+1 irmovl $1,%eax # Fall through 0x00d: nop 8 9 10 W M E W M W F bubble 0x007: 7 Predict branch as taken Fetch 2 instructions at target Cancel when mispredicted – 25 – Detect branch not-taken in execute stage On following cycle, replace instructions in execute and decode by bubbles No side effects have occurred yet CS:APP W_valE Write back W_valM W icode valE valM dstE dstM Detecting Mispredicted Branch data out read Mem. control Data Data memory memory write Memory m_valM data in Addr M_Bch M icode M_valA M_valE Bch valE valA dstE dstM e_Bch e_valE ALU ALU CC CC Execute E icode ifun ALU fun. ALU A ALU B valC valA valB dstE dstM dstE dstM srcA srcB d_srcA d_srcB Sel +Fwd A Condition srcB Fwd B Trigger Decode srcA A W_valM B Register RegisterM file file E Mispredicted Branch E_icode = IJXX & !e_Bch D Fetch icode ifun rA rB Instruction Instruction memory memory valC W_valE valP PC PC increment increment Predict PC f_PC M_valA – 26 – Select PC F predPC W_valM CS:APP Control for Misprediction # demo-j.ys 1 2 3 4 5 6 0x000: xorl %eax,%eax F 0x002: jne target # Not taken D F E D F M E D W M W E M W D F E D F M E D 0x011: t: irmovl $2,%edx # Target bubble 0x017: irmovl $3,%ebx # Target+1 0x007: irmovl $1,%eax # Fall through 0x00d: nop F Mispredicted Branch normal – 27 – 8 9 10 W M E W M W F bubble Condition 7 D E M W bubble bubble normal normal CS:APP demo-retb.ys Return Example 0x000: 0x006: 0x00b: 0x011: 0x020: 0x020: 0x026: 0x027: 0x02d: 0x033: 0x039: 0x100: 0x100: – 28 – irmovl Stack,%esp call p irmovl $5,%esi halt .pos 0x20 p: irmovl $-1,%edi ret irmovl $1,%eax irmovl $2,%ecx irmovl $3,%edx irmovl $4,%ebx .pos 0x100 Stack: # Initialize stack pointer # Procedure call # Return point # procedure # # # # Should Should Should Should not not not not be be be be executed executed executed executed # Stack: Stack pointer Previously executed three additional instructions CS:APP Correct Return Example # demo-retb 0x026: ret F bubble D F bubble bubble 0x00b: irmovl $5,%esi # Return As ret passes through pipeline, stall at fetch stage E D F M E D F W M E D F W M E D W M E W M W W valM = 0x0b While in decode, execute, and memory stage – 29 – Inject bubble into decode stage Release stall when reach write-back stage • • • F valC 5 rB %esi CS:APP Detecting Return M_Bch M icode M_valE Bch valE valA dstE dstM e_Bch e_valE ALU ALU CC CC Execute E icode ifun ALU fun. ALU A ALU B valC valA valB dstE dstM srcA srcB d_srcA d_srcB dstE dstM srcA Sel+Fwd A Decode D – 30 – icode Fwd B A B Register Register M file file E ifun rA rB valC srcB W_valM W_valE valP Condition Trigger Processing ret IRET in { D_icode, E_icode, M_icode } CS:APP Control for Return # demo-retb 0x026: ret F D F bubble bubble bubble 0x00b: M E D F irmovl $5,%esi # Return Condition Processing ret – 31 – E D F W M E D F W M E D W M E W M W F D E M W stall bubble normal normal normal CS:APP Special Control Cases Detection Condition Trigger Processing ret IRET in { D_icode, E_icode, M_icode } Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } Mispredicted Branch E_icode = IJXX & !e_Bch Action (on next cycle) Condition F D E M W Processing ret stall bubble normal normal normal Load/Use Hazard stall stall bubble normal normal bubble bubble normal normal Mispredicted Branch normal – 32 – CS:APP Implementing Pipeline Control W icode valE valM dstE dstM valE valA dstE dstM M_icode M icode Bch e_Bch CC CC E_dstM E_icode Pipe control logic E_bubble E icode ifun valC valA valB dstE dstM srcA srcB d_srcB d_srcA srcB D_icode srcA D_bubble – 33 – D_stall D F_stall F icode ifun rA rB valC valP predPC Combinational logic generates pipeline control signals Action occurs at start of following cycle CS:APP Initial Version of Pipeline Control bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool D_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }; bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool E_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB}; – 34 – CS:APP Control Combinations Load/use M E D Load Use Mispredict M E D ret 1 M JXX E ret D Combination A ret 2 M E D ret bubble ret 3 M E D ret bubble bubble Combination B Special cases that can arise on same clock cycle Combination A Not-taken branch ret instruction at branch target Combination B – 35 – Instruction that reads from memory to %esp Followed by ret instruction CS:APP E icode ifun valC valA valB dstE dstM srcA dstE dstM srcA srcB d_srcA d_srcB Select A Decode A Control Combination A D Mispredict M E D ifun rA rB Fetch srcB W_valM B Register RegisterM file file W_valE E valC valP ret 1 M JXX E ret D Combination A Condition icode d_rvalA Instruction Instruction memory memory PC PC increment increment Predict PC f_PC M_valA Select PC F W_valM predPC F D E M W stall bubble normal normal normal Mispredicted Branch normal bubble bubble normal normal Combination bubble bubble normal normal Processing ret – 36 – stall Should handle as mispredicted branch Stalls F pipeline register But PC selection logic will be using M_valM anyhow CS:APP Control Combination B ret 1 Load/use M E D M E D Load Use ret Combination B Condition F D E M W Processing ret stall bubble normal normal normal Load/Use Hazard stall stall bubble normal normal Combination stall bubble + bubble stall normal normal – 37 – Would attempt to bubble and stall pipeline register D Signaled by processor as pipeline error CS:APP Handling Control Combination B ret 1 Load/use M E D M E D Load Use ret Combination B Condition F D E M W Processing ret stall bubble normal normal normal Load/Use Hazard stall stall bubble normal normal Combination stall stall bubble normal normal – 38 – Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle CS:APP Corrected Pipeline Control Logic bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes IRET in { D_icode, E_icode, M_icode # but not condition for a load/use && !(E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB Condition through pipeline } hazard }); F D E M W Processing ret stall bubble normal normal normal Load/Use Hazard stall stall bubble normal normal Combination stall stall bubble normal normal – 39 – Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle CS:APP Pipeline Summary Data Hazards Most handled by forwarding No performance penalty Load/use hazard requires one cycle stall Control Hazards Cancel instructions when detect mispredicted branch Two clock cycles wasted Stall fetch stage while ret passes through pipeline Three clock cycles wasted Control Combinations Must analyze carefully First version had subtle bug Only arises with unusual instruction combination – 40 – CS:APP