Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson) 1 1999 ©UCB Stages of Execution in Pipelined MIPS 5 stage instruction pipeline 1) I-fetch: Fetch Instruction, Increment PC 2) Decode: Instruction, Read Registers 3) Execute: Mem-reference: Calculate Address R-format: Perform ALU Operation 4) Memory: Load: Read Data from Data Memory Store: Write Data to Data Memory 5) Write Back: Write Data to Register 2 1999 ©UCB Pipelined Execution Representation Time IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Program Flow °To simplify pipeline, every instruction takes same number of steps, called stages 3 °One clock cycle per stage 1999 ©UCB Review: Single-cycle Datapath for MIPS Stage 5 PC Instruction Memory (Imem) Stage 1 Registers Stage 2 ALU Stage 3 Data Memory (Dmem) Stage 4 °Use datapath figure to represent pipeline IFtch Dcd Exec Mem WB 4 Reg ALU IM DM Reg 1999 ©UCB Graphical Pipeline Representation Time (clock cycles) ALU I n IM DM Reg Reg s Load IM DM Reg Reg t Add r. IM DM Reg Reg Store O IM DM Reg Reg Sub r IM DM Reg Reg d Or e r (right half highlighted means read, left half write) ALU ALU ALU ALU 5 1999 ©UCB Required Changes to Datapath ° Introduce registers to separate 5 stages by putting IF/ID, ID/EX, EX/MEM, and MEM/WB registers in the datapath. ° Next PC value is computed in the 3rd step, but we need to bring in next instn in the next cycle – Move PCSrc Mux to 1st stage ° Branch address is computed in 3rd stage. With pipeline, the PC value has changed! Must carry the PC value along with instn. Width of IF/ID register = (IR)+(PC) = 64 bits. ° For lw instn, we need write register address at stage 5. But the IR is now occupied by another instn! So, we must carry the IR destination field as we move along the stages. See connection in fig. Length od ID/EX register = (Reg1)+(Reg2)+(offset)+(PC)+ destn = 133 bits 6 1999 ©UCB Pipelined Datapath (with Pipeline Regs)(6.2) Fetch Decode Execute Memory Write Back 0 M u x 1 IF/ID EX/MEM ID/EX MEM/WB Add 4 Add Add result PC Ins truction Shift left 2 Address Read register 1 Read data 1 Read register 2 Read data 2 Write register Imem Write data 0 M u x 1 Regs Zero ALU ALU result Address Write data 16 Sign extend 32 Read data 1 M u x 0 Dmem 5 7 64 bits 133 bits 102 bits 69 bits 1999 ©UCB Pipelined Control (6.3) • Start with single-cycle controller • Group control lines by pipeline stage needed • Extend pipeline registers with control bits WB Instruction Control Mem WB EX Mem RegDst ALUop ALUSrc IF/ID 8 ID/EX WB Branch MemRead MemWrite EX/MEM MemToReg RegWrite MEM/WB 1999 ©UCB Problems for Pipelining °Hazards prevent next instruction from executing during its designated clock cycle, limiting speedup • Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) • Control hazards: conditional branches & other instructions may stall the pipeline delaying later instructions (must check detergent level before washing next load) • Data hazards: Instruction depends on result of prior instruction still in the pipeline (matching socks in later load) 9 1999 ©UCB MIPS R4000 pipeline 10 1999 ©UCB Advanced Architectural Concepts °Can we achieve CPI < 1? (i.e., can we have IPC > 1?) State-of-the-Art Microprocessor °“Superscalar” execution or Instruction Level Parallelism (ILP) “Deeper Pipeline => Dynamic Branch Prediction => Speculation => Recovery ° “Out-of-order” Execution => Instruction Window and Prefetch => Reorder Buffers °“VLIW” Ex: Intel/HP Titanium 11 1999 ©UCB Instruction Level Parallelism (ILP) IPC > 1 Time IFtch Dcd Exec Mem WB IFetchDcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Program Flow ILP = 2 EX: Pentium, SPARC, MIPS 10000, IBM Power PC 12 1999 ©UCB HW Schemes: Instruction Parallelism °Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 • Enables out-of-order execution => out-of-order completion • ID stage checks for hazards. If no hazards, issue the instn for execution. Scoreboard dates to CDC 6600 in 1963 13 1999 ©UCB How ILP Works ° Issuing multiple instructions per cycle would require fetching multiple instructions from memory per cycle => called Superscalar degree or Issue width ° To find independent instructions, we must have a big pool of instructions to choose from, called instruction buffer (IB). As IB length increases, complexity of decoder (control) increases that increases the datapath cycle time ° Prefetch instructions sequentially by an IFU that operates independently from datapath control. Fetch instruction (PC)+L, where L is the IB size or as directed by the branch predictor. 14 1999 ©UCB Microarchitecture of an ILP-based CPU (Power PC) 15 1999 ©UCB 16 1999 ©UCB Very Large Instruction Word (VLIW) IPC > 1 Time IFtch Dcd Exec Mem WB Exec IFtch Dcd Exec Mem WB Exec IFtch Dcd Exec Mem WB Exec Program Flow 17 EX: Itanium 1999 ©UCB TriMedia TM32 Architecture 64-bit memory bus 32-bit peripheral bus multi-port 128 words x 32 bits register file bypass network data cache 16KB instruction cache 32 KB FU FU FU FU FU PC VLIW instruction decode and launch Compressed code in the Instruction Cache 18 1999 ©UCB What is Multiprocessing °Parallelism at the Instruction Level is limited because of data dependency => Speed up is limited!! °Abundant availability of program level parallelism, like Do I = 1000, Loop Level Parallelism. How about employing multiple processors to execute the loops => Parallel processing or Multiprocessing °With billion transistors on a chip, we can put a few CPUs in one chip => Chip multiprocessor 19 1999 ©UCB Hardware Multithreading ° We need to develop a hardware multithreading technique because switching between threads in software is very time-consuming (Why?), so not suitable for main memory (instead of I/O) access, Ex: Multitasking ° Develop multiple PCs and register sets on the CPU so that thread switching can occur without having to store the register contents in main memory (stack, like it is done for context switching). ° Several threads reside in the CPU simultaneously, and execution switches between the threads on main memory access. ° How about both multiprocessors and multithreading on a chip? => Network Processor 20 1999 ©UCB Hardware Multithreading ° How can we guarantee no dependencies between instructions in a pipeline? • One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) 21 1999 ©UCB Architectural Comparisons (cont.) Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Superscalar Simultaneous Multithreading 22 Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 1999 ©UCB Intel IXP2400 Network Processor XScale core replaces StrongARM 1.4 GHz target in 0.13-micron Nearest neighbor routes added between microengines Hardware to accelerate CRC operations and Random number generation 16 entry CAM 23 1999 ©UCB IBM Cell Processor SPU: Synergetic Processor Unit 24 1999 ©UCB Chip Multiprocessors 25 1999 ©UCB