Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/cs162 1 1999 ©UCB Review of Last Class °MIPS Datapath °Introduction to Pipelining °Introduction to Instruction Level Parallelism (ILP) °Introduction to VLIW 2 1999 ©UCB What is Multiprocessing °Parallelism at the Instruction Level is limited because of data dependency => Speed up is limited!! °Abundant availability of program level parallelism, like Do I = 1000, Loop Level Parallelism. How about employing multiple processors to execute the loops => Parallel processing or Multiprocessing °With billion transistors on a chip, we can put a few CPUs in one chip => Chip multiprocessor 3 1999 ©UCB Memory Latency Problem Even if we increase CPU power, memory is the real bottleneck. Techniques to alleviate memory latency problem: 1. Memory hierarchy – Program locality, cache memory, multilevel, pages and context switching 2. Prefetching – Get the instruction/data before the CPU needs. Good for instns because of sequential locality, so all modern processors use prefetch buffers for instns. What do with data? 3. Multithreading – Can the CPU jump to another program when accessing memory? It’s like multiprogramming!! 4 1999 ©UCB Hardware Multithreading ° We need to develop a hardware multithreading technique because switching between threads in software is very time-consuming (Why?), so not suitable for main memory (instead of I/O) access, Ex: Multitasking ° Develop multiple PCs and register sets on the CPU so that thread switching can occur without having to store the register contents in main memory (stack, like it is done for context switching). ° Several threads reside in the CPU simultaneously, and execution switches between the threads on main memory access. ° How about both multiprocessors and multithreading on a chip? => Network Processor 5 1999 ©UCB Architectural Comparisons (cont.) Fine-Grained Coarse-Grained Multiprocessing Time (processor cycle) Superscalar Simultaneous Multithreading 6 Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 1999 ©UCB Intel IXP1200 Network Processor Initial component of the Intel Exchange Architecture - IXA micro engine is a 5-stage pipeline – no ILP, 4-way multithreaded Each core multiprocessing – 6 Micro engines and a Strong Arm Core 7 166 MHz fundamental clock rate Intel claims 2.5 Mpps IP routing for 64 byte packets Already 7 the most widely used NPU Or more accurately the most widely admitted use 1999 ©UCB IXP1200 Chip Layout StrongARM processing core Microengines introduce new ISA I/O PCI SDRAM SRAM IX : PCI-like packet bus On 8 chip FIFOs 16 entry 64B each 1999 ©UCB IXP1200 Microengine 4 hardware contexts Single issue processor Explicit optional context switch on SRAM access Registers All are single ported Separate GPR 1536 registers total 32-bit ALU Can access GPR or XFER registers Standard 5 stage pipe 4KB SRAM instruction store – not a cache! 9 1999 ©UCB Intel IXP2400 Microengine (New) XScale core replaces StrongARM 1.4 GHz target in 0.13-micron Nearest neighbor routes added between microengines Hardware to accelerate CRC operations and Random number generation 16 entry CAM 10 1999 ©UCB MIPS Pipeline Chapter 6 CS 161 Text 11 1999 ©UCB Review: Single-cycle Datapath for MIPS Stage 5 PC Instruction Memory (Imem) Stage 1 Registers Stage 2 ALU Stage 3 Data Memory (Dmem) Stage 4 °Use datapath figure to represent pipeline IFtch Dcd Exec Mem WB 12 Reg ALU IM DM Reg 1999 ©UCB Stages of Execution in Pipelined MIPS 5 stage instruction pipeline 1) I-fetch: Fetch Instruction, Increment PC 2) Decode: Instruction, Read Registers 3) Execute: Mem-reference: Calculate Address R-format: Perform ALU Operation 4) Memory: Load: Read Data from Data Memory Store: Write Data to Data Memory 5) Write Back: Write Data to Register 13 1999 ©UCB Pipelined Execution Representation Time IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB IFtch Dcd Exec Mem WB Program Flow °To simplify pipeline, every instruction takes same number of steps, called stages 14 °One clock cycle per stage 1999 ©UCB Datapath Timing: Single-cycle vs. Pipelined °Assume the following delays for major functional units: • 2 ns for a memory access or ALU operation • 1 ns for register file read or write °Total datapath delay for single-cycle: 15 Insn Type Insn Fetch Reg Read ALU Oper beq R-form sw lw 2ns 2ns 2ns 2ns 1ns 1ns 1ns 1ns 2ns 2ns 2ns 2ns Data Reg Access Write 1ns 2ns 2ns 1ns Total Time 5ns 6ns 7ns 8ns °In pipeline machine, each stage = length of longest delay = 2ns; 5 stages = 10ns 1999 ©UCB Pipelining Lessons ° Pipelining doesn’t help latency (execution time) of single task, it helps throughput of entire workload ° Multiple tasks operating simultaneously using different resources ° Potential speedup = Number of pipe stages ° Time to “fill” pipeline and time to “drain” it reduces speedup ° Pipeline rate limited by slowest pipeline stage ° Unbalanced lengths of pipe stages also reduces speedup 16 1999 ©UCB Single Cycle Datapath (From Ch 5) 4 P C Read Addr 31:0 Instruction Imem 15:11 a d d 25:21 20:16 M u x << 2 PCSrc MemWrite Read Reg1 Read Read data1 Reg2 Read Write data2 Reg Write Data Regs RegDst RegWrite 15:0 17 a d d M u x Sign Extend M u x A L U Read data Zero Address MemToReg Dmem ALUcon ALUsrc Write Data MemRead ALUOp M u x 1999 ©UCB Required Changes to Datapath °Introduce registers to separate 5 stages by putting IF/ID, ID/EX, EX/MEM, and MEM/WB registers in the datapath. °Next PC value is computed in the 3rd step, but we need to bring in next instn in the next cycle – Move PCSrc Mux to 1st stage. The PC is incremented unless there is a new branch address. °Branch address is computed in 3rd stage. With pipeline, the PC value has changed! Must carry the PC value along with instn. Width of IF/ID register = (IR)+(PC) = 64 bits. 18 1999 ©UCB Changes to Datapath Contd. °For lw instn, we need write register address at stage 5. But the IR is now occupied by another instn! So, we must carry the IR destination field as we move along the stages. See connection in fig. Length of ID/EX register = (Reg1:32)+(Reg2:32)+(offset:32)+ (PC:32)+ (destination register:5) = 133 bits Assignment: What are the lengths of EX/MEM, and MEM/WB registers 19 1999 ©UCB Pipelined Datapath (with Pipeline Regs)(6.2) Fetch Decode Execute Memory Write Back 0 M u x 1 IF/ID EX/MEM ID/EX MEM/WB Add 4 Add Add result PC Ins truction Shift left 2 Address Read register 1 Read data 1 Read register 2 Read data 2 Write register Imem Write data 0 M u x 1 Regs Zero ALU ALU result Address Write data 16 Sign extend 32 Read data 1 M u x 0 Dmem 5 20 64 bits 133 bits 102 bits 69 bits 1999 ©UCB