Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
EE 333 Fall 2006 Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS pipeline & control Pentium 4 architecture Lillevik 333f06-l20 University of Portland School of Engineering 1 EE 333 Fall 2006 Pipelining overview • Pipelining – Increased performance through parallel operations – Goal: complete several operations at the same time • Hazards – Conditions which inhibit parallel operations – Techniques exist to minimize the problem Lillevik 333f06-l20 University of Portland School of Engineering 2 EE 333 Fall 2006 A laundry pipeline • To Do laundry: wash, dry, fold, put away • Each step takes 30 minutes, but for four students .... • Laundry done at 2 AM Lillevik 333f06-l20 University of Portland School of Engineering 3 EE 333 Fall 2006 Let’s speed it up (pipeline) • Move one load from one step to the next • But start the next load before first is complete • Takes only until 9:30 PM – party time !! Bucket Brigade Lillevik 333f06-l20 University of Portland School of Engineering 4 EE 333 Fall 2006 Speedup • Speedup – Ratio of serial time to parallel – Metric to compare advantages of parallel operations Tseries S Tparallel Lillevik 333f06-l20 University of Portland School of Engineering 5 EE 333 Fall 2006 Find the laundry speedup? Lillevik 333f06-l20 University of Portland School of Engineering 6 EE 333 Fall 2006 A computer pipeline • Assume – Instructions require multiple clocks to complete – Each instruction follows approximately the same steps (stage) • Method – Start initial instruction on first clock – On following clocks start subsequent instructions Lillevik 333f06-l20 University of Portland School of Engineering 7 EE 333 Fall 2006 MIPS instruction steps/stages 1. IF: Fetch instruction from memory 2. ID: Read registers while decoding instruction 3. EX: Execute the operation or calculate an address 4. MEM: Access an operand in data memory 5. WB: Write the result into a register Lillevik 333f06-l20 University of Portland School of Engineering 8 EE 333 Fall 2006 MIPS pipeline IF First instruction ends ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM Fifth instruction starts Lillevik 333f06-l20 University of Portland School of Engineering WB 9 EE 333 Fall 2006 Find the MIPS pipeline speedup? Assume five instructions Lillevik 333f06-l20 University of Portland School of Engineering 10 EE 333 Fall 2006 What about a large program? Series Speedup Tseries N Tn T1 ... TN Tseries S Tparallel 5N Pipelined Tparallel N Tn T1 ... TN 5 1 ... 1 5N 5 5 ( N 1) 5 1 1 N N 5 lim S 5 N 0 1 0 5 ( N 1) Lillevik 333f06-l20 University of Portland School of Engineering 11 EE 333 Fall 2006 Speedup of pipeline with p stages? Lillevik 333f06-l20 University of Portland School of Engineering 12 EE 333 Fall 2006 MIPS pipelined datapath Pipeline registers added to datapath Lillevik 333f06-l20 University of Portland School of Engineering 13 EE 333 Fall 2006 Save for WB stage Save for Mem stage Pipelined Control WB Instruction Control M WB EX M WB ID/EX EX/MEM MEM/WB Save for Ex stage IF/ID Signals used in later stage determined by IF/ID Lillevik 333f06-l20 University of Portland School of Engineering 14 EE 333 Fall 2006 Datapath & pipelined control Lillevik 333f06-l20 University of Portland School of Engineering 15 EE 333 Fall 2006 Pentium 4 pipeline • Twenty stages long • Theoretical speedup of 20 • Hazards (forced sequential operations) reduce speedup – Some instructions executed “out of order” to avoid hazard – Multiple (optimistic) pipelines created, one selected to create result, other data discarded Lillevik 333f06-l20 University of Portland School of Engineering 16 EE 333 Fall 2006 Early Pentium 4 • • • • Socket 423/478 42 M transistors, 0.18 and 0.13 mm technology 2.0 GHz core frequency, ~60 W Integrated heat spreader, built-in thermal monitor Lillevik 333f06-l20 University of Portland School of Engineering 17 EE 333 Fall 2006 NetBurst Architecture • Faster system bus • Advanced transfer cache • Advanced dynamic execution (execution trace cache, enhanced branch prediction) • Hyper pipelined technology • Rapid execution engine • Enhanced floating point and multi-media (SSE2) Lillevik 333f06-l20 University of Portland School of Engineering 18 EE 333 Fall 2006 Architecture Overview Lillevik 333f06-l20 University of Portland School of Engineering 19 EE 333 Fall 2006 Front Side Bus Lillevik 333f06-l20 University of Portland School of Engineering 20 EE 333 Fall 2006 FSB Bandwidth • Clocked at 100 MHz, quad “pumped” Clock A Clock B • • • • 128 B cache lines, 64-bit (8 B) accesses Split transactions, pipelined External bandwidth: 100M x 8 x 4 = 3.2 GB/s Makes better use of bus bandwidth Lillevik 333f06-l20 University of Portland School of Engineering 21 EE 333 Fall 2006 L2 Advanced Transfer Cache Lillevik 333f06-l20 University of Portland School of Engineering 22 EE 333 Fall 2006 Full-Speed L2 Cache • Depth of 256 KB • Eight-way set associative, 128 B line • Wide instruction & data interface of 256 bits (32 B) • Read latency of 7 clocks, but … • Clocked at core frequency (2.0 GHz) • Internal bandwidth, 32 x 2.0 G = 64 GB/s • Optimizes data transfers to/from memory Lillevik 333f06-l20 University of Portland School of Engineering 23 EE 333 Fall 2006 L1 Data Cache Lillevik 333f06-l20 University of Portland School of Engineering 24 EE 333 Fall 2006 L1 Data Cache • • • • Depth of 8 KB Four-way, set associative, 64 B line Read latency of 2 clocks, but …. Dual port for one load & one store-perclock • Supports advanced pre-fetch algorithm Lillevik 333f06-l20 University of Portland School of Engineering 25 EE 333 Fall 2006 Dynamic Execution Lillevik 333f06-l20 University of Portland School of Engineering 26 EE 333 Fall 2006 Trace Cache & Branch Prediction • Replaces traditional L1 instruction cache • Trace cache contains ~12K decoded instructions (micro-operations), removes decode latency • Improved branch prediction algorithm, eliminates 33% of P3 mis-predictions (pipeline stalls) • Keeps correct instructions executing Lillevik 333f06-l20 University of Portland School of Engineering 27 EE 333 Fall 2006 Execution Engine Lillevik 333f06-l20 University of Portland School of Engineering 28 EE 333 Fall 2006 Hyper Pipelined Technology • Execution pipeline contains 20 stages – Out-of-order, speculative execution unit – 126 instructions “in flight” – Includes 48 load, 24 stores • Rapid execution engine – 2 ALUs, 2X clocked (one instruction in ½ clock) – 2 AGUs, 2X clocked • Results in higher throughput and reduced latency Lillevik 333f06-l20 University of Portland School of Engineering 29 EE 333 Fall 2006 Streaming SIMD Extensions • FPU and MMX – – • SSE2 (extends MMX and SSE) – – – – • 128-bit format AGU data movement register 144 new instructions DP floating-point Integer Cache and memory management Performance increases across broad range of applications Lillevik 333f06-l20 University of Portland School of Engineering 30 EE 333 Lillevik 333f06-l20 Fall 2006 University of Portland School of Engineering 31 EE 333 Fall 2006 Find the laundry speedup? Tseries S Tparallel 8hrs 2.29 3.5hrs Lillevik 333f06-l20 University of Portland School of Engineering 32 EE 333 Fall 2006 Find the MIPS pipeline speedup? Assume five instructions Tseries N5I n Tpipe N5I n I1 I 2 I 3 I 4 I 5 5 1111 55555 9 25 Tseries 25 S 2.8 Tparallel 9 Lillevik 333f06-l20 University of Portland School of Engineering 33 EE 333 Fall 2006 Speedup of pipeline with p stages? Series Tseries N Tn T1 ... TN pN Tseries S Tparallel Parallel Tparallel N Tn T1 ... TN p 1 ... 1 pN p p ( N 1) p 1 1 N N p lim S p N 0 1 0 p ( N 1) Lillevik 333f06-l20 University of Portland School of Engineering 34