Download EE 333

EE 333 Fall 2006 Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS pipeline & control Pentium 4 architecture Lillevik 333f06-l20 University of Portland School of Engineering 1 EE 333 Fall 2006 Pipelining overview • Pipelining – Increased performance through parallel operations – Goal: complete several operations at the same time • Hazards – Conditions which inhibit parallel operations – Techniques exist to minimize the problem Lillevik 333f06-l20 University of Portland School of Engineering 2 EE 333 Fall 2006 A laundry pipeline • To Do laundry: wash, dry, fold, put away • Each step takes 30 minutes, but for four students .... • Laundry done at 2 AM Lillevik 333f06-l20 University of Portland School of Engineering 3 EE 333 Fall 2006 Let’s speed it up (pipeline) • Move one load from one step to the next • But start the next load before first is complete • Takes only until 9:30 PM – party time !! Bucket Brigade Lillevik 333f06-l20 University of Portland School of Engineering 4 EE 333 Fall 2006 Speedup • Speedup – Ratio of serial time to parallel – Metric to compare advantages of parallel operations Tseries S Tparallel Lillevik 333f06-l20 University of Portland School of Engineering 5 EE 333 Fall 2006 Find the laundry speedup? Lillevik 333f06-l20 University of Portland School of Engineering 6 EE 333 Fall 2006 A computer pipeline • Assume – Instructions require multiple clocks to complete – Each instruction follows approximately the same steps (stage) • Method – Start initial instruction on first clock – On following clocks start subsequent instructions Lillevik 333f06-l20 University of Portland School of Engineering 7 EE 333 Fall 2006 MIPS instruction steps/stages 1. IF: Fetch instruction from memory 2. ID: Read registers while decoding instruction 3. EX: Execute the operation or calculate an address 4. MEM: Access an operand in data memory 5. WB: Write the result into a register Lillevik 333f06-l20 University of Portland School of Engineering 8 EE 333 Fall 2006 MIPS pipeline IF  First instruction ends ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM Fifth instruction starts  Lillevik 333f06-l20 University of Portland School of Engineering WB 9 EE 333 Fall 2006 Find the MIPS pipeline speedup? Assume five instructions Lillevik 333f06-l20 University of Portland School of Engineering 10 EE 333 Fall 2006 What about a large program? Series Speedup Tseries  N Tn  T1  ...  TN Tseries S Tparallel  5N  Pipelined Tparallel  N Tn  T1  ...  TN  5  1  ...  1 5N 5  5  ( N  1) 5  1  1 N N 5 lim S  5 N  0 1 0  5  ( N  1) Lillevik 333f06-l20 University of Portland School of Engineering 11 EE 333 Fall 2006 Speedup of pipeline with p stages? Lillevik 333f06-l20 University of Portland School of Engineering 12 EE 333 Fall 2006 MIPS pipelined datapath Pipeline registers added to datapath Lillevik 333f06-l20 University of Portland School of Engineering 13 EE 333 Fall 2006 Save for WB stage Save for Mem stage Pipelined Control WB Instruction Control M WB EX M WB ID/EX EX/MEM MEM/WB Save for Ex stage IF/ID Signals used in later stage determined by IF/ID Lillevik 333f06-l20 University of Portland School of Engineering 14 EE 333 Fall 2006 Datapath & pipelined control Lillevik 333f06-l20 University of Portland School of Engineering 15 EE 333 Fall 2006 Pentium 4 pipeline • Twenty stages long • Theoretical speedup of 20 • Hazards (forced sequential operations) reduce speedup – Some instructions executed “out of order” to avoid hazard – Multiple (optimistic) pipelines created, one selected to create result, other data discarded Lillevik 333f06-l20 University of Portland School of Engineering 16 EE 333 Fall 2006 Early Pentium 4 • • • • Socket 423/478 42 M transistors, 0.18 and 0.13 mm technology 2.0 GHz core frequency, ~60 W Integrated heat spreader, built-in thermal monitor Lillevik 333f06-l20 University of Portland School of Engineering 17 EE 333 Fall 2006 NetBurst Architecture • Faster system bus • Advanced transfer cache • Advanced dynamic execution (execution trace cache, enhanced branch prediction) • Hyper pipelined technology • Rapid execution engine • Enhanced floating point and multi-media (SSE2) Lillevik 333f06-l20 University of Portland School of Engineering 18 EE 333 Fall 2006 Architecture Overview Lillevik 333f06-l20 University of Portland School of Engineering 19 EE 333 Fall 2006 Front Side Bus Lillevik 333f06-l20 University of Portland School of Engineering 20 EE 333 Fall 2006 FSB Bandwidth • Clocked at 100 MHz, quad “pumped” Clock A Clock B • • • • 128 B cache lines, 64-bit (8 B) accesses Split transactions, pipelined External bandwidth: 100M x 8 x 4 = 3.2 GB/s Makes better use of bus bandwidth Lillevik 333f06-l20 University of Portland School of Engineering 21 EE 333 Fall 2006 L2 Advanced Transfer Cache Lillevik 333f06-l20 University of Portland School of Engineering 22 EE 333 Fall 2006 Full-Speed L2 Cache • Depth of 256 KB • Eight-way set associative, 128 B line • Wide instruction & data interface of 256 bits (32 B) • Read latency of 7 clocks, but … • Clocked at core frequency (2.0 GHz) • Internal bandwidth, 32 x 2.0 G = 64 GB/s • Optimizes data transfers to/from memory Lillevik 333f06-l20 University of Portland School of Engineering 23 EE 333 Fall 2006 L1 Data Cache Lillevik 333f06-l20 University of Portland School of Engineering 24 EE 333 Fall 2006 L1 Data Cache • • • • Depth of 8 KB Four-way, set associative, 64 B line Read latency of 2 clocks, but …. Dual port for one load & one store-perclock • Supports advanced pre-fetch algorithm Lillevik 333f06-l20 University of Portland School of Engineering 25 EE 333 Fall 2006 Dynamic Execution Lillevik 333f06-l20 University of Portland School of Engineering 26 EE 333 Fall 2006 Trace Cache & Branch Prediction • Replaces traditional L1 instruction cache • Trace cache contains ~12K decoded instructions (micro-operations), removes decode latency • Improved branch prediction algorithm, eliminates 33% of P3 mis-predictions (pipeline stalls) • Keeps correct instructions executing Lillevik 333f06-l20 University of Portland School of Engineering 27 EE 333 Fall 2006 Execution Engine Lillevik 333f06-l20 University of Portland School of Engineering 28 EE 333 Fall 2006 Hyper Pipelined Technology • Execution pipeline contains 20 stages – Out-of-order, speculative execution unit – 126 instructions “in flight” – Includes 48 load, 24 stores • Rapid execution engine – 2 ALUs, 2X clocked (one instruction in ½ clock) – 2 AGUs, 2X clocked • Results in higher throughput and reduced latency Lillevik 333f06-l20 University of Portland School of Engineering 29 EE 333 Fall 2006 Streaming SIMD Extensions • FPU and MMX – – • SSE2 (extends MMX and SSE) – – – – • 128-bit format AGU data movement register 144 new instructions DP floating-point Integer Cache and memory management Performance increases across broad range of applications Lillevik 333f06-l20 University of Portland School of Engineering 30 EE 333 Lillevik 333f06-l20 Fall 2006 University of Portland School of Engineering 31 EE 333 Fall 2006 Find the laundry speedup? Tseries S Tparallel 8hrs   2.29 3.5hrs Lillevik 333f06-l20 University of Portland School of Engineering 32 EE 333 Fall 2006 Find the MIPS pipeline speedup? Assume five instructions Tseries  N5I n Tpipe  N5I n  I1  I 2  I 3  I 4  I 5  5 1111  55555 9  25 Tseries 25 S   2.8 Tparallel 9 Lillevik 333f06-l20 University of Portland School of Engineering 33 EE 333 Fall 2006 Speedup of pipeline with p stages? Series Tseries  N Tn  T1  ...  TN  pN Tseries S Tparallel  Parallel Tparallel  N Tn  T1  ...  TN  p  1  ...  1 pN p  p  ( N  1) p  1  1 N N p lim S  p N  0 1 0  p  ( N  1) Lillevik 333f06-l20 University of Portland School of Engineering 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download EE 333