Download www.eng.yale.edu

EENG 449bG/CPSC 439bG Computer Systems Lecture 17 Instruction Level Parallelism III (Multiple Issue Processors and Speculation) March 29, 2005 Prof. Andreas Savvides Spring 2005 http://www.eng.yale.edu/courses/2005s/een g449b 3/29/05 EENG449b/Savvides Lec 17.1 Why can Tomasulo overlap iterations of loops? • Register renaming – Multiple iterations use different physical destinations for registers (dynamic loop unrolling). • Reservation stations – Permit instruction issue to advance past integer control flow operations – Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard. • Other perspective: Tomasulo building data flow dependency graph on the fly. 3/29/05 EENG449b/Savvides Lec 17.2 Tomasulo’s scheme offers 2 major advantages (1) the distribution of the hazard detection logic – – – distributed reservation stations and the CDB (Common Data Bus) If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) the elimination of stalls for WAW and WAR hazards 3/29/05 EENG449b/Savvides Lec 17.3 Multiple Issue Processors • Two main types: – Superscalar Processors » Issue variable number of instructions per clock cycle » Can be statically or dynamically scheduled – VLIW (Very Large Instruction Set) Processors » Issue a constant number of instructions formatted as a packet of smaller instructions » Parallelism across instructions is specifically indicated » Statically scheduled by the compiler 3/29/05 EENG449b/Savvides Lec 17.4 Multiple Issue Issues • issue packet: group of instructions from fetch unit that could potentially issue in 1 clock – If instruction causes structural hazard or a data hazard either due to earlier instruction in execution or to earlier instruction in issue packet, then instruction does not issue – 0 to N instruction issues per clock cycle, for N-issue • Performing issue checks in 1 cycle could limit clock cycle time: O(n2-n) comparisons – => issue stage usually split and pipelined – 1st stage decides how many instructions from within this packet can issue, 2nd stage examines hazards among selected instructions and those already been issued – => splitting leads to higher branch penalties => prediction accuracy important 3/29/05 EENG449b/Savvides Lec 17.5 Getting CPI < 1: Issuing Multiple Instructions/Cycle • Superscalar MIPS: 2 instructions, 1 FP & 1 anything – Fetch 64-bits/clock cycle; Int on left, FP on right – Can only issue 2nd instruction if 1st instruction issues – More ports for FP registers to do FP load & FP op in a pair Type Int. instruction FP instruction Int. instruction FP instruction Int. instruction FP instruction Pipe Stages IF ID IF ID IF IF EX MEM WB EX MEM WB ID EX MEM WB ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB • 1 cycle load delay expands to 3 instructions in SS – instruction in right half can’t use it, nor instructions in next slot 3/29/05 EENG449b/Savvides Lec 17.6 Dynamic Scheduling in Superscalar The easy way • How to issue two instructions and keep in-order instruction issue for Tomasulo? – Assume 1 integer + 1 floating point – 1 Tomasulo control for integer, 1 for floating point • Issue 2X Clock Rate, so that issue remains in order • Only loads/stores might cause dependency between integer and FP issue: – Replace load reservation station with a load queue; operands must be read in the order they are fetched – Load checks addresses in Store Queue to avoid RAW violation – Store checks addresses in Load Queue to avoid WAR,WAW 3/29/05 EENG449b/Savvides Lec 17.7 Superscalar Processors Require More Ambitious Scheduling • Need to deal with preserving exception order • Pipelining the issue stage will result in additional overheads – E.g in a superscalar pipeline the outcome of a load instruction cannot be used in the next 3 instructions – Because the issue stage is pipelined • Without more ambitious scheduling, Superscalar processors do not have an advantage • How can we extend Tomasulo’s algorithm to schedule a superscalar pipeline? 3/29/05 EENG449b/Savvides Lec 17.8 Hardware Speculation • Tomasulo had: In-order issue, out-of-order execution, and out-of-order completion • Need to “fix” the out-of-order completion aspect so that we can find precise breakpoint in instruction stream. 3/29/05 EENG449b/Savvides Lec 17.9 Relationship between precise interrupts and specultation: • Speculation is a form of guessing. • Important for branch prediction: – Need to “take our best shot” at predicting branch direction. – Go further than dynamic branch prediction – start executing • If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly: – This is exactly same as precise exceptions! • Technique for both precise interrupts/exceptions and speculation: in-order completion or commit • Tomasulo’s algorithm can be extended to support speculation 3/29/05 EENG449b/Savvides Lec 17.10 Tomasulo Recap 3/29/05 EENG449b/Savvides Lec 17.11 Improvements to Tomasulo Algorithm • Separate the bypassing of results among instructions – Allow an instruction to execute and bypass its results to other instructions – Do not allow the instruction to make changes that cannot be undone unless we know the instruction is no longer speculative – When an instruction is no longer speculative, we allow it to update the register file – instruction commit • Speculation key idea – Execute instructions out of order but commit them in order 3/29/05 EENG449b/Savvides Lec 17.12 HW support for precise interrupts • Need HW buffer for results of uncommitted instructions: reorder buffer (ROB) – 3 fields: instr, destination, value – Use reorder buffer number instead of reservation station FP when execution completes Op – Supplies operands between Queue execution complete & commit – (Reorder buffer can be operand source => more registers like RS) – Instructions commit Res Stations – Once instruction commits, FP Adder result is put into register – As a result, easy to undo speculated instructions on mispredicted branches or exceptions 3/29/05 Reorder Buffer FP Regs Res Stations FP Adder EENG449b/Savvides Lec 17.13 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”) 3/29/05 EENG449b/Savvides Lec 17.14 Program Counter Valid Exceptions? Result Reorder Table FP Op Queue Res Stations FP Adder Compar network Dest Reg What are the hardware complexities with reorder buffer (ROB)? Reorder Buffer FP Regs Res Stations FP Adder • Need as many ports on ROB as register file 3/29/05 EENG449b/Savvides Lec 17.15 Summary • Reservations stations: implicit register renaming to larger set of registers + buffering source operands – Prevents registers as bottleneck – Avoids WAR, WAW hazards of Scoreboard – Allows loop unrolling in HW • Not limited to basic blocks (integer units gets ahead, beyond branches) • Today, helps cache misses as well – Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?) • Lasting Contributions – Dynamic scheduling – Register renaming – Load/store disambiguation 3/29/05 EENG449b/Savvides Lec 17.16 Register renaming, virtual registers versus Reorder Buffers • Alternative to Reorder Buffer is a larger virtual set of registers and register renaming • Virtual registers hold both architecturally visible registers + temporary values – replace functions of reorder buffer and reservation station • Renaming process maps names of architectural registers to registers in virtual register set – Changing subset of virtual registers contains architecturally visible registers • Simplifies instruction commit: mark register as no longer speculative, free register with old value • Adds 40-80 extra registers: Alpha, Pentium,… – Size limits no. instructions in execution (used until commit) 3/29/05 EENG449b/Savvides Lec 17.17 How much to speculate? • Speculation Pro: uncover events that would otherwise stall the pipeline (cache misses) • Speculation Con: speculate costly if exceptional event occurs when speculation was incorrect • Typical solution: speculation allows only lowcost exceptional events (1st-level cache miss) • When expensive exceptional event occurs, (2nd-level cache miss or TLB miss) processor waits until the instruction causing event is no longer speculative before handling the event • Assuming single branch per cycle: future may speculate across multiple branches! 3/29/05 EENG449b/Savvides Lec 17.18 Limits to ILP • Conflicting studies of amount – Benchmarks (vectorized Fortran FP vs. integer C programs) – Hardware sophistication – Compiler sophistication • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? – – – – 3/29/05 Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc. EENG449b/Savvides Lec 17.19 Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted 2 & 3 => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses are known & a store can be moved before a load provided addresses not equal Also: unlimited number of instructions issued/clock cycle; perfect caches; 1 cycle latency for all instructions (FP *,/); 3/29/05 EENG449b/Savvides Lec 17.20 Upper Limit to ILP: Ideal Machine (Figure 3.34, page 294) 160 150.1 FP: 75 - 150 140 Instruction Issues per cycle IPC 120 118.7 Integer: 18 - 60 100 75.2 80 62.6 60 54.8 40 17.9 20 0 gcc espresso li fpppp doducd tomcatv Programs 3/29/05 EENG449b/Savvides Lec 17.21 More Realistic HW: Branch Impact Figure 3.38, Page 300 60 Instruction issues per cycle IPC 50 Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle FP: 15 - 45 40 Integer: 6 - 12 30 20 10 0 gcc espresso li fpppp doducd tomcatv Program Perfect 3/29/05 Perfect Tournament Selective predictor Standard 2-bit BHT (512) Static Profile None EENG449b/Savvides No prediction Lec 17.22 More Realistic HW: Renaming Register Impact Figure 3.41, Page 304 FP: 11 - 45 70 Change 2000 instr window, 64 instr issue, 8K 2 level Prediction 60 IPC Instruction issues per cycle 50 40 Integer: 5 - 15 30 20 10 0 gcc espresso li fpppp doducd tomcatv Program Infinite 3/29/05 Infinite 256 256 128 128 64 32 64 None 32 None EENG449b/Savvides Lec 17.23 More Realistic HW: Memory Address Alias Impact 49 49 Figure 3.43, Page 306 50 Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers 45 40 Instruction issues per cycle IPC 35 45 45 FP: 4 - 45 (Fortran, no heap) 30 25 Integer: 4 - 9 20 16 16 15 15 12 10 10 5 9 7 7 4 5 5 4 3 3 4 6 4 3 5 4 0 gcc espress o li fpppp doducd tomcatv Program Perfect Perfect 3/29/05 Global/stack Perfect Inspection Global/Stack perf; Inspec. heap conflicts Assem. None None EENG449b/Savvides Lec 17.24 Realistic HW for ‘00: Window Impact (Figure 3.45, Page 309) 60 IPC Instruction issues per cycle 50 Perfect disambiguation (HW), 1K Selective 52 Prediction, 16 entry return, 47 64 registers, issue as many as window 56 FP: 8 - 45 45 40 35 34 30 22 Integer: 6 - 12 20 15 15 10 10 10 10 9 13 12 12 11 11 10 8 8 6 4 6 3 17 16 14 9 6 4 22 2 15 14 12 9 8 4 9 7 5 4 3 3 6 3 3 0 gcc expresso li fpppp doducd tomcatv Program Infinite 3/29/05 256 128 Infinite 256 128 64 32 16 64 32 16 8 8 4 4 EENG449b/Savvides Lec 17.25 How to Exceed ILP Limits of this study? • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage • Unnecessary dependences (compiler not unrolling loops so iteration variable dependence) • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction – Address value prediction and speculation predicts addresses and speculates by reordering loads and stores; could provide better aliasing analysis, only need predict if addresses = 3/29/05 EENG449b/Savvides Lec 17.26 Workstation Microprocessors 3/2001 • Max Max Max Max Max 3/29/05 issue: 4 instructions (many CPUs) rename registers: 128 (Pentium 4) BHT: 4K x 9 (Alpha 21264B), 16Kx2 (Ultra III) Window Size (OOO): 126 intructions (Pent. 4) Pipeline: 22/24 stages (Pentium 4) Source: Microprocessor Report, www.MPRonline.com EENG449b/Savvides Lec 17.27 Conclusion • 1985-2000: 1000X performance – Moore’s Law transistors/chip => Moore’s Law for Performance/MPU • Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year – Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order execution, … • ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler, HW? – Otherwise drop to old rate of 1.3X per year? – Less than 1.3X because of processor-memory performance gap? • Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP? 3/29/05 EENG449b/Savvides Lec 17.28 Review: Dynamic Branch Prediction • Prediction becoming important part of scalar execution • Branch History Table: 2 bits for loop accuracy • Correlation: Recently executed branches correlated with next branch. – Either different branches – Or different executions of same branches 3/29/05 • Tournament Predictor: more resources to competitive solutions and pick between them • Branch Target Buffer: include branch address & prediction • Predicated Execution can reduce number of branches, number of mispredicted branches • Return address stack for prediction of indirect jump EENG449b/Savvides Lec 17.29 Review: Limits of ILP • 1985-2000: 1000X performance – Moore’s Law transistors/chip => Moore’s Law for Performance/MPU • Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism to get 1.55X/year – Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order execution, … • ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler, HW? – Otherwise drop to old rate of 1.3X per year? – Less because of processor-memory performance gap? • Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP? 3/29/05 EENG449b/Savvides Lec 17.30 Dynamic Scheduling in P6 (Pentium Pro, II, III) • Q: How pipeline 1 to 17 byte 80x86 instructions? • P6 doesn’t pipeline 80x86 instructions • P6 decode unit translates the Intel instructions into 72-bit micro-operations (~ MIPS) • Sends micro-operations to reorder buffer & reservation stations • Many instructions translate to 1 to 4 micro-operations • Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of microoperations • 14 clocks in total pipeline (~ 3 state machines) 3/29/05 EENG449b/Savvides Lec 17.31 Dynamic Scheduling in P6 Parameter 80x86 microops Max. instructions issued/clock 3 6 Max. instr. complete exec./clock 5 Max. instr. commited/clock 3 Window (Instrs in reorder buffer) 40 Number of reservations stations 20 Number of rename registers 40 No. integer functional units (FUs) 2 No. floating point FUs 1 No. SIMD Fl. Pt. FUs 1 No. memory Fus 1 load + 1 store 3/29/05 EENG449b/Savvides Lec 17.32 P6 Pipeline • 14 clocks in total (~3 state machines) • 8 stages are used for in-order instruction fetch, decode, and issue – Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations (uops) • 3 stages are used for out-of-order execution in one of 5 separate functional units • 3 stages are used for instruction commit Instr Fetch 16B /clk 3/29/05 16B Instr 6 uops Decode 3 Instr /clk Reserv. Reorder ExecuGraduStation Buffer tion ation Renaming units 3 uops 3 uops (5) /clk /clk EENG449b/Savvides Lec 17.33 P6 Block Diagram • IP = PC From: http://www.digitlife.com/articles/pentium4/ 3/29/05 EENG449b/Savvides Lec 17.34 Pentium III Die Photo • • • • • • • • • • • • • • • • • • 1st Pentium III, Katmai: 9.5 M transistors, 12.3 * 3/29/05 10.4 mm in 0.25-mi. with 5 layers of aluminum EBL/BBL - Bus logic, Front, Back MOB - Memory Order Buffer Packed FPU - MMX Fl. Pt. (SSE) IEU - Integer Execution Unit FAU - Fl. Pt. Arithmetic Unit MIU - Memory Interface Unit DCU - Data Cache Unit PMH - Page Miss Handler DTLB - Data TLB BAC - Branch Address Calculator RAT - Register Alias Table SIMD - Packed Fl. Pt. RS - Reservation Station BTB - Branch Target Buffer IFU - Instruction Fetch Unit (+I$) ID - Instruction Decode ROB - Reorder Buffer MS - Micro-instruction Sequencer EENG449b/Savvides Lec 17.35 P6 Performance: Stalls at decode stage I$ misses or lack of RS/Reorder buf. entry go m88ksim Instruction stream Resource capacity stalls gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 3/29/05 0 0.5 1 1.5 2 2.5 3 0.5 to 2.5 Stall cycles per instruction: 0.98 avg. (0.36 integer) EENG449b/Savvides Lec 17.36 P6 Performance: uops/x86 instr 200 MHz, 8KI$/8KD$/256KL2$, 66 MHz bus go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 1 3/29/05 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.2 to 1.6 uops per IA-32 instruction: 1.36 avg. (1.37 integer) EENG449b/Savvides Lec 17.37 P6 Performance: Branch Mispredict Rate go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor BTB miss frequency Mispredict frequency hydro2d mgrid applu turb3d apsi fpppp wave5 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 10% to 40% Miss/Mispredict ratio: 20% avg. (29% integer) 3/29/05 EENG449b/Savvides Lec 17.38 P6 Performance: Speculation rate (% instructions issued that do not commit) go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 0% 3/29/05 10% 20% 30% 40% 50% 60% 1% to 60% instructions do not commit: 20% avg (30% integer)EENG449b/Savvides Lec 17.39 P6 Performance: Cache Misses/1k instr go m88ksim gcc L1 Instruction L1 Data L2 compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 0 20 40 60 80 100 120 140 160 10 to 160 Misses per Thousand Instructions: 49 avg (30 integer) 3/29/05 EENG449b/Savvides Lec 17.40 P6 Performance: uops commit/clock go m88ksim gcc compress li ijpeg perl 0 uops commit 1 uop commits 2 uops commit 3 uops commit vortex tomcatv swim su2cor hydro2d mgrid Average 0: 55% 1: 13% 2: 8% 3: 23% applu turb3d apsi fpppp wave5 0% 3/29/05 20% 40% 60% 80% Integer 0: 40% 1: 21% 2: 12% 3: 27% 100% EENG449b/Savvides Lec 17.41 P6 Dynamic Benefit? Sum of parts CPI vs. Actual CPI go m88ksim gcc compress li ijpeg uops Instruction cache stalls Resource capacity stalls Branch mispredict penalty Data Cache Stalls perl vortex tomcatv swim su2cor hydro2d mgrid applu Actual CPI Ratio of sum of parts vs. actual CPI: 1.38X avg. (1.29X integer) turb3d apsi fpppp wave5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0.8 to 3.8 Clock cycles per instruction: 1.68 avg (1.16 integer) 3/29/05 EENG449b/Savvides Lec 17.42 AMD Althon • Similar to P6 microarchitecture (Pentium III), but more resources • Transistors: PIII 24M v. Althon 37M • Die Size: 106 mm2 v. 117 mm2 • Power: 30W v. 76W • Cache: 16K/16K/256K v. 64K/64K/256K • Window size: 40 vs. 72 uops • Rename registers: 40 v. 36 int +36 Fl. Pt. • BTB: 512 x 2 v. 4096 x 2 • Pipeline: 10-12 stages v. 9-11 stages • Clock rate: 1.0 GHz v. 1.2 GHz • Memory bandwidth: 1.06 GB/s v. 2.12 GB/s 3/29/05 EENG449b/Savvides Lec 17.43 Pentium 4 • Still translate from 80x86 to micro-ops • P4 has better branch predictor, more FUs • Instruction Cache holds micro-operations vs. 80x86 instructions – no decode stages of 80x86 on cache hit – called “trace cache” (TC) • Faster memory bus: 400 MHz v. 133 MHz • Caches – Pentium III: L1I 16KB, L1D 16KB, L2 256 KB – Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB – Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock • Clock rates: – Pentium III 1 GHz v. Pentium IV 1.5 GHz – 14 stage pipeline vs. 24 stage pipeline 3/29/05 EENG449b/Savvides Lec 17.44 Pentium 4 features • Multimedia instructions 128 bits wide vs. 64 bits wide => 144 new instructions – When used by programs?? – Faster Floating Point: execute 2 64-bit Fl. Pt. Per clock – Memory FU: 1 128-bit load, 1 128-store /clock to MMX regs • Using RAMBUS DRAM – Bandwidth faster, latency same as SDRAM – Cost 2X-3X vs. SDRAM • • • • 3/29/05 ALUs operate at 2X clock rate for many ops Pipeline doesn’t stall at this clock rate: uops replay Rename registers: 40 vs. 128; Window: 40 v. 126 BTB: 512 vs. 4096 entries (Intel: 1/3 improvement) EENG449b/Savvides Lec 17.45 Pentium, Pentium Pro, Pentium 4 Pipeline • Pentium (P5) = 5 stages Pentium Pro, II, III (P6) = 10 stages (1 cycle ex) Pentium 4 (NetBurst) = 20 stages (no decode) From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00 3/29/05 EENG449b/Savvides Lec 17.46 Block Diagram of Pentium 4 Microarchitecture • BTB = Branch Target Buffer (branch predictor) • I-TLB = Instruction TLB, Trace Cache = Instruction cache • RF = Register File; AGU = Address Generation Unit • "Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s From “Pentium 4 (Partially) Previewed,” Microprocessor Report, 8/28/00 3/29/05 EENG449b/Savvides Lec 17.47 Pentium 4 Die Photo • 42M Xtors – PIII: 26M • 217 mm2 – PIII: 106 mm2 • L1 Execution Cache – Buffer 12,000 Micro-Ops • 8KB data cache • 256KB L2$ 3/29/05 EENG449b/Savvides Lec 17.48 Benchmarks: Pentium 4 v. PIII v. Althon • SPECbase2000 – Int, [email protected] GHz: 524, PIII @1GHz: 454, AMD [email protected]:? – FP, [email protected] GHz: 549, PIII @1GHz: 329, AMD [email protected]:304 • WorldBench 2000 benchmark (business) PC World magazine, Nov. 20, 2000 (bigger is better) – P4 : 164, PIII : 167, AMD Althon: 180 • • • • Quake 3 Arena: P4 172, Althon 151 SYSmark 2000 composite: P4 209, Althon 221 Office productivity: P4 197, Althon 209 S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to argue that frequency is not the most important thing-- precisely the position Intel has argued while its Pentium III lagged behind the Athlon in clock speed." 3/29/05 EENG449b/Savvides Lec 17.49 Why? • • • • Instruction count is the same for x86 Clock rates: P4 > Althon > PIII How can P4 be slower? Time = Instruction count x CPI x 1/Clock rate • Average Clocks Per Instruction (CPI) of P4 must be worse than Althon, PIII • Will CPI ever get < 1.0 for real programs? 3/29/05 EENG449b/Savvides Lec 17.50 Simultaneous Multithreading (SMT) • Simultaneous multithreading (SMT): insight that dynamically scheduled processor already has many HW mechanisms to support multithreading – large set of virtual registers that can be used to hold the register sets of independent threads (assuming separate renaming tables are kept for each thread) – out-of-order completion allows the threads to execute out of order, and get better utilization of the HW Source: Micrprocessor Report, December 6, 1999 “Compaq Chooses SMT for Alpha” 3/29/05 EENG449b/Savvides Lec 17.51 SMT is coming • Just adding a per thread renaming table and keeping separate PCs – Independent commitment can be supported by logically keeping a separate reorder buffer for each thread • Compaq has announced it for future Alpha microprocessor: 21464 in 2003; others likely On a multiprogramming workload comprising a mixture of SPECint95 and SPECfp95 benchmarks, Compaq claims the SMT it simulated achieves a 2.25X higher throughput with 4 simultaneous threads than with just 1 thread. For parallel programs, 4 threads 1.75X v. 1 3/29/05 Source: Micrprocessor Report, December 6, 1999 “Compaq Chooses SMT for Alpha” EENG449b/Savvides Lec 17.52

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download www.eng.yale.edu