Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Performance Performance – what is it: measures of performance The CPU Performance Equation: – Execution time as the measure – what affects execution time – examples Choosing good benchmarks? – choosing bad benchmarks? Amdahl's Law Datorteknik PerformanceAnalyse bild 1 Performance is Time Time to do the task (Execution Time) – execution time, response time, latency Tasks per unit time (sec, minute, ...) – throughput, bandwidth Datorteknik PerformanceAnalyse bild 2 Performance as Response Time Performance is most often measured as response time or execution time for some task. “X is n times faster than Y” means Performance(X) –––––––––––––– = Performance(Y) Execution Time(Y) –––––––––––––––– = n Execution Time(X) Example Execution time of program P X is 5 sec; Y is 10 sec. X is 2 times faster than Y. Datorteknik PerformanceAnalyse bild 3 What time to measure? Elapsed time, wall-clock time: – – – – CPU Time: – – – – actual time from start to completion depends on CPU, system, I/O, etc. often used in real benchmarks only suitable choice when I/O is included measure/analyze CPU performance only may be suitable when machine is timeshared possibly both user and system component User CPU time is our focus for first part of course Elapsed time = CPU time + Idle time – usually and assuming time is accurately accounted for Datorteknik PerformanceAnalyse bild 4 Metrics of performance Different performance metrics are appropriate at different levels: Application Programming Language Compiler ISA Datapath Control Function Units Transistors Frames per second Operations per second (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Cycles per second (clock rate) Cycles per Instruction Datorteknik PerformanceAnalyse bild 5 Relating Processor Metrics CPU execution time per program = CPU clock cycles/program X Clock cycle time = CPU clock cycles/program ÷ Clock rate (frequency) CPU clock cycles/program = Instructions/program X Clock cycles Per Instruction Clock cycles Per Instruction (CPI) is an average measurement, it depends on : – ISA, the implementation, and the program measured – CPI = CPU clock cycles/program ÷ Instructions/program – Also, Instructions per clock cycle or IPC = 1 / CPI CPU execution time = Instructions X CPI X Clock cycle Datorteknik PerformanceAnalyse bild 6 Let’s look at the single-cycle model analytically Datorteknik PerformanceAnalyse bild 7 Static timing analysis Memories Register Adders ALU Use topological sort! 10 ns 5 ns 10 ns 10 ns Datorteknik PerformanceAnalyse bild 8 Zero ext. 35 ns delay 10 ns Branch logic 5 ns 10 ns 0 A ALU 4 B + Sgn/Ze extend 31 + 10 ns lw $2 const($3) 10 ns 10 ns Datorteknik PerformanceAnalyse bild 9 But that path goes through the data memory! What if this is not a load/store? How about an instruction that does nothing? “NOP” Datorteknik PerformanceAnalyse bild 10 Zero ext. 10 ns delay 10 ns Branch logic 5 ns 10 ns 0 A ALU 4 B + Sgn/Ze extend 31 + 10 ns Nop 10 ns 10 ns Datorteknik PerformanceAnalyse bild 11 Zero ext. 25 ns delay 10 ns Branch logic 5 ns 10 ns 0 A ALU 4 B + Sgn/Ze extend 31 + 10 ns Add $ra $rb $rc 10 ns 10 ns Datorteknik PerformanceAnalyse bild 12 Zero ext. 20 ns delay 10 ns Branch logic 5 ns 10 ns 0 A ALU 4 B + Sgn/Ze extend 31 + 10 ns B label 10 ns 10 ns Datorteknik PerformanceAnalyse bild 13 35 ns for load/store but 10 ns for NOP !? Datorteknik PerformanceAnalyse bild 14 Amdahl’s Law: “Make the common case fast” Datorteknik PerformanceAnalyse bild 15 Amdahl's Law Handy for evaluating impact of a change not tied to CPU performance equation Insight: No improvement of a feature enhances performance by more than the use of the feature. Suppose that enhancement E accelerates fraction F of a program by a factor S (remainder of the task is unaffected): ExecTimeE = (1 – F(1 – 1/S)) X ExecTimewithout E F 1-F F/S 1-F S= Datorteknik PerformanceAnalyse bild 16 What if we don’t need the ALU? A branch instruction? Datorteknik PerformanceAnalyse bild 17 BUT! The single cycle model has to accomodate the slowest instruction Even if it rarely occurs! Datorteknik PerformanceAnalyse bild 18 How much work can our structure perform? For a program Q: Time = Number of executed instruction * Number of cycles per instruction * Time per cycle T = Nq * CPI * Tc Datorteknik PerformanceAnalyse bild 19 For the single cycle model.... CPI = 1 for all instructions Tc determined by the slowest instruction Datorteknik PerformanceAnalyse bild 20 How to reduce T? T = Nq * CPI * Tc Reduce Nq. More powerful instructions! More hardware, longer paths, cycle time goes up (slower machine) Datorteknik PerformanceAnalyse bild 21 “No free lunch” Why designers are so well paid to optimize designs. Datorteknik PerformanceAnalyse bild 22 How to reduce T? T = Nq * CPI * Tc Faster hardware Technological limits Cost increase not linearly related Sales volume drops Datorteknik PerformanceAnalyse bild 23 How to reduce T? T = Nq * CPI * Tc Make this a function of the instruction For example: NOP = 1 cycle LW = 4 cycles Chapter 5.4, the classical method Datorteknik PerformanceAnalyse bild 24 How to reduce T? T = Nq * CPI * Tc Make this a function of the instruction CPI goes up, but we can use an average, not the worst case Tc goes down, time to do the longes step, not the entire instruction Datorteknik PerformanceAnalyse bild 25 Example Branch: Step 1: fetch Step 2: New PC Add: Step 1: fetch Step 2: decode/ register fetch Step 3: Compute and write back Datorteknik PerformanceAnalyse bild 26 Example LW = 4 steps Cycletime = 1/4 old time T LW =4 * 1/4 old time, CPI just as slow for the lw instruction our worst case! Datorteknik PerformanceAnalyse bild 27 But that’s not important if LW is not common! T = Nq * CPI * 1/4 old time Averaged over this many instructions 1,3? 1,7? Never = 4,0! Datorteknik PerformanceAnalyse bild 28 We win because of quantitative statistical properties of our programs! Datorteknik PerformanceAnalyse bild 29 What value of CPI do we use? 1,3? 1,5? 1,7? Easy: Use average program! ? Datorteknik PerformanceAnalyse bild 30 There is no such thing! Datorteknik PerformanceAnalyse bild 31 Artificial “average programs” called “benchmarks” Are they something to trust? What about “peak performance values” mips? mflops? We have a peak at CPI = 1.... ...a program of only NO-OPS! Datorteknik PerformanceAnalyse bild 32 Why Do Benchmarks? How we evaluate performance differences – Across and within a single system (design & variations) What should benchmarks do? – Represent a large class of important programs – Behave like typical programs: improved benchmark performance => improved performance broadly For better or worse, benchmarks shape a field Good ones accelerate progress Bad benchmarks hurt progress – help real programs vs. sell machines/papers? – Enhancements that help benchmarks may not help most programs and v.v. Datorteknik PerformanceAnalyse bild 33 Classes of Benchmarks (Toy) Benchmarks – 10-100 line–e.g.,: sieve, puzzle, quicksort – good first programming assignments Synthetic Benchmarks – attempt to match average frequencies of real workloads – e.g., Whetstone, dhrystone – mostly good for nothing: too artificial Kernels – Time critical excerpts of real programs – e.g., Livermore loops, Linpack – good for micro-performance studies Real programs – e.g., gcc, spice, Verilog, Database, stock trading Datorteknik PerformanceAnalyse bild 34 Successful Benchmark: SPEC Collection 1987 RISC industry (workstations) mired in “bench marketing”: – (“That is an 8 MIPS machine, but they claim 10 MIPS!”) EE Times + 5 companies band together to perform Systems Performance Evaluation Committee (SPEC) in 1988: – Sun, MIPS, HP, Apollo, DEC Create standard list of programs, inputs, reporting rules: – several real programs, including OS calls – some I/O – rules for running and reporting Datorteknik PerformanceAnalyse bild 35 Multiple clock cycle designs: State machines Micro programming chapter 5.4 “Computer Organization & Design” Datorteknik PerformanceAnalyse bild 36 How to reduce T? T = Nq * CPI * Tc Reduce quotient cycles / instruction reduce “cycles” multiple clockcycle design Increase “instruction” execute more than one instr. per cycle! Datorteknik PerformanceAnalyse bild 37 More than one instruction per cycle? Parallelism – Div/mult + floating point + integer Superscalarity – Multiple issue etc. Pipelining – Of general importance Datorteknik PerformanceAnalyse bild 38