Download chap1-2

CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair Chapter 2: Performance • • • • • Performance: – What is it: measures of performance The CPU performance equation: – Execution time as the measure – What affects execution time – Examples Popular alternative metrics – Why they don’t work Benchmarks Amdahl's law CEN316 - Chapter 1-2 2 Performance is Time Plane DC to Paris Speed Passengers Throughput (PMPH) Boeing 747 6.5 hours 610 MPH 470 286,700 BAD/Sud Concorde 3 hours 1350 MPH 132 178,200 • Time to do the task (execution time) – Execution time, response time, latency • Tasks per unit time (sec, minute, ...) – Throughput, bandwidth CEN316 - Chapter 1-2 3 Throughput and Response Time • Example – What is the effect of the following changes on throughput and response time? » Increasing processor speed? » Increasing the number of processors on the same system (multitask)? » Is there any relation between response time and throughput? » What about queuing? CEN316 - Chapter 1-2 4 Performance as Response Time • Performance is most often measured as response time or execution time for some task. • “X is n times faster than Y” means Performanc e(X) Execution Time(Y)  n Performanc e(Y) Execution Time(X) • Example – Execution time of program P » X is 5 sec » Y is 10 sec. – X is 2 times faster than Y. CEN316 - Chapter 1-2 5 What Time to Measure? • Elapsed time, wall-clock time: – Actual time from start to completion. – Depends on CPU, system, I/O, etc. – Often used in real benchmarks. – Only suitable choice when I/O is included. • CPU time: – Measure/analyze CPU performance only. – May be suitable when machine is timeshared. – Possibly both user and system component. – User CPU time is our focus for first part of course. • Elapsed time = CPU time + idle time. – Usually and assuming time is accurately accounted for. CEN316 - Chapter 1-2 6 Metrics of performance • Different performance metrics are appropriate at different levels: Application Answers per month Operations per second Programming Language Compiler ISA Datapath Control Function Units (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s Instruction Set Architecture Cycles per Instruction Cycles per second (clock rate) Transistors CEN316 - Chapter 1-2 7 Relating Processor Metrics • CPU execution time per program = CPU clock cycles/program  Clock cycle time = CPU clock cycles/program ÷ Clock rate (frequency) • CPU clock cycles/program = Instructions/program  Clock cycles Per Instruction • Clock cycles Per Instruction (CPI) is an average measurement, it depends on : – ISA, the implementation, and the program measured – CPI = CPU clock cycles/program ÷ Instructions/program – Also, Instructions per clock cycle or IPC = 1 / CPI • CPU execution time = Instructions  CPI  Clock cycle CEN316 - Chapter 1-2 8 Aspects of CPU Performance seconds instructio ns cycles seconds CPU time     program program instructio n cycle • • Instead of reporting execution time in seconds, we often use cycles Clock “ticks” indicate when to start activities (one abstraction): time • • cycle time = time between ticks = seconds per cycle clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec) A 200 Mhz. clock has a 1 200  106  109  5 nanosecond s cycle time CEN316 - Chapter 1-2 9 Example • Two machines A and B with the same ISA, • A has a clock cycle time of 1ns and a CPI=2 for certain program. • B has clock cycle time of 2ns and a CPI=1.2 for the same program. • Which machine is faster and by how much? • CPU time (A) = I x 2 x 1 ns • CPU time (B) = I x 1.2 x 2 ns CPU performanc e (A)  1.2 CPU performanc e (B) CEN316 - Chapter 1-2 10 Example of Improving performance Given the following Information – program run in 10 seconds on computer A, 4 GH clock – Build a computer that runs the same program in 6 seconds – Assumptions: » Clock rate can be increased substantially » Increasing clock rate increases the clock cycles for this program by 20% What is the target clock rate? CPU time A  10 seconds  CPU Clock Cycles Clock rate A CPU Clock Cycles cycles 4 x 109 seconds A A CPU Clock Cycles A  40 x 109 cycles CPU time B  1.2xCPU Clock Cycles Clock rate B A 1.2x40x10 9 cycles A 6 seconds  Clock rate B 1.2x40 x 109 cycles clock rate B   8GH 6 sec onds CEN316 - Chapter 1-2 11 Organizational Trade-offs • Instruction set (and hence Instruction Count), CPI, and Clock cycle time interact in complex ways: Application Programming Language Compiler ISA Instruction Mix Datapath CPI Function Units Transistors Cycle Time CEN316 - Chapter 1-2 12 Cycles Per Instruction (CPI) • CPI = average number of clock cycles per instruction – CPI = Clock Cycles / Instruction Count • Affected by both : – cost for each instruction type – the frequency of different instructions » called the instruction mix • Useful way to compute CPI (for n instruction types): n CPI   CPIi  Fi where Fi  i 1 Instructio n Count i Instructio n Count CEN316 - Chapter 1-2 13 Example comparing code segments Given the following Info from HW designer Instruction class CPI A B C 1 2 3 Code Sequence 1 2 Instruction count for instruction class A B C 2 4 1 1 2 1 Number of instructions? 5 for sequence 1 and 6 For sequence 2 Which one is faster? CPU clock cycles1 = (2x1)+ (1x2) + (2x3)=10 cycles CPU clock cycles2 = 9 cycles What is the CPI for each sequence? CPI1 = 10/5 =2 CPI2 = 9/6=1.5 CEN316 - Chapter 1-2 14 Example CPI Computation • RISC processor: register-register ISA: Instruction Type Frequency Fi Cycles CPIi CPI Contr Fi  CPIi Time % This Instr ALU Load Store Branch 50% 20% 10% 20% 1 2 2 2 0.5 0.4 0.2 0.4 33% 27% 13% 27% Typical Mix CPI = 1.5 CEN316 - Chapter 1-2 15 Using the CPU Performance Equation • Example: Consider adding ALU instructions that can have one memory operand to the MIPS ISA to produce MIPSE: – MIPSE = MIPS + ALU instrs with a memory operand. – Initial mix and cycle counts on MIPS: Instr class Load Store ALU op Branch Overall CPI Freq 30% 15% 40% 15% Cycles 2 2 1 2 MIPS CPI 0.6 0.3 0.4 0.3 1.6 – Assume: » CPI of the MIPSE instruction ALU-with-memory-instruction is 2 » Clock cycle 1.25 times the MIPS clock cycle » One half of the load instructions and a corresponding number of ALU instructions are replaced by ALU-with-memory – Which machine is faster? CEN316 - Chapter 1-2 16 Solution • Normalize mix to 100 instructions – can be easier to calculate and enhance intuition Cycles ALU 1 Load 2 Store 2 Branch 2 ALU+mem 2 Total MIPS Instruc Cycle Count Count 40 40 30 60 15 30 15 30 0 0 100 160 Instruc Delta -15 -15 15 MIPSE Instruc Cycle Count Count 25 25 15 30 15 30 15 30 15 30 85 145 MIPS execution time = 160 cycles  CCMIPS MIPSE execution time = 145 cycles  (1.25X CCMIPS) MIPSE takes ((145  1.25) / 160) times as long as MIPS MIPSE is 1.13  performance of MIPS CEN316 - Chapter 1-2 17 Alternative Performance Metrics: MIPS • Use something other than time – Often good intention to find a simple metric » bigger is better, general measure, summarizes performance • Most common metric: MIPS (Millions of Instructions Per Second): MIPS  Instructio n Count Clock Rate  6 Time 10 CPI 106 • Flaws in using MIPS – Machines with different instruction sets ? – Programs with different instruction mixes ? » dynamic frequency of instructions – Can vary inversely with performance! CEN316 - Chapter 1-2 18 Example of Problems • Consider an optimized and unoptimized version of the same program: Memory Instructions ALU Instructions Branch Instruction FP Instruction Total Instructions Unoptimize d Program 100M 100M 30M 40M 270M Optimized Program 50M 50M 30M 40M 170M CPI • Here are cycle counts for instructions Memory Cycles ALU Cycles Branch Cycles FP Cycles Per Instruction 2 1 3 5 Unoptimize d Program 200M 100M 90M 200M 2.2 Optimized Program 100M 50M 90M 200M 2.6 CEN316 - Chapter 1-2 19 Example continued • Assuming a 200 MHz clock: – MIPSunoptimized = 200/2.2 = 91 – MIPSoptimized = 200/2.6 = 77 – Performanceunoptimized > Performanceoptimized • But look at Execution time! – Execution timeunoptimized = CPI  IC / CR = 2.2  270 / 200 = 3s – Execution timeoptimized = CPI  IC / CR = 2.6  170 / 200 = 2.2s – Performanceoptimized > Performanceunoptimized • MIPS measurement is inverse of reality! CEN316 - Chapter 1-2 20 Another Alternative: MFLOPS • MFLOPS (Millions of FLoating Operations per Second): – common metric in scientific/engineering and supercomputer arenas – MFLOPS = Floating point Operations Time X 106 – Machine dependent: what is a floating point op? – often not where time is spent (i.e. not in FP operations) – at best, no better than execution time – at worst, much less informative and more deceptive CEN316 - Chapter 1-2 21 What is benchmarks? • Benchmarks – a set of programs that form a “workload” specifically chosen to measure performance • SPEC (System Performance Evaluation Cooperative) creates standard sets of benchmarks starting with SPEC89. The latest is SPEC CPU2006 which consists of 12 integer benchmarks (CINT2006) and 17 floating-point benchmarks (CFP2006). www.spec.org • There are also benchmark collections for power workloads (SPECpower_ssj2008), for mail workloads (SPECmail2008), for multimedia workloads (mediabench), … CEN316 - Chapter 1-2 22 Comparing and Summarizing Performance  How do we summarize the performance for benchmark set with a single number? l l First the execution times are normalized giving the “SPEC ratio” (bigger is faster, i.e., SPEC ratio is the inverse of execution time) The SPEC ratios are then “averaged” using the geometric mean (GM) GM = n n  SPEC ratioi i=1 • Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) CEN316 - Chapter 1-2 23 Amdahl's Law • Handy for evaluating impact of a change not tied to CPU performance equation • Insight: No improvement of a feature enhances performance by more than the use of the feature. • Suppose that enhancement E accelerates fraction F of a program by a factor S (remainder of the task is unaffected):   F  Execution Time enhanced   1  F      Execution Time unenhanced  S   F 1–F S= CEN316 - Chapter 1-2 24 Example on Amdahl's Law • Assume a program runs in 100 seconds on a machine whre multiply operations consumes 80 seconds of this time. How much do we have to improve the speed of multiplication if we want to run the program 5 times faster? • Solution: – Execution time after improvement = Execution time affected by improvement + Execution time unaffected A mount of improvement – 20 seconds = (80/n) + 20 seconds – What is the value of n??? CEN316 - Chapter 1-2 25 Summary : Performance • Time is the measure of computer performance! – Performance equation includes three parts; all three together determine performance • Good products created when have: – Good benchmarks – Good ways to summarize performance • Will need different performance metrics as well as a different set of applications to benchmark embedded and desktop computers, which are more focused on response time, versus servers, which are more focused on throughput • Remember Amdahl’s Law: Speedup is limited by unimproved part of program CEN316 - Chapter 1-2 26

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download chap1-2