Download PowerPoint Slide File

Performance Tuning John Black CS 425 UNR, Fall 2000 Why Go Fast? • • • • • Real Time Systems Solve Bigger Problems Less waiting time Better Simulations, Video Games, etc. Moore’s Law: Every 18 months CPU speeds double for constant cost – Has held true for last 30 years! How do we go Faster? • Find the Bottlenecks! – Hardware problems – Bad algorithms/data structures – I/O bound (disk, network, peripherals) • If none of these, we have to hand tune Hand Tuning • Use a “profiler” – 80% of time is spent in 20% of the code – Find this 20% and tune by hand – Do NOT waste time tuning code which is not a bottleneck • How can we hand-tune? Hand Tuning (cont.) • Exploit architecture-dependent features (of course this approach limits portability) • We focus on the Pentium family – Memory System – Pipeline Memory System • Modern processors have a hierarchical memory system – Memory units nearer the processor are faster but smaller Registers About 40 32-bit registers Level 1 Cache Level 2 Cache Main Memory 32K P3 has 256K or 512K Common Memory-Related Bottlenecks • Alignment: the requirement that an accessed object lie at an address which is a multiple of 16 or 32 or 64, etc. • For Pentium Pro, P2, P3, there is no penalty for misaligned data (unless you cross a cache line boundary) – Cache lines are 32 byte cells in the caches – We always read 32-bytes at a time Locality • Since we fetch 32 bytes at a time, accessing memory sequentially (or in a small neighborhood) is efficient – This is called “spatial locality” • Since cache contents are aged (with an LRU algorithm) and eventually kicked out, accessing items repeatedly is efficient – This is called “temporal locality” Digression on Big-Oh • We learn in algorithms class that we are concerned only with the asymptotic running time on a RAM machine – With new architectures this may no longer be a good way to measure performance – An O(n lg n) algorithm may be FASTER than an O(n) algorithm due to architectural considerations Pipelining • Modern processors execute many instructions in parallel to increase speed Decode Get Args Execute Retire An instruction may be getting decoded at the same time as another is getting its arguments and another is executing; this parallelism greatly speeds up the processor Is a Pentium RISC or CISC? • The Pentium instruction set is CISC (Complex Instruction Set Computer) but it actually translates into micro-ops and runs on a RISC (Reduced Instruction Set Computer) under the sheets • The RISC instructions (micro-ops) are what is actually pipelined How does this Affect Tuning? • If the pipeline is not fully utilized, we can improve performance by fixing this – Reorder instructions to reduce dependencies and “pipeline stalls” – Avoid mispredicted branches with loop unrolling – Avoid function calls with inlining The Downside • All this tuning makes code harder to write, maintain, debug, port, etc. • Assembly language may be required, which has all the above drawbacks

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PowerPoint Slide File