Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSC 480 - Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12 Multithreading Overhead • Multithreading always introduces overhead. • • • • • Costly synchronization code even without contention. Synchronization delays when contention occurs. Hardware synchronization (flushes at barriers). Increased context switching. Thread scheduling. Thread creation and destruction. • How do we keep the hardware busy? • What are the bottlenecks? Scalability • Ability to improve throughput or capacity when adding computing resources -- CPUs, memory, long-term storage, or I/O bandwidth. • The bottleneck moves somewhere else. • Throughput measures total data processed. • Latency measures delays in event responses. • Sometimes they trade one against the other. • Use a better algorithm. Avoid premature tradeoffs • Make it correct, the fast. • An initial correct solution, even if slower, helps you think through the problem, gives you output reference data for when you perform code improvements, and gives you some reusable code. • It is like learning how to drive! • Know your code libraries. Amdahl’s Law • • • • Speedup <= 1.0 /(F + ((1.0 - F) / N)) F is fraction of computation that is serial. N is number of processors. Sources of serialization – • • • • • Intrinsic to algorithm (e.g., partitioning in quicksort). Synchronization (e.g., CyclicBarrier in penny – dime). Access to shared data structures such as work queues. Task submission, result processing, join. Hardware serialization (e.g., memory, I/O bandwidth). Context switching • Operating system calls and complex JVM code. • Cache and page faults to load and flush a thread’s working set (of data) on a switch. • I/O bound threads block more often that CPU bound, resulting in more context switches. • High kernel usage (> 10%) indicates heavy scheduling activity, may be caused by frequent blocking due to I/O or lock contention. Memory synchronization • Memory barriers • Flush and invalidate registers, data caches, write buffers, stall execution pipelines. • Uncontended synchronization overhead is low. • JVM run-time compiler can use escape analysis to identify when a “locked” object does not escape its thread, removing the lock. • Compiler lock coarsening can merge adjacent requests for the same lock. Memory sync and blocking • Lock elision on locks to thread-confined objects is in IBM JVM and slated for HotSpot 7. • Synchronization creates traffic on memory buses. • Clear, explicit Java Memory Model creates opportunities for compiler optimizations. • Blocking overhead – • Spin waiting. • O.S. system calls and context switching overhead. Reducing lock contention • Reduce duration during which locks are held. • Reduce the frequency of lock requests. • Replace exclusive locks with coordination mechanisms that permit greater concurrency. • Prefer a dataflow architecture when possible. • Lock splitting for independent sets of shared variables guarded by a single lock. • Lock striping reduces overhead of excess locks. Hot fields, exclusive locks • A hot field is a single field such as a counter that is accessed by many threads. • Parallelize data structures when possible. • Merge them in a subset of threads as needed. • Cache merged results if invalidating cache is fast. • Use alternatives to exclusive locks. • ReadWriteLock • Atomic variables and volatiles • Queues and library classes with reduced or split locks. Monitoring CPU utilization • top, mpstat, vmstat, iostat, sar, cpustat • Process and resource profilers and low-level activity register data analyzers. • See Summer of 2010 report, final page. • Also commercial code profilers for multithreading. • Causes of CPU underutilization – • • • • Insufficient load I/O bound processing External bottlenecks Lock contention Miscellaneous • Thread pools? Yes Object pooling? No • Dynamic allocation and garbage collection is cheaper and easier than application-driven object pools and application-level synchronization of object pools. • Reducing context switching overhead. • Make some portion of the application architecture CPU intensive, partition it so it can be multithreaded. • Let some partitioned set of threads handle the buffered I/O. Only they are candidates for I/O blocking. • May introduce possibility of a serial bottleneck in the I/O threads. Use appropriate queuing.