* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Powerpoint slides
Supercomputer wikipedia , lookup
Supercomputer architecture wikipedia , lookup
Error detection and correction wikipedia , lookup
Parallel computing wikipedia , lookup
Data-intensive computing wikipedia , lookup
Stream processing wikipedia , lookup
General-purpose computing on graphics processing units wikipedia , lookup
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts Summary (1) Architecture • Modern architecture designs are driven by energy constraints • Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput • Some parallelism is implicit (out-of-order superscalar processing,) but have limits • Others are explicit (vectorization and multithreading,) and rely on software to unlock 2 Summary (2) Memory • Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other • Locality (relationships between memory accesses) can help us get the best of all cases • Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.) 3 Summary (3) Software • Want to fully occupy your hardware? – Express locality (tiling) – Vectorize (compiler or manual) – Multithread (e.g. OpenMP) – Accelerate (e.g. CUDA, OpenCL) • Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free. 4 Research Perspective (2010) • Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations? – Across multiple architectures – Across many applications • What kinds of performance trends are we seeing from successive GPU generations? • Conclusion – GPUs aren’t special, and parallel programming is getting easier 5 Application Survey • Surveyed the GPU Computing Gems chapters • Studied the Parboil benchmarks in detail Results: • Eight (for now) major categories of optimization transformations – Performance impact of individual optimizations on certain Parboil benchmarks included in the paper 6 1: (Input) Data Access Tiling DRAM DRAM Explicit Copy Local Access DRAM Implicit Copy Scratchpad Cache Local Access 7 2. (Output) Privatization • Avoid contention by aggregating updates locally • Requires storage resources to keep copies of data structures Private Results Local Results Global Results 8 x Running Example: SpMV Ax = v v Row Col Data A 9 x Running Example: SpMV Ax = v v Row Col Data A 10 3. “Scatter to Gather” Transformation x Ax = v v Row Col Data A 11 3. “Scatter to Gather” Transformation x Ax = v v Row Col Data A 12 4. Binning A 13 5. Regularization (Load Balancing) 14 6. Compaction 15 7. Data Layout Transformation 16 7. Data Layout Transformation 17 8. Granularity Coarsening • Parallel execution often requires redundant and coordination work – Merging multiple threads into one allows reuse of result, reducing redundancy Time 4-way parallel 2-way parallel Redundant Essential 18 How much faster do applications really get each hardware generation? Unoptimized Code Has Improved Drastically • Orders of magnitude speedup in many cases • Hardware does not solve all problems – Coalescing (lbm) – Highly contentious atomics (bfs) 20 Optimized Code Is Improving Faster than “Peak Performance” • Caches capture locality scratchpad can’t efficiently (spmv, stencil) • Increased local storage capacity enables extra optimization (sad) • Some benchmarks need atomic throughput more than flops (bfs, histo) 21 Optimization Still Matters • Hardware never changes algorithmic complexity (cutcp) • Caches do not solve layout problems for big data (lbm) • Coarsening still makes a big difference (cutcp, sgemm) • Many artificial performance cliffs are gone (sgemm, tpacf, mri-q) 22 Stuff we haven’t covered • Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters. • Patterns and practice – Some of the major patterns of optimization we covered, but only the basic ones. Many optimization patterns are algorithmic. 23 Fill Out Evaluations! 24