Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs, Intel Corporation Graphics Applications  Computational intensive graphics applications are becoming increasingly popular  Computer-Aided Design ─      From Airplanes to Cars Visualization of massive quantities of Data Visual Simulators e.g. Training Pilots Fancier Graphical User Interfaces And, of course, Games And this trend is continuing  As high-end applications become more mainstream Parallel Architecture and Compilation Techniques, 2003 2 3D Application OpenGL Or DirectX Graphics Pipeline Scene Transform Lighting Vertex Shaders • Operate on every vertex in the scene Clipping • Effects like • Blur • Diffuse and specular reflection Rasterization Texture Mapping Compositing Display Pixel Shaders • Operate on every pixel • Effects like • Texturing • Fog blending Parallel Architecture and Compilation Techniques, 2003 3 Vertex and Pixel Shaders  Need to operate millions of times a second     Small programs Typically run on the graphics cards However most desktops do not have graphics cards that support programmable shaders This work focuses on running Vertex Shaders on the main CPU   Pixel shaders have very high computational and bandwidth requirements Graphics applications are designed to adapt to the available features and performance Parallel Architecture and Compilation Techniques, 2003 4 Goals  Improving the performance of Vertex Shaders on the main CPU     Analyze the performance on today’s CPU Better Compiler Optimizations Additional Architectural Support Identify three architectural and compiler enhancements  Significant impact on the performance ─ Roughly by a factor of 2 Parallel Architecture and Compilation Techniques, 2003 5 Outline      Motivation Baseline Compiler Three Enhancements Performance Evaluation Conclusions Parallel Architecture and Compilation Techniques, 2003 6 Vertex Shader Programs Virtual Machine Temporary Registers 12 x 4 Vertex Input 16 x 4 Registers SIMD ALU Integer Registers 84 x 1 Constant Memory 256 x 4 Vertex Output 15 x 4 Registers dp4 dp4 dp4 dp4 oPos.x, oPos.y, oPos.z, oPos.w, v0, v0, v0, v0, c[0] c[1] c[2] c[3]     Small Programs (at most 256 instructions) SIMD instructions with xyzw components Mask and Swizzle on each instruction No state saved between vertices  mov oD0, c[4].wzyx  Read-only memory & Temporary Registers Program cannot change control flow Parallel Architecture and Compilation Techniques, 2003 7 Baseline Optimizing Compiler  Implemented a Compiler for Vertex Shaders Input: Vertex Shader Assembly Output: Optimized x86 (with SSE2)  Started with DirectX reference rasterizer: Interpreter ─      Used it as the front end Use Olive pattern-matching code-generator generator Graph-coloring based register allocator Loop unrolling List-scheduler About 70% faster than a naïve translator  Translate into C and feed it to a C compiler Parallel Architecture and Compilation Techniques, 2003 8 Characteristics of Generated Code  Mostly SIMD instructions (x86 with SSE2)   Large basic blocks    83-99 % instructions Use of control-flow is limited Makes it easier to compile efficiently Vertex Shared Assembly to x86 Assembly  10-20 times increase in number of instructions mul r0.x_z_, v0.xyzz, Parallel Architecture and Compilation Techniques, 2003 v1.wwww 9 Outline      Motivation Baseline Compiler Three Enhancements Performance Evaluation Conclusions Parallel Architecture and Compilation Techniques, 2003 10 1. New Instructions   Dot products are very common in Shaders A dot product translates is expensive on x86   A sequence of 7 instructions 1 multiply, 2 add, 4 shuffle instructions ─  In the simple case New dot product instructions  Compute dot product of two source operands and store it in each of the word of the destination operand Parallel Architecture and Compilation Techniques, 2003 11 2. Mask Analysis Optimization  Traditional optimizers keep track of the liveness information on a per-register basis    Analysis Phase    Shaders: often only part of the SIMD register is live Modify to do this for each word of the SIMD register Annotate the IR with additional information During live variable analysis, propagate the liveness mask depending on the instructions Optimization Phase   Identify dead code Replace some shuffle/mask instructions with move ─ Might get eliminated entirely during register allocation Parallel Architecture and Compilation Techniques, 2003 12 3. Number of Registers    Spilling registers to memory can degrade performance Investigate the impact of increasing the number of registers from 8 to 16 Why not more?  Trickier to encode it in the ISA Parallel Architecture and Compilation Techniques, 2003 13 Outline      Motivation Baseline Compiler Three Enhancements Performance Evaluation Conclusions Parallel Architecture and Compilation Techniques, 2003 14 Experimental Setup  10 Vertex Shaders    2.2 GHz Pentium IV processor    8-84 instructions Only 3 of them have loops (Control) Instruction counts otherwise Breakdown the instructions into categories Measure performance by using the generated code to process an array of vertices  Compute average Parallel Architecture and Compilation Techniques, 2003 15 Evaluation Normalized Execution Time Base New Instructions Only Mask Optimization Only Both 1 0.8 0.6 0.4 0.2 0 B CTC L PS PL PE R T TS W Vertex Shaders  New dot-product Instructions: 27.4% Average (Estimate)    Reduces the number of instructions by 24 % Mask optimization: 19.5% on Average Both: 42% on Average Parallel Architecture and Compilation Techniques, 2003 16 Evaluation Cont’d Normalized Instruction Count Base 16 Registers 1 0.8 0.6 0.4 0.2 0 B CTC L PS PL PE R T TS W Vertex Shaders  Reduce the number of instructions by 8 % on average   35-100% of the spill instructions This understates the potential benefit  More registers allow more aggressive optimizations like instruction scheduling Parallel Architecture and Compilation Techniques, 2003 17 Outline      Motivation Baseline Compiler Three Enhancement Performance Evaluation Conclusions Parallel Architecture and Compilation Techniques, 2003 18 Conclusions & Future Work   Implemented an Optimizing Compiler for Vertex Shaders Propose and Evaluate Three Enhancements Compiler: Mask Optimization  Architectural: New Instructions & More registers Improve the performance by a factor of 2 (Roughly)   Shaders are evolving rapidly   More like general purpose processors More complex model Parallel Architecture and Compilation Techniques, 2003 19