Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Parallel Rendering: Sorting, Chromium, and MPI Mengxia Zhu Spring 2006 Parallel Rendering  Graphics rendering process is computationally intensive  Parallel computation is a natural measure to leverage for higher performance  Two levels of parallelism: parallelism – pipelining  Data parallelism – multiple results computed at the same time  Functional Rendering Pipeline Data Parallel Algorithms  A lot of taxonomies of categorizing parallel algorithms  Image space vs. object space  Shared memory architecture, distributed memory architecture  MPI, OpenMP, …  Need a uniform framework to study and understand parallel rendering Sorting in Rendering  Rendering as a sorting process:  Sort from object coordinates to screen coordinates  Use this concept to study computational and communication costs  The key procedure: calculating the effect of each primitive on each pixel  Use this concept to study computational and communication costs Sorting Categories   The location of this ‘sort’ determines the structure of the parallel algorithm Sort-first  during geometry processing  distributes “raw” primitives  Sort-middle  between geom. processing and rasterization  distributes screen-space primitives  Sort-last  during rasterization  distributes pixels/fragments Sorting cont  A landmark paper: “A sorting classification of parallel rendering”, Molner, et. al., IEEE CG&A’94. Sort-First Sort-Middle Sort-Last G R G R G R G R G R G R G R G R G R C Sort First Primitives initially assigned arbitrarily  Pre-transformation is done to determine which screen regions are covered  Primitives are then redistributed over the network to the correct renderer  Renderer performs the work of the entire pipeline for that primitive from that point on  Sort First cont Sort First cont Screen space is partitioned into nonoverlapping 2D tiles, each is rendered independently by a tightly coupled pair of geometry and rasterization processors.  Sub-image of 2D tiles are composited without depth comparison.  Analysis Terms    Assume a dataset containing nr raw primitives with average size ar . We will call primitives that result from tessellation display primitives. If T is the tessellation ratio, there are nd = Tnr of these, with average size ad = ar /T. If there is no tessellation, T = 1, nd = nr , and ad = ar . Assume an image containing A pixels and need to compute S samples per pixel. Assume that all primitives within the viewing frustum. Sort-first analysis  Pros:  Low communication requirements when tessellation or oversampling are high, or when inter-frame coherence exploited  Processors implement entire rendering pipeline for a given screen region  Cons:  Susceptible to load imbalance (clumping)  Exploiting coherence is difficult Sort Middle Primitives initially assigned arbitrarily  Primitives fully transformed, lit, etc., by the geometry processor to which they are initially assigned  Transformed primitives are distributed over the network to the rasterizer assigned to their region of the screen  Sort Middle Sort Middle Analysis  Pros:  Redistribution  occurs at a “natural” place Cons:  High communication cost if T is high  Susceptible to load imbalance in the same way as sort-first  Overhead:  Display primitive distribution cost  Tessellation factor Sort Last Sort Last Defers sorting until the end (imagine phase)  Renderers operate independently until the visibility stage  Fragments transmitted over network to compositing processors to resolve visibility  Sort Last Analysis  Pros:  Renderers implement full pipeline and are independent until pixel merging  Less prone to load imbalance  Very scalable  Cons:  Pixel traffic can be extremely high Image Composition • A naïve approach is binary compositing. • Each disjoint pair of processors produces a new subimage. • N/2 subimages are left after the first stage. • Half the number of the original processors are paired up for the next level of compositing hence another half would be idle. • The binary-swap compositing method makes sure that every processor participates in all the stages of the process. • The key idea – at each compositing stage, the two processors involved in a composite operation split the image plane into two pieces. Binary Swap Example • The binary-swap compositing algorithm for four processors: Which to choose? It depends.  Which ones can be best matched to hardware capabilities?  Number of primitives, tessellation factor, coherence, etc., are all considerations. Many tradeoffs.  Load Balancing  For better load balancing,  Task queuing: the task queue can be ordered in decreasing task size, such that the concurrency gets finer until the queue is exhausted.  Load stealing: having nodes steal smaller tasks from other nodes, once they have completed their own tasks  Time stamp: timeout stamps used for each task, such that if the node can not finish its task before the timeout, it takes the remnant of the task, re-partitions it and re-distributes it.  Hierarchical data structures, such as octree, k-d tree, etc., are commonly used. References  These slides reference contents from  Jian Huang at University of Tennessee at Knoxville  William Gropp and Ewing Lusk at Argonne National Laboratory