Download Sort-first

Introduction to Parallel Rendering: Sorting, Chromium, and MPI Mengxia Zhu Spring 2006 Parallel Rendering  Graphics rendering process is computationally intensive  Parallel computation is a natural measure to leverage for higher performance  Two levels of parallelism: parallelism – pipelining  Data parallelism – multiple results computed at the same time  Functional Rendering Pipeline Data Parallel Algorithms  A lot of taxonomies of categorizing parallel algorithms  Image space vs. object space  Shared memory architecture, distributed memory architecture  MPI, OpenMP, …  Need a uniform framework to study and understand parallel rendering Sorting in Rendering  Rendering as a sorting process:  Sort from object coordinates to screen coordinates  Use this concept to study computational and communication costs  The key procedure: calculating the effect of each primitive on each pixel  Use this concept to study computational and communication costs Sorting Categories   The location of this ‘sort’ determines the structure of the parallel algorithm Sort-first  during geometry processing  distributes “raw” primitives  Sort-middle  between geom. processing and rasterization  distributes screen-space primitives  Sort-last  during rasterization  distributes pixels/fragments Sorting cont  A landmark paper: “A sorting classification of parallel rendering”, Molner, et. al., IEEE CG&A’94. Sort-First Sort-Middle Sort-Last G R G R G R G R G R G R G R G R G R C Sort First Primitives initially assigned arbitrarily  Pre-transformation is done to determine which screen regions are covered  Primitives are then redistributed over the network to the correct renderer  Renderer performs the work of the entire pipeline for that primitive from that point on  Sort First cont Sort First cont Screen space is partitioned into nonoverlapping 2D tiles, each is rendered independently by a tightly coupled pair of geometry and rasterization processors.  Sub-image of 2D tiles are composited without depth comparison.  Analysis Terms    Assume a dataset containing nr raw primitives with average size ar . We will call primitives that result from tessellation display primitives. If T is the tessellation ratio, there are nd = Tnr of these, with average size ad = ar /T. If there is no tessellation, T = 1, nd = nr , and ad = ar . Assume an image containing A pixels and need to compute S samples per pixel. Assume that all primitives within the viewing frustum. Sort-first analysis  Pros:  Low communication requirements when tessellation or oversampling are high, or when inter-frame coherence exploited  Processors implement entire rendering pipeline for a given screen region  Cons:  Susceptible to load imbalance (clumping)  Exploiting coherence is difficult Sort Middle Primitives initially assigned arbitrarily  Primitives fully transformed, lit, etc., by the geometry processor to which they are initially assigned  Transformed primitives are distributed over the network to the rasterizer assigned to their region of the screen  Sort Middle Sort Middle Analysis  Pros:  Redistribution  occurs at a “natural” place Cons:  High communication cost if T is high  Susceptible to load imbalance in the same way as sort-first  Overhead:  Display primitive distribution cost  Tessellation factor Sort Last Sort Last Defers sorting until the end (imagine phase)  Renderers operate independently until the visibility stage  Fragments transmitted over network to compositing processors to resolve visibility  Sort Last Analysis  Pros:  Renderers implement full pipeline and are independent until pixel merging  Less prone to load imbalance  Very scalable  Cons:  Pixel traffic can be extremely high Image Composition • A naïve approach is binary compositing. • Each disjoint pair of processors produces a new subimage. • N/2 subimages are left after the first stage. • Half the number of the original processors are paired up for the next level of compositing hence another half would be idle. • The binary-swap compositing method makes sure that every processor participates in all the stages of the process. • The key idea – at each compositing stage, the two processors involved in a composite operation split the image plane into two pieces. Binary Swap Example • The binary-swap compositing algorithm for four processors: Which to choose? It depends.  Which ones can be best matched to hardware capabilities?  Number of primitives, tessellation factor, coherence, etc., are all considerations. Many tradeoffs.  Load Balancing  For better load balancing,  Task queuing: the task queue can be ordered in decreasing task size, such that the concurrency gets finer until the queue is exhausted.  Load stealing: having nodes steal smaller tasks from other nodes, once they have completed their own tasks  Time stamp: timeout stamps used for each task, such that if the node can not finish its task before the timeout, it takes the remnant of the task, re-partitions it and re-distributes it.  Hierarchical data structures, such as octree, k-d tree, etc., are commonly used. References  These slides reference contents from  Jian Huang at University of Tennessee at Knoxville  William Gropp and Ewing Lusk at Argonne National Laboratory

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sort-first