Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Shell: A Spatial Decomposition Data Structure for 3D Curve Traversal on Many-core Architectures (Regular Submission) Kai Xiao University of Notre Dame [email protected] Danny Z. Chen∗ University of Notre Dame [email protected] X. Sharon Hu University of Notre Dame [email protected] Bo Zhou Altera Corp [email protected] Abstract Shared memory many-core processors such as GPUs have been extensively used in accelerating computation-intensive algorithms and applications. When porting existing algorithms from sequential or other parallel architecture models to shared memory many-core architectures, non-trivial modifications are often needed in order to match the execution patterns of the target algorithms with the characteristics of many-core architectures. 3D curve traversal is a fundamental process in many applications, and is commonly accelerated by spatial decomposition schemes captured in hierarchical data structures (e.g., kd-trees). However, curve traversal using hierarchical data structures needs to conduct repeated hierarchical searches. Such search process is time-consuming on shared memory many-core architectures since it incurs considerable amounts of expensive memory accesses and execution divergence. In this paper, we propose a novel spatial decomposition based data structure, called Shell, which completely avoids hierarchical search for 3D curve traversal. In Shell, a structure is built on the boundary of each region in the decomposed space, which allows any curve traversing in a region to find the next neighboring region to traverse using table lookup schemes, without any hierarchical search. While our 3D curve traversal approach works for other spatial decomposition paradigms and many-core processors, we illustrate it using kd-tree decomposition on GPU and compare with the fastest known kd-tree searching algorithms for ray traversal. Analysis and experimental results show that our approach improves ray traversal performance considerably over the kd-tree searching approaches. Keywords: Many-core architecture, GPU, data structure, spatial decomposition, 3D curve traversal. ∗ The research of D.Z. Chen was supported in part by NSF under Grants CCF-0916606 and CCF-1217906. 1 Introduction 3D geometric scenes are involved in many applications, in which large numbers of curve traversal (e.g., ray traversal) operations are frequently conducted. Spatial decomposition based data structures have been developed on shared memory many-core architecture (e.g., general purpose graphics processing units (GPGPUs)) for accelerating curve traversal solutions. In this paper, we present a new efficient spatial decomposition based data structure for 3D curve traversal that better exploits the characteristics of shared memory many-core architectures and avoids hierarchical searches that are commonly performed in other known spatial decomposition data structures. Shared memory many-core processors present great opportunities to speed up computation-intensive applications by parallelization [19, 20, 22, 26]. Recent advances on GPGPUs, such as those from NVIDIA, AMD, and Intel, leverage massively parallel architectures based on single-instruction, multiple-data (SIMD) processor cores to achieve high performance [9]. In this paper, we use NVIDIA GPU with the Fermi architecture as the model for illustration and experiments, but our solutions can be applied to other types of shared memory many-core processors, such as those with an MIMD architecture (e.g., Intel SCC [7]). Due to the specific characteristics of GPU, sequential or parallel algorithms straightforwardly ported to GPU often suffer from a number of performance bottlenecks such as memory access efficiency and execution divergence, thus utilizing only a fraction of the GPU computation power [1]. A shared memory many-core processor, especially GPU, commonly contains hundreds of cores (e.g., NVIDIA GTX570 contains 480 cores [23]). All these cores are connected to one storage component such as shared cache or main memory. During execution, the memory bandwidth shared by multiple cores is often insufficient to support a large number of simultaneous memory requests. Hence, memory access efficiency on GPUs is much more crucial for performance than on traditional pipelined processors such as CPUs. For example, in the NVIDIA Fermi GPU, the latency of an off-chip memory transaction is 400-600 times longer than the fastest computational instructions. Since computational operations can be executed simultaneously by multiple cores but memory transactions need to be processed sequentially by the memory controller shared by these cores, the memory transaction latency is usually even worse. Besides memory access efficiency, execution divergence is another major performance bottleneck. The SIMD architecture used in GPU attains the most benefit when a group of cores (32-48 in NVIDIA Fermi) follows the same instruction flow. But when the execution paths running on different cores diverge due to branches (e.g., conditional statements), their parallel execution can no longer be sustained and instead serial execution takes place. Such divergent execution paths severely deteriorate performance [30]. Achieving good performance on SIMD systems requires solid understanding of both the architecture and execution patterns of the target algorithms [18]. Curve traversal is to trace the trajectory of a curve through a geometric scene. A curve can be a line segment, a ray, or a general algebraic curve. Ray traversal is a special case of curve traversal and a fundamental process in many applications, such as graphics ray tracing [2, 29] and radiation dose calculation [5, 31]. In this paper, we use ray traversal as example to illustrate the problem and our solution for 3D curve traversal on many-core architectures. Our approach can be easily extended to traverse other types of curves. Since applications using ray traversal often involve large numbers of rays and repeatedly conduct the traversal process for such rays (e.g., to generate high-resolution images in graphics ray tracing), the execution speed of ray traversal is critical for such applications. A number of data structures have been developed to partition and organize geometric objects in 3D scenes to improve the efficiency of ray traversal. One important class of such data structures is based on spatial decomposition (e.g., grids [10, 17], octrees and kd-trees [28]). A spatial decomposition scheme partitions a geometric scene into a set of regions, each containing a (small) number of objects. The subdivided regions are normally organized by a hierarchical structure which represents the geometric relationship among those regions. Kd-tree is a commonly used hierarchical structure due to its ability of matching the subdivided regions with the distribution of geometric objects in the scene. It is also considered to be the data structure that provides the fastest known ray traversal 1 speed in static scenes because of its efficient hierarchical search mechanism [32]. Quite a few algorithms have been developed for 3D ray traversal using kd-tree structures, in which repeated hierarchical searches in the tree are needed to find the neighboring regions for traversal [11, 25]. For kd-tree based ray traversal on a traditional CPU architecture, a stack is normally used to store the tree nodes along a search path. On GPU, due to the lack of stack support and limited capacity of on-chip memory, a stack based approach typically allocates stacks to off-chip memory (e.g., global memory). The long access latency of the off-chip memory can easily become a performance bottleneck for stack-based ray traversal on GPUs [16]. Several stack-less kd-tree ray traversal algorithms on GPU have been proposed to address this challenge. For example, Foley et al. [8] and Horn et al. [15] proposed a restart traversal scheme on kd-trees which starts the tree search from the root, and uses a push-down method to move the “root” down to the minimum subtree to be searched. Horn et al. [15] also developed a short-stack approach which builds a small circular stack in the on-chip memory of GPU, and combines it with the push-down method (PD-SS). Popov et al. [24] proposed a kd-rope algorithm [13] using the concept of neighbor links, where each leaf node in the kd-tree contains not only the region R it represents but also a set of pointers, each pointing to a minimum subtree that contains all neighboring regions touching each of the boundaries (or faces) of R. Santos et al. [25] improved the kd-rope implementation, making it the fastest known kd-tree searching ray traversal approach on GPU (called kd-rope++). Although these algorithms considerably improved the kd-tree searching 3D ray traversal performance on GPU, all of them still rely on hierarchical search to find the next traversal region when a ray crosses a region boundary (i.e., a ray exits from a region). During ray traversal, each search operation incurs reading a tree node, and search for the next neighboring traversal region can visit O(H) nodes in a kd-tree, where H is the height of the tree. Note that each reading of a visited tree node may take multiple memory transactions to obtain the entire information package for the node because each GPU memory transaction has a limited size (e.g., 128 bytes). Also, the threads for different rays may follow different sequences of visited tree nodes which can result in execution divergence. Hence, the performance of kd-tree searching ray traversal on GPU can suffer considerably from these issues. We propose a new spatial decomposition based data structure, called Shell, which completely eliminates hierarchical search, hence leading to an efficient solution for 3D curve traversal on GPU. Tree search is to find the next neighboring region for a curve (say, ray) to cross a region boundary, by using the geometric information of the curve and regions. To avoid tree search, Shell provides a neighboring region locating mechanism based on table lookup techniques to replace hierarchical search and find the next region for any traversing ray, allowing the ray to directly access the next region’s information. Generally speaking, given the set of decomposed 3D regions in a hierarchical structure (say, a kd-tree), Shell focuses on the neighboring relationship among the regions for all leaf tree nodes, called leaf regions. For each leaf region R, the information of its neighboring leaf regions is captured in Shell by a geometric structure called arrangements, with one arrangement per boundary face of R. An arrangement is a partition of a face F of R into a set of 2D regions called cells, such that each cell C covers an area of F touching a neighboring leaf region R0 of R and contains information of R0 . When a ray exits R by crossing F in C, it acquires the neighboring region information of R0 by accessing C. Of course, schemes for quickly finding C (which is hit by the ray) are needed to avoid tree searches and other performance bottlenecks incurred by hierarchical searching data structures (for this, we apply various table lookup schemes). There are two key factors in designing the Shell table lookup schemes: the ray traversal speed and memory usage. The ray traversal speed essentially depends on how efficiently a cell can be located when a ray crosses a region face through that cell; the memory usage is related to the total number of cells on the faces of all leaf regions in Shell (i.e., the sizes of arrangements or lookup tables for all leaf regions). Both factors are affected by the partition schemes for generating the arrangements. We seek to balance the ray traversal performance with good memory usage in Shell, and present a set of partition schemes, including (from simple to sophisticated) uniform grids, multi-level uniform grids, and compressed non-uniform grids, 2 to deal with different neighboring region settings. Actually, each such scheme can be viewed as an extension of the simpler ones for obtaining a good trade-off between ray traversal performance and memory usage for a more complicated neighboring region setting. We also exploit several memory accessing techniques [3, 6, 14] to further reduce memory transactions used in Shell based ray traversal. Given a geometric scene with N objects, suppose a kd-tree decomposition partitions it into M leaf regions with a tree height of O(log M ) using O(M + N ) memory. A ray traversal in such a kd-tree takes O(log M ) search steps (i.e., the number of visited tree nodes or memory transactions involved) each time a ray crosses a region boundary. In comparison, using Shell, only one processing step (or O(1) memory transactions) without any tree search is needed to find the next region. The Shell memory usage is O(M + N + U ), where U is the total number of cells in the arrangements for all leaf regions. Using judicious partition and compression schemes, the O(M + N + U ) memory bound of Shell can be made in practice comparable to the O(M + N ) memory bound of the kd-tree. Since many applications use massive numbers of traversing rays each of which can cross many regions in the scenes, reducing the number of memory transactions from O(log M ) to O(1) per region crossing for each ray on GPU can be significant. To demonstrate the effectiveness of our approach, we implemented our Shell data structure based on a kd-tree decomposition with all our arrangement schemes. Our experiments were conducted on ray traversals in several benchmark geometric scenes [27] used in graphics rendering. The experimental results show that our Shell approach outperforms the fastest known kd-tree searching ray traversal approaches on GPU by over 2X to 5X. Although ray traversals, kd-tree decomposition, and GPU architecture are used to illustrate our method, our Shell data structure and ray traversal algorithms can be easily extended to traversing 3D curves of other types, using different spatial decomposition schemes, and on other many-core architectures. 2 Ray Traversal Using Kd-tree Decomposition In this section, we give a general discussion of ray traversal based on spatial decomposition (say, a kd-tree). We also briefly illustrate the PD-SS [24] and kd-rope [13] ray traversal algorithms on GPU. Given a spatially decomposed scene, suppose we consider the regions corresponding to all leaf tree nodes, called leaf regions (e.g., in Figure 1(a), a 2D scene is decomposed into 18 leaf regions). We can use a dual graph G to represent all leaf regions, as follows: Each vertex in G corresponds to exactly one leaf region, and two vertices are connected by an edge in G if and only if the two corresponding regions share any common boundary portion. With this graph model, the ray traversal problem for a ray r is to find a path (i.e., a sequence of vertices) in G determined by the trajectory of r (i.e., the sequence of regions intersected by r in the order of the corresponding vertices on the path). To obtain the path in G for r, a key issue, which we call the next-region issue, is, at each vertex v (for a region Rv ) of the path, to find the next vertex v 0 (for a region Rv0 ) on the path, i.e., after region Rv , the ray r enters the next region Rv0 . This issue can be resolved by using the geometric location information of r traversing in Rv and the neighboring leaf regions of Rv . In implementation, resolving this issue means to map a point on a boundary face of Rv (from which r exits Rv ) to a memory address (which stores the information of the neighboring region Rv0 traversed next by r). How to resolve the next-region issue effectively is an essential task for any ray traversal solution. Figure 1(b) shows the kd-tree representing the spatial decomposition in Figure 1(a). Since a kd-tree organizes the relationship among its regions in a tree structure, hierarchical search is the basic process for solving the next-region issue used by any kd-tree searching ray traversal algorithms. For example, when ray1 leaves region F in Figure 1(a), the next region D (and the memory address storing the information of region D) should be found based on the exit point of ray1 from a boundary face of region F . However, a hierarchical search accesses a number of internal nodes in the kd-tree, which impacts ray traversal performance since it incurs considerable memory transactions and divergent branches. Although considered as the most efficient kd-tree searching ray traversal algorithms on GPU, PD-SS and kd-rope 3 inherit the hierarchical search mechanism from the kd-tree structure to find neighboring regions. In the worst case, both algorithms still visit O(H) nodes to find a neighboring region, where H is the height of the kd-tree. Further, accessing each internal node takes multiple memory transactions (usually 2-4 for different kd-tree implementations), which is a considerable overhead for ray traversal on GPU. Moreover, the numbers of accessed nodes between different rays using the same algorithm can be quite different (e.g., in PD-SS, ray1 accesses 28 nodes and ray2 access 20 nodes), which means that the rays can have different workloads (the numbers of tree nodes accessed). In parallel execution on GPU, since many rays are often simultaneously processed by processor cores on an SIMD architecture, the workload variance of the rays can cause execution divergence and hence deteriorate the ray traversal performance. To eliminate the hierarchical search overhead for 3D ray traversal on GPU, we develop a new data structure for resolving the next-region issue based on table lookup schemes. In designing a GPU table lookup scheme for the next-region issue, we need to consider two key factors: (1) It should allow to find the next neighboring region quickly (especially, with only few memory transactions); (2) it should not take too much GPU memory (large memory usage can limit its application to scenes of large numbers of objects). We present a series of schemes to address these two factors for various neighboring region settings. 3 The Shell Data Structure We propose a new data structure, called Shell, for directly finding the next neighboring regions without any tree search. By avoiding hierarchical search, Shell reduces not only memory transactions but also execution divergence. Given a kd-tree decomposed scene, based on our dual graph model, Shell focuses on the leaf regions stored at all leaf nodes of the kd-tree. For each leaf region R, we put a geometric structure, called arrangements (our “tables”), on the boundary of R (intuitively, the “Shell” of region R) to capture the information of all neighboring leaf regions of R. When a ray r hits a boundary face F of R to exit from R, the arrangement for F allows r to find quickly the neighboring region of R to traverse next. The arrangements are a key structure of Shell, which essentially provide table lookup schemes for mapping the geometric information of a ray r traversing in a region R and the neighboring regions of R to a GPU memory address storing the information of the next neighboring region of R traversed by r. For each face F of every leaf region R, an arrangement A(F ) partitions F into a set of 2D areas, called cells, such that each cell touches only one neighboring region of R on F . All cells of A(F ) are organized by a specific data structure such that given any point p on F (e.g., the hit point of any ray), the cell of A(F ) containing p can be found quickly. As a table lookup scheme, the cells of A(F ) form a data table, and each cell gives (say) a pointer to its corresponding neighboring region of R. Ideally, the following features are desired from good arrangements: (1) An arbitrary ray r can quickly find the cell C containing its hit point on a face of R (say, with only few memory transactions), and (2) the memory requirement for storing all cells is not too high. There is a trade-off between these two features. Depending on different settings of neighboring regions, we propose a set of partition and compression schemes for building arrangements to achieve good performance with respect to ray traversal speed and memory usage of Shell. Since the GPU memory architecture prefers simple memory layout for efficient addressing, we choose arrays as the main data structure in GPU for storing arrangements based on all our partition schemes. Figure 2(b) illustrates the Shell structure based on the spatial decomposition in Figure 2(a). Here, to illustrate the idea of a partition scheme for arrangements of neighboring regions, we use a simple uniform grid as example to partition all region boundaries in the scene. Each leaf region is represented by a ShellRegion (SR) (e.g., see a Shell-Region in Figure 2(c)). Every Shell-Region uses an arrangement to partition each of its boundary faces into a set of cells (e.g., the shaded small boxes around the boundary of the Shell-Region in Figure 2(c)). Each cell is called a Shell-Unit (SU), which covers an area on its boundary face and contains a pointer to a neighboring Shell-Region touching that cell on the face. In Figure 2(c), 4 the highlighted Shell-Unit contains a pointer to Shell-Region D. The memory layout of Shell, shown in Figure 2(d), consists of an array of all Shell-Regions and an array of all Shell-Units stored in the GPU global memory. The Shell-Units of the same boundary face of a Shell-Region are all allocated as a group in consecutive memory space, and the memory address of the first Shell-Unit in the group is stored in the corresponding Shell-Region structure. Thus, any Shell-Unit can be addressed by computing its memory offset from the first Shell-Unit in the same group. When a ray r leaves a region through a face F , the neighboring region entered next by r is found by accessing the Shell-Unit in which r crosses F . 3.1 The Structure of an Individual Shell-Region A Shell-Region contains information of a 3D leaf region and the arrangements of its six 2D boundary faces. To ensure that a Shell-Region can be quickly loaded from memory, its memory requirement should ideally be no bigger than the size of a single memory transaction in the hardware architecture (e.g., on the NVIDIA GTX570 GPU, an off-chip memory transaction can load 128 consecutive bytes into the on-chip L1 cache). Essentially, the leaf region information of a Shell-Region R consists of its geometric location and size, as well as a pointer to the list of its associated geometric objects. As an axis-aligned box in 3D, the geometric location and size of R can be represented by the two vertices on a diagonal of the box: V0 (the vertex with the smallest coordinate in the x, y, and z directions of R) and V1 (the vertex with the largest coordinate in the x, y, and z directions of R). An arrangement of a region face consists of a geometric partition scheme subdividing the face into a set of cells (i.e., Shell-Units), and a data structure for mapping the geometric locations of these cells to the memory addresses of the corresponding Shell-Units in the GPU global memory. The memory address of the first Shell-Unit for each face is stored in its Shell-Region, which is used as the base for addressing other Shell-Units of the face. The partition scheme determines the time and memory usage of a ray traversal algorithm using Shell. There is a trade-off between these two factors. For example, storing more Shell-Units may make the cell locating process easier and quicker, and hence speed up ray traversal. To achieve a good balance between traversal time and memory usage for different neighboring region settings, we propose a series of partition schemes, including (from simple to sophisticated) uniform grid, multi-level uniform grids, and non-uniform grid, as well as a grid compression scheme. Each scheme is built on top of the simpler ones and reduces memory usage by using slightly more computational operations for locating cells (but not more memory transactions), to handle a more complicated neighboring region setting, as presented below. 3.1.1 Uniform Grid Schemes A uniform grid partitions each region face into a matrix of cells (pixels) of the same size. This arrangement easily maps the geometric locations of the face to the memory addresses of the pixels. Each pixel is a cell representing a Shell-Unit, which contains a pointer to the neighboring region touching this pixel. Using a uniform grid for all region boundaries is the simplest approach to build a Shell structure. Figure 2(b) gives an example for this on a 2D scene with 8 regions. Actually, such a Shell structure can be viewed as a combination of the kd-tree and uniform grid approaches. The uniform grid scheme provides a fast accessing mechanism for the Shell-Units; but, it can use a large amount of memory. Therefore, it is good when each region’s face has a simple distribution of neighboring regions (say, for scenes containing well-shaped objects with a relatively regular distribution). However, for a complicated scene, the region faces often have a large number of neighbors with non-regular distributions. Hence the uniform grid may have to choose the finest resolution of the neighbor distributions on all region faces as the pixel size, and the number of pixels in the grid (and the Shell memory space) can be quite large. To address this issue, instead of using one uniform grid partition for all region faces, we can use multiple uniform grids with different levels of resolutions to partition the region faces depending on different 5 neighboring region settings. This is called the multi-level uniform grid scheme, which provides some flexibility to the region faces with simple neighbor distributions so that they can use coarse grid resolutions and thus store fewer pixels (Shell-Units). For the region faces with complicated neighbor distributions, we still must use uniform pixels of sufficiently fine resolutions. The resolution of each grid used in the multilevel uniform grid scheme needs to be stored for each face of every Shell-Region, which is used to calculate the pixel hit by a ray on the face and the memory address of the corresponding Shell-Unit. Comparing with a uniform grid, the multi-level grid scheme can reduce the number of Shell-Units on the region faces and hence save memory space, with a very small time overhead. 3.1.2 Non-uniform Grid Scheme Even with the multi-level uniform grid scheme, each region face is still partitioned into a matrix of pixels with the same size. For a boundary face touching multiple neighboring regions, the grid resolution is still restricted to what is required to distinguish the smallest ones, which can still lead to high memory usage. Observe that multiple pixels of a uniform grid on a boundary face often touch the same neighboring region and hence store multiple copies of the pointer to the same corresponding Shell-Region. To reduce such duplications, we propose a non-uniform grid scheme which aims to merge pixels of a uniform size on each face sharing the same neighboring region into a larger cell to save memory space. A non-uniform grid is built on the partition of a uniform grid. Figure 3(a) shows a boundary face (for a 3D scene) with 16 neighboring regions (marked by heavy lines) and originally partitioned into a uniform grid. For a boundary face F with a uniform grid, we use the set of vertical or horizontal lines along the projected boundaries of the neighboring regions on F (e.g., the solid bold red boxes in Figure 3(a)) to partition F . Note that only a subset of the lines for the uniform grid (e.g., the solid green lines in Figure 3(a)) is actually aligned with such projected region boundaries on F , while the other lines for the uniform grid are needed due to the resolution restriction of the uniform grid. A non-uniform grid uses the lines aligning with the projected neighboring region boundaries to form a new grid partition, whose cells (the boxes bounded by solid green lines in Figure 3(a)) may cover multiple pixels of the uniform grid. we need a decoding mechanism to map the indices from the uniform grid to the non-uniform grid, so that any cell of the non-uniform grid (hit by a ray) can be located using the pixel indices in the uniform grid. To support this mechanism, the partition lines in each axis of the uniform grid are indexed by a bit sequence, where each bit is set as 1 if the corresponding line is used in the non-uniform grid partition and 0 otherwise. Every bit sequence is then stored in the integer format, called a coordinate integer, in the Shell-Region (e.g., Coord.x in Figure 3(a)). During ray traversal, suppose a ray r exits the Shell-Region through a cell C(i, j) in the uniform grid for a face F . Then the following is done: (1) The two coordinate integers Coord.x and Coord.y for F are obtained; (2) using Coord.x (resp., Coord.y), find the number of 1’s in the bit sequence of Coord.x (resp., Coord.y) from its left end up to position i (resp., j), denoted by i0 (resp., j 0 ). Then cell C 0 (i0 , j 0 ) is the one in the non-uniform grid of F hit by the ray r. To avoid using any branch operations (which may cause execution divergence), we design a short procedure to accomplish Step (2) (see Figure 3(b)). Note that given the two coordinate integers, the decoding process uses no memory transaction to map the pixel indices in the uniform grid to cell indices in the non-uniform grid. 3.1.3 Grid Compression Scheme Even with a non-uniform grid scheme, the Shell structure still tends to use more memory than a kd-tree structure. A grid is commonly represented as a matrix, each its element storing a value. On a face F with K neighboring regions partitioned into a grid of size m × n (a uniform or non-uniform grid), its matrix representation stores m × n Shell-Units that point to the K neighboring Shell-Regions of F . To further 6 reduce the memory, we employ a grid compression scheme similar to the sparse matrix compressed method [4], which improves the memory usage for F from O(m × n + K) to O(m + n + K) = O(m + n). Note that for any boundary face F , the projected shape of each neighboring region on F is an axisparallel rectangle. Consider a non-uniform grid Gn on F . The cells in each row or column of Gn contain pointers to a set of neighboring regions touching F . Observe that for any row Ri and any column Cj of Gn , the neighboring region R0 pointed to by the cell C(i, j) in Gn appears in both the set of neighboring regions pointed to by the cells of Ri and the set of neighboring regions pointed to by the cells of Cj ; further, R0 is the only neighboring region appearing in both these two sets (this follows from the rectangular shapes of the projected neighboring regions on F ). Based on this observation, we use the following grid compression scheme. (1) Store pointers to all neighboring Shell-Regions of F in an indexed array AF . (2) For each row (or column) of Gn , build a bit sequence as follows: the sequence has K bits, one for each neighboring ShellRegion of F , in their order as stored in the array AF ; each bit is set as 1 if and only if the corresponding neighboring Shell-Region appears in that row (or column) of Gn . Every such bit sequence is stored as an integer, called a neighbor integer. Thus, we have an array of pointers to the set of K neighboring ShellRegions of F , and two sequences of neighbor integers (one for the rows and one for the columns of Gn ). Since there are K neighboring Shell-Regions touching F and Gn has m rows and n columns on F , the memory usage of the above grid compression scheme for F is clearly O(m + n + K). (Here, we assume that the number K of neighboring regions touching F is not too big, which is usually the case in practice.) For a ray r hitting a point on F to exit, we use the following “decoding” process to find the next ShellRegion traversed by r: (1) Compute the pixel indices in the uniform grid for the hit point of r on F ; (2) find the cell indices (for a row Ri and a column Cj ) in the non-uniform grid Gn ; (3) take the neighbor integers for Ri and Cj , and perform a logic AND operation on them to identify the unique neighboring Shell-Region pointed to by the cells in both Ri and Cj . Note that both the uniform grid and non-uniform grid of F involved in our decoding process above are only used conceptually, i.e., they are only concepts for helping our computation but are not actual structures that we maintain explicitly in Shell. Figure 4 illustrates how the grid compression scheme is applied based on the non-uniform grid in Figure 3. 3.2 GPU Memory Layout and Memory Bound of Shell Since the GPU memory architecture prefers simple memory layout schemes in order to implement easy and efficient addressing, we use arrays to store the Shell data structure. The Shell memory layout consists of an array of Shell-Regions and an array of Shell-Units, addressed by their indices (e.g., see Figure 2(c)). A Shell-Region representation contains sufficient information to address any of its Shell-Units. A leaf region in 3D has six 2D boundary faces neighboring with other leaf regions. In a Shell-Region representation, each face stores a pointer to (the first of) its corresponding group of Shell-Units and stores the structure produced by the partition and table lookup schemes for that face. A partition scheme consists of the structure type (uniform grids, non-uniform grid, with or without grid compression), dimensions, resolutions (if using uniform grids), coordinate integers (if using a non-uniform grid), and pointers to three arrays (for the neighbor integers of the rows and columns of a non-uniform grid and for the pointers to the neighboring Shell-Regions in the array AF , if using a grid compression scheme). The memory size of a Shell-Region representation is constant (no more than 64 bytes) and is designed to be no bigger than the length of a memory transaction in the GPU hardware (e.g., 128 bytes). Thus, a single off-chip memory transaction can load a Shell-Region representation onto the on-chip memory (e.g., cache) during ray traversal. All Shell-Units on a boundary face of a Shell-Region are grouped and allocated onto consecutive memory locations in the Shell-Unit array based on a predetermined sequence specified by the specific partition scheme for that face. For each Shell-Unit, its memory offset to the address of the first ShellUnit in the group can be calculated from its geometric location on the face. Since the base address of each group of the Shell-Units is stored in the corresponding Shell-Region, any Shell-Unit can be addressed easily. 7 With a grid partition scheme, Shell-Units on a face are represented as a matrix and stored as a 1D array in the GPU memory. With the grid compression scheme, Shell-Units are in the array of neighboring Shell-Region IDs (whose pointer is stored in its Shell-Region). Each Shell-Unit only contains a pointer to the neighboring Shell-Region that it touches, and hence can always be loaded by one memory transaction. Suppose for a 3D scene with N objects (for specific applications) represented by Shell, we need to store M Shell-Regions and U Shell-Units besides the N objects, where M is the number of leaf regions in the spatially decomposed scene and U is the total number of Shell-Units. Then clearly Shell uses O(M +N +U ) memory. Note that in the non-uniform grid compression scheme, the number of cells on each boundary face is equal to the number of its neighboring regions, and the total size of the three arrays (for neighbor integers of the rows and columns of a non-uniform grid and for pointers to the neighboring Shell-Regions) is proportional to the total size of these neighboring regions. Thus, U is proportional to the number of neighboring leaf region pairs in the spatial decomposition (i.e., the number of edges in the dual graph G). 4 Ray Traversal and Construction of Shell In this section, we show how to use the Shell data structure effectively in ray traversal. As we pointed out in Section 2, ray traversal is essentially to map the geometric location information of a point on a boundary face of a region R (at which a ray r exits R) to a memory address containing information of the neighboring region R0 of R traversed next by r. Hence, the main task of the Shell based ray traversal is to efficiently “decode” the information stored in the Shell structure so as to obtain the needed memory address. 4.1 Locating the Next Traversing Region In the Shell data structure, the Shell-Unit corresponding to the cell containing the hit point of a ray r on a boundary face F of a region R stores a pointer to (i.e., the memory address of) the neighboring region of R touching that Shell-Unit; we call this Shell-Unit a target Shell-Unit and denote it by SUt . Thus, locating the next region entered by r can be accomplished in three steps: (I) Find the hit point p of r on F ; (II) determine the memory location of the target Shell-Unit SUt containing p; (III) access SUt to obtain the address of the Shell-Region for the next traversing region of r. Step (I) is taken by all ray traversal algorithms regardless of the data structure used and Step (III) is trivial. Below, we elaborate how Step (II) is done. The Shell-Units of a boundary face F , which we call a Shell-Unit group, are stored consecutively in the memory as part of an array. Hence, the memory location of SUt can be computed based on the memory address of the Shell-Unit group for face F (denoted as Mbase ) and the offset of SUt from the first element in the Shell-Unit group. Since Mbase has already been read into the on-chip memory upon r entering the current region, the main hurdle is to determine the offset. The process of computing the offset is equivalent to decoding the Shell structure associated with the arrangement on F . The actual decoding method depends on the specific partition and compression schemes used for F . Our decoding solutions for two cases (the uncompressed uniform grid scheme and compressed non-uniform grid scheme) are given in Algorithms 1 and 2 in Appendix 2. Decoding methods for other cases can be easily derived from these two algorithms. 4.2 Ray Traversal Based on Shell The ray traversal algorithm using the Shell data structure is shown in Algorithm 3, Appendix 2. In an ideal case, locating the next traversing region using the Shell data structure takes only two memory transactions: one for accessing a Shell-Region (Line 10 of Algorithm 3) and one for a Shell-Unit (Line 9 of Algorithm 1, or Line 5 of Procedure 2 (see Figure 3(b)) used in Line 9 of Algorithm 2). But, the Shell decoding for the more sophisticate schemes may need more than two memory requests. For example, the compressed non-uniform grid scheme incurs three additional memory requests: one access to the grid 8 coordinates (Line 7 of Algorithm 2) and two accesses for decoding the target Shell-Unit (Lines 1–2 of Procedure 2 (see Figure 3(b)) used in Line 9 of Algorithm 2). By properly aligning the memory layout of the Shell-Region structure and the coordinate integers for the non-uniform grid or the neighbor integers for a compressed grid, we can fulfill the five sequentially issued memory accesses by only two memory transactions with cache. (Note that one memory transaction loads a group of data to cache, which can satisfy multiple memory accesses if the accesses are aligned properly and the data stays in the cache [23].) Although we are able to fulfill the memory accesses with two memory transactions, the limited cache size in a many-core processor can induce conflict misses which cause a part of the cached data (brought in by a memory transaction) being replaced before they are used. This is commonly referred to as cache thrashing. For example, when the Shell-Region information is accessed, the memory transaction brings in not only the Shell-Region information but also the grid information. However, due to the sequentiality of the two reads, the grid information may be replaced by data access for some other threads. This problem is exasperated when a large number of threads are active simultaneously and competing for cache access. To overcome this challenge, we judiciously reduce the number of simultaneously executing threads for targeted sections of the code by controlling the number of threads to be launched. (A careful balance is needed so that this reduction would not degrade the overall performance; details are omitted due to the page limit.) The combined effort above guarantees that only two memory transactions are used for locating the next traversing region. For comparison, consider a geometric scene decomposed into M regions and a ray traversing through K leaf regions. A kd-tree searching traversal algorithm accesses O(K × log M ) tree nodes; each access takes 2–4 memory transactions depending on the kd-tree implementation. With the Shell ray traversal algorithm, the ray accesses K Shell-Regions; each access needs only two memory transactions. 4.3 Construction of the Shell Data Structure Before ray traversal, the Shell data structure needs to be constructed. Since the ray traversal process typically needs to be performed repeatedly for a large number of rays (e.g., in graphics ray tracing and radiation dose calculation applications), the construction is best performed as a preprocessing task, especially for static scenes. As a common practice, we accomplish this task on the CPU of a CPU+GPU platform. The construction starts by obtaining a spatial decomposition based on a cost model such as the Surface Area Heuristic (SAH) [21] and building the corresponding kd-tree. The Shell construction scheme processes each leaf region R as follows. We first identify all neighboring leaf regions touching each face of R. Every face is partitioned into a uniform grid, whose resolution is determined according to the number and distribution of its neighboring regions. Then depending on the size of the scene and the memory capacity, the non-uniform grid partition and/or grid compression schemes, if needed, are applied to each face. Finally, the memory space for the Shell-Region and Shell-Units of R is allocated and initialized accordingly. The Shell data structure consists of the Shell-Regions and Shell-Units for all leaf regions of the decomposed scene. The Shell data structure is then downloaded to GPU for ray traversal applications. 5 Evaluation To evaluate the proposed Shell approach for ray traversal on GPU, we implemented our Shell data structure and Shell based ray traversal algorithm. We also implemented two state-of-the-art kd-tree searching ray traversal algorithms: PD-SS and kd-rope. For kd-rope, we implemented its latest version, kd-rope++ [25], which is the fastest known kd-tree searching ray traversal approach. For PD-SS, although its performance is slightly lower comparing with kd-rope, it takes much less memory space. To ensure the quality of our PD-SS and kd-rope implementations, we tested them on graphics ray tracing applications and achieved rendering performance results comparable to those published in related work (e.g., [11, 25]). The hardware platform 9 used in our evaluation is an NVIDIA GTX570 graphics card (Fermi architecture, 480 cores, 1.6GHz core frequency, 1GB device memory). Figure 5 shows the set of geometric scenes used in our experiments. These scenes are commonly found in the study of graphics rendering approaches. The properties of the kd-tree decomposition for these scenes are listed in Columns 2–4 in Table 1. Two different Shell data structures are built on the kd-tree decomposition of each scene. Shell-1 aims to minimize the memory usage by adopting the non-uniform grid partition and grid compression schemes whenever they can save memory space in building the arrangement for neighboring regions on each boundary face. Shell-2 tends to achieve a good balance between memory usage and ray traversal performance, which uses those sophisticated schemes only if they can reduce a certain portion of memory usage (e.g., more than 20%). All data presented in each scene are obtained by traversing a set of rays generated to render an image of the scene with resolution of 1024 × 1024 from graphics ray tracing. Below we analyze and compare various metrics for the Shell based and the traditional kd-tree searching ray traversal algorithms. These metrics include the ray traversal speed, number of accessed nodes, number of divergent branches, and memory usage. Figure 6 compares the ray traversal performance using PD-SS, kd-rope, and our Shell based algorithms. The data in Figure 6 show that a speedup between 2.6X–5.1X can be achieved by using Shell comparing to PD-SS and 2.2X–4.3X to kd-rope. Furthermore, comparing the performance of Shell-1 and Shell-2, we see that a larger Shell data structure tends to lead to a faster ray traversal performance. The Shell based ray traversal algorithm gains its performance advantage through removing many expensive memory accesses to internal nodes in the kd-tree searching ray traversal methods. As demonstrated in Figure 7, Shell based ray traversal accesses on average 4.2X and 3.5X fewer nodes than PD-SS and kd-rope, respectively. Although the kd-rope approach uses neighbor links to reduce accessing internal nodes, it still needs to visit a number of internal nodes when a region boundary face has multiple neighboring regions. Another factor contributing to the performance improvement of the Shell based ray traversal is that the Shell data structure reduces a significant amount of execution divergence. To illustrate this, Figure 8(a)(b) summarize the number of branches and number of divergent branches, respectively, for each algorithm. Shell based ray traversal incurs on average 85% less branches and 70% less divergent branches. But, Shell base ray traversal cannot completely eliminate execution divergence since different partition schemes are involved. The more sophisticated schemes a Shell structure uses (e.g., compressed non-uniform grid), the more divergent branches it may have. This is shown by comparing the Shell-1 and Shell-2 results in Figure 8. The memory usage of the Shell data structure is strongly dependent on the schemes chosen to generate the arrangements for neighboring regions on region boundaries. Our proposed partition schemes, together with the grid compression method, provide a trade-off framework for memory usage and ray traversal performance. In our experimental study, Shell-1 minimizes the memory requirement while Shell-2 aims to achieve a good balance between memory usage and ray traversal performance. Table 1 summarizes the properties and memory usage of PD-SS, kd-rope, Shell-1, and Shell-2 for each of the geometric scenes in Figure 5. A kd-tree can be directly used by PD-SS. For kd-rope, the kd-tree needs to be augmented in the leaf nodes by storing neighbor links and bounding boxes for each of them. As shown in the column of “size(kd-rope)” in Table 1, the kd-tree supporting kd-rope uses about 3X more memory than that supporting PD-SS. The Shell-1 data structure in Table 1 adopts a non-uniform grid and the grid compression scheme to as many region boundaries as possible. Its memory usage is about 40% higher than PD-SS but 60% smaller than kd-rope. The Shell-2 data structure stores more Shell-Units and uses the more sophisticated schemes only if they can save memory usage for a region boundary by more than 20%. For example, on a region face F , suppose a uniform grid partitions it into x cells and the nonuniform grid scheme can reduce the number of cells to y. Then Shell-2 adopts the non-uniform grid on F only if (y ÷ x) < 0.8. Although Shell-2 uses on average 30% more memory than Shell-1, its ray traversal performance is 20%–45% better than Shell-1. Furthermore, Shell-2 not only uses 50%–60% less memory than kd-rope, but also achieves 2.2X–4.3X speedup over kd-rope (see Figure 6). 10 References [1] T. Aila and S. Laine. Understanding the efficiency of ray traversal on GPUs. In Proceedings of the 1st ACM Conference on High Performance Graphics, pages 145–149, 2009. [2] E. Bethel and M. Howison. Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning. International Journal of High Performance Computing Applications, 26(4):399–412, 2012. [3] G. Blelloch, P. Gibbons, and S. Vardhan. Combinable memory-block transactions. In Proceedings of the 20th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’08, pages 23–34, 2008. [4] A. Buluç, J. Fineman, M. Frigo, J. Gilbert, and C. Leiserson. Parallel sparse matrix-vector and matrix-transposevector multiplication using compressed sparse blocks. In Proceedings of the 21st ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’09, pages 233–244, 2009. [5] Q. Chen, M. Chen, and W. Lu. Ultrafast convolution/superposition using tabulated and exponential kernel. Medical Physics, 38:1150–1161, 2011. [6] P. Chuong, F. Ellen, and V. Ramachandran. A universal construction for wait-free transaction friendly data structures. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’10, pages 335–344, 2010. [7] C. Clauss, S. Lankes, P. Reble, and T. Bemmerl. Evaluation and improvements of programming models for the Intel SCC many-core processor. In 2011 International Conference on High Performance Computing and Simulation, pages 525–532, 2011. [8] T. Foley and J. Sugerman. Kd-tree acceleration structures for a GPU raytracer. In Proceedings of Graphics Hardware, pages 15–22, 2005. [9] W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Trans. Archit. Code Optim., 6(2):1–37, 2009. [10] S. Guntury and P. J. Narayanan. Raytracing dynamic scenes on the GPU using grids. IEEE Transactions on Visualization and Computer Graphics, 18(1):5–16, 2012. [11] M. Hapala and V. Havran. Review: Kd-tree traversal algorithms for ray tracing. Computer Graphics Forum, 30(1):199–213, 2011. [12] V. Havran. Heuristic Ray Shooting Algorithms. PhD thesis, Czech Technical University, Nov. 2000. [13] V. Havran, J. Bittner, and J. Zara. Ray tracing with rope trees. In Proceedings of Spring Conference on Computer Graphics, pages 130–139, 1998. [14] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’10, pages 355–364, 2010. [15] D. R. Horn, J. Sugerman, M. Houston, and P. Hanrahan. Interactive k-d tree GPU ray tracing. In Proceedings of the Symposium on Interactive 3D Games and Graphics, pages 167–174, 2007. [16] D. M. Huges and I. S. Lim. Kd-jump: A path-preserving stackless traversal for faster isosurface raytraing on GPUs. IEEE Transactions on Visualization and Computer Graphics, 15(6):1555–1562, 2009. [17] J. Kalojanov, M. Billeter, and P. Slusallek. Two-level grids for ray tracing on GPUs. Computer Graphics Forum, 30(2):307–314, 2011. [18] D. Kopta, J. Spjut, E. Brunvand, and A. Davis. Efficient MIMD architectures for high-performance ray tracing. In 2010 IEEE International Conference on Computer Design, pages 9–16, 2010. [19] Y. Krishnakumar, T. Prasad, K. Kumar, P. Raju, and B. Kiranmai. Realization of a parallel operating SIMD-MIMD architecture for image processing application. In 2011 International Conference on Computer, Communication and Electrical Technology, pages 98–102, 2011. 11 [20] M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. Chew. Scheduling strategies for optimistic parallel execution of irregular programs. In Proceedings of the 20th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’08, pages 217–228, 2008. [21] J. D. MacDonald and K. S. Booth. Heuristic for ray tracing using space subdivision. Visual Computer, 6:153– 165, 1990. [22] T. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, and S. Dighe. The 48-core SCC processor: The programmer’s view. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11, 2010. [23] NVIDIA Corporation. NVIDIA CUDA C programming guide version 5.0. URL: http://docs.nvidia.com/cuda/ cuda-c-programming-guide/index.html, 2013. [24] S. Popov, J. Gunther, H. P. Seidel, and P. Slusallek. Stackless kd-tree traversal for high performance GPU ray tracing. Computer Graphics Forum, 26(3):415–424, 2007. (Proc. Eurographics. 2007). [25] A. Santos, J. M. Teixeira, T. Farias, V. Teichrieb, and J. Kelner. Understanding the efficiency of kd-tree ray traversal techniques over a GPGPU architecture. International Journal of Parallel Programming, 40(3):331– 352, 2012. [26] F. Song and J. Dongarra. A scalable framework for heterogeneous GPU-based clusters. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’12, 2012. [27] Stanford Computer Graphics Laboratory. The Stanford 3D scanning repository. URL: http://graphics.stanford. edu/data/3Dscanrep/, 2012. [28] J. Tsakok, W. Bishop, and A. Kennings. Kd-tree traversal techniques. In Proceedings of the IEEE Symposium on Interactive Ray Tracing, pages 190–194, 2008. [29] I. Wald, W. R. Mark, J. Gunther, S. Boulous, T. Ize, W. Hunt, S. G. Parker, and P. Shirley. State of the art in ray tracing animated scenes. Computer Graphics Forum, 28(6):1691–1722, 2009. [30] K. Xiao, B. Zhou, D. Z. Chen, and X. S. Hu. Efficient implementation of the 3D-DDA ray traversal algorithm on GPU and its application in radiation dose calculation. Medical Physics, 39:7619–7626, 2012. [31] B. Zhou, X. S. Hu, and D. Z. Chen. Memory-efficient volume ray tracing on GPU for radiotherapy. In IEEE 9th Symposium on Application Specific Processors, pages 46–51, 2011. [32] M. Zlatuska and V. Havran. Ray tracing on a GPU with CUDA – Comparative study of three algorithms. In Proceedings of Computer Graphics, Visualization and Computer Vision, pages 69–75, 2010. 12 A Appendix 1: Figures Figure 1: (a) A 2D scene with 18 spatially decomposed regions and two traversing rays, ray1 and ray2. (b) The kd-tree representation of the scene in (a) with the leaf nodes of the tree corresponding to the decomposed regions in (a). (c) Sequences of regions traversed by ray1 and ray2, and the tree nodes accessed by these rays using the PD-SS and kd-rope algorithms. 13 Figure 2: (a) A 2D scene with 8 spatially decomposed regions. (b) A uniform-grid based Shell structure for the decomposed scene in (a). (c) The Shell-Region for region B in (a) and one of its Shell-Units adjacent to region D. (d) The memory layout of Shell, where each element in the Shell-Region array contains a pointer to the first Shell-Unit associated with that Shell-Region, and each element in the Shell-Unit array contains a pointer to its neighboring Shell-Region (indicated by the dashed arrows). Figure 3: (a) A non-uniform grid for a boundary face F with 16 neighboring regions. F is partitioned by all the vertical and horizontal lines into a uniform grid scheme. The solid bold red boxes are the projected boundaries of the neighboring regions on F . The solid green lines are the lines used by the non-uniform grid. The set of 21 × 18 Shell-Units using the uniform grid on F is reduced to 8 × 7 Shell-Units using the non-uniform grid. (b) The decoding procedure for converting the cell indices from a uniform grid to a non-uniform grid. By applying this procedure, Cell(11, 12) (containing point P in (a)) in the uniform grid is converted to Cell0 (4, 5) in the non-uniform grid. 14 Figure 4: (a) A compressed non-uniform grid for the boundary face F in Figure 3(a). The grid compression scheme stores 7 + 8 + 16 = 31 integers for F (i.e., the neighbor integers for 7 rows and 8 columns, and an array of pointers to the 16 neighboring regions). (b) The decoding procedure for obtaining the pointer to the neighboring Shell-Region touching a given cell. By applying this procedure, the pointer to the neighboring Shell-Region of Cell(4, 5) (containing point P in (a)) is found at AF [8]. Figure 5: Geometric scenes used in the experiments of this paper. From left to right: Bunny (69K objects), Dragon (871K objects), Buddha (1.08M objects), from the Stanford 3D Scanning Repository [27], and Balls5 (66K objects) used in [12]. 15 Figure 6: The execution time of the PD-SS, kd-rope, and Shell based ray traversal on the scenes in Figure 5. The detailed information for the data structures (including kd-tree, Shell-1, and Shell-2) is given in Table 1. Figure 7: The total numbers of nodes (including internal and leaf nodes) accessed by all rays during the execution of PD-SS, kd-rope, and Shell based ray traversal on the scenes in Figure 5. The same set of rays is used by all these algorithms on a scene. Note that Shell-1 and Shell-2 are based on the same kd-tree decomposition, and hence the Shell based algorithms access the same number of nodes (leaf regions) using these two Shell data structures. 16 Figure 8: The numbers of (a) branches and (b) divergent branches during the execution of PD-SS, kd-rope, and Shell based ray traversal on the scenes in Figure 5. The Shell based algorithms incur smaller numbers of branches and divergent branches, and Shell-2 has even smaller such numbers than Shell-1 because Shell-2 uses less sophisticated schemes. B Appendix 2: Algorithms Algorithm 1 Locating the next traversing region using a uniform grid 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: Input: ray R, boundary face F , Shell-Region SR. Output: target Shell-Unit SUt . compute the point P where R intersects F and exits from SR; obtain the uniform grid G for F ; Mbase = the address of the Shell-Unit group for F ; compute the location index (i, j) of the pixel in G containing P ; /* The pixel at (i, j) represents the target Shell-Unit. */ Mof f = i × (row size of G) + j; Maddr = Mbase + Mof f ; /* Compute the memory address Maddr for the target Shell-Unit. */ SUt = SU array[Maddr ]; /* Load and return the target Shell-Unit, which is the address of the next Shell-Region SRN . */ return SUt ; Algorithm 2 Locating the next traversing region using a compressed non-uniform grid 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: Input: ray R, boundary face F , Shell-Region SR. Output: target Shell-Unit SUt . compute the point P where R intersects F and exits from SR; obtain the uniform grid G for F ; compute the location index (i, j) of the pixel in G containing P ; obtain the non-uniform grid Gn for F ; obtain Coord.x and Coord.y for F ; compute index (i0 , j 0 ) by calling Procedure 1 in Figure 3(b) using (i, j) ; /* Decode the non-uniform grid. */ compute pointer ID by calling Procedure 2 in Figure 4(b) using (i0 , j 0 ); /* Decode the compressed grid. */ return ID; /* ID points to the next Shell-Region to traverse, which serves as the target Shell-Unit. */ 17 Algorithm 3 Ray traversal algorithm based on Shell 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: Input: ray R, Shell data structure with an SR (Shell-Region) array and an SU (Shell-Unit) array. Output: traversal of R. initialization, to obtain the first Shell-Region SR0 for traversing R; SR = SR0 ; more region to traverse = true; while more region to traverse do compute the boundary face F of SR on which R exits SR; obtain the partition and compression scheme S used on F ; use the corresponding algorithm for S to compute the target SUt for entering the next Shell-Region SRN ; /* E.g., if F uses a compressed non-uniform grid, then SUt = Algorithm 2(R, F , SR). */ SR = SR array[SUt ]; /* The next Shell-Region SRN is obtained and is set as SR for the next iteration. */ if the traversal of R is finished then more region to traverse = f alse; /* If the traversal of R is not finished, then continue to the next iteration. */ end if end while end C Appendix 3: Tables Scenes Bunny Dragon Buddha Balls5 a b c d leaves a 154K 920K 1.3M 57K INs b 401K 3.5M 4.8M 189K kd-tree objects c size(PD-SS) 5.5 9.2MB 2.8 55.8MB 2.4 78.1MB 2.3 3.6MB size(kd-rope) 30.1MB 167MB 241MB 11.9MB Shell-1 SUs d size 2.8M 13.4MB 13M 75MB 20M 112MB 0.96M 4.7MB Shell-2 SUs d size 3.9M 17MB 19M 98MB 24M 141MB 1.2M 6.1MB The total number of leaf nodes (i.e., leaf regions). The total number of internal tree nodes. The average number of objects contained in each leaf node. The total number of Shell-Units. Table 1: The properties and memory usage of the kd-tree and Shell data structures built for the scenes in Figure 5. Columns 2, 3, and 4 illustrate the properties of the kd-tree. Column 5 shows the memory usage of the kd-tree, which is used by the PD-SS algorithm. Column 6 shows the memory space used by kd-rope, which stores more information (e.g., the bounding box information and neighbor links) in the leaf nodes of the kd-tree. For the Shell-1 and Shell-2 data structures, Columns 7 and 9 illustrate the total numbers of Shell-Units that they contain, and Columns 8 and 10 show their total memory requirement, respectively. 18