Download Shell: A Spatial Decomposition Data Structure for 3D Curve

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Shell: A Spatial Decomposition Data Structure for 3D Curve
Traversal on Many-core Architectures
(Regular Submission)
Kai Xiao
University of Notre Dame
[email protected]
Danny Z. Chen∗
University of Notre Dame
[email protected]
X. Sharon Hu
University of Notre Dame
[email protected]
Bo Zhou
Altera Corp
[email protected]
Abstract
Shared memory many-core processors such as GPUs have been extensively used in accelerating
computation-intensive algorithms and applications. When porting existing algorithms from sequential or
other parallel architecture models to shared memory many-core architectures, non-trivial modifications
are often needed in order to match the execution patterns of the target algorithms with the characteristics
of many-core architectures. 3D curve traversal is a fundamental process in many applications, and is
commonly accelerated by spatial decomposition schemes captured in hierarchical data structures (e.g.,
kd-trees). However, curve traversal using hierarchical data structures needs to conduct repeated hierarchical searches. Such search process is time-consuming on shared memory many-core architectures since
it incurs considerable amounts of expensive memory accesses and execution divergence. In this paper,
we propose a novel spatial decomposition based data structure, called Shell, which completely avoids
hierarchical search for 3D curve traversal. In Shell, a structure is built on the boundary of each region in
the decomposed space, which allows any curve traversing in a region to find the next neighboring region
to traverse using table lookup schemes, without any hierarchical search. While our 3D curve traversal
approach works for other spatial decomposition paradigms and many-core processors, we illustrate it
using kd-tree decomposition on GPU and compare with the fastest known kd-tree searching algorithms
for ray traversal. Analysis and experimental results show that our approach improves ray traversal
performance considerably over the kd-tree searching approaches.
Keywords: Many-core architecture, GPU, data structure, spatial decomposition, 3D curve traversal.
∗
The research of D.Z. Chen was supported in part by NSF under Grants CCF-0916606 and CCF-1217906.
1
Introduction
3D geometric scenes are involved in many applications, in which large numbers of curve traversal
(e.g., ray traversal) operations are frequently conducted. Spatial decomposition based data structures have
been developed on shared memory many-core architecture (e.g., general purpose graphics processing units
(GPGPUs)) for accelerating curve traversal solutions. In this paper, we present a new efficient spatial
decomposition based data structure for 3D curve traversal that better exploits the characteristics of shared
memory many-core architectures and avoids hierarchical searches that are commonly performed in other
known spatial decomposition data structures.
Shared memory many-core processors present great opportunities to speed up computation-intensive
applications by parallelization [19, 20, 22, 26]. Recent advances on GPGPUs, such as those from NVIDIA,
AMD, and Intel, leverage massively parallel architectures based on single-instruction, multiple-data (SIMD)
processor cores to achieve high performance [9]. In this paper, we use NVIDIA GPU with the Fermi
architecture as the model for illustration and experiments, but our solutions can be applied to other types of
shared memory many-core processors, such as those with an MIMD architecture (e.g., Intel SCC [7]).
Due to the specific characteristics of GPU, sequential or parallel algorithms straightforwardly ported to
GPU often suffer from a number of performance bottlenecks such as memory access efficiency and execution
divergence, thus utilizing only a fraction of the GPU computation power [1]. A shared memory many-core
processor, especially GPU, commonly contains hundreds of cores (e.g., NVIDIA GTX570 contains 480
cores [23]). All these cores are connected to one storage component such as shared cache or main memory.
During execution, the memory bandwidth shared by multiple cores is often insufficient to support a large
number of simultaneous memory requests. Hence, memory access efficiency on GPUs is much more crucial
for performance than on traditional pipelined processors such as CPUs. For example, in the NVIDIA Fermi
GPU, the latency of an off-chip memory transaction is 400-600 times longer than the fastest computational
instructions. Since computational operations can be executed simultaneously by multiple cores but memory
transactions need to be processed sequentially by the memory controller shared by these cores, the memory
transaction latency is usually even worse. Besides memory access efficiency, execution divergence is another
major performance bottleneck. The SIMD architecture used in GPU attains the most benefit when a group of
cores (32-48 in NVIDIA Fermi) follows the same instruction flow. But when the execution paths running on
different cores diverge due to branches (e.g., conditional statements), their parallel execution can no longer
be sustained and instead serial execution takes place. Such divergent execution paths severely deteriorate
performance [30]. Achieving good performance on SIMD systems requires solid understanding of both the
architecture and execution patterns of the target algorithms [18].
Curve traversal is to trace the trajectory of a curve through a geometric scene. A curve can be a line
segment, a ray, or a general algebraic curve. Ray traversal is a special case of curve traversal and a
fundamental process in many applications, such as graphics ray tracing [2, 29] and radiation dose calculation
[5, 31]. In this paper, we use ray traversal as example to illustrate the problem and our solution for 3D curve
traversal on many-core architectures. Our approach can be easily extended to traverse other types of curves.
Since applications using ray traversal often involve large numbers of rays and repeatedly conduct the
traversal process for such rays (e.g., to generate high-resolution images in graphics ray tracing), the execution speed of ray traversal is critical for such applications. A number of data structures have been developed
to partition and organize geometric objects in 3D scenes to improve the efficiency of ray traversal. One
important class of such data structures is based on spatial decomposition (e.g., grids [10, 17], octrees and
kd-trees [28]). A spatial decomposition scheme partitions a geometric scene into a set of regions, each
containing a (small) number of objects. The subdivided regions are normally organized by a hierarchical
structure which represents the geometric relationship among those regions. Kd-tree is a commonly used
hierarchical structure due to its ability of matching the subdivided regions with the distribution of geometric
objects in the scene. It is also considered to be the data structure that provides the fastest known ray traversal
1
speed in static scenes because of its efficient hierarchical search mechanism [32].
Quite a few algorithms have been developed for 3D ray traversal using kd-tree structures, in which
repeated hierarchical searches in the tree are needed to find the neighboring regions for traversal [11, 25].
For kd-tree based ray traversal on a traditional CPU architecture, a stack is normally used to store the tree
nodes along a search path. On GPU, due to the lack of stack support and limited capacity of on-chip memory,
a stack based approach typically allocates stacks to off-chip memory (e.g., global memory). The long access
latency of the off-chip memory can easily become a performance bottleneck for stack-based ray traversal
on GPUs [16]. Several stack-less kd-tree ray traversal algorithms on GPU have been proposed to address
this challenge. For example, Foley et al. [8] and Horn et al. [15] proposed a restart traversal scheme on
kd-trees which starts the tree search from the root, and uses a push-down method to move the “root” down
to the minimum subtree to be searched. Horn et al. [15] also developed a short-stack approach which builds
a small circular stack in the on-chip memory of GPU, and combines it with the push-down method (PD-SS).
Popov et al. [24] proposed a kd-rope algorithm [13] using the concept of neighbor links, where each leaf
node in the kd-tree contains not only the region R it represents but also a set of pointers, each pointing to
a minimum subtree that contains all neighboring regions touching each of the boundaries (or faces) of R.
Santos et al. [25] improved the kd-rope implementation, making it the fastest known kd-tree searching ray
traversal approach on GPU (called kd-rope++).
Although these algorithms considerably improved the kd-tree searching 3D ray traversal performance on
GPU, all of them still rely on hierarchical search to find the next traversal region when a ray crosses a region
boundary (i.e., a ray exits from a region). During ray traversal, each search operation incurs reading a tree
node, and search for the next neighboring traversal region can visit O(H) nodes in a kd-tree, where H is the
height of the tree. Note that each reading of a visited tree node may take multiple memory transactions to
obtain the entire information package for the node because each GPU memory transaction has a limited size
(e.g., 128 bytes). Also, the threads for different rays may follow different sequences of visited tree nodes
which can result in execution divergence. Hence, the performance of kd-tree searching ray traversal on GPU
can suffer considerably from these issues.
We propose a new spatial decomposition based data structure, called Shell, which completely eliminates
hierarchical search, hence leading to an efficient solution for 3D curve traversal on GPU. Tree search is to
find the next neighboring region for a curve (say, ray) to cross a region boundary, by using the geometric
information of the curve and regions. To avoid tree search, Shell provides a neighboring region locating
mechanism based on table lookup techniques to replace hierarchical search and find the next region for any
traversing ray, allowing the ray to directly access the next region’s information.
Generally speaking, given the set of decomposed 3D regions in a hierarchical structure (say, a kd-tree),
Shell focuses on the neighboring relationship among the regions for all leaf tree nodes, called leaf regions.
For each leaf region R, the information of its neighboring leaf regions is captured in Shell by a geometric
structure called arrangements, with one arrangement per boundary face of R. An arrangement is a partition
of a face F of R into a set of 2D regions called cells, such that each cell C covers an area of F touching a
neighboring leaf region R0 of R and contains information of R0 . When a ray exits R by crossing F in C, it
acquires the neighboring region information of R0 by accessing C. Of course, schemes for quickly finding
C (which is hit by the ray) are needed to avoid tree searches and other performance bottlenecks incurred by
hierarchical searching data structures (for this, we apply various table lookup schemes).
There are two key factors in designing the Shell table lookup schemes: the ray traversal speed and
memory usage. The ray traversal speed essentially depends on how efficiently a cell can be located when a
ray crosses a region face through that cell; the memory usage is related to the total number of cells on the
faces of all leaf regions in Shell (i.e., the sizes of arrangements or lookup tables for all leaf regions). Both
factors are affected by the partition schemes for generating the arrangements. We seek to balance the ray
traversal performance with good memory usage in Shell, and present a set of partition schemes, including
(from simple to sophisticated) uniform grids, multi-level uniform grids, and compressed non-uniform grids,
2
to deal with different neighboring region settings. Actually, each such scheme can be viewed as an extension
of the simpler ones for obtaining a good trade-off between ray traversal performance and memory usage
for a more complicated neighboring region setting. We also exploit several memory accessing techniques
[3, 6, 14] to further reduce memory transactions used in Shell based ray traversal.
Given a geometric scene with N objects, suppose a kd-tree decomposition partitions it into M leaf
regions with a tree height of O(log M ) using O(M + N ) memory. A ray traversal in such a kd-tree takes
O(log M ) search steps (i.e., the number of visited tree nodes or memory transactions involved) each time
a ray crosses a region boundary. In comparison, using Shell, only one processing step (or O(1) memory
transactions) without any tree search is needed to find the next region. The Shell memory usage is O(M +
N + U ), where U is the total number of cells in the arrangements for all leaf regions. Using judicious
partition and compression schemes, the O(M + N + U ) memory bound of Shell can be made in practice
comparable to the O(M + N ) memory bound of the kd-tree. Since many applications use massive numbers
of traversing rays each of which can cross many regions in the scenes, reducing the number of memory
transactions from O(log M ) to O(1) per region crossing for each ray on GPU can be significant.
To demonstrate the effectiveness of our approach, we implemented our Shell data structure based on a
kd-tree decomposition with all our arrangement schemes. Our experiments were conducted on ray traversals
in several benchmark geometric scenes [27] used in graphics rendering. The experimental results show that
our Shell approach outperforms the fastest known kd-tree searching ray traversal approaches on GPU by
over 2X to 5X. Although ray traversals, kd-tree decomposition, and GPU architecture are used to illustrate
our method, our Shell data structure and ray traversal algorithms can be easily extended to traversing 3D
curves of other types, using different spatial decomposition schemes, and on other many-core architectures.
2
Ray Traversal Using Kd-tree Decomposition
In this section, we give a general discussion of ray traversal based on spatial decomposition (say, a
kd-tree). We also briefly illustrate the PD-SS [24] and kd-rope [13] ray traversal algorithms on GPU.
Given a spatially decomposed scene, suppose we consider the regions corresponding to all leaf tree
nodes, called leaf regions (e.g., in Figure 1(a), a 2D scene is decomposed into 18 leaf regions). We can use
a dual graph G to represent all leaf regions, as follows: Each vertex in G corresponds to exactly one leaf
region, and two vertices are connected by an edge in G if and only if the two corresponding regions share
any common boundary portion. With this graph model, the ray traversal problem for a ray r is to find a path
(i.e., a sequence of vertices) in G determined by the trajectory of r (i.e., the sequence of regions intersected
by r in the order of the corresponding vertices on the path). To obtain the path in G for r, a key issue, which
we call the next-region issue, is, at each vertex v (for a region Rv ) of the path, to find the next vertex v 0 (for a
region Rv0 ) on the path, i.e., after region Rv , the ray r enters the next region Rv0 . This issue can be resolved
by using the geometric location information of r traversing in Rv and the neighboring leaf regions of Rv .
In implementation, resolving this issue means to map a point on a boundary face of Rv (from which r exits
Rv ) to a memory address (which stores the information of the neighboring region Rv0 traversed next by r).
How to resolve the next-region issue effectively is an essential task for any ray traversal solution.
Figure 1(b) shows the kd-tree representing the spatial decomposition in Figure 1(a). Since a kd-tree
organizes the relationship among its regions in a tree structure, hierarchical search is the basic process for
solving the next-region issue used by any kd-tree searching ray traversal algorithms. For example, when
ray1 leaves region F in Figure 1(a), the next region D (and the memory address storing the information of
region D) should be found based on the exit point of ray1 from a boundary face of region F .
However, a hierarchical search accesses a number of internal nodes in the kd-tree, which impacts ray
traversal performance since it incurs considerable memory transactions and divergent branches. Although
considered as the most efficient kd-tree searching ray traversal algorithms on GPU, PD-SS and kd-rope
3
inherit the hierarchical search mechanism from the kd-tree structure to find neighboring regions. In the
worst case, both algorithms still visit O(H) nodes to find a neighboring region, where H is the height
of the kd-tree. Further, accessing each internal node takes multiple memory transactions (usually 2-4 for
different kd-tree implementations), which is a considerable overhead for ray traversal on GPU. Moreover,
the numbers of accessed nodes between different rays using the same algorithm can be quite different (e.g.,
in PD-SS, ray1 accesses 28 nodes and ray2 access 20 nodes), which means that the rays can have different
workloads (the numbers of tree nodes accessed). In parallel execution on GPU, since many rays are often
simultaneously processed by processor cores on an SIMD architecture, the workload variance of the rays
can cause execution divergence and hence deteriorate the ray traversal performance.
To eliminate the hierarchical search overhead for 3D ray traversal on GPU, we develop a new data
structure for resolving the next-region issue based on table lookup schemes. In designing a GPU table
lookup scheme for the next-region issue, we need to consider two key factors: (1) It should allow to find the
next neighboring region quickly (especially, with only few memory transactions); (2) it should not take too
much GPU memory (large memory usage can limit its application to scenes of large numbers of objects).
We present a series of schemes to address these two factors for various neighboring region settings.
3
The Shell Data Structure
We propose a new data structure, called Shell, for directly finding the next neighboring regions without
any tree search. By avoiding hierarchical search, Shell reduces not only memory transactions but also
execution divergence. Given a kd-tree decomposed scene, based on our dual graph model, Shell focuses on
the leaf regions stored at all leaf nodes of the kd-tree. For each leaf region R, we put a geometric structure,
called arrangements (our “tables”), on the boundary of R (intuitively, the “Shell” of region R) to capture the
information of all neighboring leaf regions of R. When a ray r hits a boundary face F of R to exit from R,
the arrangement for F allows r to find quickly the neighboring region of R to traverse next.
The arrangements are a key structure of Shell, which essentially provide table lookup schemes for
mapping the geometric information of a ray r traversing in a region R and the neighboring regions of R
to a GPU memory address storing the information of the next neighboring region of R traversed by r. For
each face F of every leaf region R, an arrangement A(F ) partitions F into a set of 2D areas, called cells,
such that each cell touches only one neighboring region of R on F . All cells of A(F ) are organized by
a specific data structure such that given any point p on F (e.g., the hit point of any ray), the cell of A(F )
containing p can be found quickly. As a table lookup scheme, the cells of A(F ) form a data table, and each
cell gives (say) a pointer to its corresponding neighboring region of R. Ideally, the following features are
desired from good arrangements: (1) An arbitrary ray r can quickly find the cell C containing its hit point
on a face of R (say, with only few memory transactions), and (2) the memory requirement for storing all
cells is not too high. There is a trade-off between these two features. Depending on different settings of
neighboring regions, we propose a set of partition and compression schemes for building arrangements to
achieve good performance with respect to ray traversal speed and memory usage of Shell. Since the GPU
memory architecture prefers simple memory layout for efficient addressing, we choose arrays as the main
data structure in GPU for storing arrangements based on all our partition schemes.
Figure 2(b) illustrates the Shell structure based on the spatial decomposition in Figure 2(a). Here, to
illustrate the idea of a partition scheme for arrangements of neighboring regions, we use a simple uniform
grid as example to partition all region boundaries in the scene. Each leaf region is represented by a ShellRegion (SR) (e.g., see a Shell-Region in Figure 2(c)). Every Shell-Region uses an arrangement to partition
each of its boundary faces into a set of cells (e.g., the shaded small boxes around the boundary of the
Shell-Region in Figure 2(c)). Each cell is called a Shell-Unit (SU), which covers an area on its boundary
face and contains a pointer to a neighboring Shell-Region touching that cell on the face. In Figure 2(c),
4
the highlighted Shell-Unit contains a pointer to Shell-Region D. The memory layout of Shell, shown in
Figure 2(d), consists of an array of all Shell-Regions and an array of all Shell-Units stored in the GPU
global memory. The Shell-Units of the same boundary face of a Shell-Region are all allocated as a group
in consecutive memory space, and the memory address of the first Shell-Unit in the group is stored in the
corresponding Shell-Region structure. Thus, any Shell-Unit can be addressed by computing its memory
offset from the first Shell-Unit in the same group. When a ray r leaves a region through a face F , the
neighboring region entered next by r is found by accessing the Shell-Unit in which r crosses F .
3.1
The Structure of an Individual Shell-Region
A Shell-Region contains information of a 3D leaf region and the arrangements of its six 2D boundary
faces. To ensure that a Shell-Region can be quickly loaded from memory, its memory requirement should
ideally be no bigger than the size of a single memory transaction in the hardware architecture (e.g., on the
NVIDIA GTX570 GPU, an off-chip memory transaction can load 128 consecutive bytes into the on-chip L1
cache). Essentially, the leaf region information of a Shell-Region R consists of its geometric location and
size, as well as a pointer to the list of its associated geometric objects. As an axis-aligned box in 3D, the
geometric location and size of R can be represented by the two vertices on a diagonal of the box: V0 (the
vertex with the smallest coordinate in the x, y, and z directions of R) and V1 (the vertex with the largest
coordinate in the x, y, and z directions of R).
An arrangement of a region face consists of a geometric partition scheme subdividing the face into a
set of cells (i.e., Shell-Units), and a data structure for mapping the geometric locations of these cells to
the memory addresses of the corresponding Shell-Units in the GPU global memory. The memory address
of the first Shell-Unit for each face is stored in its Shell-Region, which is used as the base for addressing
other Shell-Units of the face. The partition scheme determines the time and memory usage of a ray traversal
algorithm using Shell. There is a trade-off between these two factors. For example, storing more Shell-Units
may make the cell locating process easier and quicker, and hence speed up ray traversal. To achieve a good
balance between traversal time and memory usage for different neighboring region settings, we propose a
series of partition schemes, including (from simple to sophisticated) uniform grid, multi-level uniform grids,
and non-uniform grid, as well as a grid compression scheme. Each scheme is built on top of the simpler
ones and reduces memory usage by using slightly more computational operations for locating cells (but not
more memory transactions), to handle a more complicated neighboring region setting, as presented below.
3.1.1
Uniform Grid Schemes
A uniform grid partitions each region face into a matrix of cells (pixels) of the same size. This arrangement easily maps the geometric locations of the face to the memory addresses of the pixels. Each pixel is a
cell representing a Shell-Unit, which contains a pointer to the neighboring region touching this pixel.
Using a uniform grid for all region boundaries is the simplest approach to build a Shell structure.
Figure 2(b) gives an example for this on a 2D scene with 8 regions. Actually, such a Shell structure can
be viewed as a combination of the kd-tree and uniform grid approaches.
The uniform grid scheme provides a fast accessing mechanism for the Shell-Units; but, it can use a large
amount of memory. Therefore, it is good when each region’s face has a simple distribution of neighboring
regions (say, for scenes containing well-shaped objects with a relatively regular distribution). However, for
a complicated scene, the region faces often have a large number of neighbors with non-regular distributions.
Hence the uniform grid may have to choose the finest resolution of the neighbor distributions on all region
faces as the pixel size, and the number of pixels in the grid (and the Shell memory space) can be quite large.
To address this issue, instead of using one uniform grid partition for all region faces, we can use
multiple uniform grids with different levels of resolutions to partition the region faces depending on different
5
neighboring region settings. This is called the multi-level uniform grid scheme, which provides some
flexibility to the region faces with simple neighbor distributions so that they can use coarse grid resolutions
and thus store fewer pixels (Shell-Units). For the region faces with complicated neighbor distributions, we
still must use uniform pixels of sufficiently fine resolutions. The resolution of each grid used in the multilevel uniform grid scheme needs to be stored for each face of every Shell-Region, which is used to calculate
the pixel hit by a ray on the face and the memory address of the corresponding Shell-Unit.
Comparing with a uniform grid, the multi-level grid scheme can reduce the number of Shell-Units on
the region faces and hence save memory space, with a very small time overhead.
3.1.2
Non-uniform Grid Scheme
Even with the multi-level uniform grid scheme, each region face is still partitioned into a matrix of pixels
with the same size. For a boundary face touching multiple neighboring regions, the grid resolution is still
restricted to what is required to distinguish the smallest ones, which can still lead to high memory usage.
Observe that multiple pixels of a uniform grid on a boundary face often touch the same neighboring region
and hence store multiple copies of the pointer to the same corresponding Shell-Region. To reduce such
duplications, we propose a non-uniform grid scheme which aims to merge pixels of a uniform size on each
face sharing the same neighboring region into a larger cell to save memory space.
A non-uniform grid is built on the partition of a uniform grid. Figure 3(a) shows a boundary face
(for a 3D scene) with 16 neighboring regions (marked by heavy lines) and originally partitioned into a
uniform grid. For a boundary face F with a uniform grid, we use the set of vertical or horizontal lines along
the projected boundaries of the neighboring regions on F (e.g., the solid bold red boxes in Figure 3(a)) to
partition F . Note that only a subset of the lines for the uniform grid (e.g., the solid green lines in Figure 3(a))
is actually aligned with such projected region boundaries on F , while the other lines for the uniform grid are
needed due to the resolution restriction of the uniform grid. A non-uniform grid uses the lines aligning with
the projected neighboring region boundaries to form a new grid partition, whose cells (the boxes bounded
by solid green lines in Figure 3(a)) may cover multiple pixels of the uniform grid.
we need a decoding mechanism to map the indices from the uniform grid to the non-uniform grid, so
that any cell of the non-uniform grid (hit by a ray) can be located using the pixel indices in the uniform
grid. To support this mechanism, the partition lines in each axis of the uniform grid are indexed by a bit
sequence, where each bit is set as 1 if the corresponding line is used in the non-uniform grid partition and
0 otherwise. Every bit sequence is then stored in the integer format, called a coordinate integer, in the
Shell-Region (e.g., Coord.x in Figure 3(a)). During ray traversal, suppose a ray r exits the Shell-Region
through a cell C(i, j) in the uniform grid for a face F . Then the following is done: (1) The two coordinate
integers Coord.x and Coord.y for F are obtained; (2) using Coord.x (resp., Coord.y), find the number of
1’s in the bit sequence of Coord.x (resp., Coord.y) from its left end up to position i (resp., j), denoted by
i0 (resp., j 0 ). Then cell C 0 (i0 , j 0 ) is the one in the non-uniform grid of F hit by the ray r. To avoid using any
branch operations (which may cause execution divergence), we design a short procedure to accomplish Step
(2) (see Figure 3(b)). Note that given the two coordinate integers, the decoding process uses no memory
transaction to map the pixel indices in the uniform grid to cell indices in the non-uniform grid.
3.1.3
Grid Compression Scheme
Even with a non-uniform grid scheme, the Shell structure still tends to use more memory than a kd-tree
structure. A grid is commonly represented as a matrix, each its element storing a value. On a face F with
K neighboring regions partitioned into a grid of size m × n (a uniform or non-uniform grid), its matrix
representation stores m × n Shell-Units that point to the K neighboring Shell-Regions of F . To further
6
reduce the memory, we employ a grid compression scheme similar to the sparse matrix compressed method
[4], which improves the memory usage for F from O(m × n + K) to O(m + n + K) = O(m + n).
Note that for any boundary face F , the projected shape of each neighboring region on F is an axisparallel rectangle. Consider a non-uniform grid Gn on F . The cells in each row or column of Gn contain
pointers to a set of neighboring regions touching F . Observe that for any row Ri and any column Cj of Gn ,
the neighboring region R0 pointed to by the cell C(i, j) in Gn appears in both the set of neighboring regions
pointed to by the cells of Ri and the set of neighboring regions pointed to by the cells of Cj ; further, R0 is
the only neighboring region appearing in both these two sets (this follows from the rectangular shapes of
the projected neighboring regions on F ). Based on this observation, we use the following grid compression
scheme. (1) Store pointers to all neighboring Shell-Regions of F in an indexed array AF . (2) For each row
(or column) of Gn , build a bit sequence as follows: the sequence has K bits, one for each neighboring ShellRegion of F , in their order as stored in the array AF ; each bit is set as 1 if and only if the corresponding
neighboring Shell-Region appears in that row (or column) of Gn . Every such bit sequence is stored as an
integer, called a neighbor integer. Thus, we have an array of pointers to the set of K neighboring ShellRegions of F , and two sequences of neighbor integers (one for the rows and one for the columns of Gn ).
Since there are K neighboring Shell-Regions touching F and Gn has m rows and n columns on F , the
memory usage of the above grid compression scheme for F is clearly O(m + n + K). (Here, we assume
that the number K of neighboring regions touching F is not too big, which is usually the case in practice.)
For a ray r hitting a point on F to exit, we use the following “decoding” process to find the next ShellRegion traversed by r: (1) Compute the pixel indices in the uniform grid for the hit point of r on F ; (2) find
the cell indices (for a row Ri and a column Cj ) in the non-uniform grid Gn ; (3) take the neighbor integers
for Ri and Cj , and perform a logic AND operation on them to identify the unique neighboring Shell-Region
pointed to by the cells in both Ri and Cj . Note that both the uniform grid and non-uniform grid of F
involved in our decoding process above are only used conceptually, i.e., they are only concepts for helping
our computation but are not actual structures that we maintain explicitly in Shell. Figure 4 illustrates how
the grid compression scheme is applied based on the non-uniform grid in Figure 3.
3.2
GPU Memory Layout and Memory Bound of Shell
Since the GPU memory architecture prefers simple memory layout schemes in order to implement easy
and efficient addressing, we use arrays to store the Shell data structure. The Shell memory layout consists
of an array of Shell-Regions and an array of Shell-Units, addressed by their indices (e.g., see Figure 2(c)).
A Shell-Region representation contains sufficient information to address any of its Shell-Units. A
leaf region in 3D has six 2D boundary faces neighboring with other leaf regions. In a Shell-Region
representation, each face stores a pointer to (the first of) its corresponding group of Shell-Units and stores the
structure produced by the partition and table lookup schemes for that face. A partition scheme consists of the
structure type (uniform grids, non-uniform grid, with or without grid compression), dimensions, resolutions
(if using uniform grids), coordinate integers (if using a non-uniform grid), and pointers to three arrays (for
the neighbor integers of the rows and columns of a non-uniform grid and for the pointers to the neighboring
Shell-Regions in the array AF , if using a grid compression scheme). The memory size of a Shell-Region
representation is constant (no more than 64 bytes) and is designed to be no bigger than the length of a
memory transaction in the GPU hardware (e.g., 128 bytes). Thus, a single off-chip memory transaction can
load a Shell-Region representation onto the on-chip memory (e.g., cache) during ray traversal.
All Shell-Units on a boundary face of a Shell-Region are grouped and allocated onto consecutive
memory locations in the Shell-Unit array based on a predetermined sequence specified by the specific
partition scheme for that face. For each Shell-Unit, its memory offset to the address of the first ShellUnit in the group can be calculated from its geometric location on the face. Since the base address of each
group of the Shell-Units is stored in the corresponding Shell-Region, any Shell-Unit can be addressed easily.
7
With a grid partition scheme, Shell-Units on a face are represented as a matrix and stored as a 1D array in the
GPU memory. With the grid compression scheme, Shell-Units are in the array of neighboring Shell-Region
IDs (whose pointer is stored in its Shell-Region). Each Shell-Unit only contains a pointer to the neighboring
Shell-Region that it touches, and hence can always be loaded by one memory transaction.
Suppose for a 3D scene with N objects (for specific applications) represented by Shell, we need to store
M Shell-Regions and U Shell-Units besides the N objects, where M is the number of leaf regions in the
spatially decomposed scene and U is the total number of Shell-Units. Then clearly Shell uses O(M +N +U )
memory. Note that in the non-uniform grid compression scheme, the number of cells on each boundary
face is equal to the number of its neighboring regions, and the total size of the three arrays (for neighbor
integers of the rows and columns of a non-uniform grid and for pointers to the neighboring Shell-Regions)
is proportional to the total size of these neighboring regions. Thus, U is proportional to the number of
neighboring leaf region pairs in the spatial decomposition (i.e., the number of edges in the dual graph G).
4
Ray Traversal and Construction of Shell
In this section, we show how to use the Shell data structure effectively in ray traversal. As we pointed out
in Section 2, ray traversal is essentially to map the geometric location information of a point on a boundary
face of a region R (at which a ray r exits R) to a memory address containing information of the neighboring
region R0 of R traversed next by r. Hence, the main task of the Shell based ray traversal is to efficiently
“decode” the information stored in the Shell structure so as to obtain the needed memory address.
4.1
Locating the Next Traversing Region
In the Shell data structure, the Shell-Unit corresponding to the cell containing the hit point of a ray r on
a boundary face F of a region R stores a pointer to (i.e., the memory address of) the neighboring region of R
touching that Shell-Unit; we call this Shell-Unit a target Shell-Unit and denote it by SUt . Thus, locating the
next region entered by r can be accomplished in three steps: (I) Find the hit point p of r on F ; (II) determine
the memory location of the target Shell-Unit SUt containing p; (III) access SUt to obtain the address of the
Shell-Region for the next traversing region of r. Step (I) is taken by all ray traversal algorithms regardless
of the data structure used and Step (III) is trivial. Below, we elaborate how Step (II) is done.
The Shell-Units of a boundary face F , which we call a Shell-Unit group, are stored consecutively in the
memory as part of an array. Hence, the memory location of SUt can be computed based on the memory
address of the Shell-Unit group for face F (denoted as Mbase ) and the offset of SUt from the first element
in the Shell-Unit group. Since Mbase has already been read into the on-chip memory upon r entering the
current region, the main hurdle is to determine the offset. The process of computing the offset is equivalent
to decoding the Shell structure associated with the arrangement on F . The actual decoding method depends
on the specific partition and compression schemes used for F . Our decoding solutions for two cases (the
uncompressed uniform grid scheme and compressed non-uniform grid scheme) are given in Algorithms 1
and 2 in Appendix 2. Decoding methods for other cases can be easily derived from these two algorithms.
4.2
Ray Traversal Based on Shell
The ray traversal algorithm using the Shell data structure is shown in Algorithm 3, Appendix 2.
In an ideal case, locating the next traversing region using the Shell data structure takes only two memory
transactions: one for accessing a Shell-Region (Line 10 of Algorithm 3) and one for a Shell-Unit (Line
9 of Algorithm 1, or Line 5 of Procedure 2 (see Figure 3(b)) used in Line 9 of Algorithm 2). But, the
Shell decoding for the more sophisticate schemes may need more than two memory requests. For example,
the compressed non-uniform grid scheme incurs three additional memory requests: one access to the grid
8
coordinates (Line 7 of Algorithm 2) and two accesses for decoding the target Shell-Unit (Lines 1–2 of
Procedure 2 (see Figure 3(b)) used in Line 9 of Algorithm 2). By properly aligning the memory layout
of the Shell-Region structure and the coordinate integers for the non-uniform grid or the neighbor integers
for a compressed grid, we can fulfill the five sequentially issued memory accesses by only two memory
transactions with cache. (Note that one memory transaction loads a group of data to cache, which can
satisfy multiple memory accesses if the accesses are aligned properly and the data stays in the cache [23].)
Although we are able to fulfill the memory accesses with two memory transactions, the limited cache
size in a many-core processor can induce conflict misses which cause a part of the cached data (brought
in by a memory transaction) being replaced before they are used. This is commonly referred to as cache
thrashing. For example, when the Shell-Region information is accessed, the memory transaction brings in
not only the Shell-Region information but also the grid information. However, due to the sequentiality of
the two reads, the grid information may be replaced by data access for some other threads. This problem is
exasperated when a large number of threads are active simultaneously and competing for cache access. To
overcome this challenge, we judiciously reduce the number of simultaneously executing threads for targeted
sections of the code by controlling the number of threads to be launched. (A careful balance is needed so
that this reduction would not degrade the overall performance; details are omitted due to the page limit.)
The combined effort above guarantees that only two memory transactions are used for locating the
next traversing region. For comparison, consider a geometric scene decomposed into M regions and a ray
traversing through K leaf regions. A kd-tree searching traversal algorithm accesses O(K × log M ) tree
nodes; each access takes 2–4 memory transactions depending on the kd-tree implementation. With the Shell
ray traversal algorithm, the ray accesses K Shell-Regions; each access needs only two memory transactions.
4.3
Construction of the Shell Data Structure
Before ray traversal, the Shell data structure needs to be constructed. Since the ray traversal process
typically needs to be performed repeatedly for a large number of rays (e.g., in graphics ray tracing and radiation dose calculation applications), the construction is best performed as a preprocessing task, especially
for static scenes. As a common practice, we accomplish this task on the CPU of a CPU+GPU platform.
The construction starts by obtaining a spatial decomposition based on a cost model such as the Surface
Area Heuristic (SAH) [21] and building the corresponding kd-tree. The Shell construction scheme processes
each leaf region R as follows. We first identify all neighboring leaf regions touching each face of R.
Every face is partitioned into a uniform grid, whose resolution is determined according to the number and
distribution of its neighboring regions. Then depending on the size of the scene and the memory capacity,
the non-uniform grid partition and/or grid compression schemes, if needed, are applied to each face. Finally,
the memory space for the Shell-Region and Shell-Units of R is allocated and initialized accordingly. The
Shell data structure consists of the Shell-Regions and Shell-Units for all leaf regions of the decomposed
scene. The Shell data structure is then downloaded to GPU for ray traversal applications.
5
Evaluation
To evaluate the proposed Shell approach for ray traversal on GPU, we implemented our Shell data structure and Shell based ray traversal algorithm. We also implemented two state-of-the-art kd-tree searching ray
traversal algorithms: PD-SS and kd-rope. For kd-rope, we implemented its latest version, kd-rope++ [25],
which is the fastest known kd-tree searching ray traversal approach. For PD-SS, although its performance is
slightly lower comparing with kd-rope, it takes much less memory space. To ensure the quality of our PD-SS
and kd-rope implementations, we tested them on graphics ray tracing applications and achieved rendering
performance results comparable to those published in related work (e.g., [11, 25]). The hardware platform
9
used in our evaluation is an NVIDIA GTX570 graphics card (Fermi architecture, 480 cores, 1.6GHz core
frequency, 1GB device memory). Figure 5 shows the set of geometric scenes used in our experiments. These
scenes are commonly found in the study of graphics rendering approaches. The properties of the kd-tree
decomposition for these scenes are listed in Columns 2–4 in Table 1. Two different Shell data structures are
built on the kd-tree decomposition of each scene. Shell-1 aims to minimize the memory usage by adopting
the non-uniform grid partition and grid compression schemes whenever they can save memory space in
building the arrangement for neighboring regions on each boundary face. Shell-2 tends to achieve a good
balance between memory usage and ray traversal performance, which uses those sophisticated schemes
only if they can reduce a certain portion of memory usage (e.g., more than 20%). All data presented in each
scene are obtained by traversing a set of rays generated to render an image of the scene with resolution of
1024 × 1024 from graphics ray tracing. Below we analyze and compare various metrics for the Shell based
and the traditional kd-tree searching ray traversal algorithms. These metrics include the ray traversal speed,
number of accessed nodes, number of divergent branches, and memory usage.
Figure 6 compares the ray traversal performance using PD-SS, kd-rope, and our Shell based algorithms.
The data in Figure 6 show that a speedup between 2.6X–5.1X can be achieved by using Shell comparing to
PD-SS and 2.2X–4.3X to kd-rope. Furthermore, comparing the performance of Shell-1 and Shell-2, we see
that a larger Shell data structure tends to lead to a faster ray traversal performance.
The Shell based ray traversal algorithm gains its performance advantage through removing many expensive memory accesses to internal nodes in the kd-tree searching ray traversal methods. As demonstrated in
Figure 7, Shell based ray traversal accesses on average 4.2X and 3.5X fewer nodes than PD-SS and kd-rope,
respectively. Although the kd-rope approach uses neighbor links to reduce accessing internal nodes, it still
needs to visit a number of internal nodes when a region boundary face has multiple neighboring regions.
Another factor contributing to the performance improvement of the Shell based ray traversal is that the
Shell data structure reduces a significant amount of execution divergence. To illustrate this, Figure 8(a)(b) summarize the number of branches and number of divergent branches, respectively, for each algorithm.
Shell based ray traversal incurs on average 85% less branches and 70% less divergent branches. But, Shell
base ray traversal cannot completely eliminate execution divergence since different partition schemes are
involved. The more sophisticated schemes a Shell structure uses (e.g., compressed non-uniform grid), the
more divergent branches it may have. This is shown by comparing the Shell-1 and Shell-2 results in Figure 8.
The memory usage of the Shell data structure is strongly dependent on the schemes chosen to generate
the arrangements for neighboring regions on region boundaries. Our proposed partition schemes, together
with the grid compression method, provide a trade-off framework for memory usage and ray traversal
performance. In our experimental study, Shell-1 minimizes the memory requirement while Shell-2 aims
to achieve a good balance between memory usage and ray traversal performance.
Table 1 summarizes the properties and memory usage of PD-SS, kd-rope, Shell-1, and Shell-2 for each
of the geometric scenes in Figure 5. A kd-tree can be directly used by PD-SS. For kd-rope, the kd-tree
needs to be augmented in the leaf nodes by storing neighbor links and bounding boxes for each of them.
As shown in the column of “size(kd-rope)” in Table 1, the kd-tree supporting kd-rope uses about 3X more
memory than that supporting PD-SS. The Shell-1 data structure in Table 1 adopts a non-uniform grid and
the grid compression scheme to as many region boundaries as possible. Its memory usage is about 40%
higher than PD-SS but 60% smaller than kd-rope. The Shell-2 data structure stores more Shell-Units and
uses the more sophisticated schemes only if they can save memory usage for a region boundary by more
than 20%. For example, on a region face F , suppose a uniform grid partitions it into x cells and the nonuniform grid scheme can reduce the number of cells to y. Then Shell-2 adopts the non-uniform grid on F
only if (y ÷ x) < 0.8. Although Shell-2 uses on average 30% more memory than Shell-1, its ray traversal
performance is 20%–45% better than Shell-1. Furthermore, Shell-2 not only uses 50%–60% less memory
than kd-rope, but also achieves 2.2X–4.3X speedup over kd-rope (see Figure 6).
10
References
[1] T. Aila and S. Laine. Understanding the efficiency of ray traversal on GPUs. In Proceedings of the 1st ACM
Conference on High Performance Graphics, pages 145–149, 2009.
[2] E. Bethel and M. Howison. Multi-core and many-core shared-memory parallel raycasting volume rendering
optimization and tuning. International Journal of High Performance Computing Applications, 26(4):399–412,
2012.
[3] G. Blelloch, P. Gibbons, and S. Vardhan. Combinable memory-block transactions. In Proceedings of the 20th
ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’08, pages 23–34, 2008.
[4] A. Buluç, J. Fineman, M. Frigo, J. Gilbert, and C. Leiserson. Parallel sparse matrix-vector and matrix-transposevector multiplication using compressed sparse blocks. In Proceedings of the 21st ACM Symposium on Parallelism
in Algorithms and Architectures, SPAA’09, pages 233–244, 2009.
[5] Q. Chen, M. Chen, and W. Lu. Ultrafast convolution/superposition using tabulated and exponential kernel.
Medical Physics, 38:1150–1161, 2011.
[6] P. Chuong, F. Ellen, and V. Ramachandran. A universal construction for wait-free transaction friendly data
structures. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures,
SPAA’10, pages 335–344, 2010.
[7] C. Clauss, S. Lankes, P. Reble, and T. Bemmerl. Evaluation and improvements of programming models for
the Intel SCC many-core processor. In 2011 International Conference on High Performance Computing and
Simulation, pages 525–532, 2011.
[8] T. Foley and J. Sugerman. Kd-tree acceleration structures for a GPU raytracer. In Proceedings of Graphics
Hardware, pages 15–22, 2005.
[9] W. Fung, I. Sham, G. Yuan, and T. Aamodt. Dynamic warp formation: Efficient MIMD control flow on SIMD
graphics hardware. ACM Trans. Archit. Code Optim., 6(2):1–37, 2009.
[10] S. Guntury and P. J. Narayanan. Raytracing dynamic scenes on the GPU using grids. IEEE Transactions on
Visualization and Computer Graphics, 18(1):5–16, 2012.
[11] M. Hapala and V. Havran. Review: Kd-tree traversal algorithms for ray tracing. Computer Graphics Forum,
30(1):199–213, 2011.
[12] V. Havran. Heuristic Ray Shooting Algorithms. PhD thesis, Czech Technical University, Nov. 2000.
[13] V. Havran, J. Bittner, and J. Zara. Ray tracing with rope trees. In Proceedings of Spring Conference on Computer
Graphics, pages 130–139, 1998.
[14] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff.
In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’10, pages
355–364, 2010.
[15] D. R. Horn, J. Sugerman, M. Houston, and P. Hanrahan. Interactive k-d tree GPU ray tracing. In Proceedings of
the Symposium on Interactive 3D Games and Graphics, pages 167–174, 2007.
[16] D. M. Huges and I. S. Lim. Kd-jump: A path-preserving stackless traversal for faster isosurface raytraing on
GPUs. IEEE Transactions on Visualization and Computer Graphics, 15(6):1555–1562, 2009.
[17] J. Kalojanov, M. Billeter, and P. Slusallek. Two-level grids for ray tracing on GPUs. Computer Graphics Forum,
30(2):307–314, 2011.
[18] D. Kopta, J. Spjut, E. Brunvand, and A. Davis. Efficient MIMD architectures for high-performance ray tracing.
In 2010 IEEE International Conference on Computer Design, pages 9–16, 2010.
[19] Y. Krishnakumar, T. Prasad, K. Kumar, P. Raju, and B. Kiranmai. Realization of a parallel operating
SIMD-MIMD architecture for image processing application. In 2011 International Conference on Computer,
Communication and Electrical Technology, pages 98–102, 2011.
11
[20] M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. Chew. Scheduling
strategies for optimistic parallel execution of irregular programs. In Proceedings of the 20th ACM Symposium
on Parallelism in Algorithms and Architectures, SPAA’08, pages 217–228, 2008.
[21] J. D. MacDonald and K. S. Booth. Heuristic for ray tracing using space subdivision. Visual Computer, 6:153–
165, 1990.
[22] T. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl,
and S. Dighe. The 48-core SCC processor: The programmer’s view. In Proceedings of the 2010 ACM/IEEE
International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11,
2010.
[23] NVIDIA Corporation. NVIDIA CUDA C programming guide version 5.0. URL: http://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html, 2013.
[24] S. Popov, J. Gunther, H. P. Seidel, and P. Slusallek. Stackless kd-tree traversal for high performance GPU ray
tracing. Computer Graphics Forum, 26(3):415–424, 2007. (Proc. Eurographics. 2007).
[25] A. Santos, J. M. Teixeira, T. Farias, V. Teichrieb, and J. Kelner. Understanding the efficiency of kd-tree ray
traversal techniques over a GPGPU architecture. International Journal of Parallel Programming, 40(3):331–
352, 2012.
[26] F. Song and J. Dongarra. A scalable framework for heterogeneous GPU-based clusters. In Proceedings of the
24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA’12, 2012.
[27] Stanford Computer Graphics Laboratory. The Stanford 3D scanning repository. URL: http://graphics.stanford.
edu/data/3Dscanrep/, 2012.
[28] J. Tsakok, W. Bishop, and A. Kennings. Kd-tree traversal techniques. In Proceedings of the IEEE Symposium
on Interactive Ray Tracing, pages 190–194, 2008.
[29] I. Wald, W. R. Mark, J. Gunther, S. Boulous, T. Ize, W. Hunt, S. G. Parker, and P. Shirley. State of the art in ray
tracing animated scenes. Computer Graphics Forum, 28(6):1691–1722, 2009.
[30] K. Xiao, B. Zhou, D. Z. Chen, and X. S. Hu. Efficient implementation of the 3D-DDA ray traversal algorithm
on GPU and its application in radiation dose calculation. Medical Physics, 39:7619–7626, 2012.
[31] B. Zhou, X. S. Hu, and D. Z. Chen. Memory-efficient volume ray tracing on GPU for radiotherapy. In IEEE 9th
Symposium on Application Specific Processors, pages 46–51, 2011.
[32] M. Zlatuska and V. Havran. Ray tracing on a GPU with CUDA – Comparative study of three algorithms. In
Proceedings of Computer Graphics, Visualization and Computer Vision, pages 69–75, 2010.
12
A
Appendix 1: Figures
Figure 1: (a) A 2D scene with 18 spatially decomposed regions and two traversing rays, ray1 and ray2. (b) The kd-tree
representation of the scene in (a) with the leaf nodes of the tree corresponding to the decomposed regions in (a). (c) Sequences of
regions traversed by ray1 and ray2, and the tree nodes accessed by these rays using the PD-SS and kd-rope algorithms.
13
Figure 2: (a) A 2D scene with 8 spatially decomposed regions. (b) A uniform-grid based Shell structure for the decomposed
scene in (a). (c) The Shell-Region for region B in (a) and one of its Shell-Units adjacent to region D. (d) The memory layout of
Shell, where each element in the Shell-Region array contains a pointer to the first Shell-Unit associated with that Shell-Region, and
each element in the Shell-Unit array contains a pointer to its neighboring Shell-Region (indicated by the dashed arrows).
Figure 3: (a) A non-uniform grid for a boundary face F with 16 neighboring regions. F is partitioned by all the vertical and
horizontal lines into a uniform grid scheme. The solid bold red boxes are the projected boundaries of the neighboring regions on
F . The solid green lines are the lines used by the non-uniform grid. The set of 21 × 18 Shell-Units using the uniform grid on F is
reduced to 8 × 7 Shell-Units using the non-uniform grid. (b) The decoding procedure for converting the cell indices from a uniform
grid to a non-uniform grid. By applying this procedure, Cell(11, 12) (containing point P in (a)) in the uniform grid is converted to
Cell0 (4, 5) in the non-uniform grid.
14
Figure 4: (a) A compressed non-uniform grid for the boundary face F in Figure 3(a). The grid compression scheme stores
7 + 8 + 16 = 31 integers for F (i.e., the neighbor integers for 7 rows and 8 columns, and an array of pointers to the 16 neighboring
regions). (b) The decoding procedure for obtaining the pointer to the neighboring Shell-Region touching a given cell. By applying
this procedure, the pointer to the neighboring Shell-Region of Cell(4, 5) (containing point P in (a)) is found at AF [8].
Figure 5: Geometric scenes used in the experiments of this paper. From left to right: Bunny (69K objects), Dragon (871K
objects), Buddha (1.08M objects), from the Stanford 3D Scanning Repository [27], and Balls5 (66K objects) used in [12].
15
Figure 6: The execution time of the PD-SS, kd-rope, and Shell based ray traversal on the scenes in Figure 5. The detailed
information for the data structures (including kd-tree, Shell-1, and Shell-2) is given in Table 1.
Figure 7: The total numbers of nodes (including internal and leaf nodes) accessed by all rays during the execution of PD-SS,
kd-rope, and Shell based ray traversal on the scenes in Figure 5. The same set of rays is used by all these algorithms on a scene.
Note that Shell-1 and Shell-2 are based on the same kd-tree decomposition, and hence the Shell based algorithms access the same
number of nodes (leaf regions) using these two Shell data structures.
16
Figure 8: The numbers of (a) branches and (b) divergent branches during the execution of PD-SS, kd-rope, and Shell based ray
traversal on the scenes in Figure 5. The Shell based algorithms incur smaller numbers of branches and divergent branches, and
Shell-2 has even smaller such numbers than Shell-1 because Shell-2 uses less sophisticated schemes.
B
Appendix 2: Algorithms
Algorithm 1 Locating the next traversing region using a uniform grid
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
Input: ray R, boundary face F , Shell-Region SR.
Output: target Shell-Unit SUt .
compute the point P where R intersects F and exits from SR;
obtain the uniform grid G for F ;
Mbase = the address of the Shell-Unit group for F ;
compute the location index (i, j) of the pixel in G containing P ; /* The pixel at (i, j) represents the target Shell-Unit. */
Mof f = i × (row size of G) + j;
Maddr = Mbase + Mof f ; /* Compute the memory address Maddr for the target Shell-Unit. */
SUt = SU array[Maddr ]; /* Load and return the target Shell-Unit, which is the address of the next Shell-Region SRN . */
return SUt ;
Algorithm 2 Locating the next traversing region using a compressed non-uniform grid
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
Input: ray R, boundary face F , Shell-Region SR.
Output: target Shell-Unit SUt .
compute the point P where R intersects F and exits from SR;
obtain the uniform grid G for F ;
compute the location index (i, j) of the pixel in G containing P ;
obtain the non-uniform grid Gn for F ;
obtain Coord.x and Coord.y for F ;
compute index (i0 , j 0 ) by calling Procedure 1 in Figure 3(b) using (i, j) ; /* Decode the non-uniform grid. */
compute pointer ID by calling Procedure 2 in Figure 4(b) using (i0 , j 0 ); /* Decode the compressed grid. */
return ID; /* ID points to the next Shell-Region to traverse, which serves as the target Shell-Unit. */
17
Algorithm 3 Ray traversal algorithm based on Shell
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
Input: ray R, Shell data structure with an SR (Shell-Region) array and an SU (Shell-Unit) array.
Output: traversal of R.
initialization, to obtain the first Shell-Region SR0 for traversing R;
SR = SR0 ;
more region to traverse = true;
while more region to traverse do
compute the boundary face F of SR on which R exits SR;
obtain the partition and compression scheme S used on F ;
use the corresponding algorithm for S to compute the target SUt for entering the next Shell-Region SRN ;
/* E.g., if F uses a compressed non-uniform grid, then SUt = Algorithm 2(R, F , SR). */
SR = SR array[SUt ]; /* The next Shell-Region SRN is obtained and is set as SR for the next iteration. */
if the traversal of R is finished then
more region to traverse = f alse; /* If the traversal of R is not finished, then continue to the next iteration. */
end if
end while
end
C
Appendix 3: Tables
Scenes
Bunny
Dragon
Buddha
Balls5
a
b
c
d
leaves a
154K
920K
1.3M
57K
INs b
401K
3.5M
4.8M
189K
kd-tree
objects c size(PD-SS)
5.5
9.2MB
2.8
55.8MB
2.4
78.1MB
2.3
3.6MB
size(kd-rope)
30.1MB
167MB
241MB
11.9MB
Shell-1
SUs d
size
2.8M 13.4MB
13M
75MB
20M
112MB
0.96M 4.7MB
Shell-2
SUs d
size
3.9M 17MB
19M
98MB
24M 141MB
1.2M 6.1MB
The total number of leaf nodes (i.e., leaf regions).
The total number of internal tree nodes.
The average number of objects contained in each leaf node.
The total number of Shell-Units.
Table 1: The properties and memory usage of the kd-tree and Shell data structures built for the scenes in Figure 5.
Columns 2, 3, and 4 illustrate the properties of the kd-tree. Column 5 shows the memory usage of the kd-tree, which
is used by the PD-SS algorithm. Column 6 shows the memory space used by kd-rope, which stores more information
(e.g., the bounding box information and neighbor links) in the leaf nodes of the kd-tree. For the Shell-1 and Shell-2
data structures, Columns 7 and 9 illustrate the total numbers of Shell-Units that they contain, and Columns 8 and 10
show their total memory requirement, respectively.
18