Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Operations on GPU Changchang Wu 4/18/2007 Outline • Database Operations on GPU • Point List Generation on GPU • Nearest Neighbor Searching on GPU Database Operations on GPU Design Issues • Low bandwidth between GPU and CPU • Avoid frame buffer readbacks • No arbitrary writes • Avoid data rearrangements • Programmable pipeline has poor branching • Evaluate branches using fixed function tests Design Overview • Use depth test functionality of GPUs for performing comparisons • Implements all possible comparisons <, <=, >=, >, ==, !=, ALWAYS, NEVER • Use stencil test for data validation and storing results of comparison operations • Use occlusion query to count number of elements that satisfy some condition Basic Operations Basic SQL query Select A From T Where C A= attributes or aggregations (SUM, COUNT, MAX etc) T=relational table C= Boolean Combination of Predicates (using operators AND, OR, NOT) Basic Operations • Predicates – ai op constant or ai op aj • Op is one of <,>,<=,>=,!=, =, TRUE, FALSE • Boolean combinations – Conjunctive Normal Form (CNF) expression evaluation • Aggregations – COUNT, SUM, MAX, MEDIAN, AVG Predicate Evaluation • ai op constant (d) • Copy the attribute values ai into depth buffer • Define the comparison operation using depth test • Draw a screen filling quad at depth d glDepthFunc(…) glStencilOp(fail, zfail, zpass ); Predicate Evaluation • Comparing two attributes: • ai op aj is treated as (ai – aj) op 0 • Semi-linear queries • Easy to compute with fragment shader Boolean Combinations • Expression provided as a CNF • CNF is of form (A1 AND A2 AND … AND Ak) where Ai = (Bi1 OR Bi2 OR … OR Bimi ) • CNF does not have NOT operator • If CNF has a NOT operator, invert comparison operation to eliminate NOT Eg. NOT (ai < d) => (ai >= d) • For example, compute ai within [low, high] • Evaluated as ( ai >= low ) AND ( ai <= high ) Algorithm Range Query • Compute ai within [low, high] • Evaluated as ( ai >= low ) AND ( ai <= high ) Aggregations • COUNT, MAX, MIN, SUM, AVG • No data rearrangements COUNT • Use occlusion queries to get pixel pass count • Syntax: • Begin occlusion query • Perform database operation • End occlusion query • Get count of number of attributes that passed database operation • Involves no additional overhead! MAX, MIN, MEDIAN • We compute Kth-largest number • Traditional algorithms require data rearrangements • We perform no data rearrangements, no frame buffer readbacks K-th Largest Number • By comparing and counting, determinate every bit in order of MSB to LSB Example: Parallel Max • S={10,24,37,99,192,200,200,232} • Step 1: Draw Quad at 128(10000000) • S = {10,24,37,99,192,200,200,232} • Step 2: Draw Quad at 192(11000000) • S = {10,24,37,192,200,200,232} • Step 3: Draw Quad at 224(11100000) • S = {10,24,37,192,200,200,232} • Step 4: Draw Quad at 240(11110000) • – No values pass • Step 5: Draw Quad at 232(11101000) • S = {10,24,37,192,200,200,232} • Step 6,7,8: Draw Quads at 236,234,233 – No values pass, Max is 232 Accumulator, Mean • Accumulator - Use sorting algorithm and add all the values • Mean – Use accumulator and divide by n • Interval range arithmetic • Alternative algorithm • Use fragment programs – requires very few renderings • Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03] Accumulator • Data representation is of form ak 2k + ak-1 2k-1 + … + a0 Sum = sum(ak) 2k+ sum(ak-1) 2k-1+…+sum(a0) Current GPUs support no bit-masking operations The Algorithm >=0.5 means i-th bit is 1 Implementation • Algorithm • CPU – Intel compiler 7.1 with hyper-threading, multi-threading, SIMD optimizations • GPU – NVIDIA Cg Compiler • Hardware • Dell Precision Workstation with Dual 2.8GHz Xeon Processor • NVIDIA GeForce FX 5900 Ultra GPU • 2GB RAM Benchmarks • TCP/IP database with 1 million records and four attributes • Census database with 360K records Copy Time Predicate Evaluation Range Query Multi-Attribute Query Semi-linear Query Kth-Largest Kth-Largest Kth-Largest conditional Accumulator Analysis: Issues • Precision • Copy time • Integer arithmetic • Depth compare masking • Memory management • No Branching • No random writes Analysis: Performance • Relative Performance Gain • High Performance – Predicate evaluation, multi-attribute queries, semi-linear queries, count • Medium Performance – Kth-largest number • Low Performance - Accumulator High Performance • Parallel pixel processing engines • Pipelining • Early Z-cull • Eliminate branch mispredictions Medium Performance • Parallelism • FX 5900 has clock speed 450MHz, 8 pixel processing engines • Rendering single 1000x1000 quad takes 0.278ms • Rendering 19 such quads take 5.28ms. Observed time is 6.6ms • 80% efficiency in parallelism!! Low Performance • No gain over SIMD based CPU implementation • Two main reasons: • Lack of integer-arithmetic • Clock rate Advantages • Algorithms progress at GPU growth rate • Offload CPU work • Fast due to massive parallelism on GPUs • Algorithms could be generalized to any geometric shape • Eg. Max value within a triangular region • Commodity hardware! GPU Point List Generation • Data compaction Overall task 3D to 2D mapping Current Problem The solution Overview, Data Compaction Algorithm: Discriminator Algorithm: Histogram Builder Histogram Output Algorithm: PointList Builder PointList Output Timing Reduces a highly sparse matrix with N elements to a list of its M active entries in O(N) + M (log N) steps, Applications • Image Analysis • Feature Detection • Volume Analysis • Sparse Matrix Generation Searching • 1D Binary Search • Nearest Neighbor Search for High dimension space • K-NN Search Binary Search • Find a specific element in an ordered list • Implement just like CPU algorithm • Assuming hardware supports long enough shaders • Finds the first element of a given value v • If v does not exist, find next smallest element > v • Search algorithm is sequential, but many searches can be executed in parallel • Number of pixels drawn determines number of searches executed in parallel • 1 pixel == 1 search Binary Search • Search for v0 Initialize Search starts at center of sorted array 4 v2 >= v0 so search left half of sub-array Sorted List v0 0 v0 1 v0 2 v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search • Search for v0 Initialize 4 Step 1 2 Sorted List v0 0 v0 >= v0 so search left half of sub-array v0 1 v0 2 v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search • Search for v0 Initialize 4 Step 1 2 Step 2 1 Sorted List v0 0 v0 >= v0 so search left half of sub-array v0 1 v0 2 v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search • Search for v0 Initialize 4 Step 1 2 Step 2 1 Step 3 0 Sorted List v0 0 At this point, we either have found v0 or are 1 element too far left One last step to resolve v0 1 v0 2 v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search • Search for v0 Initialize 4 Step 1 2 Step 2 1 Step 3 0 Step 4 0 Sorted List v0 0 Done! v0 1 v0 2 v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search • Search for v0 and v2 Initialize 4 Search starts at center of sorted array 4 Both searches proceed to the left half of the array Sorted List v0 0 v0 1 v0 2 v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search • Search for v0 and v2 Initialize 4 4 Step 1 2 2 Sorted List v0 0 The search for v0 continues as before The search for v2 overshot, so go back to the right v0 1 v0 2 v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search • Search for v0 and v2 Initialize 4 4 Step 1 2 2 Step 2 1 3 Sorted List v0 0 v0 1 v0 2 We’ve found the proper v2, but are still looking for v0 Both searches continue v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search • Search for v0 and v2 Initialize 4 4 Step 1 2 2 Step 2 1 3 Step 3 0 2 Sorted List v0 0 v0 1 v0 2 Now, we’ve found the proper v0, but overshot v2 The cleanup step takes care of this v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search • Search for v0 and v2 Initialize 4 4 Step 1 2 2 Step 2 1 3 Step 3 0 2 Step 4 0 3 Sorted List v0 0 v0 1 v0 2 Done! Both v0 and v2 are located properly v2 3 v2 4 v2 5 v5 6 v5 7 Binary Search Summary • Single rendering pass • Each pixel drawn performs independent search • O(log n) steps Nearest Neighbor Search • Very fundamental step in similarity search of data mining, retrieval… • Curse of dimensionality, • When dimensionality is very high, structures like k-d tree does not help • Use GPU to improve linear scan Distances • N-norm distance • Cosine distance acos(dot(x,y)) Data Representation • Use separate textures to store different dimensions. Distance Computation • Accumulating distance component of different dimensions Reduction in RGBA Reduction to find NN Results Results K-Nearest Neighbor Search • Given a sample point p, find the k points nearest p within a data set • On the CPU, this is easily done with a heap or priority queue • Can add or reject neighbors as search progresses • Don’t know how to build one efficiently on GPU • kNN-grid • Can only add neighbors… kNN-grid Algorithm sample point candidate neighbor neighbors found Want 4 neighbors kNN-grid Algorithm • Candidate neighbors must be within max search radius • Visit voxels in order of distance to sample point sample point candidate neighbor neighbors found Want 4 neighbors kNN-grid Algorithm 1 sample point candidate neighbor neighbors found Want 4 neighbors • If current number of neighbors found is less than the number requested, grow search radius kNN-grid Algorithm 2 sample point candidate neighbor neighbors found Want 4 neighbors • If current number of neighbors found is less than the number requested, grow search radius kNN-grid Algorithm 2 sample point candidate neighbor neighbors found Want 4 neighbors • Don’t add neighbors outside maximum search radius • Don’t grow search radius when neighbor is outside maximum radius kNN-grid Algorithm • Add neighbors within search radius 3 sample point candidate neighbor neighbors found Want 4 neighbors kNN-grid Algorithm • Add neighbors within search radius 4 sample point candidate neighbor neighbors found Want 4 neighbors kNN-grid Algorithm • Don’t expand search radius if enough neighbors already found 4 sample point candidate neighbor neighbors found Want 4 neighbors kNN-grid Algorithm • Add neighbors within search radius 5 sample point candidate neighbor neighbors found Want 4 neighbors kNN-grid Algorithm 6 sample point candidate neighbor neighbors found Want 4 neighbors • Visit all other voxels accessible within determined search radius • Add neighbors within search radius kNN-grid Summary 6 sample point candidate neighbor neighbors found Want 4 neighbors • Finds all neighbors within a sphere centered about sample point • May locate more than requested knearest neighbors References • Naga Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin and Dinesh Manocha, Fast Computation of Database Operations using Graphics Processors http://www.gpgpu.org/s2004/slides/govindaraju.DatabaseOperations.ppt • Benjamin Bustos, Oliver Deussen, Stefan Hiller, and Daniel Keim, A Graphic Hardware Accelerated Algorithm for Nearest Neighbor Search • Gernot Ziegler, Art Tevs, Christian Theobalt, Hans-Peter Seidel, GPU Point List Generation through Histogram Pyramids http://www.mpi-inf.mpg.de/~gziegler/gpu_pointlist/ • Tim Purcell, Sorting and Searching http://www.gpgpu.org/s2005/slides/purcell.SortingAndSearching. ppt