Download PPT - University of Utah School of Computing

Data Visualization And Mining Using The GPU Sudipto Guha (Univ. of Pennsylvania) Shankar Krishnan (AT&T Labs - Research) Suresh Venkatasubramanian (AT&T Labs - Research) What you will see in this tutorial…       The GPU is fast The GPU is programmable The GPU can be used for interactive visualization How do we abstract the GPU ? What is an efficient GPU program How various data mining primitives are implemented Using the GPU for non-standard data visualization What you will NOT see…    Detailed programming tricks Hacks to improve performance Mechanics of GPU programming  BUT…  we will show you where to find all of this.  Plethora of resources now available on the web, as well as code, toolkits, examples, and more… Schedule    1st Hour  1:30pm –  1:50pm – 2nd Hour  2:30pm –  2:50pm –  3:20pm – 3rd Hour  3:30pm –  4:00pm –  4:20pm – 1:50pm: Introduction to GPUs (Shankar) 2:30pm: Examples of GPUs in Data Analysis (Shankar) 2:50pm: Stream Algorithms on the GPU (Sudipto) 3:20pm: Data mining primitives (Sudipto) 3:30pm: Questions and Short break 4:00pm: GPU Case Studies (Shankar) 4:20pm: High-level software support (Shankar) 4:30pm: Wrap-up and Questions Animusic Demo (courtesy ATI) But I don’t do graphics ! Why should I care about the GPU ? Two Converging Trends in Computing …  The accelerated development of graphics cards    developing faster than CPUs GPUs are cheap and ubiquitous Increasing need for streaming computations   original motivation from dealing with large data sets also interesting for multimedia applications, image processing, visualization etc. What is a Stream?     An ordered list of data items Each data item has the same type  like a tuple or record Length of stream is potentially very large Examples  data records in database applications  vertex information in computer graphics  points, lines etc. in computational geometry Streaming Model     Input presented as a sequence Algorithm works in passes  allowed one sequential scan over input  not permitted to move backwards mid-scan Workspace  typically o(n)  arbitrary computation allowed Algorithm efficiency  size of workspace and computation time Streaming: Data driven to Performance driven    Primary motivation is computing over transient data (data driven)  data over a network, sensor data, router data etc. Computing over large, disk-resident data which are expensive to access (data and performance driven) To improve algorithm performance How does streaming help performance? Von Neumann Bottleneck Bus Memory CPU Control Unit Instructions Cache ALU Data Von Neumann Bottleneck  Memory bottleneck  CPU processing faster than memory bandwidth  discrepancy getting worse  large caches and sophisticated prefetching strategies alleviate bottleneck to some extent  caches occupy large portions of real estate in modern ship design Trends In Hardware L3 Cache Array Cache Real Estate Die photograph of the Intel/HP IA-64 processor (Itanium2 chip) L3 Cache Array L3 Cache Array Stream Architectures    All input in the form of streams Stream processed by a specialized computational unit called kernel Kernel performs the same operations on each stream element Data Stream kernel kernel kernel Stream Architectures       Items processed in a FIFO fashion Reduced memory latency and cache requirements Simplified control flow Data-level parallelism Greater computational efficiency Examples  CHEOPS [Rixner et. al. ’98] and Imagine [Kapasi et. al. ’02]  high performance media applications GPU: A Streaming Pipelined Architecture      Inputs presented in streaming fashion  processed data items pass to next phase and does not return Data-level parallelism Limited local storage  data items essentially carry their own state Pipelining: each item processed identically Not quite general purpose yet, but getting there GPU Performance From ‘Stream Programming Environments’ – Hanrahan, 2004. The Graphics Pipeline Modeling Transformations Illumination (Shading) Viewing Transformation Input: Geometric model: Description of all object, surface and light source geometry and transformations Lighting model: Computational description of object and light properties, interaction (reflection, scattering etc.) Synthetic viewpoint (or Camera location): Eye position and viewing frustum Clipping Raster viewport: Pixel grid on to which image is mapped Projection (to Screen Space) Scan Conversion (Rasterization) Output: Visibility / Display Colors/Intensities suitable for display (For example, 24-bit RGB value at each pixel) The Graphics Pipeline Modeling Transformations  Illumination (Shading)  Viewing Transformation   Clipping Primitives are processed in a series of stages Each stage forwards its result on to the next stage The pipeline can be implemented in different ways Optimizations & additional programmability are available at some stages Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler Modeling Transformations Modeling Transformations Illumination (Shading) Viewing Transformation   3D models defined in their own coordinate system (object space) Modeling transforms orient the models within a common coordinate frame (world space) Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Object space World space Slide Courtesy: Durand and Cutler Illumination (Shading) (Lighting) Modeling Transformations  Illumination (Shading)  Viewing Transformation Vertices lit (shaded) according to material properties, surface properties (normal) and light sources Local lighting model (Diffuse, Ambient, Phong, etc.) Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler Viewing Transformation Modeling Transformations   Illumination (Shading) Maps world space to eye space Viewing position is transformed to origin & direction is oriented along some axis (usually z) Viewing Transformation Eye space Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display World space Clipping Modeling Transformations  Transform to Normalized Device Coordinates (NDC) Illumination (Shading) Viewing Transformation Eye space Clipping Projection (to Screen Space)  NDC Portions of the object outside the view volume (view frustum) are removed Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler Projection Modeling Transformations  The objects are projected to the 2D image place (screen space) Illumination (Shading) Viewing Transformation NDC Screen Space Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler Scan Conversion (Rasterization) Modeling Transformations Illumination (Shading)   Rasterizes objects into pixels Interpolate values as we go (color, depth, etc.) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler Visibility / Display Modeling Transformations  Each pixel remembers the closest object (depth buffer)  Almost every step in the graphics pipeline involves a change of coordinate system. Transformations are central to understanding 3D computer graphics. Illumination (Shading) Viewing Transformation Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display Slide Courtesy: Durand and Cutler Coordinate Systems in the Pipeline Modeling Transformations Illumination (Shading) Object space World space Viewing Transformation Clipping Projection (to Screen Space) Eye Space / Camera Space Clip Space (NDC) Scan Conversion (Rasterization) Screen Space Visibility / Display Slide Courtesy: Durand and Cutler Programmable Graphics Pipeline 3D API: OpenGL or Direct3D 3D API Commands 3D Application Or Game Vertex Index Stream Primitive Assembly Pre-transformed Fragments Pre-transformed Vertices Programmable Vertex Processor Programmable Fragment Processor Transformed Fragments GPU Front End Pixel Pixel Location Updates Stream Rasterization Frame Raster and Buffer Operations Interpolation Assembled Primitives Transformed Vertices GPU Command & Data Stream CPU-GPU Boundary Courtesy: Cg Book [Fernando and Kilgard] Increase in Expressive Power 1996 simple if-then tests via depth and stencil testing. 1998 More complex arithmetic and lookup operations 2001 Limited programmability in pipeline via specialized assembly constructs 2002 Full programmability, but only straight line programs 2004 True conditionals and loops 2006(?) General purpose streaming processor ? Recall Hanrahan, 2004. GPU = Fast co-processor ?    GPU speed increasing at cubed-Moore’s Law. This is a consequence of the data-parallel streaming aspects of the GPU. GPUs are cheap ! Put enough together, and you can get a supercomputer. So can we use the GPU for general-purpose computing ? NYT May 26, 2003: TECHNOLOGY; From PlayStation to Supercomputer for $50,000: National Center for Supercomputing Applications at University of Illinois at Urbana-Champaign builds supercomputer using 70 individual Sony Playstation 2 machines; project required no hardware engineering other than mounting Playstations in a rack and connecting them with high-speed network switch In Summary  GPU is fast The trends, compared to CPU, is already better  But is it being used in non-graphics applications?  Wealth of applications Data Analysis Motion Planning Particle Systems Voronoi Diagrams Geometric Optimization Physical Simulation Linear Solvers Force-field simulation Molecular Dynamics Graph Drawing Database queries Sorting and Searching Range queries Audio, Video and Image processing … and graphics too !! Computation & Visualization     The success stories involving the GPU revolve around the merger of computation and visualization Linear system solvers used for real-time physical simulation Voronoi diagrams allow us to perform shape analysis n-body computations form the basis for graph drawing and molecular modelling. For large data, visualization=analysis Analysis Tools Viz tools Interactive Data Analysis Analysis Tools Viz tools GPU combines both in one Example 1: Voronoi Diagrams Hoff et. al. Siggraph 99 Example 2: Change Detection using Histograms Other examples   Physical system simulation:  Fluid flow visualization for graphics and scientific computing  One of the most well-studied and successful uses of the GPU. Image processing and analysis  A very hot area of research in the GPU world  Numerous packages (openvidia, apple dev kit) for processing images on the GPU  Cameras mounted on robots with real-time scene processing Other examples  SCOUT: GPU-based system for expressing general data visualization queries  provides a high level data processing language  Data processed thru ‘programs’; user can interact with output data  SCOUT demonstrates  effectiveness of GPU for data visualization  need for general GPU primitives for processing data.  But the computation and visualization is decoupled. Also are many more examples !!  Central Theme of This Tutorial     GPU is fast, and viz is built-in GPU can be programmed at high level Some of the most challenging aspects of computation and visualization of large data  interactivity  dynamic data GPU enables both An Extended Example Reverse Nearest Neighbours      An instance consists of clients and servers Each server “serves” those clients that it is closest to. What is the load on a server ? It is the number of clients that consider this server to be their nearest neighbour (among all servers) Hence, the term “reverse nearest neighbour”: for each server, find its reverse nearest neighbours. Note: both servers and clients can move or be inserted or deleted. What do we know ?    First proposed by Korn and Muthukrishnan (2000)  1-D, static Easy if servers are fixed (compute Voronoi diagram of servers, and locate clients in the diagram) In general, hard to do with moving clients and servers in two dimensions. Algorithmic strategy  For each of N clients  iterate over each of M servers, find closest one For each of M servers  iterate over each of N clients that consider it closest, and count them.  apparent “complexity” is M * N  ……Or is it ? Looking more closely…  For each of N clients  iterate over each of M servers, find closest one For each of M servers  iterate over each of N clients that consider it closest, and count them.  Each of these loops can be performed in parallel  Looking more closely…  For each of N clients  iterate over each of M servers, find closest one For each of M servers  iterate over each of N clients that consider it closest, and count them.  Complexity is N + M  Demo Demo: Change distance function RNN as special case of EM  In expectation-maximization, in each iteration  first determine which points are to be associated with which cluster centers (the “nearest neighbour” step)  Then compute a new cluster center based on the points in a cluster (the “counting” step)  We can implement expectation-maximization on the GPU !  More later What does this example show us?  GPU is fast GPU is programmable GPU can deal with dynamic data GPU can be used for interactive visualization  However, care is needed to get the best performance    Coming up next …  How do we abstract the GPU? What is an efficient GPU program  How various data mining primitives are implemented  Using the GPU for non-standard data visualization  PART II. Stream algorithms using the GPU Programmable Graphics Pipeline 3D API: OpenGL or Direct3D 3D API Commands 3D Application Or Game Vertex Index Stream Primitive Assembly Pre-transformed Fragments Pre-transformed Vertices Programmable Vertex Processor Programmable Fragment Processor Transformed Fragments GPU Front End Pixel Pixel Location Updates Stream Rasterization Frame Raster and Buffer Operations Interpolation Assembled Primitives Transformed Vertices GPU Command & Data Stream CPU-GPU Boundary Courtesy: Cg Book [Fernando and Kilgard] Think parallel      First cut: Each Pixel is a processor.  Not really – multiple pipes Each of the pipes perform bits of the whole The mapping is not necessarily known Can even be load dependent But the drivers think that way.  Order of completion is not guaranteed  i.e. cannot say that the computation proceeds along “scan” lines  Restricted interaction between pixels  We cannot write and read the same location Think simple  SIMD machine  Single instruction  Same program is applied at every pixel  Recall data parallelism  Program size is bounded The ability to keep state is limited  Cost is in changing state: a pass-based model   Cannot read and write from same memory location in a single pass Think streams   Works as transducers or kernels : Algorithms that take an input, consume it and output a transform before looking at the next input. Can set up networks Example: Dense Matrix Multiplication  Larsen/McAllester (2002), Hall, Carr & Hart (2003)  Given matrices X, Y, compute Z=XY  For s=1 to n  For t=1 to n  Zst=0  For m:=1 to n Do  Zst ← Zst+XsmYmt  EndFor  EndFor  EndFor Data Structures For s=1 to n For t=1 to n Zst=0 For m:=1 to n Do Zst ← Zst+XsmYmt EndFor EndFor EndFor Data Structures: Arrays X,Y,Z How are arrays implemented in GPUs ? As texture memory (Used for shading and bump maps) Fast texture lookups built into pipeline. Textures are primary storage mechanism for general purpose programming Control Flow For s=1 to n For t=1 to n Zst=0 For m:=1 to n Zst ← Zst+XsmYmt EndFor EndFor EndFor For s=1 to n For t=1 to n Zst ← 0 For m=1 to n For s=1 to n For t=1 to n Zst ← Zst+XsmYmt EndFor EndFor Endfor From Loops to Pixels For m=1 to n For s=1 to n For t=1 to n Zst ← Zst+XsmYmt EndFor EndFor Endfor For m=1 to n For pixel (s,t) Zst ← Zst+XsmYmt EndFor Endfor Operations for all pixels are performed in “parallel” Is that it ? SIMD … For m=1 to n For s=1 to n For t=1 to n R1.r ← lookup(s,m,X) R2.r ← lookup(m,t,Y) R3.r ← lookup(s,t,Z) buffer ← R3.r + R1.r * R2.r EndFor EndFor Save buffer into Z Each pixel (x,y) gets the parameter m: Endfor R1.r ← lookup(myxcoord,m,X) R2.r ← lookup(m,myycoord,Y) R3.r ← lookup(myxcoord,mycoord,Z) mybuffer ← R3.r + R1.r * R2.r Analysing theoretically    O(n) passes to multiply two n X n matrices. Uses O(n2) size textures. Your favourite parallel algorithm for matrix multiplication can be mapped onto GPU How about Strassen ?  Number of passes is the same  Total work decreases  Can reduce the # to O(n0.9) using special packing ideas [GKV04] In practice…  Simpler implementations win out   The fastest GPU implementation uses n matrix—vector operations, and packs data into 4-vectors. It is better than naïve code using three For loops, but not better than using optimized CPU code, e.g., in linear algebra packages. Trendlines show that the running time of GPU code grows linearly. Packing and additions, e.g. in Strassen type algorithms are problematic due to data movement, cache effects.    Summing up    SIMD machine Pass based computation  Pipelined operation  Cost is in changing state  Total work Revisiting RNN Recall the two implementations  M*N distances need to be evaluated  First idea:  M+N passes using M+N processors in parallel N  Second idea:  M x N grid to compute the distances  1 pass  Find Min in 1 pass ?  How to aggregate ? M GPU hardware modelling    The GPU is a parallel processor, where each pixel can be considered to be a single processor (almost) The GPU is a streaming processor, where each pixel processes a stream of data We can build high-level algorithms (EM) from low level primitives (distance calculations, counting) Operation costs  When evaluating the cost of an operation on the GPU, standard operation counts are not applicable  The basic unit of cost is the “rendering pass”, consisting of a stream of data being processed by a single program  Virtual parallelism of GPU means that we “pretend” that all streams can be processed simultaneously  Assign a cost of 1 to a pass: this is analogous to external memory models where each disk access is 1 unit  Many caveats An extreme example  One pass median finding  Well studied problem in streaming with log n pass lower bound.  Consider repeating the data O(log n) times to get a sequence of size O(n log n)  At any point the algorithm has an UB and an LB. 1. 2. 3. 4.  The algorithm reads the next n numbers and picks a random element u between the LB and UB. The algorithm uses the next n numbers and finds the rank of u. Now it sets LB=u or UB=u depending on counts. Repeat ! Quickfind! Why does the example not generalize? Read is not the same as Write. Total work vs number of passes. PART II.5. Data Mining Primitives I: Clustering  Problem definition: Given N points find K points c i to optimize min 2 min dist  i ( x j , ci ) j  Hall & Hart ’03: K+N pass KMEANS algorithm  What we will see:  Log N passes per iteration  EM and “soft assignments” K-Means 2D version  Basic Algorithm: Recall Voronoi scenario Why GPU ?  Motion/updates ! When does the cross over happen ? K-Means in GPU  In each pass, given a set of K colors we can assign a “color” to each point and keep track of “mass” and “mean” of each color.  Issues:  Higher D  Extension to more general algorithms The RNN type approach N        K*N distances, 1 pass Compute min, logarithmic passes Aggregate, logarithmic passes Normalize and compute new centers High D ? Replace 1 by O(D) K EM ? Instead of Min (which is 0/1) compute “shares” Same complexity! Demo: Expectation Maximization II: Wavelets   Transforms Image analysis A B C D A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 Wavelets: Recurse! A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 A B C  D 2 DWT Demo Wang, Wong, Heng & Leung http://www.cse.cuhk.edu.hk/~ttwong/demo/dwtgpu/dwtgpu.html DWT Demo Wang, Wong, Heng & Leung http://www.cse.cuhk.edu.hk/~ttwong/demo/dwtgpu/dwtgpu.html III: FFTs e 2in N What does the above remind you of ? FFT Demo Moreland and Angel http://www.cs.unm.edu/~kmorel/documents/fftgpu/ III: Sorting Networks!     Can be used for quantiles, computing histograms GPUSort: [Govindaraju etal 2005] Basically “network computations” Each pixel performs some local piece only. In summary  The big idea  Think parallel.  Think simple.  Think streams.  SIMD Pass based computing Tradeoff of pass versus total cost “Ordered” access    Next up  We saw “number crunching algorithms”.  What are the applications in mining which are “combinatorial” in nature and yet visual ?  And of courses, resources and howtos. Part III: Other GPU Case Studies and High-level Software Support for the GPU What have we seen so far     Computation on the GPU How to program Cost model  What is cheap and expensive Number based problems, e.g.,  nearest neighbor (Voronoi), clustering, wavelets, sorting What you will see …  Examples with different kinds of input data 1. Computation of Depth Contours 2. Graph Drawing  Languages and tools to program the GPU Final wrap-up  Depth Contours Depth Contours 3 5 7 1 6 2 Location depth = 1 Depth Contours k-contour: set of all points having location depth ≥ k Every n-point set has an n/3-contour The Tukey median is the deepest contour. We wish to compute the set of all depth contours Motivation    Special case of general non-parametric data analysis tool “visualize the location, spread, correlation, skewness, and tails of the data” [Tukey et al 1999] Hypothesis testing, robust statistics Point-Hyperplane Duality    Mapping from R2  R2 Points get mapped to lines and vice-versa  (a,b)  y = -ax + b  y = ax + b  (a,b) Incidence properties are preserved Contours  k-levels k-contour: convex hulls of k- and n-k levels in dual. Main Algorithm    Draw primal line for each dual pixel that lies on a dual line Record only (at each primal pixel) the line whose dual pixel has least depth Recall nearest neighbor (Voronoi), clustering – we are using the geometry heavily Demo: Depth Contours Dynamic Depth Contours     Smooth, but random movement of points Compute incremental depth changes using duals Worst-case quadratic updates might be performed In practice, small number of updates  Can achieve real-time speeds for first time Dynamic Depth Contours Graph Drawing Graph Drawing   Graph  Set of vertices  Boolean relation between pairs of vertices defines edges Given a graph, layout the vertices and edges in a plane based on some aesthetics Abstract Graph  Example  9 vertices, 12 edges  Edges  v1: v2, v4  v2: v1, v3, v5  v3: v2, v6  v4: v1, v5, v7  v5: v2, v4, v6, v8  v6: v3, v5, v9  v7: v4, v8  v8: v5, v7, v9  v9: v6, v8 After Graph Layout  grid3.graph v7 v4 v1  v8 v5 v2 v9 v6 v3 Can visualize the relationships much better Force-Directed Layouts   Start with random initial placement of vertices Define an energy field on the plane using forces between vertices (both attractive and repulsive)  Fattractive(v i )   c. (v j  v i ) .(v j  v i ) ( v i ,v j )E  1 1 Freplusive(v i )   . .(v i  v j ) 2 j i c (v i  v j )  Evolve the system based on the energy field Local energy minimization defines the layout  Essentially solving a N-body problem  GPU Algorithm   Naïve approach  At each iteration  Compute pairwise repulsive forces  Compute attractive forces between edges  Force on each vertex is sum of individual forces  Resultant force defines vertex displacement  Displace vertices and iterate  Complexity – O(n) per iteration  Bottleneck – summing individual forces Optimized approach  Use logarithmic parallel summation algorithm  Complexity – O(log n) per iteration  Parallel Summation Algorithm 5 9 -3 8 -7 4 1 11 -6 offset = 4 -2 19 -13 4 1 11 O(log n) 28 -15 19 -13 13 -15 19 -13 -6 offset = 2 4 1 11 -6 offset = 1 4 1 11 -6 Demo: Graph Drawing Software tools GPGPU Languages    Why do we want them?  Make programming GPUs easier!  Don’t need to know OpenGL, DirectX, or ATI/NV extensions  Simplify common operations  Focus on the algorithm, not on the implementation Sh  University of Waterloo  http://libsh.sourceforge.net Brook  Stanford University  http://brook.sourceforge.net Brook Brook: Streams  streams  collection of records requiring similar computation  particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>;  similar to arrays, but…  index operations disallowed: position[i]  read/write stream operators streamRead (positions, p_ptr); streamWrite (velocityfield, v_ptr);  encourage data parallelism Slide Courtesy: Ian Buck Brook: Kernels  kernels  functions applied to streams  similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; Slide Courtesy: Ian Buck Brook: Kernels  kernels  functions applied to streams  similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; }   no dependencies between stream elements encourage high arithmetic intensity Slide Courtesy: Ian Buck Brook: Reductions  reductions  compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a,r); r = a[0]; for (int i=1; i<100; i++) r += a[i]; Slide Courtesy: Ian Buck Brook: Reductions  reductions  associative operations only (a+b)+c = a+(b+c)    Order independence sum, multiply, max, min, OR, AND, XOR matrix multiply Slide Courtesy: Ian Buck Brook: Reductions  multi-dimension reductions  stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; Slide Courtesy: Ian Buck Brook: Matrix Vector Multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float float float float matrix<20,10>; vector<10, 1>; tempmv<20,10>; result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); M V V V = T Slide Courtesy: Ian Buck Brook: matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float float float float matrix<20,10>; vector<10, 1>; tempmv<20,10>; result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); T sum R Slide Courtesy: Ian Buck Demo: Matrix Vector Multiplication Demo: Bitonic Sort and Binary Search Debugging Tools Debugger Features  Per-pixel register watch    Including interpolants, outputs, etc. Breakpoints Fragment program interpreter  Single step forwards or backwards  Execute modified code on the fly Slide Courtesy: Tim Purcell Shadesmith http://graphics.stanford.edu/projects/shadesmith  Debugger in the spirit of imdebug    Simply add a debug statement when binding shaders Display window with scale, bias, component masking Advanced features  Can watch any shader register contents without recompile Shader single stepping (forward and backward), breakpointing  Shader source edit and reload without recompile  Slide Courtesy: Tim Purcell Demo: Shadesmith GPGPU.org      Your first stop for GPGPU information! News: www.GPGPU.org Discussion: www.GPGPU.org/forums Developer resources, sample code, tutorials: www.GPGPU.org/developer And for open source GPGPU software: http://gpgpu.sourceforge.net/ Summing up What we saw in this tutorial …    GPU is fast outperforming the CPU Shown to be effective in a variety of applications  High memory bandwidth  Parallel processing capabilities  Interactive visualization applications  Data mining, simulations, streaming operations  Spatial database operations High-level programming and debugging support Limitation of the GPU    Not designed as a general-purpose processor  meant to extract maximum performance for highly parallel tasks of computer graphics Will not be suited for all applications  Bad for “pointer chasing” tasks  word processing  No bit shifts or bit-wise logical operations  Cryptography  No double precision arithmetic yet  Large scale scientific applications Unusual programming model Conclusions: Looking forward     Increased performance More features  double precision arithmetic  Bit-wise operations  Random bits Increased programmability and generality  strike the right balance between generality and performance Other data-parallel processors will emerge  Cell processor by IBM, Sony, Toshiba  Better suited to general-purpose stream computing Questions? Slides can be downloaded soon from http://www.research.att.com/areas/visualization/gpgpu.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPT - University of Utah School of Computing