Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
GPUs/Data Parallel Accelerators Dezső Sima Nov. 2008 Ver. 1.0 © Dezső Sima 2008 Contents 1.Introduction 2. Basics of the SIMT execution 3. Overview of GPGPUs 4. Overview of data parallel accelerators 5. Microarchitecture and operation 5.1 Nvidia’s GPGPU line 5.2 Intel’s Larrabee 6. References 1. The emergence of GPGPUs 1. Introduction (1) Based on its FP32 computing capability and the large number of FP-units available the unified shader is a prospective candidate for speeding up HPC! GPUs with unified shader architectures also termed as GPGPUs (General Purpose GPUs) 1. Introduction (2) Figure: Peak SP FP performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [11] 1. Introduction (3) Figure: Bandwidth values of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [11] 1. Introduction (4) Not cached Figure: Contrasting the utilization of the silicon area in CPUs and GPUs [11] 2. Basics of the SIMT execution 2. Basics of the SIMT execution (1) Main alternatives of data parallel execution Data parallel execution SIMD execution SIMT execution • One dimensional data parallel execution, • One/two dimensional data parallel execution, i.e. it performs the same operation i.e. it performs the same operation on all elements of given on all elements of given FX/FP input vectors FX/FP input arrays (vectors/matrices) • is massively multithreaded, and provides • data dependent flow control as well as • barrier synchronization Needs an FX/FP SIMD extension of the ISA E.g. 2. and 3. generation superscalars Needs an FX/FP SIMT extension of the ISA or the API GPGPUs, data parallel accelerators Figure: Main alternatives of data parallel execution 2. Basics of the SIMT execution (2) Scalar execution SIMD execution SIMT execution Domain of execution: single data elements Domain of execution: elements of vectors Domain of execution: elements of matrices (at the programming level) Figure: Domains of execution in case of scalar, SIMD and SIMT execution Remark SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia) 2. Basics of the SIMT execution (3) Key components of the implementation of SIMT execution • Data parallel execution • Massive multithreading • Data dependent flow control • Barrier synchronization 2. Basics of the SIMT execution (4) Data parallel execution Performed by SIMT cores SIMT cores execute the same instruction stream on a number of ALUs (i.e. all ALUs of a SIMT core perform typically the same operation). SIMT core Fetch/Decode ALU ALU ALU ALU ALU ALU ALU ALU Figure: Basic layout of a SIMT core SIMT cores are the basic building blocks of GPGPU or data parallel accelerators. During SIMT execution 2-dimensional matrices will be mapped to blocks of SIMT cores. 2. Basics of the SIMT execution (5) Remark 1 Different manufacturers designate SIMT cores differently, such as • streaming multiprocessor (Nvidia), • superscalar shader processor (AMD), • wide SIMD processor, CPU core (Intel). 2. Basics of the SIMT execution (6) Each ALU is allocated a working register set (RF) Fetch/Decode ALU ALU ALU ALU ALU ALU ALU ALU RF RF RF RF RF RF RF RF Figure: Main functional blocks of a SIMT core 2. Basics of the SIMT execution (7) SIMT ALUs perform typically, RRR operations, that is ALUs take their operands from and write the calculated results to the register set (RF) allocated to them. RF ALU Figure: Principle of operation of the SIMD ALUs 2. Basics of the SIMT execution (8) Remark 2 Actually, the register sets (RF) allocated to each ALU are given parts of a large enough register file. RF RF RF RF RF RF RF RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Figure: Allocation of distinct parts of a large register set as workspaces of the ALUs 2. Basics of the SIMT execution (9) Basic operation of recent SIMT ALUs • execute basically SP FP-MADD (simple precision i.e. 32-bit. Multiply-Add) instructions of the form axb+c , RF • are pipelined, capable of starting a new operation every new clock cycle, (more precisely, every shader clock cycle), That is, without further enhancements their peak performance is 2 SP FP operations/cycle ALU • need a few number of clock cycles, e.g. 4 shader cycles, to present the results of the SP FMADD operations to the RF, 2. Basics of the SIMT execution (10) Additional operations provided by SIMT ALUs • most SIMT ALUs can execute also FX operations and FX/FP conversions as well. E.g. Nvidia’s and AMD/ATI’s SIMT ALUs can execute execute FX add, multiply, divide, shift operations as well. By contrast Intel preferred to use a dedicated scalar unit for performing FX operations beyond their SIMT ALUs (termed as the vector unit). 2. Basics of the SIMT execution (11) Enhancements of SIMT cores Typically, beyond a number of identical SIMD ALUs (designated occasionally as the vector unit) SIMT cores include also one or more dedicated units to speed up special computations not supported by the SIMD ALUs, such as • double precision (DP) FP operations, • trigonometric functions, such as sin, cos, etc. Examples most recent GPGPU cores of Nvidia and AMD/ATI (GT200, RV770) 2. Basics of the SIMT execution (12) Massive multithreading Multithreading is implemented by creating and managing parallel executable threads for each data element of the execution domain. Same instructions for all data elements Figure: Parallel executable threads for each element of the execution domain 2. Basics of the SIMT execution (13) Aim of multithreading Speeding up computations • by increased utilization of available computing resources in case when threads stall due to long latency operations, (achieved by suspending stalled threads from execution and allocating free computing resources to runable threads) • by increased utilization of available silicon area for performing computations rather than for implementing sophisticated cache systems, (achieved by hiding memory access latencies through multithreading) 2. Basics of the SIMT execution (14) Effective implementation of multithreading when thread switches, called context switches, do not cause cycle penalties. Achieved by • providing and maintaining separate contexts for each thread, and • implementing a zero-cycle context switch mechanism. 2. Basics of the SIMT execution (15) SIMT core Fetch/Decode CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX Actual context CTX CTX CTX CTX CTX CTX CTX CTX Context switch CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX CTX ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Register file (RF) Figure: Providing separate thread contexts for each thread allocated for execution in a SIMT ALU 2. Basics of the SIMT execution (16) Data dependent flow control SIMT branches allow data dependent thread execution. In SIMT processing both paths of a branch are executed such that for each path the prescribed operations are executed on all data elements which obey the data condition valid for that path (e.g. xi > 0). Example 2. Basics of the SIMT execution (17) Figure: Execution of branches [24] The given condition will be checked separately for each thread 2. Basics of the SIMT execution (18) First all ALUs meeting the condition execute the prescibed three operations, then all ALUs missing the condition execute the next two operatons Figure: Execution of branches [24] 2. Basics of the SIMT execution (19) Figure: Resuming instruction stream processing after executing a branch [24] 2. Basics of the SIMT execution (20) Barrier synchronization Allows to let complete all prior instructions before executing the next instruction. Implemented in AMD’s Intermediate Language (IL) by the fence_threads instruction [10]. Remark In the R600 ISA this instruction is coded by setting the BARRIER field of the Control Flow (CF) instruction format [7]. 3. Overview of GPGPUs 3. Overview of GPGPUs (1) Basic implementation alternatives of the SIMT execution GPGPUs Data parallel accelerators Dedicated units Programmable GPUs supporting data parallel execution with appropriate with appropriate programming environments programming environment Have display outputs E.g. Nvidia’s 8800 and GTX lines AMD’s HD 38xx, Hd48xx lines No display outputs Have larger memories than GPGPUs Nvidia’s Tesla lines AMD’s FireStream lines Figure: Basic implementation alternatives of the SIMT execution 3. Overview of GPGPUs (2) GPGPUs AMD/ATI’s line Nvidia’s line 90 nm 80 nm G80 Shrink 65 nm Shrink Enhanced arch. G92 R600 G200 55 nm RV670 Enhanced arch. Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines RV770 3. Overview of GPGPUs (3) NVidia 10/07 11/06 G80 G92 GT200 90 nm/681 mtrs 65 nm/754 mtrs 65 nm/1400 mtrs Cores Cards 6/08 8800 GTS 96 ALUs 320-bit 8800 GTX 8800 GT GTX260 GTX280 128 ALUs 384-bit 112 ALUs 256-bit 192 ALUs 448-bit 240 ALUs 512-bit 11/07 6/07 CUDA Version 1.0 6/08 Version 1.1 Version 2.0 AMD/ATI Cores Cards 11/05 5/07 11/07 5/08 R500 R600 R670 RV770 80 nm/681 mtrs 55 nm/666 mtrs 55 nm/956 mtrs (Xbox) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870 48 ALUs 320 ALUs 512-bit 320 ALUs 256-bit 320 ALUs 256-bit 800 ALUs 256-bit 800 ALUs 256-bit 11/07 Brooks+ Brook+ 6/08 RapidMind 3870 support 2005 2006 2007 Figure: Overview of GPGPUs 2008 2009 3. Overview of GPGPUs (4) 8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280 Core G80 G80 G92 GT200 GT200 Introduction 11/06 11/06 10/07 6/08 6/08 IC technology 90 nm 90 nm 65 nm 65 nm 65 nm Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2 Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz No. stream proc.s 96 128 112 192 240 Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz 3 3 Computation No. FP32 inst./cycle 3* (but only in a few issue cases) Peak FP32 performance 346 GLOPS 512 GLOPS 508 GLOPS 715 GLOPS 933 GLOPS Peak FP64 performance – – – – 77/76 GLOPS 1600 Mb/s 1800 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB Mem. type GDDR3 GDDR3 GDDR3 GDDR3 GDDR3 Mem. channel 6*64-bit 6*64-bit 4*64-bit 8*64-bit 8*64-bit Mem. contr. Crossbar Crossbar Crossbar Crossbar Crossbar SLI SLI SLI SLI SLI PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 10 10 10 10.1 subset 10.1 subset Memory Mem. transfer rate (eff) System Multi. CPU techn. Interface MS Direct X Table: Main features of Nvidia’s GPGPUs 3. Overview of GPGPUs (5) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870 Core R600 R670 R670 RV770 RV770 Introduction 5/07 11/07 11/07 5/08 5/08 80 nm 55 nm 55 nm 55 nm 55 nm Nr. of transistors 700 mtrs 666 mtrs 666 mtrs 956 mtrs 956 mtrs Die are 408 mm2 192 mm2 192 mm2 260 mm2 260 mm2 Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz No. stream proc.s 320 320 320 800 800 Shader frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz 2 2 2 2 2 Peak FP32 performance 471.6 GLOPS 429 GLOPS 496 GLOPS 1000 GLOPS 1200 GLOPS Peak FP64 performance – – – 200 GLOPS 240 GLOPS 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5) 512-bit 256-bit 256-bit 265-bit 265-bit 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s Mem. size 512 MB 256 MB 512 MB 512 MB 512 MB Mem. type GDDR3 GDDR3 GDDR4 GDDR3 GDDR3/GDDR5 Mem. channel 8*64-bit 8*32-bit 8*32-bit 4*64-bit 4*64-bit Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar Multi. CPU techn. CrossFire CrossFire X CrossFire X CrossFire X CrossFire X Interface PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 10 10.1 10.1 10.1 10.1 IC technology Computation No. FP32 inst./cycle Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System MS Direct X Table: Main features of AMD/ATIs GPGPUs 3. Overview of GPGPUs (6) Price relations (as of 10/2008) Nvidia GTX260 GTX280 ~ 300 $ ~ 600 $ AMD/ATI HD4850 HD4870 ~ 200 $ na 3. Overview of GPGPUs (7) Data parallel accelerators Implementation alternatives of data parallel accelerators On card implementation On-die integration Recent implementations E.g. Future implementations GPU cards Intel’s Heavendahl Data-parallel accelerator cards AMD’s Torrenza integration technology AMD’s Fusion integration technology Trend Figure: Implementation alternatives of dedicated data parallel accelerators 4. Overview of data parallel accelerators 4. Overview of data parellel accererators (1) On-card accelerators Card implementations Single cards fitting into a free PCI Ex16 slot of the host computer. E.g. Nvidia Tesla C870 Nvidia Tesla C1060 AMD FireStream 9170 Desktop implementations 1U server implementations Usually 4 cards Usually dual cards mounted into a 1U server rack, mounted into a box, connected two adapter cards connected to an that are inserted into adapter card that is inserted into a two free PCIEx16 slots of a server through two switches free PCI-E x16 slot of the and two cables. host PC through a cable. Nvidia Tesla D870 Nvidia Tesla S870 Nvidia Tesla S1070 AMD FireStream 9250 Figure:Implementation alternatives of on-card accelerators 4. Overview of data parellel accererators (2) FB: Frame Buffer Figure: Main functional units of Nvidia’s Tesla C870 card [2] 4. Overview of data parellel accererators (3) Figure: Nvida’s Tesla C870 and AMD’s FireStream 9170 cards [2], [3] 4. Overview of data parellel accererators (4) Figure: Tesla D870 desktop implementation [4] 4. Overview of data parellel accererators (5) Figure: Nvidia’s Tesla D870 desktop implementation [4] 4. Overview of data parellel accererators (6) Figure: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4] 4. Overview of data parellel accererators (7) Figure: Concept of Nvidia’s Tesla S870 1U rack server [5] 4. Overview of data parellel accererators (8) Figure: Internal layout of Nvidia’s Tesla S870 1U rack [6] 4. Overview of data parellel accererators (9) Figure: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards inserted into PCI-E x16 slots of the host server [6] 4. Overview of data parellel accererators (10) NVidia Tesla Card 6/07 6/08 C870 C1060 G80-based 1.5 GB GDDR3 0.519 GLOPS GT200-based 4 GB GDDR3 0.936 GLOPS 6/07 Desktop D870 G80-based 2*C870 incl. 3 GB GDDR3 1.037 GLOPS 6/07 IU Server S870 S1070 G80-based 4*C870 incl. 6 GB GDDR3 2.074 GLOPS GT200-based 4*C1060 16 GB GDDR3 3.744 GLOPS 6/07 CUDA 6/08 Version 1.0 11/07 6/08 Version 1.01 Version 2.0 2007 Figure: Overview of Nvidia’s Tesla family 2008 4. Overview of data parellel accererators (11) AMD FireStream 11/07 Card 6/08 9170 9170 RV670-based 2 GB GDDR3 500 GLOPS FP32 ~200 GLOPS FP64 Shipped 6/08 9250 9250 RV770-based 1 GB GDDR3 1 TLOPS FP32 ~300 GFLOPS FP64 Shipped 12/07 Stream Computing SDK Version 1.0 Brook+ ACM/AMD Core Math Library CAL (Computer Abstor Layer) Rapid Mind 2007 10/08 2008 Figure: Overview of AMD/ATI’s FireStream family 4. Overview of data parellel accererators (12) Nvidia Tesla cards AMD FireStream cards Core type C870 C1060 9170 9250 Based on G80 GT200 RV670 RV770 Introduction 6/07 6/08 11/07 6/08 Core frequency 600 MHz 602 MHz 800 MHz 625 MHz ALU frequency 1350 MHz 1296 GHz 800 MHz 325 MHZ 128 240 320 800 Peak FP32 performance 518 GLOPS 933 GLOPS 512 GLOPS 1 TLOPS Peak FP64 performance – – ~200 GLOPS ~250 GLOPS 1600 Gb/s 1600 Gb/s 1600 Gb/s 1986 Gb/s 384-bit 512-bit 256-bit 256-bit 768 GB/s 102 GB/s 51.2 GB/s 63.5 GB/s Mem. size 1.5 GB 4 GB 2 GB 1 GB Mem. type GDDR3 GDDR3 GDDR3 GDDR3 PCI-E x16 PCI-E 2.0x16 PCI-E 2.0x16 PCI-E 2.0x16 171 W 200 W 150 W 150 W Core No. of ALUs Memory Mem. transfer rate (eff) Mem. interface Mem. bandwidth System Interface Power (max) Table: Main features of Nvidia’s and AMD/ATI’s data parallel accelerator cards 4. Overview of data parellel accererators (13) Price relations (as of 10/2008) Nvidia Tesla C870 D870 S870 ~ 1500 $ ~ 5000 $ ~ 7500 $ C1060 ~ 1600 $ S1070 ~ 8000 $ AMD/ATI FireStream 9170 ~ 800 $ 9250 ~ 800 $ 5. Microarchitecture and operation 5.1 Nvidia’s GPGPU line 5.2 AMD/ATI’s GPGPU line 5.3 Intel’s Larrabee 5.1 Nvidia’s GPGPU line 5.1 Nvidia’s GPGPU line (1) Microarchitecture of GPUs Microarchitecture of GPGPUs 3-level microarchitectures Microarchitectures inheriting the structure of programmable GPUs E.g. Nvidia’s and AMD/ATI’s GPGPUs Two-level microarchitectures Dedicated microarchitectures a priory developed to support both graphics and HPC Intel’s Larrabee Figure: Alternative layouts of microarchitectures of GPGPUs 5.1 Nvidia’s GPGPU line (2) Host CPU North Bridge Host memory Command Processor Unit Commands Work Schedeler CB CBA CB Cores 1 CB: Core Blocks CBA: Core Block Array Cores n L1 Cache L1 Cache IN: PCI-E x 16 IF IN Data L2 Hub L2 MC m MC 2x32-bit 2x32-bit MC: Memory Controller Display c. 1 Interconnection Network Global Memory Simplified block diagram of recent 3-level GPUs/data-parallel accelerators (Data parallel accelerators do not include Display controllers) 5.1 Nvidia’s GPGPU line (3) In these slides C Core SIMT Core Nvidia AMD/ATI SM Streaming Multiprocesszor Multithreaded processor Shader-processzor Thread processor SIMD Array SIMD Engine SIMD core SIMD CB Core Block TPC Texture Processor Cluster Multiprocessor CBA Core Block Array SPA Streaming Processor Array ALU Algebraic Logic Unit Streaming Processor Thread Processor Scalar ALU Stream Processing Unit Stream Processor Table: Terminologies used with GPGPUs/Data parallel accelerators 5.1 Nvidia’s GPGPU line (4) Microarchitecture of Nvidia’s GPGPUs GPGPUs based on 3-level microarchitectures AMD/ATI’s line Nvidia’s line 90 nm 80 nm G80 Shrink 65 nm Shrink Enhanced arch. G92 R600 G200 55 nm RV670 Enhanced arch. Figure: Overview of Nvidia’s and AMD/ATI’s GPGPU lines RV770 5.1 Nvidia’s GPGPU line (5) G80/G92 Microarchitecture 5.1 Nvidia’s GPGPU line (6) Figure: Overview of the G80 [14] 5.1 Nvidia’s GPGPU line (7) Figure: Overview of the G92 [15] 5.1 Nvidia’s GPGPU line (8) Figure: The Core Block of the G80/G92 [14], [15] 5.1 Nvidia’s GPGPU line (9) Streaming Processors: SIMT ALUs Figure: Block diagram of G80/G92 cores [14], [15] 5.1 Nvidia’s GPGPU line (10) Individual components of the core SM Register File (RF) 8K registers (each 4 bytes wide) deliver 4 operands/clock I$ L1 Load/Store pipe can also read/write RF Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU Figure: Register File [12] 5.1 Nvidia’s GPGPU line (11) Programmer’s view of the Register File 4 thread blocks There are 8192 and 16384 registers in each the G80 and the G200 resp. SM in This is an implementation decision, not part of CUDA • Registers are dynamically partitioned across all thread blocks assigned to the SM • Once assigned to a thread block, the register is NOT accessible by threads in other blocks • Each thread in the same block only accesses registers assigned to itself Figure: The programmer’s view of the Register File [12] 3 thread blocks 5.1 Nvidia’s GPGPU line (12) The Constant Cache Immediate address constants Indexed address constants Constants stored in DRAM, and cached on chip L1 per SM A constant value can be broadcast to all threads in a Warp Extremely efficient way of accessing a value that is common for all threads in a Block! I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU Figure: The constant cache [12] 5.1 Nvidia’s GPGPU line (13) Shared Memory Each SM has 16 KB of Shared Memory 16 banks of 32 bit words CUDA uses Shared Memory as shared storage visible to all threads in a thread block read and write access Not used explicitly for pixel shader programs I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU Figure: Shared Memory [12] 5.1 Nvidia’s GPGPU line (14) A program needs to manage the global, constant and texture memory spaces visible to kernels through calls to the CUDA runtime. This includes memory allocation and deallocation as well as invoking data transfers between the CPU and GPU. 5.1 Nvidia’s GPGPU line (15) Figure: Major functional blocks of G80/GT92 ALUs [14], [15] 5.1 Nvidia’s GPGPU line (16) Barrier synchronization • used to coordinate memory accesses at synchronization points, • at synchronization points the execution of the threads is suspended until all threads reach this point (barrier synchronization) • synchronization is achieved by calling the void_syncthreads() intrinsic function [11]; 5.1 Nvidia’s GPGPU line (17) Principle of operation Based on Nvidia’s data parallel computing model Nvidia’s data parallel computing model is specified at different levels of abstraction • at the Instruction Set Architecture level (ISA) (not disclosed) • at the intermediate level (at the level of APIs) not discussed here) • at the high level programming language level by means of CUDA. 5.1 Nvidia’s GPGPU line (18) CUDA [11] • programming language and programming environment that allows explicit data parallel execution on an attached massively parallel device (GPGPU), • its underlying principle is to allow the programmer to target portions of the source code for execution on the GPGPU, • defined as a set of C-language extensions, The key element of the language is the notion of kernel 5.1 Nvidia’s GPGPU line (19) A kernel is specified by • using the _global_ declaration specifier, • a number of associated CUDA threads, • a domain of execution (grid, blocks) using the syntax <<<….>>> Execution of kernels when called, a kernel is executed N times in parallel by N associated CUDA threads, as opposed to only once like in case of regular C functions. 5.1 Nvidia’s GPGPU line (20) Example The above sample code • • adds two vectors A and B of size N and stores the result into vector C by executing the invoked threads (identified by a one dimensional index i) in parallel on the attached massively parallel GPGPU, rather than adding the vectors A and B by executing embedded loops on the conventional CPU. Remark The thread index threadIdx is a vector of up to 3-components, that identifies a one-, two- or three-dimensional thread block. 5.1 Nvidia’s GPGPU line (21) The kernel concept is enhanced by three key abstractions • the thread concept, • the memory concept and • the synchronization concept. 5.1 Nvidia’s GPGPU line (22) The thread concept based on a three level hierarchy of threads • grids • thread blocks • threads 5.1 Nvidia’s GPGPU line (23) The hierarchy of threads Host kernel0<<<>>>() Device Each kernel invocation is executed as a grid of thread blocks (Block(i,j)) kernel1<<<>>>() Figure: Hierarchy of threads [25] 5.1 Nvidia’s GPGPU line (24) Thread blocks and threads Thread blocks • identified by two- or three-dimensional indices, • equally shaped, • required to execute independently, that is they can be scheduled in any order, • organized into a one- or two dimensional array, • have a per block shared memory. Threads of a thread block • identified by thread IDs (thread number within a block), • share data through fast shared memory, • synchronized to coordinate memory accesses, Threads in different thread blocks can not communicate or be synchronized. Figure: Thread blocks and threads [11] 5.1 Nvidia’s GPGPU line (25) The memory concept Threads have • • • • • private registers (R/W access) per block shared memory (R/W access) per grid global memory (R/W access) per block constant memory (R access) per TPC texture memory (R access) Shared memory is organized into banks (16 banks in version 1) The global, constant and texture memory spaces can be read from or written to by the CPU and are persistent across kernel launches by the same application. Figure: Memory concept [26] (revised) 5.1 Nvidia’s GPGPU line (26) Mapping of the memory spaces of the programming model to the memory spaces of the streaming processor A thread block is scheduled for execution to a particular multithreaded SM Streaming Multiprocessor 1 (SM 1) SMs are the fundamental processing units for CUDA thread blocks An SM incorporates 8 Execution Units (designated a Processors in the figure) Figure: Memory spaces of the SM [7] 5.1 Nvidia’s GPGPU line (27) The synchronization concept Barrier synchronization • used to coordinate memory accesses at synchronization points, • at synchronization points the execution of the threads is suspended until all threads reach this point (barrel synchronization) • synchronization is achieved by the declaration void_syncthreads(); 5.1 Nvidia’s GPGPU line (28) GT200 5.1 Nvidia’s GPGPU line (29) Figure: Block diagram of the GT200 [16] 5.1 Nvidia’s GPGPU line (30) Figure: The Core Block of the GT200 [16] 5.1 Nvidia’s GPGPU line (31) Streaming Multiprocessors: SIMT cores Figure: Block diagram of the GT200 cores [16] 5.1 Nvidia’s GPGPU line (32) Figure: Major functional blocks of GT200 ALUs [16] 5.1 Nvidia’s GPGPU line (33) Figure: Die shot of the GT 200 [17] 5.2 Intel’s Larrabee 5.2 Intel’s Larrabee (1) Larrabee Part of Intel’s Tera-Scale Initiative. • Objectives: High end graphics processing, HPC Not a single product but a base architecture for a number of different products. • Brief history: Project started ~ 2005 First unofficial public presentation: 03/2006 (withdrawn) First brief public presentation 09/07 (Otellini) [29] First official public presentations: in 2008 (e.g. at SIGGRAPH [27]) Due in ~ 2009 • Performance (targeted): 2 TFlops 5.2 Intel’s Larrabee (2) NI: New Instructions Figure: Positioning of Larrabee in Intel’s product portfolio [28] 5.2 Intel’s Larrabee (3) Figure: First public presentation of Larrabee at IDF Fall 2007 [29] 5.2 Intel’s Larrabee (4) Basic architecture Figure: Block diagram of the Larrabee [30] • Cores: In order x86 IA cores augmented with new instructions • L2 cache: fully coherent • Ring bus: 1024 bits wide 5.2 Intel’s Larrabee (5) Figure: Block diagram of Larrabee’s cores [31] 5.2 Intel’s Larrabee (6) Larrabee’ microarchitecture [27] Derived from that of the Pentium’s in order design 5.2 Intel’s Larrabee (7) Main extensions • 64-bit instructions • 4-way multithreaded (with 4 register sets) • addition of a 16-wide (16x32-bit) VU • increased L1 caches (32 KB vs 8 KB) • access to its 256 KB local subset of a coherent L2 cache • ring network to access the coherent L2 $ and allow interproc. communication. Figure: The anchestor of Larrabee’s cores [28] 5.2 Intel’s Larrabee (8) New instructions allow explicit cache control • to prefetch data into the L1 and L2 caches • to control the eviction of cache lines by reducing their priority. the L2 cache can be used as a scratchpad memory while remaining fully coherent. 5.2 Intel’s Larrabee (9) The Scalar Unit • supports the full ISA of the Pentium (it can run existing code including OS kernels and applications) • provides new instructions, e.g. for • bit count • bit scan (it finds the next bit set within a register). 5.2 Intel’s Larrabee (10) The Vector Unit Mask registers have one bit per bit lane, to control which bits of a vector reg. or memory data are read or written and which remain untouched. VU scatter-gather instructions (load a VU vector register from 16 non-contiguous data locations from anywhere from the on die L1 cache without penalty, or store a VU register similarly). Numeric conversions 8-bit, 16-bit integer and 16 bit FP data can be read from the L1 $ or written into the L1 $, with conversion to 32-bit integers without penalty. L1 D$ becomes as an extension of the register file Figure: Block diagram of the Vector Unit [31] 5.2 Intel’s Larrabee (11) ALUs • ALUs execute integer, SP and DP FP instructions • Multiply-add instructions are available. Figure: Layout of the 16-wide vector ALU [31] 5.2 Intel’s Larrabee (12) Task scheduling performed entirely by software rather than by hardware, like in Nvidia’s or AMD/ATI’s GPGPUs. 5.2 Intel’s Larrabee (13) SP FP performance 2 operations/cycle 16 ALUs 32 operations/core At present no data available for the clock frequency or the number of cores in Larrabee. Assuming a clock frequency of 2 GHz and 32 cores SP FP performance: 2 TFLOPS 5.2 Intel’s Larrabee (14) Figure: Larrabee’s software stack (Source Intel) Larrabee’s Native C/C++ compiler allows many available apps to be recompiled and run correctly with no modifications. 6. References (1) 6. References [1]: Torricelli F., AMD in HPC, HPC07, http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf [2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia [3] AMD FireStream 9170, http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html [4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008, Nvidia, http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf [5]: Tesla S870 GPU Computing System, Specification, Nvida, http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf [6]: Torres G., Nvidia Tesla Technology, Nov. 2007, http://www.hardwaresecrets.com/article/495 [7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD [8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU, ASPLOS 2006, June 2008 [9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007 http://ati.amd.com/developer/techpapers.html [10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL), Version 2.0, Oct. 2008, AMD 6. References (2) [11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, June 2008, Nvidia [12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007, University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/ lectures/lecture7-threading%20hardware.ppt#256,1,ECE 498AL Lectures 7: Threading Hardware in G80 [13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008, http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf [14]: Nvidia G80, Pc Watch, April 16 2007, http://pc.watch.impress.co.jp/docs/2007/0416/kaigai350.htm [15]: GeForce 8800GT (G92), PC Watch, Oct. 31 2007, http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf [16]: NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm [17]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review, PC Perspective, June 16 2008, http://www.pcper.com/article.php?aid=577&type=expert&pid=3 [18]: http://en.wikipedia.org/wiki/DirectX [19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia, http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf [20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html 6. References (3) [21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for Visual Information Technology, IIIT Hyderabad, http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf [22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html [23]: Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch, http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf [24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,” Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008, [25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” 09-08-2008 [26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 1.1, Nov. 2007, Nvidia [27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,” ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008 [28]: Kogo H., “Larrabee”, PC Watch, Oct. 17, 2008, http://pc.watch.impress.co.jp/docs/2008/1017/kaigai472.htm [29]: Shrout R., IDF Fall 2007 Keynote, Sept. 18, 2007, PC Perspective, http://www.pcper.com/article.php?aid=453 6. References (4) [30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabeeintels-biggest-leap-ahead-since-the-pentium-pro.html [31]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated First Move, Anandtech, Aug. 4. 2008, http://www.anandtech.com/showdoc.aspx?i=3367&p=2 [32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf [33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1 http://ati.amd.com/technology/streamcomputing/ Stream_Computing_User_Guide.pdf [34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007, http://www.graphicshardware.org/previous/www_2007/presentations/ doggett-radeon2900-gh07.pdf [35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf [36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008, http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf [37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008, http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf 6. References (5) [38]: Kogo H., RV770 Overview, PC Watch, July 02 2008, http://pc.watch.impress.co.jp/docs/2008/0702/kaigai_09.pdf