Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP5426 Parallel and Distributed Computing Parallel Architectures Basic Functional units Peripherals Computer Central Processing Unit Computer Communication lines Main Memory System’s Interconnection Input Output 2 Implicit Parallelism Microprocessor clock speeds have posted impressive gains over the past two decades (two to three orders of magnitude). Higher levels of device integration have made available a large number of transistors. How best to utilize these resources? Conventionally, use these resources in multiple functional units and execute multiple instructions in the same cycle (ILP) Pipelining Superscalar 3 Pipelining overlaps various stages of instruction execution to achieve performance F1 E1 F2 E2 I1 I1: I2: I3: F1 I2 E1 F2 E3 ....... Time I3 F: Fetch E: Execute E2 F3 F3 E3 Time 4 Pipelining Limitations: The speed of a pipeline is eventually limited by the slowest stage. – needs more stage, or very deep pipelines However, in typical program traces, every 5-6th instruction is a conditional jump! - requires very accurate branch prediction. The penalty of a misprediction grows with the depth of the pipeline, since a larger number of instructions will have to be flushed. 5 Superscalar multiple redundant functional units within each CPU so that multiple instructions can executed on separate data items concurrently. Early ones: two ALUs and a single FPU modern ones: have more, e.g. the PowerPC 970 includes four ALUs and two FPUs, as well as two SIMD units. 6 Superscalar The scheduler - a piece of hardware looks at a large number of instructions in an instruction queue and selects appropriate number of instructions to execute concurrently. Very Long Instruction Word (VLIW) processors - rely on compile time analysis to identify and bundle together instructions that can be executed concurrently. 7 Superscalar Limitations: The degrees of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism The complexity and time cost of the dispatcher and associated dependency checking logic. The performance of the system as a whole will suffer if unable to keep all of the units fed with instructions. 8 Implicit Parallelism Things affect the performance: True Data Dependency: The result of one operation is an input to the next. Resource Dependency: Two operations require the same resource. Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. 9 Multicore Technology Power and Heat: Intel Embraces Multicore May 17, 2004 … Intel, the world's largest chip maker, publicly acknowledged that it had hit a ''thermal wall'' on its microprocessor line. As a result, the company is changing its product strategy and disbanding one of its most advanced design groups. Intel also said that it would abandon two advanced chip development projects … Now, Intel is embarked on a course already adopted by some of its major rivals: obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor … Intel's decision to change course and embrace a ''dual core'‘ processor structure shows the challenge of overcoming the effects of heat generated by the constant on-off movement of tiny switches in modern computers … some analysts and former Intel designers said that Intel was coming to terms with escalating heat problems so severe they threatened to cause its chips to fracture at extreme temperatures… New York Times, May 17, 2004 10 Generic Multicore chip Processor Memor y Processor Memor y Global Memory Handful of processors each supporting ~1 hardware thread On-chip memory near processors (cache, RAM, or both) Shared global memory space (external cache and DRAM) 11 Intel Core i7 12 GPU 13 GPU All PCs have a GPU - the main chip inside a computer which calculates and generates the positioning of graphics on a computer screen. Games typically renders 10 000s triangles @ 60 fps Screen resolution is typically 1600 x 1200 and each pixel is recalculated every frame This corresponds to processing 115 200 000 pps GPUs are designed to make these operations fast New types of GPUs contain multiple cores that utilise hardware multithreading and SIMD. 14 IBM Cell Broadband Engine Multicore Single powerful thread per core Thread parallel Explicit communication Explicit synchronization 15 Cell Processor PPE – Power Processing Element SPE – Synergistic Processing Element SPU – Synergistic Processing Unit LS – Local Store MFC – Memory Flow Controller EIB – Element Interconnect Bus 16 Cell Processor – the PPE More-less a standard scalar processor Access to the main memory through load and store instructions Standard L1 and L2 caches Capable of running scalar (not vectorized) code fast Capable of running a standard operating system, e.g., Linux Capable of executing IEEE floating point arithmetic (double and single precision) 17 Cell Processor – the SPE Completely a vector processor Only executes code from the local memory Only operates on data in the local memory Accesses the maim memory and local memories of other SPEs only through DMA messages Loads and stores only 128-bit vectors Only operates on 128-bit vectors (scalar instructions are emulated in software) Only supports a single thread with a register file of 128 vector registers at its disposal 18 NVIDIA GTX 280 The GPU (a set of multiprocessors) executes many thread blocks Each thread block consists of many threads Within thread block threads are grouped in warps Each thread has: Per-thread registers Per-thread memory (in DRAM) Each thread block has: Per-thread-block shared memory Global memory (DRAM) is accessible to all threads 19 NVIDIA GTX 280 SM – Streaming Multiprocessor (more-less a core) SP – Streaming Processor (“scalar processor core”) (AKA “thread processor”) Register file Shared memory Constant cache (read only for SM) Texture cache (read only for SM) 20 GTX 280: Streaming Multiprocessor Eight scalar processors (thread procs.) with one instruction issue logic (SIMD) Long vector (32 threads = 1 warp) Massively multithreaded (512 scalar hardware threads = 16 warps) Huge register file (8192 scalar registers shared among all threads) Can load and store data to and from shared memory Can load and store data directly to and from the main memory 21 General-Purpose Computing on GPUs Idea: Potential for very high performance at low cost Architecture well suited for certain kinds of parallel applications (data parallel) Early challenges: Architectures very customized to graphics problems (e.g., vertex and fragment processors) Programmed using graphics-specific programming models or libraries Recent trends: Some convergence between commodity and GPUs and their associated parallel programming models 22 CUDA Compute Unified Device Architecture, one of first to support heterogeneous architectures Data-parallel programming interface to GPU Data to be operated on is discretized into independent partition of memory Each thread performs roughly same computation to different partition of data When appropriate, easy to express and very efficient parallelization Programmer expresses Thread programs to be launched on GPU, and how to launch Data organization and movement between host and GPU Synchronization, memory management, … 23 CUDA Software Stack 24 CUDA Hardware Model 25 CUDA Memory Model • Each thread can: – – – – – – • Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can read/write global, constant, and texture memory 26 CUDA Programming Model The GPU is viewed as a compute device that: is a coprocessor to the CPU or host has its own device memory runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads 27 CUDA Programming Model 28 29 Future Computer Systems 30 Challenges of Using Multicore and GPU 31 Flynn’s Taxonomy Data stream Single Multiple Instruction stream Single SISD Uniprocessor MISD Rarely used Multiple SIMD Procesor arrays Pipelined vector processors MIMD Multiprocessors Multicomputers 32 SISD SIMD Uniprocessors Array or vector processors MISD Rarely used MIMD Multiproc’s or multicomputers Global memory Multiple data streams Distributed memory Single data stream Johnson’ s expansion Multiple instr streams Single instr stream Types of Parallelism Shared variables Message passing GMSV GMMP Shared-memory multiprocessors Rarely used DMSV DMMP Distributed Distrib-memory shared memory multicomputers Flynn’s categories 33 SIMD Single Instruction Multiple Data architecture A single instruction can operate on multiple data elements in parallel rely on the highly structured nature of the underlying computations Data parallelism widely applied in multimedia processing (e.g., graphics, image and video) 34 SIMD 35 MIMD MIMD architectures can be classified (based on the memory organization) into shared-memory or distributed-memory architectures (or a combination of both, i.e. distributed shared-memory). shared-memory multiprocessors communicate together through a common memory. distributed-memory multicomputers communicate together through communication links. 36 Parallel Architectures M M DS2 P1 P2 ... Memory M Interconnection Network DS1 DSn Pn SHARED-MEMORY MULTIPROCESSOR All the processors in the system can directly access all memory locations in the system 37 Parallel Architectures DS2 P1 P2 M M ... Interconnect Network DS1 DSn Pn •Processors have direct access only to local memory •communication & synchronization takes place via messages. M DISTRIBUTED-MEMORY MULTICOMPUTERS 38 DS1 DS2 P1 M P2 M ... Interconnect Network Parallel Architectures DSn Pn M DISTRIBUTED SHARED-MEMORY MULTIPROCESSOR All the processors in the system can access all local memory locations. 39 Parallel Architectures Shared Memory Distributed Memory •Explicit global data structure •Implicit global data structure • Decomposition of work is independent of data layout •Decomposition of data determines assignment of work •Communication is implicit •Communication is explicit •Explicit synchronization (need to •Synchronization is implicit (data avoid race condition and overbuffered till received) writing) 40 Architecture of Interest 41 Architecture of Interest 42 Architecture of Interest 43 Architecture of Interest 44 Cloud Computing Cloud computing describes a new business model for IT service supplement, consumption, and delivery over the Internet. NIST definition: Cloud computing as a model for enabling convenient, ondemand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. The cloud computing promotes availability and consists of Five essential characteristics Four deployment models Three service models 45 Cloud Computing Five essential characteristics: On-demand self-service Broad network access Resource pooling Rapid elasticity Measured service Four deployment models: Private cloud; Community cloud; Public cloud and hybrid cloud 46 Cloud Computing Three service models: End Users Software as a Service Platform as a Service Infrastructure as a Service 47 Cloud Computing Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet. Users need not have knowledge of, expertise in, or control over the technology infrastructure in the "cloud" that supports them. Characteristics: Virtual: software, databases, Web servers, operating systems, storage and networking all as virtual servers. XaaS: Infrastructure as a Service (IaaS), Platform as a Service (Paas) and Software as a Service (SaaS) 48 Cloud Computing Motivation: Efficient resource use Utilization of typical data centers: below 10-30% Average lifetime of servers: approx. 3 years (CapEx) Excessive operating costs (OpEx) Staffing Maintenance (HW & SW) Energy (both for powering and cooling) 49 Cost Efficiency: Provider’s Perspective Cost Reductions Economies of scale prevails Cloud service providers can bring 75% - 80% cost reduction by bulk purchases Efficient resource management practices Utilization improvement (server consolidation) Automated processes (reduction in staffing cost) Profit Increases Increase in market demand Quality of service (performance) 50 Cost Efficiency: Provider’s Perspective Demand 2 1 Time (days) 3 Capacity Demand 2 1 Time (days) 3 Resources Capacity Resources Resources Smoothing/flattening out workloads by effectively managing demand and capacity Capacity Demand 2 1 Time (days) 3 51 Cost Efficiency: User’s Perspective Elasticity Utilization may often be bursty # EC2 instances 250,000 users: 3,500 instances Friday, 18/April/2008 50,000 users: 400 instances Tuesday, 25,000 15/April/2008 “registered” users Monday, 14/April/2008 52 Cost Efficiency: User’s Perspective Capacity Supply Capacity Dynamic provisioning (pay-as-you-go) Supply Demand Time Demand Time 53 Cloud Computing Virtualization (VMWare and Xen) highly customized user environment the isolation to malicious software seamless migration of services across physical machines App App App App App App OS1 OS2 OS1 Operating System Hypervisor Hardware Hardware Traditional Stack Virtualized Stack 54 References Jack Dongarra (U. Tenn.) CS 594 slides http://www.cs.utk.edu/~dongarra/WEBPAGES/cs594-2010.htm 55