Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Motivation for Parallelism The speed of an application is determined by more than just processor speed. Motivation for Parallelism Memory speed Disk speed Network speed ... Multiprocessors typically improve the aggregate speeds: 1 Types of Parallelism Instructions near each other in an instruction stream could be independent. These can then execute in parallel either partially (pipelining), or fully (superscalar). Hardware needed for dependency tracking. The amount of ILP available per instruction stream is limited ILP is usually considered an implicit parallelism since the hardware automatically exploits it without programmer/compiler intervention. Programmers/compilers could transform applications to expose more ILP. 2007 October 11th Huge data sets could be difficult, expensive, or otherwise infeasible to store in a central location. Distributed data and parallel processing is a practical solution. 2007 October 11th 2 ILP Example: Loop Unrolling Instruction Level Parallelism (ILP) Harnessing the computing power of distributed systems over the Internet is a popular example of parallel processing. (SETI, Folding@home, ...) Constraints on the location of data Memory bandwidth is improved by separate memories. Multiprocessors usually have more aggregate cache memory. Each processor in a cluster can have its own disk and network adapter, improving aggregate speeds. 2007 October 11th Communication enables parallel applications: for( i=0; i<=64; i++ ) { vec[i] = a * vec[i]; } This loop has one sequence of load, multiply, store per iteration. The amount of ILP is very limited. for( i=0; i<=64; i+=4 ) { vec[i+0] = a * vec[i+0]; vec[i+1] = a * vec[i+1]; vec[i+2] = a * vec[i+2]; vec[i+3] = a * vec[i+3]; } 4-fold loop unrolling increases the amount of ILP exploitable by the hardware. Four independent load, multiply, store sequences per iteration. 3 2007 October 11th 4 Types of Parallelism TLP Example: Quicksort QuickSort( A, B ): if( B – A < 10 ) { /* Base Case: Use fast sort */ FastSort( A, B ); } else { /* Continue Recursively */ Partition [A,B] into [A,C] and [C+1,B]; QuickSort( A, C ); QuickSort( C+1, B ); } Task Level Parallelism (TLP) Several instruction streams are independent. Parallel execution of the streams is possible. Much more coarse-grain parallelism compared with ILP. Typically involves the programmer and/or compiler to: decompose the application into tasks, enforce dependencies, and expose parallelism. X Some experimental techniques, such as speculative multithreading, are aimed at removing these burdens from the programmer. Y 2007 October 11th 5 Types of Parallelism 2007 October 11th 4 + 5 = 9 3 + 6 = 9 6 + 2 = 8 5 + 4 = 9 4 + 3 = 7 6 Scalar processors can issue one instruction per cycle. Superscalar processors can issue more than one instruction per cycle. Common feature in most modern processors. By replicating functional units and adding hardware to detect and track instruction dependencies a superscalar processor takes advantage of Instruction Level Parallelism. A related technique (applicable to both scalar and superscalar processors) is out-of-order (OoO) execution. 7 + 1 = 8 Z 2007 October 11th In many applications a collection of data is transformed in such a way that the operations on each element is largely independent of the others. A typical scenario is when we apply the same instruction to a collection of data. Example: adding two arrays 3 + 4 = 7 Both Y and Z depend on X but are mutually independent. Superscalar and OoO execution Data Parallelism (DP) /* Task X */ /* Task Y */ /* Task Z */ The same operation (+) applied to a collection of data 7 Instructions are reordered (by hardware) for better utilization of pipeline(s). Excellent example of the use of extra transistors to speed up execution without programmer intervention. Naturally limited by the available ILP. Also severely limited by the hardware complexity of dependency checking. In practice: 2-way superscalar architectures are common, more than 4-way is unlikely. 2007 October 11th 8 Vector Processors Vector processors refer to a previously common supercomputer architecture where a vector is a basic memory abstraction. Vector: 1D array of numbers Example: add two vectors Examples: Cray 1, IBM 3090/VF. Scalar solution: Loop through vectors and add each scalar element Vector Processors Repeated address translations Branches Vector solution: Add vectors via a vector addition instruction Address translations once No branches Independent operations enable: deeper pipelines, higher clock frequencies, and multiple concurrent functional units (e.g., SIMD-units) 2007 October 11th 9 Single Instruction Multiple Data (SIMD) Matrix Computations Graphics Processing Image Analysis ... SIMD instruction extensions such as MMX, SSE, AltiVec. Uniform Memory Access (UMA): each processor has the same access time. Non-Uniform Memory Access (NUMA): some memory is closer to a processor, access time is higher to distant memory. Furthermore, their caches could be coherent or not. CC-UMA: Cache-Coherent Uniform Memory Access 2007 October 11th 10 Multiprocessors where all processors share a single address space are commonly called Shared Memory Multiprocessors. They can be classified based on how long access time different processors have to different memory areas. Found primarily in common microprocessors and GPUs: 2007 October 11th Shared Memory Multiprocessor Several functional units executing the same instruction on different data streams simultaneously and synchronized. Suitable architecture for many data parallel applications: Vector processors are suitable for a certain range of applications. Traditional, scalar, processor designs have been successful over a larger spectrum of applications. Economic realities favor large clusters of commodity processors or small-scale SMPs. 11 By another name, Symmetric MultiProcessor (SMP). 2007 October 11th 12 Bus-Based UMA MP Chip Chip NUMA MP Chip Chip Proc. Proc. Proc. Proc. Cache Cache Cache Cache Memory Proc. Memory Memory 13 Multicore They are very similar to SMPs but they Memory 14 Chip A new desktop computer today is definitely a multiprocessor. Proc. Multicore Multicores usually have a single address space and are cache-coherent. Proc. 2007 October 11th When several processor cores are physically located in the same processor socket we refer to it as a multicore processor. Both Intel and AMD now have quad-core (4 cores) processors in their product portfolios. Memory Interconnect Bus 2007 October 11th Proc. Core Core Core Core L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache typically share one or more levels of cache, have more favorable inter-processor/core communication speed. Memory 2007 October 11th 15 2007 October 11th 16 Multicore Multicore chips have multiple benefits: Higher peak performance Power consumption control Some cores can be turned off. Production yield increase Distributed Memory Multiprocessor 8-core chips with a defective core sold with one core disabled. ...but also some potential drawbacks: Memory bandwidth per core limited Lower peak performance per thread In contrast to a single address space, machines with multiple private memories are commonly called Distributed Memory Machines. Data is exchanged between memories via messages communicated over a dedicated network. When the processor/memory pairs are physically separated, such as on different boards or in different casings, such machines are called Clusters. Physical limits such as the number of pins. Some inherently sequential applications may actually run slower. 2007 October 11th 17 Cluster 2007 October 11th Clusters: Past and Present Node Node Node Node In the past, clusters were exclusively high-end machines with Proc. Proc. Proc. Proc. Cache Cache Cache Cache Memory Memory Memory Memory Network commodity processors, and commodity interconnects (e.g. ethernet with switches). The economic benefits and programmer familiarity with commodity components far outweigh the performance issues. Has helped ”democratize” supercomputing: 19 custom supercomputer processors, and custom high-performance interconnects. These machines where very expensive and therefore limited to research and big corporations. In the ’90s onwards it is increasingly common with clusters based on off-the-shelf components: 2007 October 11th 18 many corporations and universities have clusters today. 2007 October 11th 20 Networks Access to memory on other nodes is very expensive. Data must be transferred over a relatively high-latency lowbandwidth network. Algorithms with low data locality will suffer. High synchronization requirements will also degrade performance for the same reason. The network design is a tradeoff between conflicting goals: Maximum bandwidth and low latency: Low cost and power consumption: full connectivity tree network Switch-based networks common today, other examples of topologies include rings, meshes, hypercubes, and trees. 2007 October 11th 21