Download EECC756 - Shaaban

Introduction to Parallel Processing • Parallel Computer Architecture: Definition & Broad issues involved – • The Need And Feasibility of Parallel Computing – – – • • • • Parallel Programming Models Flynn’s 1972 Classification of Computer Architecture Current Trends In Parallel Architectures – • • • • • Scientific Supercomputing Trends CPU Performance and Technology Trends, Parallelism in Microprocessor Generations Computer System Peak FLOP Rating History/Near Future The Goal of Parallel Processing Elements of Parallel Computing Factors Affecting Parallel System Performance Parallel Architectures History – – • A Generic Parallel Computer Architecture Modern Parallel Architecture Layered Framework Shared Address Space Parallel Architectures Message-Passing Multicomputers: Message-Passing Programming Tools Data Parallel Systems Dataflow Architectures Systolic Architectures: Matrix Multiplication Systolic Array Example (PCA Chapter 1.1, 1.2) EECC756 - Shaaban #1 lec # 1 Spring 2006 3-14-2006 Parallel Computer Architecture A parallel computer (or multiple processor system) is a collection of communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into parallel tasks, exploiting Thread-Level Parallelism (TLP). i.e Parallel Processing • Broad issues involved: – The concurrency and communication characteristics of parallel algorithms for a given computational problem (represented by dependency graphs) – Computing Resources and Computation Allocation: • The number of processing elements (PEs), computing power of each element and amount/organization of physical memory used. • What portions of the computation and data are allocated or mapped to each PE. – Data access, Communication and Synchronization • • • • How the processing elements cooperate and communicate. How data is shared/transmitted between processors. Abstractions and primitives for cooperation/communication. The characteristics and performance of parallel system network (System interconnects). – Parallel Processing Performance and Scalability Goals: • Maximize performance enhancement of parallelism: Maximize Speedup. – By minimizing parallelization overheads and balancing workload on processors • Scalability of performance to larger systems/problems. Processor = Programmable computing element that runs stored programs written using pre-defined instruction set Processing Elements = PEs = Processors EECC756 - Shaaban #2 lec # 1 Spring 2006 3-14-2006 A Generic Parallel Computer Architecture Parallel Machine Network (Custom or industry standard) Netw ork  A processing node Communication assist (CA) Mem Operating System Parallel Programming Environments $ Processing Nodes Network Interface (custom or industry standard) P Processing Nodes: One or more processing elements or processors per node: Custom or commercial microprocessors. Single or multiple processors per chip Each processing node contains one or more processing elements (PEs) or processor(s), memory system, plus communication assist: (Network interface and communication controller) Parallel machine network (System Interconnects). Function of a parallel machine network is to efficiently (reduce communication cost) transfer information (data, results .. ) from source node to destination node as needed to allow cooperation among parallel processing nodes to solve large computational problems divided into a number parallel computational tasks. Parallel Computer = Multiple Processor System EECC756 - Shaaban #3 lec # 1 Spring 2006 3-14-2006 The Need And Feasibility of Parallel Computing • Application demands: More computing cycles/memory needed Driving Force – – – Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ... General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Gaming… Mainstream multithreaded programs, are similar to parallel programs • Technology Trends: – – Number of transistors on chip growing rapidly. Clock rates expected to continue to go up but only slowly. Actual performance returns diminishing due to deeper pipelines. Increased transistor density allows integrating multiple processor cores per creating ChipMultiprocessors (CMPs) even for mainstream computing applications (desktop/laptop..). • Architecture Trends: – – – Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited. Increased clock rates require deeper pipelines with longer latencies and higher CPIs. Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor systems is the most viable approach to further improve performance. • Main motivation for development of chip-multiprocessors (CMPs) • Economics: – The increased utilization of commodity of-the-shelf (COTS) components in high performance parallel computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost. • Today’s microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom Pes. • Commercial System Area Networks (SANs) offer an alternative to custom more costly networks EECC756 - Shaaban #4 lec # 1 Spring 2006 3-14-2006 Why is Parallel Processing Needed? Challenging Applications in Applied Science/Engineering • • • • • • • • • • • • • • • • Astrophysics Atmospheric and Ocean Modeling Such applications have very high Bioinformatics computational and memory Biomolecular simulation: Protein folding requirements that cannot be met with single-processor architectures. Computational Chemistry Many applications contain a large Computational Fluid Dynamics (CFD) degree of computational parallelism Computational Physics Computer vision and image understanding Data Mining and Data-intensive Computing Engineering analysis (CAD/CAM) Global climate modeling and forecasting Material Sciences Military applications Driving force for High Performance Computing (HPC) Quantum chemistry and multiple processor system development VLSI design …. EECC756 - Shaaban #5 lec # 1 Spring 2006 3-14-2006 Why is Parallel Processing Needed? Scientific Computing Demands (Memory Requirement) Computational and memory demands exceed the capabilities of even the fastest current uniprocessor systems 3-5 GFLOPS for uniprocessor EECC756 - Shaaban #6 lec # 1 Spring 2006 3-14-2006 Scientific Supercomputing Trends • Proving ground and driver for innovative architecture and advanced high performance computing (HPC) techniques: – Market is much smaller relative to commercial (desktop/server) segment. – Dominated by costly vector machines starting in the 70s through the 80s. – Microprocessors have made huge gains in the performance needed for such applications: Enabled with high transistor density/chip • • • • • High clock rates. (Bad: Higher CPI?) Multiple pipelined floating point units. Instruction-level parallelism. Effective use of caches. Multiple processor cores/chip (2 cores 2002-2005, 4 end of 2006, 8 cores 2007?) However even the fastest current single microprocessor systems still cannot meet the needed computational demands. As shown in last slide • Currently: Large-scale microprocessor-based multiprocessor systems and computer clusters are replacing (replaced?) vector supercomputers that utilize custom processors. EECC756 - Shaaban #7 lec # 1 Spring 2006 3-14-2006 Uniprocessor Performance Evaluation • • • CPU Performance benchmarking is heavily program-mix dependent. Ideal performance requires a perfect machine/program match. Performance measures: – Total CPU time = T = TC / f = TC x C = I x CPI x C = I x (CPIexecution + m x k) x C TC = Total program execution clock cycles f = clock rate C = CPU clock cycle time = 1/f CPI = Cycles per instruction (in seconds) I = Instructions executed count CPIexecution = CPI with ideal memory m = Memory stall cycles per memory access k = Memory accesses per instruction – MIPS Rating = I / (T x 106) = f / (CPI x 106) = f x I /(TC x 106) (in million instructions per second) – Throughput Rate: Wp = 1/ T = f /(I x CPI) = (MIPS) x 106 /I (in programs per second) • Performance factors: (I, CPIexecution, m, k, C) are influenced by: instruction-set architecture (ISA) , compiler design, CPU micro-architecture, implementation and control, cache and memory hierarchy, program access locality, and program instruction mix and instruction dependencies. T = I x CPI x C EECC756 - Shaaban #8 lec # 1 Spring 2006 3-14-2006 Single CPU Performance Trends • The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance. • This is even more true with the development of cost-effective multi-core microprocessors that support TLP at the chip level. Performance 100 Supercomputers 10 Mainframes Microprocessors Minicomputers 1 0.1 1965 1970 1975 1980 1985 1990 1995 EECC756 - Shaaban #9 lec # 1 Spring 2006 3-14-2006 Microprocessor Frequency Trend 100 Intel Processor freq scales by 2X per generation IBM Power PC DEC Gate delays/clock 21264S 1,000 Mhz 21164A 21264 Pentium(R) 21064A 21164 II 21066 MPC750 604 604+ 10 Pentium Pro 601, 603 (R) Pentium(R) 100 486 386 2005 2003 2001 1999 1997 1995 1993 1991 1989 1 1987 10  Frequency doubles each generation ?  Number of gates/clock reduce by 25%  Leads to deeper pipelines with more stages (e.g Intel Pentium 4E has 30+ pipeline stages) T = I x CPI x C Gate Delays/ Clock 10,000 Realty Check: Clock frequency scaling is slowing down! (Did silicone finally hit the wall?) Why? 1- Power leakage 2- Clock distribution delays Result: Deeper Pipelines Longer stalls Higher CPI (lowers effective performance per cycle) Solution: Exploit TLP at the chip level, Chip-multiprocessor (CMPs) EECC756 - Shaaban #10 lec # 1 Spring 2006 3-14-2006 Transistor Count Growth Rate 100,000,000 Currently > 1.5 Billion  Transistors 10,000,000 Moore’s Law:     R10000       Pentium                    i80386  i80286    R3000 R2000   1,000,000 100,000 2X transistors/Chip Every 1.5 years (circa 1970) still holds Enables Thread-Level Parallelism (TLP) at the chip level: Chip-Multiprocessors (CMPs) + Simultaneous Multithreaded (SMT) processors i8086 10,000  i8080   i8008 i4004 1,000 1970 1980 1990 2000 1975 1985 1995 2005 • One billion transistors/chip reached in 2005 Solution • Transistor count grows faster than clock rate: Currently ~ 40% per year • Single-threaded uniprocessors do not efficiently utilize the increased transistor count. Limited ILP, increased size of cache EECC756 - Shaaban #11 lec # 1 Spring 2006 3-14-2006 Parallelism in Microprocessor VLSI Generations Bit-level parallelism Instruction-level 100,000,000 Thread-level (?) (ILP) (TLP) Multiple micro-operations per cycle Single-issue (multi-cycle non-pipelined) Pipelined  Superscalar /VLIW CPI <1 CPI =1 10,000,000    Not Pipelined CPI >> 1   1,000,000    R10000                 Simultaneous Multithreading SMT: e.g. Intel’s Hyper-threading Chip-Multiprocessors (CMPs) e.g IBM Power 4, 5 Intel Pentium D, Core Duo AMD Athlon 64 X2 Dual Core Opteron Sun UltraSparc T1 (Niagara) Pentium Transistors    i80386  Chip-Level Parallel Processing   i80286  100,000   R2000    R3000 Even more important due to slowing clock rate increase  i8086 10,000  i8080  i8008  Single Thread  i4004 1,000 1970 1975 1980 1985 Improving microprocessor generation performance by exploiting more levels of parallelism 1990 1995 2000 2005 EECC756 - Shaaban #12 lec # 1 Spring 2006 3-14-2006 Current Dual-Core Chip-Multiprocessor Architectures Single Die Shared L2 Cache Cores communicate using shared cache (Lowest communication latency) Examples: IBM POWER4/5 Intel Pentium Core Duo (Yonah), Conroe Sun UltraSparc T1 (Niagara) Single Die Private Caches Shared System Interface Cores communicate using on-chip Interconnects (shared system interface) Two Dice – Shared Package Private Caches Private System Interface Cores communicate over external Front Side Bus (FSB) (Highest communication latency) Examples: AMD Dual Core Opteron, Athlon 64 X2 Intel Itanium2 (Montecito) Source: Real World Technologies, http://www.realworldtech.com/page.cfm?ArticleID=RWT101405234615 Example: Intel Pentium D EECC756 - Shaaban #13 lec # 1 Spring 2006 3-14-2006 Microprocessors Vs. Vector Processors Uniprocessor Performance: LINPACK 10,000 CRAY  CRAY  Micro Micro n = 1,000 n = 100 n = 1,000 n = 100 Now about 3-5 GFLOPS per microprocessor Vector Processors  1,000  1 GFLOP  C90 (109 FLOPS) LINPACK (MFLOPS) T94    100     DEC 8200  Ymp  Xmp/416       IBM Pow er2/990 MIPS R4400 Xmp/14se   DEC Alpha  HP9000/735   DEC Alpha AXP  HP 9000/750  CRAY 1s  IBM RS6000/540 10  MIPS M/2000  Microprocessors  MIPS M/120  Sun 4/260 1 1975   1980 1985 1990 1995 2000 EECC756 - Shaaban #14 lec # 1 Spring 2006 3-14-2006 Parallel Performance: LINPACK 10,000  MPP peak  CRAY peak 1,000 LINPACK (GFLOPS) Current Top LINPACK Performance: Now about 280,600 GFLOPS = 280.6 TeraFLOPS IBM BlueGene/L 131072 (128K) processors (0.7 GHz PowerPC 440) ASCI Red  1 TeraFLOP (1012 Parago n XP/S MP (6768)  Parago n XP/S MP (1024)   T3D CM-5  FLOPS = 1000 GFLOPS) 100 T932(32)  Paragon XP/S CM-200  CM-2  Ymp/832(8) 1   C90(16) Delta 10    iPSC/860  nCUBE/2(1024) Xmp /416(4) 0.1 1985 1987 1989 1991 1993 Current ranking of top 500 parallel supercomputers in the world is found at: www.top500.org 1995 1996 EECC756 - Shaaban #15 lec # 1 Spring 2006 3-14-2006 Why is Parallel Processing Needed? LINPAK Performance Trends 10,000 10,000 CRAY  CRAY  Micro Micro n = 1,000 n = 100 n = 1,000 n = 100 LINPACK (MFLOPS) (109 FLOPS) T94  C90    100      Ymp Xmp/416  1 TeraFLOP LINPACK (GFLOPS)  1 GFLOP DEC 8200       IBM Pow er2/990 MIPS R4400 Xmp/14se (1012 FLOPS =1000 GFLOPS) 100 CM-200   Ymp/832(8) 10 1  MIPS M/2000     C90(16) Delta 10 DEC Alpha  HP9000/735   DEC Alpha AXP  HP 9000/750  IBM RS6000/540 Paragon XP/S MP (6768)  Paragon XP/S MP (1024)   T3D CM-5  T932(32)  Paragon XP/S CM-2     CRAY 1s ASCI Red  1,000  1,000  MPP peak  CRAY peak  iPSC/860  nCUBE/2(1024) Xmp /416(4)  MIPS M/120  Sun 4/260 1 1975 0.1 1985   1980 1985 1990 1995 Uniprocessor Performance 1987 1989 1991 1993 1995 1996 2000 Parallel System Performance EECC756 - Shaaban #16 lec # 1 Spring 2006 3-14-2006 Computer System Peak FLOP Rating History/Near Future Current Top Peak FP Performance: Now about 367,000 GFLOPS = 367 TeraFLOPS = 0.367 Peta FLOP IBM BlueGene/L 131072 (128K) processors (0.7 GHz PowerPC 440) Peta FLOP (1015 FLOPS = 1000 Tera FLOPS) Teraflop (1012 FLOPS = 1000 GFLOPS) EECC756 - Shaaban #17 lec # 1 Spring 2006 3-14-2006 The Goal of Parallel Processing • Goal of applications in using parallel machines: Maximize Speedup over single processor performance Speedup (p processors) = Performance (p processors) Performance (1 processor) • For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Time (1 processor) Time (p processors) • Ideal speedup = number of processors = p Very hard to achieve Due to parallelization overheads: communication cost, dependencies ... EECC756 - Shaaban #18 lec # 1 Spring 2006 3-14-2006 The Goal of Parallel Processing • Parallel processing goal is to maximize parallel speedup: Speedup = Sequential Work on one processor Time(1) Time(p) < Max (Work + Synch Wait Time + Comm Cost + Extra Work) Parallelization overheads i.e the processor with maximum execution time • Ideal Speedup = p number of processors – Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors. • Maximize parallel speedup by: 1 2 • – Balancing computations on processors (every processor does the same amount of work) and the same amount of overheads. – Minimizing communication cost and other overheads associated with each step of parallel program creation and execution. Performance Scalability: Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of processors) are increased. EECC756 - Shaaban #19 lec # 1 Spring 2006 3-14-2006 Elements of Parallel Computing Computing Problems Assign parallel computations to processors Algorithms and Data Structures Dependency analysis Mapping Hardware Architecture Programming High-level Languages Operating System Binding (Compile, Load) Applications Software Performance Evaluation e.g Speedup EECC756 - Shaaban #20 lec # 1 Spring 2006 3-14-2006 Elements of Parallel Computing 1 Computing Problems: Driving Force – Numerical Computing: Science and and engineering numerical problems demand intensive integer and floating point computations. – Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches. 2 Parallel Algorithms and Data Structures – Special algorithms and data structures are needed to specify the computations and communication present in computing problems (from dependency analysis). – Most numerical algorithms are deterministic using regular data structures. – Symbolic processing may use heuristics or non-deterministic searches. – Parallel algorithm development requires interdisciplinary interaction. EECC756 - Shaaban #21 lec # 1 Spring 2006 3-14-2006 Elements of Parallel Computing 3 Hardware Resources – Processors, memory, and peripheral devices (processing nodes) form the hardware core of a computer system. – Processor connectivity (system interconnects, network), memory organization, influence the system architecture. 4 Operating Systems – Manages the allocation of resources to running processes. – Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication. • Parallelism exploitation possible at: algorithm design, program writing, compilation, and run time. EECC756 - Shaaban #22 lec # 1 Spring 2006 3-14-2006 Elements of Parallel Computing 5 System Software Support – Needed for the development of efficient programs in highlevel languages (HLLs.) – Assemblers, loaders. – Portable parallel programming languages – User interfaces and tools. 6 Compiler Support – Preprocessor compiler: Sequential compiler and low-level library of the target parallel computer. – Precompiler: Some program flow analysis, dependence checking, limited optimizations for parallelism detection. – Parallelizing compiler: Can automatically detect parallelism in source code and transform sequential code into parallel constructs. EECC756 - Shaaban #23 lec # 1 Spring 2006 3-14-2006 Approaches to Parallel Programming Programmer Programmer Source code written in sequential languages C, C++ FORTRAN, LISP .. Source code written in concurrent dialects of C, C++ FORTRAN, LISP .. Parallelizing compiler Concurrency preserving compiler Parallel object code Execution by runtime system (a) Implicit Parallelism Concurrent object code (b) Explicit Parallelism Execution by runtime system EECC756 - Shaaban #24 lec # 1 Spring 2006 3-14-2006 Factors Affecting Parallel System Performance • Parallel Algorithm Related: i.e Inherent Parallelism – Available concurrency and profile, grain size, uniformity, patterns. • Dependencies between computations represented by dependency graph – Required communication/synchronization, uniformity and patterns. – Data size requirements. – Communication to computation ratio (C-to-C ratio, lower is better). • Parallel program Related: – Programming model used. – Resulting data/code memory requirements, locality and working set characteristics. – Parallel task grain size. – Assignment of tasks to processors: Dynamic or static. – Cost of communication/synchronization primitives. • Hardware/Architecture related: – – – – – Total CPU computational power available. Types of computation modes supported. Shared address space Vs. message passing. Communication network characteristics (topology, bandwidth, latency) Memory hierarchy properties. Concurrency = Parallelism EECC756 - Shaaban #25 lec # 1 Spring 2006 3-14-2006 Sequential Execution on one processor Possible Parallel Execution Schedule on Two Processors P0, P1 Dependency Graph 0 0 1 A 2 1 A 3 B C 9 7 Comm D 10 E F D 11 12 E 14 G 15 D 9 10 E Idle 12 Comm 13 14 Idle G 15 16 F 17 18 19 G 20 16 What would the speed be with 3 processors? 4 processors? 5 … ? 17 18 T2 =16 19 20 21 21 Time T1 =21 F 8 11 Comm 13 B 6 C 8 C 5 6 7 Comm 4 B 5 Idle 3 Comm 4 A 2 Assume computation time for each task A-G = 3 Assume communication time between parallel tasks = 1 Assume communication can overlap with computation Speedup on two processors = T1/T2 = 21/16 = 1.3 A simple parallel execution example Time P0 P1 EECC756 - Shaaban #26 lec # 1 Spring 2006 3-14-2006 Evolution of Computer Architecture Scalar Sequential Lookahead Functional Parallelism I/E Overlap Multiple Func. Units Pipeline Implicit Vector I/E: Instruction Fetch and Execute SIMD: Single Instruction stream over Multiple Data streams Explicit Vector Memory-to -Memory MIMD: Multiple Instruction streams over Multiple Data streams SIMD Associative Processor Processor Array Data Parallel Message Passing Register-to -Register Shared Memory MIMD Multicomputer Computer Clusters Multiprocessor Massively Parallel Processors (MPPs) EECC756 - Shaaban Parallel Architectures History Historically, parallel architectures were tied to programming models • Divergent architectures, with no predictable pattern of growth. Application Software Systolic Arrays System Software Architecture SIMD Message Passing Dataflow More on this next lecture Shared Memory EECC756 - Shaaban #28 lec # 1 Spring 2006 3-14-2006 Parallel Programming Models • Programming methodology used in coding applications (parallel) • Specifies communication and synchronization • Examples: – Multiprogramming: or Multi-tasking (not true parallel processing!) No communication or synchronization at program level. A number of independent programs running on different processors in the system. – Shared memory address space (SAS): Parallel program threads or tasks communicate using a shared memory address space (shared data in memory). – Message passing: Explicit point to point communication is used between parallel program tasks using messages. – Data parallel: More regimented, global actions on data (i.e the same operations over all elements on an array or vector) – Can be implemented with shared address space or message passing EECC756 - Shaaban #29 lec # 1 Spring 2006 3-14-2006 Flynn’s 1972 Classification of Computer (Taxonomy) Architecture • Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors. • Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. • Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. • Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: Tightly coupled processors Loosely coupled processors • Shared memory multiprocessors. • Multicomputers: Unshared distributed memory, message-passing used instead (e.g clusters) Classified according to number of instruction streams (threads) and number of data streams in architecture EECC756 - Shaaban #30 lec # 1 Spring 2006 3-14-2006 Flynn’s Classification of Computer Architecture (Taxonomy) Uniprocessor Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. Shown here: array of synchronized processing elements CU = Control Unit PE = Processing Element M = Memory Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors. Parallel computers or multiprocessor systems Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. Classified according to number of instruction streams (threads) and number of data streams in architecture Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: Distributed memory multiprocessor system shown EECC756 - Shaaban #31 lec # 1 Spring 2006 3-14-2006 Current Trends In Parallel Architectures • The extension of “computer architecture” to support communication and cooperation: – OLD: Instruction Set Architecture (ISA) – NEW: Communication Architecture • Defines: – Critical abstractions, boundaries, and primitives (interfaces) – Organizational structures that implement interfaces (hardware or software) • Compilers, libraries and OS are important bridges today More on this next lecture EECC756 - Shaaban #32 lec # 1 Spring 2006 3-14-2006 Modern Parallel Architecture Layered Framework CAD Database Multiprogramming Shared address Scientific modeling Message passing Parallel applications Data parallel Programming models Compilation or library Operating systems support Communication hardware Communication abstraction User/system boundary Hardware/software boundary (ISA) Physical communication medium More on this next lecture EECC756 - Shaaban #33 lec # 1 Spring 2006 3-14-2006 Shared Address Space Parallel Architectures (SAS) • Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores • Convenient: – Location transparency – Similar programming model to time-sharing in uniprocessors • Except processes run on different processors • Good throughput on multiprogrammed workloads • Naturally provided on a wide range of platforms – Wide range of scale: few to hundreds of processors • Popularly known as shared memory machines or model – Ambiguous: Memory may be physically distributed among processing nodes. i.e Distributed shared memory multiprocessors Sometimes called Tightly-Coupled Parallel Computers EECC756 - Shaaban #34 lec # 1 Spring 2006 3-14-2006 Shared Address Space (SAS) Parallel Programming Model • Process: virtual address space plus one or more threads of control • Portions of address spaces of processes are shared Virtual address spaces for a collection of processes communicating via shared addresses Load P1 Machine physical address space Pn pr i v at e Pn P2 Common physical addresses P0 St or e Shared portion of address space Private portion of address space P2 pr i v at e P1 pr i v at e P0 pr i v at e • Writes to shared address visible to other threads (in other processes too) • Natural extension of the uniprocessor model: • Conventional memory operations used for communication • Special atomic operations needed for synchronization • OS uses shared memory to coordinate processes EECC756 - Shaaban #35 lec # 1 Spring 2006 3-14-2006 Models of Shared-Memory Multiprocessors • The Uniform Memory Access (UMA) Model: – The physical memory is shared by all processors. – All processors have equal access to all memory addresses. – Also referred to as Symmetric Memory Processors (SMPs). • Distributed memory or Non-uniform Memory Access (NUMA) Model: – Shared memory is physically distributed locally among processors. Access to remote memory is higher. • The Cache-Only Memory Architecture (COMA) Model: – A special case of a NUMA machine where all distributed main memory is converted to caches. – No memory hierarchy at each processor. EECC756 - Shaaban #36 lec # 1 Spring 2006 3-14-2006 Models of Shared-Memory Multiprocessors Uniform Memory Access (UMA) Model or Symmetric Memory Processors (SMPs). I/O devices Mem Mem Mem Mem I/O ctrl I/O ctrl Interconnect Interconnect Processor Interconnect: Bus, Crossbar, Multistage network P: Processor M: Memory C: Cache D: Cache directory Processor Network Network  M $ P M $ P  M D D D C C C P P P $ P Distributed memory or Non-uniform Memory Access (NUMA) Model Cache-Only Memory Architecture (COMA) EECC756 - Shaaban #37 lec # 1 Spring 2006 3-14-2006 Uniform Memory Access Example: Intel Pentium Pro Quad Circa 1997 CPU P-Pr o module 256-KB L2 $ Interrupt controller Bus interface Shared FSB P-Pr o module P-Pr o module PCI bridge PCI bus PCI I/O cards PCI bridge PCI bus P-Pr o bus (64-bit data, 36-bit address, 66 MHz) Memory controller MIU 1-, 2-, or 4-w ay interleaved DRAM • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume Bus-Based Symmetric Memory Processors (SMPs). A single Front Side Bus (FSB) is shared among processors This severely limits scalability to only 2-4 processors EECC756 - Shaaban #38 lec # 1 Spring 2006 3-14-2006 Non-Uniform Memory Access (NUMA) Example: AMD 8-way Opteron Server Node Circa 2003 Dedicated point-to-point interconnects (HyperTransport links) used to connect processors alleviating the traditional limitations of FSB-based SMP systems. Each processor has two integrated DDR memory channel controllers: memory bandwidth scales up with number of processors. NUMA architecture since a processor can access its own memory at a lower latency than access to remote memory directly connected to other processors in the system. Total 16 processor cores when dual core Opteron processors used EECC756 - Shaaban #39 lec # 1 Spring 2006 3-14-2006 Uniform Memory Access Example: SUN Enterprise P $ P $ $2 $2 CPU/mem cards Mem ctrl Bus interf ace/sw itch Gigaplane bus (256 data, 41 address, 83 MHz) I/O cards 2 FiberChannel SBUS SBUS SBUS 100bT, SCSI Bus interf ace – 16 cards of either type: processors + memory, or I/O – All memory accessed over bus, so symmetric – Higher bandwidth, higher latency bus EECC756 - Shaaban #40 lec # 1 Spring 2006 3-14-2006 Distributed Shared-Memory Multiprocessor System Example: Circa 1995-1999 Cray T3E NUMA Example External I/O P $ Mem Mem ctrl and NI More recent Cray Example: Cray X1E Supercomputer XY Sw itch Z • • • • Scale up to 2048 processors, DEC Alpha EV6 microprocessor (COTS) Custom 3D Torus point-to-point network, 480MB/s links Memory controller generates communication requests for non-local references No hardware mechanism for coherence (SGI Origin etc. provide this) Example of Non-uniform Memory Access (NUMA) EECC756 - Shaaban #41 lec # 1 Spring 2006 3-14-2006 Message-Passing Multicomputers • Comprised of multiple autonomous computers (nodes) connected via a suitable network. Industry standard System Area Network (SAN) or proprietary network • Each node consists of one or more processors, local memory, attached storage and I/O peripherals. • Local memory is only accessible by local processors in a node (no shared memory among nodes). • Inter-node communication is carried out by message passing through the connection network. • Process communication achieved using a message-passing programming environment (e.g. PVM, MPI). – Programming model more removed from basic hardware operations • Include: e.g abstracted – A number of commercial Massively Parallel Processor systems (MPPs). – Computer clusters that utilize commodity of-the-shelf (COTS) components. Also called Loosely-Coupled Parallel Computers EECC756 - Shaaban #42 lec # 1 Spring 2006 3-14-2006 Message-Passing Abstraction Tag Send (X, Q, t) Match Data Receive Y, P, t Address Y Send X, Q, t Recipient Addr ess X Local pr ocess addr ess space Sender Tag Receive (Y, P, t) Data Recipient Sender Process P • • • • • • Local pr ocess addr ess space Process Q Send specifies buffer to be transmitted and receiving process Receive specifies sending process and application storage to receive into Memory to memory copy possible, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/receive match achieves pairwise synchronization event – Ordering of computations according to dependencies • Many possible overheads: copying, buffer management, protection ... EECC756 - Shaaban #43 lec # 1 Spring 2006 3-14-2006 Message-Passing Example: Intel Paragon Circa 1983 i860 i860 L1 $ L1 $ Intel Paragon node Memory bus (64-bit, 50 MHz) Mem ctrl DMA Driver Sandia’ s Intel Paragon XP/S-based Super computer 2D grid netw ork w ith processing node attached to every sw itch NI 4-w ay interleaved DRAM 8 bits, 175 MHz, bidirectional 2D grid point to point network EECC756 - Shaaban #44 lec # 1 Spring 2006 3-14-2006 Message-Passing Example: IBM SP-2 Circa 1994-1998 Pow er 2 CPU IBM SP-2 node L2 $ Memory bus General interconnection netw ork formed fom r 8-port sw itches • Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bandwidth limited by I/O bus) MicroChannel bus NIC I/O DMA i860 Multi-stage network NI DRAM • 4-w ay interleaved DRAM Memory controller EECC756 - Shaaban #45 lec # 1 Spring 2006 3-14-2006 Message-Passing MPP Example: Circa 2005 IBM Blue Gene/L (2 processors/chip) • (2 chips/compute card) • (16 compute cards/node board) • (32 node boards/tower) • (64 tower) = 128k = 131072 (0.7 GHz PowerPC 440) processors (64k nodes) System Location: Lawrence Livermore National Laboratory Networks: 3D Torus point-to-point network Global tree 3D point-to-point network (both proprietary) 2.8 Gflops peak per processor core System (64 cabinets, 64x32x32) Cabinet (32 Node boards, 8x8x16) Node Board (32 chips, 4x4x2) 16 Compute Cards Compute Card (2 chips, 2x1x1) 180/360 TF/s 16 TB DDR Chip (2 processors) 90/180 GF/s 8 GB DDR 2.8/5.6 GF/s 4 MB 2.9/5.7 TF/s 256 GB DDR 5.6/11.2 GF/s 0.5 GB DDR Current Top Ranking LINPACK Performance: 280,600 GFLOPS = 280.6 TeraFLOPS = 0.2806 Peta FLOP Current Top Peak FP Performance: Now about 367,000 GFLOPS = 367 TeraFLOPS = 0.367 Peta FLOP EECC756 - Shaaban #46 lec # 1 Spring 2006 3-14-2006 Message-Passing Programming Tools • Message-passing programming environments include: – Message Passing Interface (MPI): • Provides a standard for writing concurrent message-passing programs. • MPI implementations include parallel libraries used by existing programming languages (C, C++). – Parallel Virtual Machine (PVM): • Enables a collection of heterogeneous computers to used as a coherent and flexible concurrent computational resource. • PVM support software executes on each machine in a userconfigurable pool, and provides a computational environment of concurrent applications. • User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines. EECC756 - Shaaban #47 lec # 1 Spring 2006 3-14-2006 Data Parallel Systems SIMD in Flynn taxonomy • Programming model (Data Parallel) – Operations performed in parallel on each element of data structure – Logically single thread of control, performs sequential or parallel steps Control processor – Conceptually, a processor is associated with each data element • Architectural model – Array of many simple, cheap processors each with little memory • Processors don’t sequence through instructions – Attached to a control processor that issues instructions – Specialized and general communication, cheap global synchronization • Example machines: – Thinking Machines CM-1, CM-2 (and CM-5) PE PE  PE PE PE  PE   PE PE   PE All PE are synchronized (same instruction or operation in a given cycle) – Maspar MP-1 and MP-2, PE = Processing Element EECC756 - Shaaban #48 lec # 1 Spring 2006 3-14-2006 Dataflow Architectures • Represent computation as a graph of essential data dependencies – Non-Von Neumann Architecture (Not PC-based) – Logical processor at each node, activated by availability of operands – Message (tokens) carrying tag of next instruction sent to next processor – Tag compared with others in matching store; match fires execution 1 b c e Research Dataflow machine prototypes include: Token +   a = (b +1)  (b  c) Distribution • The MIT Tagged Architecture d=ce Network f =ad • The Manchester Dataflow Machine d  Dataflow graph a  Netw ork f Token store Program store Waiting Matching Instruction f etch One Node Execute Form token Netw ork Token queue Netw ork Token Matching Token Distribution Tokens = Copies of computation results The Tomasulo approach of dynamic instruction execution utilizes dataflow driven execution engine: • The data dependency graph for a small window of instructions is constructed dynamically when instructions are issued in order of the program. •The execution of an issued instruction is triggered by the availability of its operands (data it needs) over the CDB. EECC756 - Shaaban #49 lec # 1 Spring 2006 3-14-2006 Systolic Architectures • Replace single processor with an array of regular processing elements • Orchestrate data flow for high throughput with less memory access M M PE PE • PE PE • Different from pipelining – Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory Different from SIMD: each PE may do something different • • Initial motivation: VLSI Application-Specific Integrated Circuits (ASICs) Represent algorithms directly by chips connected in regular pattern A possible example of MISD in Flynn’s Classification of Computer Architecture EECC756 - Shaaban #50 lec # 1 Spring 2006 3-14-2006 C=AX B Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,0 b1,0 b0,0 Alignments in time b2,1 b1,1 b0,1 b2,2 b1,2 b0,2 Columns of B Rows of A a0,2 a1,2 a2,2 a2,1 a1,1 a0,1 a0,0 a1,0 a2,0 T=0 Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ EECC756 - Shaaban #51 lec # 1 Spring 2006 3-14-2006 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product b2,0 b1,0 Alignments in time b2,1 b1,1 b0,1 b2,2 b1,2 b0,2 b0,0 a0,2 a1,2 a2,2 a2,1 a1,1 a0,1 a0,0 a0,0*b0,0 a1,0 a2,0 T=1 Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ EECC756 - Shaaban #52 lec # 1 Spring 2006 3-14-2006 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time b2,1 b1,1 b2,0 b1,0 a0,2 a0,1 a0,0*b0,0 + a0,1*b1,0 a0,0 b2,2 b1,2 b0,2 b0,1 a0,0*b0,1 b0,0 a1,2 a2,2 a2,1 a1,1 a1,0 a1,0*b0,0 a2,0 T=2 Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ EECC756 - Shaaban #53 lec # 1 Spring 2006 3-14-2006 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time b2,2 b1,2 b2,1 b2,0 a0,2 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,1 b1,0 a1,2 a1,1 a1,0*b0,0 + a1,1*b1,0 a1,0 b1,1 a0,0*b0,1 + a0,1*b1,1 a0,0 b0,2 a0,0*b0,2 b0,1 a1,0*b0,1 b0,0 a2,2 a2,1 a2,0 T=3 Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ a2,0*b0,0 EECC756 - Shaaban #54 lec # 1 Spring 2006 3-14-2006 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time b2,2 b1,2 b2,1 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,2 b2,0 a1,2 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,1 b1,0 a2,2 a2,2 a2,1 a2,0*b0,0 + a2,1*b1,0 T=4 Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ a2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a0,1 a0,0*b0,2 + a0,1*b1,2 b1,1 a1,0*b0,1 +a1,1*b1,1 a1,0 b0,2 a1,0*b0,2 b0,1 a2,0*b0,1 EECC756 - Shaaban #55 lec # 1 Spring 2006 3-14-2006 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time b2,2 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a0,2 a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2 b2,1 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,2 b2,0 a2,2 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 T=5 Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ a2,1 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1 a1,1 b1,2 a1,0*b0,2 + a1,1*b1,2 b1,1 a2,0*b0,1 + a2,1*b1,1 a2,0 b0,2 a2,0*b0,2 EECC756 - Shaaban #56 lec # 1 Spring 2006 3-14-2006 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2 b2,2 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1 a1,2 a1,0*b0,2 + a1,1*b1,2 + a1,2*b2,2 b2,1 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 T=6 Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ a2,2 a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1 a2,1 b1,2 a2,0*b0,2 + a2,1*b1,2 EECC756 - Shaaban #57 lec # 1 Spring 2006 3-14-2006 Systolic Array Example: 3x3 Systolic Array Matrix Multiplication • Processors arranged in a 2-D grid • Each processor accumulates one element of the product Alignments in time a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2 a1,0*b0,2 + a1,1*b1,2 + a1,2*b2,2 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1 Done b2,2 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 T=7 Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/ a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1 a2,2 a2,0*b0,2 + a2,1*b1,2 + a2,2*b2,2 EECC756 - Shaaban #58 lec # 1 Spring 2006 3-14-2006 Source: www.top500.org EECC756 - Shaaban #59 lec # 1 Spring 2006 3-14-2006

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download EECC756 - Shaaban