Download EECC756 - Shaaban

Parallel Computer Architecture • A parallel computer is a collection of processing elements that cooperate to solve large problems fast • Broad issues involved: – Resource Allocation: • Number of processing elements (PEs). • Computing power of each element. • Amount of physical memory used. – Data access, Communication and Synchronization • How the elements cooperate and communicate. • How data is transmitted between processors. • Abstractions and primitives for cooperation. – Performance and Scalability • Performance enhancement of parallelism: Speedup. • Scalabilty of performance to larger systems/problems. EECC756 - Shaaban #1 lec # 1 Spring 2002 3-12-2002 The Need And Feasibility of Parallel Computing • Application demands: More computing cycles: – Scientific computing: CFD, Biology, Chemistry, Physics, ... – General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Gaming… – Mainstream multithreaded programs, are similar to parallel programs • Technology Trends – Number of transistors on chip growing rapidly – Clock rates expected to go up but only slowly • Architecture Trends – Instruction-level parallelism is valuable but limited – Coarser-level parallelism, as in MPs, the most viable approach • Economics: – Today’s microprocessors have multiprocessor support eliminating the need for designing expensive custom PEs – Lower parallel system cost. – Multiprocessor systems to offer a cost-effective replacement of uniprocessor systems in mainstream computing. EECC756 - Shaaban #2 lec # 1 Spring 2002 3-12-2002 Scientific Computing Demand EECC756 - Shaaban #3 lec # 1 Spring 2002 3-12-2002 Scientific Supercomputing Trends • Proving ground and driver for innovative architecture and advanced techniques: – Market is much smaller relative to commercial segment – Dominated by vector machines starting in 70s – Meanwhile, microprocessors have made huge gains in floating-point performance • • • • High clock rates. Pipelined floating point units. Instruction-level parallelism. Effective use of caches. • Large-scale multiprocessors replace vector supercomputers – Well under way already EECC756 - Shaaban #4 lec # 1 Spring 2002 3-12-2002 Raw Uniprocessor Performance: LINPACK 10,000 CRAY  CRAY  Micro Micro n = 1,000 n = 100 n = 1,000 n = 100  1,000  T94  LINPACK (MFLOPS) C90    100     DEC 8200  Ymp  Xmp/416       IBM Pow er2/990 MIPS R4400 Xmp/14se   DEC Alpha  HP9000/735   DEC Alpha AXP  HP 9000/750  CRAY 1s  IBM RS6000/540 10  MIPS M/2000   MIPS M/120  Sun 4/260 1 1975   1980 1985 1990 1995 2000 EECC756 - Shaaban #5 lec # 1 Spring 2002 3-12-2002 Raw Parallel Performance: LINPACK 10,000  MPP peak  CRAY peak ASCI Red  LINPACK (GFLOPS) 1,000 Parago n XP/S MP (6768)  Parago n XP/S MP (1024)   T3D CM-5  100 T932(32)  Paragon XP/S CM-200  CM-2  Ymp/832(8) 1   C90(16) Delta 10    iPSC/860  nCUBE/2(1024) Xmp /416(4) 0.1 1985 1987 1989 1991 1993 1995 1996 EECC756 - Shaaban #6 lec # 1 Spring 2002 3-12-2002 General Technology Trends • Microprocessor performance increases 50% - 100% per year • Transistor count doubles every 3 years • DRAM size quadruples every 3 years 180 160 140 DEC 120 alpha 100 IBM 80 RS6000 60 540 40 20 MIPS Sun 4 M/120 260 MIPS Integer FP HP 9000 750 M2000 0 1987 1988 1989 1990 1991 1992 EECC756 - Shaaban #7 lec # 1 Spring 2002 3-12-2002 Clock Frequency Growth Rate Clock rate (MHz) 1,000 100 10         R10000            Pentium100                         i80386  i80286 i8086    1 i8080    i8008 i4004 0.1 1970 1980 1990 2000 1975 1985 1995 2005 • Currently increasing 30% per year EECC756 - Shaaban #8 lec # 1 Spring 2002 3-12-2002 Transistor Count Growth Rate 100,000,000  Transistors 10,000,000     R10000   Pentium                      i80386  i80286    R3000 R2000   1,000,000 100,000 i8086 10,000  i8080   i8008 i4004 1,000 1970 1980 1990 2000 1975 1985 1995 2005 • 100 million transistors on chip by early 2000’s A.D. • Transistor count grows much faster than clock rate - Currently 40% per year EECC756 - Shaaban #9 lec # 1 Spring 2002 3-12-2002 System Attributes to Performance • • • • Performance benchmarking is program-mix dependent. Ideal performance requires a perfect machine/program match. Performance measures: – Cycles per instruction (CPI) – Total CPU time = T = C x t = C / f = Ic x CPI x t = Ic x (p + m x k) x t Ic = Instruction count t = CPU cycle time p = Instruction decode cycles m = Memory cycles k = Ratio between memory/processor cycles C = Total program clock cycles f = clock rate – MIPS Rate = Ic / (T x 106) = f / (CPI x 106) = f x Ic /(C x 106) – Throughput Rate: Wp = f /(Ic x CPI) = (MIPS) x 106 /Ic Performance factors: (Ic, p, m, k, t) are influenced by: instruction-set architecture, compiler design, CPU implementation and control, cache and memory hierarchy. EECC756 - Shaaban #10 lec # 1 Spring 2002 3-12-2002 CPU Performance Trends The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance. Performance 100 Supercomputers 10 Mainframes Microprocessors Minicomputers 1 0.1 1965 1970 1975 1980 1985 1990 1995 EECC756 - Shaaban #11 lec # 1 Spring 2002 3-12-2002 Parallelism in Microprocessor VLSI Generations Bit-level parallelism Instruction-level Thread-level (?) 100,000,000  10,000,000      1,000,000    R10000                 Pentium Transistors    i80386    i80286  100,000    R3000  R2000   i8086 10,000  i8080  i8008   i4004 1,000 1970 1975 1980 1985 1990 1995 2000 2005 EECC756 - Shaaban #12 lec # 1 Spring 2002 3-12-2002 The Goal of Parallel Computing • Goal of applications in using parallel machines: Speedup Speedup (p processors) = Performance (p processors) Performance (1 processor) • For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Time (1 processor) Time (p processors) EECC756 - Shaaban #13 lec # 1 Spring 2002 3-12-2002 Elements of Modern Computers Computing Problems Algorithms and Data Structures Mapping Hardware Architecture Programming High-level Languages Operating System Binding (Compile, Load) Applications Software Performance Evaluation EECC756 - Shaaban #14 lec # 1 Spring 2002 3-12-2002 Elements of Modern Computers 1 Computing Problems: – Numerical Computing: Science and technology numerical problems demand intensive integer and floating point computations. – Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches. 2 Algorithms and Data Structures – Special algorithms and data structures are needed to specify the computations and communication present in computing problems. – Most numerical algorithms are deterministic using regular data structures. – Symbolic processing may use heuristics or non-deterministic searches. – Parallel algorithm development requires interdisciplinary interaction. EECC756 - Shaaban #15 lec # 1 Spring 2002 3-12-2002 Elements of Modern Computers 3 Hardware Resources – Processors, memory, and peripheral devices form the hardware core of a computer system. – Processor instruction set, processor connectivity, memory organization, influence the system architecture. 4 Operating Systems – Manages the allocation of resources to running processes. – Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication. – Parallelism exploitation at: algorithm design, program writing, compilation, and run time. EECC756 - Shaaban #16 lec # 1 Spring 2002 3-12-2002 Elements of Modern Computers 5 System Software Support – Needed for the development of efficient programs in highlevel languages (HLLs.) – Assemblers, loaders. – Portable parallel programming languages – User interfaces and tools. 6 Compiler Support – Preprocessor compiler: Sequential compiler and low-level library of the target parallel computer. – Precompiler: Some program flow analysis, dependence checking, limited optimizations for parallelism detection. – Parallelizing compiler: Can automatically detect parallelism in source code and transform sequential code into parallel constructs. EECC756 - Shaaban #17 lec # 1 Spring 2002 3-12-2002 Approaches to Parallel Programming Programmer Programmer Source code written in sequential languages C, C++ FORTRAN, LISP .. Source code written in concurrent dialects of C, C++ FORTRAN, LISP .. Parallelizing compiler Concurrency preserving compiler Parallel object code Execution by runtime system (a) Implicit Parallelism Concurrent object code (b) Explicit Parallelism Execution by runtime system EECC756 - Shaaban #18 lec # 1 Spring 2002 3-12-2002 Evolution of Computer Architecture Scalar Sequential Lookahead Functional Parallelism I/E Overlap Multiple Func. Units I/E: Instruction Fetch and Execute Pipeline Implicit Vector SIMD: Single Instruction stream over Multiple Data streams Explicit Vector Memory-to -Memory MIMD: Multiple Instruction streams over Multiple Data streams SIMD Associative Processor Processor Array Register-to -Register MIMD Multicomputer Massively Parallel Processors (MPPs) Multiprocessor EECC756 - Shaaban Parallel Architectures History Historically, parallel architectures tied to programming models • Divergent architectures, with no predictable pattern of growth. Application Software Systolic Arrays System Software Architecture SIMD Message Passing Dataflow Shared Memory EECC756 - Shaaban #20 lec # 1 Spring 2002 3-12-2002 Programming Models • Programming methodology used in coding applications • Specifies communication and synchronization • Examples: – Multiprogramming: No communication or synchronization at program level – Shared memory address space: – Message passing: Explicit point to point communication – Data parallel: More regimented, global actions on data • Implemented with shared address space or message passing EECC756 - Shaaban #21 lec # 1 Spring 2002 3-12-2002 Flynn’s 1972 Classification of Computer Architecture • Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines. • Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. • Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. • Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: • Shared memory multiprocessors. • Multicomputers: Unshared distributed memory, message-passing used instead. EECC756 - Shaaban #22 lec # 1 Spring 2002 3-12-2002 Flynn’s Classification of Computer Architecture Fig. 1.3 page 12 in Advanced Computer Architecture: Parallelism, Scalability, Programmability, Hwang, 1993. EECC756 - Shaaban #23 lec # 1 Spring 2002 3-12-2002 Current Trends In Parallel Architectures • The extension of “computer architecture” to support communication and cooperation: – OLD: Instruction Set Architecture – NEW: Communication Architecture • Defines: – Critical abstractions, boundaries, and primitives (interfaces) – Organizational structures that implement interfaces (hardware or software) • Compilers, libraries and OS are important bridges today EECC756 - Shaaban #24 lec # 1 Spring 2002 3-12-2002 Modern Parallel Architecture Layered Framework CAD Database Multiprogramming Shared address Scientific modeling Message passing Parallel applications Data parallel Programming models Compilation or library Operating systems support Communication hardware Communication abstraction User/system boundary Hardware/software boundary Physical communication medium EECC756 - Shaaban #25 lec # 1 Spring 2002 3-12-2002 Shared Address Space Parallel Architectures • Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores • Convenient: – Location transparency – Similar programming model to time-sharing in uniprocessors • Except processes run on different processors • Good throughput on multiprogrammed workloads • Naturally provided on a wide range of platforms – Wide range of scale: few to hundreds of processors • Popularly known as shared memory machines or model – Ambiguous: Memory may be physically distributed among processors EECC756 - Shaaban #26 lec # 1 Spring 2002 3-12-2002 Shared Address Space (SAS) Model • Process: virtual address space plus one or more threads of control • Portions of address spaces of processes are shared Virtual address spaces for a collection of processes communicating via shared addresses Load P1 Machine physical address space Pn pr i v at e Pn P2 Common physical addresses P0 St or e Shared portion of address space Private portion of address space P2 pr i v at e P1 pr i v at e P0 pr i v at e • Writes to shared address visible to other threads (in other processes too) • Natural extension of the uniprocessor model: • Conventional memory operations used for communication • Special atomic operations needed for synchronization • OS uses shared memory to coordinate processes EECC756 - Shaaban #27 lec # 1 Spring 2002 3-12-2002 Models of Shared-Memory Multiprocessors • The Uniform Memory Access (UMA) Model: – The physical memory is shared by all processors. – All processors have equal access to all memory addresses. • Distributed memory or Nonuniform Memory Access (NUMA) Model: – Shared memory is physically distributed locally among processors. • The Cache-Only Memory Architecture (COMA) Model: – A special case of a NUMA machine where all distributed main memory is converted to caches. – No memory hierarchy at each processor. EECC756 - Shaaban #28 lec # 1 Spring 2002 3-12-2002 Models of Shared-Memory Multiprocessors Uniform Memory Access (UMA) Model I/O devices Mem Mem Mem Mem I/O ctrl I/O ctrl Interconnect Interconnect Processor Interconnect: Bus, Crossbar, Multistage network P: Processor M: Memory C: Cache D: Cache directory Processor Network Network  M $ P M $ P  M D D D C C C P P P $ P Distributed memory or Nonuniform Memory Access (NUMA) Model Cache-Only Memory Architecture (COMA) EECC756 - Shaaban #29 lec # 1 Spring 2002 3-12-2002 Uniform Memory Access Example: Intel Pentium Pro Quad CPU P-Pr o module 256-KB L2 $ Interrupt controller Bus interface P-Pr o module P-Pr o module PCI bridge PCI bus PCI I/O cards PCI bridge PCI bus P-Pr o bus (64-bit data, 36-bit address, 66 MHz) Memory controller MIU 1-, 2-, or 4-w ay interleaved DRAM • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume • Low latency and bandwidth EECC756 - Shaaban #30 lec # 1 Spring 2002 3-12-2002 Uniform Memory Access Example: SUN Enterprise P $ P $ $2 $2 CPU/mem cards Mem ctrl Bus interf ace/sw itch Gigaplane bus (256 data, 41 address, 83 MHz) I/O cards 2 FiberChannel SBUS SBUS SBUS 100bT, SCSI Bus interf ace – 16 cards of either type: processors + memory, or I/O – All memory accessed over bus, so symmetric – Higher bandwidth, higher latency bus EECC756 - Shaaban #31 lec # 1 Spring 2002 3-12-2002 Distributed Shared-Memory Multiprocessor System Example: Cray T3E External I/O P $ Mem Mem ctrl and NI XY Sw itch Z • Scale up to 1024 processors, 480MB/s links • Memory controller generates communication requests for nonlocal references • No hardware mechanism for coherence (SGI Origin etc. provide this) EECC756 - Shaaban #32 lec # 1 Spring 2002 3-12-2002 Message-Passing Multicomputers • Comprised of multiple autonomous computers (nodes). • Each node consists of a processor, local memory, attached storage and I/O peripherals. • Programming model more removed from basic hardware operations • Local memory is only accessible by local processors. • A message passing network provides point-to-point static connections among the nodes. • Inter-node communication is carried out by message passing through the static connection network • Process communication achieved using a message-passing programming environment. EECC756 - Shaaban #33 lec # 1 Spring 2002 3-12-2002 Message-Passing Abstraction Match Receive Y, P, t Address Y Send X, Q, t Addr ess X • • • • • • • Local pr ocess addr ess space Local pr ocess addr ess space Process P Process Q Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event Many overheads: copying, buffer management, protection EECC756 - Shaaban #34 lec # 1 Spring 2002 3-12-2002 Message-Passing Example: IBM SP-2 Pow er 2 CPU IBM SP-2 node L2 $ Memory bus • • Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bandwidth limited by I/O bus) 4-w ay interleaved DRAM Memory controller MicroChannel bus NIC I/O DMA i860 NI DRAM General interconnection netw ork formed fom r 8-port sw itches EECC756 - Shaaban #35 lec # 1 Spring 2002 3-12-2002 Message-Passing Example: Intel Paragon i860 i860 L1 $ L1 $ Intel Paragon node Memory bus (64-bit, 50 MHz) Mem ctrl DMA Driver Sandia’ s Intel Paragon XP/S-based Super computer 2D grid netw ork w ith processing node attached to every sw itch NI 4-w ay interleaved DRAM 8 bits, 175 MHz, bidirectional EECC756 - Shaaban #36 lec # 1 Spring 2002 3-12-2002 Message-Passing Programming Tools • Message-passing programming libraries include: – Message Passing Interface (MPI): • Provides a standard for writing concurrent message-passing programs. • MPI implementations include parallel libraries used by existing programming languages. – Parallel Virtual Machine (PVM): • Enables a collection of heterogeneous computers to used as a coherent and flexible concurrent computational resource. • PVM support software executes on each machine in a userconfigurable pool, and provides a computational environment of concurrent applications. • User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines. EECC756 - Shaaban #37 lec # 1 Spring 2002 3-12-2002 Data Parallel Systems SIMD in Flynn taxonomy • Programming model – Operations performed in parallel on each element of data structure – Logically single thread of control, performs sequential or parallel steps Control processor – Conceptually, a processor is associated with each data element • Architectural model – Array of many simple, cheap processors each with little memory • Processors don’t sequence through instructions – Attached to a control processor that issues instructions – Specialized and general communication, cheap global synchronization • PE PE  PE PE PE  PE   PE PE   PE Example machines: – Thinking Machines CM-1, CM-2 (and CM-5) – Maspar MP-1 and MP-2, EECC756 - Shaaban #38 lec # 1 Spring 2002 3-12-2002 Dataflow Architectures • Represent computation as a graph of essential dependences – Logical processor at each node, activated by availability of operands – Message (tokens) carrying tag of next instruction sent to next processor – Tag compared with others in matching store; match fires execution 1 a = (b +1)  (b  c) d=ce f =ad b c e  +  d  Dataflow graph a  Netw ork f Token store Program store Waiting Matching Instruction f etch Execute Form token Netw ork Token queue Netw ork EECC756 - Shaaban #39 lec # 1 Spring 2002 3-12-2002 Systolic Architectures • Replace single processor with an array of regular processing elements • Orchestrate data flow for high throughput with less memory access M M PE PE PE PE • Different from pipelining – Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory • Different from SIMD: each PE may do something different • Initial motivation: VLSI enables inexpensive special-purpose chips • Represent algorithms directly by chips connected in regular pattern EECC756 - Shaaban #40 lec # 1 Spring 2002 3-12-2002

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download EECC756 - Shaaban