Download Massively Parallel Processor (MPP) Architectures

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Scalable Multiprocessors Case Studies(Chapter 7) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks! Massively Parallel Processor (MPP) Architectures Processor Network Interface • Network interface typically close to processor Cache Memory Bus – Memory bus: » locked to specific processor architecture/bus protocol – Registers/cache: » only in research machines I/O Bridge Network I/O Bus Main Memory Disk Controller Disk • Time-to-market is long – processor already available or work closely with processor designers Disk • Maximize performance and cost (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 1 CS/ECE 757 2 Network of Workstations Processor interrupts Cache Core Chip Set I/O Bus Main Memory Disk Controller Disk Disk Graphics Controller Graphics Network Interface • Network interface on I/O bus • Standards (e.g., PCI) => longer life, faster to market • Slow (microseconds) to access network interface • “System Area Network” (SAN): between LAN & MPP Network (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 3 Transaction Interpretation • Simplest MP: assist doesn’t interpret much if anything – DMA from/to buffer, interrupt or set flag on completion • User-level messaging: get the OS out of the way – assist does protection checks to allow direct user access to network – may have minimal interpretation otherwise • Virtual DMA: get the CPU out of the way (maybe) – basic protection plus address translation: user-level bulk DMA • Global physical address space (NUMA): everything in hardware – complexity increases, but performance does too (if done right) • Cache coherence: even more so – stay tuned (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 2 CS/ECE 757 4 Increasing HW Support, Specialization, Intrusiveness, Performance (???) Spectrum of Designs • None: Physical bit stream – physical DMA nCUBE, iPSC, . . . • User/System – User-level port – User-level handler CM-5, *T J-Machine, Monsoon, . . . • Remote virtual address – Processing, translation – Reflective memory (?) Paragon, Meiko CS-2, Myrinet Memory Channel, SHRIMP • Global physical address – Proc + Memory controller RP3, BBN, T3D, T3E • Cache-to-cache (later) – Cache controller Dash, KSR, Flash (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 5 Net Transactions: Physical DMA Data Dest DMA channels Addr Length Rdy Memory Status, interrupt Cmd P °°° Addr Length Rdy Memory P • Physical addresses: OS must initiate transfers – system call per message on both ends: ouch • Sending OS copies data to kernel buffer w/ header/trailer – can avoid copy if interface does scatter/gather • Receiver copies packet into OS buffer, then interrupts – user message then copied (or mapped) into user space (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 3 CS/ECE 757 6 Conventional LAN Network Interface Host Memory NIC trncv NIC Controller Data Addr Len Status Next Addr Len Status Next Addr Len Status Next TX addr RX len IO Bus mem bus Addr Len Status Next Addr Len Status Next DMA Proc Addr Len Status Next (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 7 User Level Ports User/system Data Dest °°° Mem P Status, interrupt Mem P • map network hardware into user’s address space – talk directly to network via loads & stores • user-to-user communication without OS intervention: low latency • protection: user/user & user/system • DMA hard… CPU involvement (copying) becomes bottleneck (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 4 CS/ECE 757 8 Case Study: Thinking Machines CM5 • Follow-on to CM2 – – – – – Abandons SIMD, bit-serial processing Uses off-shelf processors/parts Focus on floating point 32 to 1024 processors Designed to scale to 16K processors • Designed to be independent of specific processor node • Current" processor node – 40 MHz SPARC – 32 MB memory per node – 4 FP vector chips per node (C) 2003, J. E. Smith 9 CM5, contd. • Vector Unit – – – – – Four FP processors Direct connect to main memory Each has a 1 word data path FP unit can do scalar or vector 128 MFLOPS peak: 50 MFLOPS Linpack (C) 2003, J. E. Smith Page 5 10 Interconnection Networks • Two networks: Data and Control • Network Interface – – – – Memory-mapped functions Store data => data Part of address => control Some addresses map to privileged functions (C) 2003, J. E. Smith 11 Interconnection Networks • Input and output FIFO for each network • Two data networks • Save/restore network buffers on context switch Diagnostics network Control network Data network PM PM Processing partition SPARC Processing Control partition processors FPU $ ctrl Data networks $ SRAM I/O partition Control network NI MBUS DRAM ctrl DRAM Vector unit DRAM ctrl DRAM DRAM ctrl DRAM (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 6 Vector unit DRAM ctrl DRAM CS/ECE 757 12 Message Management • Typical function implementation – Each function has two FIFOs (in and out) – Two outgoing control registers: » send and send_first » send_first initiates message » send sends any additional data – read send_ok to check successful send;else retry – read send_space to check space prior to send – incoming register receive_ok can be polled – read receive_length_left for message length – read receive for input data in FIFO order (C) 2003, J. E. Smith 13 User Level Network ports Virtual address space Net output port Net input port Processor Status Registers Program counter • Appears to user as logical message queues plus status • What happens if no user pop? (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 7 CS/ECE 757 14 Control Network • In general, every processor participates – A mode bit allows a processor to abstain • Broadcasting – 3 functional units: broadcast, supervisor broadcast, interrupt » only 1 broadcast at a time » broadcast message 1 to 15 32-bit words • Combining – Operation types: » reduction » parallel prefix » parallel suffix » router-done (big-OR) » Combine message is a 32 to 128 bit integer » Operator types: OR, XOR, max, signed add, unsigned add – Operation and operator are encoded in send_first address (C) 2003, J. E. Smith 15 Control Network, contd • Global Operations – – – – Big OR of 1 bit from each processor three independent units; one synchronous, 2 asynchronous Synchronous useful for barrier synch Asynchronous useful for passing error conditions independent of other Control unit functions • Individual instructions are not synchronized each processor fetches instructions barrier synch via control is used between instruction blocks => support for a loose form of data parallelism (C) 2003, J. E. Smith Page 8 16 Data Network • Architecture – – – – – – – Fat-tree Packet switched Bandwidth to neighbor: 20 MB/sec Latency in large network: 3 to 7 microseconds Can support message passing or global address space • Network interface – – – – One data network functional unit send_first gets destn address + tag 1 to 5 32-bit chunks of data tag can cause interrupt at receiver • Addresses may be physical or relative – physical addresses are privileged – relative address is bounds-checked and translated (C) 2003, J. E. Smith 17 Data Network, contd. • Typical message interchange: – alternate between pushing onto send-FIFO and receiving on receive-FIFO – once all messages are sent, assert this with combine function on control network – alternate between receiving more messages and testing control network – when router_done goes active, message pass phase is complete (C) 2003, J. E. Smith Page 9 18 User Level Handlers U se r/sys te m D a ta Ad d re ss D est ° ° ° Mem M em P P • Hardware support to vector to address specified in message – message ports in registers – alternate register set for handler? • Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 19 CS/ECE 757 20 J-Machine • Each node a small messagedriven processor • HW support to queue msgs and dispatch to msg handler task (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 10 J-Machine Message-Driven Processor • MIT research project • Targets fine-grain message passing – very low message overheads allow: – small messages – small tasks • J-Machine architecture – – – – – 3D mesh direct interconnect Global address space up to 1 Mwords of DRAM per node MDP single-chip supports processing memory control message passing not an off-the-shelf chip (C) 2003, J. E. Smith 21 Features/Example: Combining Tree • Each node collects data from lower levels, accumulates sum, and passes sum up tree when all inputs are done. • Communication – SEND instructions send values up tree in small messages – On arrival, a task is created to perform COMBINE routine • Synchronization – – – – message arrival dispatches a task example: combine message invokes COMBINE routine presence bits (full/empty) on state value set empty; reads are blocked until it becomes full (C) 2003, J. E. Smith Page 11 22 Features/Example, contd. • Naming – segmented memory, and translation instructions – protection via segment descriptors – allocate a segment of memory for each combining node (C) 2003, J. E. Smith 23 Instruction Set • Separate register sets for three execution levels: – background: no messages – priority 0: lower priority messages – priority 1: higher priority messages • Each Register set: – – – – – 4 GPRs 4 Address regs.: contain segment descriptors base + field length 4 ID registers: P0 and P1 only instruction pointer: includes status info • Addressing – an address is an offset + address register (C) 2003, J. E. Smith Page 12 24 Instruction Set, contd • Tags – – – – – – 4 bit tags per 32-bit data value data types, e.g. integer, boolean, symbol system types, e.g. IP, addr, msg 4 user-defined types Full/empty types: Fut, and Cfut interrupt on access to Cfut value • MDP can do operand type checking and interrupt on miss-match => program does not have to do software checking • Instructions – 17-bit format (two per word) – three operands – words not tagged as instructions are loaded into R0 automatically (C) 2003, J. E. Smith 25 Instruction Set, contd. (C) 2003, J. E. Smith Page 13 26 Instruction set, contd. • Naming – ENTER instruction places translation into TLB – XLATE instruction does TLB translation • Communication – SEND instructions – Example: » SEND R0,0 ;send net address, prio 0 » SEND2 R1,R2,0 ;send two data words » SEND2E R3,[3,A3],0; send two more words; (C) 2003, J. E. Smith 27 Instruction set, contd. • Task scheduling – – – – – – Message at head of queue causes MDP to create a task opcode loaded into IP length field and message head loaded into A3 takes three cycles low latency messages executed directly others specify a handler that locates a required method and transfers control to the method (C) 2003, J. E. Smith Page 14 28 Network Architecture • 3D Mesh; up to 64K nodes • No torus => faces for I/O • Bidirectional channels – – – – – – channels can be turned around on alternate cycles 9 bits data + 6 bits control => 9 bit phit 2 phit per flit (granularity of flow control) Each channel 288 Mbps Bisection bandwidth (1024 nodes) 18.4 Gps • Synchronous routers – clocked at 2X processor clock – => 9-bit phit per 62.5ns – messages route at 1 cycle per hop (C) 2003, J. E. Smith 29 Dedicated Message Processing Without Specialized Hardware Network dest °°° Mem Mem NI P User • • • • NI MP P System User MP System Msg processor performs arbitrary output processing (at system level) Msg processor interprets incoming network transactions (in system) User Processor <–> Msg Processor share memory Msg Processor <–> Msg Processor via system network transaction (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 15 CS/ECE 757 30 Levels of Network Transaction Network dest °°° Mem NI P Mem NI MP User MP P System • User Processor stores cmd / msg / data into shared output queue – must still check for output queue full (or grow dynamically) • Communication assists make transaction happen – checking, translation, scheduling, transport, interpretation • Avoid system call overhead • Multiple bus crossings likely bottleneck (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 31 Example: Intel Paragon Service Network I/O Nodes I/O Nodes Devices Devices 16 Mem 175 MB/s Duplex 2048 B NI i860xp 50 MHz 16 KB $ 4-way 32B Block MESI °°° EOP rte MP handler Var data 64 400 MB/s $ $ P MP sDMA rDMA (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 16 CS/ECE 757 32 Dedicated MP w/specialized NI: Meiko CS-2 • Integrate message processor into network interface – active messages-like capability – dedicated threads for DMA, reply handling, simple remote memory access – supports user-level virtual DMA » own page table » can take a page fault, signal OS, restart • meanwhile, nack other node • Problem: processor is slow, time-slices threads – fundamental issue with building your own CPU (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 33 Myricom Myrinet (Berkeley NOW) • Programmable network interface on I/O Bus (Sun SBUS or PCI) – embedded custom CPU (“Lanai”, ~40 MHz RISC CPU) – 256KB SRAM – 3 DMA engines: to network, from network, to/from host memory • Downloadable firmware executes in kernel mode – includes source-based routing protocol • SRAM pages can be mapped into user space – separate pages for separate processes – firmware can define status words, queues, etc. » data for short messages or pointers for long ones » firmware can do address translation too… w/OS help – poll to check for sends from user • Bottom line: I/O bus still bottleneck, CPU could be faster (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 17 CS/ECE 757 34 Shared Physical Address Space • Implement SAS model in hardware w/o caching – actual caching must be done by copying from remote memory to local – programming paradigm looks more like message passing than Pthreads » yet, low latency & low overhead transfers thanks to HW interpretation; high bandwidth too if done right » result: great platform for MPI & compiled data-parallel codes • Implementation: – “pseudo-memory” acts as memory controller for remote mem, converts accesses to network transaction (request) – “pseudo-CPU” on remote node receives requests, performs on local memory, sends reply – split-transaction or retry-capable bus required (or dual-ported mem) (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 35 Case Study: Cray T3D • • • • • • • • • • • Processing element nodes 3D Torus interconnect Wormhole routing PE numbering Local memory Support circuitry Prefetch Messaging Barrier synch Fetch and inc. Block transfers (C) 2003, J. E. Smith Page 18 36 Processing Element Nodes • Two processors per node • Shared block transfer engine – DMA-like transfer of large blocks of data • Shared network interface/router • Synchronous 6.67 ns clock (C) 2003, J. E. Smith 37 Communication Links • Signals: – Data: 16 bits – Channel control: 4 bits – -- request/response, virt. channel buffer – Channel acknowledge: 4 bits • virt. channel buffer status (C) 2003, J. E. Smith Page 19 38 Routing • Dimension order routing – may go in either + or - direction along a dimension • Virtual channels – Four virtual channel buffers per physical channel – => two request channels, two response channels • Deadlock avoidance – In each dimension specify a "dateline" link – Packets that do not cross dateline use virtual channel 0 – Packets that cross dateline use virtual channel 1 (C) 2003, J. E. Smith 39 Packets • Size: 8 physical units (phits) – 16 bits per phit • Header: – – – – – routing info destn processor control info source processor memory address • Read Request – header: 6 phits • Read Response – header: 6 phits – body: 4 phits (1 word) – or 16 phits (4 words) (C) 2003, J. E. Smith Page 20 40 Processing Nodes • Processor: Alpha 21064 • Support circuitry: – – – – – – – Address interpretation reads and writes (local/non-local) data prefetch messaging barrier synch. fetch and incr. status (C) 2003, J. E. Smith 41 Address Interpretation • T3D needs: – 64 MB memory per node => 26 bits – up to 2048 nodes => 11 bits • Alpha 21064 provides: – 43-bit virtual address – 32-bit physical address – (+2 bits for mem mapped devices) => Annex registers in DTB – – – – – – external to Alpha map 32-bit address onto 48 bit node + 27-bit address 32-annex registers annex registers also contain function info e.g. cache / non-cache accesses DTB modified via load/locked store cond. insts. (C) 2003, J. E. Smith Page 21 42 Reads and Writes • Reads – note: no hardware cache coherence – may be cacheable or non-cacheable – two types of non-cacheable: » normal and atomic swap » swap uses "swaperand" register – three types of cacheable reads: » normal, atomic swap, cached readahead » cached readahead buffers and pre-reads 4 word blocks of data • Writes – – – – Alpha uses write-through read-allocate cache write counter: to help with consistency increments on write request decrements on ack (C) 2003, J. E. Smith 43 Data Prefetch • Architected Prefetch Queue – 1 word wide by 16 deep • Prefetch instruction: – Alpha prefetch hint => T3D prefetch • Performance – Allows multiple outstanding read requests – (normal 21064 reads are blocking) (C) 2003, J. E. Smith Page 22 44 Messaging • Message queues – 256 KBytes reserved space in local memory – => 4080 message packets + 16 overflow locations • Sending processor: – Uses Alpha PAL code – builds message packets of 4 words – plus address of receiving node • Receiving node – – – – – – – stores message interrupts processor processor sends an ack processor may execute routine at address provided by message (active messages) if message queue full; NACK is sent also, error messages may be generated by support circuitry (C) 2003, J. E. Smith 45 Barrier Synchronization • For Barrier or Eureka • Hardware implementation – – – – hierarchical tree bypasses in tree to limit its scope masks for barrier bits to further limit scope interrupting or non-interrupting (C) 2003, J. E. Smith Page 23 46 Fetch and Increment • Special hardware registers – – – – 2 per node user accessible used for auto-scheduling tasks (often loop iterations) (C) 2003, J. E. Smith 47 Block Transfer • Special block transfer engine – – – – does DMA transfers can gather/scatter among processing elements up to 64K packets 1 or 4 words per packet • Types of transfers – constant stride read/write – gather/scatter • Requires System Call – for memory protection => big overhead (C) 2003, J. E. Smith Page 24 48 Cray T3E • • • • • • T3D Post Mortem T3E Overview Global Communication Synchronization Message passing Kernel performance (C) 2003, J. E. Smith 49 T3D Post Mortem • Very high performance interconnect – 3D torus worthwhile • Barrier network "overengineered" – Barrier synch not a significant fraction of runtime • Prefetch queue useful; should be more of them • Block Transfer engine not very useful – high overhead to setup – yet another way to access memory • DTB Annex difficult to use in general – one entry might have sufficed – every processor must have same mapping for physical page (C) 2003, J. E. Smith Page 25 50 T3E Overview • Alpha 21164 processors • Up to 2 GB per node • Caches – – – – – 8K L1 and 96K L2 on-chip supports 2 outstanding 64-byte line fills stream buffers to enhance cache only local memory is cached => hardware cache coherence straightforward • 512 (user) + 128 (system) E-registers for communication/synchronization • One router chip per processor (C) 2003, J. E. Smith 51 T3E Overview, contd. • Clock: – system (i.e. shell) logic at 75 MHz – proc at some multiple (initially 300 MHz) • 3D Torus Interconnect – bidirectional links – adaptive multiple path routing – links run at 300 MHz (C) 2003, J. E. Smith Page 26 52 Global Communication: E-registers • • • • Extend address space Implement "centrifuging" function Memory-mapped (in IO space) Operations: – load/stores between E-registers and processor registers – Global E-register operations » transfer data to/from global memory » messaging » synchronization (C) 2003, J. E. Smith 53 Global Address Translation • E-reg block holds base and mask; – previously stored there as part of setup • Remote memory access (mem mapped store): – data bits: E-reg pointer(8) + address index(50) – address bits: Command + src/dstn E-reg (C) 2003, J. E. Smith Page 27 54 Global Address Translation (C) 2003, J. E. Smith 55 Address Translation, contd. • Translation steps – – – – – Address index centrifuged with mask => virt address + virt PE Offset added to base => vseg + seg offset vseg translated => gseg + base PE base PE + virtual PE => logical PE logical PE through lookup table => physical routing tag • GVA: gseg(6) + offset (32) – goes out over network to physical PE » at remote node, global translation => physical address (C) 2003, J. E. Smith Page 28 56 Global Communication: Gets and Puts • Get: global read – word (32 or 64-bits) – vector (8 words, with stride) – stride in access E-reg block • Put: global store • Full/Empty synchronization on E-regs • Gets/Puts are pipelined – up to 480 MB/sec transfer rate between nodes (C) 2003, J. E. Smith 57 Synchronization • Atomic ops between E-regs and memory – – – – Fetch & Inc Fetch & Add Compare & Swap Masked Swap • Barrier/Eureka Synchronization – – – – – – 32 BSUs per processor accessible as memory-mapped registers BSUs have states and operations State transition diagrams Barrier/Eureka trees are embedded in 3D torus use highes priority virtual channels (C) 2003, J. E. Smith Page 29 58 Message Passing • Message queues – – – – arbitrary number created in user or system mode mapped into memory space up to 128 MBytes • Message Queue Control Word in memory – – – – tail pointer limit value threshold triggers interrupt signal bit set when interrupt is triggered • Message Send – from block of 8 E-regs – Send command, similar to put • Queue head management done in software – swap can manages queue in two segments (C) 2003, J. E. Smith 59 T3E Summary • Messaging is done well... – within constraints of COTS processor <<buzzword alert>> • Relies more on high communication and memory bandwidth than caching => lower perf on dynamic irregular codes => higher perf on memory-intensive codes with large communication • Centrifuge; probably unique • Barrier synch uses dedicated hardware but NOT dedicated network (C) 2003, J. E. Smith Page 30 60 DEC Memory Channel (Princeton SHRIMP) Virtual Virtual Physical Physical • Reflective Memory • Writes on Sender appear in Receiver’s memory – send & receive regions – page control table • Receive region is pinned in memory • Requires duplicate writes, really just message buffers (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 61 Performance of Distributed Memory Machines • Microbenchmarking • One-way latency of small (five-word) message – echo test – round-trip divided by 2 • Shared Memory remote read • Message Passing Operations – see text (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 31 CS/ECE 757 62 Network Transaction Performance 14 Microseconds 12 10 8 Gap L Or Os 6 4 T3D NOW CS2 Paragon CM-5 T3D NOW CS2 Paragon 0 CM-5 2 Figure 7.31 (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh CS/ECE 757 63 Remote Read Performance 20 15 Gap L Issue 10 T3D NOW CS2 Paragon CM-5 T3D NOW CS2 0 Paragon 5 CM-5 Microseconds 25 Figure 7.32 (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 32 CS/ECE 757 64 Summary of Distributed Memory Machines • Convergence of architectures – everything “looks basically the same” – processor, cache, memory, communication assist • Communication Assist – where is it? (I/O bus, memory bus, processor registers) – what does it know? » does it just move bytes, or does it perform some functions? – is it programmable? – does it run user code? • Network transaction – input & output buffering – action on remote node (C) 2001 Mark D. Hill from Adve, Falsafi, Lebek, Reinhardt & Singh Page 33 CS/ECE 757 65

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Massively Parallel Processor (MPP) Architectures