Download No Slide Title

Scaling the Cray MTA Burton Smith Cray Inc. 1 Overview  The MTA is a uniform shared memory multiprocessor with     Every 64-bit memory word also has a full/empty bit    latency tolerance using fine-grain multithreading no data caches or other hierarchy no memory bandwidth bottleneck, absent hot-spots Load and store can act as receive and send, respectively The bit can also implement locks and atomic updates Every processor has 16 protection domains    One is used by the operating system The rest can be used to multiprogram the processor We limit the number of “big” jobs per processor 2 Multithreading on one processor 1 2 i =n . . . 3 i =2 i =1 Programs running in parallel i =n Subproblem A i =3 4 . Su bproblem B . . i =1 Serial Code i =0 Concurrent threads of computation Subproblem A .... Hardware stream s (128) Unused streams streams Instruction Ready Pool; Pipeline of executing instructions 3 Multithreading on multiple processors 1 2 i =n . . . 3 i =2 . Su bproblem B i =1 Programs running in parallel i =n Subproblem A i =3 4 . . i =1 Serial Code i =0 Concurrent threads of computation Subproblem A ... ... ... ... Multithreaded across multiple processors 4 Typical MTA processor utilization Percent utilization | 100 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 Number of st reams 5 Processor features  Multithreaded VLIW with three operations per 64-bit word     31 general-purpose 64-bit registers per stream Paged program address space (4KB pages) Segmented data address space (8KB–256MB segments)        The ops. are named M(emory), A(rithmetic), and C(ontrol) Privilege and interleaving is specified in the descriptor Data addressability to the byte level Explicit-dependence lookahead Multiple orthogonally generated condition codes Explicit branch target registers Speculative loads Unprivileged traps  and no interrupts at all 6 Supported data types   8, 16, 32, and 64-bit signed and unsigned integer 64-bit IEEE and 128-bit “doubled precision” floating point      conversion to and from 32-bit 1EEE is supported Bit vectors and matrices of arbitrary shape 64-bit pointer with 16 tag bits and 48 address bits 64-bit stream status word (SSW) 64-bit exception register 7 Bit vector and matrix operations          The usual logical operations and shifts are available in both A and C versions t=tera_bit_tally(u) t=tera_bit_odd_{and,nimp,or,xor}(u,v) x=tera_bit_{left,right}_{ones,zeros}(y,z) t=tera_shift_pair_{left,right}(t,u,v,w) t=tera_bit_merge(u,v,w) t=tera_bit_mat_{exor,or}(u,v) t=tera_bit_mat_transpose(u) t=tera_bit_{pack,unpack}(u,v) 8 The memory subsystem  Program addresses are hashed to logical addresses       We use an invertible matrix over GF(2) The result is no stride sensitivity at all logical addresses are then interleaved among physical memory unit numbers and offsets The mumber of memory units can be a power of 2 times any factor of 315=5*7*9 1, 2, or 4 GB of memory per processor The memory units support 1 memory reference per cycle per processor  plus instruction fetches to the local processor’s L2 cache 9 Memory word organization  64-bit words   Big-endian partial word order   addressing halfwords, quarterwords, and bytes 4 tag bits per word   with 8 more bite for SECDED with four more SECDED bits The memory implements a 64-bit fetch-and-add operation full/empty forward trap 1 trap 0 63 tag bits 0 data value 10 Synchronized memory operations       Each word of memory has an associated full/empty bit Normal loads ignore this bit, and normal stores set it full Sync memory operations are available via data declarations Sync loads atomically wait for full, then load and set empty Sync stores atomically wait for empty, then store and set full Waiting is autonomous, consuming no processor issue cycles    After a while, a trap occurs and the thread state is saved Sync and normal memory operations usually take the same time because of this “optimistic” approach In any event, synchronization latency is tolerated 11 I/O Processor (IOP)    There are as many IOPs as there are processors An IOP program describes a sequence of unitstride block transfers to or from anywhere in memory Each IOP drives a 100MB/s (32-bit) HIPPI channel    both directions can be driven simultaneously memory-to-memory copies are also possible We soon expect to be leveraging off-the-shelf buses and microprocessors as outboard devices 12 The memory network   The current MTA memory network is a 3–D toroidal mesh with pure deflection (“hot potato”) routing It must deliver one random memory reference per processor per cycle   The most expensive part of the system is its wires   This is a general property of high bandwidth systems Larger systems will need more sophisticated topologies   When this condition is met, the topology is transparent Surprisingly, network topology is not a dead subject Unlike wires, transistors keep getting faster and cheaper  We should use transistors aggressively to save wires 13 Our problem is bandwidth, not latency   In any memory network, concurrency = latency x bandwidth Multithreading supplies ample memory network concurrency   Bandwidth (not latency) limits practical MTA system size   and large MTA systems will have expensive memory networks In future, systems will be differentiated by their bandwidths    even to the point of implementing uniform shared memory System purchasers will buy the class of bandwidth they need System vendors will make sure their bandwidth scales properly The issue is the total cost of a given amount of bandwidth  How much bandwidth is enough? 14 Reducing the number and cost of wires    Use on-wafer and on-board wires whenever possible Use the highest possible bandwidth per wire Use optics (or superconductors) for long-distance interconnect to avoid skin effect    Use direct interconnection network topologies   Indirect networks waste wires Use symmetric (bidirectional) links for fault tolerance   Leverage technologies from other markets DWDM is not quite economical enough yet Disabling an entire cycle preserves balance Base networks on low-diameter graphs of low degree  bandwidth per node  degree /average distance 15 Graph symmetries   Suppose G=(v, e) is a graph with vertex set v and directed edge set evv G is called bidirectional when (x,y)G implies (y,x)G      Bidirectional links are helpful for fault reconfiguration An automorphism of G is a mapping  : v  v such that (x,y) is in G if and only if ((x), (y)) is also G is vertex-symmetric when for any pair of vertices there is an automorphism mapping one vertex to the other G is edge-symmetric when for any pair of edges there is an automorphism mapping one edge to the other Edge and vertex symmetries help in balancing network load 16 Specific bandwidth  Consider an n-node edge-symmetric bidirectional network with (out-)degree  and link bandwidth    Let message destinations be uniformly distributed among the nodes     hashing memory addresses helps guarantee this Let d be the average distance (in hops) between nodes Assume every node generates messages at bandwidth b   so the total aggregate link bandwidth available is n then nbd  n and therefore b/d The ratio d of degree to average distance limits the ratio b/ of injection bandwidth to link bandwidth We call d the specific bandwidth of the network 17 Graphs with average distance  degree Degree Diamet er = Degree Diamet er = Degree+1 3 20 38 4 95 364 5 532 2,734 6 7,8 17 13,056 7 35,154 93,744 8 234,360 8 20,260 9 3,019,632 15,68 6,400 Source: Bermond, Delorme, and Quisquater, JPDC 3 (1986), p. 433 18 Cayley graphs    Groups are a good source of low-diameter graphs The vertices of a Cayley graph are the group elements The  edges leaving a vertex are generators of the group   Cayley graphs are always vertex-symmetric    Generator g goes from node x to node x ·g Premultiplication by y ·x-1 is an automorphism taking x to y A Cayley graph is edge-symmetric if and only if every pair of generators is related by a group automorphism Example: the k-ary n-cube is a Cayley graph of (k)n    (k)n is the n-fold direct product of the integers modulo k The 2n generators are (1,0…0), (-1,0…0) ,…(0,0…-1) This graph is clearly edge-symmetric 19 Another example: the Star graph     The Star graph is an edge-symmetric Cayley graph of the group Sn of permutations on n symbols The generators are the exchanges of the rightmost symbol with every other symbol position It therefore has n! vertices and degree n-1 For moderate n, the specific bandwidth is close to 1 Degree Hypercube 8 ary n-cube St ar graph 3 8 16 24 4 16 64 120 5 32 128 720 6 64 512 5040 7 128 1024 40,320 8 256 4096 362,8 8 0 20 The Star graph of size 4! = 24 3210 3201 3012 3102 3021 3120 2310 2301 2013 2103 2031 2130 1302 1203 1320 1230 1023 1032 0312 0213 0321 0231 0123 0132 21 Conclusions  The Cray MTA is a new kind of high performance system      It will scale to 64 processors in 2001 and 256 in 2002   scalar multithreaded processors uniform shared memory fine-grain synchronization simple programming future versions will have thousands of processors It extends the capabilities of supercomputers   scalar parallelism, e.g. data base fine-grain synchronization, e.g. sparse linear systems 28

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download No Slide Title