Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS252 Graduate Computer Architecture Lecture 21 Multiprocessor Networks (con’t) John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252 Review: On Chip: Embeddings in two dimensions 6x3x2 • Embed multiple logical dimension in one physical dimension using long wires • When embedding higher-dimension in lower one, either some wires longer than others, or all wires long 4/15/2009 cs252-S09, Lecture 21 2 Review: Store&Forward vs Cut-Through Routing Cut-Through Routing Store & Forward Routing Source Dest 32 1 0 3 2 1 0 3 2 1 3 2 3 Dest 0 1 0 2 1 0 3 2 1 0 3 2 1 3 2 3 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 1 0 2 1 0 3 2 1 0 Time 3 2 1 Time: h(n/b + D/) OR(cycles): h(n/w + D) 0 vs vs n/b + h D/ n/w + h D • what if message is fragmented? • wormhole vs virtual cut-through 4/15/2009 cs252-S09, Lecture 21 3 Contention • Two packets trying to use the same link at same time – limited buffering – drop? • Most parallel mach. networks block in place – link-level flow control – tree saturation • Closed system - offered load depends on delivered – Source Squelching 4/15/2009 cs252-S09, Lecture 21 4 Bandwidth • What affects local bandwidth? – packet density – routing delay – contention b x ndata/n b x ndata /(n + wD) » endpoints » within the network • Aggregate bandwidth – bisection bandwidth » sum of bandwidth of smallest set of links that partition the network – total bandwidth of all the channels: Cb – suppose N hosts issue packet every M cycles with ave dist » each msg occupies h channels for l = n/w cycles each » C/N channels available per node » link utilization for store-and-forward: r = (hl/M channel cycles/node)/(C/N) = Nhl/MC < 1! » link utilization for wormhole routing? 4/15/2009 cs252-S09, Lecture 21 5 Saturation 0.8 Delivered Bandwidth 80 70 Latency 60 50 40 Saturation 30 20 10 0.7 0.6 0.5 0.4 Saturation 0.3 0.2 0.1 0 0 0 0.2 0.4 0.6 0.8 1 Delivered Bandwidth 4/15/2009 cs252-S09, Lecture 21 0 0.2 0.4 0.6 0.8 1 1.2 Offered Bandwidth 6 How Many Dimensions? • n = 2 or n = 3 – Short wires, easy to build – Many hops, low bisection bandwidth – Requires traffic locality • n >= 4 – Harder to build, more wires, longer average length – Fewer hops, better bisection bandwidth – Can handle non-local traffic • k-ary d-cubes provide a consistent framework for comparison – N = kd – scale dimension (d) or nodes per dimension (k) – assume cut-through 4/15/2009 cs252-S09, Lecture 21 7 Traditional Scaling: Latency scaling with N 250 140 200 100 d=2 d=3 80 d=4 k=2 60 n/w 40 Ave Latency T(n=140) Ave Latency T(n=40) 120 150 100 50 20 0 0 0 5000 10000 0 2000 4000 6000 8000 10000 Machine Size (N) Machine Size (N) • Assumes equal channel width – independent of node count or dimension – dominated by average distance 4/15/2009 cs252-S09, Lecture 21 8 Average Distance 100 256 90 1024 80 16384 1048576 Ave Distance 70 60 50 ave dist = d (k-1)/2 40 30 20 10 0 0 5 10 15 20 25 Dimension • but, equal channel width is not equal cost! • Higher dimension => more channels 4/15/2009 cs252-S09, Lecture 21 9 In the 3D world • For n nodes, bisection area is O(n2/3 ) • For large n, bisection bandwidth is limited to O(n2/3 ) – Bill Dally, IEEE TPDS, [Dal90a] – For fixed bisection bandwidth, low-dimensional k-ary ncubes are better (otherwise higher is better) – i.e., a few short fat wires are better than many long thin wires – What about many long fat wires? 4/15/2009 cs252-S09, Lecture 21 10 Dally paper (con’t) • Equal Bisection,W=1 for hypercube W= ½k • Three wire models: – Constant delay, independent of length – Logarithmic delay with length (exponential driver tree) – Linear delay (speed of light/optimal repeaters) Logarithmic Delay 4/15/2009 Linear Delay cs252-S09, Lecture 21 11 Equal cost in k-ary n-cubes • • • • • Equal number of nodes? Equal number of pins/wires? Equal bisection bandwidth? Equal area? Equal wire length? What do we know? • switch degree: d diameter = d(k-1) • total links = Nd • pins per node = 2wd • bisection = kd-1 = N/k links in each directions • 2Nw/k wires cross the middle 4/15/2009 cs252-S09, Lecture 21 12 Latency for Equal Width Channels 250 256 Average Latency (n = 40, D = 2) 1024 200 16384 1048576 150 100 50 0 0 5 10 15 20 25 Dimension • total links(N) = Nd 4/15/2009 cs252-S09, Lecture 21 13 Latency with Equal Pin Count 300 300 256 nodes 250 1024 nodes Ave Latency T(n= 140 B) Ave Latency T(n=40B) 250 16 k nodes 200 1M nodes 150 100 50 200 150 100 256 nodes 1024 nodes 50 16 k nodes 1M nodes 0 0 0 5 10 15 20 25 0 Dimension (d) 5 10 15 20 25 Dimension (d) • Baseline d=2, has w = 32 (128 wires per node) • fix 2dw pins => w(d) = 64/d • distance up with d, but channel time down 4/15/2009 cs252-S09, Lecture 21 14 Latency with Equal Bisection Width • N-node hypercube has N bisection links • 2d torus has 2N 1/2 • Fixed bisection => w(d) = N 1/d / 2 = k/2 1000 900 Ave Latency T(n=40) 800 700 600 500 400 300 256 nodes 200 1024 nodes • 1 M nodes, d=2 has w=512! 16 k nodes 100 1M nodes 0 0 5 10 15 20 25 Dimension (d) 4/15/2009 cs252-S09, Lecture 21 15 Larger Routing Delay (w/ equal pin) 1000 Ave Latency T(n= 140 B) 900 800 700 600 500 400 300 256 nodes 200 1024 nodes 16 k nodes 100 1M nodes 0 0 5 10 15 20 25 Dimension (d) • Dally’s conclusions strongly influenced by assumption of small routing delay – Here, Routing delay D=20 4/15/2009 cs252-S09, Lecture 21 16 Latency under Contention 300 250 n40,d2,k32 200 n40,d3,k10 Latency n16,d2,k32 n16,d3,k10 150 n8,d2,k32 n8,d3,k10 n4,d2,k32 100 n4,d3,k10 50 0 0 0.2 0.4 0.6 0.8 1 Channel Utilization • Optimal packet size? 4/15/2009 Channel utilization? cs252-S09, Lecture 21 17 Saturation 250 200 150 Latency n/w=40 n/w=16 n/w=8 n/w = 4 100 50 0 0 0.2 0.4 0.6 0.8 1 Ave Channel Utilization • Fatter links shorten queuing delays 4/15/2009 cs252-S09, Lecture 21 18 Phits per cycle 350 300 Latency 250 200 150 100 50 n8, d3, k10 n8, d2, k32 0 0 0.05 0.1 0.15 0.2 0.25 Flits per cycle per processor • higher degree network has larger available bandwidth – cost? 4/15/2009 cs252-S09, Lecture 21 19 Discussion • Rich set of topological alternatives with deep relationships • Design point depends heavily on cost model – nodes, pins, area, ... – Wire length or wire delay metrics favor small dimension – Long (pipelined) links increase optimal dimension • Need a consistent framework and analysis to separate opinion from design • Optimal point changes with technology 4/15/2009 cs252-S09, Lecture 21 20 Another Idea: Express Cubes • Problem: Low-dimensional networks have high k – Consequence: may have to travel many hops in single dimension – Routing latency can dominate long-distance traffic patterns • Solution: Provide one or more “express” links – Like express trains, express elevators, etc » Delay linear with distance, lower constant » Closer to “speed of light” in medium » Lower power, since no router cost – “Express Cubes: Improving performance of k-ary n-cube interconnection networks,” Bill Dally 1991 • Another Idea: route with pass transistors through links 4/15/2009 cs252-S09, Lecture 21 21 The Routing problem: Local decisions Input Ports Receiver Input Buffer Output Buffer Transmiter Output Ports Cross-bar Control Routing, Scheduling • Routing at each hop: Pick next output port! 4/15/2009 cs252-S09, Lecture 21 22 How do you build a crossbar? Io Io I1 I2 I3 O0 I1 Oi O2 I2 O3 I3 phase RAM addr Din Dout Io I1 I2 I3 4/15/2009 cs252-S09, Lecture 21 O0 Oi O2 O3 23 Input buffered swtich Input Ports Output Ports R0 R1 Cross-bar R2 R3 Scheduling • Independent routing logic per input – FSM • Scheduler logic arbitrates each output – priority, FIFO, random • Head-of-line blocking problem 4/15/2009 cs252-S09, Lecture 21 24 Output Buffered Switch Input Ports Output Ports R0 Output Ports R1 Output Ports R2 Output Ports R3 Control • How would you build a shared pool? 4/15/2009 cs252-S09, Lecture 21 25 Output scheduling R0 Input Buffers O0 R1 R2 Cross-bar Output Ports O1 O2 R3 • n independent arbitration problems? – static priority, random, round-robin • simplifications due to routing algorithm? • general case is max bipartite matching 4/15/2009 cs252-S09, Lecture 21 26 Switch Components • Output ports – transmitter (typically drives clock and data) • Input ports – synchronizer aligns data signal with local clock domain – essentially FIFO buffer • Crossbar – connects each input to any output – degree limited by area or pinout • Buffering • Control logic – complexity depends on routing logic and scheduling algorithm – determine output port for each incoming packet – arbitrate among inputs directed at same output 4/15/2009 cs252-S09, Lecture 21 27 Properties of Routing Algorithms • Routing algorithm: – R: N x N -> C, which at each switch maps the destination node nd to the next channel on the route – which of the possible paths are used as routes? – how is the next hop determined? » » » » arithmetic source-based port select table driven general computation • Deterministic – route determined by (source, dest), not intermediate state (i.e. traffic) • Adaptive – route influenced by traffic along the way • Minimal – only selects shortest paths • Deadlock free – no traffic pattern can lead to a situation where packets are deadlocked and never move forward 4/15/2009 cs252-S09, Lecture 21 28 Routing Mechanism • need to select output port for each input packet – in a few cycles • Simple arithmetic in regular topologies – ex: Dx, Dy routing in a grid » » » » » west (-x) east (+x) south (-y) north (+y) processor Dx < 0 Dx > 0 Dx = 0, Dy < 0 Dx = 0, Dy > 0 Dx = 0, Dy = 0 • Reduce relative address of each dimension in order – Dimension-order routing in k-ary d-cubes – e-cube routing in n-cube 4/15/2009 cs252-S09, Lecture 21 29 Deadlock Freedom • How can deadlock arise? – necessary conditions: » shared resource » incrementally allocated » non-preemptible – think of a channel as a shared resource that is acquired incrementally » source buffer then dest. buffer » channels along a route • How do you avoid it? – constrain how channel resources are allocated – ex: dimension order • How do you prove that a routing algorithm is deadlock free? – Show that channel dependency graph has no cycles! 4/15/2009 cs252-S09, Lecture 21 30 Consider Trees • Why is the obvious routing on X deadlock free? – butterfly? – tree? – fat tree? • Any assumptions about routing mechanism? amount of buffering? 4/15/2009 cs252-S09, Lecture 21 31 Up*-Down* routing for general topology • • • • • Given any bidirectional network Construct a spanning tree Number of the nodes increasing from leaves to roots UP increase node numbers Any Source -> Dest by UP*-DOWN* route – up edges, single turn, down edges – Proof of deadlock freedom? • Performance? – Some numberings and routes much better than others – interacts with topology in strange ways 4/15/2009 cs252-S09, Lecture 21 32 Turn Restrictions in X,Y +Y -X +X -Y • XY routing forbids 4 of 8 turns and leaves no room for adaptive routing • Can you allow more turns and still be deadlock free? 4/15/2009 cs252-S09, Lecture 21 33 Minimal turn restrictions in 2D +y +x -x West-first north-last 4/15/2009 -y cs252-S09, Lecture 21 negative first 34 Example legal west-first routes • Can route around failures or congestion • Can combine turn restrictions with virtual channels 4/15/2009 cs252-S09, Lecture 21 35 General Proof Technique • resources are logically associated with channels • messages introduce dependences between resources as they move forward • need to articulate the possible dependences that can arise between channels • show that there are no cycles in Channel Dependence Graph – find a numbering of channel resources such that every legal route follows a monotonic sequence no traffic pattern can lead to deadlock • network need not be acyclic, just channel dependence graph 4/15/2009 cs252-S09, Lecture 21 36 Example: k-ary 2D array • Thm: Dimension-ordered (x,y) routing is deadlock free • Numbering – – – – +x channel (i,y) -> (i+1,y) gets i similarly for -x with 0 as most positive edge +y channel (x,j) -> (x,j+1) gets N+j similary for -y channels • any routing sequence: x direction, turn, y direction is increasing • Generalization: 1 00 2 01 2 17 18 10 17 3 02 1 03 0 11 12 13 21 22 23 31 32 33 18 20 19 16 30 – “e-cube routing” on 3-D: X then Y then Z 4/15/2009 cs252-S09, Lecture 21 37 Channel Dependence Graph 1 00 2 01 2 17 18 10 17 1 2 3 02 1 18 17 0 11 12 13 30 21 22 23 31 32 17 18 0 33 18 17 18 17 2 3 1 0 17 18 17 18 17 18 1 2 3 2 1 0 16 19 16 19 1 2 4/15/2009 1 18 17 1 2 19 16 3 03 18 20 2 cs252-S09, Lecture 21 16 19 16 19 2 3 1 0 38 More examples: • What about wormhole routing on a ring? 2 1 0 3 7 4 5 6 • Or: Unidirectional Torus of higher dimension? 4/15/2009 cs252-S09, Lecture 21 39 Deadlock free wormhole networks? • Basic dimension order routing techniques don’t work for unidirectional k-ary d-cubes – only for k-ary d-arrays (bi-directional) – And – dimension-ordered routing not adaptive! • Idea: add channels! – provide multiple “virtual channels” to break the dependence cycle – good for BW too! Input Ports Output Ports Cross-Bar – Do not need to add links, or xbar, only buffer resources • This adds nodes to the CDG, remove edges? 4/15/2009 cs252-S09, Lecture 21 40 When are virtual channels allocated? Input Ports Output Ports Cross-Bar Hardware efficient design For crossbar • Two separate processes: – Virtual channel allocation – Switch/connection allocation • Virtual Channel Allocation – Choose route and free output virtual channel • Switch Allocation – For each incoming virtual channel, must negotiate switch on outgoing pin • In ideal case (not highly loaded), would like to optimistically allocate a virtual channel 4/15/2009 cs252-S09, Lecture 21 41 Breaking deadlock with virtual channels Packet switches from lo to hi channel 4/15/2009 cs252-S09, Lecture 21 42 Paper Discusion: Linder and Harden • Paper: “An Adaptive and Fault Tolerant Wormhole Routing Stategy for k-ary n-cubes” – Daniel Linder and Jim Harden • General virtual-channel scheme for k-ary n-cubes – With wrap-around paths • Properties of result for uni-directional k-ary n-cube: – 1 virtual interconnection network – n+1 levels • Properties of result for bi-directional k-ary n-cube: – 2n-1 virtual interconnection networks – n+1 levels per network 4/15/2009 cs252-S09, Lecture 21 43 Example: Unidirectional 4-ary 2-cube Physical Network • Wrap-around channels necessary but can cause deadlock 4/15/2009 Virtual Network • Use VCs to avoid deadlock • 1 level for each wrap-around cs252-S09, Lecture 21 44 Bi-directional 4-ary 2-cube: 2 virtual networks Virtual Network 1 4/15/2009 Virtual Network 2 cs252-S09, Lecture 21 45