* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt - Computer Science Division - University of California, Berkeley
Survey
Document related concepts
Transcript
CS252 Graduate Computer Architecture Lecture 21 Multiprocessor Networks (con’t) John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252 Review: On Chip: Embeddings in two dimensions 6x3x2 • Embed multiple logical dimension in one physical dimension using long wires • When embedding higher-dimension in lower one, either some wires longer than others, or all wires long 4/15/2009 cs252-S09, Lecture 21 2 Review: Store&Forward vs Cut-Through Routing Cut-Through Routing Store & Forward Routing Source Dest 32 1 0 3 2 1 0 3 2 1 3 2 3 Dest 0 1 0 2 1 0 3 2 1 0 3 2 1 3 2 3 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 1 0 2 1 0 3 2 1 0 Time 3 2 1 Time: h(n/b + D/) OR(cycles): h(n/w + D) 0 vs vs n/b + h D/ n/w + h D • what if message is fragmented? • wormhole vs virtual cut-through 4/15/2009 cs252-S09, Lecture 21 3 Contention • Two packets trying to use the same link at same time – limited buffering – drop? • Most parallel mach. networks block in place – link-level flow control – tree saturation • Closed system - offered load depends on delivered – Source Squelching 4/15/2009 cs252-S09, Lecture 21 4 Bandwidth • What affects local bandwidth? – packet density – routing delay – contention b x ndata/n b x ndata /(n + wD) » endpoints » within the network • Aggregate bandwidth – bisection bandwidth » sum of bandwidth of smallest set of links that partition the network – total bandwidth of all the channels: Cb – suppose N hosts issue packet every M cycles with ave dist » each msg occupies h channels for l = n/w cycles each » C/N channels available per node » link utilization for store-and-forward: r = (hl/M channel cycles/node)/(C/N) = Nhl/MC < 1! » link utilization for wormhole routing? 4/15/2009 cs252-S09, Lecture 21 5 Saturation 0.8 Delivered Bandwidth 80 70 Latency 60 50 40 Saturation 30 20 10 0.7 0.6 0.5 0.4 Saturation 0.3 0.2 0.1 0 0 0 0.2 0.4 0.6 0.8 1 Delivered Bandwidth 4/15/2009 cs252-S09, Lecture 21 0 0.2 0.4 0.6 0.8 1 1.2 Offered Bandwidth 6 How Many Dimensions? • n = 2 or n = 3 – Short wires, easy to build – Many hops, low bisection bandwidth – Requires traffic locality • n >= 4 – Harder to build, more wires, longer average length – Fewer hops, better bisection bandwidth – Can handle non-local traffic • k-ary d-cubes provide a consistent framework for comparison – N = kd – scale dimension (d) or nodes per dimension (k) – assume cut-through 4/15/2009 cs252-S09, Lecture 21 7 Traditional Scaling: Latency scaling with N 250 140 200 100 d=2 d=3 80 d=4 k=2 60 n/w 40 Ave Latency T(n=140) Ave Latency T(n=40) 120 150 100 50 20 0 0 0 5000 10000 0 2000 4000 6000 8000 10000 Machine Size (N) Machine Size (N) • Assumes equal channel width – independent of node count or dimension – dominated by average distance 4/15/2009 cs252-S09, Lecture 21 8 Average Distance 100 256 90 1024 80 16384 1048576 Ave Distance 70 60 50 ave dist = d (k-1)/2 40 30 20 10 0 0 5 10 15 20 25 Dimension • but, equal channel width is not equal cost! • Higher dimension => more channels 4/15/2009 cs252-S09, Lecture 21 9 In the 3D world • For n nodes, bisection area is O(n2/3 ) • For large n, bisection bandwidth is limited to O(n2/3 ) – Bill Dally, IEEE TPDS, [Dal90a] – For fixed bisection bandwidth, low-dimensional k-ary ncubes are better (otherwise higher is better) – i.e., a few short fat wires are better than many long thin wires – What about many long fat wires? 4/15/2009 cs252-S09, Lecture 21 10 Dally paper (con’t) • Equal Bisection,W=1 for hypercube W= ½k • Three wire models: – Constant delay, independent of length – Logarithmic delay with length (exponential driver tree) – Linear delay (speed of light/optimal repeaters) Logarithmic Delay 4/15/2009 Linear Delay cs252-S09, Lecture 21 11 Equal cost in k-ary n-cubes • • • • • Equal number of nodes? Equal number of pins/wires? Equal bisection bandwidth? Equal area? Equal wire length? What do we know? • switch degree: d diameter = d(k-1) • total links = Nd • pins per node = 2wd • bisection = kd-1 = N/k links in each directions • 2Nw/k wires cross the middle 4/15/2009 cs252-S09, Lecture 21 12 Latency for Equal Width Channels 250 256 Average Latency (n = 40, D = 2) 1024 200 16384 1048576 150 100 50 0 0 5 10 15 20 25 Dimension • total links(N) = Nd 4/15/2009 cs252-S09, Lecture 21 13 Latency with Equal Pin Count 300 300 256 nodes 250 1024 nodes Ave Latency T(n= 140 B) Ave Latency T(n=40B) 250 16 k nodes 200 1M nodes 150 100 50 200 150 100 256 nodes 1024 nodes 50 16 k nodes 1M nodes 0 0 0 5 10 15 20 25 0 Dimension (d) 5 10 15 20 25 Dimension (d) • Baseline d=2, has w = 32 (128 wires per node) • fix 2dw pins => w(d) = 64/d • distance up with d, but channel time down 4/15/2009 cs252-S09, Lecture 21 14 Latency with Equal Bisection Width • N-node hypercube has N bisection links • 2d torus has 2N 1/2 • Fixed bisection => w(d) = N 1/d / 2 = k/2 1000 900 Ave Latency T(n=40) 800 700 600 500 400 300 256 nodes 200 1024 nodes • 1 M nodes, d=2 has w=512! 16 k nodes 100 1M nodes 0 0 5 10 15 20 25 Dimension (d) 4/15/2009 cs252-S09, Lecture 21 15 Larger Routing Delay (w/ equal pin) 1000 Ave Latency T(n= 140 B) 900 800 700 600 500 400 300 256 nodes 200 1024 nodes 16 k nodes 100 1M nodes 0 0 5 10 15 20 25 Dimension (d) • Dally’s conclusions strongly influenced by assumption of small routing delay – Here, Routing delay D=20 4/15/2009 cs252-S09, Lecture 21 16 Latency under Contention 300 250 n40,d2,k32 200 n40,d3,k10 Latency n16,d2,k32 n16,d3,k10 150 n8,d2,k32 n8,d3,k10 n4,d2,k32 100 n4,d3,k10 50 0 0 0.2 0.4 0.6 0.8 1 Channel Utilization • Optimal packet size? 4/15/2009 Channel utilization? cs252-S09, Lecture 21 17 Saturation 250 200 150 Latency n/w=40 n/w=16 n/w=8 n/w = 4 100 50 0 0 0.2 0.4 0.6 0.8 1 Ave Channel Utilization • Fatter links shorten queuing delays 4/15/2009 cs252-S09, Lecture 21 18 Phits per cycle 350 300 Latency 250 200 150 100 50 n8, d3, k10 n8, d2, k32 0 0 0.05 0.1 0.15 0.2 0.25 Flits per cycle per processor • higher degree network has larger available bandwidth – cost? 4/15/2009 cs252-S09, Lecture 21 19 Discussion • Rich set of topological alternatives with deep relationships • Design point depends heavily on cost model – nodes, pins, area, ... – Wire length or wire delay metrics favor small dimension – Long (pipelined) links increase optimal dimension • Need a consistent framework and analysis to separate opinion from design • Optimal point changes with technology 4/15/2009 cs252-S09, Lecture 21 20 Another Idea: Express Cubes • Problem: Low-dimensional networks have high k – Consequence: may have to travel many hops in single dimension – Routing latency can dominate long-distance traffic patterns • Solution: Provide one or more “express” links – Like express trains, express elevators, etc » Delay linear with distance, lower constant » Closer to “speed of light” in medium » Lower power, since no router cost – “Express Cubes: Improving performance of k-ary n-cube interconnection networks,” Bill Dally 1991 • Another Idea: route with pass transistors through links 4/15/2009 cs252-S09, Lecture 21 21 The Routing problem: Local decisions Input Ports Receiver Input Buffer Output Buffer Transmiter Output Ports Cross-bar Control Routing, Scheduling • Routing at each hop: Pick next output port! 4/15/2009 cs252-S09, Lecture 21 22 How do you build a crossbar? Io Io I1 I2 I3 O0 I1 Oi O2 I2 O3 I3 phase RAM addr Din Dout Io I1 I2 I3 4/15/2009 cs252-S09, Lecture 21 O0 Oi O2 O3 23 Input buffered swtich Input Ports Output Ports R0 R1 Cross-bar R2 R3 Scheduling • Independent routing logic per input – FSM • Scheduler logic arbitrates each output – priority, FIFO, random • Head-of-line blocking problem 4/15/2009 cs252-S09, Lecture 21 24 Output Buffered Switch Input Ports Output Ports R0 Output Ports R1 Output Ports R2 Output Ports R3 Control • How would you build a shared pool? 4/15/2009 cs252-S09, Lecture 21 25 Output scheduling R0 Input Buffers O0 R1 R2 Cross-bar Output Ports O1 O2 R3 • n independent arbitration problems? – static priority, random, round-robin • simplifications due to routing algorithm? • general case is max bipartite matching 4/15/2009 cs252-S09, Lecture 21 26 Switch Components • Output ports – transmitter (typically drives clock and data) • Input ports – synchronizer aligns data signal with local clock domain – essentially FIFO buffer • Crossbar – connects each input to any output – degree limited by area or pinout • Buffering • Control logic – complexity depends on routing logic and scheduling algorithm – determine output port for each incoming packet – arbitrate among inputs directed at same output 4/15/2009 cs252-S09, Lecture 21 27 Properties of Routing Algorithms • Routing algorithm: – R: N x N -> C, which at each switch maps the destination node nd to the next channel on the route – which of the possible paths are used as routes? – how is the next hop determined? » » » » arithmetic source-based port select table driven general computation • Deterministic – route determined by (source, dest), not intermediate state (i.e. traffic) • Adaptive – route influenced by traffic along the way • Minimal – only selects shortest paths • Deadlock free – no traffic pattern can lead to a situation where packets are deadlocked and never move forward 4/15/2009 cs252-S09, Lecture 21 28 Routing Mechanism • need to select output port for each input packet – in a few cycles • Simple arithmetic in regular topologies – ex: Dx, Dy routing in a grid » » » » » west (-x) east (+x) south (-y) north (+y) processor Dx < 0 Dx > 0 Dx = 0, Dy < 0 Dx = 0, Dy > 0 Dx = 0, Dy = 0 • Reduce relative address of each dimension in order – Dimension-order routing in k-ary d-cubes – e-cube routing in n-cube 4/15/2009 cs252-S09, Lecture 21 29 Deadlock Freedom • How can deadlock arise? – necessary conditions: » shared resource » incrementally allocated » non-preemptible – think of a channel as a shared resource that is acquired incrementally » source buffer then dest. buffer » channels along a route • How do you avoid it? – constrain how channel resources are allocated – ex: dimension order • How do you prove that a routing algorithm is deadlock free? – Show that channel dependency graph has no cycles! 4/15/2009 cs252-S09, Lecture 21 30 Consider Trees • Why is the obvious routing on X deadlock free? – butterfly? – tree? – fat tree? • Any assumptions about routing mechanism? amount of buffering? 4/15/2009 cs252-S09, Lecture 21 31 Up*-Down* routing for general topology • • • • • Given any bidirectional network Construct a spanning tree Number of the nodes increasing from leaves to roots UP increase node numbers Any Source -> Dest by UP*-DOWN* route – up edges, single turn, down edges – Proof of deadlock freedom? • Performance? – Some numberings and routes much better than others – interacts with topology in strange ways 4/15/2009 cs252-S09, Lecture 21 32 Turn Restrictions in X,Y +Y -X +X -Y • XY routing forbids 4 of 8 turns and leaves no room for adaptive routing • Can you allow more turns and still be deadlock free? 4/15/2009 cs252-S09, Lecture 21 33 Minimal turn restrictions in 2D +y +x -x West-first north-last 4/15/2009 -y cs252-S09, Lecture 21 negative first 34 Example legal west-first routes • Can route around failures or congestion • Can combine turn restrictions with virtual channels 4/15/2009 cs252-S09, Lecture 21 35 General Proof Technique • resources are logically associated with channels • messages introduce dependences between resources as they move forward • need to articulate the possible dependences that can arise between channels • show that there are no cycles in Channel Dependence Graph – find a numbering of channel resources such that every legal route follows a monotonic sequence no traffic pattern can lead to deadlock • network need not be acyclic, just channel dependence graph 4/15/2009 cs252-S09, Lecture 21 36 Example: k-ary 2D array • Thm: Dimension-ordered (x,y) routing is deadlock free • Numbering – – – – +x channel (i,y) -> (i+1,y) gets i similarly for -x with 0 as most positive edge +y channel (x,j) -> (x,j+1) gets N+j similary for -y channels • any routing sequence: x direction, turn, y direction is increasing • Generalization: 1 00 2 01 2 17 18 10 17 3 02 1 03 0 11 12 13 21 22 23 31 32 33 18 20 19 16 30 – “e-cube routing” on 3-D: X then Y then Z 4/15/2009 cs252-S09, Lecture 21 37 Channel Dependence Graph 1 00 2 01 2 17 18 10 17 1 2 3 02 1 18 17 0 11 12 13 30 21 22 23 31 32 17 18 0 33 18 17 18 17 2 3 1 0 17 18 17 18 17 18 1 2 3 2 1 0 16 19 16 19 1 2 4/15/2009 1 18 17 1 2 19 16 3 03 18 20 2 cs252-S09, Lecture 21 16 19 16 19 2 3 1 0 38 More examples: • What about wormhole routing on a ring? 2 1 0 3 7 4 5 6 • Or: Unidirectional Torus of higher dimension? 4/15/2009 cs252-S09, Lecture 21 39 Deadlock free wormhole networks? • Basic dimension order routing techniques don’t work for unidirectional k-ary d-cubes – only for k-ary d-arrays (bi-directional) – And – dimension-ordered routing not adaptive! • Idea: add channels! – provide multiple “virtual channels” to break the dependence cycle – good for BW too! Input Ports Output Ports Cross-Bar – Do not need to add links, or xbar, only buffer resources • This adds nodes to the CDG, remove edges? 4/15/2009 cs252-S09, Lecture 21 40 When are virtual channels allocated? Input Ports Output Ports Cross-Bar Hardware efficient design For crossbar • Two separate processes: – Virtual channel allocation – Switch/connection allocation • Virtual Channel Allocation – Choose route and free output virtual channel • Switch Allocation – For each incoming virtual channel, must negotiate switch on outgoing pin • In ideal case (not highly loaded), would like to optimistically allocate a virtual channel 4/15/2009 cs252-S09, Lecture 21 41 Breaking deadlock with virtual channels Packet switches from lo to hi channel 4/15/2009 cs252-S09, Lecture 21 42 Paper Discusion: Linder and Harden • Paper: “An Adaptive and Fault Tolerant Wormhole Routing Stategy for k-ary n-cubes” – Daniel Linder and Jim Harden • General virtual-channel scheme for k-ary n-cubes – With wrap-around paths • Properties of result for uni-directional k-ary n-cube: – 1 virtual interconnection network – n+1 levels • Properties of result for bi-directional k-ary n-cube: – 2n-1 virtual interconnection networks – n+1 levels per network 4/15/2009 cs252-S09, Lecture 21 43 Example: Unidirectional 4-ary 2-cube Physical Network • Wrap-around channels necessary but can cause deadlock 4/15/2009 Virtual Network • Use VCs to avoid deadlock • 1 level for each wrap-around cs252-S09, Lecture 21 44 Bi-directional 4-ary 2-cube: 2 virtual networks Virtual Network 1 4/15/2009 Virtual Network 2 cs252-S09, Lecture 21 45