* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Communications
Survey
Document related concepts
Transcript
Asynchronous Interconnection Network and Communication Chapter 3 of Casanova, et. al. Interconnection Network Topologies • The processors in a distributed memory parallel system are connected using an interconnection network. • All computers have specialized coprocessors that route messages and place date in local memories – Nodes consist of a (computing) processor, a memory, and a communications coprocessor – Nodes are often called processors, when not ambigious. Network Topology Types • Static Topologies – A fixed network that cannot be changed – Nodes connected directly to each other by point-to-point communications links • Dynamic Topologies – Topology can change at runtime – One or more nodes can request direct communication be established between them. • Done using switches Some Static Topologies • • • • • • Fully connected network (or clique) Ring Two-Dimensional grid Torus Hypercube Fat tree Examples of Interconnection Topologies Static Topologies Features • Fixed number of nodes • Degree: – Nr of nodes incident to edges • Distance between nodes: – Length of shortest path between two nodes • Diameter: – Largest distance between two nodes • Number of links: – Total number of Edges • Bisection Width: – Minimum nr. of edges that must be removed to partition the network into two disconnected networks of the same size. Classical Interconnection Networks Features • Clique (or Fully Connected) – All processors are connected – p(p-1)/2 edges • Ring – Very simple and very useful topology • 2D Grid – Degree of interior processors is 4 – Not symmetric, as edge processors have different properties – Very useful when computations are local and communications are between neighbors – Has been heavily used previously Classical Network • 2D Torus – Easily formed from 2D mesh by connecting matching end points. • Hypercube – Has been extensively used – Using recursive defn, can design simple but very efficient algorithms – Has small diameter that is logarithmic in nr of edges – Degree and total number of edges grows too quickly to be useful with massively parallel machines. Dynamic Topologies • Fat tree is different than other networks included – The compute nodes are only at the leaves. – Nodes at higher level do not perform computation – Topology is a binary tree – both in 2D front view and in side view. – Provides extra bandwidth near root. – Used by Thinking Machine Corp. on the CM-5 • Crossbar Switch – Has p2 switches, which is very expensive for large p – Can connect n processors to combination of n processors – .Cost rises with the number of switches, which is quadratic with number of processors. Dynamic Topologies (cont) • Benes Network & Omega Networks – Use smaller size crossbars arranged in stages – Only crossbars in adjacent stages are connected together. – Called multi-stage networks and are cheaper to build that full crossbar. – Configuring multi-stage networks is more difficult than crossbar. – Dynamic networks are now the most common used topologies. A Simple Communications Performance Model • Assume a processor Pi sends a message to Pj or length m. – Cost to transfer message along a network link is roughly linear in message length. – Results in cost to transfer message along a particular route to be roughly linear in m. • Let ci,j(m) denote the time to transfer this message. Hockney Performance Model for Communications • The time ci,j(m) to transfer this message can be modeled by ci,j(m) = Li,j + m/Bi,j = Li,j + mbi,j – m is the size of the message – Li,j is the startup time, also called latency – Bi,j is the bandwidth, in bytes per second – bi,j is 1/Bi,j, the inverse of the bandwidth • Proposed by Hockney in 1994 to evaluate the performance of the Intel Paragon. • Probably the most commonly used model. Hockney Performance Model (cont.) • Factors that Li,j and Bi,j depend on – Length of route – Communication protocol used – Communications software overhead – Ability to use links in parallel – Whether links are half or full duplex – Etc. Store and Forward Protocol • SF is a point-to-point protocol • Each intermediate node receives and stores the entire message before retransmitting it • Implemented in earliest parallel machines in which nodes did not have communications coprocessors. • Intermediate nodes are interrupted to handle messages and route them towards their destination. Store and Forward Protocol (cont) • If d(i,j) is the number of links between Pi & Pj, the formula for ci,j(m) can be re-written as ci,j(m) = d(i,j) {L+ m/B} = d(i,j)L + d(i,j)mb where – L is the initial latency & b is the reciprocal for the broadcast bandwidth for one link. • This protocol produces a poor latency & bandwith • The communication cost can be reduced using pipelining. Store and Forward Protocol using Pipelining • The message is split into r packets of size m/r. • The packets are sent one after another from Pi to Pj. • The first packet reaches node j after ci,j(m/r) time units. • The remaining r-1 packets arrive in (r-1) (L+ mb/r) time units • Simplifying, total communication time reduces to [d(i,j) -1+r][L+ mb/r] • Casanova, et.al. finds optimal value for r above. Two Cut-Through Protocols • Common performance model: ci,j(m) = L + d(i,j)* + m/B where – L is the one-time cost of creating a message. – is the routing management overhead – Generally << L as routing management is performed by hardware while L involve software overhead – m/B is the time required to transmit the message through entire route Circuit-Switching Protocol • First cut-through protocol • Route created before first message is sent • Message sent directly to destination through this route – The nodes used in this transmission can not be used during this transmission for any other communication Wormhole (WH) Protocol • A second cut-through protocol • The destination address is stored in the header of the message. • Routing is performed dynamically at each node. • Message is split into small packets called flits • If two flits arrive at the same time, flits are stored in intermediate nodes’ internal registers Point-to-Point Communication Comparisons • Store and Forward is not used in physical networks but only at applications level • Cut-through protocols are more efficient – – – – Hide distance between nodes Avoid large buffer requirement for intermediate nodes Almost no message loss For small networks, flow-control mechanism not needed • Wormhole generally preferred to circuit switching – Latency is normally much lower LogP Model • Models based on the LogP model are more precise than the Hockney model • Involves three components of communication – the sender, the network, and the receiver – At times, some of these components may be busy while others are not. • Some parameters for LogP – – – – m is the message size (in bytes) w is the size of packets message is split into L is an upper bound on the latency o is the overhead, • Defined to be the time that the a node is engaged in the transmission or reception of a packet LogP Model (cont) • Parameters for LogP (cont) – g or gap is the minimal time interval between consecutive packet transmission or packet reception • During this time, a node may not use the communication coprocessor (i.e., network card) – 1/g the communication bandwidth available per node – P the number of nodes in the platform • Cost of sending m bytes with packet size w m 1 c(m) 2o L g w • Processor occupational time on sender/receiver m 1 o ( 1 g w Other LogP Related Models • LogP attempts to capture in a few parameters the characteristics of parallel platforms. • Platforms are fine-tuned and may use different protocols for short & long messages • LogGP is an extension of LogP where G captures the bandwidth for long messages • pLogP is an extension of LogP where L, o, and g depend on the message size m. – Also seperates sender overhead os and receiver overhead or. Affine Models • The use of the floor functions in LogP models causes them to be nonlinear. – Causes many problems in analytic & theoretical studies. – Has resulted in proposal of many fully linear models • The time that Pi is busy sending a message is expressed as an affine function of the message size – An affine function of m has form f(m) = a*m + b where a and b are constants. If b=0, then f is linear function • Similarly, the time Pj is busy receiving the message is expressed as an affine function of the message size • We will postpone further coverage of the topic of affine models for the present Modeling Concurrent Communications • Multi-port model – Assumes that communications are contention-free and do not interfere with each other. – A consequence is that a node may communicate with an unlimited number of nodes without any degradation in performance. – Would require a clique interconnection network to fully support. – May simplify proofs that certain problems are hard • If hard under ideal communications conditions, then hard in general. – Assumption not realistic - communication resources are always limited. – See Casanova text for additional information. Concurrent Communications Models (2/5) • Bounded Multi-port model – Proposed by Hong and Prasanna – For applications that uses threads (e.g., on a multicore technology), the network link can be shared by several incoming and outgoing communications. – The sum of bandwidths allocated by operating system to all communications can not exceed bandwidth of network card. – An unbounded nr of communications can take place if they share the total available bandwidth. – The bandwidth defines the bandwidth allotted to each communication – Bandwidth sharing by application is unusual, as is usually handled by operating system. Concurrent Communications Models (3/5) • 1 port (unidirectional or half-duplex) model – Avoids unrealistically optimistic assumptions – Forbids concurrent communication at a node. – A node can either send data or receive it, but not simultaneously. – This model is very pessimistic, as real world platforms can achieve some concurrent computations. – Model is simple and is easy to design algorithms that follow this model. Concurrent Communications Models (4/5) • 1 port (bidirectional or full-duplex) model – Currently, most network cards are full-duplex. – Allows a single emission and single reception simultaneously. – Introduced by Blat, et. al. • Current hardware does not easily enable multiple messages to be transmitted simultaneous. • Multiple sends and receives are claimed to be eventually serialized by a single hardware port to the next. • Saif & Parashar did experimental work that suggests asynchronous sends become serialized when message sizes exceed a few megabytes. Concurrent Communications Models (5/5) • k-ports model – A node may have k>1 network cards – This model allows a node to be involved in a maximum of one emission and one reception on each network card. – This model is used in Chapters 4 & 5. Bandwidth Sharing • The previous concurrent communication models only consider contention on nodes • Other parts of the network can also limit performance • It may be useful to determine constraints on each network link • This type of network model are useful for performance evaluation purposes, but are too complicated for algorithm design purposes. • Casanova text evalutes algorithms using 2 models: – Hockney model or even simplified versions (e.g. assuming no latency) – Multi-port (ignoring contention) or the 1 port model. Case Study: Unidirectional Ring • We first consider the platform of p processors arranged in a unidirectional ring. • Processors are denoted Pk for k = 0, 1, … , p-1. • Each PE can find its logical index by calling My_Num(). Unidirectional Ring Basics • The processor can determine the number of PEs by calling NumProcs() – Both preceding commands are supported in MPI, a language implemented on most asychronous systems. • Each processor has its own memory • All processors execute the same program, which acts on data in their local memories – Single Program, Multiple Data or SPMD • Processors communicate by message passing – explicitly sending and receiving messages. Unidirectional Ring Basics (cont – 2/5) • A processor sends a message using the function send(addr,m) – addr is the memory address (in the sender process) of first data item to be sent. – m is the message length (i.e., nr of items to be sent) • A processor receives a message using function receive(addr,m) – The addr is the local address in receiving processor where first data item is to be stored. – If processor Pi executes a receive, then its predecessor (P(i-1)mod p) must execute a send. – Since each processor has a unique predecessor and successor, they do not have to be specified Unidirectional Ring Basics (cont – 3/5) • A restrictive assumption is to assume that both the send and receive is blocking. – Then the participating processes can not continue until the communication is complete. – The blocking assumption is typical of 1st generation platforms • A classical assumption is keep the receive blocking but to allow the send is non-blocking – The processor executing a send can continue while the data transfer takes place. – To implement, one function is used to initiate the send and another function is used to determine when communication is finished. Unidirectional Ring Basics (cont – 4/5) • In algorithms, we simply indicate the blocking and non-blocking operations • More recent proposed assumption is that a single processor can send data, receive data, and compute simultaneously. – All three can occur only if no race condition exists. – Convenient to think of three logical threads of control running on a processor • One for computing • One for sending data • One for receiving data • We will usually use the less restrictive third assumption Unidirectional Ring Basics (cont – 4/5) • Timings for Send/Receive – We use a simplified version of the Hockney model – The time to send or receive over one link is c(m) = L+ mb • m is the length of the message • L is the startup cost in seconds due to the physical latency and the software overhead • b is the inverse of the data transfer rate. The Broadcast Operation • The broadcast operation allows an processor Pk to send the same message of length m to all other processors. • At the beginning of the broadcast operation, the message is stored at the address addr in the memory of the sending process, Pk. • At the end of the broadcast, the message will be stored at address addr in the memory of all processors. • All processors must call the following function Broadcast(k, addr, m) Broadcast Algorithm Overview • The message will go around the ring from processor - from Pk to Pk+1 to Pk+2 to … to Pk-1. • We will assume the processor numbers are modulo p, where p is the number of processors. For example, if k=0 and p=8, then k-1 = p-1 = 7. • Note there is no parallelism in this algorithm, since the message advances around ring only one processor per round. • The predecessor of Pk (i.e, Pk-1) does not send the message to Pk. Analysis of Broadcast Algorithm • For algorithm to be correct, the “receive” in Step 10 will execute before Step 11. • Running Time: – Since we have a sequence of p-1 communications, the time to broadcast a message of length m is (p-1)(L+mb) • MPI does not typically use ring topology for creating communication primitives – Instead use various tree topologies that are more efficient on modern parallel computer platforms. – However, these primitives are simpler on a ring. – Prepares readers to implement primitives, when more efficient than using MPI primitives. Scatter Algorithm • Scatter operation allows Pk to send a different message of length m to each processor. • Initially, Pk holds a message of length m to be sent to Pq at location “addr [q]”. • To keep the array of addresses uniform, space for a message to Pk is also provided. • At the end of the algorithm, each processor stores its message from Pk at location msg. • The efficient way to implement this algorithm is to pipeline the messages. – Message to most distant processor (i.e,, Pk-1) is followed by message to processor Pk-2. Discussion of Scatter Algorithm • In Steps 5-6, Pk successively send messages to the other p-1processors in the order of their distance from Pk . • In Step 7, Pk stores its message to itself. • The other processors concurrently move messages along as they arrive in steps 9-12. • Each processor uses two buffers with addresses tempS and tempR. – This allows processors to send a message and to receive the next message in parallel in Step12. Discussion of Scatter Algorithm (cont) • In step 11, tempS tempR means two addresses are switched so received value can be sent to next processor. • When a processor receives its message from Pk, the processor stops forwarding (Step 10). • Whatever is in the receive buffer, tempR, at the end is stored as its message from Pk (Step 13). • The running time of the scatter algorithm is the same as for the broadcast, namely (p-1)(L+mb) Example for Scatter Algorithm • Example: In Figure 3.7, let p=6 and k=4. • Steps 5-6: For i = 1 to p-1 do send(addr[(k+p-i) mod p], m) – – – – – – Let PE = (k+p-i) mod p = (10 – i) mod 6 For i=1, PE = 9 mod 6 = 3 For i=2, PE = 8 mod 6 = 2 For i=3, PE = 7 mod 6 = 1 For i=4, PE = 6 mod 6 = 0 For i=5, PE = 5 mod 6 = 5 • Note messages are sent to processors in the order 3, 2, 1, 0, 5 – That is, messages to most distant processors sent first. Example for Scatter Algorithm (cont) • • • • Example: In Figure 3.7, let p=6 and k=4. Steps 10: For i = 1 to (k-1-q) mod p do Compute: (k-1-q) mod p = (3-q) mod 6 for all q. Note: q≠ k, which is 4 – q = 5 i = 1 to 4 since (3-5) mod 6 = 4 • PE 5 forwards values in loop from i = 1 to 4 – q = 0 i = 1 to 3 since (3-0) mod 6 = 3 • PE 0 forwards values from i = 1 to 3 – q = 1 i = 1 to 2 since (3-1) mod 6 = 2 • PE 1 forwards values from i = 1 to 2 – q = 2 i = 1 to 1 since (3-2) mod 6 = 1 • PE 2 is active in loop when i = 1 Example for Scatter Algorithm (cont) – q = 3 i = 1 to 0 since (3-3) mod 6 = 0 • PE 3 precedes PE k, so it never forwards a value • However, it receives and stores a message at #9 • Note that in step 9, all processors store the first message they receive. – That means even the processor k-1 receives a value to store. All-to-All Algorithm • This command allows all p processors to simultaneously broadcast a message (to all PEs) • Again, it is assumed all messages have length m • At the beginning, each processor holds the message it wishes to broadcast at address my_message. • At the end, each processor will hold an array addr of p messages, where addr[k] holds message from Pk. • Using pipelining, the running time is the same as for a single broadcast, namely (p-1)(L+mb) Gossip Algorithm • Last of the classical collection of communication operations • Each processor sends a different message to each processor. • Gossip algorithm is Problem 3.7 in textbook. – Note it will take 1 step for each PE to send a message to its closest neighbor using all links. – It will take 2 steps for each PE to send a message to its 2nd closest neighbor, using all links. – In general, it will take 1*2* …*(p-1) = (p-1)! steps for each PE to send messages to all other nodes, using all links of the network during each step. – Complexity is (p-1)! = (p-1)(p-2)/2 = O(p2). Pipelined Broadcast by kth processor • Longer messages can be broadcast faster if they are broken into smaller pieces • Suppose they are broken into r pieces of the same length. • The sender sends the pieces out in order, and has them travel simultaneously on the ring. • Initially, the pieces are stored at addresses addr[0], addr[1], … , addr[r-1]. • At the end, all pieces are stored in all processors. • At each step, when a processor receives a message piece, it also forwards the piece it previously received, if any, to its successor. Pipelined Broadcast (cont) • There must be p-1 communication steps for first piece to reach the last processor, Pk-1. • Then it takes r-1 time for the rest of the pieces to reach Pk-1. • The required time is (p+r-2) ( L + mb/r) • The value of r that minimizes this expression can be found by setting its derivative (with respect to r) to zero and solving for r. • For large m, the time required tends to mb – Does not depend on p. • Compares well to broadcast time: (p-1)(L+mb) Hypercube • Defn: A 0-cube consists of a single vertex. For n>0, an n-cube consists of two identical (n-1)cubes with edges added to join matching pairs of vertices in the two (n-1) cubes. Hypercubes (cont) • Equivalent defn: An n-cube is a graph consisting of 2n vertices from 0 to 2n -1 such that two vertices are connected if and only if their binary representation differs by one single bit. • Property: The diameter and degree of an n-cube are equal to n. • Proof is left for reader. Easy if use recursion. • Hamming Distance: Let A and B be two points in an n-cube. H(A,B) is the number of bits that differ between the binary labels of A and B. • Notation: If b is a binary bit, then let b = 1-b, the bit complement of b. Hypercube Paths • Using binary representation, let – A = an-1 an-2 … a2 a1 a0 – B = bn-1 bn-2 … b2 b1 b0 • WLOG, assume that A and B have different bits exactly in their last k bits. – Differing bits at end makes numbering them easier • Then a pathway from A to B can be created by the following sequence of nodes: – – – – – A = an-1 an-2 … a2 a1 a0 Vertex 1 = an-1 an-2 … a2 a1 a0 Vertex 2 = an-1 an-2 … a2 a1 a0 ........ B = Vertex k = an-1 an-2 … ak ak-1 … a2 a1 a0 Hypercube Paths (cont) • Independent of which bits of A and B agree, there are k choices for first bit to flip, (k-1) choices for next bit to flip, etc. – This gives k! different paths from A to B. • How many independent paths exist from A to B? – I.e., paths with only A and B as common vertices. Theorem: If A and B are n-cube vertices that differ in k bits, then there exist exactly k independent paths from A to B. Proof: First, we show k independent paths exist. • We build an independent path for each j with 0j<k Hypercube Paths (cont) • Let P(j, j-1, j-2, … , 0, k-1, k-2, … , j+1) denote the path from A to B with following sequence of nodes – A = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – V(1) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – V(2) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – …......... – V(j+1) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – V(j+2) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – ............. – V(k-1) = an-1 an-2 … ak ak-1 ak+1 … aj+1aj aj-1 aj-2 … a2 a1 a0 – B =V(k)=an-1 an-2 … ak ak-1 ak+1 … aj+1aj aj-1 aj-2 … a2 a1 a0 Hypercube Pathways (cont) • Suppose the following two path have a common vertex X other than A and B. P(j, j-1, j-2, … , 0, k-1, k-2, … , j+1) P(t, t-1, t-2, … , 0, k-1, k-2, … , t+1) – Since paths are different, and A and B differ by k bits, we may assume 0≤t<j<k • Let A and X differ in q bits • To travel from A to X along either path, exactly q bits in circular sequence have been flipped, in a left to right order for each – 1st path: j, j-1 … t, t-1 … 0, k-1, k-2 … j-1 – 2nd path t, t-1 … 0, k-1, k-2 … j-1 … t-1 • This is impossible, as the q bits are flipped along each path can not be exactly the same bits. Hypercube Paths (cont) • Finally, there can not be another independent path Q from A to B (i.e., without another common vertex) – If so, the first node in path following A would have to flipped one bit, say bit q, to agree with B. – But then, the path described earlier that flipped bit q first would have a common interior vertex with this path Q. Hypercube Routing • XOR is an exclusive OR. Exactly one input must be a 1. • To design a route from A to B in the n-cube, we will use the algorithm that will always flip the rightmost bit that disagrees with B. • The NOR of the binary representation of A and B indicates by “1’s” the bits that have to be flipped. • For the 5-cube, if A is 10111 and B is 01110,then – A NOR B is 11001, and the routing is as follows: A = 10111 10110 11110 01110 = B • This algorithm can be executed as follows: – A NOR B = 10111 NOR 01110 = 11001, so A routes the message along link 1 (the digits link) to Node A1 ≡ 10110 – A1 NOR B = 10110 NOR 01110 = 11000, so Node A1 routes message along link 4 to Node A2 =10110 – A2 NOR B = 10110 NOR 01110 = 10000, so A2 routes message along link 5 to B Hypercube Routing (cont) • This routing algorithm can be used to implement a wormhole or cut-through protocol in hardware. • Problem: If another pair of processors have already reserved one link on the desired path, the message may stall until the end of the other communication. • Solution: Since there are multiple paths, the routers select links to use based on a link reservation table and message labels. – In our example, if link 1 in Node A is busy, then instead use link 4 to forward the message to new node 11111 that is on a new path to B. – If at some point, the current vertex determines there is no useful links available, then current vertex will have to wait for a useful link to become available – If at some vertex, the desired links are not available, then algorithm could use a link that extends path length Gray Code • Recursive construction of Gray code: – G1 = (0,1) and has 21 = 2 elements – G2 = (00,01,11, 10) and had 22elements – G3 = (000, 001, 011, 010, 110, 111, 101, 100) & has 23 elements – etc. • Gray code for dimension n 1 is denoted Gn, and is defined recursively. G1 = (0,1) and for n>1, Gn is the sequence 0Gn-1 followed by the rev rev sequence 1Gn-1 (i.e., 1Gn 1 ) where – xGn-1 is the sequence obtained by prefixing every element of G with x – Grev is the sequence obtained by listing the elements of G in the reverse order Gray Code (cont.) – Since we assume Gn-1 has 2n-1 elements and G n has exactly two copies of Gn-1 , it has 2n elements. • Summary: The Gray code Gn is an ordered sequence of all the 2n binary codes with n digits whose successive values differ from each other by exactly one bit. • Notation: Let gi(r) denote the ith element of the Gray code of dimension r. • Observation: Since the Gray code Gn = {g1(n), g2(n), …, g(n) } form a ordered sequence of names for all of the nodes in a 2n ring. Embeddings • Defn: An embedding of a topology (e.g., ring, 2D mesh, etc) into an n-cube is a 1-1 function f from the vertices of the topology into the n-cube. – An embedding is said to preserve locality if the image of any two neighbors are also neighbors in the n-cube – If embeddings do not preserve locality, then we try to minimize the distance of neighbors in the hypercube. – An embedding is said to be onto if the embedding function f has its range to be the entire n-cube. A 2n Ring embedding onto n-cube Theorem: There is an embedding of a ring with 2n vertices onto an n-cube that preserves locality. Proof: • Our construction of the Gray code provides an ordered sequence of binary values with n digits that can be used as names of the nodes of the ring with 2n vertices – The first name (e.g., 00…0) can be used to name any node in ring. – The sequence of Gray code names are given names to ring nodes successively, in a clockwise or counterclockwise order Embedding Ring onto n-cube (cont) • The gray code binary numbers are identical with the names assigned to hypercube nodes. – Two successive n-cube nodes are connected since they only differ by one binary digi • So the embedding is a result of the Gray code providing an ordered sequence of n-cube names that is used to provide successive names the 2n ring nodes. • This concludes the proof that this embedding follows “from construction of Gray code”. • However, the following formal proof provides more details and an “after construction” argument.. A 2n Ring Embedding onto the n-cube • The following optional formal proof is included for those who do not find the preceding argument convincing: Theorem: There is an embedding of a ring with 2n vertices onto an n-cube that preserves locality. Proof: • We establish the following claim: The mapping f(i) = gi(n) is an embedding of the ring onto the n-cube. • This claim is true for n = 1 as G1 = (0,1) and nodes 0 and 1 are connected on both the ring and 1-cube. • We assume above claim is true for a fixed n-1 with n1. • We use the neighbor-preserving embedding f of the vertices of a ring with 2n-1 nodes onto the (n-1)-cube to build a similar embedding f* of a ring with 2n vertices onto the n-cube. Ring Embedding for n-cube (cont) • Recall that gray code for Gn is the sequence 0Gn-1 followed by the sequence 1Gn-1rev • The n-cube consists of one copy of an (n-1)-cube with a 0 prefix and a second copy of an (n-1) cube with a 1 prefix. • By assumption that claim is true for n-1, the gray code sequence 0Gn provides the binary code for a ring of elements in the (n-1)-cube, with each successive element differing by 1 digit from its previous element. • Likewise, the gray code sequence 1Gn-1rev provides the binary code for a ring of elements in the second copy of an (n-1)-cube, with each successive element differing by on digit from its previous element.. • The last element of 0Gn-1 is identical to the first element of 1Gn-1rev except for the added digit, so these two elements also differ by one bit. 2D Torus Embedding for n-cube • We embed a 2r 2s torus onto an n-cube with n=r+s by using the cartesian product Gr Gs of two Gray codes. • A processor with coordinates (i,j) on the grid is mapped to the processor f(i,j) = (gi(r), gi(s) ) in the n-cube. • Recall the map f1(i) = gi(r), is an embedding of a 2r ring onto a r-cube “row” ring and f2(j) = gi(s) is an embedding of a 2s ring onto a s-cube “column” ring. • We must identify (gi(r), gi(s) ) with the n=r+s cube where the first r bits are given by gi(r) and the next s bits are given by gi(s) . • Then for a fixed j, f(i1,j) are neighbors of f(i,j) since f1 is an embedding of a 2r ring onto a r-cube. • Likewise, for a fixed i, f(i, j 1) are neighbors of f(i,j) since f2 is an embedding of a 2s ring onto a s-cube. Collective Communications in Hypercube • Purpose: Gain overview of complexity of collective communications for hypercube. – Will focus on broadcast on hypercube. • Assume processor 0 wants to broadcast. And consider the naïve algorithm: – Processor 0 sends the message to all of its neighbors – Next, every neighbor sends a message to all of its neighbors. – Etc. • Redundancy in naïve algorithm – The same processor receives the same message many times. – E.g., processor 0 receives all its neighbors. – Mismatched SENDS and RECEIVES may happen. Improved Hypercube Broadcast • We seek a strategy where – Each processor receives the message only once – The number of steps is minimal. – Will use one or more spanning trees. • Send and Receive will need parameter of which dimension the communication takes place – SEND( cube_link, send_addr, m ) – RECEIVE(cube_link, send_addr, m ) Hypercube Broadcast Algorithm • There are n steps, numbered from n-1 to 0 • All processors will receive their msg on the link corresponding to their rightmost 1 • Processors receiving a msg will forward it on links whose index is smaller than its rightmost 1. • At step i, all processors whose rightmost 1 is strictly larger than i forwards the msg on link i. • Let broadcast originates with processor 0 – Assume 0 has a fictitious 1 at position n. – This adds an additional digit, as has binary digits for positions 0,1, …, n-1. Trace of Hypercube Broadcast for n=4 • Since broadcast originates with processor 0, we assume its index is 10000 – Processor 0 sends the broadcast msg on link 3 to 1000 • Next, both processor 0000 and 1000 have their rightmost 1 in position at least three, so both send a message along link 2 – Msg goes to 0100 and 1100, respectively. • This process continues until the last step at which every even-numbered processor sends its msg along its link 0. Broadcasting using Spanning Tree Broadcast Algorithm • Let BIT(A,b) denote the value of the bth bit of processor A. • The algorithm for the broadcast of a message of length m by processor k is given in Algorithm 3.5 • Since there are n steps, the execution time with the store-and-forward model is n(L+mb). • This algorithm is valid for the 1-port model. – At each step, a processor communicates with at most one other processor. Broadcast Algorithm in Hypercube (cont) Observations about Algorithm Steps 2. Specifies that the algorithm action is for processor q 3. n is the number of binary digits used to label processors. 4. pos = q XOR k – uses “exclusive OR”. – Broadcast is from Pk . – Updates pos to work as if P0 is root of broadcast ??? 5. Steps 5-7 sets “first-1” to the location of first “1” in pos. Note: Steps 8-10 are the core of algorithm 8. Phase steps through link dimensions, higher ones first 9. If link dimension = first-1, q gets message on first-1 link – Seems to be moving message up to q ??? 10.Q sends message along each smaller dimension than first-1 to processor below it.