Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Asynchronous Interconnection Network and Communication Chapter 3 of Casanova, et. al. Interconnection Network Topologies • The processors in a distributed memory parallel system are connected using an interconnection network. • All computers have specialized coprocessors that route messages and place date in local memories – Nodes consist of a (computing) processor, a memory, and a communications coprocessor – Nodes are often called processors, when not ambigious. Network Topology Types • Static Topologies – A fixed network that cannot be changed – Nodes connected directly to each other by point-to-point communications links • Dynamic Topologies – Topology can change at runtime – One or more nodes can request direct communication be established between them. • Done using switches Some Static Topologies • • • • • • Fully connected network (or clique) Ring Two-Dimensional grid Torus Hypercube Fat tree Examples of Interconnection Topologies Static Topologies Features • Fixed number of nodes • Degree: – Nr of nodes incident to edges • Distance between nodes: – Length of shortest path between two nodes • Diameter: – Largest distance between two nodes • Number of links: – Total number of Edges • Bisection Width: – Minimum nr. of edges that must be removed to partition the network into two disconnected networks of the same size. Classical Interconnection Networks Features • Clique (or Fully Connected) – All processors are connected – p(p-1)/2 edges • Ring – Very simple and very useful topology • 2D Grid – Degree of interior processors is 4 – Not symmetric, as edge processors have different properties – Very useful when computations are local and communications are between neighbors – Has been heavily used previously Classical Network • 2D Torus – Easily formed from 2D mesh by connecting matching end points. • Hypercube – Has been extensively used – Using recursive defn, can design simple but very efficient algorithms – Has small diameter that is logarithmic in nr of edges – Degree and total number of edges grows too quickly to be useful with massively parallel machines. Dynamic Topologies • Fat tree is different than other networks included – The compute nodes are only at the leaves. – Nodes at higher level do not perform computation – Topology is a binary tree – both in 2D front view and in side view. – Provides extra bandwidth near root. – Used by Thinking Machine Corp. on the CM-5 • Crossbar Switch – Has p2 switches, which is very expensive for large p – Can connect n processors to combination of n processors – .Cost rises with the number of switches, which is quadratic with number of processors. Dynamic Topologies (cont) • Benes Network & Omega Networks – Use smaller size crossbars arranged in stages – Only crossbars in adjacent stages are connected together. – Called multi-stage networks and are cheaper to build that full crossbar. – Configuring multi-stage networks is more difficult than crossbar. – Dynamic networks are now the most common used topologies. A Simple Communications Performance Model • Assume a processor Pi sends a message to Pj or length m. – Cost to transfer message along a network link is roughly linear in message length. – Results in cost to transfer message along a particular route to be roughly linear in m. • Let ci,j(m) denote the time to transfer this message. Hockney Performance Model for Communications • The time ci,j(m) to transfer this message can be modeled by ci,j(m) = Li,j + m/Bi,j = Li,j + mbi,j – m is the size of the message – Li,j is the startup time, also called latency – Bi,j is the bandwidth, in bytes per second – bi,j is 1/Bi,j, the inverse of the bandwidth • Proposed by Hockney in 1994 to evaluate the performance of the Intel Paragon. • Probably the most commonly used model. Hockney Performance Model (cont.) • Factors that Li,j and Bi,j depend on – Length of route – Communication protocol used – Communications software overhead – Ability to use links in parallel – Whether links are half or full duplex – Etc. Store and Forward Protocol • SF is a point-to-point protocol • Each intermediate node receives and stores the entire message before retransmitting it • Implemented in earliest parallel machines in which nodes did not have communications coprocessors. • Intermediate nodes are interrupted to handle messages and route them towards their destination. Store and Forward Protocol (cont) • If d(i,j) is the number of links between Pi & Pj, the formula for ci,j(m) can be re-written as ci,j(m) = d(i,j) {L+ m/B} = d(i,j)L + d(i,j)mb where – L is the initial latency & b is the reciprocal for the broadcast bandwidth for one link. • This protocol produces a poor latency & bandwith • The communication cost can be reduced using pipelining. Store and Forward Protocol using Pipelining • The message is split into r packets of size m/r. • The packets are sent one after another from Pi to Pj. • The first packet reaches node j after ci,j(m/r) time units. • The remaining r-1 packets arrive in (r-1) (L+ mb/r) time units • Simplifying, total communication time reduces to [d(i,j) -1+r][L+ mb/r] • Casanova, et.al. finds optimal value for r above. Two Cut-Through Protocols • Common performance model: ci,j(m) = L + d(i,j)* + m/B where – L is the one-time cost of creating a message. – is the routing management overhead – Generally << L as routing management is performed by hardware while L involve software overhead – m/B is the time required to transmit the message through entire route Circuit-Switching Protocol • First cut-through protocol • Route created before first message is sent • Message sent directly to destination through this route – The nodes used in this transmission can not be used during this transmission for any other communication Wormhole (WH) Protocol • A second cut-through protocol • The destination address is stored in the header of the message. • Routing is performed dynamically at each node. • Message is split into small packets called flits • If two flits arrive at the same time, flits are stored in intermediate nodes’ internal registers Point-to-Point Communication Comparisons • Store and Forward is not used in physical networks but only at applications level • Cut-through protocols are more efficient – – – – Hide distance between nodes Avoid large buffer requirement for intermediate nodes Almost no message loss For small networks, flow-control mechanism not needed • Wormhole generally preferred to circuit switching – Latency is normally much lower LogP Model • Models based on the LogP model are more precise than the Hockney model • Involves three components of communication – the sender, the network, and the receiver – At times, some of these components may be busy while others are not. • Some parameters for LogP – – – – m is the message size (in bytes) w is the size of packets message is split into L is an upper bound on the latency o is the overhead, • Defined to be the time that the a node is engaged in the transmission or reception of a packet LogP Model (cont) • Parameters for LogP (cont) – g or gap is the minimal time interval between consecutive packet transmission or packet reception • During this time, a node may not use the communication coprocessor (i.e., network card) – 1/g the communication bandwidth available per node – P the number of nodes in the platform • Cost of sending m bytes with packet size w m 1 c(m) 2o L g w • Processor occupational time on sender/receiver m 1 o ( 1 g w Other LogP Related Models • LogP attempts to capture in a few parameters the characteristics of parallel platforms. • Platforms are fine-tuned and may use different protocols for short & long messages • LogGP is an extension of LogP where G captures the bandwidth for long messages • pLogP is an extension of LogP where L, o, and g depend on the message size m. – Also seperates sender overhead os and receiver overhead or. Affine Models • The use of the floor functions in LogP models causes them to be nonlinear. – Causes many problems in analytic & theoretical studies. – Has resulted in proposal of many fully linear models • The time that Pi is busy sending a message is expressed as an affine function of the message size – An affine function of m has form f(m) = a*m + b where a and b are constants. If b=0, then f is linear function • Similarly, the time Pj is busy receiving the message is expressed as an affine function of the message size • We will postpone further coverage of the topic of affine models for the present Modeling Concurrent Communications • Multi-port model – Assumes that communications are contention-free and do not interfere with each other. – A consequence is that a node may communicate with an unlimited number of nodes without any degradation in performance. – Would require a clique interconnection network to fully support. – May simplify proofs that certain problems are hard • If hard under ideal communications conditions, then hard in general. – Assumption not realistic - communication resources are always limited. – See Casanova text for additional information. Concurrent Communications Models (2/5) • Bounded Multi-port model – Proposed by Hong and Prasanna – For applications that uses threads (e.g., on a multicore technology), the network link can be shared by several incoming and outgoing communications. – The sum of bandwidths allocated by operating system to all communications can not exceed bandwidth of network card. – An unbounded nr of communications can take place if they share the total available bandwidth. – The bandwidth defines the bandwidth allotted to each communication – Bandwidth sharing by application is unusual, as is usually handled by operating system. Concurrent Communications Models (3/5) • 1 port (unidirectional or half-duplex) model – Avoids unrealistically optimistic assumptions – Forbids concurrent communication at a node. – A node can either send data or receive it, but not simultaneously. – This model is very pessimistic, as real world platforms can achieve some concurrent computations. – Model is simple and is easy to design algorithms that follow this model. Concurrent Communications Models (4/5) • 1 port (bidirectional or full-duplex) model – Currently, most network cards are full-duplex. – Allows a single emission and single reception simultaneously. – Introduced by Blat, et. al. • Current hardware does not easily enable multiple messages to be transmitted simultaneous. • Multiple sends and receives are claimed to be eventually serialized by a single hardware port to the next. • Saif & Parashar did experimental work that suggests asynchronous sends become serialized when message sizes exceed a few megabytes. Concurrent Communications Models (5/5) • k-ports model – A node may have k>1 network cards – This model allows a node to be involved in a maximum of one emission and one reception on each network card. – This model is used in Chapters 4 & 5. Bandwidth Sharing • The previous concurrent communication models only consider contention on nodes • Other parts of the network can also limit performance • It may be useful to determine constraints on each network link • This type of network model are useful for performance evaluation purposes, but are too complicated for algorithm design purposes. • Casanova text evalutes algorithms using 2 models: – Hockney model or even simplified versions (e.g. assuming no latency) – Multi-port (ignoring contention) or the 1 port model. Case Study: Unidirectional Ring • We first consider the platform of p processors arranged in a unidirectional ring. • Processors are denoted Pk for k = 0, 1, … , p-1. • Each PE can find its logical index by calling My_Num(). Unidirectional Ring Basics • The processor can determine the number of PEs by calling NumProcs() – Both preceding commands are supported in MPI, a language implemented on most asychronous systems. • Each processor has its own memory • All processors execute the same program, which acts on data in their local memories – Single Program, Multiple Data or SPMD • Processors communicate by message passing – explicitly sending and receiving messages. Unidirectional Ring Basics (cont – 2/5) • A processor sends a message using the function send(addr,m) – addr is the memory address (in the sender process) of first data item to be sent. – m is the message length (i.e., nr of items to be sent) • A processor receives a message using function receive(addr,m) – The addr is the local address in receiving processor where first data item is to be stored. – If processor Pi executes a receive, then its predecessor (P(i-1)mod p) must execute a send. – Since each processor has a unique predecessor and successor, they do not have to be specified Unidirectional Ring Basics (cont – 3/5) • A restrictive assumption is to assume that both the send and receive is blocking. – Then the participating processes can not continue until the communication is complete. – The blocking assumption is typical of 1st generation platforms • A classical assumption is keep the receive blocking but to allow the send is non-blocking – The processor executing a send can continue while the data transfer takes place. – To implement, one function is used to initiate the send and another function is used to determine when communication is finished. Unidirectional Ring Basics (cont – 4/5) • In algorithms, we simply indicate the blocking and non-blocking operations • More recent proposed assumption is that a single processor can send data, receive data, and compute simultaneously. – All three can occur only if no race condition exists. – Convenient to think of three logical threads of control running on a processor • One for computing • One for sending data • One for receiving data • We will usually use the less restrictive third assumption Unidirectional Ring Basics (cont – 4/5) • Timings for Send/Receive – We use a simplified version of the Hockney model – The time to send or receive over one link is c(m) = L+ mb • m is the length of the message • L is the startup cost in seconds due to the physical latency and the software overhead • b is the inverse of the data transfer rate. The Broadcast Operation • The broadcast operation allows an processor Pk to send the same message of length m to all other processors. • At the beginning of the broadcast operation, the message is stored at the address addr in the memory of the sending process, Pk. • At the end of the broadcast, the message will be stored at address addr in the memory of all processors. • All processors must call the following function Broadcast(k, addr, m) Broadcast Algorithm Overview • The message will go around the ring from processor - from Pk to Pk+1 to Pk+2 to … to Pk-1. • We will assume the processor numbers are modulo p, where p is the number of processors. For example, if k=0 and p=8, then k-1 = p-1 = 7. • Note there is no parallelism in this algorithm, since the message advances around ring only one processor per round. • The predecessor of Pk (i.e, Pk-1) does not send the message to Pk. Analysis of Broadcast Algorithm • For algorithm to be correct, the “receive” in Step 10 will execute before Step 11. • Running Time: – Since we have a sequence of p-1 communications, the time to broadcast a message of length m is (p-1)(L+mb) • MPI does not typically use ring topology for creating communication primitives – Instead use various tree topologies that are more efficient on modern parallel computer platforms. – However, these primitives are simpler on a ring. – Prepares readers to implement primitives, when more efficient than using MPI primitives. Scatter Algorithm • Scatter operation allows Pk to send a different message of length m to each processor. • Initially, Pk holds a message of length m to be sent to Pq at location “addr [q]”. • To keep the array of addresses uniform, space for a message to Pk is also provided. • At the end of the algorithm, each processor stores its message from Pk at location msg. • The efficient way to implement this algorithm is to pipeline the messages. – Message to most distant processor (i.e,, Pk-1) is followed by message to processor Pk-2. Discussion of Scatter Algorithm • In Steps 5-6, Pk successively send messages to the other p-1processors in the order of their distance from Pk . • In Step 7, Pk stores its message to itself. • The other processors concurrently move messages along as they arrive in steps 9-12. • Each processor uses two buffers with addresses tempS and tempR. – This allows processors to send a message and to receive the next message in parallel in Step12. Discussion of Scatter Algorithm (cont) • In step 11, tempS tempR means two addresses are switched so received value can be sent to next processor. • When a processor receives its message from Pk, the processor stops forwarding (Step 10). • Whatever is in the receive buffer, tempR, at the end is stored as its message from Pk (Step 13). • The running time of the scatter algorithm is the same as for the broadcast, namely (p-1)(L+mb) Example for Scatter Algorithm • Example: In Figure 3.7, let p=6 and k=4. • Steps 5-6: For i = 1 to p-1 do send(addr[(k+p-i) mod p], m) – – – – – – Let PE = (k+p-i) mod p = (10 – i) mod 6 For i=1, PE = 9 mod 6 = 3 For i=2, PE = 8 mod 6 = 2 For i=3, PE = 7 mod 6 = 1 For i=4, PE = 6 mod 6 = 0 For i=5, PE = 5 mod 6 = 5 • Note messages are sent to processors in the order 3, 2, 1, 0, 5 – That is, messages to most distant processors sent first. Example for Scatter Algorithm (cont) • • • • Example: In Figure 3.7, let p=6 and k=4. Steps 10: For i = 1 to (k-1-q) mod p do Compute: (k-1-q) mod p = (3-q) mod 6 for all q. Note: q≠ k, which is 4 – q = 5 i = 1 to 4 since (3-5) mod 6 = 4 • PE 5 forwards values in loop from i = 1 to 4 – q = 0 i = 1 to 3 since (3-0) mod 6 = 3 • PE 0 forwards values from i = 1 to 3 – q = 1 i = 1 to 2 since (3-1) mod 6 = 2 • PE 1 forwards values from i = 1 to 2 – q = 2 i = 1 to 1 since (3-2) mod 6 = 1 • PE 2 is active in loop when i = 1 Example for Scatter Algorithm (cont) – q = 3 i = 1 to 0 since (3-3) mod 6 = 0 • PE 3 precedes PE k, so it never forwards a value • However, it receives and stores a message at #9 • Note that in step 9, all processors store the first message they receive. – That means even the processor k-1 receives a value to store. All-to-All Algorithm • This command allows all p processors to simultaneously broadcast a message (to all PEs) • Again, it is assumed all messages have length m • At the beginning, each processor holds the message it wishes to broadcast at address my_message. • At the end, each processor will hold an array addr of p messages, where addr[k] holds message from Pk. • Using pipelining, the running time is the same as for a single broadcast, namely (p-1)(L+mb) Gossip Algorithm • Last of the classical collection of communication operations • Each processor sends a different message to each processor. • Gossip algorithm is Problem 3.7 in textbook. – Note it will take 1 step for each PE to send a message to its closest neighbor using all links. – It will take 2 steps for each PE to send a message to its 2nd closest neighbor, using all links. – In general, it will take 1*2* …*(p-1) = (p-1)! steps for each PE to send messages to all other nodes, using all links of the network during each step. – Complexity is (p-1)! = (p-1)(p-2)/2 = O(p2). Pipelined Broadcast by kth processor • Longer messages can be broadcast faster if they are broken into smaller pieces • Suppose they are broken into r pieces of the same length. • The sender sends the pieces out in order, and has them travel simultaneously on the ring. • Initially, the pieces are stored at addresses addr[0], addr[1], … , addr[r-1]. • At the end, all pieces are stored in all processors. • At each step, when a processor receives a message piece, it also forwards the piece it previously received, if any, to its successor. Pipelined Broadcast (cont) • There must be p-1 communication steps for first piece to reach the last processor, Pk-1. • Then it takes r-1 time for the rest of the pieces to reach Pk-1. • The required time is (p+r-2) ( L + mb/r) • The value of r that minimizes this expression can be found by setting its derivative (with respect to r) to zero and solving for r. • For large m, the time required tends to mb – Does not depend on p. • Compares well to broadcast time: (p-1)(L+mb) Hypercube • Defn: A 0-cube consists of a single vertex. For n>0, an n-cube consists of two identical (n-1)cubes with edges added to join matching pairs of vertices in the two (n-1) cubes. Hypercubes (cont) • Equivalent defn: An n-cube is a graph consisting of 2n vertices from 0 to 2n -1 such that two vertices are connected if and only if their binary representation differs by one single bit. • Property: The diameter and degree of an n-cube are equal to n. • Proof is left for reader. Easy if use recursion. • Hamming Distance: Let A and B be two points in an n-cube. H(A,B) is the number of bits that differ between the binary labels of A and B. • Notation: If b is a binary bit, then let b = 1-b, the bit complement of b. Hypercube Paths • Using binary representation, let – A = an-1 an-2 … a2 a1 a0 – B = bn-1 bn-2 … b2 b1 b0 • WLOG, assume that A and B have different bits exactly in their last k bits. – Differing bits at end makes numbering them easier • Then a pathway from A to B can be created by the following sequence of nodes: – – – – – A = an-1 an-2 … a2 a1 a0 Vertex 1 = an-1 an-2 … a2 a1 a0 Vertex 2 = an-1 an-2 … a2 a1 a0 ........ B = Vertex k = an-1 an-2 … ak ak-1 … a2 a1 a0 Hypercube Paths (cont) • Independent of which bits of A and B agree, there are k choices for first bit to flip, (k-1) choices for next bit to flip, etc. – This gives k! different paths from A to B. • How many independent paths exist from A to B? – I.e., paths with only A and B as common vertices. Theorem: If A and B are n-cube vertices that differ in k bits, then there exist exactly k independent paths from A to B. Proof: First, we show k independent paths exist. • We build an independent path for each j with 0j<k Hypercube Paths (cont) • Let P(j, j-1, j-2, … , 0, k-1, k-2, … , j+1) denote the path from A to B with following sequence of nodes – A = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – V(1) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – V(2) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – …......... – V(j+1) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – V(j+2) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0 – ............. – V(k-1) = an-1 an-2 … ak ak-1 ak+1 … aj+1aj aj-1 aj-2 … a2 a1 a0 – B =V(k)=an-1 an-2 … ak ak-1 ak+1 … aj+1aj aj-1 aj-2 … a2 a1 a0 Hypercube Pathways (cont) • Suppose the following two path have a common vertex X other than A and B. P(j, j-1, j-2, … , 0, k-1, k-2, … , j+1) P(t, t-1, t-2, … , 0, k-1, k-2, … , t+1) – Since paths are different, and A and B differ by k bits, we may assume 0≤t<j<k • Let A and X differ in q bits • To travel from A to X along either path, exactly q bits in circular sequence have been flipped, in a left to right order for each – 1st path: j, j-1 … t, t-1 … 0, k-1, k-2 … j-1 – 2nd path t, t-1 … 0, k-1, k-2 … j-1 … t-1 • This is impossible, as the q bits are flipped along each path can not be exactly the same bits. Hypercube Paths (cont) • Finally, there can not be another independent path Q from A to B (i.e., without another common vertex) – If so, the first node in path following A would have to flipped one bit, say bit q, to agree with B. – But then, the path described earlier that flipped bit q first would have a common interior vertex with this path Q. Hypercube Routing • XOR is an exclusive OR. Exactly one input must be a 1. • To design a route from A to B in the n-cube, we will use the algorithm that will always flip the rightmost bit that disagrees with B. • The NOR of the binary representation of A and B indicates by “1’s” the bits that have to be flipped. • For the 5-cube, if A is 10111 and B is 01110,then – A NOR B is 11001, and the routing is as follows: A = 10111 10110 11110 01110 = B • This algorithm can be executed as follows: – A NOR B = 10111 NOR 01110 = 11001, so A routes the message along link 1 (the digits link) to Node A1 ≡ 10110 – A1 NOR B = 10110 NOR 01110 = 11000, so Node A1 routes message along link 4 to Node A2 =10110 – A2 NOR B = 10110 NOR 01110 = 10000, so A2 routes message along link 5 to B Hypercube Routing (cont) • This routing algorithm can be used to implement a wormhole or cut-through protocol in hardware. • Problem: If another pair of processors have already reserved one link on the desired path, the message may stall until the end of the other communication. • Solution: Since there are multiple paths, the routers select links to use based on a link reservation table and message labels. – In our example, if link 1 in Node A is busy, then instead use link 4 to forward the message to new node 11111 that is on a new path to B. – If at some point, the current vertex determines there is no useful links available, then current vertex will have to wait for a useful link to become available – If at some vertex, the desired links are not available, then algorithm could use a link that extends path length Gray Code • Recursive construction of Gray code: – G1 = (0,1) and has 21 = 2 elements – G2 = (00,01,11, 10) and had 22elements – G3 = (000, 001, 011, 010, 110, 111, 101, 100) & has 23 elements – etc. • Gray code for dimension n 1 is denoted Gn, and is defined recursively. G1 = (0,1) and for n>1, Gn is the sequence 0Gn-1 followed by the rev rev sequence 1Gn-1 (i.e., 1Gn 1 ) where – xGn-1 is the sequence obtained by prefixing every element of G with x – Grev is the sequence obtained by listing the elements of G in the reverse order Gray Code (cont.) – Since we assume Gn-1 has 2n-1 elements and G n has exactly two copies of Gn-1 , it has 2n elements. • Summary: The Gray code Gn is an ordered sequence of all the 2n binary codes with n digits whose successive values differ from each other by exactly one bit. • Notation: Let gi(r) denote the ith element of the Gray code of dimension r. • Observation: Since the Gray code Gn = {g1(n), g2(n), …, g(n) } form a ordered sequence of names for all of the nodes in a 2n ring. Embeddings • Defn: An embedding of a topology (e.g., ring, 2D mesh, etc) into an n-cube is a 1-1 function f from the vertices of the topology into the n-cube. – An embedding is said to preserve locality if the image of any two neighbors are also neighbors in the n-cube – If embeddings do not preserve locality, then we try to minimize the distance of neighbors in the hypercube. – An embedding is said to be onto if the embedding function f has its range to be the entire n-cube. A 2n Ring embedding onto n-cube Theorem: There is an embedding of a ring with 2n vertices onto an n-cube that preserves locality. Proof: • Our construction of the Gray code provides an ordered sequence of binary values with n digits that can be used as names of the nodes of the ring with 2n vertices – The first name (e.g., 00…0) can be used to name any node in ring. – The sequence of Gray code names are given names to ring nodes successively, in a clockwise or counterclockwise order Embedding Ring onto n-cube (cont) • The gray code binary numbers are identical with the names assigned to hypercube nodes. – Two successive n-cube nodes are connected since they only differ by one binary digi • So the embedding is a result of the Gray code providing an ordered sequence of n-cube names that is used to provide successive names the 2n ring nodes. • This concludes the proof that this embedding follows “from construction of Gray code”. • However, the following formal proof provides more details and an “after construction” argument.. A 2n Ring Embedding onto the n-cube • The following optional formal proof is included for those who do not find the preceding argument convincing: Theorem: There is an embedding of a ring with 2n vertices onto an n-cube that preserves locality. Proof: • We establish the following claim: The mapping f(i) = gi(n) is an embedding of the ring onto the n-cube. • This claim is true for n = 1 as G1 = (0,1) and nodes 0 and 1 are connected on both the ring and 1-cube. • We assume above claim is true for a fixed n-1 with n1. • We use the neighbor-preserving embedding f of the vertices of a ring with 2n-1 nodes onto the (n-1)-cube to build a similar embedding f* of a ring with 2n vertices onto the n-cube. Ring Embedding for n-cube (cont) • Recall that gray code for Gn is the sequence 0Gn-1 followed by the sequence 1Gn-1rev • The n-cube consists of one copy of an (n-1)-cube with a 0 prefix and a second copy of an (n-1) cube with a 1 prefix. • By assumption that claim is true for n-1, the gray code sequence 0Gn provides the binary code for a ring of elements in the (n-1)-cube, with each successive element differing by 1 digit from its previous element. • Likewise, the gray code sequence 1Gn-1rev provides the binary code for a ring of elements in the second copy of an (n-1)-cube, with each successive element differing by on digit from its previous element.. • The last element of 0Gn-1 is identical to the first element of 1Gn-1rev except for the added digit, so these two elements also differ by one bit. 2D Torus Embedding for n-cube • We embed a 2r 2s torus onto an n-cube with n=r+s by using the cartesian product Gr Gs of two Gray codes. • A processor with coordinates (i,j) on the grid is mapped to the processor f(i,j) = (gi(r), gi(s) ) in the n-cube. • Recall the map f1(i) = gi(r), is an embedding of a 2r ring onto a r-cube “row” ring and f2(j) = gi(s) is an embedding of a 2s ring onto a s-cube “column” ring. • We must identify (gi(r), gi(s) ) with the n=r+s cube where the first r bits are given by gi(r) and the next s bits are given by gi(s) . • Then for a fixed j, f(i1,j) are neighbors of f(i,j) since f1 is an embedding of a 2r ring onto a r-cube. • Likewise, for a fixed i, f(i, j 1) are neighbors of f(i,j) since f2 is an embedding of a 2s ring onto a s-cube. Collective Communications in Hypercube • Purpose: Gain overview of complexity of collective communications for hypercube. – Will focus on broadcast on hypercube. • Assume processor 0 wants to broadcast. And consider the naïve algorithm: – Processor 0 sends the message to all of its neighbors – Next, every neighbor sends a message to all of its neighbors. – Etc. • Redundancy in naïve algorithm – The same processor receives the same message many times. – E.g., processor 0 receives all its neighbors. – Mismatched SENDS and RECEIVES may happen. Improved Hypercube Broadcast • We seek a strategy where – Each processor receives the message only once – The number of steps is minimal. – Will use one or more spanning trees. • Send and Receive will need parameter of which dimension the communication takes place – SEND( cube_link, send_addr, m ) – RECEIVE(cube_link, send_addr, m ) Hypercube Broadcast Algorithm • There are n steps, numbered from n-1 to 0 • All processors will receive their msg on the link corresponding to their rightmost 1 • Processors receiving a msg will forward it on links whose index is smaller than its rightmost 1. • At step i, all processors whose rightmost 1 is strictly larger than i forwards the msg on link i. • Let broadcast originates with processor 0 – Assume 0 has a fictitious 1 at position n. – This adds an additional digit, as has binary digits for positions 0,1, …, n-1. Trace of Hypercube Broadcast for n=4 • Since broadcast originates with processor 0, we assume its index is 10000 – Processor 0 sends the broadcast msg on link 3 to 1000 • Next, both processor 0000 and 1000 have their rightmost 1 in position at least three, so both send a message along link 2 – Msg goes to 0100 and 1100, respectively. • This process continues until the last step at which every even-numbered processor sends its msg along its link 0. Broadcasting using Spanning Tree Broadcast Algorithm • Let BIT(A,b) denote the value of the bth bit of processor A. • The algorithm for the broadcast of a message of length m by processor k is given in Algorithm 3.5 • Since there are n steps, the execution time with the store-and-forward model is n(L+mb). • This algorithm is valid for the 1-port model. – At each step, a processor communicates with at most one other processor. Broadcast Algorithm in Hypercube (cont) Observations about Algorithm Steps 2. Specifies that the algorithm action is for processor q 3. n is the number of binary digits used to label processors. 4. pos = q XOR k – uses “exclusive OR”. – Broadcast is from Pk . – Updates pos to work as if P0 is root of broadcast ??? 5. Steps 5-7 sets “first-1” to the location of first “1” in pos. Note: Steps 8-10 are the core of algorithm 8. Phase steps through link dimensions, higher ones first 9. If link dimension = first-1, q gets message on first-1 link – Seems to be moving message up to q ??? 10.Q sends message along each smaller dimension than first-1 to processor below it.