Download Communications

Document related concepts

IEEE 802.1aq wikipedia , lookup

IEEE 1355 wikipedia , lookup

CAN bus wikipedia , lookup

Airborne Networking wikipedia , lookup

Everything2 wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Transcript
Asynchronous Interconnection
Network and Communication
Chapter 3 of Casanova, et. al.
Interconnection Network
Topologies
• The processors in a distributed memory parallel
system are connected using an interconnection
network.
• All computers have specialized coprocessors
that route messages and place date in local
memories
– Nodes consist of a (computing) processor, a memory,
and a communications coprocessor
– Nodes are often called processors, when not
ambigious.
Network Topology Types
• Static Topologies
– A fixed network that cannot be changed
– Nodes connected directly to each other by
point-to-point communications links
• Dynamic Topologies
– Topology can change at runtime
– One or more nodes can request direct
communication be established between them.
• Done using switches
Some Static Topologies
•
•
•
•
•
•
Fully connected network (or clique)
Ring
Two-Dimensional grid
Torus
Hypercube
Fat tree
Examples of Interconnection Topologies
Static Topologies Features
• Fixed number of nodes
• Degree:
– Nr of nodes incident to edges
• Distance between nodes:
– Length of shortest path between two nodes
• Diameter:
– Largest distance between two nodes
• Number of links:
– Total number of Edges
• Bisection Width:
– Minimum nr. of edges that must be removed to
partition the network into two disconnected networks
of the same size.
Classical Interconnection Networks Features
• Clique (or Fully Connected)
– All processors are connected
– p(p-1)/2 edges
• Ring
– Very simple and very useful topology
• 2D Grid
– Degree of interior processors is 4
– Not symmetric, as edge processors have different
properties
– Very useful when computations are local and
communications are between neighbors
– Has been heavily used previously
Classical Network
• 2D Torus
– Easily formed from 2D mesh by connecting matching
end points.
• Hypercube
– Has been extensively used
– Using recursive defn, can design simple but very
efficient algorithms
– Has small diameter that is logarithmic in nr of edges
– Degree and total number of edges grows too quickly
to be useful with massively parallel machines.
Dynamic Topologies
• Fat tree is different than other networks included
– The compute nodes are only at the leaves.
– Nodes at higher level do not perform computation
– Topology is a binary tree – both in 2D front view and
in side view.
– Provides extra bandwidth near root.
– Used by Thinking Machine Corp. on the CM-5
• Crossbar Switch
– Has p2 switches, which is very expensive for large p
– Can connect n processors to combination of n
processors
– .Cost rises with the number of switches, which is
quadratic with number of processors.
Dynamic Topologies (cont)
• Benes Network & Omega Networks
– Use smaller size crossbars arranged in stages
– Only crossbars in adjacent stages are connected
together.
– Called multi-stage networks and are cheaper to build
that full crossbar.
– Configuring multi-stage networks is more difficult than
crossbar.
– Dynamic networks are now the most common used
topologies.
A Simple Communications
Performance Model
• Assume a processor Pi sends a message to Pj or
length m.
– Cost to transfer message along a network link is
roughly linear in message length.
– Results in cost to transfer message along a particular
route to be roughly linear in m.
• Let ci,j(m) denote the time to transfer this
message.
Hockney Performance Model
for Communications
• The time ci,j(m) to transfer this message can be
modeled by
ci,j(m) = Li,j + m/Bi,j = Li,j + mbi,j
– m is the size of the message
– Li,j is the startup time, also called latency
– Bi,j is the bandwidth, in bytes per second
– bi,j is 1/Bi,j, the inverse of the bandwidth
• Proposed by Hockney in 1994 to evaluate the
performance of the Intel Paragon.
• Probably the most commonly used model.
Hockney Performance Model (cont.)
• Factors that Li,j and Bi,j depend on
– Length of route
– Communication protocol used
– Communications software overhead
– Ability to use links in parallel
– Whether links are half or full duplex
– Etc.
Store and Forward Protocol
• SF is a point-to-point protocol
• Each intermediate node receives and stores the
entire message before retransmitting it
• Implemented in earliest parallel machines in
which nodes did not have communications
coprocessors.
• Intermediate nodes are interrupted to handle
messages and route them towards their
destination.
Store and Forward Protocol (cont)
• If d(i,j) is the number of links between Pi & Pj,
the formula for ci,j(m) can be re-written as
ci,j(m) = d(i,j) {L+ m/B} = d(i,j)L + d(i,j)mb
where
– L is the initial latency & b is the reciprocal for the
broadcast bandwidth for one link.
• This protocol produces a poor latency &
bandwith
• The communication cost can be reduced using
pipelining.
Store and Forward Protocol
using Pipelining
• The message is split into r packets of size m/r.
• The packets are sent one after another from Pi
to Pj.
• The first packet reaches node j after ci,j(m/r) time
units.
• The remaining r-1 packets arrive in
(r-1) (L+ mb/r) time units
• Simplifying, total communication time reduces to
[d(i,j) -1+r][L+ mb/r]
• Casanova, et.al. finds optimal value for r above.
Two Cut-Through Protocols
• Common performance model:
ci,j(m) = L + d(i,j)*  + m/B
where
– L is the one-time cost of creating a message.
–  is the routing management overhead
– Generally  << L as routing management is
performed by hardware while L involve
software overhead
– m/B is the time required to transmit the
message through entire route
Circuit-Switching Protocol
• First cut-through protocol
• Route created before first message is sent
• Message sent directly to destination
through this route
– The nodes used in this transmission can not
be used during this transmission for any other
communication
Wormhole (WH) Protocol
• A second cut-through protocol
• The destination address is stored in the
header of the message.
• Routing is performed dynamically at each
node.
• Message is split into small packets called
flits
• If two flits arrive at the same time, flits are
stored in intermediate nodes’ internal
registers
Point-to-Point Communication
Comparisons
• Store and Forward is not used in physical
networks but only at applications level
• Cut-through protocols are more efficient
–
–
–
–
Hide distance between nodes
Avoid large buffer requirement for intermediate nodes
Almost no message loss
For small networks, flow-control mechanism not
needed
• Wormhole generally preferred to circuit
switching
– Latency is normally much lower
LogP Model
• Models based on the LogP model are more
precise than the Hockney model
• Involves three components of communication –
the sender, the network, and the receiver
– At times, some of these components may be busy
while others are not.
• Some parameters for LogP
–
–
–
–
m is the message size (in bytes)
w is the size of packets message is split into
L is an upper bound on the latency
o is the overhead,
• Defined to be the time that the a node is engaged
in the transmission or reception of a packet
LogP Model (cont)
• Parameters for LogP (cont)
– g or gap is the minimal time interval between
consecutive packet transmission or packet reception
• During this time, a node may not use the
communication coprocessor (i.e., network card)
– 1/g the communication bandwidth available per node
– P the number of nodes in the platform
• Cost of sending m bytes with packet size w
 m  1
c(m)  2o  L  
g

 w 
• Processor occupational time on sender/receiver
 m 1

o  (
 1 g
 w

Other LogP Related Models
• LogP attempts to capture in a few parameters
the characteristics of parallel platforms.
• Platforms are fine-tuned and may use different
protocols for short & long messages
• LogGP is an extension of LogP where G
captures the bandwidth for long messages
• pLogP is an extension of LogP where L, o, and g
depend on the message size m.
– Also seperates sender overhead os and receiver
overhead or.
Affine Models
• The use of the floor functions in LogP models
causes them to be nonlinear.
– Causes many problems in analytic & theoretical
studies.
– Has resulted in proposal of many fully linear models
• The time that Pi is busy sending a message is
expressed as an affine function of the message size
– An affine function of m has form f(m) = a*m + b where
a and b are constants. If b=0, then f is linear function
• Similarly, the time Pj is busy receiving the message
is expressed as an affine function of the message
size
• We will postpone further coverage of the topic of
affine models for the present
Modeling Concurrent Communications
• Multi-port model
– Assumes that communications are contention-free
and do not interfere with each other.
– A consequence is that a node may communicate with
an unlimited number of nodes without any
degradation in performance.
– Would require a clique interconnection network to
fully support.
– May simplify proofs that certain problems are hard
• If hard under ideal communications conditions, then
hard in general.
– Assumption not realistic - communication resources
are always limited.
– See Casanova text for additional information.
Concurrent Communications Models (2/5)
• Bounded Multi-port model
– Proposed by Hong and Prasanna
– For applications that uses threads (e.g., on a multicore technology), the network link can be shared by
several incoming and outgoing communications.
– The sum of bandwidths allocated by operating system
to all communications can not exceed bandwidth of
network card.
– An unbounded nr of communications can take place
if they share the total available bandwidth.
– The bandwidth defines the bandwidth allotted to each
communication
– Bandwidth sharing by application is unusual, as is
usually handled by operating system.
Concurrent Communications Models (3/5)
• 1 port (unidirectional or half-duplex) model
– Avoids unrealistically optimistic assumptions
– Forbids concurrent communication at a node.
– A node can either send data or receive it, but not
simultaneously.
– This model is very pessimistic, as real world platforms
can achieve some concurrent computations.
– Model is simple and is easy to design algorithms that
follow this model.
Concurrent Communications Models (4/5)
• 1 port (bidirectional or full-duplex) model
– Currently, most network cards are full-duplex.
– Allows a single emission and single reception
simultaneously.
– Introduced by Blat, et. al.
• Current hardware does not easily enable multiple
messages to be transmitted simultaneous.
• Multiple sends and receives are claimed to be
eventually serialized by a single hardware port to
the next.
• Saif & Parashar did experimental work that
suggests asynchronous sends become serialized
when message sizes exceed a few megabytes.
Concurrent Communications Models (5/5)
• k-ports model
– A node may have k>1 network cards
– This model allows a node to be involved in a
maximum of one emission and one reception on each
network card.
– This model is used in Chapters 4 & 5.
Bandwidth Sharing
• The previous concurrent communication models
only consider contention on nodes
• Other parts of the network can also limit
performance
• It may be useful to determine constraints on each
network link
• This type of network model are useful for
performance evaluation purposes, but are too
complicated for algorithm design purposes.
• Casanova text evalutes algorithms using 2 models:
– Hockney model or even simplified versions (e.g.
assuming no latency)
– Multi-port (ignoring contention) or the 1 port model.
Case Study: Unidirectional Ring
• We first consider the platform of p processors
arranged in a unidirectional ring.
• Processors are denoted Pk for k = 0, 1, … , p-1.
• Each PE can find its logical index by calling
My_Num().
Unidirectional Ring Basics
• The processor can determine the number of PEs
by calling NumProcs()
– Both preceding commands are supported in MPI, a
language implemented on most asychronous
systems.
• Each processor has its own memory
• All processors execute the same program, which
acts on data in their local memories
– Single Program, Multiple Data or SPMD
• Processors communicate by message passing
– explicitly sending and receiving messages.
Unidirectional Ring Basics (cont – 2/5)
• A processor sends a message using the function
send(addr,m)
– addr is the memory address (in the sender process)
of first data item to be sent.
– m is the message length (i.e., nr of items to be sent)
• A processor receives a message using function
receive(addr,m)
– The addr is the local address in receiving processor
where first data item is to be stored.
– If processor Pi executes a receive, then its
predecessor (P(i-1)mod p) must execute a send.
– Since each processor has a unique predecessor and
successor, they do not have to be specified
Unidirectional Ring Basics (cont – 3/5)
• A restrictive assumption is to assume that both
the send and receive is blocking.
– Then the participating processes can not continue
until the communication is complete.
– The blocking assumption is typical of 1st generation
platforms
• A classical assumption is keep the receive
blocking but to allow the send is non-blocking
– The processor executing a send can continue while
the data transfer takes place.
– To implement, one function is used to initiate the send
and another function is used to determine when
communication is finished.
Unidirectional Ring Basics (cont – 4/5)
• In algorithms, we simply indicate the blocking
and non-blocking operations
• More recent proposed assumption is that a
single processor can send data, receive data,
and compute simultaneously.
– All three can occur only if no race condition exists.
– Convenient to think of three logical threads of control
running on a processor
• One for computing
• One for sending data
• One for receiving data
• We will usually use the less restrictive third
assumption
Unidirectional Ring Basics (cont – 4/5)
• Timings for Send/Receive
– We use a simplified version of the Hockney model
– The time to send or receive over one link is
c(m) = L+ mb
• m is the length of the message
• L is the startup cost in seconds due to the physical
latency and the software overhead
• b is the inverse of the data transfer rate.
The Broadcast Operation
• The broadcast operation allows an processor Pk
to send the same message of length m to all
other processors.
• At the beginning of the broadcast operation, the
message is stored at the address addr in the
memory of the sending process, Pk.
• At the end of the broadcast, the message will be
stored at address addr in the memory of all
processors.
• All processors must call the following function
Broadcast(k, addr, m)
Broadcast Algorithm Overview
• The message will go around the ring from
processor - from Pk to Pk+1 to Pk+2 to … to Pk-1.
• We will assume the processor numbers are
modulo p, where p is the number of processors.
For example, if k=0 and p=8, then k-1 = p-1 = 7.
• Note there is no parallelism in this algorithm,
since the message advances around ring only
one processor per round.
• The predecessor of Pk (i.e, Pk-1) does not send
the message to Pk.
Analysis of Broadcast Algorithm
• For algorithm to be correct, the “receive” in Step
10 will execute before Step 11.
• Running Time:
– Since we have a sequence of p-1 communications,
the time to broadcast a message of length m is
(p-1)(L+mb)
• MPI does not typically use ring topology for
creating communication primitives
– Instead use various tree topologies that are more
efficient on modern parallel computer platforms.
– However, these primitives are simpler on a ring.
– Prepares readers to implement primitives, when more
efficient than using MPI primitives.
Scatter Algorithm
• Scatter operation allows Pk to send a different
message of length m to each processor.
• Initially, Pk holds a message of length m to be
sent to Pq at location “addr [q]”.
• To keep the array of addresses uniform, space
for a message to Pk is also provided.
• At the end of the algorithm, each processor
stores its message from Pk at location msg.
• The efficient way to implement this algorithm is
to pipeline the messages.
– Message to most distant processor (i.e,, Pk-1) is
followed by message to processor Pk-2.
Discussion of Scatter Algorithm
• In Steps 5-6, Pk successively send messages to
the other p-1processors in the order of their
distance from Pk .
• In Step 7, Pk stores its message to itself.
• The other processors concurrently move
messages along as they arrive in steps 9-12.
• Each processor uses two buffers with addresses
tempS and tempR.
– This allows processors to send a message
and to receive the next message in parallel in
Step12.
Discussion of Scatter Algorithm (cont)
• In step 11, tempS  tempR means two
addresses are switched so received value can
be sent to next processor.
• When a processor receives its message from Pk,
the processor stops forwarding (Step 10).
• Whatever is in the receive buffer, tempR, at the
end is stored as its message from Pk (Step 13).
• The running time of the scatter algorithm is the
same as for the broadcast, namely
(p-1)(L+mb)
Example for Scatter Algorithm
• Example: In Figure 3.7, let p=6 and k=4.
• Steps 5-6: For i = 1 to p-1 do
send(addr[(k+p-i) mod p], m)
–
–
–
–
–
–
Let PE = (k+p-i) mod p = (10 – i) mod 6
For i=1, PE = 9 mod 6 = 3
For i=2, PE = 8 mod 6 = 2
For i=3, PE = 7 mod 6 = 1
For i=4, PE = 6 mod 6 = 0
For i=5, PE = 5 mod 6 = 5
• Note messages are sent to processors in the
order 3, 2, 1, 0, 5
– That is, messages to most distant processors sent
first.
Example for Scatter Algorithm (cont)
•
•
•
•
Example: In Figure 3.7, let p=6 and k=4.
Steps 10: For i = 1 to (k-1-q) mod p do
Compute: (k-1-q) mod p = (3-q) mod 6 for all q.
Note: q≠ k, which is 4
– q = 5  i = 1 to 4 since (3-5) mod 6 = 4
• PE 5 forwards values in loop from i = 1 to 4
– q = 0  i = 1 to 3 since (3-0) mod 6 = 3
• PE 0 forwards values from i = 1 to 3
– q = 1  i = 1 to 2 since (3-1) mod 6 = 2
• PE 1 forwards values from i = 1 to 2
– q = 2  i = 1 to 1 since (3-2) mod 6 = 1
• PE 2 is active in loop when i = 1
Example for Scatter Algorithm (cont)
– q = 3  i = 1 to 0 since (3-3) mod 6 = 0
• PE 3 precedes PE k, so it never forwards a value
• However, it receives and stores a message at #9
• Note that in step 9, all processors store the first
message they receive.
– That means even the processor k-1 receives a value
to store.
All-to-All Algorithm
• This command allows all p processors to
simultaneously broadcast a message (to all PEs)
• Again, it is assumed all messages have length m
• At the beginning, each processor holds the
message it wishes to broadcast at address
my_message.
• At the end, each processor will hold an array
addr of p messages, where addr[k] holds
message from Pk.
• Using pipelining, the running time is the same as
for a single broadcast, namely
(p-1)(L+mb)
Gossip Algorithm
• Last of the classical collection of communication
operations
• Each processor sends a different message to
each processor.
• Gossip algorithm is Problem 3.7 in textbook.
– Note it will take 1 step for each PE to send a
message to its closest neighbor using all links.
– It will take 2 steps for each PE to send a message to
its 2nd closest neighbor, using all links.
– In general, it will take 1*2* …*(p-1) = (p-1)! steps for
each PE to send messages to all other nodes, using
all links of the network during each step.
– Complexity is (p-1)! = (p-1)(p-2)/2 = O(p2).
Pipelined Broadcast by kth processor
• Longer messages can be broadcast faster if they
are broken into smaller pieces
• Suppose they are broken into r pieces of the same
length.
• The sender sends the pieces out in order, and has
them travel simultaneously on the ring.
• Initially, the pieces are stored at addresses addr[0],
addr[1], … , addr[r-1].
• At the end, all pieces are stored in all processors.
• At each step, when a processor receives a message
piece, it also forwards the piece it previously
received, if any, to its successor.
Pipelined Broadcast (cont)
• There must be p-1 communication steps for first
piece to reach the last processor, Pk-1.
• Then it takes r-1 time for the rest of the pieces to
reach Pk-1.
• The required time is (p+r-2) ( L + mb/r)
• The value of r that minimizes this expression can
be found by setting its derivative (with respect to
r) to zero and solving for r.
• For large m, the time required tends to mb
– Does not depend on p.
• Compares well to broadcast time: (p-1)(L+mb)
Hypercube
• Defn: A 0-cube consists of a single vertex. For
n>0, an n-cube consists of two identical (n-1)cubes with edges added to join matching pairs
of vertices in the two (n-1) cubes.
Hypercubes (cont)
• Equivalent defn: An n-cube is a graph consisting
of 2n vertices from 0 to 2n -1 such that two
vertices are connected if and only if their binary
representation differs by one single bit.
• Property: The diameter and degree of an n-cube
are equal to n.
• Proof is left for reader. Easy if use recursion.
• Hamming Distance: Let A and B be two points in
an n-cube. H(A,B) is the number of bits that
differ between the binary labels of A and B.
• Notation: If b is a binary bit, then let b = 1-b, the
bit complement of b.
Hypercube Paths
• Using binary representation, let
– A = an-1 an-2 … a2 a1 a0
– B = bn-1 bn-2 … b2 b1 b0
• WLOG, assume that A and B have different bits
exactly in their last k bits.
– Differing bits at end makes numbering them easier
• Then a pathway from A to B can be created by
the following sequence of nodes:
–
–
–
–
–
A = an-1 an-2 … a2 a1 a0
Vertex 1 = an-1 an-2 … a2 a1 a0
Vertex 2 = an-1 an-2 … a2 a1 a0
........
B = Vertex k = an-1 an-2 … ak ak-1 … a2 a1 a0
Hypercube Paths (cont)
• Independent of which bits of A and B agree,
there are k choices for first bit to flip, (k-1)
choices for next bit to flip, etc.
– This gives k! different paths from A to B.
• How many independent paths exist from A to B?
– I.e., paths with only A and B as common vertices.
Theorem: If A and B are n-cube vertices that differ
in k bits, then there exist exactly k independent
paths from A to B.
Proof: First, we show k independent paths exist.
• We build an independent path for each j with
0j<k
Hypercube Paths (cont)
• Let P(j, j-1, j-2, … , 0, k-1, k-2, … , j+1) denote the path
from A to B with following sequence of nodes
– A = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0
– V(1) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0
– V(2) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0
– ….........
– V(j+1) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0
– V(j+2) = an-1 an-2 … ak ak-1 ak-2 … aj aj-1 aj-2 … a2 a1 a0
– .............
– V(k-1) = an-1 an-2 … ak ak-1 ak+1 … aj+1aj aj-1 aj-2 … a2 a1 a0
– B =V(k)=an-1 an-2 … ak ak-1 ak+1 … aj+1aj aj-1 aj-2 … a2 a1 a0
Hypercube Pathways (cont)
• Suppose the following two path have a common vertex X
other than A and B.
P(j, j-1, j-2, … , 0, k-1, k-2, … , j+1)
P(t, t-1, t-2, … , 0, k-1, k-2, … , t+1)
– Since paths are different, and A and B differ by k bits,
we may assume 0≤t<j<k
• Let A and X differ in q bits
• To travel from A to X along either path, exactly q bits in
circular sequence have been flipped, in a left to right
order for each
– 1st path: j, j-1 … t, t-1 … 0, k-1, k-2 … j-1
– 2nd path
t, t-1 … 0, k-1, k-2 … j-1 … t-1
• This is impossible, as the q bits are flipped along each
path can not be exactly the same bits.
Hypercube Paths (cont)
• Finally, there can not be another independent path Q
from A to B (i.e., without another common vertex)
– If so, the first node in path following A would have to flipped one
bit, say bit q, to agree with B.
– But then, the path described earlier that flipped bit q first would
have a common interior vertex with this path Q.
Hypercube Routing
• XOR is an exclusive OR. Exactly one input must be a 1.
• To design a route from A to B in the n-cube, we will use
the algorithm that will always flip the rightmost bit that
disagrees with B.
• The NOR of the binary representation of A and B
indicates by “1’s” the bits that have to be flipped.
• For the 5-cube, if A is 10111 and B is 01110,then
– A NOR B is 11001, and the routing is as follows:
A = 10111  10110  11110  01110 = B
• This algorithm can be executed as follows:
– A NOR B = 10111 NOR 01110 = 11001, so A routes the
message along link 1 (the digits link) to Node A1 ≡ 10110
– A1 NOR B = 10110 NOR 01110 = 11000, so Node A1 routes
message along link 4 to Node A2 =10110
– A2 NOR B = 10110 NOR 01110 = 10000, so A2 routes message
along link 5 to B
Hypercube Routing (cont)
• This routing algorithm can be used to implement a
wormhole or cut-through protocol in hardware.
• Problem: If another pair of processors have already
reserved one link on the desired path, the message may
stall until the end of the other communication.
• Solution: Since there are multiple paths, the routers
select links to use based on a link reservation table and
message labels.
– In our example, if link 1 in Node A is busy, then
instead use link 4 to forward the message to new
node 11111 that is on a new path to B.
– If at some point, the current vertex determines there is
no useful links available, then current vertex will have
to wait for a useful link to become available
– If at some vertex, the desired links are not available,
then algorithm could use a link that extends path length
Gray Code
• Recursive construction of Gray code:
– G1 = (0,1) and has 21 = 2 elements
– G2 = (00,01,11, 10) and had 22elements
– G3 = (000, 001, 011, 010, 110, 111, 101, 100) & has 23
elements
– etc.
• Gray code for dimension n  1 is denoted Gn,
and is defined recursively. G1 = (0,1) and for
n>1, Gn is the sequence 0Gn-1 followed by the
rev
rev
sequence 1Gn-1 (i.e., 1Gn 1 ) where
– xGn-1 is the sequence obtained by prefixing
every element of G with x
– Grev is the sequence obtained by listing the
elements of G in the reverse order
Gray Code (cont.)
– Since we assume Gn-1 has 2n-1 elements and
G n has exactly two copies of Gn-1 , it has 2n
elements.
• Summary: The Gray code Gn is an ordered
sequence of all the 2n binary codes with n digits
whose successive values differ from each other
by exactly one bit.
• Notation: Let gi(r) denote the ith element of the
Gray code of dimension r.
• Observation: Since the Gray code Gn = {g1(n),
g2(n), …, g(n) } form a ordered sequence of names
for all of the nodes in a 2n ring.
Embeddings
• Defn: An embedding of a topology (e.g., ring, 2D
mesh, etc) into an n-cube is a 1-1 function f from
the vertices of the topology into the n-cube.
– An embedding is said to preserve locality if
the image of any two neighbors are also
neighbors in the n-cube
– If embeddings do not preserve locality, then
we try to minimize the distance of neighbors
in the hypercube.
– An embedding is said to be onto if the
embedding function f has its range to be the
entire n-cube.
A 2n Ring embedding onto n-cube
Theorem: There is an embedding of a ring with 2n
vertices onto an n-cube that preserves locality.
Proof:
• Our construction of the Gray code provides an ordered
sequence of binary values with n digits that can be used
as names of the nodes of the ring with 2n vertices
– The first name (e.g., 00…0) can be used to name any
node in ring.
– The sequence of Gray code names are given names
to ring nodes successively, in a clockwise or counterclockwise order
Embedding Ring onto n-cube (cont)
• The gray code binary numbers are identical with
the names assigned to hypercube nodes.
– Two successive n-cube nodes are connected since
they only differ by one binary digi
• So the embedding is a result of the Gray code
providing an ordered sequence of n-cube names
that is used to provide successive names the 2n
ring nodes.
• This concludes the proof that this embedding
follows “from construction of Gray code”.
• However, the following formal proof provides
more details and an “after construction”
argument..
A 2n Ring Embedding onto the n-cube
• The following optional formal proof is included for those
who do not find the preceding argument convincing:
Theorem: There is an embedding of a ring with 2n
vertices onto an n-cube that preserves locality.
Proof:
• We establish the following claim: The mapping f(i) = gi(n)
is an embedding of the ring onto the n-cube.
• This claim is true for n = 1 as G1 = (0,1) and nodes 0 and
1 are connected on both the ring and 1-cube.
• We assume above claim is true for a fixed n-1 with n1.
• We use the neighbor-preserving embedding f of the
vertices of a ring with 2n-1 nodes onto the (n-1)-cube to
build a similar embedding f* of a ring with 2n vertices
onto the n-cube.
Ring Embedding for n-cube (cont)
• Recall that gray code for Gn is the sequence 0Gn-1
followed by the sequence 1Gn-1rev
• The n-cube consists of one copy of an (n-1)-cube with a
0 prefix and a second copy of an (n-1) cube with a 1
prefix.
• By assumption that claim is true for n-1, the gray code
sequence 0Gn provides the binary code for a ring of
elements in the (n-1)-cube, with each successive
element differing by 1 digit from its previous element.
• Likewise, the gray code sequence 1Gn-1rev provides the
binary code for a ring of elements in the second copy of
an (n-1)-cube, with each successive element differing by
on digit from its previous element..
• The last element of 0Gn-1 is identical to the first element
of 1Gn-1rev except for the added digit, so these two
elements also differ by one bit.
2D Torus Embedding for n-cube
• We embed a 2r  2s torus onto an n-cube with n=r+s by
using the cartesian product Gr  Gs of two Gray codes.
• A processor with coordinates (i,j) on the grid is mapped
to the processor f(i,j) = (gi(r), gi(s) ) in the n-cube.
• Recall the map f1(i) = gi(r), is an embedding of a 2r ring
onto a r-cube “row” ring and f2(j) = gi(s) is an embedding
of a 2s ring onto a s-cube “column” ring.
• We must identify (gi(r), gi(s) ) with the n=r+s cube where
the first r bits are given by gi(r) and the next s bits are
given by gi(s) .
• Then for a fixed j, f(i1,j) are neighbors of f(i,j) since f1 is
an embedding of a 2r ring onto a r-cube.
• Likewise, for a fixed i, f(i, j 1) are neighbors of f(i,j) since
f2 is an embedding of a 2s ring onto a s-cube.
Collective Communications in Hypercube
• Purpose: Gain overview of complexity of collective
communications for hypercube.
– Will focus on broadcast on hypercube.
• Assume processor 0 wants to broadcast. And consider
the naïve algorithm:
– Processor 0 sends the message to all of its neighbors
– Next, every neighbor sends a message to all of its
neighbors.
– Etc.
• Redundancy in naïve algorithm
– The same processor receives the same message
many times.
– E.g., processor 0 receives all its neighbors.
– Mismatched SENDS and RECEIVES may happen.
Improved Hypercube Broadcast
• We seek a strategy where
– Each processor receives the message only once
– The number of steps is minimal.
– Will use one or more spanning trees.
• Send and Receive will need parameter of which
dimension the communication takes place
– SEND( cube_link, send_addr, m )
– RECEIVE(cube_link, send_addr, m )
Hypercube Broadcast Algorithm
• There are n steps, numbered from n-1 to 0
• All processors will receive their msg on the link
corresponding to their rightmost 1
• Processors receiving a msg will forward it on
links whose index is smaller than its rightmost 1.
• At step i, all processors whose rightmost 1 is
strictly larger than i forwards the msg on link i.
• Let broadcast originates with processor 0
– Assume 0 has a fictitious 1 at position n.
– This adds an additional digit, as has binary digits
for positions 0,1, …, n-1.
Trace of Hypercube Broadcast for n=4
• Since broadcast originates with processor 0, we
assume its index is 10000
– Processor 0 sends the broadcast msg on link
3 to 1000
• Next, both processor 0000 and 1000 have their
rightmost 1 in position at least three, so both
send a message along link 2
– Msg goes to 0100 and 1100, respectively.
• This process continues until the last step at
which every even-numbered processor sends its
msg along its link 0.
Broadcasting using Spanning Tree
Broadcast Algorithm
• Let BIT(A,b) denote the value of the bth bit of
processor A.
• The algorithm for the broadcast of a message of
length m by processor k is given in Algorithm 3.5
• Since there are n steps, the execution time with
the store-and-forward model is n(L+mb).
• This algorithm is valid for the 1-port model.
– At each step, a processor communicates with at most
one other processor.
Broadcast Algorithm in Hypercube (cont)
Observations about Algorithm Steps
2. Specifies that the algorithm action is for processor q
3. n is the number of binary digits used to label processors.
4. pos = q XOR k
– uses “exclusive OR”.
– Broadcast is from Pk .
– Updates pos to work as if P0 is root of broadcast ???
5. Steps 5-7 sets “first-1” to the location of first “1” in pos.
Note: Steps 8-10 are the core of algorithm
8. Phase steps through link dimensions, higher ones first
9. If link dimension = first-1, q gets message on first-1 link
–
Seems to be moving message up to q ???
10.Q sends message along each smaller dimension than
first-1 to processor below it.