Download TITLE BYLINE Synonym Definition Discussion

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quadtree wikipedia , lookup

B-tree wikipedia , lookup

Red–black tree wikipedia , lookup

Interval tree wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Binary tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
TITLE
Broadcast
BYLINE
Jesper Larsson Träff
Department of Scientific Computing
University of Vienna
A-1090 Vienna
Austria
Robert A. van de Geijn
Department of Computer Science
The University of Texas at Austin
Austin, TX
USA
Synonym
One-to-all broadcasting
Parallel copy
Definition
Among a set of processors a designated root processor has a data item that
needs to be communicated to all other processors. The broadcast operation
performs this collective communication operation.
Discussion
For the broadcast operation it is generally assumed that upon execution
all processors involved in the broadcast know the identity of the designated
1
root processor that is initially in possession of the data. It is also typically
assumed that the amount of data to be broadcast is known at that time.
Let p be the number of processors, numbered consecutively from 0 to p − 1.
Let processor r be the root processor that has a data item x of size n to be
communicated to the remainder p − 1 processors.
Lower bounds
A convenient model for the cost of sending a message between two nodes
in a distributed memory architecture is α + nβ, where α is the startup cost
(latency) and β is the cost per item transfered. If one assumes that a node
can send to and receive from only one other node at any given time, two lower
bounds for the broadcast can be easily justified, the first for the α term and
the second for the β term.
• dlog2 (p)eα. Define a round of communication as a period during which
each node can send at most one message and receive at most one message. In each round, the number of nodes that own message x can at
most double. Thus, a minimum of dlog2 (p)e are needed to broadcast
the message. Each round casts at least α.
• nβ. If p > 1 then the message must leave the root node, requiring a
time of at least nβ.
Tree-based broadcast algorithms
The best known algorithm for broadcasting is the Minimum Spanning Tree
(MST) algorithm. Assume again that each node can send only one message
and receive only one message at a given time. The following algorithm (also
given in the entry on Collective Communication) broadcasts the message
from the root to all other nodes:
• Partition the processors into two (roughly equal) subsets.
• Send x from the root to a processor (the destination) in the subset that
does not include the root.
• Recursively broadcast x from the root and this destination processor
in their respective subsets of processors.
2
Under the stated assumptions the total cost of this algorithm is dlog2 pe(α +
nβ). We note that it achieves the lower bound for the α term but not for
the β term.
Pipelining
A discussion of pipelined broadcast algorithms starts by assuming that the
root is node 0 and that communication will be from node i to node i + 1,
i = 0, . . . , p − 2. The original message, of length n, is partitioned into k
packets of length n/k each. In the first round of the algorithm the first packet
is sent from node 0 to node 1. In the second round this packet is forwarded
to node 2 while the second packet is sent to node 1. In this fashion a pipeline
is established that communicates the packets to all nodes.
The cost can be analyzed as follows: the first packet arrives at node p
after (p − 1) rounds. After this, every round another packet arrives so that
the process completes after (p − 1) + (k − 1) = (p + k − 2) rounds. Since each
nβ.
round costs α + nk β the total cost for the broadcast is (p + k − 2)α + p+k−2
k
The optimal q
k can be determined by differentiating and setting the result to
zero, kopt = (p − 2)nβ/α, taking into account that a packet be a multiple
of a byte. If α = 0 (no startup cost) then the cost is (p + n − 2)β if the
message is pipelined one item at a time (k = n). If n is much greater than
p, this is close to the lower bound of nβ.
The problem is that α and/or p are typically large for current distributed
memory architectures. Moreover, the broadcast is often used to send short
messages. Thus, it becomes important to reduce the length of the longest
path by combining pipelining with a tree algorithm. The best example of
this is the Edge-Disjoint Spanning Tree algorithm by Ho and Johnsson (one
s or two?) [?]. The idea behind that algorithm is that on a hypercube or
fully connected architecture with a power of two number of nodes one can
embed log2 (p) trees and alternate pipelining part of the message within each
of those trees. The effective longest path becomes log2 (p) + 1. Thus, even
though each node can only send and receive one message during a round,
the cost of this pipelined q
algorithm is (log2 (p) + 1)(α + nk β), where k can be
chosen optimally: kopt = (log2 (p) − 1)nβ/α.
3
Composing other collective communications
A third technique for implementing the broadcast is to compose a scatter of
the data with a subsequent allgather, leaving the data duplicated on each
node. A scatter can be implemented much like the MST broadcast, except
that in each round only the data that are eventually destined for the subset
that does not connain the root needs to be sent. Assume that the target
architecture has a power of two number of nodes. The cost, under the same
assumptions as were used for the MST broadcast, of the scatter is approxnβ. An allgather on this kind of architecture can be
imately log2 (p)α + p−1
p
implemented with a cost of log2 (p)α p−1
nβ. Thus, a scatter followed by an
p
allgather implements a broadcast at a cost of 2 log2 (p)α + 2 p−1
nβ, which is
p
within a factor two of the lower bound for both the α and β terms.
Comparison
There is a clear tradeoff between the scatter-allgather broadcast and the
MST broadcast: while the former is faster by a factor 2 for short messages
(n small), the latter is faster by a factor log2 (p)/2 for long messages (n
large). Asymptotically, as the message length gets very large, the pipelined
algorithms are preferred. The problem is that the startup (α) is typically
much larger than the cost per item transferred (β). A second problem is
that methods that pipeline along multiple paths, like the EDST algorithm,
depend on the architecture being highly synchronous so that the pipelines
do not start interfering. Finally, generalizing them in a practical way beyond
hypercubes and fully connected architectures has been a challenge. As a result, pipelined algorithms have been successfully implemented only on highly
synchronous architectures like the original Single-Instruction Multiple-Data
(SIMD) Connection Machines.
Start of Jesper’s material.
Discussion
For the broadcast operation it is common to assume that all processors know
the identity of the designated root processor initially in possession of the
data. It is also typically assumed that the amount of data to be broadcast
4
is known prior to the operation by all processors. This is for instance the
common assumption in interfaces for collective operations like MPI. Let p
be the number of processors, numbered consecutively from 0 to p − 1. Let
processor r be the root processor that has a data item x of size n to be
communicated to the remainder p − 1 processors.
For simplicity, and without loss of generality it is assumed in the following
that r = 0 (for most broadcast algorithms a renumbering of the processors
suffices to establish this).
Tree-based broadcast algorithms
Assume first that each processor can send and receive messages to and from
any other processor. A natural way to broadcast the information from the
root is to organize the processors in a tree structure. As soon as a processor
receives the data item x from its parent in such a tree, the information is
sent forward to the children in the tree. Such tree structures are as shown
in Figure 1.
In the star tree (a) the root sends x to each of the other p − 1 processors,
either in sequence or simultaneously as permitted by the communication
network. The root processor is busy throughout the operation, and if the send
operations must be performed in sequence the last processor has received x
after p − 1 send operations.
In the balanced binary tree (b) the processors in disjoint subtrees can
work concurrently. The depth (maximum distance from the root to a leaf) of
the tree is dlog2 pe. If a processor can perform only one send operation at a
time, the latest processor receives x after 2dlog2 pe send operations. This can
be improved by employing slightly skewed tree structures. The ith Fibonacci
tree (c) consists of the root r with the i − 2th Fibonacci tree as left subtree
and the i − 1th Fibonacci tree as the right subtree (with F0 consisting of
a single node, and F1 of a root with a single child). The depth of the ith
Fibonacci tree is i, and broadcasting in the ith Fibonacci tree requires i send
operations if only one send operation at a time is possible.
The binomial tree (d) can likewise be recursively constructed. The ith
binomial tree Bi consists of the root with i children Bj for j = i − 1, . . . , 0.
Broadcasting in the ith binomial tree again takes i send operations. This is
equivalent to the MST construction used in many collective communication
algorithms (see entry Collective communication).
In a path (e) each processor (except the last) has only one child. The last
5
(a) Star graph
...
(b) Binary tree
(c) Fibonacci trees, F0 , F1 , F2 , F3 , F4
(d) Binomial trees, B2 , B3 , B4
(e) Path
...
Figure 1: Commonly used broadcast trees: (a) Star, (b) Binary Tree, (c)
Fibonacci trees, (d) Binomial trees, (e) Path.
6
Tree
Star
Binary tree
Fibonacci tree
Binomial tree (MST)
Path
Rounds
p−1
2dlog2 √pe
dlogΦ pe, where Φ = 1+2 5
dlog2 pe
p−1
Table 1: Broadcast complexity in number of rounds (for the last processor
to have received the data item) commonly used trees. [The Fib tree bound
needs to be checked!]
processor receives the data item after p − 1 send operations, but the root has
completed its part in the broadcast operation as soon as it has sent x to its
child.
Trees can be embedded in fully connected networks, but also in many
other network topologies, and tree based algorithms are therefore practically
relevant for the implementation of the broadcast operation on many parallel systems. Assuming that a processor can only be engaged in one either
send or receive communication operation at at time (uni-directional, singleported communication), and that the processors together work in synchronized rounds in which any number of disjoint communication operations can
take place concurrently, the number of rounds required for the last processor
to receive the data item x for the discussed tree structures are summarized
in Table 1. Assuming linear, homogeneous communication costs, such that
a transfer of a data item of size n between any two processors takes α + nβ
units of time, the broadcast time is modeled by the number rounds times the
time per transfer.
In the single-ported model the minimum number of rounds required for
all processors to have received data item x is dlog2 pe, since, starting with
the root, the number of processors having received x can at most double
per communication round. The minimum time required in the linear cost
model is therefore at least dlog2 pe(α + nβ). This lower bound achieved by
the binomial tree broadcast.
Since at least one processor has to send x and at least one processor has
to receive x, another trivial lower bound in the linear cost model is therefore
α + nβ. No simple tree based broadcast algorithms achieves this.
7
Tree
Path
Binary tree
Fibonacci tree
Scatter-allgather
Round optimal
Best cost
nβ
2dlog2 peα + 2 p−1
√ p
αdlog2 pe + 2 dlog2 peαβ b + βn
q
Table 2: Best possible broadcast time for pipelined tree based broadcast
algorithms, and other algorithms.
Assuming that x is divisible, and that x can be sent in N blocks of size
n/N (assuming for simplicity that N divides n) better performance can be
achieved. By pipelining through the path, such that the blocks are sent one
after another, the root is busy for N consecutive rounds. The ith processor,
i > 0 receives the first block after i communication rounds, and a new block
in each of the following N − 1 rounds. Thus, the last processor has received
all N blocks and can terminate the broadcast after p − 1 + N − 1 = N + p − 2
rounds. The cost of this pipelined broadcast algorithms is (N +p−2)(α+ Nn )β.
Minimizing this yields a best possible cost of XXX
Likewise the fixed degree balanced binary tree and Fibonacci tree can be
pipelined. Upon receiving a new block each not in a fixed sequence sends this
block to its children. This scheme keeps the processors busy in all rounds
(except for root and leaves) either receiving or sending blocks. Best possible
broadcast costs are shown in Table 2.
Likewise, pipelining can be employed for all other fixed degree tree structures. Trees like the binomial tree (or the star) where the nodes have differing
degrees depending on p on the other hand cannot be pipelined.
Another interesting approach to broadcast, yielding comparable costs,
but that does not rely on pipelining divides the n data into p blocks of size
n/p and scatters these over the p processors (see entries on Collective communication, Scatter). The processors then perform and allgather operation
to assemble the pieces into the full data item x on all processors. The cost of
this algorithm is also listed in Table 2. Note that this algorithm need slight
modifications in the case where n is smaller than (or comparable to) p, that
is for very large parallel systems. The blocks will in that case become too
small for the scatter operation to be possible.
8
Broadcasting with simultaneous trees
In the single-ported, fully connected model, bidirectional simultaneous sendreceive model [1], in which each processor can at the same time both send a
message to another processor and receive a message from a possibly different
processor, it can be shown that broadcasting N blocks of data requires at
least in N − 1 + log2 p communication rounds (lower bound).
cost model this yields a broadcast time of αdlog2 pe +
q In the linear
√
2 dlog2 peαβ b + βn which is asymptotically optimal.
In [13] the idea of using multiple, edge-disjoint spanning trees in hypercubes was introduced. With pipelining and edge-disjoint binomial trees
(EDBT) it is possible to achieve this lower bound when p is a power of two.
It was for a number of years an open problem how to achieve this for arbitrary p. The first round optimal algorithms were given in [14, 1]. A different,
explicit construction was found and described in [19]. A very elegant (and
practical) extension of the hypercube EDBT algorithm to arbitrary p was
presented in [12]. This algorithm runs the hypercube algorithm on pairs of
processors (some of these singletons).
The EDBT idea was used to achieve optimal (?) broadcast algorithms
for multidimensional meshes and tori in [20].
General graphs
The problem of determining the smallest number of rounds needed to broadcast in an arbitrary, given graph is NP-complete in the following sense [9,
Problem ND49]. Given an undirected graph G = (V, E), a root vertex r ∈ V ,
and an integer k (number of rounds), is there a sequence of vertex and edge
subsets {r} = V0 , E1 , V1 , V2 , E2 , . . . , Ek , Vk = V with Vi ⊆ V , Ei ⊆ E, such
that each e ∈ Ei has one endpoint in Vi−1 and one in Vi , no two edges of
Ei share an endpoint, and Vi = Vi−1 ∪ {w|(v, w) ∈ Ei }. This holds also for
many special networks [11]. For approximation algorithms see [8].
Related Entries
Collective communication
Reduction
9
Message-Passing Interface (MPI)
Communication network
Single-ported communication
Bibliographic Notes and Further Reading
Some classical surveys with extensive treatment of broadcast (and allgather/gossiping)
problems under various communication and network assumptions can be
found in [10, 7]. For a survey of broadcasting in distributed systems see [6].
Fibonacci trees for broadcast were used in [4] to achieve near-optimal
broadcast times.
The classical paper [13] introduced advanced edge-disjoint trees for broadcast in hypercubes (EDBT).
The scatter-allgather algorithm is from [2] [Is this correct?]. In [18] it is
shown how to alleviate the problem with blocks becoming too small for large
processor counts.
A interpolation between binary trees and path for that is useful when
pipelining of medium size, socalled fractional trees, was given in [15].
In [16] it has been that binary trees (that are simple to implement and
pipeline) can be used to achieve the optimal cost, by using two edge-disjoint
trees instead of one.
An arguably more accurate performance model of communication networks is the socalled LogGP model. With this model yet other broadcast
tree structures yield best performance [5, 17].
An optimal to a lower-order term algorithm based on using two binary
trees (edge-disjoint binary trees) was recently presented in [16].
Heterogeneous systems in which different processors can have different
speeds and different communication costs pose new challenges for efficient
broadcast algorithms (as well as for other collective operations). For examples, see [3].
10
References
[1] Amotz Bar-Noy, Shlomo Kipnis, and Baruch Schieber. Optimal multiple
message broadcasting in telephone-like communication systems. Discrete
Applied Mathematics, 100(1–2):1–15, 2000.
[2] M. Barnett, S. Gupta, D. Payne, L. Shuler, R. van de Geijn, and
J. Watts. Interprocessor collective communication library. In Proceedings of Supercomputing 1994, Nov. 1994.
[3] Olivier Beaumont, Arnaoud Legrand, Loris Marchal, and Yves Robert.
Pipelining broadcast on heterogeneous platforms. IEEE Transactions
on Parallel and Distributed Systems, 16(4):300–313, 2005.
[4] Jehoshua Bruck, Robert Cypher, and Ching-Tien Ho. Multiple message
broadcasting with generalized fibonacci trees. In Symposium on Parallel
and Distributed Processing (SPDP), pages 424–431, 1992.
[5] David E. Culler, Richard M. Karp, David Patterson, Abhijit Sahay,
Eunice E. Santos, Klaus Erik Schauser, Ramesh Subramonian, and
Thorsten von Eicken. LogP: A practical model of parallel computation.
Communications of the ACM, 39(11):78–85, 1996.
[6] Xavier Défago, André Schiper, and Péter Urbán. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Computing
Surveys, 36(4):372–421, 2004.
[7] Pierre Fraigniaud and Emmanuel Lazard. Methods and problems of
communication in usual networks. Discrete Applied Mathematics, 53(1–
3):79–133, 1994.
[8] Pierre Fraigniaud and Sandrine Vial. Approximation algorithms for
broadcasting and gossiping. Journal of Parallel and Distributed Computing, 43:47–55, 1997.
[9] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide
to the Theory of NP-Completeness. Freeman, 1979. With an addendum,
1991.
11
[10] Sandra M. Hedetniemi, T. Hedetniemi, and Arthur L. Liestman. A survey of gossiping and broadcasting in communication networks. Networks,
18:319–349, 1988.
[11] Klaus Jansen and Haiko Müller. The minimum broadcast time problem for several processor networks. Theoretical Computer Science,
147(1&2):69–85, 1995.
[12] Bin Jia. Process cooperation in multiple message broadcast. Parallel
Computing, 35(12):572–580, 2009.
[13] S. Lennart Johnsson and Ching-Tien Ho. Optimum broadcasting and
personalized communication in hypercubes. IEEE Transactions on
Computers, 38(9):1249–1268, 1989.
[14] Oh-Heum Kwon and Kyung-Yong Chwa. Multiple message broadcasting
in communication networks. Networks, 26:253–261, 1995.
[15] Peter Sanders and Jop F. Sibeyn. A bandwidth latency tradeoff for
broadcast and reduction. Information Processing Letters, 86(1):33–38,
2003.
[16] Peter Sanders, Jochen Speck, and Jesper Larsson Träff. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel
Computing, 35:581–594, 2009.
[17] Eunice E. Santos. Optimal and near-optimal algorithms for k-item
broadcast. Journal of Parallel and Distributed Computing, 57(2):121–
139, 1999.
[18] Jesper Larsson Träff. A simple work-optimal broadcast algorithm for
message-passing parallel systems. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 11th European PVM/MPI
Users’ Group Meeting, volume 3241 of Lecture Notes in Computer Science, pages 173–180. Springer-Verlag, 2004.
[19] Jesper Larsson Träff and Andreas Ripke. Optimal broadcast for fully
connected processor-node networks. Journal of Parallel and Distributed
Computing, 68(7):887–901, 2008.
12
[20] Jerrell Watts and Robert A. van de Geijn. A pipelined broadcast for
multidimensional meshes. Parallel Processing Letters, 5:281–292, 1995.
13