Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TITLE Broadcast BYLINE Jesper Larsson Träff Department of Scientific Computing University of Vienna A-1090 Vienna Austria Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA Synonym One-to-all broadcasting Parallel copy Definition Among a set of processors a designated root processor has a data item that needs to be communicated to all other processors. The broadcast operation performs this collective communication operation. Discussion For the broadcast operation it is generally assumed that upon execution all processors involved in the broadcast know the identity of the designated 1 root processor that is initially in possession of the data. It is also typically assumed that the amount of data to be broadcast is known at that time. Let p be the number of processors, numbered consecutively from 0 to p − 1. Let processor r be the root processor that has a data item x of size n to be communicated to the remainder p − 1 processors. Lower bounds A convenient model for the cost of sending a message between two nodes in a distributed memory architecture is α + nβ, where α is the startup cost (latency) and β is the cost per item transfered. If one assumes that a node can send to and receive from only one other node at any given time, two lower bounds for the broadcast can be easily justified, the first for the α term and the second for the β term. • dlog2 (p)eα. Define a round of communication as a period during which each node can send at most one message and receive at most one message. In each round, the number of nodes that own message x can at most double. Thus, a minimum of dlog2 (p)e are needed to broadcast the message. Each round casts at least α. • nβ. If p > 1 then the message must leave the root node, requiring a time of at least nβ. Tree-based broadcast algorithms The best known algorithm for broadcasting is the Minimum Spanning Tree (MST) algorithm. Assume again that each node can send only one message and receive only one message at a given time. The following algorithm (also given in the entry on Collective Communication) broadcasts the message from the root to all other nodes: • Partition the processors into two (roughly equal) subsets. • Send x from the root to a processor (the destination) in the subset that does not include the root. • Recursively broadcast x from the root and this destination processor in their respective subsets of processors. 2 Under the stated assumptions the total cost of this algorithm is dlog2 pe(α + nβ). We note that it achieves the lower bound for the α term but not for the β term. Pipelining A discussion of pipelined broadcast algorithms starts by assuming that the root is node 0 and that communication will be from node i to node i + 1, i = 0, . . . , p − 2. The original message, of length n, is partitioned into k packets of length n/k each. In the first round of the algorithm the first packet is sent from node 0 to node 1. In the second round this packet is forwarded to node 2 while the second packet is sent to node 1. In this fashion a pipeline is established that communicates the packets to all nodes. The cost can be analyzed as follows: the first packet arrives at node p after (p − 1) rounds. After this, every round another packet arrives so that the process completes after (p − 1) + (k − 1) = (p + k − 2) rounds. Since each nβ. round costs α + nk β the total cost for the broadcast is (p + k − 2)α + p+k−2 k The optimal q k can be determined by differentiating and setting the result to zero, kopt = (p − 2)nβ/α, taking into account that a packet be a multiple of a byte. If α = 0 (no startup cost) then the cost is (p + n − 2)β if the message is pipelined one item at a time (k = n). If n is much greater than p, this is close to the lower bound of nβ. The problem is that α and/or p are typically large for current distributed memory architectures. Moreover, the broadcast is often used to send short messages. Thus, it becomes important to reduce the length of the longest path by combining pipelining with a tree algorithm. The best example of this is the Edge-Disjoint Spanning Tree algorithm by Ho and Johnsson (one s or two?) [?]. The idea behind that algorithm is that on a hypercube or fully connected architecture with a power of two number of nodes one can embed log2 (p) trees and alternate pipelining part of the message within each of those trees. The effective longest path becomes log2 (p) + 1. Thus, even though each node can only send and receive one message during a round, the cost of this pipelined q algorithm is (log2 (p) + 1)(α + nk β), where k can be chosen optimally: kopt = (log2 (p) − 1)nβ/α. 3 Composing other collective communications A third technique for implementing the broadcast is to compose a scatter of the data with a subsequent allgather, leaving the data duplicated on each node. A scatter can be implemented much like the MST broadcast, except that in each round only the data that are eventually destined for the subset that does not connain the root needs to be sent. Assume that the target architecture has a power of two number of nodes. The cost, under the same assumptions as were used for the MST broadcast, of the scatter is approxnβ. An allgather on this kind of architecture can be imately log2 (p)α + p−1 p implemented with a cost of log2 (p)α p−1 nβ. Thus, a scatter followed by an p allgather implements a broadcast at a cost of 2 log2 (p)α + 2 p−1 nβ, which is p within a factor two of the lower bound for both the α and β terms. Comparison There is a clear tradeoff between the scatter-allgather broadcast and the MST broadcast: while the former is faster by a factor 2 for short messages (n small), the latter is faster by a factor log2 (p)/2 for long messages (n large). Asymptotically, as the message length gets very large, the pipelined algorithms are preferred. The problem is that the startup (α) is typically much larger than the cost per item transferred (β). A second problem is that methods that pipeline along multiple paths, like the EDST algorithm, depend on the architecture being highly synchronous so that the pipelines do not start interfering. Finally, generalizing them in a practical way beyond hypercubes and fully connected architectures has been a challenge. As a result, pipelined algorithms have been successfully implemented only on highly synchronous architectures like the original Single-Instruction Multiple-Data (SIMD) Connection Machines. Start of Jesper’s material. Discussion For the broadcast operation it is common to assume that all processors know the identity of the designated root processor initially in possession of the data. It is also typically assumed that the amount of data to be broadcast 4 is known prior to the operation by all processors. This is for instance the common assumption in interfaces for collective operations like MPI. Let p be the number of processors, numbered consecutively from 0 to p − 1. Let processor r be the root processor that has a data item x of size n to be communicated to the remainder p − 1 processors. For simplicity, and without loss of generality it is assumed in the following that r = 0 (for most broadcast algorithms a renumbering of the processors suffices to establish this). Tree-based broadcast algorithms Assume first that each processor can send and receive messages to and from any other processor. A natural way to broadcast the information from the root is to organize the processors in a tree structure. As soon as a processor receives the data item x from its parent in such a tree, the information is sent forward to the children in the tree. Such tree structures are as shown in Figure 1. In the star tree (a) the root sends x to each of the other p − 1 processors, either in sequence or simultaneously as permitted by the communication network. The root processor is busy throughout the operation, and if the send operations must be performed in sequence the last processor has received x after p − 1 send operations. In the balanced binary tree (b) the processors in disjoint subtrees can work concurrently. The depth (maximum distance from the root to a leaf) of the tree is dlog2 pe. If a processor can perform only one send operation at a time, the latest processor receives x after 2dlog2 pe send operations. This can be improved by employing slightly skewed tree structures. The ith Fibonacci tree (c) consists of the root r with the i − 2th Fibonacci tree as left subtree and the i − 1th Fibonacci tree as the right subtree (with F0 consisting of a single node, and F1 of a root with a single child). The depth of the ith Fibonacci tree is i, and broadcasting in the ith Fibonacci tree requires i send operations if only one send operation at a time is possible. The binomial tree (d) can likewise be recursively constructed. The ith binomial tree Bi consists of the root with i children Bj for j = i − 1, . . . , 0. Broadcasting in the ith binomial tree again takes i send operations. This is equivalent to the MST construction used in many collective communication algorithms (see entry Collective communication). In a path (e) each processor (except the last) has only one child. The last 5 (a) Star graph ... (b) Binary tree (c) Fibonacci trees, F0 , F1 , F2 , F3 , F4 (d) Binomial trees, B2 , B3 , B4 (e) Path ... Figure 1: Commonly used broadcast trees: (a) Star, (b) Binary Tree, (c) Fibonacci trees, (d) Binomial trees, (e) Path. 6 Tree Star Binary tree Fibonacci tree Binomial tree (MST) Path Rounds p−1 2dlog2 √pe dlogΦ pe, where Φ = 1+2 5 dlog2 pe p−1 Table 1: Broadcast complexity in number of rounds (for the last processor to have received the data item) commonly used trees. [The Fib tree bound needs to be checked!] processor receives the data item after p − 1 send operations, but the root has completed its part in the broadcast operation as soon as it has sent x to its child. Trees can be embedded in fully connected networks, but also in many other network topologies, and tree based algorithms are therefore practically relevant for the implementation of the broadcast operation on many parallel systems. Assuming that a processor can only be engaged in one either send or receive communication operation at at time (uni-directional, singleported communication), and that the processors together work in synchronized rounds in which any number of disjoint communication operations can take place concurrently, the number of rounds required for the last processor to receive the data item x for the discussed tree structures are summarized in Table 1. Assuming linear, homogeneous communication costs, such that a transfer of a data item of size n between any two processors takes α + nβ units of time, the broadcast time is modeled by the number rounds times the time per transfer. In the single-ported model the minimum number of rounds required for all processors to have received data item x is dlog2 pe, since, starting with the root, the number of processors having received x can at most double per communication round. The minimum time required in the linear cost model is therefore at least dlog2 pe(α + nβ). This lower bound achieved by the binomial tree broadcast. Since at least one processor has to send x and at least one processor has to receive x, another trivial lower bound in the linear cost model is therefore α + nβ. No simple tree based broadcast algorithms achieves this. 7 Tree Path Binary tree Fibonacci tree Scatter-allgather Round optimal Best cost nβ 2dlog2 peα + 2 p−1 √ p αdlog2 pe + 2 dlog2 peαβ b + βn q Table 2: Best possible broadcast time for pipelined tree based broadcast algorithms, and other algorithms. Assuming that x is divisible, and that x can be sent in N blocks of size n/N (assuming for simplicity that N divides n) better performance can be achieved. By pipelining through the path, such that the blocks are sent one after another, the root is busy for N consecutive rounds. The ith processor, i > 0 receives the first block after i communication rounds, and a new block in each of the following N − 1 rounds. Thus, the last processor has received all N blocks and can terminate the broadcast after p − 1 + N − 1 = N + p − 2 rounds. The cost of this pipelined broadcast algorithms is (N +p−2)(α+ Nn )β. Minimizing this yields a best possible cost of XXX Likewise the fixed degree balanced binary tree and Fibonacci tree can be pipelined. Upon receiving a new block each not in a fixed sequence sends this block to its children. This scheme keeps the processors busy in all rounds (except for root and leaves) either receiving or sending blocks. Best possible broadcast costs are shown in Table 2. Likewise, pipelining can be employed for all other fixed degree tree structures. Trees like the binomial tree (or the star) where the nodes have differing degrees depending on p on the other hand cannot be pipelined. Another interesting approach to broadcast, yielding comparable costs, but that does not rely on pipelining divides the n data into p blocks of size n/p and scatters these over the p processors (see entries on Collective communication, Scatter). The processors then perform and allgather operation to assemble the pieces into the full data item x on all processors. The cost of this algorithm is also listed in Table 2. Note that this algorithm need slight modifications in the case where n is smaller than (or comparable to) p, that is for very large parallel systems. The blocks will in that case become too small for the scatter operation to be possible. 8 Broadcasting with simultaneous trees In the single-ported, fully connected model, bidirectional simultaneous sendreceive model [1], in which each processor can at the same time both send a message to another processor and receive a message from a possibly different processor, it can be shown that broadcasting N blocks of data requires at least in N − 1 + log2 p communication rounds (lower bound). cost model this yields a broadcast time of αdlog2 pe + q In the linear √ 2 dlog2 peαβ b + βn which is asymptotically optimal. In [13] the idea of using multiple, edge-disjoint spanning trees in hypercubes was introduced. With pipelining and edge-disjoint binomial trees (EDBT) it is possible to achieve this lower bound when p is a power of two. It was for a number of years an open problem how to achieve this for arbitrary p. The first round optimal algorithms were given in [14, 1]. A different, explicit construction was found and described in [19]. A very elegant (and practical) extension of the hypercube EDBT algorithm to arbitrary p was presented in [12]. This algorithm runs the hypercube algorithm on pairs of processors (some of these singletons). The EDBT idea was used to achieve optimal (?) broadcast algorithms for multidimensional meshes and tori in [20]. General graphs The problem of determining the smallest number of rounds needed to broadcast in an arbitrary, given graph is NP-complete in the following sense [9, Problem ND49]. Given an undirected graph G = (V, E), a root vertex r ∈ V , and an integer k (number of rounds), is there a sequence of vertex and edge subsets {r} = V0 , E1 , V1 , V2 , E2 , . . . , Ek , Vk = V with Vi ⊆ V , Ei ⊆ E, such that each e ∈ Ei has one endpoint in Vi−1 and one in Vi , no two edges of Ei share an endpoint, and Vi = Vi−1 ∪ {w|(v, w) ∈ Ei }. This holds also for many special networks [11]. For approximation algorithms see [8]. Related Entries Collective communication Reduction 9 Message-Passing Interface (MPI) Communication network Single-ported communication Bibliographic Notes and Further Reading Some classical surveys with extensive treatment of broadcast (and allgather/gossiping) problems under various communication and network assumptions can be found in [10, 7]. For a survey of broadcasting in distributed systems see [6]. Fibonacci trees for broadcast were used in [4] to achieve near-optimal broadcast times. The classical paper [13] introduced advanced edge-disjoint trees for broadcast in hypercubes (EDBT). The scatter-allgather algorithm is from [2] [Is this correct?]. In [18] it is shown how to alleviate the problem with blocks becoming too small for large processor counts. A interpolation between binary trees and path for that is useful when pipelining of medium size, socalled fractional trees, was given in [15]. In [16] it has been that binary trees (that are simple to implement and pipeline) can be used to achieve the optimal cost, by using two edge-disjoint trees instead of one. An arguably more accurate performance model of communication networks is the socalled LogGP model. With this model yet other broadcast tree structures yield best performance [5, 17]. An optimal to a lower-order term algorithm based on using two binary trees (edge-disjoint binary trees) was recently presented in [16]. Heterogeneous systems in which different processors can have different speeds and different communication costs pose new challenges for efficient broadcast algorithms (as well as for other collective operations). For examples, see [3]. 10 References [1] Amotz Bar-Noy, Shlomo Kipnis, and Baruch Schieber. Optimal multiple message broadcasting in telephone-like communication systems. Discrete Applied Mathematics, 100(1–2):1–15, 2000. [2] M. Barnett, S. Gupta, D. Payne, L. Shuler, R. van de Geijn, and J. Watts. Interprocessor collective communication library. In Proceedings of Supercomputing 1994, Nov. 1994. [3] Olivier Beaumont, Arnaoud Legrand, Loris Marchal, and Yves Robert. Pipelining broadcast on heterogeneous platforms. IEEE Transactions on Parallel and Distributed Systems, 16(4):300–313, 2005. [4] Jehoshua Bruck, Robert Cypher, and Ching-Tien Ho. Multiple message broadcasting with generalized fibonacci trees. In Symposium on Parallel and Distributed Processing (SPDP), pages 424–431, 1992. [5] David E. Culler, Richard M. Karp, David Patterson, Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser, Ramesh Subramonian, and Thorsten von Eicken. LogP: A practical model of parallel computation. Communications of the ACM, 39(11):78–85, 1996. [6] Xavier Défago, André Schiper, and Péter Urbán. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Computing Surveys, 36(4):372–421, 2004. [7] Pierre Fraigniaud and Emmanuel Lazard. Methods and problems of communication in usual networks. Discrete Applied Mathematics, 53(1– 3):79–133, 1994. [8] Pierre Fraigniaud and Sandrine Vial. Approximation algorithms for broadcasting and gossiping. Journal of Parallel and Distributed Computing, 43:47–55, 1997. [9] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, 1979. With an addendum, 1991. 11 [10] Sandra M. Hedetniemi, T. Hedetniemi, and Arthur L. Liestman. A survey of gossiping and broadcasting in communication networks. Networks, 18:319–349, 1988. [11] Klaus Jansen and Haiko Müller. The minimum broadcast time problem for several processor networks. Theoretical Computer Science, 147(1&2):69–85, 1995. [12] Bin Jia. Process cooperation in multiple message broadcast. Parallel Computing, 35(12):572–580, 2009. [13] S. Lennart Johnsson and Ching-Tien Ho. Optimum broadcasting and personalized communication in hypercubes. IEEE Transactions on Computers, 38(9):1249–1268, 1989. [14] Oh-Heum Kwon and Kyung-Yong Chwa. Multiple message broadcasting in communication networks. Networks, 26:253–261, 1995. [15] Peter Sanders and Jop F. Sibeyn. A bandwidth latency tradeoff for broadcast and reduction. Information Processing Letters, 86(1):33–38, 2003. [16] Peter Sanders, Jochen Speck, and Jesper Larsson Träff. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Computing, 35:581–594, 2009. [17] Eunice E. Santos. Optimal and near-optimal algorithms for k-item broadcast. Journal of Parallel and Distributed Computing, 57(2):121– 139, 1999. [18] Jesper Larsson Träff. A simple work-optimal broadcast algorithm for message-passing parallel systems. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 11th European PVM/MPI Users’ Group Meeting, volume 3241 of Lecture Notes in Computer Science, pages 173–180. Springer-Verlag, 2004. [19] Jesper Larsson Träff and Andreas Ripke. Optimal broadcast for fully connected processor-node networks. Journal of Parallel and Distributed Computing, 68(7):887–901, 2008. 12 [20] Jerrell Watts and Robert A. van de Geijn. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters, 5:281–292, 1995. 13