* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Parallel Euler tour and Post Ordering for Parallel Tree Accumulations
Survey
Document related concepts
Transcript
Parallel Euler tour and Post Ordering for Parallel Tree Accumulations An implementation technical report Sinan Al-Saffar & David Bader University Of New Mexico Dec. 2003 Introduction Tree accumulation is the process of aggregating data placed in tree nodes according to their tree structure. The aggregation can be a simple arithmetic operation or a more complex function. This is similar to arithmetic expression evaluation except there is one operation to be performed on the entire tree and is thus implied and internal nodes do not contain arithmetic operations. We have written a program to implement accumulations on tree data structures in parallel. We developed and tested our code on a fourteen-processor Sun SMP and implemented parallelism using the threaded SIMPLE[1] library. We report our results and describe our work in this report. To implement tree accumulation and generally other algorithms in parallel one requires input to be arranged and preprocessed in a certain way, for example a tree needs to be supplied in a post ordering[2] or preordering of its nodes. These operations are as important as the main task one is addressing since if they are not done efficiently, they become bottle-necks for performance. Some authors wave their hand and assume a certain property exists in the data after a “preprocessing step”. We address these aspects in this report in detail. We demonstrate how to obtain the Euler tour of a tree in parallel and how to use this tour to obtain the post order of the tree nodes in parallel. We then show how to perform parallel tree accumulations also in parallel using this obtained post ordering. Input: We wrote treeGen, a small program that generates random trees to be used on our runs. By a random tree we mean a tree whose nodes contain an arbitrary value and have an arbitrary (up to a specifiable max degree) number of child nodes. The arbitrary values in the nodes can be used later in tree accumulation or other operations. We only show the node numbers in the figure below. The weights or values of the nodes are irrelevant to the tour. An example input tree is shown in figure-1 below along with the output file from treeGen specifying it. The DIMACS file format is simply an edges list. c c p n e e n e e n e n e e n n n n tree file generated by treeGen format 8 7 0 53 0 1 0 2 1 14 1 3 1 4 2 27 2 5 3 2 3 3 6 3 7 4 31 5 85 6 30 7 11 6 0 1 2 4 5 7 Figure-1 (the input tree) The Euler Tour The Euler tour is a traversal of a graph such that each edge of this graph is visited exactly once. The Euler tour technique is essential for different parallel computations on graphs. For example, computing the post order of the tree which is equivalent to conducting a depth first search is inherently sequential since a processor cannot determine the rank or order of its nodes without waiting for the output of the previous processor so it can learn the starting rank of its first node. The starting rank depends on the nodes appearing before in the depth first search. A post ordering and other properties can be obtained with the Euler tour technique without the need for a depth first search. To obtain an Euler tour of the undirected tree on which we are to perform the accumulation, we transform this tree into a directed graph first by replacing each edge in the tree with two arcs opposed to each other in direction. This results in an Eulerian graph since every node in the graph will be of even degree. Figure-3 shows the input tree of figure-1 transformed in this manner. Result of the tour: The output of performing the Euler tour will be a successor array containing elements equal to twice the number of edges in the input tree (since each edge will be replaced with two directed arcs for the Euler tour). Each cell in this successor array represents an edge (the index is the edge number) which points to the edge that succeeds it in the tour. For example if Successor[2] = 7, then edge number 7 succeeds edge number 2 in the Euler tour. The successor array in figure-3 represents the Euler tour of edges: 1-4-9-13-10-14-8-5-11-32-7-12-6. Data Structures: Graphs are usually represented using an adjacency list data structure. We augment the typical adjacency list structure to facilitate the parallel computation of the successor array (the Euler tour). For this purpose we make three modification. The first two are suggested by[3]. The modifications to the adjacency list are: 1. Make the adjacency list circular 2. Add reverse pointers to each edge aÆb this pointer references the reverse edge in the graph, edge bÆa. 3. The edge list is contiguous and the location of each edge in it uniquely identifies that edge. We found this condition to be necessary for the algorithm to work. The single edge structure will thus look as follows: Figure-2 struct edge{ int val; struct edge * next; struct edge * nextR; } val nextR next nextR is the reverse pointer and val is the second node in the edge. If ‘val’ were node ‘c’ and nextR leads to an edge whose ‘val’ happens to be the value ‘a’ then we know this node represents the edge aÆc and its index in the edgeList[] array (see figure-3) is the edge number in the input tree. This is important since the successor array holding the Euler tour will refer to the tour by edge numbers. The next pointer simply points to the next edge in the list and is made circular. Our program starts out be reading the DIMACS tree file into memory and storing the tree in the adjacency list illustrated in figure-3. the program then performs discovers the Euler tour in parallel by running the following code: pardo (i, 0, numEdgesX2, 1){ succ[i].succ= edgeList[i].nextR->next- edgeList; succ[i].prefix = 0; } Note the subtracting of the edgeList value in the second line. This is why we mentioned that each edge’s index or location in memory is significant, it corresponds to the number of that edge in the tree. The successor array elements have two pointers, the successor value which points to the next edge, trivially. And the prefix value which may contain any weights to be assigned to the edges. This code is abstracted in the algorithm in figure-3 without the prefix line. The prefix or weight of each node is initialized to zero. We need the prefix value of each edge to do post ordering on the tour once it is found. Obtaining the post ordering from the tour: We assign the value of 1 to upward pointing edges and the value of 0 to the downward pointing edges. We then perform parallel prefix sums on the tree. This results in the post ordering of the nodes. If preordering were desired, it could be obtained by assigning the upwardly pointing edges 1’s and 0’s to the edges pointing downwards and then performing parallel prefix sums on the edges. The parallel prefix sums is provided by David Helman and is an implementation of the parallel prefix sums algorithm in [3]. The following code summarizes the calculation of the post order given the successor array containing the correct Euler tour and prefix sums in the prefix field: // compute post order pardo(i,0,numEdges*2,1){ post[succ[i].prefix] = edgeList[i].nextR->val; } This code block results in the post[] array containing the list of tree nodes in post order. Next Page we show an illustration of the data structures discussed so far and needed for the Euler tour and obtaining the post ordering. Edge # 2 in the tree occurs in the 2nd position in the EdgeList array. a 1 2 3 13 a 4 d 5 e 6 a 7 f 8 b 9 g 10 h f 10 g 3 12 e 14 c 7 11 d 9 2 c 5 8 b 6 b 4 1 a h b c Above: the input tree with numbered edges. d Right: its adjacency and edge lists made circular and reverse edge pointers added. e f 11 b Below: the successor array representing the Euler tour of the tree. We can obtain this in parallel as the algorithm below shows. g 12 c h 13 d AdjList[] 14 d EdgeList[] The EdgeList array above occupies contiguous memory and the index of each edge is the edge number in the graph. Successor[] 4 7 2 9 11 1 12 5 13 14 3 6 10 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Algorithm for parallel Euler tour using the structures above: For every edge, e, in the tree do in parallel { Successor [ e ] = ( EdgeList [ e] . reversePtr ) . nextPtr ; } Figure-3 EdgeList[12] Successor[12] = 6 Because: We follow the reverse pointer in the EdgeList[12] cell. This leads to the cell at index 7. Its next pointer goes one cell above (index 6). Equivekently the nodes read c--->a Tree Accumulation Thus far we arranged our data in a post ordering. This makes it possible to divide the data among processors in a way necessary for performing tree accumulation in parallel, the next step. To perform tree accumulation we identify the leaf nodes in the post ordering of the tree and partition those among the processors where each processor will keep pointers to its part of the shared array containing this post ordering. To perform upwards tree accumulation in parallel, each processor starts with its set of leaf nodes and aggregates the data in each node with its parent and then marks the child as being removed. The parent’s child count is decremented and the parent is flagged as a new leaf node to be processed in the next iteration when its child’s count becomes zero. Two arrays are kept, leaves0 and leaves1. Leaves0 holds the current leaves and leave1 holds the leaves for the coming iteration. The array pointers are swapped after each iteration. Repeating this process and going up the tree till the root is reached will complete the upward accumulation. Below is the sample of the code performing the above task. pardo(i,0,*numLeaves0,1){ //1. aggrigate your data with parent. parent = edgeListPtrs[leaves0[j]]->val; nodeValue[parent] = nodeValue[parent] OPR nodeValue[leaves0[j]]; //2 update the child count for that parent childCount[parent] = childCount[parent]-1; } if (childCount[parent] == 0){ // parent has become leaf block1[id].size = block1[id].size +1; leaves1[k] = parent; k++; } j++; Results We present performance results for carrying out parallel Euler tour and post ordering of the input tree on a trees of size 2^22 nodes. We ran our parallel code on three different types of trees, random, caterpillar (linear), and full trees. The results are shown below. Performance on random trees 20 Seconds 15 Accumulation 10 Euler and post order 5 0 1 2 4 6 8 10 12 14 Number of Processors One observes that our performance does achieve good speedups up to 8 processors but it seems to level off thereafter. We suspect this could be an artifact of the system we ran on since parallel expression evaluation code that is known to scale on other systems up to 64 processors had the similar speedup curved as ours shown here. Performance on full trees 14 12 Seconds 10 8 Accumulation 6 Euler and post ordering 4 2 0 1 2 4 6 8 10 Number of Processors 12 14 Performance on caterpillar trees 14 12 Seconds 10 8 Accumulation 6 Euler and Post Order 4 2 0 1 2 4 6 8 10 12 14 Number of Processors Another aspect to notice is that running times are dominated by the time it takes to construct the Euler tour and computing the post ordering thus one should not ignore these processes. In this report we demonstrated how preprocessing can be and should be conducted in parallel and how it can be used in the main task being parallelized. The choice of data structures, as always, is tightly related to the kinds of operations we wish to perform in the data. We used augmented adjacency lists to perform parallel Euler tour and post ordering. Then we used this post ordering in parallel tree accumulation. We hope this technical report will assist those trying to implement these operations in the future. References • • • [1] SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs) (1999) David A. Bader, Joseph JaJa. [2] F. Sevilgen, S. Aluru and N. Futamura, "Distributed memory tree accumulations," Proc. Parallel and Distributed Computing Systems, pp. 389-395, 1999. [3] Introduction to Parallel Algorithms. Joseph JaJa, 1992