Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Amorphous Data-parallelism ∗ Keshav Pingali1,2 , Milind Kulkarni2 , Donald Nguyen1 , Martin Burtscher2 , Mohammed Amber Hassan3 , Mario Mendez-Lojo2 , Dimitrios Prountzos1 , Xin Sui3 1 Department of Computer Science, 2 Institute for Computational Engineering and Sciences and 3 Department of Electrical and Computer Engineering The University of Texas at Austin Abstract Most client-side applications running on multicore processors are likely to be irregular programs that deal with complex, pointerbased data structures such as sparse graphs and trees. However, we understand very little about the nature of parallelism in irregular algorithms, let alone how to exploit it effectively on multicore processors. In this paper, we define a generalized form of data-parallelism that we call amorphous data-parallelism and show that it is ubiquitous in irregular algorithms from a variety of disciplines including data-mining, AI, compilers, networks, and scientific computing. We also show that the concepts behind amorphous data-parallelism lead to a natural categorization of irregular algorithms; regular algorithms such as finite-differences and FFTs emerge as an important category of irregular algorithms. We argue that optimistic or speculative execution is essential for exploiting amorphous dataparallelism in many irregular algorithms, but show that the overheads of speculative execution can be reduced dramatically in most cases by exploiting algorithmic structure. Finally, we show how these insights can guide the design of programming languages, system support, and architectures for exploiting parallelism. 1. Introduction Application/domain Data-mining Bayesian inference Compilers Functional interpreters Maxflow Minimal spanning trees N-body methods Graphics Network science System modeling Event-driven simulation Meshing Linear solvers साकारमन'त) +वि, +नराकार) त. +न/ल) । एत34वोप78न न प.नभवस) 9 भवः ॥ That which has form is not real; only the amorphous endures. When you understand this, you will not return to illusion PDE solvers Algorithms Agglomerative clustering [67], kmeans [67] Belief propagation [46], Survey propagation [46] Iterative dataflow algorithms [3], elimination-based algorithms [3] Graph reduction [52], static and dynamic dataflow [6, 19] Preflow-push [16], augmenting paths [16] Prim’s [16], Kruskal’s [16], Boruvka’s [22] algorithms Barnes-Hut [9], fast multipole [29] Ray-tracing [24] Social network maintenance [32] Petri-net simulation [51] Chandy-Misra-Bryant [48], Timewarp [35] Delaunay mesh generation [14], refinement [60], Metis-style graph partitioning [37] Sparse MVM [28], Sparse Cholesky factorization [27] Finite-difference codes [28] Table 1. Problem domains and algorithms — Ashtavakra Gita The parallel programming community has a deep understanding of parallelism and locality in dense matrix algorithms. However, outside of computational science, most algorithms are irregular: that is, they are organized around pointer-based data structures such as trees and graphs, not dense arrays. Table 1 is a list of algorithms from important problem domains. Only finite-difference codes use dense matrices; the rest use trees and sparse graphs. Unfortunately, we currently have few insights into the structure of parallelism and locality in irregular algorithms, and this has stunted the development of techniques and tools that make it easier to produce parallel implementations of these algorithms. Domain specialists have written parallel programs for some of the algorithms in Table 1 (see [7, 27, 34, 35, 56, 60] among others). There are also parallel graph libraries such as Boost [13] and Stapl [23]. These efforts are mostly problem-specific, and it is difficult to extract broadly applicable abstractions, principles, and mechanisms from these implementations. Automatic parallelization using points-to analysis [33] and shape analysis [26, 30] has been successful in parallelizing some irregular programs that use ∗ This work is supported in part by NSF grants 0719966, 0702353, 0615240, 0541193, 0509307, 0509324, 0426787 and 0406380, as well as grants from IBM and Intel Corporation. trees, such as n-body methods. However, most of the applications in Table 1 are organized around large sparse graphs with no particular structure. These difficulties have seemed insurmountable, so irregular algorithms remain the Cinderella of parallel programming in spite of their central role in applications. In this paper, we argue that these problems can be solved only by changing the abstractions that we use to reason about parallelism in algorithms. The most widely used abstraction is the computation graph (also known as the dependence graph) in which nodes represent computations, and edges represent dependences between computations [25]; for programs with loops, dependence edges are often labeled with additional information such as dependence distances and directions [38]. Computations that are not ordered by the transitive closure of the dependence relation can be executed in parallel. Computation graphs produced either by studying the algorithm or by compiler analysis of programs are referred to as static computation graphs. Although static computation graphs are used extensively for implementing regular algorithms, they are not adequate for modeling parallelism in irregular algorithms. As we explain in Section 2, the most important reason for this is that dependences between computations in irregular algorithms are functions of runtime data values, so they cannot be represented usefully by a static computation graph. To address the need for better abstractions, this paper proposes a new way of thinking about irregular algorithms, which we call the operator formulation of algorithms, and introduces an associated data-centric abstraction [39] called the halograph. Our point of departure is Niklaus Wirth’s aphorism “Program = Algorithm + Data Structure” [68]; instead of measuring parallelism in instrumented programs as is commonly done, we study parallelism in algorithms expressed using operations on abstract data types, independently of the concrete data structures used to represent those abstract data types. In this spirit, we replace program-centric concepts like dependence distances and directions with data-centric concepts called active nodes and neighborhoods; one important consequence of this change in perspective is that regular algorithms emerge as an important special class of irregular algorithms in much the same way that metric spaces are an important special class of general topologies in mathematics [49]. This operator formulation of irregular algorithms also leads to a novel structural classification of irregular algorithms, discussed in Section 3. In Section 4, we introduce amorphous data-parallelism, a generalization of data-parallelism in regular algorithms, and show how halographs provide a useful abstraction for reasoning about this parallelism. We argue that optimistic or speculative execution is the only general approach to exploiting amorphous data-parallelism, but we also show that the overhead of speculative execution can be reduced by exploiting algorithmic structure when it is present. Sections 5–7 demonstrate the practical utility of these ideas by showing how they can be used to implement the algorithms in Table 1. This discussion is organized along the lines of the structural classification of algorithms introduced in Section 3. We show how amorphous data-parallelism arises in different classes of irregular algorithms; whenever possible, we also describe the optimizations used in handwritten parallel implementations of particular algorithms, and show how these optimizations can be viewed in our framework as exploiting algorithmic structure. Finally, Section 8 discusses how these insights might guide future research in language and systems support for irregular algorithms. 2. Inadequacy of static computation graphs The need for new algorithmic abstractions for irregular algorithms can be appreciated by considering Delaunay Mesh Refinement (DMR) [14], an irregular algorithm used extensively in finiteelement meshing and graphics. The Delaunay triangulation for a set of points in a plane is the triangulation in which each triangle satisfies a certain geometric constraint called the Delaunay condition [18]. In many applications, triangles are required to satisfy additional quality constraints, and this is accomplished by a process of iterative refinement that repeatedly fixes “bad” triangles (those that do not satisfy the quality constraints) by adding new points to the mesh and re-triangulating. Refining a bad triangle by itself may violate the Delaunay property of triangles around it, so it is necessary to compute a region of the mesh called the cavity of the bad triangle, and replace all the triangles in the cavity with new triangles. Figure 1 illustrates this process; the darker-shaded triangles are “bad” triangles, and the cavities are the lighter shaded regions around the bad triangles. Re-triangulating a cavity may generate new bad triangles, but it can be shown that at least in 2D, this iterative refinement process will ultimately terminate and produce a guaranteed-quality mesh. Different orders of processing bad triangles lead to different meshes, although all such meshes satisfy the quality constraints and are acceptable outcomes of the refinement process [14]. This is an example of don’t-care non-determinism, also known as committed-choice non-determinism [20]. Figure 2 shows the pseudocode for mesh refinement. Each iteration of the while loop refines one bad triangle; we call this computation an activity. (a) Unrefined Mesh (b) Refined Mesh Figure 1. Fixing bad triangles 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: Mesh mesh = // read in initial mesh Worklist wl<Triangle>; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Triangle t = wl.get(); // get bad triangle if (t no longer in mesh) continue; Cavity c = new Cavity(t); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } Figure 2. Pseudocode of the mesh refinement algorithm DMR can be performed in parallel since bad triangles whose cavities do not overlap can be refined in parallel. Note that two triangles whose cavities overlap can be refined in either order but not concurrently. Unfortunately, static computation graphs are inadequate for reasoning about parallelism in this algorithm. In Delaunay mesh refinement, as in most irregular algorithms, dependences between activities are functions of runtime values: whether or not two activities in DMR can be executed concurrently depends on whether the cavities of the relevant triangles overlap, and this depends on the input mesh and how it has been modified by previous refinements. Since this information is known only during program execution, a static computation graph for Delaunay mesh refinement must conservatively assume that every activity might interfere with every other activity. A second problem with computation graphs is that they do not model don’t-care non-determinism. In computation graphs, computations can be executed in parallel if they are not ordered by the transitive closure of the dependence relation. In irregular algorithms like Delaunay mesh refinement, it is often the case that two activities can be done in either order but cannot be done concurrently, so they are not ordered by dependence and yet cannot be executed concurrently. Exploiting don’t-care non-determinism can lead to more efficient parallel implementations, as we discuss in Section 4. Figure 3. Conflicts in event-driven simulation Event-driven simulation [48] illustrates a different kind of complexity exhibited by some irregular algorithms. This application simulates a network of processing stations that communicate by sending messages along FIFO links. When a station processes a message, its internal state may be updated and it may produce zero or more messages on its outgoing links. Sequential event-driven simulation is performed by maintaining a global priority queue of events called the event-list, and processing the earliest event at each step. In this irregular algorithm, activities correspond to the processing of events. In principle, event-driven simulation can be performed in parallel by processing multiple events from the event-list simultaneously, rather than one event at a time. However, unlike in Delaunay mesh refinement, where it was legal to process bad triangles concurrently provided they were well-separated in the mesh, it may not be legal to process two events in parallel even if their stations are far apart in the network. Figure 3 demonstrates this. Suppose that in a sequential implementation, node A fires and produces a message with time-stamp 3, and then node B fires and produces a message with time-stamp 4. Notice that node C must consume the message with time-stamp 4 before it consumes the message with time-stamp 5. Therefore, in Figure 3(a), the events at A and C cannot be processed in parallel even though the stations are far apart in the network1 . However, notice that if the message from B to C in Figure 3(b) had a time-stamp greater than 5, it would have been legal to process in parallel the events at nodes A and C in Figure 3(a). To determine if two activities can be performed in parallel at a given point in the execution of the algorithm, we need a crystal ball to look into the future! In short, static computation graphs are inadequate abstractions for irregular algorithms for the following reasons. 1. Dependences between activities in irregular algorithms are usually complex functions of runtime data values, so they cannot be usefully captured by a static computation graph. 2. Many irregular algorithms exhibit don’t-care non-determinism, but this is hard to model with computation graphs. 3. Whether or not it is safe to execute two activities in parallel at a given point in the computation may depend on activities created later in the computation. This behavior is difficult to model with computation graphs. 3. Operator formulation of irregular algorithms To address these problems, this paper introduces a new way of thinking about irregular algorithms called the operator formulation of irregular algorithms, and a new data-centric algorithm abstraction called the halograph. Halographs can be defined for any abstract data type (ADT), but we will use the graph ADT to be concrete. The operator formulation leads to a natural classification of irregular algorithms, shown in Figure 4, which we will use in the rest of the paper. 3.1 Graphs and graph topology We use the term graph to refer to the usual directed graph abstract data type (ADT) that is formulated in terms of (i) a set of nodes V , and (ii) a set of edges (⊆ V × V ) between these nodes. Undirected edges are modeled by a pair of directed edges in the standard fashion. In some applications, nodes and edges are labeled with values. In this paper, all irregular algorithms will be written in terms of this ADT. In some algorithms, the graphs have a special structure that can be exploited by the implementation. Trees and grids are particularly important. Grids are usually represented using dense arrays, but note that a grid is actually a sparse graph (a 2D grid point that is not on the boundary is connected to four neighbors). Square dense matrices are isomorphic to cliques, which are graphs in which there is a labeled edge between every pair of nodes. At the other extreme, we have graphs consisting of labeled nodes and no edges; these are isomorphic to sets and multisets. Some of these topologies are shown in Figure 4. We will not discuss a particular concrete representation for the graph ADT. The implementation is free to choose a concrete repre- Figure 4. Classification of irregular algorithms Figure 5. Halograph: active elements and neighborhoods sentation that is best suited for a particular algorithm and machine: for example, grids and cliques may be represented using dense arrays, while sparse graphs may be represented using adjacency lists. This is similar to the approach taken in relational databases: SQL programmers use the relation ADT in writing programs, and the underlying DBMS system is free to implement the ADT using Btrees, hash tables, and other concrete data structures2 . 3.2 Active nodes and neighborhoods At each point during the execution of a graph algorithm, there are certain nodes or edges in the graph where computation might be performed. Performing a computation may require reading or writing other nodes and edges in the graph. The node or edge on which a computation is centered will be called an active element, and the computation itself will be called an activity. To keep the discussion simple, we assume from here on that active elements are nodes. Borrowing terminology from the literature on cellular automata [63], we refer to the set of nodes and edges that are read or written in performing the computation as the neighborhood of that active node. Figure 5 shows an undirected sparse graph in which the filled nodes represent active nodes, and shaded regions represent the neighborhoods of those active nodes. Note that in general, the neighborhood of an active node is distinct from the set of its neighbors in the graph. A halograph is a figure like Figure 5 that shows an ADT, the set of active nodes in the ADT at some point in the computation, and the associated neighborhoods. It is convenient to think of an activity as the application of an operator to the neighborhood of an active element. It is useful to distinguish between three kinds of operators, as shown in Figure 4. • Morph: A morph operator may modify the structure of the neighborhood by adding or deleting nodes and edges, and may also update values on nodes and edges. 2 Codd’s 1 Like in chaos theory, “the flap of a butterfly’s wings in Brazil can set off a tornado in Texas” [44]. Rule 8 (Physical Data Independence rule): “Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods”. [15] • Local computation: A local computation operator may update 3.4 values stored on nodes and edges in the neighborhood, but does not modify the graph structure. • Reader: A reader does not modify the neighborhood in any way. A natural way to program these algorithms is to use worklists to keep track of active nodes. However, programs written directly in terms of worklists, such as in Figure 2, encode a particular order of processing worklist items, and do not express the don’t-care nondeterminism in irregular algorithms like Delaunay mesh refinement. To address this problem, we use the Galois programming model, which is a sequential, object-oriented programming model (such as sequential Java), augmented with two Galois set iterators [41]: We illustrate these concepts using a few algorithms from Table 1. The preflow-push algorithm [16] is a maxflow algorithm in directed graphs, described in detail in Section 6.2. During the execution of this algorithm, some nodes can temporarily have more flow coming into them than is going out. These are the active nodes (in fact, we borrowed the term “active node” from the literature on preflow-push). The algorithm repeatedly selects an active node n and tries to increase the flow along some outgoing edge (n → m) to eliminate the excess flow at n if possible; if it succeeds, the residual capacity on that edge is updated. Therefore, the neighborhood consists of n, its incoming and outgoing edges, and the nodes incident on the outgoing edges. The operator is a local computation operator. Increasing the flow to m may cause m to become active; if n still has excess in-flow, it remains an active node. A second example is Delaunay mesh refinement, described in Section 2. In this application, the mesh is usually represented by a graph in which nodes represent triangles and edges represent adjacency of triangles. The active nodes in this algorithm are the nodes representing bad triangles, the operator is a morph operator, and the neighborhood of an active node is the cavity of the corresponding bad triangle. The final example is event-driven simulation. In this application, active nodes are nodes that have messages on their input channels, and the operator is a local computation operator. Local computation algorithms will be discussed in more detail in Section 6, but we mention here that algorithms like preflow-push and eventdriven simulation in which nodes become active in a very dynamic, dataflow-like manner are classified as data-driven local computation algorithms in that section. On the other hand, in structuredriven local computation algorithms such as the Jacobi and GaussSeidel finite-difference methods, active nodes and neighborhoods are defined completely by the structure of the underlying graph. Some algorithms may exhibit phase behavior: they use different types of operators at different points in their execution. BarnesHut [9] is an example: the tree-building phase uses a morph, as explained in Section 5.1, and the force computation uses a reader, as explained in Section 7. 3.3 Ordering In general, there are many active nodes in a graph, so a sequential implementation must pick one of them and perform the appropriate computation. In some algorithms such as Delaunay mesh refinement, the implementation is allowed to pick any active node for execution. As mentioned in Section 2, this is an example of don’t-care nondeterminism [20]. For some of these algorithms, the output is independent of these implementation choices, a property that is referred to as the Church-Rosser property since the most famous example of this behavior is β-reduction in λ-calculus [8]. Dataflow graph execution [6, 19] and the preflow-push algorithm [16] also exhibit this behavior. In other algorithms, the output may be different for different choices of active nodes, but all such outputs are acceptable, so the implementation can still pick any active node for execution. Delaunay mesh refinement and Petri nets [51] are examples (note that the preflow-push algorithm exhibits this kind of behavior if the algorithm outputs the minimal cut as well as the maximal flow). In contrast, some algorithms dictate an order in which active nodes must be processed. Event-driven simulation is an example: the sequential algorithm for event-driven simulation processes messages in global time-order. The order on active nodes may sometimes be a partial order. From algorithms to programs • Unordered-set iterator: for each e in Set S do B(e) The loop body B(e) is executed for each element e of set S. The order in which iterations execute is indeterminate, and can be chosen by the implementation. There may be dependences between the iterations. When an iteration executes, it may add elements to S. • Ordered-set iterator: for each e in OrderedSet S do B(e) This construct iterates over an ordered set S. It is similar to the unordered set iterator above, except that a sequential implementation must choose a minimal element from set S at every iteration. When an iteration executes, it may add new elements to S. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: Mesh mesh = // read in initial mesh Workset ws<Triangle>; ws.add(mesh.badTriangles()); for each Triangle t in ws { if (t no longer in mesh) continue; Cavity c = new Cavity(t); c.expand(); c.retriangulate(); mesh.update(c); ws.add(c.badTriangles()); } Figure 6. Delaunay mesh refinement using an unordered set iterator The body of the iterator is the implementation of the operator for that algorithm. Note that these iterators have a well-defined sequential semantics. The unordered-set iterator permits the specification of don’t-care nondeterminism. Event-driven simulation can be written using the ordered-set iterator. Iterators over multi-sets can be defined similarly. Work-sets can be implemented in many ways; for example, in a sequential implementation, unordered work-sets can be implemented using a linked list, and ordered work-sets can be implemented using a heap. The Galois system provides a data structure library, similar in spirit to the Java collections library, containing implementations of key ADTs such as graphs, trees, grids, work-sets, etc. Application programmers write algorithms in sequential Java, using the appropriate library classes and Galois set iterators. Figure 6 shows pseudocode for Delaunay mesh refinement, written using the unordered Galois set iterator. 3.5 Discussion Set iterators were first introduced in the SETL programming language [36], and can now be found in most object-oriented languages such as Java and C++. However, new elements cannot be added to sets while iterating over them, as is possible with Galois set iterators. The metaphor of operators acting on neighborhoods is reminiscent of notions in term-rewriting systems, and graph grammars in particular [21, 45]. The semantics of functional language programs are usually specified using term rewriting systems (TRS) that describe how expressions can be replaced by other expressions within the context of the functional program. The process of applying rewrite rules repeatedly to a functional language program is known as string or tree reduction. Figure 7. Graph rewriting: single-pushout approach Tree reduction can be generalized in a natural way to graph reduction by using graph grammars as rewrite rules [21, 45]. A graph rewrite rule is defined as a morphism in the category C of labeled graphs with partial graph morphisms as arrows: r: L → R, and a rewriting step is defined by a pushout [45] as shown in Figure 7. In this diagram, G is a graph, and the total morphism m: L → G, which is called a redex, identifies the portion of G that “matches” L, the left-hand side of the rewrite rule. The application of a rule r: L → R at a redex m: L → G leads to a direct derivation (r,m): G ⇒ H given by the pushout in Figure 7. The framework presented in this section is more general because operators are not limited to be a finite set of “syntactic” rewrite rules; for example, it is not clear that Delaunay mesh refinement, described in Section 5.1, can be specified using graph grammars, although some progress along these lines is reported by Verbrugge [65]. From this point of view, morph operators are partial graph morphisms in which the graph structure of L and R may be different, while a local computation operator is one in which the structure of L and R are identical, but the labels may be different. Local computation operators can be viewed as special cases of morph operators, and reader operators as special cases of local computation operators. Whether morphs and local computation operators make imperative-style in-place updates to graphs or create modified copies of the graph in a functional style is a matter of implementation. The definitions in this section can be generalized in the obvious way for algorithms that deal with multiple data structures. In that case, neighborhoods span multiple data structures, and the classification of an operator is relative to a particular data structure. For example, for the standard algorithm that multiplies dense matrices A and B to produce a dense matrix C, the operator is a local computation operator for C, and a reader for A and B. To keep the discussion simple, all algorithms in this paper will be formulated so that they operate on a single data structure. 4. Amorphous data-parallelism Figure 5 shows intuitively how opportunities for exploiting parallelism arise in irregular algorithms: if there are many active elements at some point in the computation, each one is a site where a processor can perform computation, subject to neighborhood and ordering constraints. When active nodes are unordered, the neighborhood constraints must ensure that the output produced by executing the activities in parallel is the same as the output produced by executing the activities one at a time in some order. For ordered active elements, this order must be the same as the ordering on active elements. D EFINITION 1. Given a set of active nodes and an ordering on active nodes, amorphous data-parallelism is the parallelism that arises from simultaneously processing active nodes, subject to neighborhood and ordering constraints. Formulating the concept of amorphous data-parallelism in terms of abstract data types permits us to discuss parallelism in irregular algorithms without reference to particular concrete representations of the abstract data types. The job of library implementors and systems designers is to provide the appropriate library classes and systems support to exploit commonly occurring patterns of paral- lelism and locality. A similar approach is central to the success of databases, where one considers parallelism in SQL programs independently of how relations are implemented using B-trees, hashtables, and other complex data structures. Amorphous data-parallelism is a generalization of conventional data-parallelism in which (i) concurrent operations may conflict with each other, (ii) activities can be created dynamically, and (iii) activities may modify the underlying data structure. Not surprisingly, the exploitation of amorphous data-parallelism is more complex than the exploitation of conventional data-parallelism. In this section, we present a baseline implementation that uses optimistic or speculative execution [54, 61, 66] to exploit amorphous dataparallelism. We then show that the overheads of speculative execution can be reduced or even eliminated by exploiting algorithmic structure if it is present. 4.1 Baseline implementation: optimistic parallel execution In the baseline execution model, the graph is stored in sharedmemory, and active nodes are processed by some number of threads. A thread picks an arbitrary active node and speculatively applies the operator to that node, making calls to the graph API to perform operations on the graph as needed. The neighborhood of an activity can be visualized as a blue ink-blot that begins at the active node and spreads incrementally whenever a graph API call is made that touches new nodes or edges in the graph. To ensure that neighborhood constraints are respected, each graph element has an associated exclusive logical lock that must be acquired by a thread before it can access that element. Locks are held until the activity terminates. If a lock cannot be acquired because it is already owned by another thread, a conflict is reported to the runtime system, which rolls back one of the conflicting activities. Lock manipulation is performed entirely by the methods in the graph class; in addition, to enable rollback, each graph API method that modifies the graph makes a copy of the data before modification. Logical locks provide a simple approach to ensuring that neighborhood constraints are respected, but they can be restrictive since they require neighborhoods of concurrent activities to be disjoint (such as activities i3 and i4 in Figure 5). Non-empty intersection of neighborhoods is permissible if the activities do not modify nodes and edges in the intersection for example. A more general solution than logical locks is to use commutativity conditions as in the Galois system [41]; to keep the discussion simple, we do not describe this here. If active elements are not ordered, the activity terminates when the application of the operator is complete, and all acquired locks are released. If active elements are ordered, active nodes can still be processed in any order, but they must commit in serial order. This can be implemented using a data structure similar to a reorder buffer in out-of-order processors [41]. In this case, locks are released only when the activity commits (or is rolled back by the runtime system). In regular algorithms like stencil computations and FFT, parallelism is independent of runtime values, so it is possible to give closed-form estimates of the amount of parallelism in these algorithms (possibly parameterized by the size of the input). For example, in multiplying N × N matrices, there are N 3 multiplications that can be performed in parallel. In contrast, parallelism in irregular algorithms is a function of runtime values, so closed-form estimates are not possible in general. One measure of parallelism in irregular algorithms is the number of active nodes that can be processed in parallel at each step of the algorithm for a given input, assuming that (i) there are an unbounded number of processors, (ii) an activity takes one time step to execute, (iii) the system has perfect knowledge of neighborhood and ordering constraints so it only executes activities that can complete successfully, and (iv) a maximal set of activities, subject to neighborhood and ordering constraints, is executed at each step. This is called the available parallelism at each step, and a graph showing the available parallelism at each step of execution of an irregular algorithm for a given input is called a parallelism profile. In this paper, we will present parallelism profiles produced by ParaMeter, a tool implemented on top of the Galois system [42]. They can provide substantial insight into the nature of amorphous data-parallelism, as we discuss later in this paper. In practice, handwritten parallel implementations of irregular algorithms rarely use the complex machinery described above for supporting optimistic parallel execution (the Timewarp algorithm [35] for event-driven simulation is the only exception we know to this rule). This is because most algorithms have structure that can be exploited to eliminate the need for some of these mechanisms. Sections 4.2 and 4.3 describe how algorithmic structure can be exploited to reduce the overheads of speculation and to eliminate speculation, respectively. 4.2 Exploiting structure to reduce speculative overheads The following class of operators plays a very important role in applications. D EFINITION 2. An implementation of an operator is said to be cautious if it reads all the elements in its neighborhood before it modifies any of them. A cautious operator is an operator that has a cautious implementation. Operators, which were defined abstractly as graph morphisms in Section 3, can usually be implemented in different ways: for example, one implementation might read node A, write to node A, read node B, and write to node B in that order, while a different implementation might perform the two reads before the writes. By Definition 2, the second implementation is cautious, but the first one is not. The operator itself is cautious because it has a cautious implementation. Cautious operators have the useful property that the neighborhood of an active node can be determined by executing the operator up to the point where it starts to make modifications to the neighborhood. Therefore, at any point in the computation, the neighborhoods of all active nodes can be determined before any modifications are made to the graph. 4.2.1 Zero-copy implementation For algorithms in which (i) the operator is cautious, and (ii) active elements are unordered, optimistic parallel execution can be implemented without making data copies for recovery in case of conflicts. Most of the algorithms with unordered active elements discussed in Sections 5–7 fall in this category. An efficient implementation of such algorithms is the following. Logical locks associated with nodes and edges are acquired during the read phase of the operator. If a lock cannot be acquired, there is a conflict and the computation is rolled back by the runtime system simply by releasing all locks acquired up to that point; otherwise, all locks are released when the computation terminates. We refer to this as a zero-copy implementation. Zero-copy implementations should not be confused with twophase locking, since under two-phase locking, updates to locked objects can be interleaved arbitrarily with acquiring locks on new objects. Strict two-phase locking eliminates the possibility of cascading rollbacks, while cautious operators are more restrictive and permit the implementation of optimistic parallel execution without the need for data copying. Notice that if active elements are ordered, data copying is needed even if the operator is cautious since conflicting active elements with higher priority may be created after Figure 8. Logical partitioning the read-only phase of an activity completes but before that activity commits. 4.2.2 Data-partitioning and lock coarsening In the baseline implementation, the work-set can become a bottleneck if there are a lot of worker threads since each thread must access the work-set repeatedly to obtain work. For algorithms in which (i) the data structure topology is a general graph or grid (i.e., not a tree), and (ii) active elements are unordered, data partitioning can be used to eliminate this bottleneck and to reduce the overhead of fine-grain locking of graph elements. The graph or grid can be partitioned logically between the cores: conceptually, each core is given a color, and a contiguous region of the graph or grid is assigned that color. Instead of a centralized work-set, we can implement one work-set per region and assign work in a data-centric way: if an active element lies in region R, it is put on the workset for region R, and it will be processed by the core associated with that region. If an active element is not near a region boundary, its neighborhood is likely to be confined to that region, so for the most part, cores can compute independently. Instead of associating locks with nodes and edges, we associate locks with regions; if a neighborhood expands into a new region, the core processing that activity needs to acquire the lock associated with that region. To keep core utilization high, it is desirable to over-decompose the graph or grid into more regions than there are cores, increasing the likelihood that a core has work to do even if some of its regions are locked by other cores. One disadvantage of this approach is load imbalance since it is difficult to partition the graph so that computational load is evenly distributed across the partitions, particularly because active nodes can be created dynamically. Application-specific graph partitioners might be useful; dynamic load-balancing using a strategy like work-stealing [2] is another possibility. 4.3 Exploiting structure to eliminate speculation Speculation was needed in the baseline implementation of Section 4.1 because the execution of different activities was not coordinated in any way, a scheduling policy that we call autonomous scheduling. However, if the computation graph of an algorithm can be generated inexpensively at compile-time or at runtime, it can be used to schedule activities in a more coordinated fashion, eliminating the need for speculation. Figure 9 shows different classes of coordinated scheduling policies that can be used for such algorithms. 4.3.1 Compile-time coordination If (i) the operator is a structure-driven local computation operator, and (ii) the topology is known statically (for example, it is a grid or a clique), active nodes and neighborhoods can be determined statically, so a static computation graph can be generated, independent of the input to the program (the computation graph is usually parameterized by the size of the input). Regular algorithms such as finite-differences, FFTs, and dense matrix factorizations fall in this category. Given the importance of these algorithms in highperformance computing, it is not surprising that this class of algo- refinement [34]. It can be used even if neighborhoods cannot be determined exactly, provided we can compute over-approximations to neighborhoods [5]. 4.4 Figure 9. Scheduling strategies rithms has received a lot of attention from the research community. For example, there is an enormous literature on dependence analysis, which uses integer linear programming techniques to implicitly build computation graphs from sequential, dense matrix programs [38]. 4.3.2 Just-in-time coordination For most irregular algorithms, active nodes and neighborhoods depend on runtime values, so the best one can do is to generate a computation graph for a particular input. One way to do this for any irregular program written using the Galois model is the following: (i) make a copy of the initial state; (ii) execute the program sequentially, keeping track of active nodes and neighborhoods, and produce a computation graph; (iii) use the computation graph to re-execute the program in parallel without speculation. There are certain obvious disadvantages with this approach in general, but for some algorithms, step (i) can be eliminated and step (ii) can be simplified, so this approach becomes competitive. One instance of this general scheduling strategy is the inspectorexecutor approach [70]. If (i) the operator is a structure-driven local computation operator but (ii) the topology is not known statically (for example, it is a general sparse graph), we can compute active nodes and neighborhoods at runtime once the input is given, and generate a computation graph for executing the program in parallel. The code for doing this is known as the inspector, and the code that executes the application in parallel is known as the executor. This approach was used by Saltz et al. to implement sparse iterative solvers on distributed-memory computers [70]; in this application, the graph computation is a sparse matrix-vector (MVM) multiplication, and the cost of producing the communication schedule (which can be derived from the computation graph) is amortized over the multiple sparse MVM’s performed by the iterative solver. 4.3.3 Runtime coordination For morph operators, the execution of an activity may modify the neighborhoods of pending activities, so it is usually not possible to write an inspector that can produce the entire computation graph without making any modifications to the input state. For some algorithms however, it is possible to interleave the generation of the computation graph with the execution of the application code. For algorithms in which (i) the operator is cautious and (ii) active nodes are unordered, a bulk-synchronous execution strategy is possible [64]. The algorithm is executed over several steps. In each step, (i) active nodes are determined, (ii) the neighborhoods of all active nodes are determined by partially executing the operator, (iii) an interference graph for the active nodes is constructed, (iv) a maximal independent set of nodes in the interference graph is determined, and (v) the set of independent active nodes is executed without synchronization. This process can be viewed as constructing and exploiting a computation graph level by level. This approach has been used by Gary Miller to parallelize Delaunay mesh Discussion It is likely that optimistic parallel execution is the only generalpurpose approach for exploiting parallelism in algorithms like the standard algorithm for event-driven simulation in which active nodes are ordered and interfering activities can be created dynamically3 . Since these are properties of the algorithm, this suggests that optimistic parallel execution is required for these kinds of algorithms even if they are written in a functional language. We are not aware of any discussion of this issue in the literature on functional languages. We also note that although coordinated scheduling seems attractive because there is no wasted work from mis-speculation, the overhead of producing a computation graph may limit its usefulness even for those algorithms for which it is feasible. The pattern of parallelism described in this section can be called inter-operator parallelism because it arises from applying an operator at multiple sites in the graph. There are also opportunities to exploit parallelism within a single application of an operator, which we call intra-operator parallelism. Inter-operator parallelism is the dominant parallelism pattern in problems for which neighborhoods are small compared to the overall graph since it is likely that many activities do not conflict; in these problems, intraoperator parallelism is usually fine-grain, instruction-level parallelism. Conversely, when neighborhoods are large, activities are likely to conflict and intra-operator parallelism may be more important. Inter/intra-operator parallelism is an example of nested dataparallelism [12]. In this paper, we only consider inter-operator parallelism. In Sections 5–7, we demonstrate the practical utility of the ideas introduced in this paper by showing how they can be applied to the algorithms in Table 1. The discussion is organized according to the classification of algorithms shown in Figure 4. For some of these algorithms, we describe the optimizations used in handwritten parallel implementations and show how they can be viewed as implementations of the general-purpose optimizations described in this section. 5. Morph algorithms In this section, we discuss amorphous data-parallelism in algorithms that may morph the structure of the graph by adding or deleting nodes and edges. Although morph operators can be viewed abstractly as replacing sub-graphs with other graphs, it is more insightful to classify them as follows. • Refinement: Refinement operations make the graph bigger by adding new nodes and edges. In particular, algorithms that build trees top-down, such as Barnes-Hut [9] and Prim’s minimal spanning tree (MST) algorithm [16], perform refinement. • Coarsening: Coarsening operations cluster nodes or sub-graphs together, replacing them with new nodes that represent the cluster. In particular, algorithms that build trees bottom-up, such as agglomerative clustering [67] and Boruvka’s MST algorithm [22], perform coarsening. • General morph: All other operations that modify the graph structure fall in this category. Graph reduction of functional language programs [52] is an example. In many morph algorithms, the type of the morph operator determines the shape of the parallelism profile. For example, algo3 Conservative event-driven simulation [48], described in Section 6.2, uses a different algorithm. Available Parallelism 8000 6000 4000 2000 0 0 10 20 30 40 50 60 Computation Step Available Parallelism (a) Available parallelism in Delaunay mesh refinement 300 250 200 Graph g = // read in input graph Tree mst; // create empty tree mst.setRoot(r); // r is an arbitrary vertex in g OrderedWorkset ows<Edge>; // ordered by edge weight ows.add(g.getEdges(r)); for each Edge e in ows { Node s = e.source(); // s is in tree Node t = e.target(); if (mst.contains(t)) continue; mst.addEdge(t,s); // s becomes parent of t for each Edge et in g.getEdges(t) { Node u = et.target(); if (!mst.contains(u)) ows.add(et); } } 150 100 Figure 11. Top-down tree construction: Prim’s MST algorithm 50 0 0 20 40 60 80 100 Computation Step (b) Available parallelism in Delaunay triangulation Figure 10. Available parallelism in two refinement algorithms rithms using coarsening operators on sparse graphs often have a lot of inter-operator parallelism at the start, but the parallelism decreases as the graph becomes smaller. Conversely, refinement algorithms typically exhibit increasing inter-operator parallelism over time, until they run out of work. 5.1 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: Refinement Graphs Delaunay mesh refinement, which was described in Section 2, is an example of the use of morph operators in general graphs. Figure 10(a) shows the parallelism profile for this application for an input of 100,000 triangles. As we can see, the available parallelism increases initially because (i) several bad triangles may be created by each refinement, and (ii) as the mesh gets bigger, bad triangles may move away from each other, reducing cavity conflicts. As work is completed, the available parallelism drops off. The topology of this application is a general graph, the operator is cautious, and active nodes are not ordered, so implementations can use autonomous scheduling with zero-copy, datapartitioning, and lock coarsening. These are the general-purpose equivalents of application-specific optimizations used in the handparallelized DMR code of Blandford et al. [11]. In this code, the concrete representation for the mesh is carefully crafted for DMR, but there is currently no way to implement this in a system like Galois that is based on a general-purpose parallel data structure library. DMR can also be implemented using run-time coordination, as in the implementation of Miller et al. [34]. Other refinement codes have similar parallelism profiles. For example, Figure 10(b) shows the parallelism profile for Delaunay triangulation, which creates the initial Delaunay mesh from a set of points. In this experiment, the input has 10,000 points. Trees Algorithms that build trees in a top-down fashion use refinement morphs. Figure 11 shows one implementation of Prim’s algorithm for computing minimum spanning trees in undirected graphs. The algorithm builds the tree top-down and illustrates the use of ordered work-sets. Initially, one node is chosen as the “root” of the tree and added to the MST (line 2), and all of its edges are added to the ordered work-set, ordered by weight (lines 4–5). In each iteration, the smallest edge, (s, t) is removed from the workset. Note that node s is guaranteed to be in the tree. If t is not in the MST, (s, t) is added to the tree (line 10), and all edges (t, u) are added to the work-set, provided u is not in the MST (lines 11–15). When the algorithm terminates, all nodes are in the MST. This algorithm has the same asymptotic complexity as the standard algorithm in textbooks [16], but the work-set contains edges rather than nodes. The standard algorithm requires a priority queue in which node priorities can be decreased dynamically. It is unclear whether it is worth generalizing the implementation of ordered-set iterators in a general-purpose system like ours to permit dynamic updates to the ordering. Graph search algorithms such as 15-puzzle solvers implicitly build spanning trees that may not be minimal spanning trees. These can be coded using an unordered set iterator. Top-down tree construction is also used in n-body methods like Barnes-Hut and fast multipole [9]. In many implementations, the tree represents a recursive spatial partitioning in which each leaf contains a single particle. Initially, the tree is a single node, representing the entire space. Particles are then inserted into the tree, splitting leaf nodes as needed to maintain the invariant that a leaf node can contain only a single particle. The work-set in this case is the set of particles, and it is unordered because particles can be inserted into the tree in any order. The operator is cautious, so a zerocopy implementation can be used. In practice, the tree-building phase of n-body methods does not take much time compared to the time for force calculations, which is described in Section 7. 5.2 Coarsening There are three main ways of doing coarsening. • Edge contraction: An edge is eliminated from the graph by fusing the two nodes at its end points and removing redundant edges from the resulting graph. • Node elimination: A node is eliminated from the graph, and edges are added as needed between each pair of its erstwhile neighbors. • Subgraph contraction: A sub-graph is collapsed into a single node and redundant edges are removed from the graph. Edge contraction is a special case of subgraph contraction. Coarsening morphs are commonly used in clustering algorithms in data-mining and in algorithms that build trees bottom-up. 5.2.1 Edge contraction In Figure 12(a), the edge (u, v) is contracted by fusing nodes u and v into a single new node labeled uv, and eliminating redundant edges. When redundant edges are eliminated, the weight on the remaining edge is adjusted in application-specific ways (in the example, the weight on edge (m, uv) may be some function of the weights on edges (m, u), (m, v) and (u, v)). In sparse graphs, each edge contraction affects a relatively small neighborhood, so in sufficiently large graphs, many edge contraction operations can happen in parallel as shown in Figure 12(b). The darker edges in the fine graph show the edges being contracted, and the coarse graph shows the result of simultaneously contracting all selected edges. 1: Graph g = // read in input graph 2: do { // coarsen the graph by one level 3: Graph cg; // create empty coarse graph 4: for each Node n in g.getNodes() { 5: Node un = g.getUnmatchedNeighbor(n); 6: Node rc = cg.createNode(); 7: n.setRepresentative(rc); 8: un.setRepresentative(rc); 9: } 10: for each Edge e in g.getEdges() { 11: Node rs = e.source().getRepresentative(); 12: Node rt = e.target().getRepresentative(); 13: if (rs != rt) 14: projectEdge(cg, rs, rt, e.getWeight(), +); 15: } 16: g = cg; 17: } while (!g.coarseEnough()); 18: projectEdge(Graph g, Node s, Node t, float w, fn: float*float->float) { 19: if (!g.hasEdge(s,t)) 20: g.addEdge(s,t,w); 21: else { 22: Edge e = g.getEdge(s,t); 23: float wOld = e.getWeight(); 24: e.setWeight(fn(wOld, w)); 25: } 26: } Figure 13. Edge contraction in Metis Graphs Graph coarsening by edge contraction is a key step in important algorithms for graph partitioning. For example, the Metis graph partitioner [37] applies edge contraction repeatedly until a coarse enough graph is obtained; the coarse graph is then partitioned, and the partitioning is interpolated back to the original graph. The problem of finding edges that can be contracted in parallel is related to the famous graph problem of finding maximal matchings in general graphs [16]. Efficient heuristics are known for computing maximal cardinality matchings and maximal weighted matchings in graphs of various kinds. These heuristics can be used for coordinated scheduling. In practice, simple heuristics based on randomized matchings seem to perform well, and they are wellsuited for autonomous scheduling. Figure 13 shows pseudocode for graph coarsening as implemented within the Metis graph partitioner [37], which uses randomized matching. Metis actually creates a new graph for representing the coarse graph after edge contraction. Each node n finds an unmatched neighbor un, which is n if no other unmatched neighbor exists, and a new node rc is created to represent the two nodes in the coarse graph (lines 5–8). After all the representative nodes are created, edges are created in the coarse graph; the weight of an edge in the coarse graph is the sum of the weights of the edges mapped to it. Graph g = // read in input graph Forest mst = g.getNodes(); Workset ws<Node> = g.getNodes(); for each Node n in ws { Node m = minWeight(g.getNeighbors(n)); Node l = g.createNode(); for each Node a in g.getNeighbors(n) { if (a == m) continue; float w = g.getEdge(n,a).getWeight(); projectEdge(g, l, a, w, min); } for each Node a in g.getNeighbors(m) { if (a == n) continue; float w = g.getEdge(m,a).getWeight(); projectEdge(g, l, a, w, min); } g.removeNode(n); g.removeNode(m); mst.addEdge(n, m); // add edge to the MST ws.add(l); // add new node back to workset } Figure 14. Edge contraction in bottom-up tree construction: Boruvka’s algorithm for minimum spanning trees Available Parallelism Figure 12. Graph coarsening by edge contraction 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 200 150 100 50 0 0 500 1000 1500 2000 Computation Step Figure 15. Available parallelism in Boruvka’s algorithm In this algorithm, active nodes are unordered and the operator is cautious. Trees Some tree-building algorithms are expressed in terms of edge contraction. Boruvka’s algorithm computes MSTs by performing edge contraction to build trees bottom-up [22]. It does so by performing successive coarsening steps. The pseudocode for the algorithm is given in Figure 14 and uses the projectEdge function from Figure 13. The active nodes are the nodes remaining in the graph, and the MST is initialized with every node in the graph forming its own, single-node tree (line 2), i.e., a forest. In each iteration, an active node n finds the minimum weight edge, (n, m) incident on it (line 5). A new node l is created to represent the merged nodes n and m. This is similar to the code in Figure 13 with the following exception: if there exist edges (n, u) and (m, v) in the fine graph, the edge (l, u) in the updated graph will have weight equal to the minimum weight of the two original edges. The edge (n, m) is then added to the MST to connect the two previously disjoint trees that contained n and m (line 19). Finally, l is added back to the work-set (line 20). This procedure continues until the graph is fully contracted. At this point, the MST structure represents the correct minimum spanning tree of the graph. Figure 15 shows the parallelism profile for Boruvka’s algorithm run on a random graph of 10,000 nodes, with each node connected to 5 others on average. We see that the amount of parallelism starts high, but as the tree is built, the number of active elements decreases and the likelihood that coarsening decisions are independent decreases correspondingly. Interestingly, Kruskal’s algorithm [16] also finds MSTs by performing edge contraction, but it iterates over an ordered work-set of edges, sorted by edge weight. 5.2.2 Node elimination Graph coarsening can also be based on node elimination. Each step removes a node from the graph and inserts edges as needed be- x x u y y z z Figure 16. Graph coarsening by node elimination tween its erstwhile neighbors to make these neighbors a clique, adjusting weights on the remaining nodes and edges appropriately. In Figure 16, node u is eliminated, and edges (x, z) and (y, z) are inserted to make {x, y, z} a clique. This is obviously a cautious operator. Node elimination is the graph-theoretic foundation of matrix factorization algorithms such as Cholesky and LU factorizations. Sparse Cholesky factorization in particular has received a lot of attention in the numerical linear algebra community [27]. The new edges inserted by node elimination are called fill since they correspond to zeroes in the matrix representation of the original graph that become non-zeros as a result of the elimination process. Different node elimination orders result in different amounts of fill in general, so ordering nodes for elimination to minimize fill is an important problem. The problem is NP-complete so heuristics are used in practice. One obvious heuristic is minimal-degree ordering, which is a greedy ordering that always picks the node with minimal degree at each elimination step. After some number of elimination steps, the remaining graph may be quite dense, and at that point, high-performance implementations switch to dense matrix techniques to exploit intra-operator parallelism. Just-in-time coordination is used to schedule computations; in the literature, the inspector phase is called symbolic factorization and the executor phase is called numerical factorization [27]. 5.2.3 Sub-graph contraction It is also possible to coarsen graphs by contracting entire subgraphs at a time. In the compiler literature, elimination-based algorithms perform dataflow analysis on control-flow graphs by contracting “structured” sub-graphs whose dataflow behaviors have a concise description. Dataflow analysis is performed on the smaller graph, and interpolation is used to determine dataflow values within the contracted sub-graph. This idea can be used recursively on the smaller graph. Sub-graph contraction can be performed in parallel. This approach to parallel dataflow analysis has been studied by Ryder [43] and Soffa [40]. 5.3 General morph Some applications make structural updates that are neither refinements nor coarsenings, but many of these updates may nevertheless be performed in parallel. Graphs Graph reduction of functional language programs [52] and social network maintenance [32] fall in this category. Trees Some algorithms build trees using complicated structural manipulations that cannot be classified neatly as top-down or bottom-up construction. For example, when values are inserted into a heap or a red-black tree, the final data structure is a tree, but each insertion may perform complex manipulations of the tree [16, 59]. The structure of the final heap or red-black tree depends on the order of insertions, but for most applications, any of these final structures is adequate, so this is an example of don’t-care nondeterminism, and it can be expressed using an unordered set iterator. Applications that perform general morph operations on trees may benefit from transactional memory [31] because (i) unlike refinement and coarsening, a general morph operation may potentially modify a large portion of the tree, but the modifications made by many of them may be confined to a small portion of the tree, and (ii) there is no easy way to partition a tree (unlike general graphs and grids), so it is not clear how one applies the optimizations described in Section 4.2. 6. Local computation algorithms Instead of modifying the structure of the graph, many irregular algorithms operate by labeling nodes or edges with data values and updating these values repeatedly until some termination condition is reached. Updating the data on a node or edge may require reading or writing data values in the neighborhood. In some applications, data for the updates are supplied externally. These algorithms are called local computation algorithms in this paper. To avoid verbosity, we assume in the rest of the discussion that values are stored only on nodes of the graph, and we refer to an assignment of values to nodes as the state of the system. Local computation algorithms come in two flavors. • Structure-driven algorithms: In structure-driven algorithms, ac- tive nodes and neighborhoods are determined by the graph structure (and of course the operator), and do not depend on the data values computed by the algorithm. Cellular automata [69] and finite-difference algorithms like Jacobi and Gauss-Seidel iteration [28] are well-known examples. • Data-driven algorithms: In these algorithms, updating the value at a node may trigger updates at other nodes in the graph. Therefore, nodes become active in a very data-dependent and unpredictable manner. Message-passing AI algorithms such as survey propagation [46] and algorithms for event-driven simulation [48] are examples. 6.1 Structure-driven algorithms Grids Cellular automata are perhaps the most famous example of structure-driven computations over grids. Grids in cellular automata usually have one or two dimensions. Grid nodes represent cells of the automaton, and the state of a cell c at time t is a function of the states at time t − 1 of cells in some neighborhood around c. A similar state update scheme is used in finite-difference methods for the numerical solution of partial differential equations (PDEs), where it is known as Jacobi iteration. In this case, the grid arises from spatial discretization of the domain of the PDE, and nodes hold values of the dependent variable of the PDE. Figure 17 shows a Jacobi iteration that uses a grid called state, and a neighborhood called the five-point stencil. Each grid point holds two values called old and new that are updated at each time-step using the stencil. The pseudo-code is written so that it uses a single graph (grid); in practice, it is common to use two arrays to represent the old and new values. Many other neighborhoods (stencils) are used in cellular automata [63] and finite-difference methods. A disadvantage of Jacobi iteration is that it requires two arrays for its implementation. More complex update schemes have been designed to get around this problem. Intuitively, all these schemes blur the sharp distinction between old and new states in Jacobi iteration, so nodes are updated using both old and new values. For example, red-black ordering, or more generally multi-color ordering, assigns a minimal number of colors to nodes in such a way that no node has the same color as the nodes in its neighborhood. Nodes of a given color therefore form an independent set that can be updated concurrently, so the single global update step of Jacobi iteration is replaced by a sequence of smaller steps, each of which performs in-place updates to all nodes of a given color. For a five- (a) 5-point stencil 1: Graph g = // initialize edges of g from A matrix 2: for each Node i in g.getNodes() { 3: g.getNode(i).y = 0; 4: for each Node j in g.getNeighbors(i) { 5: aij = g.getEdge(i,j).getWeight(); 5: g.getNode(i).y += aij * g.getNode(j).x; 6: } 7: } Figure 18. Sparse matrix-vector product point stencil, two colors suffice because grids are two-colorable, and this is the famous red-black ordering. These algorithms can obviously be expressed using unordered set iterators with one loop for each color. In all these algorithms, active nodes and neighborhoods can be determined from the grid structure, which is known at compiletime. As a result, compile-time scheduling can be very effective for coordinating the computations. Most parallel implementations of stencil codes partition the grid into blocks, and each processor is responsible for updating the nodes in one block. This data distribution requires less inter-processor communication than other distributions such as row or column cyclic distributions. Blocks are updated in parallel, with barrier synchronization used between time steps. In terms of the vocabulary of Figure 9, this is coordinated, compile-time scheduling. General graphs The richest sources of structure-driven algorithms on general graphs are iterative solvers for sparse linear systems, such as the conjugate gradient and GMRES methods. The key operation in these solvers is sparse matrix-vector multiplication (MVM), y = Ax, in which the matrix A is an N × N sparse matrix and x and y are dense vectors. Such a matrix can be viewed as a graph with N nodes in which there is a directed edge from node i to node j with weight Aij if Aij is non-zero. The state of a node i consists of the current values of x[i] and y[i], and at each time step, y[i] is updated using the values of x at the neighbors of i. This can be expressed as unordered iteration over the nodes in the graph, as shown in Figure 18. In practice, y and x would be represented using dense arrays. Writing the code as shown in Figure 18 highlights the similarity between sparse MVM and 5-point stencil Jacobi iteration: in both cases, the state of a node is updated by reading values at its neighbors. The key difference is that in sparse MVM, the graph structure is not known at compile-time. To handle this, distributed-memory parallel implementations of sparse MVM use the inspector-executor approach [70], an instance of just-in-time scheduling, as described in Section 4.3). In some applications, the graph is known only during the execution of the program. The augmenting paths algorithm for maxflow computations is an example [16]. This is an iterative algorithm in which each iteration tries to find a path from source to sink in which every edge has non-zero residual capacity. Once such a path 3000 2500 2000 1500 1000 500 0 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 Computation Step Figure 19. Available parallelism in preflow-push (b) Code for 5-point stencil on n × n grid Figure 17. Jacobi iteration with a 5-point stencil 3500 Available Parallelism 1: // load initial conditions into state[].old 2: for time from 1 to nsteps { 3: for each <i,j> in [2,n-1]x[2,n-1] 4: state[i,j].new = 0.2 * (state[i,j].old 5: + state[i-1,j].old + state[i+1,j].old 6: + state[i,j-1].old + state[i,j+1].old); 7: // copy result into old 8: for each <i,j> in [2,n-1]x[2,n-1] 9: state[i,j].old = state[i,j].new; 10: } is found, the global flow is augmented by an amount δ equal to the minimal residual capacity of the edges on this path, and δ is subtracted from the residual capacity of each edge. This operation can be performed in parallel as a structure-driven local computation. However, the structure of the path is discovered during execution, so this computation must be performed either autonomously or by coordinated scheduling at run-time. 6.2 Data-driven algorithms In some irregular algorithms, active nodes are not determined by the structure of the graph; instead, there is an initial set of active nodes, and performing an activity causes other nodes to become active, so nodes become active in an unpredictable, datadriven fashion. Examples include the preflow-push algorithm [4] for the maxflow problem, AI message-passing algorithms such as belief propagation and survey propagation [46], Petri nets [51], and discrete-event simulation [35, 48]. In other algorithms, such as some approaches to solving spin Ising models [55], the pattern of node updates is determined by externally-generated data such as random numbers. We call these data-driven local computation algorithms. Although graph topology is typically less important than in structure-driven algorithms, it can nevertheless be useful in reasoning about certain data-driven algorithms. For example, belief propagation is an exact inference algorithm on trees, but it is an approximate algorithm when used on general graphs because of the famous loopy propagation problem [46]. Grid structure is exploited in spin Ising solvers to reduce synchronization [55]. In this section, we describe the preflow-push algorithm, which uses an unordered workset, and discrete-event simulation, which uses an ordered work-set. Both these algorithms work on general graphs. Preflow-push algorithm The preflow-push algorithm was introduced earlier in Section 3.2. Active nodes are nodes that have excess in-flow at intermediate stages of the algorithm (unlike the augmenting paths maxflow algorithm, which maintains a valid flow at all times [16]). The algorithm also maintains at each node a value called height, which is a lower bound on the distance of that node to the sink. The algorithm performs two operations, push and relabel, at active nodes to update the flow and height values respectively, until termination. The algorithm does not specify an order for applying the push or relabel operations, and hence it can be expressed using an unordered iterator. Figure 19 shows the available parallelism in preflow push for a 512x512 grid-shaped input graph. In preflow push, as in most local computation codes, the amount of parallelism is governed by the amount of work, as the likelihood that two neighborhoods overlap remains constant throughout execution. This can be seen by the relatively stable amount of parallelism at the beginning of computation; then, as the amount of work decreases, the parallelism does as well. Event-driven simulation Event-driven simulation, which was introduced in Section 2, can be written using an ordered set iterator, where messages are three-tuples hvalue, edge, timei and the ordered set maintains messages sorted in time order. In the literature, there are two approaches to parallel eventdriven simulation, called conservative and optimistic event-driven simulation [48]. Conservative event-driven simulation reformulates the algorithm so that in addition to sending data messages, processing stations also send out-of-band messages to update time at their neighbors. Each processing station can operate autonomously without fear of deadlock, and there is no need to maintain an ordered event list. This is an example of algorithm reformulation that replaces an ordered work-set with an unordered work-set. Timewarp [35] is an optimistic parallel implementation of event-driven simulation (it is probably the first optimistic parallel implementation of any algorithm in the literature). It can be viewed as an application-specific implementation of ordered set iterators as described in Section 4.1, with two key differences. First, threads are allowed to work on new events created by speculative activities that have not yet committed. This increases parallelism, but opens up the possibility of cascading roll-backs. Second, instead of the global commit queue in the implementation of Section 4.1, there are periodic sweeps through the network to update “global virtual time” [35]. Both of these features can be implemented in a generalpurpose way in a system like Galois, but more study is needed to determine if this is worthwhile. 6.3 Discussion Solutions to fixpoint equations: Iterative methods for computing solutions to systems of equations can usually be formulated in both structure-driven and data-driven ways. A classic example of this is iterative dataflow analysis in compilers. Systems of dataflow equations can be solved by iterating over all the equations using a Jacobi or Gauss-Seidel formulation until a fixed point is reached; these are structure-driven approaches. Alternately, the classic worklist algorithm only processes an equation if its inputs have changed; this is a data-driven approach [3]. The underlying graph in this application is the control-flow graph (or its reverse for backward dataflow problems), and the algorithms can be parallelized in the same manner as other local computation algorithms. Speeding up local computation algorithms: Local computation algorithms can often be made more efficient by a preprocessing step that coarsens the graph, since this speeds up the flow of information across the graph. This idea is used in dataflow analysis of programs. An elimination-based dataflow algorithm is used first to coarsen the graph, as described in Section 5.2.3. If the resulting graph is a single node, the solution to the dataflow problem is read off by inspection; otherwise, iterative dataflow analysis is applied to the coarse graph. In either case, the solution for the coarse graph is interpolated back into the original graph. One of the first hybrid dataflow algorithms along these lines was Allen and Cocke’s interval-based algorithm for dataflow analysis [3], which finds and collapses intervals, which are single-entry, multiple-exit loop structures in the control-flow graph of the program. Some local computation algorithms can be sped up by periodically computing and exploiting global information. The globalrelabel heuristic in preflow-push is an example of such an operation. Preflow-push can be sped up by periodically performing a breadth-first search from the sink to update the height values with more accurate information [4]. In the extreme, global information can be used in every iteration. For example, iterative linear solvers are often sped up by preconditioning the matrix iterations with another matrix that is often obtained by a graph coarsening computation such as incomplete LU or Cholesky factorization [28]. 7. Reader algorithms An operator that reads a graph without modifying it in any way is a reader operator for that graph. The Google map-reduce [17] model 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: Scene s; // read in scene description Framebuffer fb; // image to generate Workset ws<Ray> = // initial rays Workset spawnedws<Ray>; for each Ray r in ws { r.trace(s); // trace ray through scene Rays rays = r.spawnRays(); ws.add(rays); spawnedws.add(rays); } for each Ray r in spawnedws { fb.update(r); // update with r’s contribution } Figure 20. Ray-tracing is an example in which the fixed graph is a set or multi-set, and map and reduce computations are performed on the underlying values. A particularly important class of algorithms in this category perform multiple traversals over a fixed graph. Each traversal maintains its own state, which is updated during the traversal, so the traversals can be performed concurrently. Force computation in nbody algorithms like Barnes-Hut [9] is an example. The construction of the octree is an example of tree refinement, as was discussed in Section 5.1. Once the tree is built, the force calculation iterates over particles, computing the force on each particle by making a top-down traversal of a prefix of the tree. The force computation step of the Barnes-Hut algorithm is completely parallel since each particle can be processed independently. The only shared data structure in this phase of the algorithm is the octree, which is not modified. Another classic example is ray-tracing. The simplified pseudocode is shown in Figure 20. Each ray is traced through the scene. If the ray hits an object, new rays may be spawned to account for reflections and refraction; these rays must also be processed. After the rays have been traced, they are used to construct the rendered image. The key concern in implementing such algorithms is exploiting locality. One approach to enhancing locality in algorithms such as ray-tracing is to “chunk” similar work together. Rays that will propagate through the same portion of the scene are bundled together and processed simultaneously by a given processor. Because each ray requires the same scene data, this approach can enhance cache locality. A similar approach can be taken in Barnes-Hut: particles in the same region of space are likely to traverse similar parts of the octree during the force computation and thus can be processed together to improve locality. In some cases, reader algorithms can be more substantially transformed to further enhance locality. In ray-tracing, bundled rays begin propagating through the scene in the same direction but may eventually diverge. Rather than using an a priori grouping of rays, groups can be dynamically updated as rays propagate through a scene, maintaining locality throughout execution [53]. 8. Conclusions and future work For more than thirty years, the parallel programming community has used computation graphs to reason about parallelism in algorithms. In this paper, we argued that this program-centric abstraction is inadequate for most irregular algorithms, and that it should be replaced by a data-centric view of algorithms called the operator formulation of irregular algorithms and an associated abstraction called the halograph. This alternate view of irregular algorithms reveals that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous in irregular algorithms; however, the amount of amorphous data-parallelism in an irregular algorithm is usually dependent on runtime values and varies from point to point in the execution of the algorithm. The operator formulation of irregular algorithms also leads to a structural classification of irregu- lar algorithms that provides insight into the patterns of amorphous data-parallelism in irregular algorithms. In particular, regular algorithms emerge as a special and important class of irregular algorithms. Optimistic or speculative execution is essential for exploiting amorphous data-parallelism in many irregular algorithms even if these algorithms are written in a declarative language, overturning the commonly held belief that speculative execution is a crutch that is needed only when we do not have enough insight into the structure of parallelism in programs. However, the overheads of speculative execution can be reduced dramatically in most cases by exploiting algorithmic structure. Finally, we showed that many optimizations used in handwritten parallel implementations of irregular algorithms can be incorporated into a general-purpose system since these optimizations exploit algorithmic structure and are usually not specific to particular algorithms. In the literature, there are many efforts to identify parallelism patterns [47, 50, 57, 58]. The scope of these efforts is broader than ours: for example, they seek to categorize implementation mechanisms such as whether task-queues or the master-worker approach is used to distribute work. Like us, Snir distinguishes between ordered and unordered work [58]. We make finer distinctions than is made in the Berkeley motifs [50]. For example, dense matrix algorithms constitute a single Berkeley motif, but we would put iterative solvers in a different category (local computation) than dense matrix factorizations (morphs), and their parallel behavior is very different. Our classification of irregular algorithms should be of interest to researchers working on applying machine learning techniques to program optimization problems [1, 62]. The approach taken in this line of research is to classify programs according to a set of features. Current approaches use fairly simple features such as loop nesting level or depth of dependences. The classification presented in this paper provides a set of semantics-based features that might perform better than syntactic features, at least for highlevel optimizations. In ongoing research, we are investigating the following questions. (1) The approach taken in this paper relies on a library of parallel data structure implementations, similar to the Java collections classes. Is this adequate or do data structure implementations have to be customized to each algorithm to obtain high performance? If so, what language and systems support would make it easier to produce high-performance implementations of these data structures? Is it possible to synthesize these implementations from high-level specifications? (2) We have a lot of insight into exploiting structure in unordered algorithms, as described in Section 4, but we have relatively little insight into exploiting structure in ordered algorithms. What structure is exploited by handwritten implementations of these algorithms, and how do we incorporate these ideas into a general-purpose system? Is there a general way of transforming ordered algorithms into unordered algorithms, as is done in an application-specific way for event-driven simulation? (3) How should information needed for efficient execution, such as scheduling policies and partitioning of data, be added to Galois programs? Are concepts from the literature on program refinements useful for doing this in a disciplined manner [10]? (4) Current architectural mechanisms for supporting parallel programming, such as hardware transactional memory [31], are blunt instruments that do not exploit the plentiful structure in irregular algorithms. The discussion in this paper suggests that it may be more fruitful for the architecture community to focus on particularly recalcitrant categories of irregular algorithms such as ordered algorithms or algorithms that perform general morphs on trees. What architectural support is appropriate for these categories of algorithms? Acknowledgements: We would like to thank Arvind (MIT), Don Batory (UT Austin), Jim Browne (UT Austin), Calin Cascaval (IBM Yorktown Heights), Jack Dennis (MIT), David Kuck (Intel), Charles Leiserson (MIT), Jayadev Misra (UT Austin), David Padua (Illinois), Burton Smith (Microsoft), Marc Snir (Illinois) and Jeanette Wing (NSF) for useful discussions. References [1] F. Agakov, E. Bonilla, J. Cavazos, G. Fursin, B. Franke, M.F.P. O’Boyle, M. Toussant, J. Thomson, and C. Williams. Using machine learning to focus iterative optimization. In CGO, March 2006. [2] Kunal Agrawal, Charles E. Leiserson, Yuxiong He, and Wen Jing Hsu. Adaptive work-stealing with parallelism feedback. ACM Trans. Comput. Syst., 26(3):1–32, 2008. [3] A. Aho, R. Sethi, , and J. Ullman. Compilers: principles, techniques, and tools. Addison Wesley, 1986. [4] Richard J. Anderson and Joao C. Setubal. On the parallel implementation of Goldberg’s maximum flow algorithm. In SPAA, 1992. [5] Christos D. Antonopoulos, Xiaoning Ding, Andrey Chernikov, Filip Blagojevic, Dimitrios S. Nikolopoulos, and Nikos Chrisochoides. Multigrain parallel Delaunay mesh generation: challenges and opportunities for multithreaded architectures. In ICS ’05, 2005. [6] Arvind and R.S.Nikhil. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. on Computers, 39(3), 1990. [7] D.A. Bader and G. Cong. Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. Journal of Parallel and Distributed Computing, 66(11):1366–1378, 2006. [8] Henk Barendregt. The Lambda Calculus, Its Syntax and Semantics. Elsevier Publishers, 1984. [9] J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation algorithm. Nature, 324(4), December 1986. [10] Don Batory, Jia Liu, and Jacob Neal Sarvela. Refinements and multidimensional separation of concerns. In ESEC/FSE-11, 2003. [11] Daniel Blandford, Guy Blelloch, and Clemens Kadow. Engineering a compact parallel Delaunay algorithm in 3D,. In Proceedings of the twenty-second Annual Symposium on Computational Geometry,, 2006. [12] Guy Blelloch. Programming parallel algorithms. Communications of the ACM, 39(3), March 1996. [13] Boost C++ libraries. http://www.boost.org/. [14] L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces. In SCG, 1993. [15] E.F. Codd. Does your DBMS run by the rules? Computerworld, October 1985. [16] Thomas Cormen, Charles Leiserson, Ronald Rivest, and Clifford Stein, editors. Introduction to Algorithms. MIT Press, 2001. [17] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. [18] B. Delaunay. Sur la sphere vide. Izvestia Akademii Nauk SSSR, Otdelenie Matematicheskikh i Estestvennykh Nauk,, 7:793–800, 1934. [19] Jack Dennis. Dataflow ideas for supercomputers. In Proceedings of CompCon, 1984. [20] Edsger Dijkstra. A Discipline of Programming. Prentice Hall, 1976. [21] H. Ehrig and M. Löwe. Parallel and distributed derivations in the single-pushout approach. Theoretical Computer Science, 109:123– 143, 1993. [22] David Eppstein. Spanning trees and spanners. In J. Sack and J. Urrutia, editors, Handbook of Computational Geometry, pages 425–461. Elsevier, 1999. [23] Ping An et al. STAPL: An adaptive, generic parallel C++ library. In LCPC. 2003. [24] J.D. Foley, A. van Dam, S.K. Feiner, and J.F. Hughes. Computer Graphics: Principles and Practice. Addison-Wesley, 1990. [25] D. Gajski, D. Kuck, and D. Padua. Dependence driven computation. In COMPCON 81, 1981. [26] R. Ghiya and L. Hendren. Is it a tree, a dag, or a cyclic graph? a shape analysis for heap-directed pointers in C. In POPL, 1996. [27] J. R. Gilbert and R. Schreiber. Highly parallel sparse Cholesky factorization. SIAM Journal on Scientific and Statistical Computing, 13:1151–1172, 1992. [28] Gene Golub and Charles van Loan. Matrix computations. Johns Hopkins Press, 1996. [29] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73:325–348, August 1987. [30] L. Hendren and A. Nicolau. Parallelizing programs with recursive data structures. IEEE TPDS, 1(1):35–47, January 1990. [31] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA, 1993. [32] Kirsten Hildrum and Philip S. Yu. Focused community discovery. In ICDM, 2005. [33] S. Horwitz, P. Pfieffer, and T. Reps. Dependence analysis for pointer variables. In PLDI, 1989. [34] Benoı̂t Hudson, Gary L. Miller, and Todd Phillips. Sparse parallel Delaunay mesh refinement. In SPAA, 2007. [35] David R. Jefferson. Virtual time. ACM TOPLAS, 7(3), 1985. [36] J.T.Schwartz, R.B.K. Dewar, E. Dubinsky, and E. Schonberg. Programming with sets: An introduction to SETL. Springer-Verlag Publishers, 1986. [37] G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48(1):96–129, 1998. [38] K. Kennedy and J. Allen, editors. Optimizing compilers for modern architectures. Morgan Kaufmann, 2001. [39] Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. Datacentric multi-level blocking. In Proceedings of the ACM Symposium on Programming Language Design and Implementation (PLDI), June 1997. [40] Robert Kramer, Rajiv Gupta, and Mary Lou Soffa. The combining DAG: A technique for parallel data flow analysis. IEEE Transactions on Parallel and Distributed Systems, 5(8), August 1994. [41] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L.P. Chew. Optimistic parallelism requires abstractions. PLDI, 2007. [42] Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu, Keshav Pingali, and Calin Cascaval. How much parallelism is there in irregular applications? In PPoPP, 2009. [43] Y. Lee and B. G. Ryder. A comprehensive approach to parallel data flow analysis. In Supercomputing, pages 236–247, 1992. [44] Edward Lorenz. Predictability- does the flap of a butterfly’s wings in Brazil set off a tornado in Texas? In Proceedings of the American Association for the Advancement of Science, December 1972. [45] M. Lowe and H. Ehrig. Algebraic approach to graph transformation based on single pushout derivations. In WG ’90: Proceedings of the 16th international workshop on Graph-theoretic concepts in computer science, pages 338–353, New York, NY, USA, 1991. Springer-Verlag New York, Inc. [46] David Mackay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003. [47] Timothy Mattson, Beverly Sanders, and Berna Massingill. Patterns for Parallel Programming. Addison-Wesley Publishers, 2004. [48] Jayadev Misra. Distributed discrete-event simulation. ACM Comput. Surv., 18(1):39–65, 1986. [49] James Munkres. Topology: a First Course. Prentice-Hall Publishers, 1975. [50] D. Patterson, K. Keutzer, K. Asanovica, K. Yelick, and R. Bodik. Berkeley dwarfs. http://view.eecs.berkeley.edu/. [51] J. L. Peterson. Petri nets. ACM Computing Surveys, 9(3):223–252, 1977. [52] Simon Peyton-Jones. The Implementation of Functional Programming Languages. Prentice-Hall, 1987. [53] M. Pharr, C. Kolb, R. Gershbein, and P. Hanrahan. Rendering complex scenes with memory-coherent ray tracing. In SIGGRAPH, 1997. [54] Lawrence Rauchwerger and David A. Padua. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE Trans. Parallel Distrib. Syst., 10(2):160–180, 1999. [55] Eunice E. Santos, Shuangtong Feng, and Jeffrey M. Rickman. Efficient parallel algorithms for 2-dimensional Ising spin models. In IPDPS, 2002. [56] K. Schloegel, G. Karypis, and V. Kumar. Parallel static and dynamic multi-constraint graph partitioning. Concurrency and Computation: Practice and Experience, 14(3):219–240, 2002. [57] David Skillicorn. Foundations of Parallel Programming. Cambridge University Press, 1994. [58] Marc Snir. http://wing.cs.uiuc.edu/group/patterns/. [59] Jon Solworth and Brian Reagan. Arbitrary order operations on trees. In Workshop on Languages and Compilers for Parallel Computing, 1994. [60] D. Spielman and SH Teng. Parallel Delaunay refinement: Algorithms and analyses. In 11th International Meshing Roundtable, 2002. [61] J. Greggory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. A scalable approach to thread-level speculation. In ISCA ’00: Proceedings of the 27th annual international symposium on Computer architecture, 2000. [62] Mark Stephenson, Saman Amarasinghe, Martin Martin, and UnaMay O’Reilly. Meta optimization: improving compiler heuristics with machine learning. SIGPLAN Not., 38(5):77–90, 2003. [63] Tim Tyler. Cellular automata neighborhood survey. http://cellauto.com/neighbourhood/index.html. [64] L. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), 1990. [65] Clark Verbrugge. A Parallel Solution Strategy for Irregular, Dynamic Problems. PhD thesis, McGill University, 2006. [66] T. N. Vijaykumar, Sridhar Gopal, James E. Smith, and Gurindar Sohi. Speculative versioning cache. IEEE Trans. Parallel Distrib. Syst., 12(12):1305–1317, 2001. [67] Pang-Ning Tan Vipin Kumar, Michael Steinbach. Introduction to Data Mining. Addison Wesley, 2005. [68] Niklaus Wirth. Algorithms + Data Structures = Programs. PrenticeHall, 1976. [69] Stephen Wolfram. A New Kind of Science. Wolfram Media, 2002. [70] Janet Wu, Raja Das, Joel Saltz, Harry Berryman, and Seema Hiranandani. Distributed memory compiler design for sparse problems. IEEE Transactions on Computers, 44, 1995.