Download Amorphous Data-parallelism - UT Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Lattice model (finance) wikipedia , lookup

Quadtree wikipedia , lookup

B-tree wikipedia , lookup

Transcript
Amorphous Data-parallelism
∗
Keshav Pingali1,2 , Milind Kulkarni2 , Donald Nguyen1 ,
Martin Burtscher2 , Mohammed Amber Hassan3 , Mario Mendez-Lojo2 , Dimitrios Prountzos1 , Xin Sui3
1
Department of Computer Science, 2 Institute for Computational Engineering and Sciences and 3 Department of Electrical and
Computer Engineering
The University of Texas at Austin
Abstract
Most client-side applications running on multicore processors are
likely to be irregular programs that deal with complex, pointerbased data structures such as sparse graphs and trees. However, we
understand very little about the nature of parallelism in irregular
algorithms, let alone how to exploit it effectively on multicore
processors.
In this paper, we define a generalized form of data-parallelism
that we call amorphous data-parallelism and show that it is ubiquitous in irregular algorithms from a variety of disciplines including
data-mining, AI, compilers, networks, and scientific computing.
We also show that the concepts behind amorphous data-parallelism
lead to a natural categorization of irregular algorithms; regular algorithms such as finite-differences and FFTs emerge as an important category of irregular algorithms. We argue that optimistic or
speculative execution is essential for exploiting amorphous dataparallelism in many irregular algorithms, but show that the overheads of speculative execution can be reduced dramatically in most
cases by exploiting algorithmic structure. Finally, we show how
these insights can guide the design of programming languages, system support, and architectures for exploiting parallelism.
1.
Introduction
Application/domain
Data-mining
Bayesian inference
Compilers
Functional interpreters
Maxflow
Minimal spanning trees
N-body methods
Graphics
Network science
System modeling
Event-driven simulation
Meshing
Linear solvers
साकारमन'त) +वि, +नराकार) त. +न/ल) ।
एत34वोप78न न प.नभवस)
9 भवः ॥
That which has form is not real; only the amorphous endures.
When you understand this, you will not return to illusion
PDE solvers
Algorithms
Agglomerative clustering [67], kmeans [67]
Belief propagation [46], Survey
propagation [46]
Iterative dataflow algorithms [3],
elimination-based algorithms [3]
Graph reduction [52], static and
dynamic dataflow [6, 19]
Preflow-push [16], augmenting
paths [16]
Prim’s [16], Kruskal’s [16], Boruvka’s [22] algorithms
Barnes-Hut [9], fast multipole [29]
Ray-tracing [24]
Social network maintenance [32]
Petri-net simulation [51]
Chandy-Misra-Bryant [48], Timewarp [35]
Delaunay mesh generation [14],
refinement [60], Metis-style graph
partitioning [37]
Sparse MVM [28], Sparse
Cholesky factorization [27]
Finite-difference codes [28]
Table 1. Problem domains and algorithms
— Ashtavakra Gita
The parallel programming community has a deep understanding
of parallelism and locality in dense matrix algorithms. However,
outside of computational science, most algorithms are irregular:
that is, they are organized around pointer-based data structures
such as trees and graphs, not dense arrays. Table 1 is a list of
algorithms from important problem domains. Only finite-difference
codes use dense matrices; the rest use trees and sparse graphs.
Unfortunately, we currently have few insights into the structure
of parallelism and locality in irregular algorithms, and this has
stunted the development of techniques and tools that make it easier
to produce parallel implementations of these algorithms.
Domain specialists have written parallel programs for some of
the algorithms in Table 1 (see [7, 27, 34, 35, 56, 60] among others). There are also parallel graph libraries such as Boost [13]
and Stapl [23]. These efforts are mostly problem-specific, and it is
difficult to extract broadly applicable abstractions, principles, and
mechanisms from these implementations. Automatic parallelization using points-to analysis [33] and shape analysis [26, 30] has
been successful in parallelizing some irregular programs that use
∗ This
work is supported in part by NSF grants 0719966, 0702353,
0615240, 0541193, 0509307, 0509324, 0426787 and 0406380, as well as
grants from IBM and Intel Corporation.
trees, such as n-body methods. However, most of the applications
in Table 1 are organized around large sparse graphs with no particular structure. These difficulties have seemed insurmountable, so
irregular algorithms remain the Cinderella of parallel programming
in spite of their central role in applications.
In this paper, we argue that these problems can be solved only
by changing the abstractions that we use to reason about parallelism
in algorithms. The most widely used abstraction is the computation graph (also known as the dependence graph) in which nodes
represent computations, and edges represent dependences between
computations [25]; for programs with loops, dependence edges are
often labeled with additional information such as dependence distances and directions [38]. Computations that are not ordered by
the transitive closure of the dependence relation can be executed
in parallel. Computation graphs produced either by studying the
algorithm or by compiler analysis of programs are referred to as
static computation graphs. Although static computation graphs are
used extensively for implementing regular algorithms, they are not
adequate for modeling parallelism in irregular algorithms. As we
explain in Section 2, the most important reason for this is that dependences between computations in irregular algorithms are functions of runtime data values, so they cannot be represented usefully
by a static computation graph.
To address the need for better abstractions, this paper proposes a
new way of thinking about irregular algorithms, which we call the
operator formulation of algorithms, and introduces an associated
data-centric abstraction [39] called the halograph. Our point of departure is Niklaus Wirth’s aphorism “Program = Algorithm + Data
Structure” [68]; instead of measuring parallelism in instrumented
programs as is commonly done, we study parallelism in algorithms
expressed using operations on abstract data types, independently
of the concrete data structures used to represent those abstract data
types. In this spirit, we replace program-centric concepts like dependence distances and directions with data-centric concepts called
active nodes and neighborhoods; one important consequence of this
change in perspective is that regular algorithms emerge as an important special class of irregular algorithms in much the same way
that metric spaces are an important special class of general topologies in mathematics [49]. This operator formulation of irregular
algorithms also leads to a novel structural classification of irregular algorithms, discussed in Section 3. In Section 4, we introduce
amorphous data-parallelism, a generalization of data-parallelism
in regular algorithms, and show how halographs provide a useful
abstraction for reasoning about this parallelism. We argue that optimistic or speculative execution is the only general approach to
exploiting amorphous data-parallelism, but we also show that the
overhead of speculative execution can be reduced by exploiting algorithmic structure when it is present.
Sections 5–7 demonstrate the practical utility of these ideas by
showing how they can be used to implement the algorithms in Table 1. This discussion is organized along the lines of the structural
classification of algorithms introduced in Section 3. We show how
amorphous data-parallelism arises in different classes of irregular
algorithms; whenever possible, we also describe the optimizations
used in handwritten parallel implementations of particular algorithms, and show how these optimizations can be viewed in our
framework as exploiting algorithmic structure. Finally, Section 8
discusses how these insights might guide future research in language and systems support for irregular algorithms.
2.
Inadequacy of static computation graphs
The need for new algorithmic abstractions for irregular algorithms
can be appreciated by considering Delaunay Mesh Refinement
(DMR) [14], an irregular algorithm used extensively in finiteelement meshing and graphics. The Delaunay triangulation for a
set of points in a plane is the triangulation in which each triangle
satisfies a certain geometric constraint called the Delaunay condition [18]. In many applications, triangles are required to satisfy
additional quality constraints, and this is accomplished by a process
of iterative refinement that repeatedly fixes “bad” triangles (those
that do not satisfy the quality constraints) by adding new points
to the mesh and re-triangulating. Refining a bad triangle by itself
may violate the Delaunay property of triangles around it, so it is
necessary to compute a region of the mesh called the cavity of the
bad triangle, and replace all the triangles in the cavity with new triangles. Figure 1 illustrates this process; the darker-shaded triangles
are “bad” triangles, and the cavities are the lighter shaded regions
around the bad triangles. Re-triangulating a cavity may generate
new bad triangles, but it can be shown that at least in 2D, this iterative refinement process will ultimately terminate and produce a
guaranteed-quality mesh. Different orders of processing bad triangles lead to different meshes, although all such meshes satisfy the
quality constraints and are acceptable outcomes of the refinement
process [14]. This is an example of don’t-care non-determinism,
also known as committed-choice non-determinism [20]. Figure 2
shows the pseudocode for mesh refinement. Each iteration of the
while loop refines one bad triangle; we call this computation an
activity.
(a) Unrefined Mesh
(b) Refined Mesh
Figure 1. Fixing bad triangles
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
Mesh mesh = // read in initial mesh
Worklist wl<Triangle>;
wl.add(mesh.badTriangles());
while (wl.size() != 0) {
Triangle t = wl.get(); // get bad triangle
if (t no longer in mesh) continue;
Cavity c = new Cavity(t);
c.expand();
c.retriangulate();
mesh.update(c);
wl.add(c.badTriangles());
}
Figure 2. Pseudocode of the mesh refinement algorithm
DMR can be performed in parallel since bad triangles whose
cavities do not overlap can be refined in parallel. Note that two triangles whose cavities overlap can be refined in either order but not
concurrently. Unfortunately, static computation graphs are inadequate for reasoning about parallelism in this algorithm. In Delaunay mesh refinement, as in most irregular algorithms, dependences
between activities are functions of runtime values: whether or not
two activities in DMR can be executed concurrently depends on
whether the cavities of the relevant triangles overlap, and this depends on the input mesh and how it has been modified by previous
refinements. Since this information is known only during program
execution, a static computation graph for Delaunay mesh refinement must conservatively assume that every activity might interfere
with every other activity.
A second problem with computation graphs is that they do not
model don’t-care non-determinism. In computation graphs, computations can be executed in parallel if they are not ordered by
the transitive closure of the dependence relation. In irregular algorithms like Delaunay mesh refinement, it is often the case that
two activities can be done in either order but cannot be done concurrently, so they are not ordered by dependence and yet cannot be
executed concurrently. Exploiting don’t-care non-determinism can
lead to more efficient parallel implementations, as we discuss in
Section 4.
Figure 3. Conflicts in event-driven simulation
Event-driven simulation [48] illustrates a different kind of complexity exhibited by some irregular algorithms. This application
simulates a network of processing stations that communicate by
sending messages along FIFO links. When a station processes a
message, its internal state may be updated and it may produce zero
or more messages on its outgoing links. Sequential event-driven
simulation is performed by maintaining a global priority queue of
events called the event-list, and processing the earliest event at each
step. In this irregular algorithm, activities correspond to the processing of events.
In principle, event-driven simulation can be performed in parallel by processing multiple events from the event-list simultaneously, rather than one event at a time. However, unlike in Delaunay
mesh refinement, where it was legal to process bad triangles concurrently provided they were well-separated in the mesh, it may
not be legal to process two events in parallel even if their stations
are far apart in the network. Figure 3 demonstrates this. Suppose
that in a sequential implementation, node A fires and produces a
message with time-stamp 3, and then node B fires and produces a
message with time-stamp 4. Notice that node C must consume the
message with time-stamp 4 before it consumes the message with
time-stamp 5. Therefore, in Figure 3(a), the events at A and C cannot be processed in parallel even though the stations are far apart
in the network1 . However, notice that if the message from B to C
in Figure 3(b) had a time-stamp greater than 5, it would have been
legal to process in parallel the events at nodes A and C in Figure 3(a). To determine if two activities can be performed in parallel
at a given point in the execution of the algorithm, we need a crystal
ball to look into the future!
In short, static computation graphs are inadequate abstractions
for irregular algorithms for the following reasons.
1. Dependences between activities in irregular algorithms are usually complex functions of runtime data values, so they cannot
be usefully captured by a static computation graph.
2. Many irregular algorithms exhibit don’t-care non-determinism,
but this is hard to model with computation graphs.
3. Whether or not it is safe to execute two activities in parallel at a
given point in the computation may depend on activities created
later in the computation. This behavior is difficult to model with
computation graphs.
3.
Operator formulation of irregular algorithms
To address these problems, this paper introduces a new way of
thinking about irregular algorithms called the operator formulation
of irregular algorithms, and a new data-centric algorithm abstraction called the halograph. Halographs can be defined for any abstract data type (ADT), but we will use the graph ADT to be concrete. The operator formulation leads to a natural classification of
irregular algorithms, shown in Figure 4, which we will use in the
rest of the paper.
3.1
Graphs and graph topology
We use the term graph to refer to the usual directed graph abstract
data type (ADT) that is formulated in terms of (i) a set of nodes V ,
and (ii) a set of edges (⊆ V × V ) between these nodes. Undirected
edges are modeled by a pair of directed edges in the standard
fashion. In some applications, nodes and edges are labeled with
values. In this paper, all irregular algorithms will be written in terms
of this ADT.
In some algorithms, the graphs have a special structure that can
be exploited by the implementation. Trees and grids are particularly
important. Grids are usually represented using dense arrays, but
note that a grid is actually a sparse graph (a 2D grid point that is
not on the boundary is connected to four neighbors). Square dense
matrices are isomorphic to cliques, which are graphs in which there
is a labeled edge between every pair of nodes. At the other extreme,
we have graphs consisting of labeled nodes and no edges; these
are isomorphic to sets and multisets. Some of these topologies are
shown in Figure 4.
We will not discuss a particular concrete representation for the
graph ADT. The implementation is free to choose a concrete repre-
Figure 4. Classification of irregular algorithms
Figure 5. Halograph: active elements and neighborhoods
sentation that is best suited for a particular algorithm and machine:
for example, grids and cliques may be represented using dense arrays, while sparse graphs may be represented using adjacency lists.
This is similar to the approach taken in relational databases: SQL
programmers use the relation ADT in writing programs, and the
underlying DBMS system is free to implement the ADT using Btrees, hash tables, and other concrete data structures2 .
3.2
Active nodes and neighborhoods
At each point during the execution of a graph algorithm, there
are certain nodes or edges in the graph where computation might
be performed. Performing a computation may require reading or
writing other nodes and edges in the graph. The node or edge on
which a computation is centered will be called an active element,
and the computation itself will be called an activity. To keep the
discussion simple, we assume from here on that active elements
are nodes. Borrowing terminology from the literature on cellular
automata [63], we refer to the set of nodes and edges that are read
or written in performing the computation as the neighborhood of
that active node. Figure 5 shows an undirected sparse graph in
which the filled nodes represent active nodes, and shaded regions
represent the neighborhoods of those active nodes. Note that in
general, the neighborhood of an active node is distinct from the set
of its neighbors in the graph. A halograph is a figure like Figure 5
that shows an ADT, the set of active nodes in the ADT at some
point in the computation, and the associated neighborhoods.
It is convenient to think of an activity as the application of an
operator to the neighborhood of an active element. It is useful to
distinguish between three kinds of operators, as shown in Figure 4.
• Morph: A morph operator may modify the structure of the
neighborhood by adding or deleting nodes and edges, and may
also update values on nodes and edges.
2 Codd’s
1 Like
in chaos theory, “the flap of a butterfly’s wings in Brazil can set off a
tornado in Texas” [44].
Rule 8 (Physical Data Independence rule): “Application programs
and terminal activities remain logically unimpaired whenever any changes
are made in either storage representations or access methods”. [15]
• Local computation: A local computation operator may update
3.4
values stored on nodes and edges in the neighborhood, but does
not modify the graph structure.
• Reader: A reader does not modify the neighborhood in any way.
A natural way to program these algorithms is to use worklists to
keep track of active nodes. However, programs written directly in
terms of worklists, such as in Figure 2, encode a particular order of
processing worklist items, and do not express the don’t-care nondeterminism in irregular algorithms like Delaunay mesh refinement.
To address this problem, we use the Galois programming model,
which is a sequential, object-oriented programming model (such
as sequential Java), augmented with two Galois set iterators [41]:
We illustrate these concepts using a few algorithms from Table 1. The preflow-push algorithm [16] is a maxflow algorithm in
directed graphs, described in detail in Section 6.2. During the execution of this algorithm, some nodes can temporarily have more
flow coming into them than is going out. These are the active nodes
(in fact, we borrowed the term “active node” from the literature on
preflow-push). The algorithm repeatedly selects an active node n
and tries to increase the flow along some outgoing edge (n → m)
to eliminate the excess flow at n if possible; if it succeeds, the residual capacity on that edge is updated. Therefore, the neighborhood
consists of n, its incoming and outgoing edges, and the nodes incident on the outgoing edges. The operator is a local computation
operator. Increasing the flow to m may cause m to become active;
if n still has excess in-flow, it remains an active node.
A second example is Delaunay mesh refinement, described in
Section 2. In this application, the mesh is usually represented by a
graph in which nodes represent triangles and edges represent adjacency of triangles. The active nodes in this algorithm are the nodes
representing bad triangles, the operator is a morph operator, and the
neighborhood of an active node is the cavity of the corresponding
bad triangle.
The final example is event-driven simulation. In this application, active nodes are nodes that have messages on their input channels, and the operator is a local computation operator. Local computation algorithms will be discussed in more detail in Section 6,
but we mention here that algorithms like preflow-push and eventdriven simulation in which nodes become active in a very dynamic,
dataflow-like manner are classified as data-driven local computation algorithms in that section. On the other hand, in structuredriven local computation algorithms such as the Jacobi and GaussSeidel finite-difference methods, active nodes and neighborhoods
are defined completely by the structure of the underlying graph.
Some algorithms may exhibit phase behavior: they use different
types of operators at different points in their execution. BarnesHut [9] is an example: the tree-building phase uses a morph, as
explained in Section 5.1, and the force computation uses a reader,
as explained in Section 7.
3.3
Ordering
In general, there are many active nodes in a graph, so a sequential
implementation must pick one of them and perform the appropriate
computation.
In some algorithms such as Delaunay mesh refinement, the implementation is allowed to pick any active node for execution.
As mentioned in Section 2, this is an example of don’t-care nondeterminism [20]. For some of these algorithms, the output is independent of these implementation choices, a property that is referred
to as the Church-Rosser property since the most famous example of
this behavior is β-reduction in λ-calculus [8]. Dataflow graph execution [6, 19] and the preflow-push algorithm [16] also exhibit this
behavior. In other algorithms, the output may be different for different choices of active nodes, but all such outputs are acceptable,
so the implementation can still pick any active node for execution.
Delaunay mesh refinement and Petri nets [51] are examples (note
that the preflow-push algorithm exhibits this kind of behavior if the
algorithm outputs the minimal cut as well as the maximal flow).
In contrast, some algorithms dictate an order in which active
nodes must be processed. Event-driven simulation is an example:
the sequential algorithm for event-driven simulation processes messages in global time-order. The order on active nodes may sometimes be a partial order.
From algorithms to programs
• Unordered-set iterator: for each e in Set S do B(e)
The loop body B(e) is executed for each element e of set S.
The order in which iterations execute is indeterminate, and can
be chosen by the implementation. There may be dependences
between the iterations. When an iteration executes, it may add
elements to S.
• Ordered-set iterator: for each e in OrderedSet S do B(e)
This construct iterates over an ordered set S. It is similar to
the unordered set iterator above, except that a sequential implementation must choose a minimal element from set S at every
iteration. When an iteration executes, it may add new elements
to S.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
Mesh mesh = // read in initial mesh
Workset ws<Triangle>;
ws.add(mesh.badTriangles());
for each Triangle t in ws {
if (t no longer in mesh) continue;
Cavity c = new Cavity(t);
c.expand();
c.retriangulate();
mesh.update(c);
ws.add(c.badTriangles());
}
Figure 6. Delaunay mesh refinement using an unordered set iterator
The body of the iterator is the implementation of the operator for
that algorithm. Note that these iterators have a well-defined sequential semantics. The unordered-set iterator permits the specification of don’t-care nondeterminism. Event-driven simulation can be
written using the ordered-set iterator. Iterators over multi-sets can
be defined similarly. Work-sets can be implemented in many ways;
for example, in a sequential implementation, unordered work-sets
can be implemented using a linked list, and ordered work-sets can
be implemented using a heap. The Galois system provides a data
structure library, similar in spirit to the Java collections library, containing implementations of key ADTs such as graphs, trees, grids,
work-sets, etc. Application programmers write algorithms in sequential Java, using the appropriate library classes and Galois set
iterators. Figure 6 shows pseudocode for Delaunay mesh refinement, written using the unordered Galois set iterator.
3.5
Discussion
Set iterators were first introduced in the SETL programming language [36], and can now be found in most object-oriented languages such as Java and C++. However, new elements cannot be
added to sets while iterating over them, as is possible with Galois
set iterators.
The metaphor of operators acting on neighborhoods is reminiscent of notions in term-rewriting systems, and graph grammars in
particular [21, 45]. The semantics of functional language programs
are usually specified using term rewriting systems (TRS) that describe how expressions can be replaced by other expressions within
the context of the functional program. The process of applying
rewrite rules repeatedly to a functional language program is known
as string or tree reduction.
Figure 7. Graph rewriting: single-pushout approach
Tree reduction can be generalized in a natural way to graph reduction by using graph grammars as rewrite rules [21, 45]. A graph
rewrite rule is defined as a morphism in the category C of labeled
graphs with partial graph morphisms as arrows: r: L → R, and a
rewriting step is defined by a pushout [45] as shown in Figure 7.
In this diagram, G is a graph, and the total morphism m: L → G,
which is called a redex, identifies the portion of G that “matches”
L, the left-hand side of the rewrite rule. The application of a rule r:
L → R at a redex m: L → G leads to a direct derivation (r,m): G
⇒ H given by the pushout in Figure 7. The framework presented
in this section is more general because operators are not limited
to be a finite set of “syntactic” rewrite rules; for example, it is not
clear that Delaunay mesh refinement, described in Section 5.1, can
be specified using graph grammars, although some progress along
these lines is reported by Verbrugge [65].
From this point of view, morph operators are partial graph morphisms in which the graph structure of L and R may be different,
while a local computation operator is one in which the structure
of L and R are identical, but the labels may be different. Local
computation operators can be viewed as special cases of morph
operators, and reader operators as special cases of local computation operators. Whether morphs and local computation operators
make imperative-style in-place updates to graphs or create modified
copies of the graph in a functional style is a matter of implementation.
The definitions in this section can be generalized in the obvious
way for algorithms that deal with multiple data structures. In that
case, neighborhoods span multiple data structures, and the classification of an operator is relative to a particular data structure. For
example, for the standard algorithm that multiplies dense matrices A and B to produce a dense matrix C, the operator is a local
computation operator for C, and a reader for A and B. To keep the
discussion simple, all algorithms in this paper will be formulated
so that they operate on a single data structure.
4.
Amorphous data-parallelism
Figure 5 shows intuitively how opportunities for exploiting parallelism arise in irregular algorithms: if there are many active elements at some point in the computation, each one is a site where
a processor can perform computation, subject to neighborhood and
ordering constraints. When active nodes are unordered, the neighborhood constraints must ensure that the output produced by executing the activities in parallel is the same as the output produced
by executing the activities one at a time in some order. For ordered
active elements, this order must be the same as the ordering on active elements.
D EFINITION 1. Given a set of active nodes and an ordering on
active nodes, amorphous data-parallelism is the parallelism that
arises from simultaneously processing active nodes, subject to
neighborhood and ordering constraints.
Formulating the concept of amorphous data-parallelism in terms
of abstract data types permits us to discuss parallelism in irregular algorithms without reference to particular concrete representations of the abstract data types. The job of library implementors and
systems designers is to provide the appropriate library classes and
systems support to exploit commonly occurring patterns of paral-
lelism and locality. A similar approach is central to the success of
databases, where one considers parallelism in SQL programs independently of how relations are implemented using B-trees, hashtables, and other complex data structures.
Amorphous data-parallelism is a generalization of conventional
data-parallelism in which (i) concurrent operations may conflict
with each other, (ii) activities can be created dynamically, and (iii)
activities may modify the underlying data structure. Not surprisingly, the exploitation of amorphous data-parallelism is more complex than the exploitation of conventional data-parallelism. In this
section, we present a baseline implementation that uses optimistic
or speculative execution [54, 61, 66] to exploit amorphous dataparallelism. We then show that the overheads of speculative execution can be reduced or even eliminated by exploiting algorithmic
structure if it is present.
4.1
Baseline implementation: optimistic parallel execution
In the baseline execution model, the graph is stored in sharedmemory, and active nodes are processed by some number of
threads. A thread picks an arbitrary active node and speculatively
applies the operator to that node, making calls to the graph API to
perform operations on the graph as needed. The neighborhood of
an activity can be visualized as a blue ink-blot that begins at the
active node and spreads incrementally whenever a graph API call
is made that touches new nodes or edges in the graph. To ensure
that neighborhood constraints are respected, each graph element
has an associated exclusive logical lock that must be acquired by
a thread before it can access that element. Locks are held until the
activity terminates. If a lock cannot be acquired because it is already owned by another thread, a conflict is reported to the runtime
system, which rolls back one of the conflicting activities. Lock
manipulation is performed entirely by the methods in the graph
class; in addition, to enable rollback, each graph API method that
modifies the graph makes a copy of the data before modification.
Logical locks provide a simple approach to ensuring that neighborhood constraints are respected, but they can be restrictive since
they require neighborhoods of concurrent activities to be disjoint
(such as activities i3 and i4 in Figure 5). Non-empty intersection of
neighborhoods is permissible if the activities do not modify nodes
and edges in the intersection for example. A more general solution
than logical locks is to use commutativity conditions as in the Galois system [41]; to keep the discussion simple, we do not describe
this here.
If active elements are not ordered, the activity terminates when
the application of the operator is complete, and all acquired locks
are released. If active elements are ordered, active nodes can still
be processed in any order, but they must commit in serial order.
This can be implemented using a data structure similar to a reorder
buffer in out-of-order processors [41]. In this case, locks are released only when the activity commits (or is rolled back by the
runtime system).
In regular algorithms like stencil computations and FFT, parallelism is independent of runtime values, so it is possible to give
closed-form estimates of the amount of parallelism in these algorithms (possibly parameterized by the size of the input). For example, in multiplying N × N matrices, there are N 3 multiplications that can be performed in parallel. In contrast, parallelism in
irregular algorithms is a function of runtime values, so closed-form
estimates are not possible in general. One measure of parallelism
in irregular algorithms is the number of active nodes that can be
processed in parallel at each step of the algorithm for a given input, assuming that (i) there are an unbounded number of processors, (ii) an activity takes one time step to execute, (iii) the system
has perfect knowledge of neighborhood and ordering constraints so
it only executes activities that can complete successfully, and (iv)
a maximal set of activities, subject to neighborhood and ordering
constraints, is executed at each step. This is called the available
parallelism at each step, and a graph showing the available parallelism at each step of execution of an irregular algorithm for a given
input is called a parallelism profile. In this paper, we will present
parallelism profiles produced by ParaMeter, a tool implemented on
top of the Galois system [42]. They can provide substantial insight
into the nature of amorphous data-parallelism, as we discuss later
in this paper.
In practice, handwritten parallel implementations of irregular
algorithms rarely use the complex machinery described above
for supporting optimistic parallel execution (the Timewarp algorithm [35] for event-driven simulation is the only exception we
know to this rule). This is because most algorithms have structure that can be exploited to eliminate the need for some of these
mechanisms. Sections 4.2 and 4.3 describe how algorithmic structure can be exploited to reduce the overheads of speculation and to
eliminate speculation, respectively.
4.2
Exploiting structure to reduce speculative overheads
The following class of operators plays a very important role in
applications.
D EFINITION 2. An implementation of an operator is said to be
cautious if it reads all the elements in its neighborhood before it
modifies any of them.
A cautious operator is an operator that has a cautious implementation.
Operators, which were defined abstractly as graph morphisms
in Section 3, can usually be implemented in different ways: for
example, one implementation might read node A, write to node
A, read node B, and write to node B in that order, while a different
implementation might perform the two reads before the writes. By
Definition 2, the second implementation is cautious, but the first
one is not. The operator itself is cautious because it has a cautious
implementation.
Cautious operators have the useful property that the neighborhood of an active node can be determined by executing the operator
up to the point where it starts to make modifications to the neighborhood. Therefore, at any point in the computation, the neighborhoods of all active nodes can be determined before any modifications are made to the graph.
4.2.1
Zero-copy implementation
For algorithms in which (i) the operator is cautious, and (ii) active elements are unordered, optimistic parallel execution can be
implemented without making data copies for recovery in case of
conflicts. Most of the algorithms with unordered active elements
discussed in Sections 5–7 fall in this category.
An efficient implementation of such algorithms is the following.
Logical locks associated with nodes and edges are acquired during
the read phase of the operator. If a lock cannot be acquired, there is
a conflict and the computation is rolled back by the runtime system
simply by releasing all locks acquired up to that point; otherwise,
all locks are released when the computation terminates. We refer to
this as a zero-copy implementation.
Zero-copy implementations should not be confused with twophase locking, since under two-phase locking, updates to locked
objects can be interleaved arbitrarily with acquiring locks on new
objects. Strict two-phase locking eliminates the possibility of cascading rollbacks, while cautious operators are more restrictive and
permit the implementation of optimistic parallel execution without
the need for data copying. Notice that if active elements are ordered, data copying is needed even if the operator is cautious since
conflicting active elements with higher priority may be created after
Figure 8. Logical partitioning
the read-only phase of an activity completes but before that activity
commits.
4.2.2
Data-partitioning and lock coarsening
In the baseline implementation, the work-set can become a bottleneck if there are a lot of worker threads since each thread must
access the work-set repeatedly to obtain work. For algorithms in
which (i) the data structure topology is a general graph or grid (i.e.,
not a tree), and (ii) active elements are unordered, data partitioning
can be used to eliminate this bottleneck and to reduce the overhead of fine-grain locking of graph elements. The graph or grid can
be partitioned logically between the cores: conceptually, each core
is given a color, and a contiguous region of the graph or grid is
assigned that color. Instead of a centralized work-set, we can implement one work-set per region and assign work in a data-centric
way: if an active element lies in region R, it is put on the workset for region R, and it will be processed by the core associated
with that region. If an active element is not near a region boundary, its neighborhood is likely to be confined to that region, so for
the most part, cores can compute independently. Instead of associating locks with nodes and edges, we associate locks with regions;
if a neighborhood expands into a new region, the core processing
that activity needs to acquire the lock associated with that region.
To keep core utilization high, it is desirable to over-decompose the
graph or grid into more regions than there are cores, increasing the
likelihood that a core has work to do even if some of its regions are
locked by other cores.
One disadvantage of this approach is load imbalance since it is
difficult to partition the graph so that computational load is evenly
distributed across the partitions, particularly because active nodes
can be created dynamically. Application-specific graph partitioners might be useful; dynamic load-balancing using a strategy like
work-stealing [2] is another possibility.
4.3
Exploiting structure to eliminate speculation
Speculation was needed in the baseline implementation of Section 4.1 because the execution of different activities was not coordinated in any way, a scheduling policy that we call autonomous
scheduling. However, if the computation graph of an algorithm can
be generated inexpensively at compile-time or at runtime, it can
be used to schedule activities in a more coordinated fashion, eliminating the need for speculation. Figure 9 shows different classes
of coordinated scheduling policies that can be used for such algorithms.
4.3.1
Compile-time coordination
If (i) the operator is a structure-driven local computation operator,
and (ii) the topology is known statically (for example, it is a grid
or a clique), active nodes and neighborhoods can be determined
statically, so a static computation graph can be generated, independent of the input to the program (the computation graph is usually
parameterized by the size of the input). Regular algorithms such
as finite-differences, FFTs, and dense matrix factorizations fall in
this category. Given the importance of these algorithms in highperformance computing, it is not surprising that this class of algo-
refinement [34]. It can be used even if neighborhoods cannot be determined exactly, provided we can compute over-approximations to
neighborhoods [5].
4.4
Figure 9. Scheduling strategies
rithms has received a lot of attention from the research community.
For example, there is an enormous literature on dependence analysis, which uses integer linear programming techniques to implicitly build computation graphs from sequential, dense matrix programs [38].
4.3.2
Just-in-time coordination
For most irregular algorithms, active nodes and neighborhoods depend on runtime values, so the best one can do is to generate a
computation graph for a particular input. One way to do this for
any irregular program written using the Galois model is the following: (i) make a copy of the initial state; (ii) execute the program
sequentially, keeping track of active nodes and neighborhoods, and
produce a computation graph; (iii) use the computation graph to
re-execute the program in parallel without speculation. There are
certain obvious disadvantages with this approach in general, but
for some algorithms, step (i) can be eliminated and step (ii) can be
simplified, so this approach becomes competitive.
One instance of this general scheduling strategy is the inspectorexecutor approach [70]. If (i) the operator is a structure-driven local
computation operator but (ii) the topology is not known statically
(for example, it is a general sparse graph), we can compute active
nodes and neighborhoods at runtime once the input is given, and
generate a computation graph for executing the program in parallel. The code for doing this is known as the inspector, and the code
that executes the application in parallel is known as the executor.
This approach was used by Saltz et al. to implement sparse iterative
solvers on distributed-memory computers [70]; in this application,
the graph computation is a sparse matrix-vector (MVM) multiplication, and the cost of producing the communication schedule (which
can be derived from the computation graph) is amortized over the
multiple sparse MVM’s performed by the iterative solver.
4.3.3
Runtime coordination
For morph operators, the execution of an activity may modify the
neighborhoods of pending activities, so it is usually not possible to
write an inspector that can produce the entire computation graph
without making any modifications to the input state. For some
algorithms however, it is possible to interleave the generation of
the computation graph with the execution of the application code.
For algorithms in which (i) the operator is cautious and (ii) active nodes are unordered, a bulk-synchronous execution strategy
is possible [64]. The algorithm is executed over several steps. In
each step, (i) active nodes are determined, (ii) the neighborhoods
of all active nodes are determined by partially executing the operator, (iii) an interference graph for the active nodes is constructed,
(iv) a maximal independent set of nodes in the interference graph is
determined, and (v) the set of independent active nodes is executed
without synchronization. This process can be viewed as constructing and exploiting a computation graph level by level. This approach has been used by Gary Miller to parallelize Delaunay mesh
Discussion
It is likely that optimistic parallel execution is the only generalpurpose approach for exploiting parallelism in algorithms like
the standard algorithm for event-driven simulation in which active nodes are ordered and interfering activities can be created
dynamically3 . Since these are properties of the algorithm, this suggests that optimistic parallel execution is required for these kinds
of algorithms even if they are written in a functional language.
We are not aware of any discussion of this issue in the literature
on functional languages. We also note that although coordinated
scheduling seems attractive because there is no wasted work from
mis-speculation, the overhead of producing a computation graph
may limit its usefulness even for those algorithms for which it is
feasible.
The pattern of parallelism described in this section can be called
inter-operator parallelism because it arises from applying an operator at multiple sites in the graph. There are also opportunities to exploit parallelism within a single application of an operator, which we call intra-operator parallelism. Inter-operator parallelism is the dominant parallelism pattern in problems for which
neighborhoods are small compared to the overall graph since it is
likely that many activities do not conflict; in these problems, intraoperator parallelism is usually fine-grain, instruction-level parallelism. Conversely, when neighborhoods are large, activities are
likely to conflict and intra-operator parallelism may be more important. Inter/intra-operator parallelism is an example of nested dataparallelism [12]. In this paper, we only consider inter-operator parallelism.
In Sections 5–7, we demonstrate the practical utility of the ideas
introduced in this paper by showing how they can be applied to
the algorithms in Table 1. The discussion is organized according
to the classification of algorithms shown in Figure 4. For some of
these algorithms, we describe the optimizations used in handwritten
parallel implementations and show how they can be viewed as
implementations of the general-purpose optimizations described in
this section.
5.
Morph algorithms
In this section, we discuss amorphous data-parallelism in algorithms that may morph the structure of the graph by adding or
deleting nodes and edges. Although morph operators can be viewed
abstractly as replacing sub-graphs with other graphs, it is more insightful to classify them as follows.
• Refinement: Refinement operations make the graph bigger by
adding new nodes and edges. In particular, algorithms that build
trees top-down, such as Barnes-Hut [9] and Prim’s minimal
spanning tree (MST) algorithm [16], perform refinement.
• Coarsening: Coarsening operations cluster nodes or sub-graphs
together, replacing them with new nodes that represent the
cluster. In particular, algorithms that build trees bottom-up,
such as agglomerative clustering [67] and Boruvka’s MST algorithm [22], perform coarsening.
• General morph: All other operations that modify the graph
structure fall in this category. Graph reduction of functional
language programs [52] is an example.
In many morph algorithms, the type of the morph operator determines the shape of the parallelism profile. For example, algo3 Conservative
event-driven simulation [48], described in Section 6.2, uses
a different algorithm.
Available Parallelism
8000
6000
4000
2000
0
0
10
20
30
40
50
60
Computation Step
Available Parallelism
(a) Available parallelism in Delaunay mesh refinement
300
250
200
Graph g = // read in input graph
Tree mst; // create empty tree
mst.setRoot(r); // r is an arbitrary vertex in g
OrderedWorkset ows<Edge>; // ordered by edge weight
ows.add(g.getEdges(r));
for each Edge e in ows {
Node s = e.source(); // s is in tree
Node t = e.target();
if (mst.contains(t)) continue;
mst.addEdge(t,s); // s becomes parent of t
for each Edge et in g.getEdges(t) {
Node u = et.target();
if (!mst.contains(u))
ows.add(et);
}
}
150
100
Figure 11. Top-down tree construction: Prim’s MST algorithm
50
0
0
20
40
60
80
100
Computation Step
(b) Available parallelism in Delaunay triangulation
Figure 10. Available parallelism in two refinement algorithms
rithms using coarsening operators on sparse graphs often have a
lot of inter-operator parallelism at the start, but the parallelism decreases as the graph becomes smaller. Conversely, refinement algorithms typically exhibit increasing inter-operator parallelism over
time, until they run out of work.
5.1
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
Refinement
Graphs Delaunay mesh refinement, which was described in Section 2, is an example of the use of morph operators in general
graphs. Figure 10(a) shows the parallelism profile for this application for an input of 100,000 triangles. As we can see, the available parallelism increases initially because (i) several bad triangles may be created by each refinement, and (ii) as the mesh gets
bigger, bad triangles may move away from each other, reducing
cavity conflicts. As work is completed, the available parallelism
drops off. The topology of this application is a general graph, the
operator is cautious, and active nodes are not ordered, so implementations can use autonomous scheduling with zero-copy, datapartitioning, and lock coarsening. These are the general-purpose
equivalents of application-specific optimizations used in the handparallelized DMR code of Blandford et al. [11]. In this code, the
concrete representation for the mesh is carefully crafted for DMR,
but there is currently no way to implement this in a system like
Galois that is based on a general-purpose parallel data structure library. DMR can also be implemented using run-time coordination,
as in the implementation of Miller et al. [34].
Other refinement codes have similar parallelism profiles. For
example, Figure 10(b) shows the parallelism profile for Delaunay
triangulation, which creates the initial Delaunay mesh from a set
of points. In this experiment, the input has 10,000 points.
Trees Algorithms that build trees in a top-down fashion use refinement morphs. Figure 11 shows one implementation of Prim’s
algorithm for computing minimum spanning trees in undirected
graphs. The algorithm builds the tree top-down and illustrates the
use of ordered work-sets. Initially, one node is chosen as the “root”
of the tree and added to the MST (line 2), and all of its edges are
added to the ordered work-set, ordered by weight (lines 4–5). In
each iteration, the smallest edge, (s, t) is removed from the workset. Note that node s is guaranteed to be in the tree. If t is not in the
MST, (s, t) is added to the tree (line 10), and all edges (t, u) are
added to the work-set, provided u is not in the MST (lines 11–15).
When the algorithm terminates, all nodes are in the MST.
This algorithm has the same asymptotic complexity as the standard algorithm in textbooks [16], but the work-set contains edges
rather than nodes. The standard algorithm requires a priority queue
in which node priorities can be decreased dynamically. It is unclear
whether it is worth generalizing the implementation of ordered-set
iterators in a general-purpose system like ours to permit dynamic
updates to the ordering.
Graph search algorithms such as 15-puzzle solvers implicitly
build spanning trees that may not be minimal spanning trees. These
can be coded using an unordered set iterator.
Top-down tree construction is also used in n-body methods like
Barnes-Hut and fast multipole [9]. In many implementations, the
tree represents a recursive spatial partitioning in which each leaf
contains a single particle. Initially, the tree is a single node, representing the entire space. Particles are then inserted into the tree,
splitting leaf nodes as needed to maintain the invariant that a leaf
node can contain only a single particle. The work-set in this case is
the set of particles, and it is unordered because particles can be inserted into the tree in any order. The operator is cautious, so a zerocopy implementation can be used. In practice, the tree-building
phase of n-body methods does not take much time compared to
the time for force calculations, which is described in Section 7.
5.2
Coarsening
There are three main ways of doing coarsening.
• Edge contraction: An edge is eliminated from the graph by
fusing the two nodes at its end points and removing redundant
edges from the resulting graph.
• Node elimination: A node is eliminated from the graph, and
edges are added as needed between each pair of its erstwhile
neighbors.
• Subgraph contraction: A sub-graph is collapsed into a single
node and redundant edges are removed from the graph. Edge
contraction is a special case of subgraph contraction.
Coarsening morphs are commonly used in clustering algorithms
in data-mining and in algorithms that build trees bottom-up.
5.2.1
Edge contraction
In Figure 12(a), the edge (u, v) is contracted by fusing nodes u and
v into a single new node labeled uv, and eliminating redundant
edges. When redundant edges are eliminated, the weight on the
remaining edge is adjusted in application-specific ways (in the
example, the weight on edge (m, uv) may be some function of
the weights on edges (m, u), (m, v) and (u, v)). In sparse graphs,
each edge contraction affects a relatively small neighborhood, so
in sufficiently large graphs, many edge contraction operations can
happen in parallel as shown in Figure 12(b). The darker edges in the
fine graph show the edges being contracted, and the coarse graph
shows the result of simultaneously contracting all selected edges.
1: Graph g = // read in input graph
2: do { // coarsen the graph by one level
3:
Graph cg; // create empty coarse graph
4:
for each Node n in g.getNodes() {
5:
Node un = g.getUnmatchedNeighbor(n);
6:
Node rc = cg.createNode();
7:
n.setRepresentative(rc);
8:
un.setRepresentative(rc);
9:
}
10:
for each Edge e in g.getEdges() {
11:
Node rs = e.source().getRepresentative();
12:
Node rt = e.target().getRepresentative();
13:
if (rs != rt)
14:
projectEdge(cg, rs, rt, e.getWeight(), +);
15:
}
16:
g = cg;
17: } while (!g.coarseEnough());
18: projectEdge(Graph g, Node s, Node t, float w,
fn: float*float->float) {
19:
if (!g.hasEdge(s,t))
20:
g.addEdge(s,t,w);
21:
else {
22:
Edge e = g.getEdge(s,t);
23:
float wOld = e.getWeight();
24:
e.setWeight(fn(wOld, w));
25:
}
26: }
Figure 13. Edge contraction in Metis
Graphs Graph coarsening by edge contraction is a key step in important algorithms for graph partitioning. For example, the Metis
graph partitioner [37] applies edge contraction repeatedly until a
coarse enough graph is obtained; the coarse graph is then partitioned, and the partitioning is interpolated back to the original
graph.
The problem of finding edges that can be contracted in parallel is related to the famous graph problem of finding maximal
matchings in general graphs [16]. Efficient heuristics are known for
computing maximal cardinality matchings and maximal weighted
matchings in graphs of various kinds. These heuristics can be used
for coordinated scheduling. In practice, simple heuristics based on
randomized matchings seem to perform well, and they are wellsuited for autonomous scheduling. Figure 13 shows pseudocode
for graph coarsening as implemented within the Metis graph partitioner [37], which uses randomized matching.
Metis actually creates a new graph for representing the coarse
graph after edge contraction. Each node n finds an unmatched
neighbor un, which is n if no other unmatched neighbor exists, and
a new node rc is created to represent the two nodes in the coarse
graph (lines 5–8). After all the representative nodes are created,
edges are created in the coarse graph; the weight of an edge in the
coarse graph is the sum of the weights of the edges mapped to it.
Graph g = // read in input graph
Forest mst = g.getNodes();
Workset ws<Node> = g.getNodes();
for each Node n in ws {
Node m = minWeight(g.getNeighbors(n));
Node l = g.createNode();
for each Node a in g.getNeighbors(n) {
if (a == m) continue;
float w = g.getEdge(n,a).getWeight();
projectEdge(g, l, a, w, min);
}
for each Node a in g.getNeighbors(m) {
if (a == n) continue;
float w = g.getEdge(m,a).getWeight();
projectEdge(g, l, a, w, min);
}
g.removeNode(n);
g.removeNode(m);
mst.addEdge(n, m); // add edge to the MST
ws.add(l); // add new node back to workset
}
Figure 14. Edge contraction in bottom-up tree construction: Boruvka’s algorithm for minimum spanning trees
Available Parallelism
Figure 12. Graph coarsening by edge contraction
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
200
150
100
50
0
0
500
1000
1500
2000
Computation Step
Figure 15. Available parallelism in Boruvka’s algorithm
In this algorithm, active nodes are unordered and the operator is
cautious.
Trees Some tree-building algorithms are expressed in terms of
edge contraction. Boruvka’s algorithm computes MSTs by performing edge contraction to build trees bottom-up [22]. It does so
by performing successive coarsening steps. The pseudocode for the
algorithm is given in Figure 14 and uses the projectEdge function
from Figure 13.
The active nodes are the nodes remaining in the graph, and
the MST is initialized with every node in the graph forming its
own, single-node tree (line 2), i.e., a forest. In each iteration, an
active node n finds the minimum weight edge, (n, m) incident on
it (line 5). A new node l is created to represent the merged nodes n
and m. This is similar to the code in Figure 13 with the following
exception: if there exist edges (n, u) and (m, v) in the fine graph,
the edge (l, u) in the updated graph will have weight equal to the
minimum weight of the two original edges. The edge (n, m) is
then added to the MST to connect the two previously disjoint trees
that contained n and m (line 19). Finally, l is added back to the
work-set (line 20). This procedure continues until the graph is fully
contracted. At this point, the MST structure represents the correct
minimum spanning tree of the graph.
Figure 15 shows the parallelism profile for Boruvka’s algorithm
run on a random graph of 10,000 nodes, with each node connected
to 5 others on average. We see that the amount of parallelism starts
high, but as the tree is built, the number of active elements decreases and the likelihood that coarsening decisions are independent decreases correspondingly.
Interestingly, Kruskal’s algorithm [16] also finds MSTs by performing edge contraction, but it iterates over an ordered work-set
of edges, sorted by edge weight.
5.2.2
Node elimination
Graph coarsening can also be based on node elimination. Each step
removes a node from the graph and inserts edges as needed be-
x
x
u
y
y
z
z
Figure 16. Graph coarsening by node elimination
tween its erstwhile neighbors to make these neighbors a clique, adjusting weights on the remaining nodes and edges appropriately.
In Figure 16, node u is eliminated, and edges (x, z) and (y, z) are
inserted to make {x, y, z} a clique. This is obviously a cautious operator. Node elimination is the graph-theoretic foundation of matrix
factorization algorithms such as Cholesky and LU factorizations.
Sparse Cholesky factorization in particular has received a lot of
attention in the numerical linear algebra community [27]. The new
edges inserted by node elimination are called fill since they correspond to zeroes in the matrix representation of the original graph
that become non-zeros as a result of the elimination process. Different node elimination orders result in different amounts of fill in
general, so ordering nodes for elimination to minimize fill is an
important problem. The problem is NP-complete so heuristics are
used in practice. One obvious heuristic is minimal-degree ordering,
which is a greedy ordering that always picks the node with minimal degree at each elimination step. After some number of elimination steps, the remaining graph may be quite dense, and at that
point, high-performance implementations switch to dense matrix
techniques to exploit intra-operator parallelism.
Just-in-time coordination is used to schedule computations; in
the literature, the inspector phase is called symbolic factorization
and the executor phase is called numerical factorization [27].
5.2.3
Sub-graph contraction
It is also possible to coarsen graphs by contracting entire subgraphs at a time. In the compiler literature, elimination-based algorithms perform dataflow analysis on control-flow graphs by contracting “structured” sub-graphs whose dataflow behaviors have a
concise description. Dataflow analysis is performed on the smaller
graph, and interpolation is used to determine dataflow values within
the contracted sub-graph. This idea can be used recursively on the
smaller graph. Sub-graph contraction can be performed in parallel. This approach to parallel dataflow analysis has been studied by
Ryder [43] and Soffa [40].
5.3
General morph
Some applications make structural updates that are neither refinements nor coarsenings, but many of these updates may nevertheless
be performed in parallel.
Graphs Graph reduction of functional language programs [52]
and social network maintenance [32] fall in this category.
Trees Some algorithms build trees using complicated structural
manipulations that cannot be classified neatly as top-down or
bottom-up construction. For example, when values are inserted into
a heap or a red-black tree, the final data structure is a tree, but each
insertion may perform complex manipulations of the tree [16, 59].
The structure of the final heap or red-black tree depends on the
order of insertions, but for most applications, any of these final
structures is adequate, so this is an example of don’t-care nondeterminism, and it can be expressed using an unordered set iterator.
Applications that perform general morph operations on trees
may benefit from transactional memory [31] because (i) unlike refinement and coarsening, a general morph operation may potentially modify a large portion of the tree, but the modifications made
by many of them may be confined to a small portion of the tree, and
(ii) there is no easy way to partition a tree (unlike general graphs
and grids), so it is not clear how one applies the optimizations described in Section 4.2.
6.
Local computation algorithms
Instead of modifying the structure of the graph, many irregular
algorithms operate by labeling nodes or edges with data values and
updating these values repeatedly until some termination condition
is reached. Updating the data on a node or edge may require reading
or writing data values in the neighborhood. In some applications,
data for the updates are supplied externally.
These algorithms are called local computation algorithms in this
paper. To avoid verbosity, we assume in the rest of the discussion
that values are stored only on nodes of the graph, and we refer to
an assignment of values to nodes as the state of the system. Local
computation algorithms come in two flavors.
• Structure-driven algorithms: In structure-driven algorithms, ac-
tive nodes and neighborhoods are determined by the graph
structure (and of course the operator), and do not depend on the
data values computed by the algorithm. Cellular automata [69]
and finite-difference algorithms like Jacobi and Gauss-Seidel
iteration [28] are well-known examples.
• Data-driven algorithms: In these algorithms, updating the value
at a node may trigger updates at other nodes in the graph. Therefore, nodes become active in a very data-dependent and unpredictable manner. Message-passing AI algorithms such as survey propagation [46] and algorithms for event-driven simulation [48] are examples.
6.1
Structure-driven algorithms
Grids Cellular automata are perhaps the most famous example
of structure-driven computations over grids. Grids in cellular automata usually have one or two dimensions. Grid nodes represent
cells of the automaton, and the state of a cell c at time t is a function
of the states at time t − 1 of cells in some neighborhood around c.
A similar state update scheme is used in finite-difference methods
for the numerical solution of partial differential equations (PDEs),
where it is known as Jacobi iteration. In this case, the grid arises
from spatial discretization of the domain of the PDE, and nodes
hold values of the dependent variable of the PDE. Figure 17 shows
a Jacobi iteration that uses a grid called state, and a neighborhood called the five-point stencil. Each grid point holds two values
called old and new that are updated at each time-step using the
stencil. The pseudo-code is written so that it uses a single graph
(grid); in practice, it is common to use two arrays to represent the
old and new values. Many other neighborhoods (stencils) are used
in cellular automata [63] and finite-difference methods.
A disadvantage of Jacobi iteration is that it requires two arrays
for its implementation. More complex update schemes have been
designed to get around this problem. Intuitively, all these schemes
blur the sharp distinction between old and new states in Jacobi iteration, so nodes are updated using both old and new values. For
example, red-black ordering, or more generally multi-color ordering, assigns a minimal number of colors to nodes in such a way
that no node has the same color as the nodes in its neighborhood.
Nodes of a given color therefore form an independent set that can
be updated concurrently, so the single global update step of Jacobi
iteration is replaced by a sequence of smaller steps, each of which
performs in-place updates to all nodes of a given color. For a five-
(a) 5-point stencil
1: Graph g = // initialize edges of g from A matrix
2: for each Node i in g.getNodes() {
3:
g.getNode(i).y = 0;
4:
for each Node j in g.getNeighbors(i) {
5:
aij = g.getEdge(i,j).getWeight();
5:
g.getNode(i).y += aij * g.getNode(j).x;
6:
}
7: }
Figure 18. Sparse matrix-vector product
point stencil, two colors suffice because grids are two-colorable,
and this is the famous red-black ordering. These algorithms can
obviously be expressed using unordered set iterators with one loop
for each color.
In all these algorithms, active nodes and neighborhoods can be
determined from the grid structure, which is known at compiletime. As a result, compile-time scheduling can be very effective for
coordinating the computations. Most parallel implementations of
stencil codes partition the grid into blocks, and each processor is
responsible for updating the nodes in one block. This data distribution requires less inter-processor communication than other distributions such as row or column cyclic distributions. Blocks are updated in parallel, with barrier synchronization used between time
steps. In terms of the vocabulary of Figure 9, this is coordinated,
compile-time scheduling.
General graphs The richest sources of structure-driven algorithms on general graphs are iterative solvers for sparse linear systems, such as the conjugate gradient and GMRES methods. The
key operation in these solvers is sparse matrix-vector multiplication (MVM), y = Ax, in which the matrix A is an N × N sparse
matrix and x and y are dense vectors. Such a matrix can be viewed
as a graph with N nodes in which there is a directed edge from
node i to node j with weight Aij if Aij is non-zero. The state of
a node i consists of the current values of x[i] and y[i], and at each
time step, y[i] is updated using the values of x at the neighbors of
i. This can be expressed as unordered iteration over the nodes in
the graph, as shown in Figure 18.
In practice, y and x would be represented using dense arrays.
Writing the code as shown in Figure 18 highlights the similarity between sparse MVM and 5-point stencil Jacobi iteration:
in both cases, the state of a node is updated by reading values
at its neighbors. The key difference is that in sparse MVM, the
graph structure is not known at compile-time. To handle this,
distributed-memory parallel implementations of sparse MVM use
the inspector-executor approach [70], an instance of just-in-time
scheduling, as described in Section 4.3).
In some applications, the graph is known only during the execution of the program. The augmenting paths algorithm for maxflow
computations is an example [16]. This is an iterative algorithm
in which each iteration tries to find a path from source to sink in
which every edge has non-zero residual capacity. Once such a path
3000
2500
2000
1500
1000
500
0
0
20000 40000
60000 80000 100000 120000 140000 160000 180000
Computation Step
Figure 19. Available parallelism in preflow-push
(b) Code for 5-point stencil on n × n grid
Figure 17. Jacobi iteration with a 5-point stencil
3500
Available Parallelism
1: // load initial conditions into state[].old
2: for time from 1 to nsteps {
3:
for each <i,j> in [2,n-1]x[2,n-1]
4:
state[i,j].new = 0.2 * (state[i,j].old
5:
+ state[i-1,j].old + state[i+1,j].old
6:
+ state[i,j-1].old + state[i,j+1].old);
7: // copy result into old
8: for each <i,j> in [2,n-1]x[2,n-1]
9:
state[i,j].old = state[i,j].new;
10: }
is found, the global flow is augmented by an amount δ equal to the
minimal residual capacity of the edges on this path, and δ is subtracted from the residual capacity of each edge. This operation can
be performed in parallel as a structure-driven local computation.
However, the structure of the path is discovered during execution,
so this computation must be performed either autonomously or by
coordinated scheduling at run-time.
6.2
Data-driven algorithms
In some irregular algorithms, active nodes are not determined by
the structure of the graph; instead, there is an initial set of active nodes, and performing an activity causes other nodes to become active, so nodes become active in an unpredictable, datadriven fashion. Examples include the preflow-push algorithm [4]
for the maxflow problem, AI message-passing algorithms such as
belief propagation and survey propagation [46], Petri nets [51],
and discrete-event simulation [35, 48]. In other algorithms, such
as some approaches to solving spin Ising models [55], the pattern
of node updates is determined by externally-generated data such
as random numbers. We call these data-driven local computation
algorithms.
Although graph topology is typically less important than in
structure-driven algorithms, it can nevertheless be useful in reasoning about certain data-driven algorithms. For example, belief propagation is an exact inference algorithm on trees, but it is an approximate algorithm when used on general graphs because of the famous
loopy propagation problem [46]. Grid structure is exploited in spin
Ising solvers to reduce synchronization [55]. In this section, we describe the preflow-push algorithm, which uses an unordered workset, and discrete-event simulation, which uses an ordered work-set.
Both these algorithms work on general graphs.
Preflow-push algorithm The preflow-push algorithm was introduced earlier in Section 3.2. Active nodes are nodes that have excess in-flow at intermediate stages of the algorithm (unlike the augmenting paths maxflow algorithm, which maintains a valid flow at
all times [16]). The algorithm also maintains at each node a value
called height, which is a lower bound on the distance of that node to
the sink. The algorithm performs two operations, push and relabel,
at active nodes to update the flow and height values respectively,
until termination. The algorithm does not specify an order for applying the push or relabel operations, and hence it can be expressed
using an unordered iterator.
Figure 19 shows the available parallelism in preflow push for
a 512x512 grid-shaped input graph. In preflow push, as in most
local computation codes, the amount of parallelism is governed
by the amount of work, as the likelihood that two neighborhoods
overlap remains constant throughout execution. This can be seen
by the relatively stable amount of parallelism at the beginning of
computation; then, as the amount of work decreases, the parallelism
does as well.
Event-driven simulation Event-driven simulation, which was introduced in Section 2, can be written using an ordered set iterator,
where messages are three-tuples hvalue, edge, timei and the ordered set maintains messages sorted in time order.
In the literature, there are two approaches to parallel eventdriven simulation, called conservative and optimistic event-driven
simulation [48]. Conservative event-driven simulation reformulates
the algorithm so that in addition to sending data messages, processing stations also send out-of-band messages to update time at
their neighbors. Each processing station can operate autonomously
without fear of deadlock, and there is no need to maintain an ordered event list. This is an example of algorithm reformulation that
replaces an ordered work-set with an unordered work-set. Timewarp [35] is an optimistic parallel implementation of event-driven
simulation (it is probably the first optimistic parallel implementation of any algorithm in the literature). It can be viewed as an
application-specific implementation of ordered set iterators as described in Section 4.1, with two key differences. First, threads are
allowed to work on new events created by speculative activities
that have not yet committed. This increases parallelism, but opens
up the possibility of cascading roll-backs. Second, instead of the
global commit queue in the implementation of Section 4.1, there
are periodic sweeps through the network to update “global virtual
time” [35]. Both of these features can be implemented in a generalpurpose way in a system like Galois, but more study is needed to
determine if this is worthwhile.
6.3
Discussion
Solutions to fixpoint equations: Iterative methods for computing solutions to systems of equations can usually be formulated in both
structure-driven and data-driven ways. A classic example of this is
iterative dataflow analysis in compilers. Systems of dataflow equations can be solved by iterating over all the equations using a Jacobi
or Gauss-Seidel formulation until a fixed point is reached; these are
structure-driven approaches. Alternately, the classic worklist algorithm only processes an equation if its inputs have changed; this is a
data-driven approach [3]. The underlying graph in this application
is the control-flow graph (or its reverse for backward dataflow problems), and the algorithms can be parallelized in the same manner
as other local computation algorithms.
Speeding up local computation algorithms: Local computation
algorithms can often be made more efficient by a preprocessing step
that coarsens the graph, since this speeds up the flow of information
across the graph. This idea is used in dataflow analysis of programs.
An elimination-based dataflow algorithm is used first to coarsen the
graph, as described in Section 5.2.3. If the resulting graph is a single
node, the solution to the dataflow problem is read off by inspection;
otherwise, iterative dataflow analysis is applied to the coarse graph.
In either case, the solution for the coarse graph is interpolated back
into the original graph. One of the first hybrid dataflow algorithms
along these lines was Allen and Cocke’s interval-based algorithm
for dataflow analysis [3], which finds and collapses intervals, which
are single-entry, multiple-exit loop structures in the control-flow
graph of the program.
Some local computation algorithms can be sped up by periodically computing and exploiting global information. The globalrelabel heuristic in preflow-push is an example of such an operation. Preflow-push can be sped up by periodically performing a
breadth-first search from the sink to update the height values with
more accurate information [4]. In the extreme, global information
can be used in every iteration. For example, iterative linear solvers
are often sped up by preconditioning the matrix iterations with another matrix that is often obtained by a graph coarsening computation such as incomplete LU or Cholesky factorization [28].
7.
Reader algorithms
An operator that reads a graph without modifying it in any way is a
reader operator for that graph. The Google map-reduce [17] model
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
Scene s; // read in scene description
Framebuffer fb; // image to generate
Workset ws<Ray> = // initial rays
Workset spawnedws<Ray>;
for each Ray r in ws {
r.trace(s); // trace ray through scene
Rays rays = r.spawnRays();
ws.add(rays);
spawnedws.add(rays);
}
for each Ray r in spawnedws {
fb.update(r); // update with r’s contribution
}
Figure 20. Ray-tracing
is an example in which the fixed graph is a set or multi-set, and map
and reduce computations are performed on the underlying values.
A particularly important class of algorithms in this category perform multiple traversals over a fixed graph. Each traversal maintains its own state, which is updated during the traversal, so the
traversals can be performed concurrently. Force computation in nbody algorithms like Barnes-Hut [9] is an example. The construction of the octree is an example of tree refinement, as was discussed
in Section 5.1. Once the tree is built, the force calculation iterates over particles, computing the force on each particle by making
a top-down traversal of a prefix of the tree. The force computation step of the Barnes-Hut algorithm is completely parallel since
each particle can be processed independently. The only shared data
structure in this phase of the algorithm is the octree, which is not
modified.
Another classic example is ray-tracing. The simplified pseudocode is shown in Figure 20. Each ray is traced through the scene.
If the ray hits an object, new rays may be spawned to account for
reflections and refraction; these rays must also be processed. After
the rays have been traced, they are used to construct the rendered
image.
The key concern in implementing such algorithms is exploiting
locality. One approach to enhancing locality in algorithms such as
ray-tracing is to “chunk” similar work together. Rays that will propagate through the same portion of the scene are bundled together
and processed simultaneously by a given processor. Because each
ray requires the same scene data, this approach can enhance cache
locality. A similar approach can be taken in Barnes-Hut: particles
in the same region of space are likely to traverse similar parts of
the octree during the force computation and thus can be processed
together to improve locality.
In some cases, reader algorithms can be more substantially
transformed to further enhance locality. In ray-tracing, bundled rays
begin propagating through the scene in the same direction but may
eventually diverge. Rather than using an a priori grouping of rays,
groups can be dynamically updated as rays propagate through a
scene, maintaining locality throughout execution [53].
8.
Conclusions and future work
For more than thirty years, the parallel programming community
has used computation graphs to reason about parallelism in algorithms. In this paper, we argued that this program-centric abstraction is inadequate for most irregular algorithms, and that it should
be replaced by a data-centric view of algorithms called the operator
formulation of irregular algorithms and an associated abstraction
called the halograph. This alternate view of irregular algorithms reveals that a generalized form of data-parallelism called amorphous
data-parallelism is ubiquitous in irregular algorithms; however, the
amount of amorphous data-parallelism in an irregular algorithm is
usually dependent on runtime values and varies from point to point
in the execution of the algorithm. The operator formulation of irregular algorithms also leads to a structural classification of irregu-
lar algorithms that provides insight into the patterns of amorphous
data-parallelism in irregular algorithms. In particular, regular algorithms emerge as a special and important class of irregular algorithms. Optimistic or speculative execution is essential for exploiting amorphous data-parallelism in many irregular algorithms
even if these algorithms are written in a declarative language, overturning the commonly held belief that speculative execution is a
crutch that is needed only when we do not have enough insight
into the structure of parallelism in programs. However, the overheads of speculative execution can be reduced dramatically in most
cases by exploiting algorithmic structure. Finally, we showed that
many optimizations used in handwritten parallel implementations
of irregular algorithms can be incorporated into a general-purpose
system since these optimizations exploit algorithmic structure and
are usually not specific to particular algorithms.
In the literature, there are many efforts to identify parallelism
patterns [47, 50, 57, 58]. The scope of these efforts is broader than
ours: for example, they seek to categorize implementation mechanisms such as whether task-queues or the master-worker approach
is used to distribute work. Like us, Snir distinguishes between ordered and unordered work [58]. We make finer distinctions than
is made in the Berkeley motifs [50]. For example, dense matrix
algorithms constitute a single Berkeley motif, but we would put
iterative solvers in a different category (local computation) than
dense matrix factorizations (morphs), and their parallel behavior
is very different. Our classification of irregular algorithms should
be of interest to researchers working on applying machine learning
techniques to program optimization problems [1, 62]. The approach
taken in this line of research is to classify programs according to a
set of features. Current approaches use fairly simple features such
as loop nesting level or depth of dependences. The classification
presented in this paper provides a set of semantics-based features
that might perform better than syntactic features, at least for highlevel optimizations.
In ongoing research, we are investigating the following questions.
(1) The approach taken in this paper relies on a library of parallel data structure implementations, similar to the Java collections
classes. Is this adequate or do data structure implementations have
to be customized to each algorithm to obtain high performance? If
so, what language and systems support would make it easier to produce high-performance implementations of these data structures?
Is it possible to synthesize these implementations from high-level
specifications?
(2) We have a lot of insight into exploiting structure in unordered algorithms, as described in Section 4, but we have relatively little insight into exploiting structure in ordered algorithms.
What structure is exploited by handwritten implementations of
these algorithms, and how do we incorporate these ideas into a
general-purpose system? Is there a general way of transforming
ordered algorithms into unordered algorithms, as is done in an
application-specific way for event-driven simulation?
(3) How should information needed for efficient execution, such
as scheduling policies and partitioning of data, be added to Galois
programs? Are concepts from the literature on program refinements
useful for doing this in a disciplined manner [10]?
(4) Current architectural mechanisms for supporting parallel
programming, such as hardware transactional memory [31], are
blunt instruments that do not exploit the plentiful structure in irregular algorithms. The discussion in this paper suggests that it
may be more fruitful for the architecture community to focus on
particularly recalcitrant categories of irregular algorithms such as
ordered algorithms or algorithms that perform general morphs on
trees. What architectural support is appropriate for these categories
of algorithms?
Acknowledgements: We would like to thank Arvind (MIT), Don
Batory (UT Austin), Jim Browne (UT Austin), Calin Cascaval
(IBM Yorktown Heights), Jack Dennis (MIT), David Kuck (Intel), Charles Leiserson (MIT), Jayadev Misra (UT Austin), David
Padua (Illinois), Burton Smith (Microsoft), Marc Snir (Illinois) and
Jeanette Wing (NSF) for useful discussions.
References
[1] F. Agakov, E. Bonilla, J. Cavazos, G. Fursin, B. Franke, M.F.P.
O’Boyle, M. Toussant, J. Thomson, and C. Williams. Using machine
learning to focus iterative optimization. In CGO, March 2006.
[2] Kunal Agrawal, Charles E. Leiserson, Yuxiong He, and Wen Jing
Hsu. Adaptive work-stealing with parallelism feedback. ACM Trans.
Comput. Syst., 26(3):1–32, 2008.
[3] A. Aho, R. Sethi, , and J. Ullman. Compilers: principles, techniques,
and tools. Addison Wesley, 1986.
[4] Richard J. Anderson and Joao C. Setubal. On the parallel implementation of Goldberg’s maximum flow algorithm. In SPAA, 1992.
[5] Christos D. Antonopoulos, Xiaoning Ding, Andrey Chernikov, Filip
Blagojevic, Dimitrios S. Nikolopoulos, and Nikos Chrisochoides.
Multigrain parallel Delaunay mesh generation: challenges and
opportunities for multithreaded architectures. In ICS ’05, 2005.
[6] Arvind and R.S.Nikhil. Executing a program on the MIT tagged-token
dataflow architecture. IEEE Trans. on Computers, 39(3), 1990.
[7] D.A. Bader and G. Cong. Fast shared-memory algorithms for
computing the minimum spanning forest of sparse graphs. Journal of
Parallel and Distributed Computing, 66(11):1366–1378, 2006.
[8] Henk Barendregt. The Lambda Calculus, Its Syntax and Semantics.
Elsevier Publishers, 1984.
[9] J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation
algorithm. Nature, 324(4), December 1986.
[10] Don Batory, Jia Liu, and Jacob Neal Sarvela. Refinements and multidimensional separation of concerns. In ESEC/FSE-11, 2003.
[11] Daniel Blandford, Guy Blelloch, and Clemens Kadow. Engineering
a compact parallel Delaunay algorithm in 3D,. In Proceedings of the
twenty-second Annual Symposium on Computational Geometry,, 2006.
[12] Guy Blelloch. Programming parallel algorithms. Communications of
the ACM, 39(3), March 1996.
[13] Boost C++ libraries. http://www.boost.org/.
[14] L. Paul Chew. Guaranteed-quality mesh generation for curved
surfaces. In SCG, 1993.
[15] E.F. Codd. Does your DBMS run by the rules? Computerworld,
October 1985.
[16] Thomas Cormen, Charles Leiserson, Ronald Rivest, and Clifford
Stein, editors. Introduction to Algorithms. MIT Press, 2001.
[17] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on
large clusters. In OSDI, 2004.
[18] B. Delaunay. Sur la sphere vide. Izvestia Akademii Nauk SSSR,
Otdelenie Matematicheskikh i Estestvennykh Nauk,, 7:793–800, 1934.
[19] Jack Dennis. Dataflow ideas for supercomputers. In Proceedings of
CompCon, 1984.
[20] Edsger Dijkstra. A Discipline of Programming. Prentice Hall, 1976.
[21] H. Ehrig and M. Löwe. Parallel and distributed derivations in the
single-pushout approach. Theoretical Computer Science, 109:123–
143, 1993.
[22] David Eppstein. Spanning trees and spanners. In J. Sack and
J. Urrutia, editors, Handbook of Computational Geometry, pages
425–461. Elsevier, 1999.
[23] Ping An et al. STAPL: An adaptive, generic parallel C++ library. In
LCPC. 2003.
[24] J.D. Foley, A. van Dam, S.K. Feiner, and J.F. Hughes. Computer
Graphics: Principles and Practice. Addison-Wesley, 1990.
[25] D. Gajski, D. Kuck, and D. Padua. Dependence driven computation.
In COMPCON 81, 1981.
[26] R. Ghiya and L. Hendren. Is it a tree, a dag, or a cyclic graph? a shape
analysis for heap-directed pointers in C. In POPL, 1996.
[27] J. R. Gilbert and R. Schreiber. Highly parallel sparse Cholesky
factorization. SIAM Journal on Scientific and Statistical Computing,
13:1151–1172, 1992.
[28] Gene Golub and Charles van Loan. Matrix computations. Johns
Hopkins Press, 1996.
[29] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations.
Journal of Computational Physics, 73:325–348, August 1987.
[30] L. Hendren and A. Nicolau. Parallelizing programs with recursive
data structures. IEEE TPDS, 1(1):35–47, January 1990.
[31] Maurice Herlihy and J. Eliot B. Moss. Transactional memory:
architectural support for lock-free data structures. In ISCA, 1993.
[32] Kirsten Hildrum and Philip S. Yu. Focused community discovery. In
ICDM, 2005.
[33] S. Horwitz, P. Pfieffer, and T. Reps. Dependence analysis for pointer
variables. In PLDI, 1989.
[34] Benoı̂t Hudson, Gary L. Miller, and Todd Phillips. Sparse parallel
Delaunay mesh refinement. In SPAA, 2007.
[35] David R. Jefferson. Virtual time. ACM TOPLAS, 7(3), 1985.
[36] J.T.Schwartz, R.B.K. Dewar, E. Dubinsky, and E. Schonberg.
Programming with sets: An introduction to SETL. Springer-Verlag
Publishers, 1986.
[37] G. Karypis and V. Kumar. Multilevel k-way partitioning scheme
for irregular graphs. Journal of Parallel and Distributed Computing,
48(1):96–129, 1998.
[38] K. Kennedy and J. Allen, editors. Optimizing compilers for modern
architectures. Morgan Kaufmann, 2001.
[39] Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. Datacentric multi-level blocking. In Proceedings of the ACM Symposium
on Programming Language Design and Implementation (PLDI), June
1997.
[40] Robert Kramer, Rajiv Gupta, and Mary Lou Soffa. The combining
DAG: A technique for parallel data flow analysis. IEEE Transactions
on Parallel and Distributed Systems, 5(8), August 1994.
[41] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and
L.P. Chew. Optimistic parallelism requires abstractions. PLDI, 2007.
[42] Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu, Keshav
Pingali, and Calin Cascaval. How much parallelism is there in irregular
applications? In PPoPP, 2009.
[43] Y. Lee and B. G. Ryder. A comprehensive approach to parallel data
flow analysis. In Supercomputing, pages 236–247, 1992.
[44] Edward Lorenz. Predictability- does the flap of a butterfly’s wings
in Brazil set off a tornado in Texas? In Proceedings of the American
Association for the Advancement of Science, December 1972.
[45] M. Lowe and H. Ehrig. Algebraic approach to graph transformation
based on single pushout derivations. In WG ’90: Proceedings of the
16th international workshop on Graph-theoretic concepts in computer
science, pages 338–353, New York, NY, USA, 1991. Springer-Verlag
New York, Inc.
[46] David Mackay. Information Theory, Inference and Learning
Algorithms. Cambridge University Press, 2003.
[47] Timothy Mattson, Beverly Sanders, and Berna Massingill. Patterns
for Parallel Programming. Addison-Wesley Publishers, 2004.
[48] Jayadev Misra. Distributed discrete-event simulation. ACM Comput.
Surv., 18(1):39–65, 1986.
[49] James Munkres. Topology: a First Course. Prentice-Hall Publishers,
1975.
[50] D. Patterson, K. Keutzer, K. Asanovica, K. Yelick, and R. Bodik.
Berkeley dwarfs. http://view.eecs.berkeley.edu/.
[51] J. L. Peterson. Petri nets. ACM Computing Surveys, 9(3):223–252,
1977.
[52] Simon Peyton-Jones. The Implementation of Functional Programming
Languages. Prentice-Hall, 1987.
[53] M. Pharr, C. Kolb, R. Gershbein, and P. Hanrahan. Rendering complex
scenes with memory-coherent ray tracing. In SIGGRAPH, 1997.
[54] Lawrence Rauchwerger and David A. Padua. The LRPD test:
Speculative run-time parallelization of loops with privatization
and reduction parallelization. IEEE Trans. Parallel Distrib. Syst.,
10(2):160–180, 1999.
[55] Eunice E. Santos, Shuangtong Feng, and Jeffrey M. Rickman.
Efficient parallel algorithms for 2-dimensional Ising spin models.
In IPDPS, 2002.
[56] K. Schloegel, G. Karypis, and V. Kumar. Parallel static and dynamic
multi-constraint graph partitioning. Concurrency and Computation:
Practice and Experience, 14(3):219–240, 2002.
[57] David Skillicorn. Foundations of Parallel Programming. Cambridge
University Press, 1994.
[58] Marc Snir. http://wing.cs.uiuc.edu/group/patterns/.
[59] Jon Solworth and Brian Reagan. Arbitrary order operations on trees.
In Workshop on Languages and Compilers for Parallel Computing,
1994.
[60] D. Spielman and SH Teng. Parallel Delaunay refinement: Algorithms
and analyses. In 11th International Meshing Roundtable, 2002.
[61] J. Greggory Steffan, Christopher B. Colohan, Antonia Zhai, and
Todd C. Mowry. A scalable approach to thread-level speculation. In
ISCA ’00: Proceedings of the 27th annual international symposium on
Computer architecture, 2000.
[62] Mark Stephenson, Saman Amarasinghe, Martin Martin, and UnaMay O’Reilly. Meta optimization: improving compiler heuristics with
machine learning. SIGPLAN Not., 38(5):77–90, 2003.
[63] Tim Tyler. Cellular automata neighborhood survey. http://cellauto.com/neighbourhood/index.html.
[64] L. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8), 1990.
[65] Clark Verbrugge. A Parallel Solution Strategy for Irregular, Dynamic
Problems. PhD thesis, McGill University, 2006.
[66] T. N. Vijaykumar, Sridhar Gopal, James E. Smith, and Gurindar Sohi.
Speculative versioning cache. IEEE Trans. Parallel Distrib. Syst.,
12(12):1305–1317, 2001.
[67] Pang-Ning Tan Vipin Kumar, Michael Steinbach. Introduction to
Data Mining. Addison Wesley, 2005.
[68] Niklaus Wirth. Algorithms + Data Structures = Programs. PrenticeHall, 1976.
[69] Stephen Wolfram. A New Kind of Science. Wolfram Media, 2002.
[70] Janet Wu, Raja Das, Joel Saltz, Harry Berryman, and Seema
Hiranandani. Distributed memory compiler design for sparse
problems. IEEE Transactions on Computers, 44, 1995.