Download Dynamic Node Reconfiguration in a Parallel

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Backpressure routing wikipedia , lookup

Airborne Networking wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

Distributed operating system wikipedia , lookup

Computer cluster wikipedia , lookup

CAN bus wikipedia , lookup

Kademlia wikipedia , lookup

Transcript
Dynamic
in
Node
Reconfiguration
a Parallel-Distributed
Michael
J. Feeley,
Brian
N.
Department
Environment
Bershad,
Jeffrey
of Computer
S. Chase,
Science
University
data
workstations
power
in
In
can be used
treat
a network
potential.
the
represent
particular,
a significant
their
by parallel-distributed
network
full
must
to these
reconfigure
This
for
paper
Amber,
to adapt
describes
for
networks
tem
support
a node
to changing
to make
use of new
network
nodes
as nodes
We describe
conditions
as they
become
busy.
gain
idle,
has
sys-
able
not
depend
on a kernel-level
experiments
uration
can
runtime
with
process
Amber
and
be implemented
migration
node
easily
and
efficiently
number
in a
et al.
degrees
.
An
of
success.
promise
order-of-magnitude
work
speed
will
However,
greater
several
success
increase
substantially
in the
Permission
direct
to
commercial
title
of the
that
copying
that
To copy
fee
the copies
the ACM
netcost
otherwise,
material
or distributed
and
of the Association
or to republish,
notice
notice
of idle
The
and
- comfort-
good
Chase
et al.
perforet al.
89,
89].
programming
if the
he must
and
machines
to take
processing
Worse,
returns,
at startup
on those
is unable
as new
systems
on a fixed set of hosts.
program
allocates
leaves
until
advantage
of additional
become
idle
owner
of one
of the
compete
for
com-
completion.
nodes
running
degraded
Typsome
during
resources
its
workstawith
on his machine.
The
program
performance
the
proband
initial
allocation
The
goal
to
a small
usage patterns:
severely
restricts
the applications,
and capable
of our
research
number
of nodes
with
i.e., nodes likely
to remain
the potential
parallelism
which
of using
is to allow
tend
a large
to be highly
number
of
and encourage
the
execution
of parallel-distributed
programs
that
adapt
to changes
in the set of available
processing
nodes on
the network.
We call such changes
and the actions
is
that
node
for
and the
must
be taken
reconfiguration.
figuration
allows
to accommodate
them
dynamic
Supporting
dynamic
node recon-
programmers
to use idle
according
to the needs of their
programs,
the working
hours of their colleagues.
for Computing
a fee
permieeion.
0-89791-390-6/91
prov;de
89,
Arnould
workstations
active
program
available
to
decomposable
nodes.
is given
requires
& Hudak
parallel-distributed
predictable
idle.
This
the
copyright
appear,
momamming
that
its
area
of this
to
ba-
techfuture,
reduce
all or part
continue
single-node
an unhappy
workstation
owner,
who may be inclined
to
abort
the program.
Such conflicts
are likely
during
a
long-running
computation,
unless
the program
limits
in local
are not made
and its date
is by permission
specifio
@1991 ACM
without
advantage,
publication
Machinary.
andlor
copy
provided
supercom-
have
only
This
work
was supported
in part
by the National
Science Foundation
(under
Grants
No.
C CR-8619663
and CCFL
89076643),
the Washington
Technology
Center,
an AT&T
Fellowship,
and Digital
Equipment
Corporation
(Systems
Research
Center, External
Research Program,
and DECwest
Engineering).
granted
in
will
models
90a,
program
component
able results
include
two decades,
networks
of workstations
as inexpensive
multiprocessors
with
trends
available
cost.
a strong
- systems
[Li
tasks
power
Introduction
nological
speeds,
of workstation
parallel-distributed
applications
to run
a parallel-distributed
ponent
reconfig-
tions
moderate
in
programming
Modern
restrict
ically,
library.
For nearly
been
used
processor
programming.
produced
execution.
1
that
lower
providing
parallel
distributed
by
facility.
that
in
power
multiprocessors
popularity,
Bennett
characteris-
show
exceed
at much
mance
tic of Amber’s
node reconfiguration
is that it is handled
at the user level in the Amber
runtime
system
it does
Our
increase
computational
will
Research
●
sys-
by expanding
A key
dramatic
and
sis for
application
become
aggregate
Shared-memory
●
facility
Amber
the
the
making
practical.
run.
programming
a parallel
more
With
networks
shartheir
as they
reconfiguration
parallel
allows
to adapt
contracting
changes
of multiprocessors.
that
in load
go from
Levy
communication,
sharing
puters,
use of the available
reapplications
in the network
an object-based
tem
that
multiprocessor.
But the set of machines
free to participate
ing changes
over time as users come and
workstations.
To make
sources, parallel-distributed
●
processing
programs
as a loosely-coupled
M.
Engineering
of inter-node
Abstract
computing
Henry
of Washington
WA 98195
Seattle,
Idle
and
and
1000410114...$1.50
114
workstations
rather
than
eliminates
the false sharing inherent in page-based distributed
memory systems, and it allows coherence and
distribution
policies to be selected on an object basis
according to how objects are used [Bennett et al. 90 b].
The maximum
parallelism
available to an Amber program is determined
by its decomposition
into objects
and threads, and the assignment
of objects to nodes
The programdetermines
the distribution
of load.
mer can control object location
using explicit
object
mobility
primitives,
or by selecting an adaptive
“off
the shelf” object placement
policy implemented
with
additional
runtime
support
layered on the base system [Koeman 90]. These features are particularly
useful for preventing
load imbalances over a changing node
set; objects can be migrated
to balance the lc~ad.
An Amber program runs as a collection of processes
distributed
across a set of network nodes, some of which
may be shared-memory
multiprocessors.
We will refer
to these processes as !ogical nodes.
A logical node is
an address space that encapsulates
all of the program
code, data objects, and threads for one part of an Amber application.
Normally,
there is one logical node
on each physical node participating
in the execution
of an Amber application.
The Amber runtirne system
schedules lightweight
Amber threads within the address
space of each logical node. On multiprocessor
nodes,
these threads can execute in parallel.
Object invocation uses shared memory and procedure calls when an
object and thread are on the same machine.
Runtime
support
code maps object moves and remote invocations into RPC calls between the logical nodes. In this
way, local communication
is efficient and remote communication
is transparent.
An important
aspect of Amber is its sparse use of
virtual memory to simplify and speed up naming of distributed
objects.
Amber uses direct virtual
addresses
as the basis for object naming in the network.
The address spaces of the logical nodes are arranged in such
a way that any object address denotes the same object in any logical node.
References to remote objects are detected at the language level ancl trapped
to the runtime system.
Amber can globally bind objects to virtual
addresses in this fashion because each
program executes in a private (but distributed)
object
space; one Amber program cannot name or use objects
created by another program.
This sets Amber apart
from other location-independent
object systems, such
as Emerald [Jul et al. 88], which provide a single global
object space shared by multiple
programs.
In this paper we describe the design, implementation, performance,
and impact
of dynamic
node reconfiguration
in Amber [Chase et al. 89]. Amber differs from earlier load sharing
work [Eager et al. 86,
Litzkow et al. 88, Theimer & Lantz 88], which makes
use of idle network workstations
by redistributing
independent
sequential
processes using kernel-level
process migration
[Powell & Miller 83, Theimer et al. 85,
Zayas 87, Douglis & Ousterhout
87].
Our work extends these ideas to the parallel-distributed
case, which
involves distributed
programs
composed
of multiple
processes cooperating
to obtain speedup on a single
parallel algorithm.
Because control over locality,
decomposition,
and communication
patterns are critical
to the performance
of a parallel-distributed
program,
we believe that a user-level approach to reconfiguration is more appropriate
than one based exclusively
on kernel-level
mechanisms.
In Amber, programs control the load redistribution
policy through reconfiguration mechanisms implemented
entirely at the user level
within the Amber runtime library.
This paper is organized as follows. In the next section
we give an overview of the Amber system.
We motivate and explain Amber’s user-level approach to reconfiguration
in Section 3, and in Section 4 we contrast
it with an alternative
approach based on kernel-level
process migration.
In Section 5 we discuss some implementation
issues for Amber’s
reconfiguration
facility.
In Section 6 we describe the system’s performance.
We
present our conclusions in Section 7.
2
Amber
Overview
Amber is an object-oriented
parallel-distributed
programming
system that supports threads, a distributed
shared object space, object migration,
and object replication [Faust 90], for programs
written
in a variant
of C++.
Amber was designed to explore the use of
a distributed
object model as a tool for structuring
parallel
programs
to run efficiently
on a network
of
DEC Firefly
shared-memory
multiprocessor
workstations [Thacker et al. 88]. Programs
are explicitly
decomposed into objects and threads distributed
among
the nodes participating
in the application.
Threads
interact with objects through a uniform object invocation mechanism; remote invocations
are handled transparently
by migrating
the thread,
migrating
the object, or replicating
the object,
as determined
by the
object’s type.
Amber
has been used to experiment
with a number of parallel-distributed
applications,
including iterative point-update
(e.g., the Game of Life),
jigsaw puzzle [Green & Juels 86], graph search, meshbased dynamic programming
[Anderson et al. 90], simulation, and an n-body gravitational
interaction.
This
section describes those aspects of Amber that are relevant to our approach to node reconfiguration.
Amber can be viewed as a distributed
shared memory system that uses language objects rather than pages
as the granularity
of distribution
and coherence. This
3
Reconfiguration
in
Amber
A running parallel-distributed
program can respond to
three types of node reconfiguration:
the node set can
shrink in size, stay the same size but change membership, or grow. Dealing with these changes requires care
at two levels. At the system level, all of the cases require system support
to transfer program
state from
one node to another, and to forward subsequent com-
115
as for changes in the number of nodes.
It depends
on application-specific
knowledge about the amount of
computation
associated with each object, and the patterns of communication
between objects. The extent to
which this knowledge is applied will always determine
the effectiveness of the decomposition.
Our goal was to create an interface that allows load
balanci~g
policies to be tailored
to meet the needs of
the application.
The information
to drive the policy
is provided by a predefine
NodeSet
object that serves
as ‘a channel- of communication
between the runtime
system and the application.
Table 1 lists the Nodethat are visible to the application.
The
Set operations
downcall interface is used by the application
to request
the current size of the node set, or to traverse the node
set when distributing
work. The upcall interface conthat are exported
sists of procedures
Grow and Shrink
by the application
and are called asynchronously
by the
runtime system when the size of the node set changes.
Shrink
is called before a node leaves the node set. to
redistribute
its work to the remaining
nodes. Grow is
called after a node is added, to migrate
work to the
new node.
munication
as needed. At theapplication
level, the program must resolve load imbalances
caused by growth
and shrinkage of the node set; additional
support
is
needed to react to changes in the available parallelism
and communication
patterns,
In Amber, reconfiguration
is controlled
by the runtime system,
but may involve application
code. The
runtime system provides the mechanism for migrating
state and adjusting
the communication
bindings,
but
the load balancing
policy is left to object placement
code at the application
level.
This section discusses
these two levels and their interplay
during node reconfiguration.
3.1
The
Role
of the
Runtime
System
Node set reconfiguration
is handled at the user level
without
direct involvement
from the operating
system
kernel. The program expands by migrating
objects and
threads to a newly created logical node running on the
new physical node.
It contracts
by forcing a logical
node to terminate
after redistributing
its objects and
threads to the remaining
logical nodes. In either case,
the transfer of program state is initiated
and controlled
by the runtime sytem, possibly in cooperation
with user
The operating
system kernel is not aware of
code.
the reconfiguration;
it sees only a flurry of network activity, together with a process creation or death.
For
simple membership
changes (replacing
one node with
another),
this amounts to a user-level process migration facility for Amber logical nodes.
The runtime system is a natural place to support migration of program state. It has responsibility
for managing distributed
threads and maintaining
coherency
of the distributed
shared memory, so it knows abcmt
threads and objects and can relocate them directly.
It
also knows details about address space usage that allow
it to be frugal with network and memory bandwidth.
For example,
Amber does not copy data that is accessible on demand elsewhere in the network (e.g. replicated
objects).
It also ignores regions of sparse virtual memory that are unused or reserved for objects residing on
other nodes.
3.2
The
Role
of the
Upcall Interface
void
void
Downcall Interface
int
Node
Table
4
The
Shrink(
Grow(
Node
deparfingNode
Node
newNode
)
)
Node Counto
NeztNode(
1: NodeSet
Process
Node
referenceNode
Operations
Migration
Alter-
native
Our approach to node reconfiguration
is based on migrating fine-grained
program state between Amber logical nodes. Consider now an alternative
approach based
on a general-purpose
process migration
facility
that
migrates and schedules logical node processes without
knowledge of their internal
state. Simple membership
changes are straightforward
using kernel-level
process
migration
(i.e., move an address space and its component threads).
Expansion
of the node set is made
the application
into
a larger
possible by decomposing
number of operating
system processes (logical nodes),
allowing
the use of as many physical nodes as there
are logical nodes in the decomposition.
A host could
leave an application’s
node set simply by using process
migration
to send the local logical node processes to
some other machine.
Reconfiguration
would be fully
transparent
to the application,
assuming that it is sufficiently decomposed.
“Over-decomposing”
at the process level is expensive because it involves heavyweight
kernel abstractions
Application
The runtime system can independently
handle any reconfiguration
in a functionally
correct way, but it relies
on the application
(or a higher level object placement
package) to redistribute
work so as to make the best
possible use of the new node set. This must be done by
migrating
objects using standard object mobility
primitives, placing them in a way that best meets the competing goals of balancing load and minimizing
network
communication.
Achieving and maintaining
a balanced
assignment of work to processing nodes is the primary
concern of parallel-distributed
programming.
The assignment must be adjusted to compensate for load imbalances that develop as the program executes, as well
116
)
(e.g. address spaces). Over-decomposing
at the object
level is cheaper
and yields finer-grained
control over
the distribution
of work.
It also serves as a basis for
using application
knowledge to balance the load at runtime, rather than simply to effect a static decomposition into component
processes.
In addition,
the process-level
approach will tend to
perform poor] y when several logical nodes are placed
on the same physical node. The system can maintain
coherency between colocated address spaces using the
same mechanisms that are used in the distributed
case
(transferring
data with RPC), but this is much more expensive than using the physically
shared memory provided by the hardware.
Overloading
a physical node with multiple
logical
nodes can also cause scheduling
problems.
For example, computations
in one logical node may block awaiting an event in another,
and overloading
may violate
assumptions
made by the user-level scheduling system
about the number of physical processors available to it.
In contrast, Amber’s user-level approach preserves the
one-to-one mapping of logical to physical nodes, placing all program state on a given physical node within a
single address space. This avoids unnecessary interprocess communication,
and it allows all Amber thread
scheduling
and synchronization
to be handled
using
lightweight
user-level mechanisms.
but we plan to implement
a server that monitors workstation availability
and responds to changes by issuing
reconfiguration
commands to Amber applications.
Removing a node from the node set involves migrating application
state from the departing
node to a designated
replacement
node.
The replacement
node is
but it is chospecified as an argument to ReplaceNode,
sen arbitrarily
by the runtime system for RemoveNode.
To remove a node, the runtime
system freezes executing threads on both the departing
node and the replacement node, and suspends attempts by other nodes
to communicate
with the departing
node. Once both
nodes are idle, the runtime
system on the departing
node uses RPC calls to load its application
state into
the logical node on the replacement
host. When the
transfer is complete, the departing
logical node terminates and the replacement
node resumes execution of
user threads. Blocked messages to the departing
node
are then forwarded
to the replacement
node. The performance of the running
program
is degraded during
this procedure, but the reconfiguration
activity
is otherwise transparent
to the application.
The node removal procedure is more complicated
if
rather than a Replacethe change is a RemoveNode
In this case, the runtime
system first notifies
Node.
and
the application
by upcalling Shrink asynchronously
waiting for a selectable time limit.
This allows Shrink
to begin redistributing
the departing
node’s objects to
the remaining
nodes.
If the time limit
exlpires, the
is paused along with the other
thread executing Shrink
executing
threads, and migrated
directly
by the runtime system along with the remaining
program state.
When the replacement
node resumes, the Shrink thread
executes to completion.
A final problem
with the kernel-level
approach
is
that it transfers
the complete
contents of a migrating address space, thereby
needlessly
moving
data
that is replicated
and can be found elsewhere.
Although system-level
optimizations
such as aggressive
pre-copying
[Theimer
et al. 85], or lazy on-demand
copying [Zayas 87], can reduce the latent y of migration, the most effective optimization
– no copying –
can only be implemented
with information
kept at the
user-level.
5
5.2
Implementation
Node
Set
Node
Set
Knowledge
The Amber
runtime
svstem
maintains
information
.
about the set of nodes participating
in the program,
including
the size of the node set, the identities
of the
nodes, and the RPC bindings needed to communicate
with them.
Each logical node keeps a local copy of
the tables with the latest information
about the current state of the node set. Keeping perfect lcnowledge
in a dynamic world would require the system to synchronously
inform every node of changes to the node
set, updating
all of their local tables atomically.
Inobject maintains
a centralized
relistead, the NodeSet
able copy of the tables, updating
them synchronously
with each change and allowing
the updates to propagate lazily to the rest of the network.
Tlhe Amber
runtime
system manages local tables as a cache over
the master tables; it detects and refreshes stale information.
Each logical node is uniquely
identified
by a node
assigned to it when it first joins the applicanumber
tion’s node set. These are used to name the node in
system data structures
that store information
about
the node set and the locations of remote objects. Node
This section describes the implementation
of node reconfiguration
support
in the Amber runtime
system.
Sections 5.2 and 5.3 focus on managing
the changing
node set and the locations of objects.
Section 5.4 describes mechanisms
for dealing with internal
runtime
system state that must be preserved when a node leaves
the node set.
5.1
Propagating
Changes
Changes
to the node set are controlled
by the Nodeinitiated
by its functions
RemoveNode,
Reand AddNode.
Changes can be requested
by the application
itself, by calling one of these functions, or from outside the program,
by invoking RPC
operations
exported
by the Amber runtime
system in
each logical node. In the current implementation
this
is done manually
by means of a command
interface,
Set object,
placeNode,
117
numbers
are frequently
transmitted
between logical
nodes as a side effect of the algorithm
for locating
objects.
A node learns of an addition
to the node
set when it receives a node number for which it has
no cached RPC binding.
It responds by obtaining
the
RPC binding for the unknown node from the NodeSet
object, and caching it locally.
Similarly,
nodes discover deletions from the node set
when their attempts to communicate
with the departed
node fail.
When a logical node leaves the node set,
its node number is inherited
by its replacement
node.
Thus the mapping from node numbers to logical nodes
is many-to-one.
Reassigning
a node number in this
fashion may invalidate
RPC bindings
for that node
number cached elsewhere in the network.
An attempt
to use a stale RPC binding is caught by the Amber runtime system.
The runtime
system handles the failure
object for the new binding,
by querying
the NodeSet
updating
its local cache of bindings,
and retrying
the
call.
5.3
Locating
Remote
Objects
Amber’s
scheme for finding remote objects is based on
addresses
[Fowler 85]. Each time an object
moves off of a node it leaves behind the node number
of its destination
as a forwarding
address.
The forwarding address may be out of dat-e if the object moves
frequently.
In this case the object’s location can be determined
by following
a chain of forwarding
addresses,
since the object leaves a new address on each node that
it visits. If an object is referenced from a node where
it is not resident, and where no forwarding
address is
stored, the search for the object is made by following
the forwarding
chain beginning
at the node where the
object was created.
Each node caches tables associating regions of memory with node numbers; these can
be used to determine
the node on which an object was
created, given the virtual address of the object.
forwarding
Node reconfiguration
complicates
management
of
forwarding
addresses.
When a node is removed from
the node set, the forwarding
addresses stored on the departing node must be preserved by migrating
them to
the replacement
node. But a migrating
forwarding
address will overwrite forwarding
addresses already stored
on the replacement
node. If the replacement
node is a
link in the chain from the departing
node to the object, then destroying
the address stored there creates
a cycle in the forwarding
graph, causing the object to
become unreachable from some nodes, To prevent this,
the forwarding
addresses associated with each object
are timestamped
with a counter that is stored with the
object and incremented
each time the object moves. A
forwarding
address is never overwritten
with an older
one; each forwarding
address on the departing
node
is compared
with its counterpart
on the replacement
node, and the older of the pair is discarded.
118
5.4
Activation
Records
The problem for node removal is to transparently
preserve application
state on the departing
node by migrating it to a safe place where it can be found when
it is needed by the remaining
nodes. The bulk of this
state is in the form of application
objects and threads,
but there are other data structures
maintained
by the
runtime
system that also must be dealt with.
Other
nodes assume that this state exists, and expect to find
it by resolving a node number embedded in their local
data structures.
Amber preserves this state by migrating it to the replacement
node, which inherits the node
number of the departing
node.
Stack activation
records are the most difficult
examActivation
records are not
ple of this kind of state.
objects; an Amber thread stack is a contiguous
region
of virtual memory handled in the standard way by machine instructions
for procedure
call and return.
Due
to the way that Amber remote object invocations
are
handled, a stack may have valid regions on several different nodes, associated with invocations
of objects residing on those nodes. The Amber remote invocation
code modifies the stack so that a procedure return into
a remote activation
record causes the thread to return
to the node where the actual activation
record resides.
Of course, activation
records residing on a node must
be preserved if the node is removed from the node set.
Our changes to Amber for node set reconfiguration
included code to update a list of valid stack fragments on
each remote invocation,
and to merge valid activation
records into formerly invalid regions on the replacement
node when a node set change occurs.
6
Performance
The performance
of the node reconfiguration
mechanism can be evaluated from two different perspectives.
The first is concerned with the amount
of time that
workstation
owners must wait to reclaim their workstations from parallel-distributed
applications.
The second is concerned with the performance
of an application under reconfiguration.
6.1
Reclaiming
a Workstation
When an application
receives a command
to extract
itself from a node, application
threads on the node
are stopped within
a fixed period of time specified as
an argument
to the command.
This quickly returns a
considerable
portion of the workstation
resources back
to its owner.
The time for the subsequent
migration
of application
state from the departing
node is dominated by the communication
latency of the transfer.
The primary
factors affecting
this time are the number and size of objects resident on the departing
node.
Table 2 shows the relationship
between overall migration time and the number and size of migrated objects.
Our current implementation
does not include a bulk
object transfer facility capable of packing and shipping
multiple
objects at once; the cost for moving a large
number of objects will be substantially
immoved
once
this feature i; implemented,
hour. The second column of Table 3 shows the number
of nodes participating
in each phase. Between phases,
the application
either expanded or contracted
as indicated by the table. Column 3 shows the time needed
to reconfigure.
For node set expansion, this includes a
delay of about five seconds to start new logical node
processes. Column 4 shows the time taken fc)r the application
and runtime system to relocate all of its local
objects.
Finally,
the last two columns show, respectively, the time to execute 10 iterations of the algorithm
and the total time spent in that phase of the program.
The table shows that the program adapted effectively
to changes in the number of nodes. Program
phases
executing on larger numbers of nodes were abie to complete their 30 iterations
in less time than phases executing on smaller numbers of nodes. Program phases
that were forced to contract back to a smaller number
of nodes completed
their iterations
in about the same
amount of time as earlier phases on the same number
of nodes.
“,
Number
of Objects
Object
100
u
6.2
0.84 I 0.97 I 1.15 !
1.53
1.72
2.09
200
300
400
500
Table
Size
(bytes)
~
1.86 I
3.46
2.84
5.85
2.23 2.47 3.05 5.15 6.68
2.90 3.23 3.95 6.77 11.51
I 3.54 I 3.94 I 4.87 I 8.35 I 13.18
2: Total
Application
Migration
Time
(seconds)
Performance
The performance
of an application
in a dynamic node
set depends on how well the application
can adapt
to reconfiguration.
We built an adaptable
version of
Red/Black
Successive Over-Relaxation
(SOR), an iterative point-update
algorithm
that computes values for
points in a grid based on the values of their North,
South, East, and West neighbors.
This section discusses the steps we took to make SOR adaptable,
and
presents resulting performance
data.
Static node set implementations
of SOR divide the
of adjacent points, distributing
one
grid into clusters
cluster to each node that is available to the application. If SOR is to adapt to changes in the number of
available nodes, it must be able to redistribute
its work
dynamically
to balance the load among currently
available nodes. The approach we took to achieve this was
to over-decompose
the problem
into 24 clusters.
We
chose this number so that we could get balanced distributions of clusters to 2, 4, 6 and 8 nodes. Having more
than one cluster per node may result in a performance
penalty (compared
to one cluster per node) due to an
increase in scheduling overhead and cross-cluster communicant ion. Running
SOR on a 1200*1200 grid, we
found that a 24-way partitioning
of the grid was 10%
slower on two nodes than a minimally
decomposed (2way partitioned)
version. The penalty increased to 20%
when we decomposed to 64 clusters.
To effectively
overlap computation
and communication in SOR, threads synchronize
at a barrier at key
points in each iteration.
To achieve the same effect
when there is more than one cluster per node and thus
more than one barrier,
we introduced
a node barrier
object on which all clusters synchronize.
Without
this
barrier, we pay an additional
15% performance
penalty
for over-decomposition.
Table 3 shows the results
SOR using 24 clusters with
by
1200
points.
We
ten times, performing
phases; the complete
changed
7
We have described
a facility
for adapting
to changing numbers of nodes in the Amber parallel-distributed
programming
environment.
Amber’s fundamental
approach to dynamic node reconfiguration
invcJves three
aspects: (1) user-level support for a distributed,
shared,
object memory, (2) programmer-controlled
c,bject mobility,
and (3) a mechanism
for communicating
node
set changes to the application.
We have compared this approach to the more traditional process migration
alternative.
In the past, process migration
has been used primarily
to balance the
load given a collection of independent
non-cooperating
In our environment,
however, the objecprocesses,
tive is to distribute
interacting
components
of a single
To properly
balance such compoparallel
program.
nents requires an understanding
of both the computational and communication
patterns of the various components.
We believe that while a parallel application
could be structured
as a set of processes that rely on
kernel-level
process migration,
a user-level approach is
more appropriate
because:
It yields better local performance,
due to the use
of a shared address space on a node.
It provides better control of parallelism,
because
fine-grained
scheduling
is done at the user level.
The kernel does not need to schedule competing
threads from different
processes that are part of
the same application.
It permits more effective reconfiguration,
because
the user level knows which data to move and which
data to ignore, Also, only the application
can de-
of our experiments
with
a problem
size of 1200
the
size
of
the
node
Conclusions
set
cide
30 iterations
in each of the ten
300 iterations
lasted about an
how
to
best
redistribute
objects
maining nodes in order to optimize
communication.
119
among
re-
processing
and
Program
Phase
Number
Nodes
of
Reconf.
Time
Distribute
Time
na
10.6
29.5
6.2
5.3
12.4
10.3
28.2
5.9
17.6
35.8
24.0
23.7
16.3
12.7
23.8
23.2
23.6
15.9
2
8
2
4
6
4
8
2
4
2
1
2
3
4
5
6
7
8
9
10
Table
3:24
Cluster
149.3
48.0
152.7
79.0
60.3
81.4
49.5
149.9
81.4
150.6
na
Adapting
Amber supports
node reconfiguration
by providing
simple mechanisms
for state migration,
together with
an interface that allows the application
to adapt to a
changing node set. We have shown how an application
can use this interface,
by overdecomposing
at the object level. We are continuing
research to determine
the
performance
effects of reconfiguration
on other paralleldistributed
programs.
8
10 Iter.
Time
—
SOR (seconds)
[Bennett
et al. 90bl Bennett,
J. K., Carter,
J. B.,
and Zw;enepoelj
W.
Munin:
Distributed
shared memory based on type-specific
memory coherence.
In Proceedings
of the .%d
ACM
SIGPLAN
Symposium
on Principles
and Practice
of Parallel
Programming,
pages
168-176,
March 1990.
[Chase et al. 89] Chase, J. S., Amador,
F. G., Lazowska, E. D., Levy, H. M., and Littlefield,
R. J.
The Amber
system:
Parallel
programming
on a network of multiprocessors.
of the 12th ACM
Symposium
In Proceedings
Systems
Principles,
pages 147–
on Operating
158, December 1989.
Acknowledgments
We would like to thank Rik Littlefield,
Cathy McCann,
Simon Koeman,
Tom Anderson,
Kathy Faust, and Ed
Lazowska for discussing with us the issues raised in this
paper.
[Douglis
References
& Ousterhout
87] Douglis,
F. and Ousterhout, J.
Process migration
in the Sprite
of the
operating
system.
In Proceedings
7th International
Computing
[Anderson
et al. 90] Anderson,
R. J., Beame, P,, and
Ruzzo, W. L. Low overhead parallel schedof the
ules for task graphs.
In Proceedings
l?nd
Annual
ACM
Algorithms
July
[Arnould
and
Symposium
on
tural
Support
and Operating
News,
E. A., Bitz, F. J., Cooper,
T., Sansom, R. D., and
The design of Nectar:
A
for heterogeneous
mul-
[Faust
Symposium
on Architecfor
Programming
Languages
Systems,
Computer
Archit ec-
17(2):205–216,
April
1989.
[Fowler
et al. 90a] Bennett,
J. K., Carter, J. B., and
Zwaenepoel,
W.
Adaptive
software cache
management
for distributed
shared memory
of the 1 ‘lth Anarchitectures.
In Proceedings
nual International
Symposium
pages 125-134,
Architecture,
on Distributed
18–25,
Septem-
90] Faust, K, An empirical
comparison
of object mobility
mechanisms.
Master’s thesis,
Department
of Computer
Science and Engineering,
University
of Washington,
April
1990.
85] Fowler,
ing
[Bennett
pages
[Eager et al. 86] Eager, D. L., Lazowska,
E. D., and
Adaptive
load sharing
in
Zahorjan,
J.
IEEE
homogeneous
distributed
systems.
SETransactions
on Software
Engineering,
12(5):662-675,
May 1986.
Parallel
pages 66–75,
Architectures,
Conference
Systems,
ber 1987.
1990.
et al. 89] Arnould,
E. C., Kung,
H.
Steenkiste,
P. A.
network
backplane
%d
ticomputers.
ture
Total
Time
(30
Iter.)
\,
448.1
190.6
511.6
267.0
202.6
269.6
182.8
501.3
274.1
485.5
Using
R. J.
Decentralized
Object
FindPhD disForwarding
Addresses.
sertation, University
of Washington,
ber 1985. Department
of Computer
technical report 85-12-1.
on Computer
[Green
May 1990.
120
DecemScience
& Juels 86] Green, P. and Juels, R. The jigsaw puzzle: A distributed
performance
test.
In Proceedings
ference
of the 6ih International
on Distributed
pages 288–295,
Computing
May
ConSystems,
1986.
[Juletal.
88] Jul, E., Levy, H., Hutchinson,
N., and
Black, A. Fine-grained
mobility
in the Emeron Computer
ald system. ACM Transactions
February 1988.
Systems,
6(1):109-133,
[Koeman
90] Koeman, S. Dynamic load balancing in a
distributed-parallel
object-oriented
environment. Master’s thesis, Department
of Computer Science and Engineering,
University
of
Washington,
December 1990.
[Li & Hudak 89] Li, K. and Hudak,
P. Memory
coherence in shared virtual
memory systems.
ACM
Transactions
[Litzkow
on
Systems,
1989.
et al. 88] Litzkow,
M. J., Livny,
M., and
Mutka,
M. W.
Condor - a hunter of idle
workstations.
ConIn 8th International
ference
on
Distributed
pages 104–1 11.
June 1988.
[Powell
Computer
November
7(4):321–359,
Computing
IEEE
Systems,
Computer
Society,
& Miller 83] Powell, M. L. and Miller,
B. P.
Process migration
in DEMOS/MP.
In Proceedings of the 9th ACM Symposium
on Oppages 110–1 19.
erating
Systems
Principles,
ACM/SIGOPS,
October 1983.
[Thacker
et al. 88] Thacker, C. P., Stewart, L. C., and
Satterthwaite,
Jr., E. H. Firefly:
A multiIEEE
Transactions
processor workstation.
August 1988.
on Computers,
37(8):909–920,
[Theimer
& Lantz 88] Theimer,
M. M. and Lantz,
K.
A.
Finding
idle
machines
in a
workstation-based
distributed
system.
In
8th International!
Computing
Computer
[Theimer
Conference
Systems,
Society,
on Distributed
pages 112-122.
June 1988.
IEEE
et al. 85] Theimer,
M. M., Lantz, K. A., and
Cheriton,
D. R. Preemptable
remote execution facilities
for the V-system.
In Proceedings
of
Operating
the
ACM/SIGOPS,
[Zayas 87] Zayas,
gration
llth
tems
10th
Systems
ACM
Symposium
December
on
pages 2-12.
Principles,
1985.
E. R.
Attacking
the process miof the
bottleneck.
In Proceedings
ACM
Symposium
Principles,
pages
on
Operating
13–24,
Sys-
November
1987.
121