Download Abstract- This paper presents an introduction to the technology of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Plan 9 from Bell Labs wikipedia , lookup

Burroughs MCP wikipedia , lookup

Process management (computing) wikipedia , lookup

DNIX wikipedia , lookup

CP/M wikipedia , lookup

Unix security wikipedia , lookup

Spring (operating system) wikipedia , lookup

Distributed operating system wikipedia , lookup

Transcript
Beowulf Parallel Workstation
~ An Instantiation of the
Cluster of Workstations paradigm
SHASHI KANTH LAKSHMIKANTHA,
Computer Science and Engineering Department,
University of Texas @ Arlington,
Texas – 76019
{[email protected]}
Abstract-This
paper
presents
an
introduction to the technology of the Cluster
of Workstations (COW) paradigm by
considering the example of the Beowulf
Parallel
Workstation,
which
is
an
instantiation of the COW paradigm. The
paper tries to present the features expected
of a COW and how Beowulf follows the
cluster paradigm through its hardware and
software architecture. Along the way, we will
try and place Beowulf among the alternative
technologies
like
Massively
Parallel
Processors (MPP) and Network of
Workstations (NOW). In the final stages of
the paper, we will talk about the future
directions of Beowulf, some other research
projects on the COW principle, and we will
conclude with the limitations of Beowulf and
the effect of the current networking trends
on the Beowulf system.
TABLE OF CONTENTS
1. PHILOSOPHY OF BEOWULF
(WHY BEOWULF?)
2. CLUSTERING TECHNOLOGY
3. TAXONOMY
OF
PARALLEL
COMPUTERS
4. BEOWULF ARCHITECTURE
5. FUTURE DIRECTIONS
6. CONCLUSIONS
7. REFERENCES
1. PHILOSOPHY OF BEOWULF
Earth and space sciences (ESS) project is a
research
project
within
the
High
Performance
Computing
and
Communications (HPCC) program of NASA
[1]. One of the goals of the ESS project is to
determine the applicability of massively
parallel computers to the problems faced by
the Earth and space sciences community.
The first Beowulf was built with the intention
of solving the problems associated with the
large data sets associated with ESS
applications [2].
In more exact terms, the goal was a
“Gigaflops Scientific Workstation”, which
could provide an alternative computing
medium to high-end workstations, symmetric
multi-processors, and scalable distributed
memory systems. Mass market commodity
microprocessors have shown a rapid
increase in performance, and there is
significant pricing disparity between PCs
and scientific workstations. The above two
factors provided a sufficient ground for
substantial gains in performance to cost by
harnessing PC technology in parallel
ensembles to provide high-end capability for
scientific and engineering applications.
Towards this end, NASA initiated the
Beowulf program to apply these low cost
system configurations to the computational
requirements in the Earth and space
sciences.
At Supercomputing ’96, NASA and DOE
demonstrated clusters costing less than
$50,000 that achieved greater than a
Gigaflops/s performance [1].
2. CLUSTERING TECHNOLOGY
Simply put, a cluster is just many computers
connected by a dedicated network. From [3],
it is a group of computers working together
to share resources or workload. A cluster
usually includes some form of hardware or
software integration that automatically
handles sharing. Note though that the
cluster has to be configured during
installation and while making changes to the
cluster over time.
There are all kinds of clusters, from disksharing clusters to full-redundant faulttolerating operating systems and hardware.
Thus, the word cluster denotes a whole
family of technologies under a common
name. To give a general idea of the types of
clusters available, consider the following
categories:
1. A group of servers that balance the
processing load or user load by using a
central server or router, that assigns the
load to different servers
 An initial or central server can
determine the load of the other
servers and send the new request to
the least loaded server.
 Or the assignment can be based on
user preference information or
based on the request type.
2. A group of servers that act as a central
system, joining together individual
resources, in whole or in part, for use by
clients.
 The individual resources could be
ordered in some structure as a
single virtual resource – for example
the Network File System (NFS).
 The individual resources are pooled
in no particular order and assigned
jobs as they become available – for
example a printer pool or modem
pool.
3. A group of servers that execute the
exact same application at the same time
in parallel across the servers.
 This is done primarily in faulttolerant, redundant or replicated
systems to make sure that exact or
correct functions are executed as
required.
 Beowulf is an example of this
category.
4. A group of servers that execute parts of
the same application across the servers
to make the computing faster.

This is parallel or distributed
computing in its pure form and so
much more advanced than simple
clustering that it is almost beyond it.
Cluster-based distributed computing v/s
True Distributed computing: The difference
between
true
distributed
computing
environments and cluster-based distributed
computing lies in how the
distributed
servers are interfaced together. If the
servers are in a completely seamlessly
environment where individual node identity
isn’t an issue for the programmer or the
administrator, you have a true distributed
computing environment. A series of
machines that have a distributed space that
also have to be managed or identified
individually would be a type of cluster. A
parallel database partitioned across several
machines is usually considered a cluster.
3. TAXONOMY OF PARALLEL
COMPUTERS
The parallel processor technology can be
partitioned into three types. They are:
 Massively Parallel Processors (MPP)
Ex. nCube, Cray, etc
 Cluster of Workstations (COW)
Ex. Beowulf
 Network of Workstations (NOW)
Ex. Berkeley NOW
Massively Parallel Processors (MPP):

MPPs are larger.
Offered in configurations from 6 to 2,048
processors. More than 2.4 TFLOPS
peak performance, offers the greatest
amount
of
power
for
parallel
applications. Supports large parallel
workloads with up to 4TB central
memory.
Industry-leading
bisection
bandwidth in excess of 122GB per
second speeds overall performance on
applications I/O bandwidth of up to
128GB per second delivers solutions
fast.

MPP
have
the
interconnect network

Programmers are still required to worry
about
locality,
load
balancing,
granularity,
and
communication
overheads in order to obtain the best
performance.
lowest
latency
Cluster of Workstations (COW) v/s MPP:
 COWs are relatively smaller
An example system may contain
around 16 processors. Performance
achieved could be in the range of:
10.9Gflop/s - Caltech Beowulf and
1.25Gflop/s - Beowulf by NASA

Interconnect network latency is more
with respect to MPP.

Programs that do not require finegrained
computation
and
communication can usually be ported
and run effectively on Beowulf clusters.
A small note on the importance of the
latency in the interconnect network. Finegrain granularity in parallel jargon means
that individual tasks are relatively small in
terms of code size and exec time. Smaller
the granularity implies greater the potential
for parallelism and speed-up greater but
also greater will be the overheads of
synchronization and communication (data
dependency). Beowulf has a network
latency, which does not support such finegrained parallelism. The recent advances in
high-speed networking effect the latency
parameter. We will talk about this later in the
paper.
Network of Workstations:
Programming a NOW is usually an attempt
to harvest unused cycles on an already
installed base of workstations. Programs in
this environment require algorithms to be
extremely tolerant of load balancing
problems and large communication latency.
The problem of load balancing arises due to
the dependency of the NOW on having
unused cycles on the workstations forming
the NOW. If the algorithm assumes that a
fixed # of cycles are always available, and
all the workstations in the NOW are too busy
to contribute idle cycles, then the algorithm
will fail if it is not tolerant of load balancing
problems. Also, there is unpredictability in
n/w latency, as the network load is not
determined by the application being run on
the cluster. As the interconnection network
is visible to the outside world, a portion of
the network traffic may be generated by
systems outside the NOW. Thus, if the
algorithm is not tolerant of large
communication latencies, then it will fail.
COW v/s NOW:

The nodes in a COW are dedicated to
the cluster. This has the important
ramification that the performance of
individual nodes are not subject to
external factors like what applications
are running on the component units in a
COW. Note the in a NOW, the # of idle
cycles contributes by a workstation will
depend on the application the user is
running at that particular time. This
eases load-balancing problems.

The interconnection network for a
cluster is isolated from the network. This
results in unpredictability in network
latency being reduced and strengthened
system
security
as
the
only
authentication
needed
between
processors in for system integrity.

In a COW, operating system parameters
can be tuned to achieve better
throughput for coarse-grain jobs. This is
not possible in a NOW as each
workstation on a NOW has a user
interacting with the workstation, and the
user of course will want a good
interactive response.

A COW provides a global process ID.
This enables a process on one node to
send signals to another process, all
within the user domain.
4. BEOWULF ARCHITECTURE
There are two constraints on the Beowulf
workstation architecture:
 It must use exclusively commodity
hardware to avoid dependence on a
single vendor.

The cost of the workstation, populated
with disk and memory, should be no
more than a high-performance scientific
workstation, approximately $50,000 [2].
With this in mind, Beowulf clusters (FIGURE
1) have been assembled around every
generation of commodity CPU since the
1994 introduction of the 100MHz Intel DX4
processor. The first Beowulf was built with
DX4 processors and 10Mbps Ethernet. In
late 1997, it was built using 16 200Mhz P6
processors connected by Fast Ethernet and
a Fast Ethernet switch. The current
price/performance
point
in
desktop
architectures, Intel’s 200MHz Pentium Pro
CPU, is incorporated into the new
generation Beowulf clusters. The networking
equipment has changed from simple
10Mbps Ethernet to Fast Ethernet to various
forms of Switched Ethernet.
We will now discuss the features expected
of Beowulf and the general hardware and
software architecture used to achieve this
end.
Features expected of Beowulf: The features
expected of Beowulf, to make it the intended
“Gigaflops Parallel Workstation”, are as
follows:



High interconnect-network bandwidth
and
low
latency
inter-processor
communication.
High
aggregate-bandwidth
disk
subsystems
High floating-point performance
How is the Beowulf built to achieve the
above features? Read on for the answer.
High Network Bandwidth: Beowulf achieves
a high bandwidth by using a multi-channel
interconnection
network.
The
interconnection network can be of two types,
in general. They are:

A Fast Ethernet switched network
Refer to Figure 2

The maximum achievable speed here is
100Mbps. With the advent of Gigabit
Ethernet, it may replace the 100Mbps
line.
A Crossbar Switch Architecture
Refer to Figure 3
The design of the crossbar switches is
done such that for non-overlapping
connections, the switch acts as a pointto-point link. And, due to the switched
architecture, broadcast is eliminated,
and network traffic due to broadcast
reduces, thus freeing up more
bandwidth. As the switches are getting
faster and intelligence is being added to
them, switches may dominate the future
of networking.
High Aggregate Disk-Bandwidth: Why do we
need a high aggregate disk bandwidth? The
answer: Existing systems follow the model
of a shared file server access through a
common LAN. In this case, the same data
would be accessed repeatedly during a
working session because the typical
workstation did not have the disk capacity to
hold all of the requisite data. The result was
long latencies to file servers, tedious
response cycles, & burdening of shared
resources. Thus performance suffers.
Coarse-grain parallelism implies that the
program could follow the SPMD (Single
Process Multiple Data) model which means
the same program acting on multiple data.
Later, the results are merged if required. For
this, if we have disk bandwidth at each
node, then the aggregate disk bandwidth will
improve response times, reduce latencies
and reduce the load on the network, only for
coarse-grained jobs. If the job is fine
grained, then due to the higher latency of
the network in a cluster, the programs will
perform poorly.
Beowulf achieves a high aggregate disk
bandwidth by placing disks on every node,
and also achieves a high-distributed disk
space.
High floating-point performance: To increase
floating-point performance for CPU intensive
applications, small-scale (2 to 4 processors)
system boards are used.
One point about the Beowulf architecture is
that clusters can be built to a different set of
requirements, as has been done at different
universities like Drexel, GMU, Clemson,
UIUC, etc. Thus, we have a very flexible
architecture framework.
FIGURE 1: BEOWULF ARCHITECTURE
Beowulf Architecture
Computer
tapping the
cluster power
Multi Channel Internal Interconnection Network
Cpu
0
Cpu
1
Cpu
2
Cpu
n
LAN
FIGURE 2: FAST ETHERNET SWITCHED NETWORK
Fast Ethernet Switched N/w
Fast Ethernet Switch
Cpu
0
Cpu
1
Cpu
n
FIGURE 3: CROSSBAR SWITCH NETWORK
Crossbar Switch N/w
Cpu
0
Cpu
1
Software Architecture of Beowulf: Until now,
we have discussed about the hardware
architecture of Beowulf. Let us now proceed
on the road to discovering the software
architecture.
The Beowulf software architecture is called
the Grendel. Grendel is implemented as an
add-on to the commercially available,
royalty-free Linux operating system. The
Beowulf
distribution
includes
several
programming
environments
and
development
libraries
as
individually
installable packages. For ex., PVM, and MPI
are available. SystemV-style IPC and pthreads are also supported.
The main features of Grendel can be
classified as follows:
 Global Process ID space
5. At the library level
6. Independent of external libraries
 Programming Models
- PVM/MPI
- Distributed Shared Memory
 Parallel File System
Let us discuss the above features in detail.
Global Process ID (GPID) Space: Each
process running in a UNIX kernel has a
unique identifier called the Process ID (PID).
The uniqueness is limited to the single
kernel under consideration. In a parallel,
distributed context, it is often convenient for
UNIX processes to have a PID that is unique
across an entire cluster, spanning several
kernels. This can be achieved in two ways.
GPID Space – Method1:
 This incorporates the notion of a SPMD
context of program execution, where
multiple copies of the same code run on
a collection of nodes, and share a UNIX
process ID.
 One implementation of this method is
discussed in [4]. Here, a parallel
process is made up of a number of
parallel tasks. Parallel tasks within a
single process are allocated the same
process ID and context ID, on each cell
(a cell being a processor node). A
parallel task of one process can send
signals to other parallel tasks in the
same process running on other cells or
nodes.
GPID Space – Method2:
 The second scheme is available at the
library layer in PVM [5]. PVM provides
each task running in its virtual machine
with a task ID that is unique across all
the hosts participating in the virtual
machine. But, this method is restricted
to programs written and compiled under
the PVM library.
 Beowulf provides an implementation of
this method, called GPID-PVM. The PID
space is divided into two parts: one for
the local processes and one for the
global ones. Some local processes, like
init (in the UNIX sense), are duplicated
across the cluster. So, the process
space would be cluttered if the local
processes were also included in the
global process space.
 This requires a static allocation of nonoverlapping PID ranges. This ensures
that runtime PID assignment requires no
inter-node communication, as kernels
allocate the PIDs from their locally
allotted ranges.
Programming Models: There are several
distributed programming models available
on Beowulf. A couple of the most commonly
used are PVM and MPI [6]. Also, a
distributed shared memory model is
planned.

PVM/MPI
 Embody the message passing
paradigm for application portability –
an application written using the
PVM/MPI libraries can be ported
across any platform for which these
libraries are available.
 Beowulf supports a slightly modified
version of the Oak Ridge PVM [7]
and an unchanged Ohio State LAM
MPI package [8]

Distributed Shared Memory

Beowulf implements a page-based
Network Virtual Memory (NVM),
also known as Distributed Shared
Memory
(DSM).
The
initial
implementation is based on the
ZOUNDS system from Sarnoff [9].

ZOUNDS is designed to achieve the
goal of a 50-microsecond page fault
in the MINI gigabit ATM interface.
 The basic idea here is to flatten
out the page fault/io process by
putting in shortcuts to the virtual
memory system where needed.
 Note that the LINUX kernel
provides a VFS-like interface
into the virtual memory system.
This makes it simpler to add
transparent distributed memory
backends to implicitly managed
namespaces.
software-enforced
ownership
and
consistency policy to give the illusion of
a memory region shared among
processes running an application. A
more
conventional
DSM-NVM
implementation is planned, along with
support for a Network Memory Server.
8. CONCLUSIONS

Using Beowulf, we have implemented a
high-performance workstation at a fairly
low price.
Parallel File System
The basic aims of a Parallel File System [10]
are:
 To allow access to storage devices in
parallel from multiple nodes – disk file
transfers between separate pairs of
disks are done in parallel. Note that a
switched network provides a point-topoint
connection
between
nodes
connected at different ports of the
switch.
 To allow data on multiple storage
devices to appear as a single logical file.

One more limitation is that Beowulf was
built by and for the researcher with
parallel programming experience.

The operating point targeted by the
Beowulf
approach
is
scientific
applications
and
users
requiring
repeated use of large data sets and
large data sets large applications, with
easily delineated course grained
parallelism.
Beowulf systems can take advantage of a
number of libraries developed to provide
parallel file system interfaces to Networks of
Workstations ([11],[12], [13]).

Coarse-grained parallelism is one of the
limitations of Beowulf, but the vastly
improving networking speeds promise to
reduce the interconnect- latency in
future, and allow finer grained programs
also to work.
7. FUTURE DIRECTIONS


The GPID concept is good for cluster
wide control and signaling of processes,
but it fails with a global view of
processes. For ex, the UNIX commands
ps, top etc will not work unmodified on
the present Beowulf system. Work is
underway to provide this capability. This
involves using the /proc pseudofilesystem. The Linux implementation
uses the /proc to present almost all
system information. The work involves
the conceptually simple step of
combining the /proc directories of the
cluster using the existing NFS
capabilities.
A page-based distributed shared
memory uses the virtual memory
hardware of the processor and a
7. REFERENCES

References for this paper:
1. Beowulf Project at CESDIS
http://www.beowulf.org/
2. Beowulf: Introduction and Overview
http://www.beowulf.org/intro.html
3. Daniel Ridge, Donald Becker, Phillip
Merkey, “Beowulf: Harnessing the
Power of Parallelism in a Pile-of-PCs”
Proceedings, IEEE Aerospace, 1997
4. Rawn Shah, “What exactly is a cluster,
anyway?”, Article appearing in the
Connectivity columns of SunWorld,
August 1998.
http://www.sunworld.com/
5. Andrew Tridgell, Paul Mackerras, David
Sitsky and David Walsh “AP/LINUX – A
modern OS for the AP1000+” Australian
National University Technical Report
http://cap.anu.edu.au/cap/projects/li
nux/
6. PVM (Parallel Virtual Machine)
http://www.epm.ornl.gov/pvm/
7. The Message Passing Interface (MPI)
standard
http://www–unix.mcs.anl.gov/mpi/
index.html
8. Oak Ridge PVM
http://www.epm.ornl.gov/pvm/
9. LAM / MPI Parallel Computing
http://www.mpi.nd.edu/lam/
10. R.G.Minnich,
“ZOUNDS:
A
Zero
Overhead
Unified
Network
DSM
System”, Sarnoff Technical report”
ftp://ftp.sarnoff.com/pub/mnfs/www/
docs/cluster.html
11. John M. May, “Parallel I/O”, Lawrence
Livermore National Laboratory”
http://www.llnl.gov/
12. R. Bennett, et all. “Jovian: A Framework
for Optimizing Parallel I/O”, Proceedings
of the 1994 Scalable Parallel Libraries
Conference.
13. PIOUS and the Parallel Virtual File
System, Clemson University

Other related references
1. Chance Reschke, Thomas Sterling,
Daniel Ridge, Daniel Savarese, Donald
Becker, Phillip Merkey “A Design Study
of Alternative Network Topologies for
the Beowulf Parallel Workstation“ ,
Proceedings, High Performance and
Distributed Computing, 1996
2. Donald J. Becker, Thomas Sterling,
Daniel Savarese, Bruce Fryxell, Kevin
Olson, “Communication Overhead for
Space Science Applications on the
Beowulf
Parallel
Workstation
“,
Proceedings,High Performance
Distributed Computing, 1995
and
3. Donald J. Becker, Thomas Sterling,
Daniel Savarese, John E. Dorband,
Udaya A. Ranawak, Charles V. Packer
“BEOWULF:
A
PARALLEL
WORKSTATION FOR SCIENTIFIC
COMPUTATION”,
Proceedings,
International Conference on Parallel
Processing, 95.
4. R.G.Minnich and David J. Farber , “The
Mether System: Distributed Shared
Memory for SunOS 4.0”,
5. Cristiana Amza, et all “TreadMarks:
Shared
Memory
Computing
on
Networks
of
Workstations”,
Rice
University.
6. The Berkeley Network of Workstations
(NOW).
http://now.cs.berkeley.edu/
7. Werner Vogels, et all. “The Design and
Architecture of the Microsoft Cluster
Service – A practical approach to High
Availability and Scalability”, Technical
Report, Microsoft Research
8. Barak A. and La'adan O., “The MOSIX
Multicomputer Operating System for
High Performance Cluster Computing”,
Journal of Future Generation Computer
Systems, Vol. 13, No. 4-5, pp. 361-372,
March 1998.
http://www.cs.huji.ac.il/mosix/