Download A Performance Anaylsis of Java DSM Implementations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
University of Adelaide
Master of Computer Science Program
Project Proposal:
A Performance Analysis of Java Distributed Shared Memory Implementations
Kevin Fenwick
E-mail: [email protected]
Web: www.cs.adelaide.edu.au/~kfenwick/thesis
March 2001
Supervisors: Dr Paul Coddington and Dr Francis Vaughan
Abstract
The Java thread mechanism allows the exploitation of parallelism within the confines of
shared memory multiprocessors by allowing multiple threads to be mapped onto distinct
physical processors. Distributed memory machines, however, which are generally
cheaper and therefore much more common, have been unable to harness the simplicity
and elegance of this approach, their programming being reliant on more complex
message-passing techniques. Java Distributed Shared Memory (DSM) implementations
seek to redress this situation by allowing the thread model to be extended into the realm
of distributed memory multicomputers.
While a number of Java DSM systems have been implemented, no direct performance
comparison of them has yet been conducted. We will be performing such a comparison
by implementing a parallel Java benchmark suite. Tests will be conducted on a Sun
Technical Compute Farm and a Linux Beowulf cluster, both of which, having over 150
processors, are large distributed machines.
This study will determine which application types offer scalability and good
performance with which Java DSM approaches and enable us to make recommendations
on how future implementations may be improved.
1. Introduction
Recently there has been increased interest in the use of Java for high performance computing.
Indeed the Java Grande Forum, which has been expressly set up for the purpose of promoting and
developing the use of Java in large-scale numerically intensive, or so called ‘Grande,’
applications, is of the opinion that ‘Java has potential to be a better environment for Grande
application development than languages such as Fortran and C++’ [10].
A significant feature of Java is its thread model, which allows convenient and portable parallel
programming to be achieved by mapping threads onto physically distinct processors. While this
mechanism is a simple and elegant means of achieving medium grained parallelism, its scope is
limited to shared memory machines. At the same time the trend in high performance computing is
towards distributed memory machines based on commodity components, namely clusters of
workstations. Java Distributed Shared Memory (DSM) is a means of extending the thread model
into the distributed memory arena by implementing a shared-memory abstraction over the
underlying distributed hardware.
While Java DSM systems have been or are being developed by a number of groups, no direct
comparison of them has yet been conducted. We will be performing the first such comparison of
Java DSM systems for performance and scalability by implementing a parallel Java benchmark
suite. We have access to two large distributed memory machines on which to run the tests: a Sun
Technical Compute Farm, which is typical of a high-end commercial machine and a Linux
Beowulf cluster, which is built entirely from commodity components. The study, which will be of
considerable interest to the Java Grande Community, will enable us to assess how well different
types of parallel algorithms are matched to particular Java DSM implementations and perhaps
provide insight into how future Java DSM systems may be improved.
In this research proposal we shall provide a brief overview of DSM, discuss the particular case of
Java DSM and then go on to discuss the benchmarking methodology that we will adopt in
assessing the various Java DSM systems.
2. Distributed Shared Memory
2.1 Introduction
Modern parallel computing architectures generally fall into one of two classes: shared memory
machines and distributed memory machines [1,3]. In the former, all processors share a single
addressable memory space; the most common example of such machines is the bus-based
symmetric multiprocessor (SMP). In the latter each processor, or a small set of processors, is
encapsulated within a semi-autonomous node, which includes an amount of local memory. Since
no node may directly address the memory of another, such machines are often termed NORMA
(NO Remote Memory Access) machines.
Shared memory machines are difficult to build, expensive and generally do not scale beyond 100
processors. Distributed memory machines, on the other hand, are much cheaper and simpler to
build and may scale to 1000s of processors. The current trend is to build such machines out of
off-the-shelf components, such as commercial workstations/PCs and commodity interconnect,
such as Fast Ethernet. These distributed memory machines are known as clusters of workstations
(COWs) or, where each node is itself an SMP, a CLUMP (Cluster of SMPs).
However, while shared memory machines are more difficult to build than their distributed
memory counterparts, they are much easier to program. Processes can address the entire memory
Kevin Fenwick, Performance Analysis of Java DSM Implementations
2
space and communication and synchronization is achieved with simple load-store primitives and
shared variables. In distributed memory machines, on the other hand, communication and
synchronization must be achieved by explicit message-passing, using send-receive primitives that
are normally implemented as C or Fortran libraries such as MPI (message passing interface) [34].
A quandary therefore exists: shared memory machines are difficult to build but relatively easy to
program, whereas distributed memory machines are easy to build but relatively to difficult to
program. DSM [5,22] is seen by its proponents as a way out of this dilemma since it aims to
provide the relative ease of shared memory programming on distributed memory systems. This is
achieved, as shown in figure 2.1, by abstracting over the distributed nature of the hardware to
present an application with the illusion of a single memory space that is shared between the
nodes. Hence, the shared memory programming model may be applied in a distributed memory
setting.
Figure 2.1 Generalized Distributed Shared Memory Scenario
APPLICATION
DISTRIBUTED SHARED MEMORY
SOFTWARE LAYER
NODE
NODE
NODE
NODE
NODE
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
CACHE
CACHE
CACHE
CACHE
CACHE
CPU
CPU
CPU
CPU
CPU
The DSM abstraction is typically achieved as middleware or a run-time system that transparently
maps references to the virtual shared memory space onto the underlying hardware using message
passing. The performance of such systems will obviously depend heavily on the amount of
underlying network communication that is necessary to maintain this illusion.
2.2 DSM Implementations
DSM implementations may be differentiated along two lines - the structure of the shared memory
the system presents to applications and the memory consistency model the system provides. At
one extreme the shared memory provided by the system may be completely unstructured, by
Kevin Fenwick, Performance Analysis of Java DSM Implementations
3
which we mean that it is simply a linear array of words, and at the other the shared memory may
be highly structured as a collection of distributed objects that are visible to all processes.
Moreover, all DSM systems involve some replication between nodes in order to improve
efficiency. When data is replicated in this way, it must be kept consistent. The memory
consistency model [4] that the system follows defines the degree to which this consistency must
be maintained and therefore how the memory system can be expected to behave when presented
with a request. A weaker consistency protocol requires less communication and hence improving
performance but will require a stronger programming model.
We shall now briefly review some representative DSM implementations.
2.2.1 IVY
IVY, developed in the mid-1980s [5], was the prototypical DSM system. It was of the
unstructured variety, the shared memory space being carved up into fixed size pages that could
migrate from machine to machine on demand. Essentially the system works as an extension of
standard virtual memory: when a processor tries to access a location in a page that is not resident
in its local physical memory a page fault occurs pending recovery of the necessary page.
However, with IVY that page may be brought, not only from disk, but also from another machine.
IVY was aimed squarely at solving the so called ‘dusty deck’ problem i.e. the wish to run existing
multiprocessor programs on distributed memory machines. These programs assume a sequentially
consistent model of memory, which is the strongest feasible form for parallel machines. This
means that there is a total order on all memory accesses i.e. all processors see the same order of
writes. Implementing a consistency protocol that assures this is extremely costly in terms of
communication and therefore performance. This performance problem is compounded by the
phenomenon of ‘false sharing’, which occurs if non-shared variables, being used by different
processors, are positioned on the same page. The page will have to be kept consistent, even
though the variables on it need not be, causing a so-called ‘ping-pong’ effect as the page is
swapped from one node to another.
2.2.2 Munin
The Munin system improved on the performance of IVY by using an implementation of release
consistency instead of sequential consistency. Release consistency [25] relies on the observation
that parallel programs need to synchronize access to shared memory to avoid race effects and that
such synchronization is typically achieved with critical sections. Since only one process may be
in a critical section at a time, writes need not be made available to other processes until the
critical section is released thus reducing the amount of communication necessary to maintain this
sort of consistency.
2.2.3 Treadmarks
The Treadmarks system [7] improves on this again by using a lazy implementation of release
consistency. The lazy implementation of release consistency manages to reduce communication
even further by only making writes available to a process that is about to acquire a critical section
rather than to all processes on the release of a critical section. Furthermore, by using a multiplewriters protocol, Treadmarks allows multiple processes to write to copies of the same page thus
reducing the false-sharing problem.
Kevin Fenwick, Performance Analysis of Java DSM Implementations
4
2.2.4 Midway
Midway [8] allows the use of entry consistency, which requires that every individual shared
variable be associated with a lock. This means that multiple critical sections may be in use at the
same time, increasing the amount of parallelism that is involved.
2.2.5 Linda
Linda [24] provides a highly structured view of shared memory known as a tuple-space into
which processes may insert or extract tuples, which are ordered structures consisting of typed
fields. Processes may then communicate by one inserting a tuple into the tuple-space and another
extracting a tuple that matches one it is looking for.
2.2.6 Orca
The Orca system [26] is an object-based DSM that comprises the Orca language, compiler and
run-time system that operates on the Amoeba distributed operating system. In object-based DSM
systems, the shared memory space takes the form of an object store, which, while in reality is
distributed across the system, appears to individual nodes to be a single object store. Any process
can invoke any object’s methods regardless of where they are located. It is the function of the
object based DSM system to make the distribution of objects transparent to individual processes.
3. Java Distributed Shared Memory
3.1 Introduction
The Java programming language has become so pervasive that it is now being actively promoted
as a tool for high performance scientific computing. The Java community has termed such
applications Grande applications and the Java Grande Forum [10] acts as a conduit for
recommendations to Sun Microsystems, the developer of Java, that will enable Java support for
Grande Applications to be improved.
Undoubtedly the initial interest in Java can be ascribed to its portability model, which is based on
the fact that Java programs are not generally compiled to native machine code but to an
intermediate byte-code. This byte-code is then interpreted by a software machine known as the
Java Virtual Machine (JVM), which has itself been ported to numerous hardware platforms.
Another interesting aspect of Java thouh, is that it supports multithreading for shared memory
machines as part of the language. The JVM transparently deals with the assignment of threads to
processors providing a simple and convenient method for exploiting parallelism that, when
coupled with the virtual machine paradigm, offers a path to truly portable parallel programming.
However, Grande Applications are typically geared towards distributed memory machines rather
than shared memory ones. Once again, DSM offers an escape route by allowing the threading
model to be extended into the domain of distributed memory.
Kevin Fenwick, Performance Analysis of Java DSM Implementations
5
3.2 Java Distributed Shared Memory Implementations
Java DSM implementations all provide a highly structured view of memory, a parallel Java
application being made up of interacting Java objects operating within some distributed shared
memory space. We may characterize three approaches to implementing this illusion on a
distributed memory machine, namely those that operate above the JVM level, those that operate
below the JVM level and those that operate at the JVM level. We shall call these respectively
super-JVM level DSM, sub-JVM level DSM and JVM level DSM.
3.2.1 Super-JVM level DSM
In this approach, each cluster node runs its own JVM and objects placed on each node use some
communication mechanism, such as Java’s own Remote Method Invocation (RMI) to pass
messages to one another. Systems such as JavaParty [12] and ProActive [13] take this approach,
remote objects and threads being handled at the level of the Java language.
In JavaParty [12], for example, multithreaded programs are written in a special form of Java that
includes certain compiler directives indicating that objects may be shared. This is done by
prefixing class declarations with the remote keyword. An initial precompilation phase translates
such classes into standard Java bytecode plus ‘RMI hooks’ that enable remote objects to
communicate with one another across the cluster. Each JVM must register with a central runtime
manager that coordinates their activities.
JavaSpaces [14] is an implementation of the Linda model developed by Sun as a Jini [33] service.
In JavaSpaces, tuple-spaces are referred to simply as Spaces and represented as a Java object.
Tuples themselves are classes inherited from the superclass Entry.
3.2.2 Sub-JVM level DSM
Here the JVM itself is built above some lower level DSM infrastructure, such as those discussed
in section 2, which presents the JVM itself with a picture of shared memory. Systems such as
Java/DSM [15], Hyperion [16] and JESSICA [18] take this approach.
Java/DSM, for example, presents a modified JVM where the Java heap [11] is implemented in
Treadmarks [7], a pre-existing DSM system.
3.2.3 JVM-level DSM
Here the JVM implementation itself is distributed across the nodes of the distributed memory
machine. One system that takes this approach is cJVM [17]. This provides a parallel Java
application with a single system image of a traditional JVM. This JVM is made up of a number of
virtual machine components, one on each node of the cluster. Taken as a whole, these processes
constitute the JVM and the application is totally unaware that it is running on a cluster. In reality,
threads and objects must be distributed across the nodes of the cluster. Distributed access is
supported by a so-called master-proxy paradigm, where a proxy is a surrogate for a remote master
object through which that object can be accessed. MULTIJAV [19] is a system similarly
implemented as a distributed JVM implementation.
Kevin Fenwick, Performance Analysis of Java DSM Implementations
6
4. Benchmarking
4.1 Introduction
The project will centre on the development of a benchmark suite of parallel Java programs. The
aim of a benchmarking suite is to provide an objective performance measure of a system. In a
scientific High Performance Computing (HPC) scenario, we are purely interested in reducing
execution time of applications; hence, the running time of programs will be our metric of
comparison.
The benchmarks that we perform will enable us to compare the performance and scalability of
some of the different types of Java DSM systems with a range of parallel applications. It is hoped
that the benchmarks will reveal useful information about the particular systems and allow us to
make recommendations for future Java DSM implementations
4.2 Benchmarking Methodology
A benchmarking suite is composed of a number of component programs designed to test various
aspects of a system. In choosing these components, a number of considerations need to be taken
into account.
Firstly, we will need to ensure that the components test different facets of a system. In the case of
the parallel Java algorithms that we will be running on the Java DSM systems, for example, it
will be important to select benchmarks that exhibit different patterns of communication. For
example, we will want to select both applications with little inter-node communication and more
communication intensive applications.
We should also ensure that the benchmark does not take too long to run, certainly not more than
an hour for the whole suite. This is important since we will be running the suite multiple times, on
different systems.
Typically, an important consideration in choosing a benchmark is portability. Certainly, we want
to make sure we are testing the same thing on each system. One might assume that given Java’s
inherent portability that this would not present a problem. However, benchmarks for what we
termed the Super-JVM level system i.e. those that operate at the language level, will require some
code modifications to operate. We need to ensure that such changes do not alter the algorithm we
are testing.
A number of benchmark suites have been implemented for sequential which we will wish to
examine. For example SciMark at the US National Institute for Statistics [28] and the Java
Grande Forum Benchmark Suite [27,29], the latter being of particular interest in that it
concentrates on Grande Applications. We will also be examining other HPC benchmarks such as
the Numerical Aerodynamic Simulation (NAS) benchmarks [30], Linpack [31], Parkbench [32].
It is envisaged that the benchmark suite will consist of a cross-section of parallel Java algorithms
and applications such as a Fast Fourier Transform (FFT) Kernel, LU matrix solve benchmark, an
n-body application using a Barnes-Hut algorithm, a Laplace solver involving nearest neighbour
communications, and an embarrassingly parallel Task Farm.
Kevin Fenwick, Performance Analysis of Java DSM Implementations
7
4.3 Hardware
We have access to two large distributed memory machines, with distinct characteristics, which
will form an ideal testing ground for our benchmarks – a commercial system from Sun and a
home-brew Beowulf cluster.
4.3.1 Sun Technical Compute Farm
The Sun Technical Compute Farm, known locally as Orion [21], is a cluster of quad-processor
E420R workstations connected by a high speed Myrinet interconnect. Each E420R node is based
around four UltraSparc II 450MHz processors with 4MB of level 2 cache, 4GB of memory and
18GB of hard disk. In total, the machine has 160 processors, 640MB of cache memory, 160GB of
RAM and 720GB of disk.
The computer, which is used primarily for computational physics problems, is a good example of
a high-end commercially produced distributed machine; at its installation date in June 2000 was
the fastest computer in Australia with a peak speed of 144 GFLOPS.
4.3.2 Linux Beowulf Cluster
Known as Perseus [22] the Beowulf Cluster was constructed by the Department of High
Performance Computing at the University of Adelaide as a tool for computational chemistry. It is
a 232-processor machine consisting of 116 dual-processor PCs connected by 100Mbit/s fast
Ethernet. 100 nodes each contain two 500MHz Pentium IIIs, eight each contain two 400MHz
Pentium IIs and eight each contain two 450MHz Pentium IIs.
5. Conclusion
Java has gained rapid acceptance amongst the programming community, the Java Grande Forum
showing that there is even considerable interest in pursuing the use of Java in high performance
computing applications. Java’s thread model offers a simple mechanism to leverage medium
grained parallelism but only on shared memory machines. To extend the elegance of this model
into the arena of distributed memory it is necessary to provide an implementation of Java DSM.
This study, which will be the first systematic and objective comparison of such systems, should
shed light on whether they offer the potential for scalability and high performance and if so for
which type of applications. It is hoped that on completion we will be able to offer some sound
recommendations for future Java DSM directions.
Kevin Fenwick, Performance Analysis of Java DSM Implementations
8
References
[1].
Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 2nd Ed., Morgan
Kaufmann 1996
[2].
Tanenbaum, A., Distributed Operating Systems, Prentice Hall 1995
[3].
Culler, Singh & Gupta, Parallel Computer Architecture, Morgan Kaufmann, 1999
[4].
Mosberger, D., Memory Consistency Models, Dept. of Computer Science, University of Arizona,
1993
[5].
Li K. & Hudak P., Memory Coherence in Shared Virtual Memory, ACM Trans on Computer
Systems, vol 7., 1989
[6].
Bennettm, Carter & Zwaenepoel, Munin: Distributed Shared Memory Based on Type-Specific
Memory Coherence, Proc. 2nd ACSM Symposium on Principles and Practice of Parallel
Programming.
[7].
Amza, Cox, Dwarkadas, Keleher, Lu, Rajamony, Yu and Zwaenepoel, Treadmarks: Shared Memory
Computing on Networks of Workstations, Dept of Computer Science, Rice University
[8].
Bershad & Zekauskas, Midway: Shared Parallel Programming with Entry Consistency for
Distributed Memory Multiprocessors, CMU Report CMU-CS-91-170, 1991
[9].
Nitzberg B. & Lo, V., Distributed Shared Memory: A Survey of Issues and Algorithms, IEEE
Computer 1991
[10]. The Java Grande Forum, http://www.javagrande.com
[11]. Venners, B., Inside the Java 2 Virtual Machine, 2nd Edition, McGraw Hill, 2000
[12]. Phillipsen, M. & Zenger, M., JavaParty – Transparent Remote Objects in Java, University of
Karlsruhe, Germany, 1997
[13]. Caromel, D., Klauser W. & VayssiTowards Seamless Computing and Metacomputing in Java,
[14]. Freeman, Hupfer and Arnold, JavaSpaces: Principles, Patterns and Practice, Sun Microsystems
1999
[15]. Yu W., & Cox A., Java/DSM: A Platform for Heterogeneous Computing, Concurrency: Practice and
Experience, November 1997
[16]. Antoniu G., Bougé L., Hatcher P., Macbeth M., McGuigan K., & Namyst R., Compiling
Multithreaded Java Bytecode for Distributed Execution, Proc. Euro-Par 2000, 2000
[17]. Aridor Y., Factor M.and Teperman A., Implementing Java on Clusters, IBM Research Laboratory,
Haifa, Israel, http://www.haifa.research.ibm.com
[18]. Ma M., Wang C., Lau F. & Xu Z., JESSICA: Java-Enabled Single-System-Image Computing
Architecture, University of H
[19]. Chen X. & Allan V.H., MultiJav: A Distributed Shared Memory System Based on Multiple Java
Virtual Machines, Utah State University, www.cs.usu.edu
[20]. Dowd K. & Severance C., High Performance Computing, 2nd Ed., O’Reilly 1998
Kevin Fenwick, Performance Analysis of Java DSM Implementations
9
[21]. Users Guide to Orion, www.physics.adelaide.edu.au/ncflgt/userguide/index.html
[22]. Perseus – a Beowulf for Computational Chemistry,
www.dhpc.adelaide.edu.au/projects/beowulf/perseus.html
[23]. Sinha, P.K., Distributed Operating Systems: Concepts and Design, IEEE Press 1996
[24]. Gerlernter, D., Generative Communications in Linda, ACM Trans. on Programming Languages and
Systems, 1995
[25]. Gharachorloo, Lenoski, Laudon, Gibbons, Gupta & Hennessy, Memory Consistency and Event
Ordering in Scalable Shared-Memory Microprocessors, Proc 17th Ann. Int’l Symp on Computer
Architecture 1990
[26]. Bal, H., Kaashoek. M., and Tanenbaum A.S., Experience with Distributed Programming in Orca,
Proc. Int’l Conf. on Computers Lanuages, IEEE, 1990
[27]. Mathew J.A., Coddington P.D. & Harwick K.A., Analysis and Development of Java Grande
Benchmarks, Technical Report DHPC-063, Department of Computer Science, University of
Adelaide, 1999
[28]. SciMark, a Java benchmark for
http://math.nist.gov/scimark2/
scientific and numerical computing,
[29]. The Java Grande Forum Benchmark Suite, http://www.epcc.ed.ac.uk/javagrande/
[30]. NAS Parallel Benchmarks, www.nas.nasa.gov
[31]. Dongarra J., Performance of Various Computers Using Standard Linear Equations Software,
http://www.netlib.org/benchmark/performance.ps
[32]. ParkBench (Parallel Kernels and Benchmarks), http://www.netlib.org/parkbench/
[33]. Jini Network Technology, an Executive Overview. http://www.javasoft.com
[34]. Gropp, W., Lusk E., Skjellum A., Using MPI: Portable Parallel Programming with the MessagePassing Interface, MIT Press, 1994
Kevin Fenwick, Performance Analysis of Java DSM Implementations
10