Download A Performance Anaylsis of Java DSM Implementations

University of Adelaide Master of Computer Science Program Project Proposal: A Performance Analysis of Java Distributed Shared Memory Implementations Kevin Fenwick E-mail: [email protected] Web: www.cs.adelaide.edu.au/~kfenwick/thesis March 2001 Supervisors: Dr Paul Coddington and Dr Francis Vaughan Abstract The Java thread mechanism allows the exploitation of parallelism within the confines of shared memory multiprocessors by allowing multiple threads to be mapped onto distinct physical processors. Distributed memory machines, however, which are generally cheaper and therefore much more common, have been unable to harness the simplicity and elegance of this approach, their programming being reliant on more complex message-passing techniques. Java Distributed Shared Memory (DSM) implementations seek to redress this situation by allowing the thread model to be extended into the realm of distributed memory multicomputers. While a number of Java DSM systems have been implemented, no direct performance comparison of them has yet been conducted. We will be performing such a comparison by implementing a parallel Java benchmark suite. Tests will be conducted on a Sun Technical Compute Farm and a Linux Beowulf cluster, both of which, having over 150 processors, are large distributed machines. This study will determine which application types offer scalability and good performance with which Java DSM approaches and enable us to make recommendations on how future implementations may be improved. 1. Introduction Recently there has been increased interest in the use of Java for high performance computing. Indeed the Java Grande Forum, which has been expressly set up for the purpose of promoting and developing the use of Java in large-scale numerically intensive, or so called ‘Grande,’ applications, is of the opinion that ‘Java has potential to be a better environment for Grande application development than languages such as Fortran and C++’ [10]. A significant feature of Java is its thread model, which allows convenient and portable parallel programming to be achieved by mapping threads onto physically distinct processors. While this mechanism is a simple and elegant means of achieving medium grained parallelism, its scope is limited to shared memory machines. At the same time the trend in high performance computing is towards distributed memory machines based on commodity components, namely clusters of workstations. Java Distributed Shared Memory (DSM) is a means of extending the thread model into the distributed memory arena by implementing a shared-memory abstraction over the underlying distributed hardware. While Java DSM systems have been or are being developed by a number of groups, no direct comparison of them has yet been conducted. We will be performing the first such comparison of Java DSM systems for performance and scalability by implementing a parallel Java benchmark suite. We have access to two large distributed memory machines on which to run the tests: a Sun Technical Compute Farm, which is typical of a high-end commercial machine and a Linux Beowulf cluster, which is built entirely from commodity components. The study, which will be of considerable interest to the Java Grande Community, will enable us to assess how well different types of parallel algorithms are matched to particular Java DSM implementations and perhaps provide insight into how future Java DSM systems may be improved. In this research proposal we shall provide a brief overview of DSM, discuss the particular case of Java DSM and then go on to discuss the benchmarking methodology that we will adopt in assessing the various Java DSM systems. 2. Distributed Shared Memory 2.1 Introduction Modern parallel computing architectures generally fall into one of two classes: shared memory machines and distributed memory machines [1,3]. In the former, all processors share a single addressable memory space; the most common example of such machines is the bus-based symmetric multiprocessor (SMP). In the latter each processor, or a small set of processors, is encapsulated within a semi-autonomous node, which includes an amount of local memory. Since no node may directly address the memory of another, such machines are often termed NORMA (NO Remote Memory Access) machines. Shared memory machines are difficult to build, expensive and generally do not scale beyond 100 processors. Distributed memory machines, on the other hand, are much cheaper and simpler to build and may scale to 1000s of processors. The current trend is to build such machines out of off-the-shelf components, such as commercial workstations/PCs and commodity interconnect, such as Fast Ethernet. These distributed memory machines are known as clusters of workstations (COWs) or, where each node is itself an SMP, a CLUMP (Cluster of SMPs). However, while shared memory machines are more difficult to build than their distributed memory counterparts, they are much easier to program. Processes can address the entire memory Kevin Fenwick, Performance Analysis of Java DSM Implementations 2 space and communication and synchronization is achieved with simple load-store primitives and shared variables. In distributed memory machines, on the other hand, communication and synchronization must be achieved by explicit message-passing, using send-receive primitives that are normally implemented as C or Fortran libraries such as MPI (message passing interface) [34]. A quandary therefore exists: shared memory machines are difficult to build but relatively easy to program, whereas distributed memory machines are easy to build but relatively to difficult to program. DSM [5,22] is seen by its proponents as a way out of this dilemma since it aims to provide the relative ease of shared memory programming on distributed memory systems. This is achieved, as shown in figure 2.1, by abstracting over the distributed nature of the hardware to present an application with the illusion of a single memory space that is shared between the nodes. Hence, the shared memory programming model may be applied in a distributed memory setting. Figure 2.1 Generalized Distributed Shared Memory Scenario APPLICATION DISTRIBUTED SHARED MEMORY SOFTWARE LAYER NODE NODE NODE NODE NODE MEMORY MEMORY MEMORY MEMORY MEMORY CACHE CACHE CACHE CACHE CACHE CPU CPU CPU CPU CPU The DSM abstraction is typically achieved as middleware or a run-time system that transparently maps references to the virtual shared memory space onto the underlying hardware using message passing. The performance of such systems will obviously depend heavily on the amount of underlying network communication that is necessary to maintain this illusion. 2.2 DSM Implementations DSM implementations may be differentiated along two lines - the structure of the shared memory the system presents to applications and the memory consistency model the system provides. At one extreme the shared memory provided by the system may be completely unstructured, by Kevin Fenwick, Performance Analysis of Java DSM Implementations 3 which we mean that it is simply a linear array of words, and at the other the shared memory may be highly structured as a collection of distributed objects that are visible to all processes. Moreover, all DSM systems involve some replication between nodes in order to improve efficiency. When data is replicated in this way, it must be kept consistent. The memory consistency model [4] that the system follows defines the degree to which this consistency must be maintained and therefore how the memory system can be expected to behave when presented with a request. A weaker consistency protocol requires less communication and hence improving performance but will require a stronger programming model. We shall now briefly review some representative DSM implementations. 2.2.1 IVY IVY, developed in the mid-1980s [5], was the prototypical DSM system. It was of the unstructured variety, the shared memory space being carved up into fixed size pages that could migrate from machine to machine on demand. Essentially the system works as an extension of standard virtual memory: when a processor tries to access a location in a page that is not resident in its local physical memory a page fault occurs pending recovery of the necessary page. However, with IVY that page may be brought, not only from disk, but also from another machine. IVY was aimed squarely at solving the so called ‘dusty deck’ problem i.e. the wish to run existing multiprocessor programs on distributed memory machines. These programs assume a sequentially consistent model of memory, which is the strongest feasible form for parallel machines. This means that there is a total order on all memory accesses i.e. all processors see the same order of writes. Implementing a consistency protocol that assures this is extremely costly in terms of communication and therefore performance. This performance problem is compounded by the phenomenon of ‘false sharing’, which occurs if non-shared variables, being used by different processors, are positioned on the same page. The page will have to be kept consistent, even though the variables on it need not be, causing a so-called ‘ping-pong’ effect as the page is swapped from one node to another. 2.2.2 Munin The Munin system improved on the performance of IVY by using an implementation of release consistency instead of sequential consistency. Release consistency [25] relies on the observation that parallel programs need to synchronize access to shared memory to avoid race effects and that such synchronization is typically achieved with critical sections. Since only one process may be in a critical section at a time, writes need not be made available to other processes until the critical section is released thus reducing the amount of communication necessary to maintain this sort of consistency. 2.2.3 Treadmarks The Treadmarks system [7] improves on this again by using a lazy implementation of release consistency. The lazy implementation of release consistency manages to reduce communication even further by only making writes available to a process that is about to acquire a critical section rather than to all processes on the release of a critical section. Furthermore, by using a multiplewriters protocol, Treadmarks allows multiple processes to write to copies of the same page thus reducing the false-sharing problem. Kevin Fenwick, Performance Analysis of Java DSM Implementations 4 2.2.4 Midway Midway [8] allows the use of entry consistency, which requires that every individual shared variable be associated with a lock. This means that multiple critical sections may be in use at the same time, increasing the amount of parallelism that is involved. 2.2.5 Linda Linda [24] provides a highly structured view of shared memory known as a tuple-space into which processes may insert or extract tuples, which are ordered structures consisting of typed fields. Processes may then communicate by one inserting a tuple into the tuple-space and another extracting a tuple that matches one it is looking for. 2.2.6 Orca The Orca system [26] is an object-based DSM that comprises the Orca language, compiler and run-time system that operates on the Amoeba distributed operating system. In object-based DSM systems, the shared memory space takes the form of an object store, which, while in reality is distributed across the system, appears to individual nodes to be a single object store. Any process can invoke any object’s methods regardless of where they are located. It is the function of the object based DSM system to make the distribution of objects transparent to individual processes. 3. Java Distributed Shared Memory 3.1 Introduction The Java programming language has become so pervasive that it is now being actively promoted as a tool for high performance scientific computing. The Java community has termed such applications Grande applications and the Java Grande Forum [10] acts as a conduit for recommendations to Sun Microsystems, the developer of Java, that will enable Java support for Grande Applications to be improved. Undoubtedly the initial interest in Java can be ascribed to its portability model, which is based on the fact that Java programs are not generally compiled to native machine code but to an intermediate byte-code. This byte-code is then interpreted by a software machine known as the Java Virtual Machine (JVM), which has itself been ported to numerous hardware platforms. Another interesting aspect of Java thouh, is that it supports multithreading for shared memory machines as part of the language. The JVM transparently deals with the assignment of threads to processors providing a simple and convenient method for exploiting parallelism that, when coupled with the virtual machine paradigm, offers a path to truly portable parallel programming. However, Grande Applications are typically geared towards distributed memory machines rather than shared memory ones. Once again, DSM offers an escape route by allowing the threading model to be extended into the domain of distributed memory. Kevin Fenwick, Performance Analysis of Java DSM Implementations 5 3.2 Java Distributed Shared Memory Implementations Java DSM implementations all provide a highly structured view of memory, a parallel Java application being made up of interacting Java objects operating within some distributed shared memory space. We may characterize three approaches to implementing this illusion on a distributed memory machine, namely those that operate above the JVM level, those that operate below the JVM level and those that operate at the JVM level. We shall call these respectively super-JVM level DSM, sub-JVM level DSM and JVM level DSM. 3.2.1 Super-JVM level DSM In this approach, each cluster node runs its own JVM and objects placed on each node use some communication mechanism, such as Java’s own Remote Method Invocation (RMI) to pass messages to one another. Systems such as JavaParty [12] and ProActive [13] take this approach, remote objects and threads being handled at the level of the Java language. In JavaParty [12], for example, multithreaded programs are written in a special form of Java that includes certain compiler directives indicating that objects may be shared. This is done by prefixing class declarations with the remote keyword. An initial precompilation phase translates such classes into standard Java bytecode plus ‘RMI hooks’ that enable remote objects to communicate with one another across the cluster. Each JVM must register with a central runtime manager that coordinates their activities. JavaSpaces [14] is an implementation of the Linda model developed by Sun as a Jini [33] service. In JavaSpaces, tuple-spaces are referred to simply as Spaces and represented as a Java object. Tuples themselves are classes inherited from the superclass Entry. 3.2.2 Sub-JVM level DSM Here the JVM itself is built above some lower level DSM infrastructure, such as those discussed in section 2, which presents the JVM itself with a picture of shared memory. Systems such as Java/DSM [15], Hyperion [16] and JESSICA [18] take this approach. Java/DSM, for example, presents a modified JVM where the Java heap [11] is implemented in Treadmarks [7], a pre-existing DSM system. 3.2.3 JVM-level DSM Here the JVM implementation itself is distributed across the nodes of the distributed memory machine. One system that takes this approach is cJVM [17]. This provides a parallel Java application with a single system image of a traditional JVM. This JVM is made up of a number of virtual machine components, one on each node of the cluster. Taken as a whole, these processes constitute the JVM and the application is totally unaware that it is running on a cluster. In reality, threads and objects must be distributed across the nodes of the cluster. Distributed access is supported by a so-called master-proxy paradigm, where a proxy is a surrogate for a remote master object through which that object can be accessed. MULTIJAV [19] is a system similarly implemented as a distributed JVM implementation. Kevin Fenwick, Performance Analysis of Java DSM Implementations 6 4. Benchmarking 4.1 Introduction The project will centre on the development of a benchmark suite of parallel Java programs. The aim of a benchmarking suite is to provide an objective performance measure of a system. In a scientific High Performance Computing (HPC) scenario, we are purely interested in reducing execution time of applications; hence, the running time of programs will be our metric of comparison. The benchmarks that we perform will enable us to compare the performance and scalability of some of the different types of Java DSM systems with a range of parallel applications. It is hoped that the benchmarks will reveal useful information about the particular systems and allow us to make recommendations for future Java DSM implementations 4.2 Benchmarking Methodology A benchmarking suite is composed of a number of component programs designed to test various aspects of a system. In choosing these components, a number of considerations need to be taken into account. Firstly, we will need to ensure that the components test different facets of a system. In the case of the parallel Java algorithms that we will be running on the Java DSM systems, for example, it will be important to select benchmarks that exhibit different patterns of communication. For example, we will want to select both applications with little inter-node communication and more communication intensive applications. We should also ensure that the benchmark does not take too long to run, certainly not more than an hour for the whole suite. This is important since we will be running the suite multiple times, on different systems. Typically, an important consideration in choosing a benchmark is portability. Certainly, we want to make sure we are testing the same thing on each system. One might assume that given Java’s inherent portability that this would not present a problem. However, benchmarks for what we termed the Super-JVM level system i.e. those that operate at the language level, will require some code modifications to operate. We need to ensure that such changes do not alter the algorithm we are testing. A number of benchmark suites have been implemented for sequential which we will wish to examine. For example SciMark at the US National Institute for Statistics [28] and the Java Grande Forum Benchmark Suite [27,29], the latter being of particular interest in that it concentrates on Grande Applications. We will also be examining other HPC benchmarks such as the Numerical Aerodynamic Simulation (NAS) benchmarks [30], Linpack [31], Parkbench [32]. It is envisaged that the benchmark suite will consist of a cross-section of parallel Java algorithms and applications such as a Fast Fourier Transform (FFT) Kernel, LU matrix solve benchmark, an n-body application using a Barnes-Hut algorithm, a Laplace solver involving nearest neighbour communications, and an embarrassingly parallel Task Farm. Kevin Fenwick, Performance Analysis of Java DSM Implementations 7 4.3 Hardware We have access to two large distributed memory machines, with distinct characteristics, which will form an ideal testing ground for our benchmarks – a commercial system from Sun and a home-brew Beowulf cluster. 4.3.1 Sun Technical Compute Farm The Sun Technical Compute Farm, known locally as Orion [21], is a cluster of quad-processor E420R workstations connected by a high speed Myrinet interconnect. Each E420R node is based around four UltraSparc II 450MHz processors with 4MB of level 2 cache, 4GB of memory and 18GB of hard disk. In total, the machine has 160 processors, 640MB of cache memory, 160GB of RAM and 720GB of disk. The computer, which is used primarily for computational physics problems, is a good example of a high-end commercially produced distributed machine; at its installation date in June 2000 was the fastest computer in Australia with a peak speed of 144 GFLOPS. 4.3.2 Linux Beowulf Cluster Known as Perseus [22] the Beowulf Cluster was constructed by the Department of High Performance Computing at the University of Adelaide as a tool for computational chemistry. It is a 232-processor machine consisting of 116 dual-processor PCs connected by 100Mbit/s fast Ethernet. 100 nodes each contain two 500MHz Pentium IIIs, eight each contain two 400MHz Pentium IIs and eight each contain two 450MHz Pentium IIs. 5. Conclusion Java has gained rapid acceptance amongst the programming community, the Java Grande Forum showing that there is even considerable interest in pursuing the use of Java in high performance computing applications. Java’s thread model offers a simple mechanism to leverage medium grained parallelism but only on shared memory machines. To extend the elegance of this model into the arena of distributed memory it is necessary to provide an implementation of Java DSM. This study, which will be the first systematic and objective comparison of such systems, should shed light on whether they offer the potential for scalability and high performance and if so for which type of applications. It is hoped that on completion we will be able to offer some sound recommendations for future Java DSM directions. Kevin Fenwick, Performance Analysis of Java DSM Implementations 8 References [1]. Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 2nd Ed., Morgan Kaufmann 1996 [2]. Tanenbaum, A., Distributed Operating Systems, Prentice Hall 1995 [3]. Culler, Singh & Gupta, Parallel Computer Architecture, Morgan Kaufmann, 1999 [4]. Mosberger, D., Memory Consistency Models, Dept. of Computer Science, University of Arizona, 1993 [5]. Li K. & Hudak P., Memory Coherence in Shared Virtual Memory, ACM Trans on Computer Systems, vol 7., 1989 [6]. Bennettm, Carter & Zwaenepoel, Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence, Proc. 2nd ACSM Symposium on Principles and Practice of Parallel Programming. [7]. Amza, Cox, Dwarkadas, Keleher, Lu, Rajamony, Yu and Zwaenepoel, Treadmarks: Shared Memory Computing on Networks of Workstations, Dept of Computer Science, Rice University [8]. Bershad & Zekauskas, Midway: Shared Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors, CMU Report CMU-CS-91-170, 1991 [9]. Nitzberg B. & Lo, V., Distributed Shared Memory: A Survey of Issues and Algorithms, IEEE Computer 1991 [10]. The Java Grande Forum, http://www.javagrande.com [11]. Venners, B., Inside the Java 2 Virtual Machine, 2nd Edition, McGraw Hill, 2000 [12]. Phillipsen, M. & Zenger, M., JavaParty – Transparent Remote Objects in Java, University of Karlsruhe, Germany, 1997 [13]. Caromel, D., Klauser W. & VayssiTowards Seamless Computing and Metacomputing in Java, [14]. Freeman, Hupfer and Arnold, JavaSpaces: Principles, Patterns and Practice, Sun Microsystems 1999 [15]. Yu W., & Cox A., Java/DSM: A Platform for Heterogeneous Computing, Concurrency: Practice and Experience, November 1997 [16]. Antoniu G., Bougé L., Hatcher P., Macbeth M., McGuigan K., & Namyst R., Compiling Multithreaded Java Bytecode for Distributed Execution, Proc. Euro-Par 2000, 2000 [17]. Aridor Y., Factor M.and Teperman A., Implementing Java on Clusters, IBM Research Laboratory, Haifa, Israel, http://www.haifa.research.ibm.com [18]. Ma M., Wang C., Lau F. & Xu Z., JESSICA: Java-Enabled Single-System-Image Computing Architecture, University of H [19]. Chen X. & Allan V.H., MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines, Utah State University, www.cs.usu.edu [20]. Dowd K. & Severance C., High Performance Computing, 2nd Ed., O’Reilly 1998 Kevin Fenwick, Performance Analysis of Java DSM Implementations 9 [21]. Users Guide to Orion, www.physics.adelaide.edu.au/ncflgt/userguide/index.html [22]. Perseus – a Beowulf for Computational Chemistry, www.dhpc.adelaide.edu.au/projects/beowulf/perseus.html [23]. Sinha, P.K., Distributed Operating Systems: Concepts and Design, IEEE Press 1996 [24]. Gerlernter, D., Generative Communications in Linda, ACM Trans. on Programming Languages and Systems, 1995 [25]. Gharachorloo, Lenoski, Laudon, Gibbons, Gupta & Hennessy, Memory Consistency and Event Ordering in Scalable Shared-Memory Microprocessors, Proc 17th Ann. Int’l Symp on Computer Architecture 1990 [26]. Bal, H., Kaashoek. M., and Tanenbaum A.S., Experience with Distributed Programming in Orca, Proc. Int’l Conf. on Computers Lanuages, IEEE, 1990 [27]. Mathew J.A., Coddington P.D. & Harwick K.A., Analysis and Development of Java Grande Benchmarks, Technical Report DHPC-063, Department of Computer Science, University of Adelaide, 1999 [28]. SciMark, a Java benchmark for http://math.nist.gov/scimark2/ scientific and numerical computing, [29]. The Java Grande Forum Benchmark Suite, http://www.epcc.ed.ac.uk/javagrande/ [30]. NAS Parallel Benchmarks, www.nas.nasa.gov [31]. Dongarra J., Performance of Various Computers Using Standard Linear Equations Software, http://www.netlib.org/benchmark/performance.ps [32]. ParkBench (Parallel Kernels and Benchmarks), http://www.netlib.org/parkbench/ [33]. Jini Network Technology, an Executive Overview. http://www.javasoft.com [34]. Gropp, W., Lusk E., Skjellum A., Using MPI: Portable Parallel Programming with the MessagePassing Interface, MIT Press, 1994 Kevin Fenwick, Performance Analysis of Java DSM Implementations 10

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Performance Anaylsis of Java DSM Implementations