Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
written and implemented at submitted to Faulty of sienes Fahbereih 5 Department of omputer siene Wirtshaftswissenshaften Vrije Universiteit Amsterdam Universität Siegen The Netherlands Germany A Flexible and Eient Message Passing Platform for Java Eine exible und eziente Message Passing Plattform für Java Markus Bornemann A Thesis presented for the Degree of Diplom-Wirtshaftsinformatiker Author: Markus Bornemann Student register: 542144 Address: Amselweg 7 57392 Shmallenberg Germany Supervisor: Dr.-Ing. habil. Thilo Kielmann Seond reader: Prof. Dr. Roland Wismüller Amsterdam, September 2005 Zusammenfassung Contents Zusammenfassung ii 1 Introdution 1 2 The Message Passing Interfae 3 2.1 Parallel Arhitetures . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 MPI Conepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Point-to-Point Communiation . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 Bloking Communiation 7 2.3.2 Non-Bloking Communiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Colletive Communiation 2.5 Groups, Contexts and Communiators 2.6 Virtual Topologies . . . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 The Grid Programming Environment Ibis 3.1 Parallel Programming in Java . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Threads and Synhronization 3.1.2 Remote Method Invoation 15 15 . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . 17 3.2 Ibis Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Ibis Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Design and Implementation of MPJ/Ibis 22 4.1 Common Design Spae and Deisions . . . . . . . . . . . . . . . . . . 22 4.2 MPJ Speiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 MPJ on Top of Ibis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 iii Contents iv 4.3.1 Point-to-point Communiation . . . . . . . . . . . . . . . . . . 26 4.3.2 Groups and Communiators . . . . . . . . . . . . . . . . . . . 30 4.4 Colletive Communiation Algorithms . . . . . . . . . . . . . . . . . 31 4.5 Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Evaluation 37 5.1 Evaluation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Miro Benhmarks 38 5.3 Java Grande Benhmark Suite 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3.1 Setion 1: Low-Level Benhmarks . . . . . . . . . . . . . . . . 42 5.3.2 Setion 2: Kernels . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.3 Setion 3: Appliations . . . . . . . . . . . . . . . . . . . . . . 55 Disussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6 Related Work 59 7 Conlusion and Outlook 63 7.1 Conlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 7.2 Outlook 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography 66 Eidesstattlihe Erklärung 69 List of Figures 2.1 Bloking Communiation Demonstration . . . . . . . . . . . . . . . . 7 2.2 Non-Bloking Communiation Demonstration . . . . . . . . . . . . . 8 2.3 Broadast Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Redue Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Satter/Gather Illustration . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Allgather Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.7 Alltoall Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.8 Graph Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.9 Cartesian Topologies 14 3.1 Java thread synhronization example . . . . . . . . . . . . . . . . . . 17 3.2 RMI invoation example (after [14, 12℄) . . . . . . . . . . . . . . . . . 18 3.3 Ibis design (redraw from the Ibis 1.1 release) . . . . . . . . . . . . . . 19 3.4 Send and Reeive Ports (after [25, 153℄) . . . . . . . . . . . . . . . . . 20 4.1 Prinipal Classes of MPJ (after [4℄) . . . . . . . . . . . . . . . . . . . 23 4.2 MPJ/Ibis design 26 4.3 MPJ/Ibis send protool 4.4 MPJ/Ibis reeive protool 4.5 Satter 4.6 Flat tree view 4.7 Broadast 4.8 Binomial tree view 4.9 Ring sending sheme (only 1 step) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 . . . . . . . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 sending sheme sending sheme . . . . . . . . . . . . . . . . . . . . . . . . 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.10 Reursive doubling illustration . . . . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . 35 v List of Figures vi 5.1 Double array throughput in Ibis . . . . . . . . . . . . . . . . . . . . 39 5.2 Double array throughput in MPJ/Ibis and mpiJava . . . . . . . . . . 40 5.3 Objet array throughput in Ibis . . . . . . . . . . . . . . . . . . . . . 41 5.4 Objet array throughput in MPJ/Ibis and mpiJava . . . . . . . . . . 41 5.5 Pingpong benhmark: arrays of doubles . . . . . . . . . . . . . . . . 43 5.6 Pingpong benhmark: arrays of objets . . . . . . . . . . . . . . . . 43 5.7 Barrier benhmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.8 Broadast benhmark 5.9 . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Redue benhmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.10 Satter benhmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.11 Gather benhmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.12 Alltoall benhmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.13 Crypt speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.14 LU Fatorization speedups . . . . . . . . . . . . . . . . . . . . . . . . 51 5.15 Series speedups 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.16 Sparse matrix multipliation speedups . . . . . . . . . . . . . . . . . 53 5.17 Suessive over-relaxation speedups . . . . . . . . . . . . . . . . . . . 54 5.18 Moleular dynamis speedups . . . . . . . . . . . . . . . . . . . . . . 55 . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.20 Raytraer speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.19 Monte Carlo speedups List of Tables 2.1 Flynn's Taxonomy (after [24, 640℄) 2.2 MPI datatypes and orresponding C types . . . . . . . . . . . . . . . 6 2.3 Predened Redue Operations . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Group Constrution Set Operations . . . . . . . . . . . . . . . . . . . 12 4.1 MPJ prototype (after [4℄) . . . . . . . . . . . . . . . . . . . . . 24 4.2 MPJ basi datatypes (after [4℄) . . . . . . . . . . . . . . . . . . . . . 25 4.3 Algorithms used in MPJ/Ibis to implement the olletive operations . 31 5.1 Lateny benhmark results 38 send . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 4 Chapter 1 Introdution Problems whih are too omplex to be solved by theory or too ost-intensive for pratial approahes an only be handled using simulation models. Some of these problems, for example geneti sequene analysis or global limate models, getting to large to be solved on a single proessor mahine in reasonable time or simply exeed the physial limitations provided. Usually, parallel omputers providing a homogeneous networking infrastruture had been used to address these problems, but in times of low budgets more and more existing loal area networks integrate multiple workstations or PCs making them suitable for parallel omputing. In ontrast to omputer lusters these mahines typially onsist of dierent hardware arhitetures and operating systems. The Message Passing Interfae (MPI) developed by the MPI-Forum has been widely aepted as a standard for parallel omputation on high performane omputer lusters. Sine MPI is widely used and a lot of experiene has been built up, there is a growth interest of making the MPI onepts appliable for heterogeneous systems. Unfortunately, existing MPI implementations written mostly in C and Fortran are limited to the hardware arhiteture they are implemented for. In the past portability was of minor importane. Therefore, MPI appliations need to be adapted and reompiled while hanging the target platform. During the last years the Java programming language has beome very important for appliation developers, when heterogeneous, networking environments need to be addressed. Java allows to develop appliations that an run on a variety of dierent 1 Chapter 1. Introdution 2 omputer arhitetures and operating systems without reompilation. Additionally, Java has been designed to be objet-oriented, dynami and multi-threaded. Nowadays, performane of hardware and onnetivity has rapidly inreased allowing to use heterogeneous infrastrutures of for instane small and medium sized ompanies or the Internet for parallel omputation. Therefore, parallel appliations need to math the requirements of being portable and exible allowing them to be exeuted on dierent arhitetures simultaneously without the high osts of porting software expliitly. Naturally, the ability of being exible should not ome with signiant performane drawbaks. In this thesis, a message passing platform alled MPJ/Ibis is presented that ombines both eieny and exibility. Chapter 2 gives an overview of parallel arhitetures in general and introdues the basi onepts of MPI. Then, Chapter 3 points out Javas abilities and limitations for parallel omputing and desribes the grid programming environment Ibis, whih addresses Javas drawbaks. The main design and implementation details of MPJ/Ibis are desribed in Chapter 4 followed by a presentation of various benhmark results in Chapter 5. Finally, in Chapter 6, previous projets onerning message passing for Java are introdued with a view to exibility and eieny. Chapter 2 The Message Passing Interfae The Message Passing Interfae (MPI) is a standardized message passing model and not a spei implementation or produt. It was designed to provide aess to advaned parallel hardware for end users, library writers and tool developers. MPI was standardized by the MPI-Forum in 1995 (MPI-1.1) and further developed to MPI-2, whih inludes MPI-1.2, in the following years. The existene of a standardized interfae makes it onvenient to software developers implementing portable and parallel programs with the guaranteed funtionality of a set of basi primitives. It denes the general requirements for a message passing implementation, presenting detailed bindings to C and Fortran languages as well. This introdution to MPI refers to MPI-1.1 [15℄ and does not laim to be a omplete referene or tutorial, rather it explains the basi priniples for the implementation of a message passing platform for Java. Therefore the author avoids, where it is possible, the presentation of MPI funtion bindings. 2.1 Parallel Arhitetures A parallel mahine onsists of a various number of proessors, whih olletively solve a given problem. In ontrast to single proessor mahines, e. g. workstations, parallel mahines are able to exeute multiple instrutions simultaneously. This results in the main motivation of parallel omputing to solve problems faster. To desribe parallel omputer arhitetures in general, Flynn's taxonomy [24, 3 2.1. Parallel Arhitetures 4 640℄ haraterizes four dierent lasses of arhitetures. It uses the onept of streams, whih at least are sequenes of items proessed by a CPU. A stream onsists either of instrutions or of data, whih will be manipulated by the instrutions. These are: SISD Single instrution, single data streams MISD Multiple instrutions, single data streams SIMD Single instrution, multiple data streams MIMD Multiple instrutions, multiple data streams Table 2.1: Flynn's Taxonomy (after [24, 640℄) The rst item, SISD, desribes a sequential arhiteture, in whih a single proessor operates on a single instrution stream and stores data in a single memory, e.g. a von Neumann arhiteture. SISD does not address any parallelization in the mentioned streams. MISD is more or less a theoretial arhiteture, where multiple instrutions operate on single data streams simultaneously. In fat multiple instrution streams need multiple data streams to be eetive, therefore no ommerial mahine exists with this design. Computers working on the SIMD model manipulate multiple data streams with the same set of instrutions. Usually this will be done in parallel, e.g. in an array proessor system. In this model all proessors are being synhronized by a global lok to make sure that every proessor is performing the same instrution in lokstep. MIMD onerns about fully autonomous proessors, whih perform dierent instrutions on dierent data. This ase implies that the omputation has to be done asynhronously. Furthermore, parallel systems, i.e. SIMDs and MIMDs, dier in the way the proessors are onneted and thus how they ommuniate with eah other. On the one hand all proessors may be assigned to one global memory, alled shared memory. On the other hand every proessor may address its own loal memory, alled distributed memory. All the paradigms above regard to hardware. In addition to the above, there is a software equivalent to SIMD, alled SPMD (Single program, multiple data). In ontrast to SIMD, SPMD works asynhronously, where the same program runs on proessors of a MIMD system. The message passing onept mathes the SPMD paradigm, where ommunia- 2.2. MPI Conepts 5 tion takes plae by exhanging messages. In the message passing world programs onsist of separate proesses. Eah proess addresses its own memory spae, whih will be managed by the user as well as data distribution among ertain proesses. 2.2 MPI Conepts MPI is based on a stati proess model, that means all proesses within an existing MPI runtime environment enter and exit this environment simultaneously. Inside the running environment all the proesses are being organized in groups. Atually the ommuniation ours by aessing ommuniators, group dened ommuniation hannels. Furthermore, the proesses inside a group are numbered from 0 to where n is the total number of proesses within the group. n-1, Those numbers are alled ranks. For all members of the group the rank number of a ertain member is the same. That allows a global view to the group members. Message Data Assuming a proess is about to send data, it has to speify what data is going to be sent. The loation of this data is alled the send buer. On the other side, when data has to be reeived, this loation is alled the reeive buer. Messages may onsist of ontiguous data, e.g. an array of integer values. In order to avoid memory-to-memory opying, e.g. if just a smaller part of an array inside the send buer should be transfered, the number of elements has to be speied for a message, so that the needed elements may be used diretly. Sine MPI ould be implemented as a library, whih may be used preompiled, it annot be assumed that a ommuniation all has information about the datatype of variables in the ommuniation buer. Hene, MPI denes its own datatypes, whih then will be attahed to the message. A redued list for the C language binding is presented in Table 2.2. A learer reason to attah datatypes expliitly to messages is shown by the ase where non-ontiguous data will be submitted, e.g. a olumn of a two-dimensional array. For that, the user has to onstrut a type map, whih speies the needed elements, ombined with a MPI datatype. This results in a derived datatype based on MPI datatypes. 2.3. Point-to-Point Communiation 6 MPI Datatypes C Datatypes MPI_CHAR signed har MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned har ... ... Table 2.2: MPI datatypes and orresponding C types Message Envelopes To realize ommuniation in general, MPI uses the so-alled message envelopes. Envelopes are meant to desribe the neessary details of a message submission. These are: • Sender • Reeiver • Tag • Communiator The sender and reeiver items just hold information about the ranks of the ommuniation partners, whereas the used group will be identied by the ommuniator. The tag is a free interpretable positive integer value. It may be used to distinguish dierent message types inside the ommuniator. 2.3 Point-to-Point Communiation Communiation between exatly two proesses is alled point-to-point ommuniation in MPI. In this way of ommuniation both the sender and reeiver are expliitly reognized by the underlying ommuniator and therefore the spei ranks inside the proess group. MPI denes two possibilities to ahieve point-to-point ommuniation, bloking and non-bloking, whih will be explained in Setions 2.3.1 and 2.3.2. Besides that there are three dierent ommuniation modes. First, in ready send mode the message will be sent as the reeiver side the alled mathing reeive funtion. It has to be alled before a sender proess is able to send 2.3. Point-to-Point Communiation 7 a message, otherwise this mode results in error. Seond, in synhronous send mode, the sending proess rst requests for a mathing reeive funtion on the reeiver side, and then, if a mathing reeive has been posted, sends the message. Third, it is possible to buer the message before it will be sent. buered send, This mode is alled where the message rst will be opied into a user dened system buer. The real ommuniation then may be separated from the sending proess and moved to the runtime environment. On the reeiver side these ommuniation modes appear fully transparent. The reeiver proess doesn't have any inuene on the ommuniation mode used. 2.3.1 Bloking Communiation Bloking ommuniation funtion alls return when the operation assigned to the funtion has nished. It bloks the aller until the involved buer may be reused. sender process receiver process call send callrecv blocked e ag ss me blocked not blocked not blocked Figure 2.1: Bloking Communiation Demonstration Figure 2.1 demonstrates a bloking send and the assoiated reeive proess. The sending proess alls send and then returns when the message has left the buer ompletely. A nished send all does not imply that the message has arrived at it's destination entirely, whereas the reeiver bloks until the message has been arrived ompletely. 2.3.2 Non-Bloking Communiation In ontrast to bloking ommuniation the operations of the non-bloking pointto-point ommuniation return immediately. That allows a proess to do further 2.4. Colletive Communiation 8 omputation during message transfer. On the other hand it is not possible to use the involved buers on the sender and reeiver side until the point-to-point transfer has been nished. To gure out when a ommuniation is done, MPI provides wait and test funtions, allowing a proess to hek for a message transfer to omplete (see gure 2.2). sender process receiver process call send callrecv me ag ss e not blocked call wait blocked not blocked Figure 2.2: Non-Bloking Communiation Demonstration 2.4 Colletive Communiation On top of the point-to-point ommuniation, MPI provides funtions to ommuniate within groups of proesses, alled olletive ommuniation. Colletive operations are exeuted by all members of the group. These operations are responsible for synhronization, broadasting, gathering, sattering and reduing of information between the groups of proesses. Some operations need a speial proess to send data to, or ollet data from, the other proesses, whih is alled the root proess. A olletive operation always bloks the aller until all proesses have alled it. In the following, the most important olletive operations will be introdued. Broadast Figure 2.3 illustrates a broadast of six proesses. The left table shows the initial state of the buers of all proesses, eah row for a proess. The rst proess is the root, whih sends its input buer to all proesses inluding itself. In the end all proesses hold a opy of the root's input buer, as shown on the right side of Figure 2.3. 9 data data A0 A0 processes processes 2.4. Colletive Communiation A0 A0 broadcast A0 A0 A0 Figure 2.3: Broadast Illustration Redue The redue operation ombines all items of the send buer of eah proess using an assoiative mathematial operation. The result then will be sent to the root proess. Figure 2.4 demonstrates redue, where the result appears in R0 . MPI data data A0 R0 processes processes denes a set of standard operations for that purpose, whih are listed in table 2.3. B0 C0 reduce D0 E0 F0 Figure 2.4: Redue Illustration Name Meaning MPI_MAX maximum value MPI_MIN minimum value MPI_SUM sum MPI_PROD produt MPI_LAND logial and MPI_BAND bit-wise and MPI_LOR logial or MPI_BOR bit-wise or MPI_LXOR logial xor MPI_BXOR bit-wise xor MPI_MAXLOC max value and loation MPI_MINLOC min value and loation Table 2.3: Predened Redue Operations Of ourse there are restritions of whih datatypes are aepted by the dierent operations. For example it does not make sense to alulate a bit-wise or on a set of 2.4. Colletive Communiation 10 oating point values. All predened redue operations are ommutative. In addition to that, MPI allows to reate user dened redue operations, whih may or may not be ommutative. Therefore a MPI implementation should take are about the way the redue will be performed in order to respet non-ommutative behaviour. An extension to redue is the allredue operation where all proesses reeive the result, instead of just root. Satter/Gather In satter mode, the send buer of root will be split into n equal parts, where n is the number of proesses inside the group. Eah proess then reeives a distint part, whih will be determined by the rank number of the reeiving proess. processes A0 A1 A2 A3 A4 processes data data A0 A5 scatter A1 A2 A3 gather A4 A5 Figure 2.5: Satter/Gather Illustration Gather is the inverse operation of satter. All proesses send their send buers to root, where all the inoming messages will be stored in rank order into root's reeive buer. Allgather An extension to gather is allgather. As shown in gure 2.6 instead of data data A0 A0 B0 C0 D0 E0 F0 B0 A0 B0 C0 D0 E0 F0 C0 A0 B0 C0 D0 E0 F0 D0 A0 B0 C0 D0 E0 F0 E0 A0 B0 C0 D0 E0 F0 F0 A0 B0 C0 D0 E0 F0 allgather Figure 2.6: Allgather Illustration processes processes just root, all proesses of the group reeive the gathered data. 2.5. Groups, Contexts and Communiators Alltoall The last extension is alltoall. In to eah other proess. This means that the by proess j in the ith 11 alltoall jth eah proess sends distint data item sent from proess i plae of the reeive buer, where and j i is reeived are rank numbers of group proesses. processes A0 A1 A2 A3 A4 A5 A0 B0 C0 D0 E0 F0 B0 B1 B2 B3 B4 B5 A1 B1 C1 D1 E1 F1 C0 C1 C2 C3 C4 C5 A2 B2 C2 D2 E2 F2 D0 D1 D2 D3 D4 D5 A3 B3 C3 D3 E3 F3 E0 E1 E2 E3 E4 E5 A4 B4 C4 D4 E4 F4 F0 F1 F2 F3 F4 F5 A5 B5 C5 D5 E5 F5 processes data data alltoall Figure 2.7: Alltoall Illustration Other Funtions In addition to the olletive operations mentioned above, MPI provides more funtions. It is beyond the objetive of this thesis to explain all of them at this point. Only two further olletive operations will be desribed in the following. MPI allows a group of proesses to synhronize on eah other. To ahieve synhronization eah proess has to all the olletive funtion alls the barrier, barrier. When a proess it bloks until all the other proesses have alled the barrier as well. A variant to proess i allredue is the operation san. It performs a prex redution, where reeives the redution of the items from the proesses 0, ..., i. 2.5 Groups, Contexts and Communiators Initially MPI reates a group, in whih all involved proesses of the runtime environment are listed. Beneath that, it provides the ability to generate other proess groups, in order to help the programmer to ahieve a better struture of the soure ode. This is important, e.g. for library developers, to fous on ertain proesses in olletive operations to avoid synhronizing or running unrelated ode on uninvolved proesses. The rank of a proess is always bound to a ertain group, whih means that 2.5. Groups, Contexts and Communiators 12 a proess that is part or several dierent groups may have dierent ranks inside those groups. Therefore it is neessary to obtain knowledge about a group, before ommuniation an take plae. This is done via the ommuniators. Eah group owns at least one ommuniator, with what the group members are able to deliver messages to eah other. Eah ommuniator is assigned to stritly one group and the relationship between groups and ommuniators reates the ontext for the proesses. Groups Eah proess belongs at least to one group, namely the initially reated group represented by the ommuniator MPI_COMM_WORLD, whih ontains all proesses. MPI itself does not provide a funtion for an expliit onstrution of a group, instead a new group will be produed by using redution and mathematial set operations on existing groups, whih result into a new group inluding the speied or alulated proesses. The set operations are listed expliitly in Table 2.4, where it will be assumed that the operations are being exeuted on two dierent groups as arguments, alled Operation union group1 and group2. Meaning group1, followed by group2 not in group1 all proesses of group1 that are also in the group2, ordered as in group1 all proesses of group1 that are not in group2, ordered as in group1 all proesses of all proesses of intersetion dierene Table 2.4: Group Constrution Set Operations As mentioned above, it is also possible to redue an existing group, using the so-alled funtions inl and exl. In both the user has to speify ertain or a range of ranks, whih will be inluded or exluded from the existing group, respetively. Intra-Communiators Communiators always our in relationship with groups. Sine proesses just send and reeive messages over ommuniators and beause ommuniators represent the hannels within groups, the set of ommuniators assigned to a proess forms the losure of the systems apability of ommuniation. Eah 2.6. Virtual Topologies 13 ommuniator is assigned to exatly one group and eah group belongs to at least one ommuniator. For example two proesses, eah is part of a dierent group, annot exhange messages diretly. In order to ahieve this possibility, a new group must be reated with a new ommuniator attahed. Therefore MPI provides funtions to onstrut ommuniators from existing groups. Communiators onerning ommuniation within groups are alled intra-ommuniators. Inter-Communiators As shown above, reating new groups and ommuniators is quite inonvenient just for the purpose that proesses from dierent groups want to exhange messages. For this speial ase MPI speies inter-ommuniators. Those ommuniators are being onstruted by speifying two intra-ommuniators, between whih the inter-ommuniation takes plae. 2.6 Virtual Topologies The previous introdued intra-ommuniators address a linear name spae, where the proesses are being numbered from 0 to n-1 (see Setion 2.2). In some ases suh a numbering does not reet the logial ommuniation struture. Depending on the underlying algorithm, strutures like hyperubes, n-dimensional meshes, rings or ommon graphs appear. These logial strutures are alled the virtual topology in MPI, whih allows an additional arrangement of the group members. Sine the ommuniation operations identify soure and destination by ranks, a virtual topology alulates, for instane, the original rank to the neighbour of a ertain proess or a proess with speied oordinates. A virtual topology is an optional attribute to intra-ommuniators and will be reated by orresponding MPI funtions, while inter-ommuniators are not allowed to use virtual topologies. Graph Topologies Eah topology may be represented by a graph, in whih the nodes represent the proesses and the edges represent the onnetions between them. A missing edge between two nodes doesn't mean that the two proesses aren't able to ommuniate to eah other, rather the virtual topology simply neglets it. The edges are not weighted. It is that there is a onnetion between two nodes or not. 2.6. Virtual Topologies 14 In gure 2.8 two examples, a binary tree with seven proesses and a ring with eight proesses, are shown. Figure 2.8: Graph Topologies Cartesian Topologies Cartesian strutures are simpler to speify than ommon graphs, beause of their regularity. Even if a artesian raster may be desribed by a graph as well, MPI provides speial funtions to reate those strutures in order to give more onveniene to the user. Figure 2.9 shows examples of a 3-dimensional mesh with eight proesses and a 2-dimensional mesh with nine proesses. Figure 2.9: Cartesian Topologies Chapter 3 The Grid Programming Environment Ibis With the onept Write one, run everywhere, Java provides a solution to implement highly portable programs. Therefore Java soure ode will not be ompiled to native exeutables, but to a platform independent presentation, alled byteode. At runtime the byteode will be interpreted or ompiled just in time by a Java Virtual Mahine (JVM). This property made Java attrative, espeially for grid omputing, where many heterogeneous platforms are used and where portability beomes an issue with ompiled languages. It has been shown, that Javas exeution speed is ompetitive to other languages like C or Fortran [3℄. In the following, Javas abilities related to parallel programming will be pointed out and then the grid programming environment Ibis and its enhanements will be introdued. 3.1 Parallel Programming in Java In order to ahieve parallel programming in general, Java oers two dierent models out of the box. These are: 1. Multithreading for shared memory arhitetures and 2. Remote Method Invoation (RMI) for distributed memory arhitetures. 15 3.1. Parallel Programming in Java 16 The RMI model enables almost transparent ommuniation between dierent JVMs and was at rst the main point of interest for researh on high-performane omputation in Java. 3.1.1 Threads and Synhronization Conurreny has been integrated diretly into Java's language speiation using threads [13, 221-250℄. A thread is a program fragment, that an run simultane- ously to other threads similar to proesses. While a proess is responsible for the exeution of a whole program, multiple threads ould run inside that proess. The main dierene between threads and proesses is, that threads are sharing the same memory address spae, while the address spaes of proesses are stritly separated. To reate a new thread, an objet of a lass, whih extends java.lang.Thread, has to be instantiated. Sine Java does not support multiple inheritane, there is also an interfae alled java.lang.Runnable, that allows a lass that is already derived from another lass to behave as a thread. This objet will be ommitted to the onstrutor of an extra reated java.lang.Thread objet. In both ases a method alled run() has to be implemented, whih is the entry point when invoking a thread. Invoking will be done by alling the method start(). Beause threads run in the same address spae and thus share the same variables and objets, Java provides the onept of monitors to prevent data from being overwritten by other threads. All objets of the lass all objets of lasses derived from java.lang.Objet java.lang.Objet, and therefore in ontrast to primitive types, ontain a so-alled lok. A lok will be assigned to exatly one thread, while other threads have to wait until the lok has been released. With that mutual exlusive lok (short: mutex lok) it is possible to oordinate aess to objets and thus avoiding onits. Java doesn't allow diret aess to loks. Instead the keyword synhronized exist to label ritial regions. By using synhronized it is possible to protet a whole method or a fragment within a method. When applying on a whole method, this synhronized pointer's lok will be used, otherwise an objet lok is used by passing the objet referene to it. In addition to monitors, Java supports onditional synhronization providing 3.1. Parallel Programming in Java the methods 17 wait() and notify() (or notifyAll() ). Both methods may only be alled when the assoiated thread is the owner of the objet's lok. A all to the thread to wait until another thread invokes is join(), notify(). wait() informs A more spei method whih bloks the urrent thread until the assoiated thread has nished entirely. Figure 3.1 demonstrates the wait and notify model by showing two threads that have been synhronized on an objet. Thread 1 sets a value to the objet and waits until Thread 2 noties it, when the value has been olleted. After notiation Thread 2 sets another value to the objet. Synchronized Object Thread 1 Thread 2 synchronized set(aValue) wait() synchronized int get() notify() (finished) synchronized set(anotherValue) wait() Figure 3.1: Java thread synhronization example 3.1.2 Remote Method Invoation An interesting way to address distributed arhitetures is the Remote Method Invoation API [22℄ provided by Java sine JDK 1.1. RMI extends normal Java programs with the ability to share objets over multiple JVMs. It doesn't matter where the JVMs are loated and therefore RMI programs work in heterogeneous environments. RMI in priniple is based on a lient/server-model, where a remote objet, whih should be aessible from other JVMs, is loated on the server side. On the lient side, a remote referene to that objet appears whih allows the lient to invoke methods of the remote objet. Internally remote objets an be olleted in the 3.1. Parallel Programming in Java 18 RMI registry. Figure 3.2 shows two JVMs demonstrating the method invoation. JVM1 JVM2 Thread doSomething(aParam); doSomething(aParam); invoke message stub remote object skeleton result message result result Figure 3.2: RMI invoation example (after [14, 12℄) When looking up a remote objet the lient gets a stub, whih implements the aessors to the remote objet and ats like a usual Java objet. The ounterpart on the server side is alled the skeleton. Both will be generated by the RMI ompiler. By invoking a remote method the stub marshalls the invoation inluding the method's arguments and sends it to the skeleton, where it will be unmarshalled and forwarded to the remote objet. The result from the remote objet will be submitted in the same manner. The ommuniation between stub and skeleton always via TCP/IP, is synhronous. Therefore a stub always has to wait until the invoation on the remote objet has nished. The great advantage of RMI is the fat, that it abstrats ompletely from low-level soket programming. That makes it more onvenient to software developers to distribute appliations in Java. On the other hand RMI shows dramati performane bottleneks, sine it uses Java's objet serialization [23℄ and reetion mehanism for data marshalling. It has been evaluated [14, 11-36℄ that RMIs ommuniation overhead an result in high latenies and low throughput, whih in the end does not result in signiant speedups for parallel appliations, in fat some appliations an beome slower [14, 34℄. 3.2. Ibis Design 19 3.2 Ibis Design As shown above, Java natively does not provide solutions to ahieve portable and eient ommuniation for distributed memory arhitetures. In the following, the 1 grid programming environment Ibis [26℄ will be introdued, whih addresses porta- bility, eieny and exibility. Ibis has been implemented in pure Java. However in speial ases some native libraries are using the Java Native Interfae, whih an be used to improve performane. Application RMI Satin RepMI GMI Pro Active IPL TCP UDP P2P GM Panda Infiniband Figure 3.3: Ibis design (redraw from the Ibis 1.1 release) The main part of Ibis is the Ibis Portability Layer (IPL). It provides several simple interfaes, whih are implemented by the lower layers (TCP, UDP, GM, ...). These implementations an be seleted and loaded by the appliation at run time. For that purpose Ibis uses Java's dynami lass loader. With that appliations an run simultaneously on a variety of dierent mahines, using optimized and speialized software where possible (e.g. Myrinet) or standard software (e.g. TCP) when neessary. Ibis appliations an be deployed on mahines ranging from lusters with loal, high-performane networks like Myrinet or Inniband, to grid platforms in whih several, remote mahines ommuniate aross the Internet. RMI (see 3.1.2) does not support these features. Although it is possible, Ibis appliations will typially not be implemented on top of the IPL. Instead they use one of the existing programming models. One of these models is a reimplementation of RMI, that allows a omparison between Ibis and the original RMI [25, 169-171℄ and shows Ibis' advantages. The other models will not be disussed here. layered struture of the urrent Ibis release (Version 1.1). 1 Ibis online at http://www.s.vu.nl/ibis/ Figure 3.3 shows the 3.3. Ibis Implementations Send and reeive ports 20 To enable ommuniation, the IPL denes send and reeive ports (Figure 3.4), whih provide a unidiretional message hannel. These ports must be onneted to eah other (onnetion oriented sheme). The ommuniation starts by requesting a new message objet from the send port, where data items of any type, even objets, of any size an be inserted. message will be submitted by invoking send(). m = sendPort.getMessage (); m.writeInt (3); After insertion, the m = receivePort.receive(); send port receive port i = m.readInt(); m.writeArray (a); m.readArray(a); m.writeArray (b, 0, 100); m.readArray(b, 0, 100); m.writeObject (o); o= m.readObject(); m.send(); m.finish(); m.finish(); Figure 3.4: Send and Reeive Ports (after [25, 153℄) Eah reeive port may be ongured in two dierent ways. Firstly, messages an be reeived expliitly by alling the reeive( ) primitive. This method is bloking and returns a message objet from whih the sent data an be extrated by the provided set of read methods (see Figure 3.4). Seondly, Ibis ahieves impliit reeipt with the ability to ongure reeive ports to generate upalls. If an upall takes plae, a message objet will be returned. These are the only ommuniation primitives that the IPL provides. All other patterns an be built on top of it. 3.3 Ibis Implementations Besides exibility, Ibis provides two important enhanements: Eient serialization and ommuniation [25, 165-169℄. The message passing implementation, whih will be introdued in Chapter 4, takes advantage of both. Eient serialization As shown in Setion 3.1.2 Java's objet serialization is a performane bottlenek. Ibis irumvents it by implementing it's own serialization mehanism that is fully soure ompatible to the original. In general, Ibis serialization ahieves performane advantages in three steps: 3.3. Ibis Implementations 21 • Avoiding run time type inspetion • Optimizing objet reation • Avoiding data opying This has been done by implementing a byteode rewriter, whih adds a speialized generator lass to all objets extending the serializable interfae and takes over the standard serialization. Evaluations [25, 164℄ have pointed out that Ibis serialization outperforms the standard Java serialization by a large margin, partiularly in those ases where objets are being serialized. Eient ommuniation The TCP/IP Ibis implementation is using one soket per unidiretional hannel between a single send and reeive port whih is kept open between individual messages. The TCP implementation of Ibis is written in pure Java allowing to ompile an Ibis appliation on a workstation, and to deploy it diretly on a grid. To speedup wide-area ommuniation, Ibis an transparently use multiple TCP streams in parallel for a single port. Finally, it an ommuniate through rewalls, even without expliitly opened ports. 2 There are two Myrinet implementations of the IPL, built on top of the native 3 GM (Glenn's messages) library and the Panda [2℄ library. Ibis oers highly-eient objet serialization that rst serializes objets into a set of arrays of primitive types. For eah send operation, the arrays to be sent are handed as a message fragment to GM, whih sends the data out without opying. On the reeiving side, the typed elds are reeived into pre-alloated buers; no other opies need to be made. 2 Myrinet 3 Myrinet online at http://www.myri.om/ GM driver online at http://www.myri.om/ss/ Chapter 4 Design and Implementation of MPJ/Ibis The driving fore in high performane omputing for Java was the Java Grande 1 Forum (JGF) from 1999 to 2003. Before Java Grande many MPI-like environments in Java were reated without delaring any standards. Thus, a working group from the JGF proposed a MPI-like standard desription, with the goal to nd a onsensus API for future message-passing environments in Java [4℄. To avoid naming onits with MPI, the proposal is alled Message Passing Interfae for Java (MPJ). 4.1 Common Design Spae and Deisions Sine MPI is the de fato standard for message passing platforms and MPJ is mainly derived from it, the deision to reate a message passing implementation mathing the MPJ speiation is obvious. The MPJ implementation should be exible and eient. Portability is being guaranteed by Java, but not exibility and eieny. RMI and Java Sokets usually use TCP for network ommuniation. That makes them not exible enough for high performane omputing, for example on a Myrinet omputer luster. RMI, as shown in Setion 3.1.2, is not eient enough to satisfy the requirements of a message passing platform. However, Ibis provides solutions 1 The Java Grande Forum online at http://www.javagrande.org 22 4.2. MPJ Speiation 23 to both eieny and exibility, and thus oers an exellent foundation to build an MPJ implementation on top of it. In the following this implementation is alled MPJ/Ibis. In ontrast to MPI, MPJ does not address the issue of thread safety expliitly. Synhronizing multiple threads in asynhronous ommuniation models - message passing models express it by the non-bloking ommuniation paradigm - is not trivial. Sine synhronizing is still an expensive operation [8℄ even in ases when it is not used but implemented, MPJ/Ibis avoids the overhead of being thread safe. Multithreaded appliations on top of MPJ must synhronize the entry points to MPJ/Ibis, that means only one thread is allowed to all MPJ primitives at a time. Thus, it will be ensured that the requirement of getting the most eient result is satised in this aspet. 4.2 MPJ Speiation MPJ Group Cartcomm Intracomm Comm Graphcomm Intercomm mpj Datatype Status Request Prequest Figure 4.1: Prinipal Classes of MPJ (after [4℄) 4.2. MPJ Speiation 24 MPJ is a result from the adaption of the C++ MPI bindings speied in MPI2 [16℄ to Java. The lass speiations are diretly built on the MPI infrastruture provided by MPI-1.1 [15℄. It has been announed that the extensions of MPI-2 like dynami proess management will be added in later work, but that has not been done yet. To stay onform to the speiation, the extensions are unsupported by MPJ/Ibis as well. Figure 4.1 shows the most important lasses of MPJ, whih will be briey introdued in the following. All MPJ lasses are organized in the pakage mpj. The lass MPJ is responsible for initialisation of the whole environment, like global onstants, peer onnetions and the default ommuniator COMM_WORLD. Thus, all members of MPJ are dened stati. As explained in Setion 2.5, proesses of one group, represented by Group, ommuniate via the ommuniators. All ommuniation elements are instanes of lass Comm or its sublasses. For example COMM_WORLD is an the lass instane of lass Intraomm. The point-to-point ommuniation operations, suh as irev, are in the Comm lass. send, rev, isend For example, the method prototype of send and looks like this: void Comm.send(Objet buf, int offset, int ount, Datatype datatype, int dest, int tag) throws MPJExeption buf offset ount datatype dest tag send buer array initial oset in send buer number of items to send data type of eah item in send buer rank of destination message tag Table 4.1: MPJ send prototype (after [4℄) The message to be sent must onsist either of an array of primitive types or of an array of objets, whih has to be passed as an argument alled to implement the java.io.Serializable interfae. The offset buf. Objets need indiates the beginning of the message (see Table 4.1). The datatype argument is neessary to support derived datatypes in analogy 4.3. MPJ on Top of Ibis to MPI (see Setion 2.2). These 25 datatypes must math the base type used in buf. are instanes of the lass Datatype and MPJ speies the basi datatypes. Table 4.2 shows a list of the most important predened types. MPJ.BYTE MPJ.CHAR MPJ.BOOLEAN MPJ.SHORT MPJ.INT MPJ.LONG MPJ.FLOAT MPJ.DOUBLE MPJ.OBJECT ... Table 4.2: MPJ basi datatypes (after [4℄) As shown in the example above the send operation does not have a return value, sine it is a bloking operation. In ontrast to MPI all ommuniation operations use Java's exeption handling to report errors. suh as isend or irev, always return a Additionally non-bloking operations, Request objet. Requests represent ongo- ing message transfers and provide several methods to obtain knowledge about the Status objet, Prequest, an exten- urrent state. One a message is ompleted, those methods return a whih ontains detailed information about the message transfer. sion to Request, is the result of a prepared persistent message transfer. Persistent ommuniation is organized in lass Comm and will not disussed here, beause it is of minor importane, even it has been implemented in MPJ/Ibis. The olletive ommuniation operations are part of the lass Intraomm, whih extends Comm, so that it is possible for them to use the point-to-point primitives. 4.3 MPJ on Top of Ibis Ibis/MPJ has been divided into three layers and was built diretly on top of the IPL (see Figure 4.2). The The Ibis Communiation Laye r provides the low level ommuniation operations. Base Communiation Layer takes are of the basi send and reeive operations speied by MPJ. It inludes the bloking and nonbloking operations and the various test and wait statements. 4.3. MPJ on Top of Ibis 26 Application Collective Communication Layer Base Communication Layer Ibis Communication Layer MPJ IPL Figure 4.2: MPJ/Ibis design Colletive Communiation Layer implements the olletive operations on top of the Base It is also responsible for group and ommuniator management. Communiation Layer. An MPJ/Ibis appliation and the Colletive Communiation Layer. 4.3.1 The is able to aess both the Base Point-to-point Communiation Eah partiipating proess is onneted expliitly to eah other using the IPL's send and reeive ports. Ibis oers two mehanisms for data reeption, upalls and downalls (see Setion 3.2). Sine upalls always require an additional running thread olleting the messages, the performane of MPJ/Ibis would be aeted negatively in all ases while swithing between running threads. However, threads an not be avoided ompletely, but the amount should be redued to a minimum. Therefore, MPJ/Ibis has been designed to use downalls. A pair of a send and a reeive port to one ommuniation partner is summarized in a Connetion objet and all existing Connetion s are olleted within a lookup Table alled ConnetionTable. The IPL does not provide primitives for non-bloking ommuniation. To support bloking and non-bloking ommuniation the Ibis Communiation Layer has been designed to be thread safe. That allows non-bloking ommuniation on top of the bloking ommuniation primitives using Java threads. Multithreading annot be avoided in this ase. Furthermore, Ibis only provides ommuniation in eager send mode, while handshaking is not supported. While MPIs and MPJs ready and 4.3. MPJ on Top of Ibis 27 synhronous send modes require some kind of handshaking in order for a better buer organization for short and large messages, the IPL implementations take are about that automatially. Therefore MPJ/Ibis does not dier between ready and synhronous send modes, but buered send is supported. Message Send Internally messages are represented by the MPJObjet lass, whih onsists of a header and a data part. The header is an one-dimensional integer array ontaining the message envelope values. Sending a message has been implemented in a straight forward way, as shown in Figure 4.3. One one of the send primitives of the Base Communiation Layer has been alled, the assigned thread heks if another send operation is in progress. In that ase, it has to wait until the previous operation has been nished. Waiting, lok obtaining and releasing have been left to Java's synhronization mehanisms. Instead of writing the MPJObjet diretly to the send port, whih auses unneessary objet serialization, the send operation has been divided into three steps. First the header will be written expliitly. Seond, it is determined if the message data has to be serialized into a system buer or not before it an be sent. Third, depending on step two the message data or the system buer will be written to the send port. Message reeive Sine non-bloking ommuniation allows to vary the order of alling reeive primitives, it is neessary to add a queue for every reeive port, where unexpeted messages will be olleted. Usually, reeiving messages into a queue will be done by using a permanent reeiver thread, whih is onneted to the reeive port and manages the queue lling automatially. Therefore, one thread would be needed for eah reeive port, resulting in a slowdown of the whole system's performane, due to thread swithing of multiple threads depending on the number of partiipating proesses. Another disadvantage of suh a onept is, that zero-opying will not be possible, sine all messages have to be opied out of the queue. 4.3. MPJ on Top of Ibis 28 yes startsend is send port locked? no lock send port write header to send port no buffered send mode? yes serialize message into attached system buffer write message to send port write system buffer to send port release send port lock sendfinished Figure 4.3: MPJ/Ibis send protool 4.3. MPJ on Top of Ibis 29 yes startrecv is queue locked? release queue lock no lock queue move message to queue message found? no check queue for the requested message no connect to receive port and get incoming message header does the message match the recv post? copy and delete message out of queue was the message buffered? yes yes no yes yes was the message buffered? deserialize message into recvbuffer release queue lock recvfinished Figure 4.4: MPJ/Ibis reeive protool no receive message directly into recvbuffer 4.3. MPJ on Top of Ibis 30 To avoid ontinuous thread swithing and to allow zero-opying the reeive protool has been designed in way that allows a reeiving primitive to onnet expliitly to the reeive port. The reeive protool is shown in Figure 4.4. After obtaining the lok, a reeiver thread heks whether the queue ontains the message requested. If the message has been found, it will be opied or deserialized out of the queue, depending whether the message was buered or not. If the message was not found inside the queue, the reeiver onnets to the reeive port and gets the inoming message header to determine, if the inoming message is targeted to the reeiver. If not, the whole message inluding the header will be inserted into an jet and moved into the queue where it waits for the mathing reeiver. MPJObTo give other reeiving threads the hane to hek the queue, the lok will be temporarily released. If the message header from the reeive port mathes the posted reeive, the non-buered message will be reeived diretly into the reeive buer, while a buered message has to be deserialized into the reeive buer. 4.3.2 Groups and Communiators As mentioned in Setion 2.5, eah group of proesses has a ommuniator that is responsible for message transfers within the group. Sine ommuniators an share the same send and reeive ports, it is mandatory to prevent mixing up messages of dierent ommuniators. The tag value, whih is used by the user to distinguish dierent messages, is not useful for this ase. Therefore, on reation eah ommuniator gets a unique ontextId, whih allows the system to handle ommuniation of dierent ommuniators on the same ports at the same time. When sending a message, the message header will be extended by adding the ommuniator to it. Messages are identied by both the ontextId tag value and the ontextId. Creating a new ommuniator is always a olletive operation. holds the value of the highest ontextId tId Eah proess that is used loally. When a new ommu- niator is going to be reated, eah proess reates a new temporary inreasing the highest by one. of the used To ensure that all proesses use the for the new ommuniator, the temporary ontextId ontextId by same ontex- will be allredued to the maximum. After that, the new ommuniator will be reated and the loal system 4.4. Colletive Communiation Algorithms informed that the new ontextId 31 is the highest. 4.4 Colletive Communiation Algorithms The olletive operations of lass point-to-point primitives. Intraomm have been implemented on top of the Sine the naive approah of sending messages diretly may result in high latenies, those operations should use speialized algorithms. For example, letting the root in a broadast operation send a message to eah node expliitly is ineient, while the last reeiver has to wait until the message has been sent to the other proesses. The urrent MPI implementations ontain a vast amount of dierent algorithms realizing the olletive operations and researh to their optimization is not nished yet. Colletive Algorithm Upper Operation allgather allgatherv allredue alltoall alltoallv barrier broadast gather gatherv redue redueSatter san satter satterv Complexity Borders non-ommutative op: at tree O(n) O(n) O((log n) + 2) O(n2) O(n2) O(2n) O(log n) O(n) O(n) O(log n) O(n) phase 1: redue ommutative op: double ring single ring reursive doubling at tree at tree at tree binomial tree at tree at tree ommutative op: binomial tree O((log n) + n) non-ommutative op: O(2n) O(n) O(n) O(n) phase 2: satterv at tree at tree at tree Table 4.3: Algorithms used in MPJ/Ibis to implement the olletive operations MPJ/Ibis provides a basi set of olletive algorithms, whih may be extended and more optimized in future work. At least one for eah operation has been implemented. Table 4.3 shows the algorithms used for all olletive operations inluding their upper omplexity borders in O -notation, where n is the number of the pro- 4.4. Colletive Communiation Algorithms esses involved. 32 In aordane to the MPI speiation four olletive operations have been extended to ahieve more exibility. The extended operations are alled allgatherv, alltoallv, gatherv and satterv. They allow to vary the item sizes and buer displaements expliitly for eah proess. In the following the algorithms used will be introdued exemplary. Flat Tree The at tree model follows the naive approah mentioned above. The root proess P0 sends to and/or reeives from the other group members diretly. Figure 4.5 demonstrates the satter operation using the at tree ommuniation sheme with ve partiipating proesses. In eah step the root sends a message ontaining the elements to be sattered to the next proess. The number of steps needed depends linearly on the number of proesses. Figure 4.6 shows a dierent view of the same satter sheme, whih results in a tree with n-1 leaves and exatly one parent node, namely the root. P0 P2 P1 P3 P4 Step 1 Step 2 Step 3 Step 4 Figure 4.5: Satter sending sheme Typially the other operations using the at tree model have a omplexity of as well, exept alltoall and barrier. The alltoall O(n) operation has been implemented using the at tree, but eah proess reates a at tree to send messages to eah other. This results in n at trees with an overall upper omplexity border of The barrier O(n2 ). operation uses the proess with rank zero to gather zero sized mes- sages from the other proesses. These messages will be sattered bak to omplete the operation. Two at trees are needed with a total ost of O(2n). 4.4. Colletive Communiation Algorithms 33 P0 P2 P1 P4 P3 Figure 4.6: Flat tree view Binomial Tree The broadast follows the binomial tree model shown in Figure 4.8. Sending messages via the binomial tree struture has a omplexity of In Figure 4.7 the where P0 broadast O(log n). operation takes plae with eight partiipating proesses, is the root sending its send buer to the other proesses. After eah step the number of sending proesses is doubled. In this example three steps are needed to exeute the whole operation, while a broadast in at tree model would take seven steps. P0 P1 P2 P3 P4 P5 P6 P7 Step 1 Step 2 Step 3 Figure 4.7: The redue Broadast sending sheme operation has been implemented using a binomial tree in reverse order. After reeiving a message eah proess ombines its send and reeive buer using the assigned operation. The result will be written into the send buer and represents the argument for the next step. In the end the redue result appears at the root proess. Sine user-dened operations may be non-ommutative and the binomial tree model does not follow strit ordering, this algorithm is not universally valid. In the ase of non-ommutative user-dened operations the at tree model will be used instead. 4.4. Colletive Communiation Algorithms 34 P0 P1 P2 P4 P3 P5 P6 P7 Figure 4.8: Binomial tree view Ring In allgatherv, eah proess rst sends its item to gather to its right neighbour (the proess with the next higher rank). If the rank of a proess is sends its item to the proess with rank 0. n−1 then it Seond, in the next steps eah proess sends the reeived item to its right neighbour. The exeution is omplete after n−1 steps, when eah proess has reeived the items of all the other proesses. Figure 4.9 illustrates one step of the ring algorithm. P0 P5 P1 P4 P2 P3 Figure 4.9: Ring sending sheme (only 1 step) Allgather uses an extension to the ring model mentioned above, alled double ring. With double ring all proesses send their items to gather to their right and left neighbours. Therefore the exeution ompletes after (n − 1)/2 steps, but in eah step the number of send and reeive operations needed is doubled ompared to the ring model. Overall this model shows the same omplexity of O(n) as the ring. 4.5. Open Issues 35 Reursive Doubling In Reursive doubling, used by allredue, in the rst step all the proesses whih have a distane of 1 exhange their messages followed by a loal redution (see Figure 4.10). In the next steps the distanes will be doubled eah time. In the end, after log n, steps the allredue operation has been nished for the ase that the number of partiipating proesses is a power-of-two. For the non-power-of-two ase the number of proesses performing the reursive doubling will be redued to power-of-two. The remaining proesses send their items to the proesses of the reursive-doubling-group expliitly before the doubling starts. When the reursive doubling has nished, the redued values will be sent to the remaining proesses. That auses two extra steps in the non-power-two ase and results in a omplexity of O((log n) + 2). P0 P1 P2 P3 P4 P5 P6 P7 Step 1 Step 2 Step 3 Figure 4.10: Reursive doubling illustration 4.5 Open Issues Multidimensional Arrays & Derived Datatypes Sine the MPJ speiation only speies generi send and reeive primitives, whih expet message data of the type java.lang.Objet, MPJ/Ibis has to ast expliitly the real type and dimension of the arrays used. That makes it impossible for MPJ/Ibis to support all dimensions of arrays. In ontrast to the programming language C and others, for whih the original MPI standard was presented, Java represents multidimensional arrays as arrays of arrays. Traversing those arrays in Java with only one pointer is impossible. Therefore MPJ/Ibis supports only one-dimensional arrays. Multidimensional arrays an be sent as an objet or eah row has to be sent expliitly. Sine Java provides derived datatypes natively using Java objets there is no real need to implement derived datatypes in MPJ/Ibis. Nevertheless ontiguous 4.5. Open Issues 36 derived datatypes are supported by MPJ/Ibis to ahieve the funtionality of the redue operations MINLOC and MAXLOC speied by MPJ, whih need at least a pair of values inside a one-dimensional array. The other types of derived datatypes may be implemented in future work, if multidimensional arrays will be supported diretly. Other Issues Due to time onstraints for this thesis, MPJ/Ibis supports reating and splitting of new ommuniators, but interommuniation is not implemented yet (see Setion 2.5). At this moment, MPJ/Ibis also does not support virtual topologies (see Setion 2.6). Both may be added in future work as well. Chapter 5 Evaluation 5.1 Evaluation Settings MPJ/Ibis on top of Ibis Version 1.1 has been evaluated on the Distributed ASCI Superomputer 2 (DAS-2) with 72 nodes in Amsterdam. Eah node onsists of: • Two 1-Ghz Pentium-IIIs • 1 GB RAM • a 20 GByte loal IDE disk • a Myrinet interfae ard • a Fast Ethernet interfae (on-board) The operating system is Red Hat Enterprise Linux with kernel 2.4. Only one proessor per node has been used during the evaluation. To ahieve more omparable results the following benhmarks have been performed with mpiJava [1℄ as well. MpiJava is based on wrapping native methods like the MPI implementation MPICH with the Java Native Interfae (JNI). Here, mpiJava Version 1.2.5 has been bound to MPICH/GM Version 1.2.6 for Myrinet. For Fast Ethernet there was only MPICH/P4 available on the DAS-2, whih is not ompatible to mpiJava. Nevertheless, the values for MPJ/Ibis on TCP for Fast Ethernet will be presented as well. Both MPJ/Ibis and mpiJava have been evaluated using Suns JVM Version 1.4.2. 37 5.2. Miro Benhmarks 38 For Ibis two dierent modules exist to aess a Myrinet network, alled Net.gm and Panda. During the evaluation it has been found out, that Net.gm in some ases auses deadloks, due to problems in buer reservation, when multiple objets are being transfered from multiple senders to one reipient. The Panda implementation has shown memory leaks for large message sizes resulting in performane redution and deadloks. However, where it is possible the results for MPJ/Ibis on Myrinet will be presented for eah benhmark. MPJ/Ibis on TCP performed stable. MpiJava on MPICH/GM in some ases performed unstable resulting in memory overows and broken data streams. These misbehaviours ourred randomly and ould not be reprodued. 5.2 Miro Benhmarks The miro benhmarks, whih have been implemented for Ibis, MPJ/Ibis and mpiJava, rstly measure the round trip lateny by sending one byte bak and forth. Sine the size an be negleted, the round trip lateny is divided by two to get the lateny for a one way data transfer. Seondly, the miro benhmarks measure the exeution time of sending an array of doubles from one node to a seond, whih aknowledges the reeption by sending one byte bak to the sender. This is repeated for array sizes from 1 byte to 1MB. Thirdly, the throughput of objet arrays is measured, where eah objet ontains a byte value. The measurement is repeated for dierent array sizes in analogy to the seond step. Implementation lateny [µs℄ mpiJava (MPICH/GM) 28 Ibis (Panda) 44 Ibis (Net.gm) 52 MPJ/Ibis (Panda) 50 MPJ/Ibis (Net.gm) 53 Ibis (TCP) 113 MPJ/Ibis (TCP) 120 Table 5.1: Lateny benhmark results 5.2. Miro Benhmarks Latenies 39 Table 5.1 shows the lateny benhmark results for MPJ/Ibis, Ibis and mpiJava. On Myrinet, MPJ/Ibis and Ibis have onsiderably higher latenies than mpiJava. The reason of the gap between Ibis and mpiJava is beyond the objetive of this thesis. Furthermore, MPJ on top of Ibis does not show onsiderably higher latenies than Ibis itself. Thus, the message reation overhead of MPJ/Ibis does not inuene the lateny by a large margin. Throughput Double Arrays The gures 5.1 and 5.2 show the throughput mea- surement results for Ibis and the message passing implementations. For TCP Ibis and MPJ/Ibis almost use the whole available bandwidth provided by Fast Ethernet for data sizes greater than 1KB. For sizes beyond 32KB the performane of the Panda implementation breaks down, aused by the memory leaks mentioned above. This halves the throughput for Ibis and MPJ/Ibis. Throughput Double Arrays 1000 Ibis TCP Ibis Panda Ibis Net.gm Bandwidth [Mbps] 800 600 400 200 0 8 16 32 64 256 1K 4K 16K Array Size [byte] 64K 256K 1M Figure 5.1: Double array throughput in Ibis Net.gm does not reah the maximum of Panda, but for large data sizes (> 64KB) it is muh faster. MPJ/Ibis on top of Net.gm also outperforms mpiJava. The break down of mpiJavas performane is aused by MPICH/GM swithing from ready send mode to synhronous send mode at message sizes beyond 128KB. All 5.2. Miro Benhmarks 40 message passing implementations do not take advantage of the available bandwidth provided by Myrinet. Throughput Double Arrays 1000 MPJ/Ibis Panda MPJ/Ibis Net.gm MPJ/Ibis TCP mpiJava MPICH/GM Bandwidth [Mbps] 800 600 400 200 0 8 16 32 64 256 1K 4K 16K Array Size [byte] 64K 256K 1M Figure 5.2: Double array throughput in MPJ/Ibis and mpiJava Throughput Objet Arrays The throughput of objet arrays is not limited by the physial restritions of the underlying network hardware. Objets have to be serialized at the senders side and deserialized by the reeiver. The performane gap between Ibis and MPJ/Ibis is small (ompare Figures 5.3 and 5.4). Both use the Ibis serialization model introdued in Setion 3.3, whih still is a performane limiter. On the other hand MPJ/Ibis outperforms mpiJava by a large margin, whih depends on the serialization model provided by Sun's JVM. MpiJava does not even reah 15 Mbps on Myrinet when objet arrays are being transfered. Eah introdued serialization model has to perform a dupliate detetion before sending an objet, in a way that no objet needs to be transfered twie. Therefore, eah objet referene has to be stored in a hash table to allow a lookup if an objet to be sent has been proessed before. For larger arrays the hashtable beomes larger as well ausing ommuniation slow downs. These slow downs have been shown by all implementations, when the objet array sizes grow beyond 8KB. 5.2. Miro Benhmarks 41 Throughput Object Arrays 50 Ibis TCP Ibis Panda Ibis Net.gm Bandwidth [Mbps] 40 30 20 10 0 8 16 32 64 256 1K 4K Array Size [byte] 16K 64K 256K 1M 256K 1M Figure 5.3: Objet array throughput in Ibis Throughput Object Arrays 50 MPJ/Ibis Panda MPJ/Ibis Net.gm MPJ/Ibis TCP mpiJava MPICH/GM Bandwidth [Mbps] 40 30 20 10 0 8 16 32 64 256 1K 4K Array Size [byte] 16K 64K Figure 5.4: Objet array throughput in MPJ/Ibis and mpiJava 5.3. Java Grande Benhmark Suite 42 5.3 Java Grande Benhmark Suite The Java Grande Benhmark Suite 1 maintained by the Edinburgh Parallel Com- puting Centre (EPCC) onsists of three Setions. Setion 1 performs low-level operations, pingpong for throughput measurements and several benhmarks for the olletive operations. Setion 2 provides ve kernel appliations performing om- mon operations, that are widely used in high performane omputation. In setion 3 three large appliations are benhmarked. The benhmarks of setion 2 and 3 measure exeution times. Those results will be presented in relative speedup [10, 30℄, whih is dened as follows: relative speedup (p processors) = runtime par. alg. (1 processor) runtime par. alg. (p processors) Additionally the theoretial perfet speedup for the kernel and appliation benhmarks is marked in eah presentation. Originally the benhmark suite has been implemented to math mpiJava's API, whih is slightly dierent to that of MPJ. Thus, the benhmark suite has been ported to MPJ. The setions 2 and 3 ontain predened problem sizes to be solved, whih were not large enough to perform eiently on the DAS-2, when omputation time beomes short. Where possible the predened problem sizes have been inreased to improve the use of apaity provided by the existing hardware. 5.3.1 Setion 1: Low-Level Benhmarks The low level benhmarks are designed to run for a xed period of time. The number of operations exeuted in that time is reorded, and the performane reported as operations/seond for the barrier and bytes/seond for the other operations. Both the size and type of arrays transfered are varied. The type is either a double, or a simple objet ontaining a double, whih allows to ompare the ommuniation overhead of sending objets and basi types. 1 Java Grande Benhmark Suite online at http://www.ep.ed.a.uk/javagrande/mpj/ontents.html 5.3. Java Grande Benhmark Suite Pingpong 43 The pingpong benhmark measures the bandwidth ahieved to send an array bak and forth between exatly two nodes. The aggregated results of the double array and the objet array benhmarks are shown in the Figures 5.5 and 5.6. PingPong (Double arrays) 400 350 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM Bandwidth [Mbps] 300 250 200 150 100 50 0 64 256 1K 4K 16K Array Size [byte] 64K 256K 1M Figure 5.5: Pingpong benhmark: arrays of doubles PingPong (Object arrays) 20 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM Bandwidth [Mbps] 15 10 5 0 64 256 1K 4K 16K Array Size [byte] 64K 256K Figure 5.6: Pingpong benhmark: arrays of objets 1M 5.3. Java Grande Benhmark Suite 44 As an be seen in Figure 5.5 for all Myrinet implementations the ahieved throughput rises in the same way up to array sizes of almost 64KB. Beyond 64KB MPJ/Ibis on Net.gm keeps outperforming mpiJava, while MPJ/Ibis on Panda slows down due to the mentioned memory leaks. For objet arrays mpiJava uses the serialization mehanism provided by the JVM, while MPJ/Ibis takes advantage of the Ibis serialization. That results in higher throughputs, even at array sizes greater than 4KB MPJ/Ibis on TCP performs better than mpiJava. With arrays larger than 1MB the performane advantage of the Ibis serialization almost disappears for Ibis' Myrinet implementations. Overall these results reet the measurements of the miro benhmarks (see Setion 5.2). Barrier MPJ/Ibis' implementation of the barrier operation is not optimal (see Figure 5.7). MPICH/GM uses the reursive doubling algorithm, whih has a lower exeution time than the algorithm used in MPJ/Ibis. In both MPJ/Ibis and mpiJava a zero sized byte array is being transfered in eah ommuniation step. Additionally, due to higher zero-latenies the relative performane of MPJ/Ibis ompared to mpiJava is further slowed down. Barrier 40000 35000 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM Barriers/sec 30000 25000 20000 15000 10000 5000 0 10 20 30 CPUs 40 Figure 5.7: Barrier benhmark 50 60 5.3. Java Grande Benhmark Suite Broadast The broadast 45 benhmark performs in a similar way as the pingpong benhmark. In addition to pingpong this benhmark is not restrited to only two nodes and therefore the broadast operation has been evaluated on up to 48 nodes. MPJ/Ibis and MPICH/GM implement the broadast operation using the same al- gorithm. Figure 5.8 shows the results of MPJ/Ibis and mpiJava. For double arrays mpiJava performs better than MPJ/Ibis up to eight involved nodes, aused by higher throughputs. With more partiipating proesses the dierene between the implementations working on Myrinet beomes marginally small. Broadasting arrays of objets omes with a onsiderably performane advantage to MPJ/Ibis. On two proessors MPJ/Ibis outperforms mpiJava almost by a fator of about six. By inreasing the number of nodes to 48 proesses this gap rises to a fator of about 20, aused by the more eient Ibis serialization. results of the benhmarks. broadast Overall the benhmark orrespond to those of the miro and pingpong 5.3. Java Grande Benhmark Suite 46 MPJ/Ibis Panda (Double arrays) mpiJava MPICH/GM (Double arrays) 1000 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 48 nodes 800 600 Bandwidth [Mbps] Bandwidth [Mbps] 1000 400 200 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 48 nodes 800 600 400 200 0 0 64 256 1K 4K 16K 64K 256K 1M 64 256 1K Array Size [byte] MPJ/Ibis Net.gm (Double arrays) 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 48 nodes 800 600 400 200 80 60 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 48 nodes 40 20 0 0 64 256 1K 4K 16K 64K 256K 1M Array Size [byte] 64 MPJ/Ibis Panda (Object arrays) 40 30 256 1K 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 48 nodes 20 10 8 6 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 48 nodes 4 2 0 0 64 256 1K 4K 16K 64K 256K 1M 64 256 1K Array Size [byte] 30 16K 64K 256K 1M MPJ/Ibis TCP (Object arrays) 50 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 48 nodes Bandwidth [Mbps] Bandwidth [Mbps] 40 4K Array Size [byte] MPJ/Ibis Net.gm (Object arrays) 50 4K 16K 64K 256K 1M Array Size [byte] mpiJava MPICH/GM (Object arrays) 10 Bandwidth [Mbps] Bandwidth [Mbps] 50 16K 64K 256K 1M MPJ/Ibis TCP (Double arrays) 100 Bandwidth [Mbps] Bandwidth [Mbps] 1000 4K Array Size [byte] 20 10 40 30 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 48 nodes 20 10 0 0 64 256 1K 4K 16K 64K 256K 1M 64 256 Array Size [byte] Figure 5.8: Broadast benhmark 1K 4K 16K 64K 256K Array Size [byte] 1M 5.3. Java Grande Benhmark Suite Redue The redue 47 benhmark only uses double arrays, sine the built in redue operations do not support objet arrays. Here, the arrays will be redued by adding the array items using the sum operation. Double Arrays (Size: 4 items) Bandwidth [Mbps] 8 Double Arrays (Size: 4472 items) 350 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM 300 Bandwidth [Mbps] 10 6 4 2 250 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM 200 150 100 50 0 0 10 20 30 40 50 60 10 CPUs 20 30 40 50 60 CPUs Figure 5.9: Redue benhmark In general mpiJava shows a better performane (see Figure 5.9), although MPICH/GM implements the same tree algorithm as MPJ/Ibis. In both, before exeuting the due re- operation a temporary buer for array reeption will be reated at eah node. Sine Java lls arrays with zeros at initialization in ontrast to C, the overhead for MPJ/Ibis is muh higher. Furthermore the smaller throughput results for MPJ/Ibis are also indiated by the higher latenies shown by the miro benhmarks. Satter Sattering messages in mpiJava and MPJ/Ibis has been implemented in the same way. As for the redue operation the satter benhmark only measures the throughput for two dierent sizes, but for double and objet arrays. It was not possible to run this benhmark on the Net.gm implementation for MPJ/Ibis, whih aused deadloks when more than two proesses were involved. As an be seen in Figure 5.10 for small double arrays mpiJava shows less performane loss on larger numbers of nodes than MPJ/Ibis on Panda, beause of the lower lateny shown in setion 5.2. For larger double arrays the impat of the lateny gap beomes marginally small. As expeted for objet arrays MPJ/Ibis on Panda outperforms mpiJava almost by a fator of 1,9. Overall the performane of sattering messages is highly inuened by the at tree algorithm used. 5.3. Java Grande Benhmark Suite 48 Double Arrays (Size: 4 items) 6 MPJ/Ibis TCP MPJ/Ibis Panda mpiJava MPICH/GM 300 Bandwidth [Mbps] 5 Bandwidth [Mbps] Double Arrays (Size: 4472 items) 350 4 3 2 1 MPJ/Ibis TCP MPJ/Ibis Panda mpiJava MPICH/GM 250 200 150 100 50 0 0 10 20 30 40 50 60 10 20 30 CPUs Object Arrays (Size: 4 items) Bandwidth [Mbps] 2.5 50 60 Object Arrays (Size: 4472 items) 10 MPJ/Ibis TCP MPJ/Ibis Panda mpiJava MPICH/GM Bandwidth [Mbps] 3 40 CPUs 2 1.5 1 8 MPJ/Ibis TCP MPJ/Ibis Panda mpiJava MPICH/GM 6 4 2 0.5 0 0 10 20 30 40 CPUs 50 60 10 20 30 40 CPUs 50 60 Figure 5.10: Satter benhmark Gather It was not possible to run the gather nor with the Myrinet implementations of Ibis. benhmark neither with mpiJava With all of them this benhmark exeeded the limit of the physial memory provided by eah node of the DAS-2. Only MPJ/Ibis on top of TCP worked stable. The results are presented in Figure 5.11. MPJ/Ibis TCP (Double Arrays) 80 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes MPJ/Ibis TCP (Object Arrays) 14 Bandwidth [Mbps] Bandwidth [Mbps] 100 60 40 12 10 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 8 6 4 20 2 0 0 64 256 1K 4K 16K 64K 256K 1M Array Size [byte] 64 256 Figure 5.11: Gather benhmark 1K 4K 16K 64K 256K Array Size [byte] 1M 5.3. Java Grande Benhmark Suite 49 This benhmark funtions the same as the arrays the than gather broadast. broadast benhmark. For double operation shows higher performane loss on more than two nodes This is beause of the at tree algorithm, in whih eah node sending a message to the root has to wait until the root has reeived the message from the previous node. On a growing number of proesses this leads to substantially higher latenies. For objet arrays the results show an unstable behaviour of MPJ/Ibis on TCP using the gather operation up to eight proesses involved. While MPJ/Ibis for double arrays works as expeted (more proesses ausing less throughput), the results for objet arrays lead to the assumption that Ibis serialization for loal opies may ause ineienies. In eah all of gather the root has to opy the items of its send buer loally into the reeive buer. Objet arrays will be opied using Ibis serialization. Partiularly the relative part of the loal opy to the ommuniation overhead at a smaller number of proesses is muh higher than at larger numbers of proesses. This impat should be elaborated more in future work. Alltoall As with the gather benhmark it was not possible to perform the benhmark with mpiJava and MPJ/Ibis on top of Panda and Net.gm. alltoall With all Myrinet implementations this benhmark produed memory overows. Additionally mpiJava randomly reported broken data streams. MPJ/Ibis TCP (Double Arrays) 80 MPJ/Ibis TCP (Object Arrays) 25 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes Bandwidth [Mbps] Bandwidth [Mbps] 100 60 40 20 20 2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 15 10 5 0 0 64 256 1K 4K 16K 64K 256K 1M Array Size [byte] 64 256 1K 4K 16K 64K 256K Array Size [byte] 1M Figure 5.12: Alltoall benhmark In MPJ/Ibis the alltoall operation uses non-bloking ommuniation primitives, 5.3. Java Grande Benhmark Suite 50 whih are alled simultaneously. On TCP multiple threads ompeting for system resoures do notably interfere the ommuniation performane leading to high performane losses the more proesses are involved. 5.3.2 Setion 2: Kernels Crypt Crypt performs an IDEA (International Data Enryption Algorithm) en- ryption and deryption on a byte array with a size of 5 ∗ 107 items. Node 0 reates the array and sends it to the other nodes, where the enryption and deryption takes plae. After omputation the involved nodes send their results bak to node 0 using individual messages. The time measurement starts after sending the initialized arrays and stops when node 0 has reeived the last array. Crypt 25 20 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM perfect Speedup 15 10 5 0 10 20 30 CPUs 40 50 60 Figure 5.13: Crypt speedups The rypt kernel does not sale perfetly in all ases (see Figure 5.13). How- ever, MPJ/Ibis on top of Panda and mpiJava show the same speedups. Beyond 8 nodes both break down. As expeted from the miro benhmarks the impat of the ommuniation overhead of Panda beomes less negletable on a growing number of nodes involved, due to redued omputation time. MPJ/Ibis on Net.gm shows 5.3. Java Grande Benhmark Suite 51 the highest speedup of up to about 23 on 64 CPUs and outperforms mpiJava by a large margin. Even MPJ/Ibis on TCP shows a better performane than mpiJava on Myrinet. LU Fatorization This kernel solves a linear system ontaining 6000 x 6000 items using LU fatorization followed by a triangular solve. The proesses exhange double and integer arrays using the broadast operation. The time during fatorization inluding ommuniation is measured. LUFact 60 50 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM perfect Speedup 40 30 20 10 10 20 30 CPUs 40 50 60 Figure 5.14: LU Fatorization speedups All message passing implementations exept MPJ/Ibis on TCP show the same speedup, though they do not sale perfetly. Due to a relatively small problem size the eet of the omputation part beomes small. Beause of memory onstraints it was not possible to enlarge the problem size beyond 6000 x 6000 items. 5.3. Java Grande Benhmark Suite Series 52 This benhmark omputes the rst inside a predened interval. 106 Fourier oeients of a funtion Communiation only takes plae in the end of the omputation of eah node sending its individual results (double arrays) to node 0. The performane of both omputation and ommuniation is measured. Series 60 50 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM perfect Speedup 40 30 20 10 10 20 30 CPUs 40 50 60 Figure 5.15: Series speedups The speedups for all implementations grow in a linear way shown in Figure 5.15. But they are not perfet. At 48 nodes mpiJava is slightly slower than the Myrinet implementations of MPJ/Ibis. Even MPJ/Ibis on Fast Ethernet does not sale worse than the Myrinet implementations. This kernel exeuted at 64 nodes is 40 times faster than exeuted at only one node. 5.3. Java Grande Benhmark Suite Sparse Matrix Multipliation 53 This kernel multiplies a sparse matrix using one array of double values and two integer arrays. First, node 0 reates the matrix data and transfers it to eah proess. Seond, when eah node has ompleted the omputation the result will be ombined using allredue (sum operation). Only the time of step two is measured. Here, a sparse matrix with a size of 106 x 106 items has been used for 200 iterations. It was not possible to enlarge the matrix size, beause of memory restritions. Sparse Matrix Multiplication 25 20 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM perfect Speedup 15 10 5 0 10 20 30 CPUs 40 50 60 Figure 5.16: Sparse matrix multipliation speedups This benhmark has shown the smallest run times of the whole JGF benhmark suite (about 131 seonds for exeution at only one node). This means, that the omputation time is marginally small and does not aet the overall exeution speed. Therefore the ommuniation overhead beomes the main fator, whih is getting higher on a growing number of involved proesses leading to performane redution (see Figure 5.16). For all MPJ/Ibis implementations the speed is slowed down showing the impat of the higher latenies shown in Setion 5.2. For mpiJava there is a small speedup of about 2,5. But it is to be notied that one of MPICH/GMs due allre- operations is erroneous. Beyond eight partiipating proesses this benhmark reports result validation errors for mpiJava. MPICH/GM uses dierent algorithms for allredue depending on data size and the number of proesses. Here, at more than 5.3. Java Grande Benhmark Suite 54 eight proesses it swithes to a dierent algorithm, whih does not work orretly. The speedup values for this benhmark on mpiJava are not representative. Suessive Over-Relaxation This benhmark performs 100 iterations of sues- sive over-relaxation (SOR) on a 6000 x 6000 grid. The arrays are distributed over proesses in bloks using the red-blak hekerboard ordering mehanism. Only neighbouring proesses exhange arrays, whih onsist of double values. The arrays are treated as objets, sine they are two-dimensional. It was not possible to run this benhmark with MPJ/Ibis on Net.gm, due to the mentioned problems in buer reservation of Net.gm. SOR 25 MPJ/Ibis TCP MPJ/Ibis Panda mpiJava MPICH/GM perfect Speedup 20 15 10 5 10 20 30 CPUs 40 50 60 Figure 5.17: Suessive over-relaxation speedups Beause of the Ibis serialization model MPJ/Ibis on Panda outperforms mpiJava by a fator of two until 32 partiipating proesses (see Figure 5.17). Beyond 32 nodes the transmitted arrays beome smaller reduing the performane advantage by inreasing the relative ommuniation overhead. 5.3. Java Grande Benhmark Suite 5.3.3 55 Setion 3: Appliations Moleular Dynamis The Moleular Dynamis appliation models 27436 parti- les (problem size: 19) interating under a Lennard-Jones potential in a ubi spatial volume with periodi boundary onditions. In eah iteration the partiles are being updated using the allredue operation with summation in the following way: • three times with double arrays • two times with a double value • one with an integer value Molecular Dynamics 60 50 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM perfect Speedup 40 30 20 10 10 20 30 CPUs 40 50 60 Figure 5.18: Moleular dynamis speedups As an be seen in Figure 5.18 mpiJava slightly outperforms MPJ/Ibis on the Myrinet modules. In ontrast to the sparse matrix multipliation this benhmark does not report validation errors with mpiJava, sine MPICH/GM does not swith between dierent allredue algorithms in this ase. 5.3. Java Grande Benhmark Suite Monte Carlo Simulation 56 This nanial simulation uses Monte Carlo tehniques to prie produts derived from an underlying asset. It generates 60000 sample time series. The results at eah node are stored in objet arrays of lass whih are sent to node 0 using the send and rev java.util.Vetor, primitives. As with the suessive over-relaxation kernel it was not possible to run this benhmark using Net.gm for Ibis. MonteCarlo 25 MPJ/Ibis TCP MPJ/Ibis Panda mpiJava MPICH/GM perfect Speedup 20 15 10 5 10 20 30 CPUs 40 50 60 Figure 5.19: Monte Carlo speedups Though none message passing implementation ahieves perfet speedup, MPJ/Ibis on Panda outperforms mpiJava onsiderably (see Figure 5.19), sine it takes advantage of the Ibis serialization mehanism. While mpiJava depends on Sun serialization, it is slightly faster than MPJ/Ibis on TCP, whih is restrited by the available bandwidth of Fast Ethernet. 5.3. Java Grande Benhmark Suite Raytraer 57 The raytraer appliation benhmark renders a sene ontaining 64 spheres at a resolution of 2000 x 2000 pixels. Eah proess omputes a part of the sene and sends the rendered pixels to node 0 using the send and rev primitives. RayTracer 60 50 MPJ/Ibis TCP MPJ/Ibis Panda MPJ/Ibis Net.gm mpiJava MPICH/GM perfect Speedup 40 30 20 10 10 20 30 CPUs 40 50 60 Figure 5.20: Raytraer speedups In all ases the raytraer sales almost perfetly. At 64 nodes mpiJava shows a slightly better performane than MPJ/Ibis. In omparison to the omputation part the ommuniation overhead is marginally small for all message passing implementations evaluated. 5.4. Disussion 58 5.4 Disussion MPJ/Ibis and mpiJava have been evaluated both via miro benhmarks (Setion 5.2) measuring latenies and throughputs and via the JGF benhmark suite (Setion 5.3) measuring throughputs, the olletive operations, kernel and appliation runtimes. The miro benhmarks have shown that MPJ/Ibis does not ome with a great performane lak aused by MPJ itself. In omparison to mpiJava on MPICH/GM MPJ/Ibis shows higher latenies, but the inuene beomes smaller at growing data sizes. Partiularly for objet arrays, MPJ/Ibis has a great performane advantage over mpiJava on MPICH/GM. On the other hand the eets of the relative higher latenies of MPJ/Ibis beome visible at the low level benhmarks of the JGF benhmark suite in setion 1. Partiularly, the algorithm for the barrier operation should be improved. MpiJava performs better when basi type arrays are being ommuniated through the different olletive operations, while MPJ/Ibis has advantages when objet arrays are being transfered. The performane lak of MPJ/Ibis for basi type arrays almost disappears at the setions 2 and 3, where more omputation intensive appliations are benhmarked. At kernels and appliations based on objet arrays MPJ/Ibis' speedups are onsiderably higher (see SOR and Monte Carlo). The only benhmarks of setion 2 and 3, where mpiJava shows substantially higher speedups than MPJ/Ibis are the Sparse Matrix Multipliation and the Moleular Dynamis. In both appliations allredue is the main operation used for ommuniation, where, as mentioned above, the orretness of MPICH/GM an be doubted. The evaluation has shown that MPJ/Ibis' performane is highly dependent on the underlying Ibis implementation. In onlusion, MPJ/Ibis' performane is ompetitive to that of mpiJava. Additionally, MPJ/Ibis omes with an advantage of full portability and exibility allowing MPJ/Ibis to run in heterogeneous networks in ontrast to mpiJava, whih depends on a native MPI implementation. Chapter 6 Related Work As mentioned in the beginning of Chapter 4 a lot of researh has been done to develop an MPI binding to Java resulting in a variety of dierent implementations. MPJ/Ibis takes plae in this history, whih partiipants will be introdued in the following. All projets are being presented with a view to eieny and portability. JavaMPI JavaMPI [17℄ is based on various funtions using JNI to wrap MPI methods to Java. For that purpose a Java-to-C Interfae generator (JCI) has been implemented to reate a C-stub funtion and a Java method delaration for eah native method to be exported from the MPI library. The automati wrapper reation resulted in an almost omplete Java binding to MPI-1.1 [15℄ with less implementation osts. However, JavaMPI appliations are not portable, sine a native MPI implementation is always required for exeution. The JavaMPI projet is main- 1 tained by the University of Westminster, but no longer ative. The last version has been released in 2000. jmpi [7℄. Jmpi [6℄ implemented at Baskent University in 1998 works on top of JPVM Both jmpi and JPVM are implemented entirely in Java. JPVM follows the onept of parallel virtual mahines (PVM). The main dierene [9℄ between MPI and PVM is, that PVM is optimized for fault tolerane in heterogeneous networks using a small set of standard ommuniation tehniques (e.g. TCP). That allows 1 JavaMPI online at http://perun.hss.wmin.a.uk/JavaMPI/ 59 Chapter 6. Related Work 60 jmpi appliations to be highly portable, but also limits ommuniation performane dramatially. Here jmpi suers from the poor performane of JPVM. The onepts of PVM will not disussed further at this point. The jmpi projet is no longer maintained. MPIJ The MPIJ [12℄ implementation is written in pure Java and runs as a part of the Distributed Objet Group Metaomputing Arhiteture (DOGMA) [11℄ using RMI for ommuniation. If available on the running platform, MPIJ additionally uses native marshaling of primitive types instead of Java marshaling. has been developed at Brigham Young University in 1998. DOGMA 2 Only a preompiled pakage of the urrent DOGMA implementation has been released, but due to almost non-existent doumentation it ould not determined, if the urrent DOGMA implementation still ontains MPIJ. JMPP JMPP [5℄ has been developed at the National Chiao-Tung University. In general this implementation is also built on top of RMI resulting in performane disadvantages. To ahieve more exibility an additional layer between the lasses implementing the MPI methods and RMI has been implemented, alled Abstrat Devie Interfae (ADI). It abstrats ompletely from the underlying ommuniation layer allowing to replae RMI with other modules for more eient ommuniation. Currently ADI only supports RMI. While the JMPP projet is inative, a more eient implementation of ADI an not be expeted. JMPI Using RMI for ommuniation JMPI [19℄ has been implemented entirely in Java with the advantage of full portability. Sine RMI auses high performane loss, an optitmized RMI model alled KaRMI [21℄ has been used for data transfer. KaRMI improves the performane of JMPI notably, but omes with redution to portability, sine it has to be ongured expliitly for eah dierent JVM used inreasing administration overhead for eah JMPI appliation. JMPI has been de- 3 veloped at the University of Massahusetts , but a release is not available. 2 DOGMA online at http://sl.s.byu.edu/dogma/ 3 http://www.umass.edu/ Chapter 6. Related Work CCJ 61 4 In ontrast to the other projets CCJ [20℄ implemented at Vrije Universiteit Amsterdam in 2003 follows a strit objet-oriented approah and thus an not be laimed to be a binding to the MPI speiation. CCJ also has been built diretly on top of RMI with all the disadvantages that ome along with it. Nevertheless group ommuniation is possible, where threads within the same thread group an exhange messages using olletive operations like Alltoall is not supported here. broadast, gather and satter. To reah higher ommuniation speeds it is also 5 possible for CCJ appliations to be ompiled with Manta [14, 37-68℄, a native Java ompiler optimized for RMI allowing remote method invoation on Myrinet based networks. Manta is soure ode ompatible to Java version 1.1. While CCJ ompiled with Manta works more eiently, the use of a native ompiler breaks Javas portability advantage. Both projets are no longer ative. mpiJava MpiJava 6 [1℄ is based on wrapping native methods like the MPI imple- mentation MPICH with the Java Native Interfae (JNI). The API is modeled very losely on the MPI-1.1 standard provided by the MPI Forum, but does not math the proposed MPJ [4℄ speiation. MpiJava omes with a large set of doumentation inluding a omplete API referene. Sine it is widely used to enable message passing for Java, it has been hosen to be ompared with MPJ/Ibis (see Chapter 5). However, mpiJava omes with some notably disadvantages: • ompatibility issues with some native MPI implementations (e.g. MPICH/P4) • redued portability, sine an existing native MPI library is needed for the target platform This projet still is ative. MPJ In 2004 the Distributed Systems Group at the University of Portsmouth announed a message passing implementation mathing the MPJ speiation. This 7 projet is also alled MPJ , but a release is not publily available. MPJ implements 4 CCJ online at http://www.s.vu.nl/ibis/j_download.html 5 Manta online at http://www.s.vu.nl/~robn/manta/ 6 mpiJava 7 MPJ online at http://www.hpjava.org/mpiJava.html Projet online at http://dsg.port.a.uk/projets/mpj/ Chapter 6. Related Work 62 an MPJ Devie layer, whih abstrats from the underlying ommuniation model. For TCP based networks it uses the Java.nio pakage and a wrapper lass using JNI for Myrinet. Like MPJ/Ibis this implementation is ompletely written in Java, but urrently supports only the point-to-point primitives, a small subset of the MPJ speiation. However, some low level benhmark results are being presented on the projets website onerning basi type array transmission, with ompetitive results. The issue of eient objet serialization seems not to be leared yet. Sine the MPJ projet is in early stage, more results have to be expeted in the future. Summary All of the Java message passing projets introdued in this hapter have shown disadvantages. Either an implementation is eient, but does not benet from Javas portability, or it is highly portable, but suers from the poor performane of the underlying ommuniation model. The only existing projet that seems to provide eieny and exibility is MPJ, whih still is under development for the rst release and not publily available at the moment. Chapter 7 Conlusion and Outlook 7.1 Conlusion In this thesis a new message passing platform for Java alled MPJ/Ibis has been presented. The main fous was to implement an environment that performane an ompete with existing Java bindings of MPI (e.g. mpiJava), but without exibility drawbaks. Chapter 2 introdued parallel arhitetures in general and the basi priniples of message passing derived from the MPI-1.1 speiation. In summary the speiation denes the following onepts: • Point-to-point ommuniation • Groups of proesses • Colletive ommuniation • Communiation ontexts • Virtual topologies • Derived datatypes In Chapter 3 Javas drawbaks for parallel omputation have been pointed out. Besides the great advantage of portability Java also shows disadvantages, partiularly RMI is not exible and eient enough to meet the requirements of an eient 63 7.1. Conlusion message passing environment. 64 The grid programming environment Ibis addresses these drawbaks (see Setions 3.2 and 3.3) and thus has been hosen for a message passing implementation to be built on top of. The main advantages of Ibis in short are: • Eient serialization • Eient ommuniation Additionally, Ibis' exibility allows any MPJ/Ibis appliation to run on lusters and grids without reompilation by loading the appropriate ommuniation module at runtime. Putting Ibis and MPI together in Chapter 4 the proposed MPJ speiation has been taken as basis, sine it is the result of researh within the Java Grande Forum and speies a well dened API for a Java binding of MPI. Chapter 4 also fouses the main implementation details, partiularly the point-to-point primitives, the olletive operation algorithms and ontext management. In Chapter 5 MPJ/Ibis has been evaluated using miro benhmarks and the benhmark suite for MPJ implementations provided by the Java Grande Forum. During evaluation for Myrinet networks MPJ/Ibis has been opposed mpiJava (see Chapter 6). The low level results have shown that MPJ/Ibis has great advantages over mpiJava, when objets have to be serialized, while mpiJava moderately outperforms MPJ/Ibis, when basi type arrays have to be ommuniated. For most kernels and appliations used in the benhmarks, where relative ommuniation overhead has been redued, MPJ/Ibis and mpiJava have shown almost equivalent results. The benhmarks have shown that exibility provided by MPJ/Ibis does not ome with onsiderable performane penalties. In summary, MPJ/Ibis an be onsidered as a message passing platform for Java that ombines ompetitive performane with portability ranging from high-performane lusters to grids. It is the rst known Java binding of MPI that provides both exibility and eieny. 7.2. Outlook 65 7.2 Outlook Showing the relevane of MPJ/Ibis for the message passing ommunity parts of 1 this thesis have found their way into a publiation , whih has been aepted at EURO PVM/MPI 2005 onferene, Sorrento (Naples), Italy and will be printed in 2 Leture Notes of Computer Siene (LNCS) . Nevertheless, some tasks still exist for MPJ/Ibis to be done. Implementing the virtual topologies and improving the olletive operations (ie. barrier ) an be done in the near future. The issue of supporting the whole set of methods needed for derived datatypes is not leared yet. As pointed out in Setion 4.5 it is not neessary to support derived datatypes in MPJ/Ibis, sine Java objets support them natively. On the other hand the existene of derived datatypes would ease the work of software developers to port existing MPI appliations written in C, Fortran or other languages to MPJ/Ibis. Thus, it should be gured out how the issue of multidimensional arrays in Java, whih in fat is the main handiap for derived datatypes, an be addressed. In 2001 the Ninja [18℄ projet (Numerially Intensive Java) supported by IBM proposed an extension to add truly multidimensional arrays to Java. Although the proposal has 3 been withdrawn from Suns Java speiation request program and thus will not be added into future Java speiations, the ambitions of the Ninja approah should be ontinued. Sine MPJ/Ibis depends on the underlying Ibis implementations, Ibis and partiularly Net.gm and Panda should be improved in future work to provide more stability to MPJ/Ibis. Furthermore, it should be investigated how the MPJ speiation (and thus MPJ/Ibis) an be extended towards MPI-2. Paralleling single-sided ommuniation and dynami proess management are additional interesting aspets espeially for global grids, when fault tolerane beomes an issue. MPJ/Ibis: a Flexible and Eient Message Passing Platform for Java online at: http://www.s.vu.nl/ibis/papers/europvm2005.pdf 2 B. Di Martino et al. (Eds.): EuroPVM/MPI 2005, LNCS Volume Number 3666, pp. 217-224, 2005, Springer Verlag Berlin Heidelberg 2005 3 Java Speiation Request: Multiarray pakage online at http://jp.org/en/jsr/detail?id=083 1 Bibliography [1℄ M. Baker, B. Carpenter, G. Fox, S. H. Ko, and S. Lim. mpijava: An objetoriented java interfae to mpi. In Presented at International Workshop on Java for Parallel and Distributed Computing, IPPS/SPDP 1999, San Juan, Puerto Rio, Apr. 1999. LNCS, Springer Verlag, Heidelberg, Germany. [2℄ R. Bhoedjang, T. Ruhl, R. Hofman, K. Langendoen, H. Bal, and F. Kaashoek. Panda: A portable platform to support parallel programming languages. pages 213226, 1993. [3℄ J. M. Bull, L. A. Smith, L. Pottage, and R. Freeman. against C and Fortran for sienti appliations. In Benhmarking Java Java Grande, pages 97 105, 2001. [4℄ B. Carpenter, V. Getov, G. Judd, A. Skjellum, and G. Fox. message passing for Java. MPJ: MPI-like Conurreny: Pratie and Experiene, 12(11):1019 1038, 2000. [5℄ Y.-P. Chen and W. Yang. Java message passing pakage - a design and imple- Proeedings of the Sixth Workshop on Compiler Tehniques for High-Performane Computing, Kaohsing, Taiwan, Mar. 2000. mentation of mpi in java. In [6℄ K. Diner. Ubiquitous Message Passing Interfae Implementation in Java: jmpi. In IPPS/SPDP, pages 203211. IEEE Computer Soiety, 1999. [7℄ A. Ferrari. JPVM: network parallel omputing in Java. and Experiene, 10(1113):985992, 1998. 66 Conurreny: Pratie Bibliography [8℄ B. Goetz. 67 Threading Lightly: Synhronization is not the enemy, online at ftp://www6.software.ibm.om/software/developer/library/j-threads1.pdf edi- tion, 2001. [9℄ W. Gropp and E. L. Lusk. Why are PVM and MPI so dierent? In PVM/MPI, pages 310, 1997. [10℄ W. Huber. Paralleles Rehnen. Oldenbourg, Münhen, 1997. [11℄ G. Judd, M. Clement, and Q. Snell. DOGMA: Distributed Objet Group Metaomputing Arhiteture. Conurreny: Pratie and Experiene, 10:977983, 1998. [12℄ G. Judd, M. J. Clement, Q. Snell, and V. Getov. implementation of MPI in java. In Design issues for eient Java Grande, pages 5865, 1999. [13℄ G. Krüger. Go To Java 2. [14℄ J. Massen. Method Invoation Based Communiation Models for Parallel Pro- Addison Wesley, 1999. gramming in Java. PhD thesis, Vrije Universiteit Amsterdam, The Netherlands, June 2003. [15℄ Message Passing Interfae Forum. MPI: A Message-Passing Interfae Standard, online at http://www.mpi-forum.org/dos/mpi-11.ps edition, 1995. [16℄ Message Passing Interfae Forum. MPI2: Extensions to the Message-Passing Interfae, online at http://www.mpi-forum.org/dos/mpi-20.ps edition, 1997. [17℄ S. Minthev and V. Getov. Towards portable message passing in java: Binding MPI. In PVM/MPI, pages 135142, 1997. [18℄ J. E. Moreira, S. P. Midki, M. Gupta, P. V. Artigas, P. Wu, and G. Almasi. The NINJA projet. Communiations of the ACM, 44(10):102109, 2001. [19℄ S. Morin, I. Koren, and C. M. Krishna. Passing Standard in Java. In IPDPS, 2002. JMPI: Implementing the Message Bibliography 68 [20℄ A. Nelisse, J. Maassen, T. Kielmann, and H. E. Bal. CCJ: objet-based message passing and olletive ommuniation in Java. Conurreny and Computation: Pratie and Experiene, 15(3-5):341369, 2003. [21℄ M. Philippsen and B. Haumaher. More eient objet serialization. In IPPS/SPDP Workshops, pages 718732, 1999. [22℄ Sun Mirosystems. Java Remote Method Invoation Speiation, online at http://java.sun.om/produts/jdk/rmi edition, July 2005. [23℄ Sun Mirosystems. Objet Serialization Speiation, online http://java.sun.om/j2se/1.4.2/dos/guide/serialization/index.html at edition, July 2005. [24℄ A. S. Tanenbaum and J. Goodman. Computer Arhitektur. Pearson Studium, Münhen, 2001. [25℄ R. V. van Nieuwpoort. Eient Java-Centri Grid-Computing. PhD thesis, Vrije Universiteit Amsterdam, The Netherlands, Sept. 2003. [26℄ R. V. van Nieuwpoort, J. Maassen, R. Hofman, T. Kielmann, and H. E. Bal. Ibis: an Eient Java-based Grid Programming Environment. In Java Grande - ISCOPE 2002 Conferene, USA, November 2002. Joint ACM pages 1827, Seattle, Washington, Eidesstattlihe Erklärung Hiermit versihere ih, dass ih die Arbeit selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe. 69