Download A Flexible and Efficient Message Passing Platform for Java.

Document related concepts
no text concepts found
Transcript
written and implemented at
submitted to
Faulty of sienes
Fahbereih 5
Department of omputer siene
Wirtshaftswissenshaften
Vrije Universiteit Amsterdam
Universität Siegen
The Netherlands
Germany
A Flexible and Eient Message
Passing Platform for Java
Eine exible und eziente Message Passing Plattform für Java
Markus Bornemann
A Thesis presented for the Degree of
Diplom-Wirtshaftsinformatiker
Author:
Markus Bornemann
Student register:
542144
Address:
Amselweg 7
57392 Shmallenberg
Germany
Supervisor:
Dr.-Ing. habil. Thilo Kielmann
Seond reader:
Prof. Dr. Roland Wismüller
Amsterdam, September 2005
Zusammenfassung
Contents
Zusammenfassung
ii
1 Introdution
1
2 The Message Passing Interfae
3
2.1
Parallel Arhitetures . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.2
MPI Conepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.3
Point-to-Point Communiation . . . . . . . . . . . . . . . . . . . . . .
6
2.3.1
Bloking Communiation
7
2.3.2
Non-Bloking Communiation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
7
. . . . . . . . . . . . . . . . . . . . . . . .
8
2.4
Colletive Communiation
2.5
Groups, Contexts and Communiators
2.6
Virtual Topologies
. . . . . . . . . . . . . . . . .
11
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3 The Grid Programming Environment Ibis
3.1
Parallel Programming in Java
. . . . . . . . . . . . . . . . . . . . . .
3.1.1
Threads and Synhronization
3.1.2
Remote Method Invoation
15
15
. . . . . . . . . . . . . . . . . .
16
. . . . . . . . . . . . . . . . . . .
17
3.2
Ibis Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.3
Ibis Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
4 Design and Implementation of MPJ/Ibis
22
4.1
Common Design Spae and Deisions . . . . . . . . . . . . . . . . . .
22
4.2
MPJ Speiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
4.3
MPJ on Top of Ibis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
iii
Contents
iv
4.3.1
Point-to-point Communiation . . . . . . . . . . . . . . . . . .
26
4.3.2
Groups and Communiators . . . . . . . . . . . . . . . . . . .
30
4.4
Colletive Communiation Algorithms
. . . . . . . . . . . . . . . . .
31
4.5
Open Issues
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
5 Evaluation
37
5.1
Evaluation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
5.2
Miro Benhmarks
38
5.3
Java Grande Benhmark Suite
5.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
42
5.3.1
Setion 1: Low-Level Benhmarks . . . . . . . . . . . . . . . .
42
5.3.2
Setion 2: Kernels
. . . . . . . . . . . . . . . . . . . . . . . .
50
5.3.3
Setion 3: Appliations . . . . . . . . . . . . . . . . . . . . . .
55
Disussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
6 Related Work
59
7 Conlusion and Outlook
63
7.1
Conlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
7.2
Outlook
65
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography
66
Eidesstattlihe Erklärung
69
List of Figures
2.1
Bloking Communiation Demonstration . . . . . . . . . . . . . . . .
7
2.2
Non-Bloking Communiation Demonstration
. . . . . . . . . . . . .
8
2.3
Broadast Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.4
Redue Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.5
Satter/Gather Illustration . . . . . . . . . . . . . . . . . . . . . . . .
10
2.6
Allgather Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.7
Alltoall Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.8
Graph Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.9
Cartesian Topologies
14
3.1
Java thread synhronization example
. . . . . . . . . . . . . . . . . .
17
3.2
RMI invoation example (after [14, 12℄) . . . . . . . . . . . . . . . . .
18
3.3
Ibis design (redraw from the Ibis 1.1 release) . . . . . . . . . . . . . .
19
3.4
Send and Reeive Ports (after [25, 153℄) . . . . . . . . . . . . . . . . .
20
4.1
Prinipal Classes of MPJ (after [4℄) . . . . . . . . . . . . . . . . . . .
23
4.2
MPJ/Ibis design
26
4.3
MPJ/Ibis send protool
4.4
MPJ/Ibis reeive protool
4.5
Satter
4.6
Flat tree view
4.7
Broadast
4.8
Binomial tree view
4.9
Ring sending sheme (only 1 step)
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
28
. . . . . . . . . . . . . . . . . . . . . . . .
29
. . . . . . . . . . . . . . . . . . . . . . . . .
32
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
sending sheme
sending sheme
. . . . . . . . . . . . . . . . . . . . . . . .
33
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.10 Reursive doubling illustration
. . . . . . . . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . . .
35
v
List of Figures
vi
5.1
Double array throughput in Ibis
. . . . . . . . . . . . . . . . . . . .
39
5.2
Double array throughput in MPJ/Ibis and mpiJava . . . . . . . . . .
40
5.3
Objet array throughput in Ibis . . . . . . . . . . . . . . . . . . . . .
41
5.4
Objet array throughput in MPJ/Ibis and mpiJava
. . . . . . . . . .
41
5.5
Pingpong benhmark: arrays of doubles
. . . . . . . . . . . . . . . .
43
5.6
Pingpong benhmark: arrays of objets
. . . . . . . . . . . . . . . .
43
5.7
Barrier benhmark
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.8
Broadast benhmark
5.9
. . . . . . . . . . . . . . . . . . . . . . . . . .
46
Redue benhmark
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
5.10 Satter benhmark
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
5.11 Gather benhmark
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
5.12 Alltoall benhmark
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.13 Crypt speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.14 LU Fatorization speedups . . . . . . . . . . . . . . . . . . . . . . . .
51
5.15 Series speedups
52
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.16 Sparse matrix multipliation speedups
. . . . . . . . . . . . . . . . .
53
5.17 Suessive over-relaxation speedups . . . . . . . . . . . . . . . . . . .
54
5.18 Moleular dynamis speedups
. . . . . . . . . . . . . . . . . . . . . .
55
. . . . . . . . . . . . . . . . . . . . . . . . . .
56
5.20 Raytraer speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.19 Monte Carlo speedups
List of Tables
2.1
Flynn's Taxonomy (after [24, 640℄)
2.2
MPI datatypes and orresponding C types
. . . . . . . . . . . . . . .
6
2.3
Predened Redue Operations . . . . . . . . . . . . . . . . . . . . . .
9
2.4
Group Constrution Set Operations . . . . . . . . . . . . . . . . . . .
12
4.1
MPJ
prototype (after [4℄)
. . . . . . . . . . . . . . . . . . . . .
24
4.2
MPJ basi datatypes (after [4℄)
. . . . . . . . . . . . . . . . . . . . .
25
4.3
Algorithms used in MPJ/Ibis to implement the olletive operations .
31
5.1
Lateny benhmark results
38
send
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
vii
4
Chapter 1
Introdution
Problems whih are too omplex to be solved by theory or too ost-intensive for
pratial approahes an only be handled using simulation models. Some of these
problems, for example geneti sequene analysis or global limate models, getting to
large to be solved on a single proessor mahine in reasonable time or simply exeed
the physial limitations provided.
Usually, parallel omputers providing a homogeneous networking infrastruture
had been used to address these problems, but in times of low budgets more and more
existing loal area networks integrate multiple workstations or PCs making them
suitable for parallel omputing.
In ontrast to omputer lusters these mahines
typially onsist of dierent hardware arhitetures and operating systems.
The Message Passing Interfae (MPI) developed by the MPI-Forum has been
widely aepted as a standard for parallel omputation on high performane omputer lusters. Sine MPI is widely used and a lot of experiene has been built up,
there is a growth interest of making the MPI onepts appliable for heterogeneous
systems.
Unfortunately, existing MPI implementations written mostly in C and
Fortran are limited to the hardware arhiteture they are implemented for. In the
past portability was of minor importane. Therefore, MPI appliations need to be
adapted and reompiled while hanging the target platform.
During the last years the Java programming language has beome very important
for appliation developers, when heterogeneous, networking environments need to be
addressed. Java allows to develop appliations that an run on a variety of dierent
1
Chapter 1. Introdution
2
omputer arhitetures and operating systems without reompilation. Additionally,
Java has been designed to be objet-oriented, dynami and multi-threaded.
Nowadays, performane of hardware and onnetivity has rapidly inreased allowing to use heterogeneous infrastrutures of for instane small and medium sized
ompanies or the Internet for parallel omputation. Therefore, parallel appliations
need to math the requirements of being portable and exible allowing them to be
exeuted on dierent arhitetures simultaneously without the high osts of porting
software expliitly.
Naturally, the ability of being exible should not ome with
signiant performane drawbaks.
In this thesis, a message passing platform alled MPJ/Ibis is presented that
ombines both eieny and exibility.
Chapter 2 gives an overview of parallel
arhitetures in general and introdues the basi onepts of MPI. Then, Chapter 3
points out Javas abilities and limitations for parallel omputing and desribes the
grid programming environment Ibis, whih addresses Javas drawbaks. The main
design and implementation details of MPJ/Ibis are desribed in Chapter 4 followed
by a presentation of various benhmark results in Chapter 5. Finally, in Chapter 6,
previous projets onerning message passing for Java are introdued with a view
to exibility and eieny.
Chapter 2
The Message Passing Interfae
The Message Passing Interfae (MPI) is a standardized message passing model and
not a spei implementation or produt. It was designed to provide aess to advaned parallel hardware for end users, library writers and tool developers.
MPI
was standardized by the MPI-Forum in 1995 (MPI-1.1) and further developed to
MPI-2, whih inludes MPI-1.2, in the following years. The existene of a standardized interfae makes it onvenient to software developers implementing portable and
parallel programs with the guaranteed funtionality of a set of basi primitives. It
denes the general requirements for a message passing implementation, presenting
detailed bindings to C and Fortran languages as well.
This introdution to MPI refers to MPI-1.1 [15℄ and does not laim to be a
omplete referene or tutorial, rather it explains the basi priniples for the implementation of a message passing platform for Java.
Therefore the author avoids,
where it is possible, the presentation of MPI funtion bindings.
2.1 Parallel Arhitetures
A parallel mahine onsists of a various number of proessors, whih olletively
solve a given problem. In ontrast to single proessor mahines, e. g. workstations,
parallel mahines are able to exeute multiple instrutions simultaneously.
This
results in the main motivation of parallel omputing to solve problems faster.
To desribe parallel omputer arhitetures in general, Flynn's taxonomy [24,
3
2.1. Parallel Arhitetures
4
640℄ haraterizes four dierent lasses of arhitetures.
It uses the onept of
streams, whih at least are sequenes of items proessed by a CPU. A stream onsists either of instrutions or of data, whih will be manipulated by the instrutions.
These are:
SISD
Single instrution, single data streams
MISD
Multiple instrutions, single data streams
SIMD
Single instrution, multiple data streams
MIMD
Multiple instrutions, multiple data streams
Table 2.1: Flynn's Taxonomy (after [24, 640℄)
The rst item, SISD, desribes a sequential arhiteture, in whih a single proessor operates on a single instrution stream and stores data in a single memory,
e.g. a von Neumann arhiteture. SISD does not address any parallelization in the
mentioned streams. MISD is more or less a theoretial arhiteture, where multiple
instrutions operate on single data streams simultaneously. In fat multiple instrution streams need multiple data streams to be eetive, therefore no ommerial
mahine exists with this design. Computers working on the SIMD model manipulate multiple data streams with the same set of instrutions. Usually this will be
done in parallel, e.g.
in an array proessor system.
In this model all proessors
are being synhronized by a global lok to make sure that every proessor is performing the same instrution in lokstep. MIMD onerns about fully autonomous
proessors, whih perform dierent instrutions on dierent data. This ase implies
that the omputation has to be done asynhronously.
Furthermore, parallel systems, i.e.
SIMDs and MIMDs, dier in the way the
proessors are onneted and thus how they ommuniate with eah other.
On
the one hand all proessors may be assigned to one global memory, alled shared
memory.
On the other hand every proessor may address its own loal memory,
alled distributed memory. All the paradigms above regard to hardware. In addition
to the above, there is a software equivalent to SIMD, alled SPMD (Single program,
multiple data). In ontrast to SIMD, SPMD works asynhronously, where the same
program runs on proessors of a MIMD system.
The message passing onept mathes the SPMD paradigm, where ommunia-
2.2. MPI Conepts
5
tion takes plae by exhanging messages. In the message passing world programs
onsist of separate proesses. Eah proess addresses its own memory spae, whih
will be managed by the user as well as data distribution among ertain proesses.
2.2 MPI Conepts
MPI is based on a stati proess model, that means all proesses within an existing
MPI runtime environment enter and exit this environment simultaneously. Inside the
running environment all the proesses are being organized in groups. Atually the
ommuniation ours by aessing ommuniators, group dened ommuniation
hannels. Furthermore, the proesses inside a group are numbered from 0 to
where
n is the total number of proesses within the group.
n-1,
Those numbers are alled
ranks. For all members of the group the rank number of a ertain member is the
same. That allows a global view to the group members.
Message Data
Assuming a proess is about to send data, it has to speify what
data is going to be sent. The loation of this data is alled the send buer. On the
other side, when data has to be reeived, this loation is alled the reeive buer.
Messages may onsist of ontiguous data, e.g. an array of integer values. In order
to avoid memory-to-memory opying, e.g. if just a smaller part of an array inside
the send buer should be transfered, the number of elements has to be speied for
a message, so that the needed elements may be used diretly.
Sine MPI ould be implemented as a library, whih may be used preompiled, it
annot be assumed that a ommuniation all has information about the datatype of
variables in the ommuniation buer. Hene, MPI denes its own datatypes, whih
then will be attahed to the message. A redued list for the C language binding is
presented in Table 2.2.
A learer reason to attah datatypes expliitly to messages is shown by the ase
where non-ontiguous data will be submitted, e.g. a olumn of a two-dimensional
array. For that, the user has to onstrut a type map, whih speies the needed
elements, ombined with a MPI datatype. This results in a derived datatype based
on MPI datatypes.
2.3. Point-to-Point Communiation
6
MPI Datatypes
C Datatypes
MPI_CHAR
signed har
MPI_SHORT
signed short int
MPI_INT
signed int
MPI_LONG
signed long int
MPI_UNSIGNED_CHAR
unsigned har
...
...
Table 2.2: MPI datatypes and orresponding C types
Message Envelopes
To realize ommuniation in general, MPI uses the so-alled
message envelopes. Envelopes are meant to desribe the neessary details of a message submission. These are:
•
Sender
•
Reeiver
•
Tag
•
Communiator
The sender and reeiver items just hold information about the ranks of the ommuniation partners, whereas the used group will be identied by the ommuniator.
The tag is a free interpretable positive integer value. It may be used to distinguish
dierent message types inside the ommuniator.
2.3 Point-to-Point Communiation
Communiation between exatly two proesses is alled point-to-point ommuniation in MPI. In this way of ommuniation both the sender and reeiver are expliitly
reognized by the underlying ommuniator and therefore the spei ranks inside
the proess group. MPI denes two possibilities to ahieve point-to-point ommuniation, bloking and non-bloking, whih will be explained in Setions 2.3.1 and
2.3.2. Besides that there are three dierent ommuniation modes.
First, in
ready send mode the message will be sent as the reeiver side the alled
mathing reeive funtion. It has to be alled before a sender proess is able to send
2.3. Point-to-Point Communiation
7
a message, otherwise this mode results in error. Seond, in
synhronous send mode,
the sending proess rst requests for a mathing reeive funtion on the reeiver
side, and then, if a mathing reeive has been posted, sends the message. Third,
it is possible to buer the message before it will be sent.
buered send,
This mode is alled
where the message rst will be opied into a user dened system
buer. The real ommuniation then may be separated from the sending proess
and moved to the runtime environment. On the reeiver side these ommuniation
modes appear fully transparent. The reeiver proess doesn't have any inuene on
the ommuniation mode used.
2.3.1
Bloking Communiation
Bloking ommuniation funtion alls return when the operation assigned to the
funtion has nished. It bloks the aller until the involved buer may be reused.
sender
process
receiver
process
call send
callrecv
blocked
e
ag
ss
me
blocked
not
blocked
not
blocked
Figure 2.1: Bloking Communiation Demonstration
Figure 2.1 demonstrates a bloking send and the assoiated reeive proess. The
sending proess alls send and then returns when the message has left the buer
ompletely. A nished send all does not imply that the message has arrived at it's
destination entirely, whereas the reeiver bloks until the message has been arrived
ompletely.
2.3.2
Non-Bloking Communiation
In ontrast to bloking ommuniation the operations of the non-bloking pointto-point ommuniation return immediately. That allows a proess to do further
2.4. Colletive Communiation
8
omputation during message transfer. On the other hand it is not possible to use
the involved buers on the sender and reeiver side until the point-to-point transfer
has been nished. To gure out when a ommuniation is done, MPI provides wait
and test funtions, allowing a proess to hek for a message transfer to omplete
(see gure 2.2).
sender
process
receiver
process
call send
callrecv
me
ag
ss
e
not
blocked
call wait
blocked
not
blocked
Figure 2.2: Non-Bloking Communiation Demonstration
2.4 Colletive Communiation
On top of the point-to-point ommuniation, MPI provides funtions to ommuniate within groups of proesses, alled olletive ommuniation.
Colletive operations are exeuted by all members of the group. These operations
are responsible for synhronization, broadasting, gathering, sattering and reduing
of information between the groups of proesses.
Some operations need a speial
proess to send data to, or ollet data from, the other proesses, whih is alled
the root proess. A olletive operation always bloks the aller until all proesses
have alled it.
In the following, the most important olletive operations will be
introdued.
Broadast
Figure 2.3 illustrates a
broadast
of six proesses. The left table shows
the initial state of the buers of all proesses, eah row for a proess.
The rst
proess is the root, whih sends its input buer to all proesses inluding itself. In
the end all proesses hold a opy of the root's input buer, as shown on the right
side of Figure 2.3.
9
data
data
A0
A0
processes
processes
2.4. Colletive Communiation
A0
A0
broadcast
A0
A0
A0
Figure 2.3: Broadast Illustration
Redue
The
redue operation ombines all items of the send buer of eah proess
using an assoiative mathematial operation. The result then will be sent to the
root proess. Figure 2.4 demonstrates redue, where the result appears in
R0 .
MPI
data
data
A0
R0
processes
processes
denes a set of standard operations for that purpose, whih are listed in table 2.3.
B0
C0
reduce
D0
E0
F0
Figure 2.4: Redue Illustration
Name
Meaning
MPI_MAX
maximum value
MPI_MIN
minimum value
MPI_SUM
sum
MPI_PROD
produt
MPI_LAND
logial and
MPI_BAND
bit-wise and
MPI_LOR
logial or
MPI_BOR
bit-wise or
MPI_LXOR
logial xor
MPI_BXOR
bit-wise xor
MPI_MAXLOC
max value and loation
MPI_MINLOC
min value and loation
Table 2.3: Predened Redue Operations
Of ourse there are restritions of whih datatypes are aepted by the dierent
operations. For example it does not make sense to alulate a bit-wise or on a set of
2.4. Colletive Communiation
10
oating point values. All predened redue operations are ommutative. In addition
to that, MPI allows to reate user dened redue operations, whih may or may not
be ommutative. Therefore a MPI implementation should take are about the way
the redue will be performed in order to respet non-ommutative behaviour. An
extension to redue is the allredue operation where all proesses reeive the result,
instead of just root.
Satter/Gather
In
satter
mode, the send buer of root will be split into n
equal parts, where n is the number of proesses inside the group.
Eah proess
then reeives a distint part, whih will be determined by the rank number of the
reeiving proess.
processes
A0
A1
A2
A3
A4
processes
data
data
A0
A5
scatter
A1
A2
A3
gather
A4
A5
Figure 2.5: Satter/Gather Illustration
Gather
is the inverse operation of satter. All proesses send their send buers
to root, where all the inoming messages will be stored in rank order into root's
reeive buer.
Allgather
An extension to
gather
is
allgather.
As shown in gure 2.6 instead of
data
data
A0
A0
B0
C0
D0
E0
F0
B0
A0
B0
C0
D0
E0
F0
C0
A0
B0
C0
D0
E0
F0
D0
A0
B0
C0
D0
E0
F0
E0
A0
B0
C0
D0
E0
F0
F0
A0
B0
C0
D0
E0
F0
allgather
Figure 2.6: Allgather Illustration
processes
processes
just root, all proesses of the group reeive the gathered data.
2.5. Groups, Contexts and Communiators
Alltoall
The last extension is
alltoall.
In
to eah other proess. This means that the
by proess
j
in the
ith
11
alltoall
jth
eah proess sends distint data
item sent from proess
i
plae of the reeive buer, where
and
j
i
is reeived
are rank numbers
of group proesses.
processes
A0
A1
A2
A3
A4
A5
A0
B0
C0
D0
E0
F0
B0
B1
B2
B3
B4
B5
A1
B1
C1
D1
E1
F1
C0
C1
C2
C3
C4
C5
A2
B2
C2
D2
E2
F2
D0
D1
D2
D3
D4
D5
A3
B3
C3
D3
E3
F3
E0
E1
E2
E3
E4
E5
A4
B4
C4
D4
E4
F4
F0
F1
F2
F3
F4
F5
A5
B5
C5
D5
E5
F5
processes
data
data
alltoall
Figure 2.7: Alltoall Illustration
Other Funtions
In addition to the olletive operations mentioned above, MPI
provides more funtions. It is beyond the objetive of this thesis to explain all of
them at this point. Only two further olletive operations will be desribed in the
following.
MPI allows a group of proesses to synhronize on eah other. To ahieve synhronization eah proess has to all the olletive funtion
alls the
barrier,
barrier.
When a proess
it bloks until all the other proesses have alled the
barrier
as
well.
A variant to
proess
i
allredue is the operation san.
It performs a prex redution, where
reeives the redution of the items from the proesses
0, ..., i.
2.5 Groups, Contexts and Communiators
Initially MPI reates a group, in whih all involved proesses of the runtime environment are listed. Beneath that, it provides the ability to generate other proess
groups, in order to help the programmer to ahieve a better struture of the soure
ode. This is important, e.g. for library developers, to fous on ertain proesses in
olletive operations to avoid synhronizing or running unrelated ode on uninvolved
proesses.
The rank of a proess is always bound to a ertain group, whih means that
2.5. Groups, Contexts and Communiators
12
a proess that is part or several dierent groups may have dierent ranks inside
those groups. Therefore it is neessary to obtain knowledge about a group, before
ommuniation an take plae. This is done via the ommuniators. Eah group
owns at least one ommuniator, with what the group members are able to deliver
messages to eah other. Eah ommuniator is assigned to stritly one group and
the relationship between groups and ommuniators reates the ontext for the
proesses.
Groups
Eah proess belongs at least to one group, namely the initially reated
group represented by the ommuniator
MPI_COMM_WORLD, whih ontains all
proesses. MPI itself does not provide a funtion for an expliit onstrution of a
group, instead a new group will be produed by using redution and mathematial
set operations on existing groups, whih result into a new group inluding the speied or alulated proesses. The set operations are listed expliitly in Table 2.4,
where it will be assumed that the operations are being exeuted on two dierent
groups as arguments, alled
Operation
union
group1
and
group2.
Meaning
group1, followed by
group2 not in group1
all proesses of group1 that are
also in the group2, ordered as in group1
all proesses of group1 that are
not in group2, ordered as in group1
all proesses of
all proesses of
intersetion
dierene
Table 2.4: Group Constrution Set Operations
As mentioned above, it is also possible to redue an existing group, using the
so-alled funtions inl and exl. In both the user has to speify ertain or a range
of ranks, whih will be inluded or exluded from the existing group, respetively.
Intra-Communiators
Communiators always our in relationship with groups.
Sine proesses just send and reeive messages over ommuniators and beause ommuniators represent the hannels within groups, the set of ommuniators assigned
to a proess forms the losure of the systems apability of ommuniation. Eah
2.6. Virtual Topologies
13
ommuniator is assigned to exatly one group and eah group belongs to at least
one ommuniator.
For example two proesses, eah is part of a dierent group,
annot exhange messages diretly. In order to ahieve this possibility, a new group
must be reated with a new ommuniator attahed. Therefore MPI provides funtions to onstrut ommuniators from existing groups. Communiators onerning
ommuniation within groups are alled intra-ommuniators.
Inter-Communiators
As shown above, reating new groups and ommuniators
is quite inonvenient just for the purpose that proesses from dierent groups want
to exhange messages.
For this speial ase MPI speies inter-ommuniators.
Those ommuniators are being onstruted by speifying two intra-ommuniators,
between whih the inter-ommuniation takes plae.
2.6 Virtual Topologies
The previous introdued intra-ommuniators address a linear name spae, where
the proesses are being numbered from 0 to
n-1 (see Setion 2.2).
In some ases suh
a numbering does not reet the logial ommuniation struture.
Depending on
the underlying algorithm, strutures like hyperubes, n-dimensional meshes, rings
or ommon graphs appear. These logial strutures are alled the virtual topology
in MPI, whih allows an additional arrangement of the group members. Sine the
ommuniation operations identify soure and destination by ranks, a virtual topology alulates, for instane, the original rank to the neighbour of a ertain proess
or a proess with speied oordinates. A virtual topology is an optional attribute
to intra-ommuniators and will be reated by orresponding MPI funtions, while
inter-ommuniators are not allowed to use virtual topologies.
Graph Topologies
Eah topology may be represented by a graph, in whih the
nodes represent the proesses and the edges represent the onnetions between them.
A missing edge between two nodes doesn't mean that the two proesses aren't able
to ommuniate to eah other, rather the virtual topology simply neglets it. The
edges are not weighted. It is that there is a onnetion between two nodes or not.
2.6. Virtual Topologies
14
In gure 2.8 two examples, a binary tree with seven proesses and a ring with eight
proesses, are shown.
Figure 2.8: Graph Topologies
Cartesian Topologies
Cartesian strutures are simpler to speify than ommon
graphs, beause of their regularity. Even if a artesian raster may be desribed by
a graph as well, MPI provides speial funtions to reate those strutures in order
to give more onveniene to the user. Figure 2.9 shows examples of a 3-dimensional
mesh with eight proesses and a 2-dimensional mesh with nine proesses.
Figure 2.9: Cartesian Topologies
Chapter 3
The Grid Programming Environment
Ibis
With the onept Write one, run everywhere, Java provides a solution to implement highly portable programs. Therefore Java soure ode will not be ompiled to
native exeutables, but to a platform independent presentation, alled byteode. At
runtime the byteode will be interpreted or ompiled just in time by a Java Virtual
Mahine (JVM). This property made Java attrative, espeially for grid omputing,
where many heterogeneous platforms are used and where portability beomes an
issue with ompiled languages.
It has been shown, that Javas exeution speed is
ompetitive to other languages like C or Fortran [3℄.
In the following, Javas abilities related to parallel programming will be pointed
out and then the grid programming environment Ibis and its enhanements will be
introdued.
3.1 Parallel Programming in Java
In order to ahieve parallel programming in general, Java oers two dierent models
out of the box. These are:
1. Multithreading for shared memory arhitetures and
2. Remote Method Invoation (RMI) for distributed memory arhitetures.
15
3.1. Parallel Programming in Java
16
The RMI model enables almost transparent ommuniation between dierent JVMs
and was at rst the main point of interest for researh on high-performane omputation in Java.
3.1.1
Threads and Synhronization
Conurreny has been integrated diretly into Java's language speiation using
threads [13, 221-250℄.
A thread is a program fragment, that an run simultane-
ously to other threads similar to proesses. While a proess is responsible for the
exeution of a whole program, multiple threads ould run inside that proess. The
main dierene between threads and proesses is, that threads are sharing the same
memory address spae, while the address spaes of proesses are stritly separated.
To reate a new thread, an objet of a lass, whih extends
java.lang.Thread, has
to be instantiated. Sine Java does not support multiple inheritane, there is also an
interfae alled
java.lang.Runnable, that allows a lass that is already derived from
another lass to behave as a thread. This objet will be ommitted to the onstrutor
of an extra reated
java.lang.Thread
objet. In both ases a method alled
run()
has to be implemented, whih is the entry point when invoking a thread. Invoking
will be done by alling the method
start().
Beause threads run in the same address spae and thus share the same variables
and objets, Java provides the onept of monitors to prevent data from being
overwritten by other threads. All objets of the lass
all objets of lasses derived from
java.lang.Objet
java.lang.Objet, and therefore
in ontrast to primitive types,
ontain a so-alled lok. A lok will be assigned to exatly one thread, while other
threads have to wait until the lok has been released. With that mutual exlusive
lok (short:
mutex lok) it is possible to oordinate aess to objets and thus
avoiding onits. Java doesn't allow diret aess to loks. Instead the keyword
synhronized
exist to label ritial regions. By using
synhronized
it is possible to
protet a whole method or a fragment within a method. When applying
on a whole method,
this
synhronized
pointer's lok will be used, otherwise an objet lok is used
by passing the objet referene to it.
In addition to monitors, Java supports onditional synhronization providing
3.1. Parallel Programming in Java
the methods
17
wait() and notify() (or notifyAll() ).
Both methods may only be alled
when the assoiated thread is the owner of the objet's lok. A all to
the thread to wait until another thread invokes
is
join(),
notify().
wait() informs
A more spei method
whih bloks the urrent thread until the assoiated thread has nished
entirely. Figure 3.1 demonstrates the wait and notify model by showing two threads
that have been synhronized on an objet. Thread 1 sets a value to the objet and
waits until Thread 2 noties it, when the value has been olleted. After notiation
Thread 2 sets another value to the objet.
Synchronized Object
Thread 1
Thread 2
synchronized set(aValue)
wait()
synchronized int get()
notify()
(finished)
synchronized set(anotherValue)
wait()
Figure 3.1: Java thread synhronization example
3.1.2
Remote Method Invoation
An interesting way to address distributed arhitetures is the Remote Method Invoation API [22℄ provided by Java sine JDK 1.1. RMI extends normal Java programs
with the ability to share objets over multiple JVMs. It doesn't matter where the
JVMs are loated and therefore RMI programs work in heterogeneous environments.
RMI in priniple is based on a lient/server-model, where a remote objet, whih
should be aessible from other JVMs, is loated on the server side. On the lient
side, a remote referene to that objet appears whih allows the lient to invoke
methods of the remote objet.
Internally remote objets an be olleted in the
3.1. Parallel Programming in Java
18
RMI registry. Figure 3.2 shows two JVMs demonstrating the method invoation.
JVM1
JVM2
Thread
doSomething(aParam);
doSomething(aParam);
invoke message
stub
remote
object
skeleton
result message
result
result
Figure 3.2: RMI invoation example (after [14, 12℄)
When looking up a remote objet the lient gets a stub, whih implements the
aessors to the remote objet and ats like a usual Java objet. The ounterpart on
the server side is alled the skeleton. Both will be generated by the RMI ompiler. By
invoking a remote method the stub marshalls the invoation inluding the method's
arguments and sends it to the skeleton, where it will be unmarshalled and forwarded
to the remote objet. The result from the remote objet will be submitted in the
same manner.
The ommuniation between stub and skeleton always via TCP/IP, is synhronous. Therefore a stub always has to wait until the invoation on the remote
objet has nished. The great advantage of RMI is the fat, that it abstrats ompletely from low-level soket programming. That makes it more onvenient to software developers to distribute appliations in Java. On the other hand RMI shows
dramati performane bottleneks, sine it uses Java's objet serialization [23℄ and
reetion mehanism for data marshalling. It has been evaluated [14, 11-36℄ that
RMIs ommuniation overhead an result in high latenies and low throughput,
whih in the end does not result in signiant speedups for parallel appliations, in
fat some appliations an beome slower [14, 34℄.
3.2. Ibis Design
19
3.2 Ibis Design
As shown above, Java natively does not provide solutions to ahieve portable and
eient ommuniation for distributed memory arhitetures. In the following, the
1
grid programming environment Ibis
[26℄ will be introdued, whih addresses porta-
bility, eieny and exibility. Ibis has been implemented in pure Java. However
in speial ases some native libraries are using the Java Native Interfae, whih an
be used to improve performane.
Application
RMI
Satin
RepMI
GMI
Pro
Active
IPL
TCP
UDP
P2P
GM
Panda
Infiniband
Figure 3.3: Ibis design (redraw from the Ibis 1.1 release)
The main part of Ibis is the Ibis Portability Layer (IPL). It provides several
simple interfaes, whih are implemented by the lower layers (TCP, UDP, GM, ...).
These implementations an be seleted and loaded by the appliation at run time.
For that purpose Ibis uses Java's dynami lass loader. With that appliations an
run simultaneously on a variety of dierent mahines, using optimized and speialized software where possible (e.g. Myrinet) or standard software (e.g. TCP) when
neessary. Ibis appliations an be deployed on mahines ranging from lusters with
loal, high-performane networks like Myrinet or Inniband, to grid platforms in
whih several, remote mahines ommuniate aross the Internet. RMI (see 3.1.2)
does not support these features. Although it is possible, Ibis appliations will typially not be implemented on top of the IPL. Instead they use one of the existing
programming models. One of these models is a reimplementation of RMI, that allows a omparison between Ibis and the original RMI [25, 169-171℄ and shows Ibis'
advantages.
The other models will not be disussed here.
layered struture of the urrent Ibis release (Version 1.1).
1 Ibis
online at http://www.s.vu.nl/ibis/
Figure 3.3 shows the
3.3. Ibis Implementations
Send and reeive ports
20
To enable ommuniation, the IPL denes send and
reeive ports (Figure 3.4), whih provide a unidiretional message hannel. These
ports must be onneted to eah other (onnetion oriented sheme). The ommuniation starts by requesting a new message objet from the send port, where data
items of any type, even objets, of any size an be inserted.
message will be submitted by invoking
send().
m = sendPort.getMessage ();
m.writeInt (3);
After insertion, the
m = receivePort.receive();
send port
receive port
i = m.readInt();
m.writeArray (a);
m.readArray(a);
m.writeArray (b, 0, 100);
m.readArray(b, 0, 100);
m.writeObject (o);
o= m.readObject();
m.send();
m.finish();
m.finish();
Figure 3.4: Send and Reeive Ports (after [25, 153℄)
Eah reeive port may be ongured in two dierent ways. Firstly, messages an
be reeived expliitly by alling the
reeive( ) primitive.
This method is bloking and
returns a message objet from whih the sent data an be extrated by the provided
set of read methods (see Figure 3.4). Seondly, Ibis ahieves impliit reeipt with
the ability to ongure reeive ports to generate upalls. If an upall takes plae, a
message objet will be returned. These are the only ommuniation primitives that
the IPL provides. All other patterns an be built on top of it.
3.3 Ibis Implementations
Besides exibility, Ibis provides two important enhanements: Eient serialization
and ommuniation [25, 165-169℄. The message passing implementation, whih will
be introdued in Chapter 4, takes advantage of both.
Eient serialization
As shown in Setion 3.1.2 Java's objet serialization is a
performane bottlenek. Ibis irumvents it by implementing it's own serialization
mehanism that is fully soure ompatible to the original. In general, Ibis serialization ahieves performane advantages in three steps:
3.3. Ibis Implementations
21
•
Avoiding run time type inspetion
•
Optimizing objet reation
•
Avoiding data opying
This has been done by implementing a byteode rewriter, whih adds a speialized
generator lass to all objets extending the serializable interfae and takes over the
standard serialization. Evaluations [25, 164℄ have pointed out that Ibis serialization
outperforms the standard Java serialization by a large margin, partiularly in those
ases where objets are being serialized.
Eient ommuniation
The TCP/IP Ibis implementation is using one soket
per unidiretional hannel between a single send and reeive port whih is kept
open between individual messages. The TCP implementation of Ibis is written in
pure Java allowing to ompile an Ibis appliation on a workstation, and to deploy
it diretly on a grid. To speedup wide-area ommuniation, Ibis an transparently
use multiple TCP streams in parallel for a single port. Finally, it an ommuniate
through rewalls, even without expliitly opened ports.
2
There are two Myrinet
implementations of the IPL, built on top of the native
3
GM (Glenn's messages) library and the Panda [2℄ library. Ibis oers highly-eient
objet serialization that rst serializes objets into a set of arrays of primitive types.
For eah send operation, the arrays to be sent are handed as a message fragment to
GM, whih sends the data out without opying. On the reeiving side, the typed
elds are reeived into pre-alloated buers; no other opies need to be made.
2 Myrinet
3 Myrinet
online at http://www.myri.om/
GM driver online at http://www.myri.om/ss/
Chapter 4
Design and Implementation of
MPJ/Ibis
The driving fore in high performane omputing for Java was the Java Grande
1
Forum (JGF) from 1999 to 2003. Before Java Grande many MPI-like environments
in Java were reated without delaring any standards. Thus, a working group from
the JGF proposed a MPI-like standard desription, with the goal to nd a onsensus
API for future message-passing environments in Java [4℄. To avoid naming onits
with MPI, the proposal is alled Message Passing Interfae for Java (MPJ).
4.1 Common Design Spae and Deisions
Sine MPI is the de fato standard for message passing platforms and MPJ is mainly
derived from it, the deision to reate a message passing implementation mathing
the MPJ speiation is obvious. The MPJ implementation should be exible and
eient. Portability is being guaranteed by Java, but not exibility and eieny.
RMI and Java Sokets usually use TCP for network ommuniation. That makes
them not exible enough for high performane omputing, for example on a Myrinet
omputer luster. RMI, as shown in Setion 3.1.2, is not eient enough to satisfy
the requirements of a message passing platform. However, Ibis provides solutions
1 The
Java Grande Forum online at http://www.javagrande.org
22
4.2. MPJ Speiation
23
to both eieny and exibility, and thus oers an exellent foundation to build an
MPJ implementation on top of it.
In the following this implementation is alled
MPJ/Ibis.
In ontrast to MPI, MPJ does not address the issue of thread safety expliitly.
Synhronizing multiple threads in asynhronous ommuniation models - message
passing models express it by the non-bloking ommuniation paradigm - is not
trivial. Sine synhronizing is still an expensive operation [8℄ even in ases when it
is not used but implemented, MPJ/Ibis avoids the overhead of being thread safe.
Multithreaded appliations on top of MPJ must synhronize the entry points to
MPJ/Ibis, that means only one thread is allowed to all MPJ primitives at a time.
Thus, it will be ensured that the requirement of getting the most eient result is
satised in this aspet.
4.2 MPJ Speiation
MPJ
Group
Cartcomm
Intracomm
Comm
Graphcomm
Intercomm
mpj
Datatype
Status
Request
Prequest
Figure 4.1: Prinipal Classes of MPJ (after [4℄)
4.2. MPJ Speiation
24
MPJ is a result from the adaption of the C++ MPI bindings speied in MPI2 [16℄ to Java. The lass speiations are diretly built on the MPI infrastruture
provided by MPI-1.1 [15℄. It has been announed that the extensions of MPI-2 like
dynami proess management will be added in later work, but that has not been
done yet. To stay onform to the speiation, the extensions are unsupported by
MPJ/Ibis as well. Figure 4.1 shows the most important lasses of MPJ, whih will
be briey introdued in the following.
All MPJ lasses are organized in the pakage
mpj.
The lass
MPJ
is responsible
for initialisation of the whole environment, like global onstants, peer onnetions
and the default ommuniator
COMM_WORLD.
Thus, all members of
MPJ
are
dened stati. As explained in Setion 2.5, proesses of one group, represented by
Group, ommuniate via the ommuniators. All ommuniation elements
are instanes of lass Comm or its sublasses. For example COMM_WORLD is an
the lass
instane of lass
Intraomm.
The point-to-point ommuniation operations, suh as
irev, are in the Comm lass.
send, rev, isend
For example, the method prototype of
send
and
looks like
this:
void Comm.send(Objet buf, int offset, int ount, Datatype datatype,
int dest, int tag) throws MPJExeption
buf
offset
ount
datatype
dest
tag
send buer array
initial oset in send buer
number of items to send
data type of eah item in send buer
rank of destination
message tag
Table 4.1: MPJ
send
prototype (after [4℄)
The message to be sent must onsist either of an array of primitive types or of
an array of objets, whih has to be passed as an argument alled
to implement the
java.io.Serializable interfae.
The
offset
buf.
Objets need
indiates the beginning
of the message (see Table 4.1).
The
datatype
argument is neessary to support derived datatypes in analogy
4.3. MPJ on Top of Ibis
to MPI (see Setion 2.2). These
25
datatypes
must math the base type used in
buf.
are instanes of the lass
Datatype
and
MPJ speies the basi datatypes. Table
4.2 shows a list of the most important predened types.
MPJ.BYTE
MPJ.CHAR
MPJ.BOOLEAN
MPJ.SHORT
MPJ.INT
MPJ.LONG
MPJ.FLOAT
MPJ.DOUBLE
MPJ.OBJECT
...
Table 4.2: MPJ basi datatypes (after [4℄)
As shown in the example above the
send
operation does not have a return value,
sine it is a bloking operation. In ontrast to MPI all ommuniation operations use
Java's exeption handling to report errors.
suh as
isend
or
irev,
always return a
Additionally non-bloking operations,
Request
objet.
Requests represent ongo-
ing message transfers and provide several methods to obtain knowledge about the
Status objet,
Prequest, an exten-
urrent state. One a message is ompleted, those methods return a
whih ontains detailed information about the message transfer.
sion to
Request,
is the result of a prepared persistent message transfer. Persistent
ommuniation is organized in lass
Comm
and will not disussed here, beause it
is of minor importane, even it has been implemented in MPJ/Ibis. The olletive
ommuniation operations are part of the lass
Intraomm,
whih extends
Comm,
so that it is possible for them to use the point-to-point primitives.
4.3 MPJ on Top of Ibis
Ibis/MPJ has been divided into three layers and was built diretly on top of the
IPL (see Figure 4.2).
The
The
Ibis Communiation Laye r provides the low level ommuniation operations.
Base Communiation Layer
takes are of the basi send and reeive operations
speied by MPJ. It inludes the bloking and nonbloking operations and the various test and wait statements.
4.3. MPJ on Top of Ibis
26
Application
Collective Communication Layer
Base Communication Layer
Ibis Communication Layer
MPJ
IPL
Figure 4.2: MPJ/Ibis design
Colletive Communiation Layer implements the olletive operations on top of the Base
It is also responsible for group and ommuniator management.
Communiation Layer. An MPJ/Ibis appliation
and the Colletive Communiation Layer.
4.3.1
The
is able to aess both the
Base
Point-to-point Communiation
Eah partiipating proess is onneted expliitly to eah other using the IPL's send
and reeive ports. Ibis oers two mehanisms for data reeption, upalls and downalls (see Setion 3.2). Sine upalls always require an additional running thread
olleting the messages, the performane of MPJ/Ibis would be aeted negatively
in all ases while swithing between running threads. However, threads an not be
avoided ompletely, but the amount should be redued to a minimum. Therefore,
MPJ/Ibis has been designed to use downalls. A pair of a send and a reeive port to
one ommuniation partner is summarized in a
Connetion
objet and all existing
Connetion s are olleted within a lookup Table alled ConnetionTable.
The IPL does not provide primitives for non-bloking ommuniation. To support bloking and non-bloking ommuniation the
Ibis Communiation Layer
has
been designed to be thread safe. That allows non-bloking ommuniation on top of
the bloking ommuniation primitives using Java threads. Multithreading annot
be avoided in this ase.
Furthermore, Ibis only provides ommuniation in eager
send mode, while handshaking is not supported. While MPIs and MPJs ready and
4.3. MPJ on Top of Ibis
27
synhronous send modes require some kind of handshaking in order for a better
buer organization for short and large messages, the IPL implementations take are
about that automatially. Therefore MPJ/Ibis does not dier between ready and
synhronous send modes, but buered send is supported.
Message Send
Internally messages are represented by the
MPJObjet lass, whih
onsists of a header and a data part. The header is an one-dimensional integer array
ontaining the message envelope values. Sending a message has been implemented
in a straight forward way, as shown in Figure 4.3. One one of the send primitives
of the
Base Communiation Layer
has been alled, the assigned thread heks if
another send operation is in progress. In that ase, it has to wait until the previous
operation has been nished. Waiting, lok obtaining and releasing have been left to
Java's synhronization mehanisms. Instead of writing the
MPJObjet
diretly to
the send port, whih auses unneessary objet serialization, the send operation has
been divided into three steps. First the header will be written expliitly. Seond,
it is determined if the message data has to be serialized into a system buer or not
before it an be sent. Third, depending on step two the message data or the system
buer will be written to the send port.
Message reeive
Sine non-bloking ommuniation allows to vary the order of
alling reeive primitives, it is neessary to add a queue for every reeive port, where
unexpeted messages will be olleted. Usually, reeiving messages into a queue will
be done by using a permanent reeiver thread, whih is onneted to the reeive port
and manages the queue lling automatially. Therefore, one thread would be needed
for eah reeive port, resulting in a slowdown of the whole system's performane, due
to thread swithing of multiple threads depending on the number of partiipating
proesses. Another disadvantage of suh a onept is, that zero-opying will not be
possible, sine all messages have to be opied out of the queue.
4.3. MPJ on Top of Ibis
28
yes
startsend
is send
port
locked?
no
lock send port
write header to
send port
no
buffered
send
mode?
yes
serialize message
into attached
system buffer
write message to
send port
write
system buffer
to send port
release send
port lock
sendfinished
Figure 4.3: MPJ/Ibis send protool
4.3. MPJ on Top of Ibis
29
yes
startrecv
is queue
locked?
release queue
lock
no
lock queue
move message to
queue
message
found?
no
check queue for
the requested
message
no
connect to receive port
and get
incoming message
header
does the
message match
the recv post?
copy and delete
message out of
queue
was the
message
buffered?
yes
yes
no
yes
yes
was the
message
buffered?
deserialize
message into
recvbuffer
release queue
lock
recvfinished
Figure 4.4: MPJ/Ibis reeive protool
no
receive message
directly into
recvbuffer
4.3. MPJ on Top of Ibis
30
To avoid ontinuous thread swithing and to allow zero-opying the reeive protool has been designed in way that allows a reeiving primitive to onnet expliitly
to the reeive port. The reeive protool is shown in Figure 4.4. After obtaining the
lok, a reeiver thread heks whether the queue ontains the message requested.
If the message has been found, it will be opied or deserialized out of the queue,
depending whether the message was buered or not. If the message was not found
inside the queue, the reeiver onnets to the reeive port and gets the inoming
message header to determine, if the inoming message is targeted to the reeiver.
If not, the whole message inluding the header will be inserted into an
jet
and moved into the queue where it waits for the mathing reeiver.
MPJObTo give
other reeiving threads the hane to hek the queue, the lok will be temporarily
released. If the message header from the reeive port mathes the posted reeive,
the non-buered message will be reeived diretly into the reeive buer, while a
buered message has to be deserialized into the reeive buer.
4.3.2
Groups and Communiators
As mentioned in Setion 2.5, eah group of proesses has a ommuniator that is
responsible for message transfers within the group. Sine ommuniators an share
the same send and reeive ports, it is mandatory to prevent mixing up messages of
dierent ommuniators. The
tag
value, whih is used by the user to distinguish
dierent messages, is not useful for this ase. Therefore, on reation eah ommuniator gets a unique
ontextId, whih allows the system to handle ommuniation
of dierent ommuniators on the same ports at the same time. When sending a
message, the message header will be extended by adding the
ommuniator to it. Messages are identied by both the
ontextId
tag value and the ontextId.
Creating a new ommuniator is always a olletive operation.
holds the value of the highest
ontextId
tId
Eah proess
that is used loally. When a new ommu-
niator is going to be reated, eah proess reates a new temporary
inreasing the highest by one.
of the used
To ensure that all proesses use the
for the new ommuniator, the temporary
ontextId
ontextId by
same ontex-
will be allredued to the
maximum. After that, the new ommuniator will be reated and the loal system
4.4. Colletive Communiation Algorithms
informed that the new
ontextId
31
is the highest.
4.4 Colletive Communiation Algorithms
The olletive operations of lass
point-to-point primitives.
Intraomm
have been implemented on top of the
Sine the naive approah of sending messages diretly
may result in high latenies, those operations should use speialized algorithms. For
example, letting the root in a
broadast
operation send a message to eah node
expliitly is ineient, while the last reeiver has to wait until the message has
been sent to the other proesses. The urrent MPI implementations ontain a vast
amount of dierent algorithms realizing the olletive operations and researh to
their optimization is not nished yet.
Colletive
Algorithm
Upper
Operation
allgather
allgatherv
allredue
alltoall
alltoallv
barrier
broadast
gather
gatherv
redue
redueSatter
san
satter
satterv
Complexity Borders
non-ommutative op: at tree
O(n)
O(n)
O((log n) + 2)
O(n2)
O(n2)
O(2n)
O(log n)
O(n)
O(n)
O(log n)
O(n)
phase 1: redue
ommutative op:
double ring
single ring
reursive doubling
at tree
at tree
at tree
binomial tree
at tree
at tree
ommutative op: binomial tree
O((log n) + n)
non-ommutative op: O(2n)
O(n)
O(n)
O(n)
phase 2: satterv
at tree
at tree
at tree
Table 4.3: Algorithms used in MPJ/Ibis to implement the olletive operations
MPJ/Ibis provides a basi set of olletive algorithms, whih may be extended
and more optimized in future work. At least one for eah operation has been implemented. Table 4.3 shows the algorithms used for all olletive operations inluding
their upper omplexity borders in
O -notation,
where n is the number of the pro-
4.4. Colletive Communiation Algorithms
esses involved.
32
In aordane to the MPI speiation four olletive operations
have been extended to ahieve more exibility. The extended operations are alled
allgatherv, alltoallv, gatherv
and
satterv.
They allow to vary the item sizes and
buer displaements expliitly for eah proess.
In the following the algorithms
used will be introdued exemplary.
Flat Tree
The at tree model follows the naive approah mentioned above. The
root proess
P0
sends to and/or reeives from the other group members diretly.
Figure 4.5 demonstrates the satter operation using the at tree ommuniation
sheme with ve partiipating proesses.
In eah step the root sends a message
ontaining the elements to be sattered to the next proess. The number of steps
needed depends linearly on the number of proesses. Figure 4.6 shows a dierent
view of the same satter sheme, whih results in a tree with n-1 leaves and exatly
one parent node, namely the root.
P0
P2
P1
P3
P4
Step 1
Step 2
Step 3
Step 4
Figure 4.5:
Satter
sending sheme
Typially the other operations using the at tree model have a omplexity of
as well, exept
alltoall
and
barrier.
The
alltoall
O(n)
operation has been implemented
using the at tree, but eah proess reates a at tree to send messages to eah
other. This results in n at trees with an overall upper omplexity border of
The
barrier
O(n2 ).
operation uses the proess with rank zero to gather zero sized mes-
sages from the other proesses. These messages will be sattered bak to omplete
the operation. Two at trees are needed with a total ost of
O(2n).
4.4. Colletive Communiation Algorithms
33
P0
P2
P1
P4
P3
Figure 4.6: Flat tree view
Binomial Tree
The
broadast
follows the binomial tree model shown in Figure
4.8. Sending messages via the binomial tree struture has a omplexity of
In Figure 4.7 the
where
P0
broadast
O(log n).
operation takes plae with eight partiipating proesses,
is the root sending its send buer to the other proesses. After eah step
the number of sending proesses is doubled. In this example three steps are needed
to exeute the whole operation, while a
broadast in at tree model would take seven
steps.
P0
P1
P2
P3
P4
P5
P6
P7
Step 1
Step 2
Step 3
Figure 4.7:
The
redue
Broadast
sending sheme
operation has been implemented using a binomial tree in reverse
order. After reeiving a message eah proess ombines its send and reeive buer
using the assigned operation. The result will be written into the send buer and
represents the argument for the next step. In the end the redue result appears at
the root proess. Sine user-dened operations may be non-ommutative and the
binomial tree model does not follow strit ordering, this algorithm is not universally
valid. In the ase of non-ommutative user-dened operations the at tree model
will be used instead.
4.4. Colletive Communiation Algorithms
34
P0
P1
P2
P4
P3
P5
P6
P7
Figure 4.8: Binomial tree view
Ring
In
allgatherv, eah proess rst sends its item to gather to its right neighbour
(the proess with the next higher rank). If the rank of a proess is
sends its item to the proess with rank
0.
n−1
then it
Seond, in the next steps eah proess
sends the reeived item to its right neighbour. The exeution is omplete after
n−1
steps, when eah proess has reeived the items of all the other proesses. Figure
4.9 illustrates one step of the ring algorithm.
P0
P5
P1
P4
P2
P3
Figure 4.9: Ring sending sheme (only 1 step)
Allgather
uses an extension to the ring model mentioned above, alled double
ring. With double ring all proesses send their items to gather to their right and
left neighbours. Therefore the exeution ompletes after
(n − 1)/2 steps,
but in eah
step the number of send and reeive operations needed is doubled ompared to the
ring model. Overall this model shows the same omplexity of
O(n)
as the ring.
4.5. Open Issues
35
Reursive Doubling
In Reursive doubling, used by allredue, in the rst step
all the proesses whih have a distane of
1
exhange their messages followed by a
loal redution (see Figure 4.10). In the next steps the distanes will be doubled
eah time. In the end, after
log n,
steps the allredue operation has been nished
for the ase that the number of partiipating proesses is a power-of-two. For the
non-power-of-two ase the number of proesses performing the reursive doubling
will be redued to power-of-two. The remaining proesses send their items to the
proesses of the reursive-doubling-group expliitly before the doubling starts. When
the reursive doubling has nished, the redued values will be sent to the remaining
proesses. That auses two extra steps in the non-power-two ase and results in a
omplexity of
O((log n) + 2).
P0
P1
P2
P3
P4
P5
P6
P7
Step 1
Step 2
Step 3
Figure 4.10: Reursive doubling illustration
4.5 Open Issues
Multidimensional Arrays & Derived Datatypes
Sine the MPJ speiation
only speies generi send and reeive primitives, whih expet message data of the
type
java.lang.Objet, MPJ/Ibis has to ast expliitly the real type and dimension of
the arrays used. That makes it impossible for MPJ/Ibis to support all dimensions
of arrays.
In ontrast to the programming language C and others, for whih the
original MPI standard was presented, Java represents multidimensional arrays as
arrays of arrays. Traversing those arrays in Java with only one pointer is impossible.
Therefore MPJ/Ibis supports only one-dimensional arrays. Multidimensional arrays
an be sent as an objet or eah row has to be sent expliitly.
Sine Java provides derived datatypes natively using Java objets there is no
real need to implement derived datatypes in MPJ/Ibis.
Nevertheless ontiguous
4.5. Open Issues
36
derived datatypes are supported by MPJ/Ibis to ahieve the funtionality of the
redue operations MINLOC and MAXLOC speied by MPJ, whih need at least a
pair of values inside a one-dimensional array. The other types of derived datatypes
may be implemented in future work, if multidimensional arrays will be supported
diretly.
Other Issues
Due to time onstraints for this thesis, MPJ/Ibis supports reating
and splitting of new ommuniators, but interommuniation is not implemented
yet (see Setion 2.5).
At this moment, MPJ/Ibis also does not support virtual
topologies (see Setion 2.6). Both may be added in future work as well.
Chapter 5
Evaluation
5.1 Evaluation Settings
MPJ/Ibis on top of Ibis Version 1.1 has been evaluated on the Distributed ASCI
Superomputer 2 (DAS-2) with 72 nodes in Amsterdam. Eah node onsists of:
•
Two 1-Ghz Pentium-IIIs
•
1 GB RAM
•
a 20 GByte loal IDE disk
•
a Myrinet interfae ard
•
a Fast Ethernet interfae (on-board)
The operating system is Red Hat Enterprise Linux with kernel 2.4. Only one proessor per node has been used during the evaluation.
To ahieve more omparable results the following benhmarks have been performed with mpiJava [1℄ as well.
MpiJava is based on wrapping native methods
like the MPI implementation MPICH with the Java Native Interfae (JNI). Here,
mpiJava Version 1.2.5 has been bound to MPICH/GM Version 1.2.6 for Myrinet.
For Fast Ethernet there was only MPICH/P4 available on the DAS-2, whih is not
ompatible to mpiJava. Nevertheless, the values for MPJ/Ibis on TCP for Fast Ethernet will be presented as well. Both MPJ/Ibis and mpiJava have been evaluated
using Suns JVM Version 1.4.2.
37
5.2. Miro Benhmarks
38
For Ibis two dierent modules exist to aess a Myrinet network, alled Net.gm
and Panda. During the evaluation it has been found out, that Net.gm in some ases
auses deadloks, due to problems in buer reservation, when multiple objets are
being transfered from multiple senders to one reipient. The Panda implementation
has shown memory leaks for large message sizes resulting in performane redution
and deadloks. However, where it is possible the results for MPJ/Ibis on Myrinet
will be presented for eah benhmark. MPJ/Ibis on TCP performed stable.
MpiJava on MPICH/GM in some ases performed unstable resulting in memory
overows and broken data streams.
These misbehaviours ourred randomly and
ould not be reprodued.
5.2 Miro Benhmarks
The miro benhmarks, whih have been implemented for Ibis, MPJ/Ibis and mpiJava, rstly measure the round trip lateny by sending one byte bak and forth.
Sine the size an be negleted, the round trip lateny is divided by two to get the
lateny for a one way data transfer. Seondly, the miro benhmarks measure the
exeution time of sending an array of doubles from one node to a seond, whih aknowledges the reeption by sending one byte bak to the sender. This is repeated
for array sizes from 1 byte to 1MB. Thirdly, the throughput of objet arrays is measured, where eah objet ontains a byte value. The measurement is repeated for
dierent array sizes in analogy to the seond step.
Implementation
lateny [µs℄
mpiJava (MPICH/GM)
28
Ibis (Panda)
44
Ibis (Net.gm)
52
MPJ/Ibis (Panda)
50
MPJ/Ibis (Net.gm)
53
Ibis (TCP)
113
MPJ/Ibis (TCP)
120
Table 5.1: Lateny benhmark results
5.2. Miro Benhmarks
Latenies
39
Table 5.1 shows the lateny benhmark results for MPJ/Ibis, Ibis and
mpiJava. On Myrinet, MPJ/Ibis and Ibis have onsiderably higher latenies than
mpiJava. The reason of the gap between Ibis and mpiJava is beyond the objetive
of this thesis. Furthermore, MPJ on top of Ibis does not show onsiderably higher
latenies than Ibis itself. Thus, the message reation overhead of MPJ/Ibis does not
inuene the lateny by a large margin.
Throughput Double Arrays
The gures 5.1 and 5.2 show the throughput mea-
surement results for Ibis and the message passing implementations. For TCP Ibis
and MPJ/Ibis almost use the whole available bandwidth provided by Fast Ethernet
for data sizes greater than 1KB. For sizes beyond 32KB the performane of the
Panda implementation breaks down, aused by the memory leaks mentioned above.
This halves the throughput for Ibis and MPJ/Ibis.
Throughput Double Arrays
1000
Ibis TCP
Ibis Panda
Ibis Net.gm
Bandwidth [Mbps]
800
600
400
200
0
8
16
32
64
256
1K
4K
16K
Array Size [byte]
64K
256K
1M
Figure 5.1: Double array throughput in Ibis
Net.gm does not reah the maximum of Panda, but for large data sizes (>
64KB) it is muh faster.
MPJ/Ibis on top of Net.gm also outperforms mpiJava.
The break down of mpiJavas performane is aused by MPICH/GM swithing from
ready send mode to synhronous send mode at message sizes beyond 128KB. All
5.2. Miro Benhmarks
40
message passing implementations do not take advantage of the available bandwidth
provided by Myrinet.
Throughput Double Arrays
1000
MPJ/Ibis Panda
MPJ/Ibis Net.gm
MPJ/Ibis TCP
mpiJava MPICH/GM
Bandwidth [Mbps]
800
600
400
200
0
8
16
32
64
256
1K
4K
16K
Array Size [byte]
64K
256K
1M
Figure 5.2: Double array throughput in MPJ/Ibis and mpiJava
Throughput Objet Arrays
The throughput of objet arrays is not limited by
the physial restritions of the underlying network hardware. Objets have to be
serialized at the senders side and deserialized by the reeiver.
The performane
gap between Ibis and MPJ/Ibis is small (ompare Figures 5.3 and 5.4). Both use
the Ibis serialization model introdued in Setion 3.3, whih still is a performane
limiter. On the other hand MPJ/Ibis outperforms mpiJava by a large margin, whih
depends on the serialization model provided by Sun's JVM. MpiJava does not even
reah 15 Mbps on Myrinet when objet arrays are being transfered.
Eah introdued serialization model has to perform a dupliate detetion before
sending an objet, in a way that no objet needs to be transfered twie. Therefore,
eah objet referene has to be stored in a hash table to allow a lookup if an objet
to be sent has been proessed before. For larger arrays the hashtable beomes larger
as well ausing ommuniation slow downs. These slow downs have been shown by
all implementations, when the objet array sizes grow beyond 8KB.
5.2. Miro Benhmarks
41
Throughput Object Arrays
50
Ibis TCP
Ibis Panda
Ibis Net.gm
Bandwidth [Mbps]
40
30
20
10
0
8
16
32
64
256
1K
4K
Array Size [byte]
16K
64K
256K
1M
256K
1M
Figure 5.3: Objet array throughput in Ibis
Throughput Object Arrays
50
MPJ/Ibis Panda
MPJ/Ibis Net.gm
MPJ/Ibis TCP
mpiJava MPICH/GM
Bandwidth [Mbps]
40
30
20
10
0
8
16
32
64
256
1K
4K
Array Size [byte]
16K
64K
Figure 5.4: Objet array throughput in MPJ/Ibis and mpiJava
5.3. Java Grande Benhmark Suite
42
5.3 Java Grande Benhmark Suite
The Java Grande Benhmark Suite
1
maintained by the Edinburgh Parallel Com-
puting Centre (EPCC) onsists of three Setions. Setion 1 performs low-level operations, pingpong for throughput measurements and several benhmarks for the
olletive operations.
Setion 2 provides ve kernel appliations performing om-
mon operations, that are widely used in high performane omputation. In setion 3
three large appliations are benhmarked. The benhmarks of setion 2 and 3 measure exeution times. Those results will be presented in relative speedup [10, 30℄,
whih is dened as follows:
relative speedup (p processors) =
runtime par. alg. (1 processor)
runtime par. alg. (p processors)
Additionally the theoretial perfet speedup for the kernel and appliation benhmarks is marked in eah presentation.
Originally the benhmark suite has been implemented to math mpiJava's API,
whih is slightly dierent to that of MPJ. Thus, the benhmark suite has been ported
to MPJ. The setions 2 and 3 ontain predened problem sizes to be solved, whih
were not large enough to perform eiently on the DAS-2, when omputation time
beomes short. Where possible the predened problem sizes have been inreased to
improve the use of apaity provided by the existing hardware.
5.3.1
Setion 1: Low-Level Benhmarks
The low level benhmarks are designed to run for a xed period of time. The number
of operations exeuted in that time is reorded, and the performane reported as
operations/seond for the barrier and bytes/seond for the other operations. Both
the size and type of arrays transfered are varied. The type is either a double, or
a simple objet ontaining a double, whih allows to ompare the ommuniation
overhead of sending objets and basi types.
1 Java Grande Benhmark Suite online at http://www.ep.ed.a.uk/javagrande/mpj/ontents.html
5.3. Java Grande Benhmark Suite
Pingpong
43
The pingpong benhmark measures the bandwidth ahieved to send
an array bak and forth between exatly two nodes. The aggregated results of the
double array and the objet array benhmarks are shown in the Figures 5.5 and 5.6.
PingPong (Double arrays)
400
350
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
Bandwidth [Mbps]
300
250
200
150
100
50
0
64
256
1K
4K
16K
Array Size [byte]
64K
256K
1M
Figure 5.5: Pingpong benhmark: arrays of doubles
PingPong (Object arrays)
20
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
Bandwidth [Mbps]
15
10
5
0
64
256
1K
4K
16K
Array Size [byte]
64K
256K
Figure 5.6: Pingpong benhmark: arrays of objets
1M
5.3. Java Grande Benhmark Suite
44
As an be seen in Figure 5.5 for all Myrinet implementations the ahieved
throughput rises in the same way up to array sizes of almost 64KB. Beyond 64KB
MPJ/Ibis on Net.gm keeps outperforming mpiJava, while MPJ/Ibis on Panda slows
down due to the mentioned memory leaks. For objet arrays mpiJava uses the serialization mehanism provided by the JVM, while MPJ/Ibis takes advantage of the
Ibis serialization. That results in higher throughputs, even at array sizes greater
than 4KB MPJ/Ibis on TCP performs better than mpiJava.
With arrays larger
than 1MB the performane advantage of the Ibis serialization almost disappears for
Ibis' Myrinet implementations. Overall these results reet the measurements of the
miro benhmarks (see Setion 5.2).
Barrier
MPJ/Ibis' implementation of the
barrier
operation is not optimal (see
Figure 5.7). MPICH/GM uses the reursive doubling algorithm, whih has a lower
exeution time than the algorithm used in MPJ/Ibis. In both MPJ/Ibis and mpiJava
a zero sized byte array is being transfered in eah ommuniation step. Additionally, due to higher zero-latenies the relative performane of MPJ/Ibis ompared to
mpiJava is further slowed down.
Barrier
40000
35000
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
Barriers/sec
30000
25000
20000
15000
10000
5000
0
10
20
30
CPUs
40
Figure 5.7: Barrier benhmark
50
60
5.3. Java Grande Benhmark Suite
Broadast
The
broadast
45
benhmark performs in a similar way as the pingpong
benhmark. In addition to pingpong this benhmark is not restrited to only two
nodes and therefore the broadast operation has been evaluated on up to 48 nodes.
MPJ/Ibis and MPICH/GM implement the
broadast
operation using the same al-
gorithm.
Figure 5.8 shows the results of MPJ/Ibis and mpiJava. For double arrays mpiJava performs better than MPJ/Ibis up to eight involved nodes, aused by higher
throughputs. With more partiipating proesses the dierene between the implementations working on Myrinet beomes marginally small.
Broadasting arrays of objets omes with a onsiderably performane advantage
to MPJ/Ibis. On two proessors MPJ/Ibis outperforms mpiJava almost by a fator
of about six. By inreasing the number of nodes to 48 proesses this gap rises to
a fator of about 20, aused by the more eient Ibis serialization.
results of the
benhmarks.
broadast
Overall the
benhmark orrespond to those of the miro and pingpong
5.3. Java Grande Benhmark Suite
46
MPJ/Ibis Panda (Double arrays)
mpiJava MPICH/GM (Double arrays)
1000
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
48 nodes
800
600
Bandwidth [Mbps]
Bandwidth [Mbps]
1000
400
200
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
48 nodes
800
600
400
200
0
0
64
256
1K
4K
16K 64K 256K 1M
64
256
1K
Array Size [byte]
MPJ/Ibis Net.gm (Double arrays)
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
48 nodes
800
600
400
200
80
60
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
48 nodes
40
20
0
0
64
256
1K
4K 16K 64K 256K 1M
Array Size [byte]
64
MPJ/Ibis Panda (Object arrays)
40
30
256
1K
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
48 nodes
20
10
8
6
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
48 nodes
4
2
0
0
64
256
1K
4K
16K
64K 256K
1M
64
256
1K
Array Size [byte]
30
16K
64K 256K
1M
MPJ/Ibis TCP (Object arrays)
50
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
48 nodes
Bandwidth [Mbps]
Bandwidth [Mbps]
40
4K
Array Size [byte]
MPJ/Ibis Net.gm (Object arrays)
50
4K 16K 64K 256K 1M
Array Size [byte]
mpiJava MPICH/GM (Object arrays)
10
Bandwidth [Mbps]
Bandwidth [Mbps]
50
16K 64K 256K 1M
MPJ/Ibis TCP (Double arrays)
100
Bandwidth [Mbps]
Bandwidth [Mbps]
1000
4K
Array Size [byte]
20
10
40
30
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
48 nodes
20
10
0
0
64
256
1K
4K
16K
64K 256K
1M
64
256
Array Size [byte]
Figure 5.8: Broadast benhmark
1K
4K
16K
64K 256K
Array Size [byte]
1M
5.3. Java Grande Benhmark Suite
Redue
The
redue
47
benhmark only uses double arrays, sine the built in redue
operations do not support objet arrays. Here, the arrays will be redued by adding
the array items using the sum operation.
Double Arrays (Size: 4 items)
Bandwidth [Mbps]
8
Double Arrays (Size: 4472 items)
350
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
300
Bandwidth [Mbps]
10
6
4
2
250
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
200
150
100
50
0
0
10
20
30
40
50
60
10
CPUs
20
30
40
50
60
CPUs
Figure 5.9: Redue benhmark
In general mpiJava shows a better performane (see Figure 5.9), although MPICH/GM
implements the same tree algorithm as MPJ/Ibis. In both, before exeuting the
due
re-
operation a temporary buer for array reeption will be reated at eah node.
Sine Java lls arrays with zeros at initialization in ontrast to C, the overhead for
MPJ/Ibis is muh higher. Furthermore the smaller throughput results for MPJ/Ibis
are also indiated by the higher latenies shown by the miro benhmarks.
Satter
Sattering messages in mpiJava and MPJ/Ibis has been implemented in
the same way.
As for the
redue
operation the
satter
benhmark only measures
the throughput for two dierent sizes, but for double and objet arrays.
It was
not possible to run this benhmark on the Net.gm implementation for MPJ/Ibis,
whih aused deadloks when more than two proesses were involved.
As an be
seen in Figure 5.10 for small double arrays mpiJava shows less performane loss
on larger numbers of nodes than MPJ/Ibis on Panda, beause of the lower lateny
shown in setion 5.2. For larger double arrays the impat of the lateny gap beomes
marginally small. As expeted for objet arrays MPJ/Ibis on Panda outperforms
mpiJava almost by a fator of 1,9. Overall the performane of sattering messages
is highly inuened by the at tree algorithm used.
5.3. Java Grande Benhmark Suite
48
Double Arrays (Size: 4 items)
6
MPJ/Ibis TCP
MPJ/Ibis Panda
mpiJava MPICH/GM
300
Bandwidth [Mbps]
5
Bandwidth [Mbps]
Double Arrays (Size: 4472 items)
350
4
3
2
1
MPJ/Ibis TCP
MPJ/Ibis Panda
mpiJava MPICH/GM
250
200
150
100
50
0
0
10
20
30
40
50
60
10
20
30
CPUs
Object Arrays (Size: 4 items)
Bandwidth [Mbps]
2.5
50
60
Object Arrays (Size: 4472 items)
10
MPJ/Ibis TCP
MPJ/Ibis Panda
mpiJava MPICH/GM
Bandwidth [Mbps]
3
40
CPUs
2
1.5
1
8
MPJ/Ibis TCP
MPJ/Ibis Panda
mpiJava MPICH/GM
6
4
2
0.5
0
0
10
20
30
40
CPUs
50
60
10
20
30
40
CPUs
50
60
Figure 5.10: Satter benhmark
Gather
It was not possible to run the
gather
nor with the Myrinet implementations of Ibis.
benhmark neither with mpiJava
With all of them this benhmark
exeeded the limit of the physial memory provided by eah node of the DAS-2.
Only MPJ/Ibis on top of TCP worked stable. The results are presented in Figure
5.11.
MPJ/Ibis TCP (Double Arrays)
80
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
MPJ/Ibis TCP (Object Arrays)
14
Bandwidth [Mbps]
Bandwidth [Mbps]
100
60
40
12
10
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
8
6
4
20
2
0
0
64
256
1K
4K 16K 64K 256K 1M
Array Size [byte]
64
256
Figure 5.11: Gather benhmark
1K
4K 16K 64K 256K
Array Size [byte]
1M
5.3. Java Grande Benhmark Suite
49
This benhmark funtions the same as the
arrays the
than
gather
broadast.
broadast
benhmark.
For double
operation shows higher performane loss on more than two nodes
This is beause of the at tree algorithm, in whih eah node sending
a message to the root has to wait until the root has reeived the message from the
previous node. On a growing number of proesses this leads to substantially higher
latenies.
For objet arrays the results show an unstable behaviour of MPJ/Ibis on TCP
using the
gather operation up to eight proesses involved.
While MPJ/Ibis for double
arrays works as expeted (more proesses ausing less throughput), the results for
objet arrays lead to the assumption that Ibis serialization for loal opies may ause
ineienies. In eah all of
gather
the root has to opy the items of its send buer
loally into the reeive buer. Objet arrays will be opied using Ibis serialization.
Partiularly the relative part of the loal opy to the ommuniation overhead at
a smaller number of proesses is muh higher than at larger numbers of proesses.
This impat should be elaborated more in future work.
Alltoall
As with the
gather
benhmark it was not possible to perform the
benhmark with mpiJava and MPJ/Ibis on top of Panda and Net.gm.
alltoall
With all
Myrinet implementations this benhmark produed memory overows. Additionally
mpiJava randomly reported broken data streams.
MPJ/Ibis TCP (Double Arrays)
80
MPJ/Ibis TCP (Object Arrays)
25
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
Bandwidth [Mbps]
Bandwidth [Mbps]
100
60
40
20
20
2 nodes
4 nodes
8 nodes
16 nodes
32 nodes
15
10
5
0
0
64
256
1K
4K 16K 64K 256K 1M
Array Size [byte]
64
256
1K
4K 16K 64K 256K
Array Size [byte]
1M
Figure 5.12: Alltoall benhmark
In MPJ/Ibis the
alltoall
operation uses non-bloking ommuniation primitives,
5.3. Java Grande Benhmark Suite
50
whih are alled simultaneously. On TCP multiple threads ompeting for system
resoures do notably interfere the ommuniation performane leading to high performane losses the more proesses are involved.
5.3.2
Setion 2: Kernels
Crypt
Crypt performs an IDEA (International Data Enryption Algorithm) en-
ryption and deryption on a byte array with a size of
5 ∗ 107
items. Node 0 reates
the array and sends it to the other nodes, where the enryption and deryption takes
plae. After omputation the involved nodes send their results bak to node 0 using individual messages. The time measurement starts after sending the initialized
arrays and stops when node 0 has reeived the last array.
Crypt
25
20
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
perfect
Speedup
15
10
5
0
10
20
30
CPUs
40
50
60
Figure 5.13: Crypt speedups
The rypt kernel does not sale perfetly in all ases (see Figure 5.13).
How-
ever, MPJ/Ibis on top of Panda and mpiJava show the same speedups. Beyond 8
nodes both break down. As expeted from the miro benhmarks the impat of the
ommuniation overhead of Panda beomes less negletable on a growing number
of nodes involved, due to redued omputation time. MPJ/Ibis on Net.gm shows
5.3. Java Grande Benhmark Suite
51
the highest speedup of up to about 23 on 64 CPUs and outperforms mpiJava by a
large margin. Even MPJ/Ibis on TCP shows a better performane than mpiJava
on Myrinet.
LU Fatorization
This kernel solves a linear system ontaining 6000 x 6000 items
using LU fatorization followed by a triangular solve. The proesses exhange double
and integer arrays using the broadast operation.
The time during fatorization
inluding ommuniation is measured.
LUFact
60
50
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
perfect
Speedup
40
30
20
10
10
20
30
CPUs
40
50
60
Figure 5.14: LU Fatorization speedups
All message passing implementations exept MPJ/Ibis on TCP show the same
speedup, though they do not sale perfetly. Due to a relatively small problem size
the eet of the omputation part beomes small. Beause of memory onstraints
it was not possible to enlarge the problem size beyond 6000 x 6000 items.
5.3. Java Grande Benhmark Suite
Series
52
This benhmark omputes the rst
inside a predened interval.
106
Fourier oeients of a funtion
Communiation only takes plae in the end of the
omputation of eah node sending its individual results (double arrays) to node 0.
The performane of both omputation and ommuniation is measured.
Series
60
50
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
perfect
Speedup
40
30
20
10
10
20
30
CPUs
40
50
60
Figure 5.15: Series speedups
The speedups for all implementations grow in a linear way shown in Figure 5.15.
But they are not perfet. At 48 nodes mpiJava is slightly slower than the Myrinet
implementations of MPJ/Ibis. Even MPJ/Ibis on Fast Ethernet does not sale worse
than the Myrinet implementations. This kernel exeuted at 64 nodes is 40 times
faster than exeuted at only one node.
5.3. Java Grande Benhmark Suite
Sparse Matrix Multipliation
53
This kernel multiplies a sparse matrix using one
array of double values and two integer arrays.
First, node 0 reates the matrix
data and transfers it to eah proess. Seond, when eah node has ompleted the
omputation the result will be ombined using allredue (sum operation). Only the
time of step two is measured. Here, a sparse matrix with a size of
106 x 106
items has
been used for 200 iterations. It was not possible to enlarge the matrix size, beause
of memory restritions.
Sparse Matrix Multiplication
25
20
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
perfect
Speedup
15
10
5
0
10
20
30
CPUs
40
50
60
Figure 5.16: Sparse matrix multipliation speedups
This benhmark has shown the smallest run times of the whole JGF benhmark
suite (about 131 seonds for exeution at only one node). This means, that the omputation time is marginally small and does not aet the overall exeution speed.
Therefore the ommuniation overhead beomes the main fator, whih is getting
higher on a growing number of involved proesses leading to performane redution
(see Figure 5.16). For all MPJ/Ibis implementations the speed is slowed down showing the impat of the higher latenies shown in Setion 5.2. For mpiJava there is a
small speedup of about 2,5. But it is to be notied that one of MPICH/GMs
due
allre-
operations is erroneous. Beyond eight partiipating proesses this benhmark
reports result validation errors for mpiJava. MPICH/GM uses dierent algorithms
for
allredue depending on data size and the number of proesses.
Here, at more than
5.3. Java Grande Benhmark Suite
54
eight proesses it swithes to a dierent algorithm, whih does not work orretly.
The speedup values for this benhmark on mpiJava are not representative.
Suessive Over-Relaxation
This benhmark performs 100 iterations of sues-
sive over-relaxation (SOR) on a 6000 x 6000 grid. The arrays are distributed over
proesses in bloks using the red-blak hekerboard ordering mehanism.
Only
neighbouring proesses exhange arrays, whih onsist of double values. The arrays
are treated as objets, sine they are two-dimensional. It was not possible to run
this benhmark with MPJ/Ibis on Net.gm, due to the mentioned problems in buer
reservation of Net.gm.
SOR
25
MPJ/Ibis
TCP
MPJ/Ibis Panda
mpiJava MPICH/GM
perfect
Speedup
20
15
10
5
10
20
30
CPUs
40
50
60
Figure 5.17: Suessive over-relaxation speedups
Beause of the Ibis serialization model MPJ/Ibis on Panda outperforms mpiJava
by a fator of two until 32 partiipating proesses (see Figure 5.17).
Beyond 32
nodes the transmitted arrays beome smaller reduing the performane advantage
by inreasing the relative ommuniation overhead.
5.3. Java Grande Benhmark Suite
5.3.3
55
Setion 3: Appliations
Moleular Dynamis
The Moleular Dynamis appliation models 27436 parti-
les (problem size: 19) interating under a Lennard-Jones potential in a ubi spatial
volume with periodi boundary onditions. In eah iteration the partiles are being
updated using the
allredue
operation with summation in the following way:
•
three times with double arrays
•
two times with a double value
•
one with an integer value
Molecular Dynamics
60
50
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
perfect
Speedup
40
30
20
10
10
20
30
CPUs
40
50
60
Figure 5.18: Moleular dynamis speedups
As an be seen in Figure 5.18 mpiJava slightly outperforms MPJ/Ibis on the
Myrinet modules. In ontrast to the sparse matrix multipliation this benhmark
does not report validation errors with mpiJava, sine MPICH/GM does not swith
between dierent
allredue
algorithms in this ase.
5.3. Java Grande Benhmark Suite
Monte Carlo Simulation
56
This nanial simulation uses Monte Carlo tehniques
to prie produts derived from an underlying asset. It generates 60000 sample time
series. The results at eah node are stored in objet arrays of lass
whih are sent to node 0 using the
send
and
rev
java.util.Vetor,
primitives. As with the suessive
over-relaxation kernel it was not possible to run this benhmark using Net.gm for
Ibis.
MonteCarlo
25
MPJ/Ibis
TCP
MPJ/Ibis Panda
mpiJava MPICH/GM
perfect
Speedup
20
15
10
5
10
20
30
CPUs
40
50
60
Figure 5.19: Monte Carlo speedups
Though none message passing implementation ahieves perfet speedup, MPJ/Ibis
on Panda outperforms mpiJava onsiderably (see Figure 5.19), sine it takes advantage of the Ibis serialization mehanism. While mpiJava depends on Sun serialization, it is slightly faster than MPJ/Ibis on TCP, whih is restrited by the available
bandwidth of Fast Ethernet.
5.3. Java Grande Benhmark Suite
Raytraer
57
The raytraer appliation benhmark renders a sene ontaining 64
spheres at a resolution of 2000 x 2000 pixels. Eah proess omputes a part of the
sene and sends the rendered pixels to node 0 using the
send
and
rev
primitives.
RayTracer
60
50
MPJ/Ibis
TCP
MPJ/Ibis Panda
MPJ/Ibis Net.gm
mpiJava MPICH/GM
perfect
Speedup
40
30
20
10
10
20
30
CPUs
40
50
60
Figure 5.20: Raytraer speedups
In all ases the raytraer sales almost perfetly. At 64 nodes mpiJava shows a
slightly better performane than MPJ/Ibis. In omparison to the omputation part
the ommuniation overhead is marginally small for all message passing implementations evaluated.
5.4. Disussion
58
5.4 Disussion
MPJ/Ibis and mpiJava have been evaluated both via miro benhmarks (Setion 5.2)
measuring latenies and throughputs and via the JGF benhmark suite (Setion 5.3)
measuring throughputs, the olletive operations, kernel and appliation runtimes.
The miro benhmarks have shown that MPJ/Ibis does not ome with a great
performane lak aused by MPJ itself. In omparison to mpiJava on MPICH/GM
MPJ/Ibis shows higher latenies, but the inuene beomes smaller at growing data
sizes. Partiularly for objet arrays, MPJ/Ibis has a great performane advantage
over mpiJava on MPICH/GM.
On the other hand the eets of the relative higher latenies of MPJ/Ibis beome visible at the low level benhmarks of the JGF benhmark suite in setion 1.
Partiularly, the algorithm for the
barrier
operation should be improved. MpiJava
performs better when basi type arrays are being ommuniated through the different olletive operations, while MPJ/Ibis has advantages when objet arrays are
being transfered.
The performane lak of MPJ/Ibis for basi type arrays almost disappears at the
setions 2 and 3, where more omputation intensive appliations are benhmarked.
At kernels and appliations based on objet arrays MPJ/Ibis' speedups are onsiderably higher (see SOR and Monte Carlo). The only benhmarks of setion 2 and 3,
where mpiJava shows substantially higher speedups than MPJ/Ibis are the Sparse
Matrix Multipliation and the Moleular Dynamis. In both appliations
allredue
is the main operation used for ommuniation, where, as mentioned above, the orretness of MPICH/GM an be doubted.
The evaluation has shown that MPJ/Ibis' performane is highly dependent on
the underlying Ibis implementation. In onlusion, MPJ/Ibis' performane is ompetitive to that of mpiJava.
Additionally, MPJ/Ibis omes with an advantage of
full portability and exibility allowing MPJ/Ibis to run in heterogeneous networks
in ontrast to mpiJava, whih depends on a native MPI implementation.
Chapter 6
Related Work
As mentioned in the beginning of Chapter 4 a lot of researh has been done to
develop an MPI binding to Java resulting in a variety of dierent implementations.
MPJ/Ibis takes plae in this history, whih partiipants will be introdued in the
following. All projets are being presented with a view to eieny and portability.
JavaMPI
JavaMPI [17℄ is based on various funtions using JNI to wrap MPI
methods to Java. For that purpose a Java-to-C Interfae generator (JCI) has been
implemented to reate a C-stub funtion and a Java method delaration for eah native method to be exported from the MPI library. The automati wrapper reation
resulted in an almost omplete Java binding to MPI-1.1 [15℄ with less implementation osts. However, JavaMPI appliations are not portable, sine a native MPI
implementation is always required for exeution.
The JavaMPI projet is main-
1
tained by the University of Westminster, but no longer ative. The last version
has
been released in 2000.
jmpi
[7℄.
Jmpi [6℄ implemented at Baskent University in 1998 works on top of JPVM
Both jmpi and JPVM are implemented entirely in Java.
JPVM follows the
onept of parallel virtual mahines (PVM). The main dierene [9℄ between MPI
and PVM is, that PVM is optimized for fault tolerane in heterogeneous networks
using a small set of standard ommuniation tehniques (e.g. TCP). That allows
1 JavaMPI
online at http://perun.hss.wmin.a.uk/JavaMPI/
59
Chapter 6. Related Work
60
jmpi appliations to be highly portable, but also limits ommuniation performane
dramatially. Here jmpi suers from the poor performane of JPVM. The onepts
of PVM will not disussed further at this point.
The jmpi projet is no longer
maintained.
MPIJ
The MPIJ [12℄ implementation is written in pure Java and runs as a part
of the Distributed Objet Group Metaomputing Arhiteture (DOGMA) [11℄ using
RMI for ommuniation. If available on the running platform, MPIJ additionally
uses native marshaling of primitive types instead of Java marshaling.
has been developed at Brigham Young University in 1998.
DOGMA
2
Only a preompiled
pakage of the urrent DOGMA implementation has been released, but due to almost non-existent doumentation it ould not determined, if the urrent DOGMA
implementation still ontains MPIJ.
JMPP
JMPP [5℄ has been developed at the National Chiao-Tung University. In
general this implementation is also built on top of RMI resulting in performane
disadvantages. To ahieve more exibility an additional layer between the lasses
implementing the MPI methods and RMI has been implemented, alled Abstrat
Devie Interfae (ADI). It abstrats ompletely from the underlying ommuniation
layer allowing to replae RMI with other modules for more eient ommuniation.
Currently ADI only supports RMI. While the JMPP projet is inative, a more
eient implementation of ADI an not be expeted.
JMPI
Using RMI for ommuniation JMPI [19℄ has been implemented entirely
in Java with the advantage of full portability. Sine RMI auses high performane
loss, an optitmized RMI model alled KaRMI [21℄ has been used for data transfer. KaRMI improves the performane of JMPI notably, but omes with redution
to portability, sine it has to be ongured expliitly for eah dierent JVM used
inreasing administration overhead for eah JMPI appliation. JMPI has been de-
3
veloped at the University of Massahusetts , but a release is not available.
2 DOGMA
online at http://sl.s.byu.edu/dogma/
3 http://www.umass.edu/
Chapter 6. Related Work
CCJ
61
4
In ontrast to the other projets CCJ
[20℄ implemented at Vrije Universiteit
Amsterdam in 2003 follows a strit objet-oriented approah and thus an not be
laimed to be a binding to the MPI speiation. CCJ also has been built diretly
on top of RMI with all the disadvantages that ome along with it.
Nevertheless
group ommuniation is possible, where threads within the same thread group an
exhange messages using olletive operations like
Alltoall
is not supported here.
broadast, gather
and
satter.
To reah higher ommuniation speeds it is also
5
possible for CCJ appliations to be ompiled with Manta
[14, 37-68℄, a native
Java ompiler optimized for RMI allowing remote method invoation on Myrinet
based networks. Manta is soure ode ompatible to Java version 1.1. While CCJ
ompiled with Manta works more eiently, the use of a native ompiler breaks
Javas portability advantage. Both projets are no longer ative.
mpiJava
MpiJava
6
[1℄ is based on wrapping native methods like the MPI imple-
mentation MPICH with the Java Native Interfae (JNI). The API is modeled very
losely on the MPI-1.1 standard provided by the MPI Forum, but does not math
the proposed MPJ [4℄ speiation. MpiJava omes with a large set of doumentation inluding a omplete API referene. Sine it is widely used to enable message
passing for Java, it has been hosen to be ompared with MPJ/Ibis (see Chapter
5). However, mpiJava omes with some notably disadvantages:
•
ompatibility issues with some native MPI implementations (e.g. MPICH/P4)
•
redued portability, sine an existing native MPI library is needed for the
target platform
This projet still is ative.
MPJ
In 2004 the Distributed Systems Group at the University of Portsmouth
announed a message passing implementation mathing the MPJ speiation. This
7
projet is also alled MPJ , but a release is not publily available. MPJ implements
4 CCJ online at http://www.s.vu.nl/ibis/j_download.html
5 Manta online at http://www.s.vu.nl/~robn/manta/
6 mpiJava
7 MPJ
online at http://www.hpjava.org/mpiJava.html
Projet online at http://dsg.port.a.uk/projets/mpj/
Chapter 6. Related Work
62
an MPJ Devie layer, whih abstrats from the underlying ommuniation model.
For TCP based networks it uses the
Java.nio
pakage and a wrapper lass using
JNI for Myrinet. Like MPJ/Ibis this implementation is ompletely written in Java,
but urrently supports only the point-to-point primitives, a small subset of the MPJ
speiation. However, some low level benhmark results are being presented on the
projets website onerning basi type array transmission, with ompetitive results.
The issue of eient objet serialization seems not to be leared yet. Sine the MPJ
projet is in early stage, more results have to be expeted in the future.
Summary
All of the Java message passing projets introdued in this hapter have
shown disadvantages.
Either an implementation is eient, but does not benet
from Javas portability, or it is highly portable, but suers from the poor performane
of the underlying ommuniation model. The only existing projet that seems to
provide eieny and exibility is MPJ, whih still is under development for the
rst release and not publily available at the moment.
Chapter 7
Conlusion and Outlook
7.1 Conlusion
In this thesis a new message passing platform for Java alled MPJ/Ibis has been
presented. The main fous was to implement an environment that performane an
ompete with existing Java bindings of MPI (e.g. mpiJava), but without exibility
drawbaks.
Chapter 2 introdued parallel arhitetures in general and the basi priniples of
message passing derived from the MPI-1.1 speiation. In summary the speiation denes the following onepts:
•
Point-to-point ommuniation
•
Groups of proesses
•
Colletive ommuniation
•
Communiation ontexts
•
Virtual topologies
•
Derived datatypes
In Chapter 3 Javas drawbaks for parallel omputation have been pointed out. Besides the great advantage of portability Java also shows disadvantages, partiularly
RMI is not exible and eient enough to meet the requirements of an eient
63
7.1. Conlusion
message passing environment.
64
The grid programming environment Ibis addresses
these drawbaks (see Setions 3.2 and 3.3) and thus has been hosen for a message
passing implementation to be built on top of. The main advantages of Ibis in short
are:
•
Eient serialization
•
Eient ommuniation
Additionally, Ibis' exibility allows any MPJ/Ibis appliation to run on lusters and
grids without reompilation by loading the appropriate ommuniation module at
runtime.
Putting Ibis and MPI together in Chapter 4 the proposed MPJ speiation
has been taken as basis, sine it is the result of researh within the Java Grande
Forum and speies a well dened API for a Java binding of MPI. Chapter 4 also
fouses the main implementation details, partiularly the point-to-point primitives,
the olletive operation algorithms and ontext management.
In Chapter 5 MPJ/Ibis has been evaluated using miro benhmarks and the
benhmark suite for MPJ implementations provided by the Java Grande Forum.
During evaluation for Myrinet networks MPJ/Ibis has been opposed mpiJava (see
Chapter 6). The low level results have shown that MPJ/Ibis has great advantages
over mpiJava, when objets have to be serialized, while mpiJava moderately outperforms MPJ/Ibis, when basi type arrays have to be ommuniated. For most kernels
and appliations used in the benhmarks, where relative ommuniation overhead
has been redued, MPJ/Ibis and mpiJava have shown almost equivalent results.
The benhmarks have shown that exibility provided by MPJ/Ibis does not ome
with onsiderable performane penalties. In summary, MPJ/Ibis an be onsidered
as a message passing platform for Java that ombines ompetitive performane with
portability ranging from high-performane lusters to grids. It is the rst known
Java binding of MPI that provides both exibility and eieny.
7.2. Outlook
65
7.2 Outlook
Showing the relevane of MPJ/Ibis for the message passing ommunity parts of
1
this thesis have found their way into a publiation , whih has been aepted at
EURO PVM/MPI 2005 onferene, Sorrento (Naples), Italy and will be printed in
2
Leture Notes of Computer Siene (LNCS) . Nevertheless, some tasks still exist
for MPJ/Ibis to be done. Implementing the virtual topologies and improving the
olletive operations (ie.
barrier ) an be done in the near future.
The issue of supporting the whole set of methods needed for derived datatypes is
not leared yet. As pointed out in Setion 4.5 it is not neessary to support derived
datatypes in MPJ/Ibis, sine Java objets support them natively. On the other hand
the existene of derived datatypes would ease the work of software developers to port
existing MPI appliations written in C, Fortran or other languages to MPJ/Ibis.
Thus, it should be gured out how the issue of multidimensional arrays in Java,
whih in fat is the main handiap for derived datatypes, an be addressed. In 2001
the Ninja [18℄ projet (Numerially Intensive Java) supported by IBM proposed an
extension to add truly multidimensional arrays to Java. Although the proposal has
3
been withdrawn from Suns Java speiation request program
and thus will not be
added into future Java speiations, the ambitions of the Ninja approah should
be ontinued.
Sine MPJ/Ibis depends on the underlying Ibis implementations, Ibis and partiularly Net.gm and Panda should be improved in future work to provide more
stability to MPJ/Ibis. Furthermore, it should be investigated how the MPJ speiation (and thus MPJ/Ibis) an be extended towards MPI-2. Paralleling single-sided
ommuniation and dynami proess management are additional interesting aspets
espeially for global grids, when fault tolerane beomes an issue.
MPJ/Ibis: a Flexible and Eient Message Passing Platform for Java online at:
http://www.s.vu.nl/ibis/papers/europvm2005.pdf
2 B. Di Martino et al. (Eds.): EuroPVM/MPI 2005, LNCS Volume Number 3666, pp. 217-224,
2005, Springer Verlag Berlin Heidelberg 2005
3 Java Speiation Request:
Multiarray pakage online at http://jp.org/en/jsr/detail?id=083
1
Bibliography
[1℄ M. Baker, B. Carpenter, G. Fox, S. H. Ko, and S. Lim. mpijava: An objetoriented java interfae to mpi. In
Presented at International Workshop on Java
for Parallel and Distributed Computing, IPPS/SPDP 1999,
San Juan, Puerto
Rio, Apr. 1999. LNCS, Springer Verlag, Heidelberg, Germany.
[2℄ R. Bhoedjang, T. Ruhl, R. Hofman, K. Langendoen, H. Bal, and F. Kaashoek.
Panda: A portable platform to support parallel programming languages. pages
213226, 1993.
[3℄ J. M. Bull, L. A. Smith, L. Pottage, and R. Freeman.
against C and Fortran for sienti appliations. In
Benhmarking Java
Java Grande,
pages 97
105, 2001.
[4℄ B. Carpenter, V. Getov, G. Judd, A. Skjellum, and G. Fox.
message passing for Java.
MPJ: MPI-like
Conurreny: Pratie and Experiene, 12(11):1019
1038, 2000.
[5℄ Y.-P. Chen and W. Yang. Java message passing pakage - a design and imple-
Proeedings of the Sixth Workshop on Compiler
Tehniques for High-Performane Computing, Kaohsing, Taiwan, Mar. 2000.
mentation of mpi in java. In
[6℄ K. Diner. Ubiquitous Message Passing Interfae Implementation in Java: jmpi.
In
IPPS/SPDP, pages 203211. IEEE Computer Soiety, 1999.
[7℄ A. Ferrari. JPVM: network parallel omputing in Java.
and Experiene, 10(1113):985992, 1998.
66
Conurreny: Pratie
Bibliography
[8℄ B. Goetz.
67
Threading Lightly: Synhronization is not the enemy,
online at
ftp://www6.software.ibm.om/software/developer/library/j-threads1.pdf
edi-
tion, 2001.
[9℄ W. Gropp and E. L. Lusk. Why are PVM and MPI so dierent? In
PVM/MPI,
pages 310, 1997.
[10℄ W. Huber.
Paralleles Rehnen.
Oldenbourg, Münhen, 1997.
[11℄ G. Judd, M. Clement, and Q. Snell. DOGMA: Distributed Objet Group Metaomputing Arhiteture.
Conurreny: Pratie and Experiene,
10:977983,
1998.
[12℄ G. Judd, M. J. Clement, Q. Snell, and V. Getov.
implementation of MPI in java. In
Design issues for eient
Java Grande, pages 5865, 1999.
[13℄ G. Krüger.
Go To Java 2.
[14℄ J. Massen.
Method Invoation Based Communiation Models for Parallel Pro-
Addison Wesley, 1999.
gramming in Java. PhD thesis, Vrije Universiteit Amsterdam, The Netherlands,
June 2003.
[15℄ Message Passing Interfae Forum.
MPI: A Message-Passing Interfae Standard,
online at http://www.mpi-forum.org/dos/mpi-11.ps edition, 1995.
[16℄ Message Passing Interfae Forum.
MPI2: Extensions to the Message-Passing
Interfae, online at http://www.mpi-forum.org/dos/mpi-20.ps edition, 1997.
[17℄ S. Minthev and V. Getov. Towards portable message passing in java: Binding
MPI. In
PVM/MPI, pages 135142, 1997.
[18℄ J. E. Moreira, S. P. Midki, M. Gupta, P. V. Artigas, P. Wu, and G. Almasi.
The NINJA projet.
Communiations of the ACM, 44(10):102109, 2001.
[19℄ S. Morin, I. Koren, and C. M. Krishna.
Passing Standard in Java. In
IPDPS, 2002.
JMPI: Implementing the Message
Bibliography
68
[20℄ A. Nelisse, J. Maassen, T. Kielmann, and H. E. Bal. CCJ: objet-based message
passing and olletive ommuniation in Java.
Conurreny and Computation:
Pratie and Experiene, 15(3-5):341369, 2003.
[21℄ M. Philippsen and B. Haumaher.
More eient objet serialization.
In
IPPS/SPDP Workshops, pages 718732, 1999.
[22℄ Sun Mirosystems.
Java Remote Method Invoation Speiation,
online at
http://java.sun.om/produts/jdk/rmi edition, July 2005.
[23℄ Sun
Mirosystems.
Objet Serialization Speiation,
online
http://java.sun.om/j2se/1.4.2/dos/guide/serialization/index.html
at
edition,
July 2005.
[24℄ A. S. Tanenbaum and J. Goodman.
Computer Arhitektur.
Pearson Studium,
Münhen, 2001.
[25℄ R. V. van Nieuwpoort.
Eient Java-Centri Grid-Computing.
PhD thesis,
Vrije Universiteit Amsterdam, The Netherlands, Sept. 2003.
[26℄ R. V. van Nieuwpoort, J. Maassen, R. Hofman, T. Kielmann, and H. E. Bal.
Ibis: an Eient Java-based Grid Programming Environment. In
Java Grande - ISCOPE 2002 Conferene,
USA, November 2002.
Joint ACM
pages 1827, Seattle, Washington,
Eidesstattlihe Erklärung
Hiermit versihere ih, dass ih die Arbeit selbstständig verfasst und keine anderen
als die angegebenen Quellen und Hilfsmittel benutzt habe.
69