Download as a PDF

Document related concepts
no text concepts found
Transcript
A PERFORMANCE COMPARISON METHOD OF
PROGRAMMING LANGUAGES USING SOURCE TO SOURCE
TRANSLATION TECHNIQUE
ソースコード間トランスレータを用いた多種言語処理系性能比較法
の研究
by
48-106126 Takafumi Nose
野瀬 貴史
A Master Thesis
修士論文
Submitted to
the Graduate School of the University of Tokyo
on February 8, 2012
in Partial Fulfillment of the Requirements
for the Degree of Master of Information Science and Technology
in Computer Science
Thesis Supervisor: Kei Hiraki 平木 敬
Professor of Computer Science
ABSTRACT
Performance of language implementation is one of the criteria for selecting which
language to write a program. Performance comparison between different language implementations is possible by comparing performance of benchmark programs that are ported
to each language based on the same benchmark. However, if the porter apply optimizations that are too much advantageous to specific language, objectivity of comparison is
lost. Additionally, vague translating rules make it difficult to understand correspondence
relation between the ported benchmarks and compare the benchmarks at function or loop
level rather than whole program level. Existing benchmarks are ported by hand, therefore the objectivity of the translation is lost. Moreover, score of the existing benchmarks
has objectivity problem because they are not compatible with standard benchmarks used
for system evaluation such as SPEC or NAS Parallel Benchmarks.
To address these issues, we propose a method to enhance the objectivity of the program translation and the benchmark score. First, we use an automatic translator that
includes explicit translation rules to make the benchmarks for reasonable translation.
The base benchmark is made up to represent workload of existing proven benchmarks.
By our method, we achieved translation of benchmarks into 5 languages with small
effort, and compared the performance of 30 language implementations including the base
benchmarks. it was shown that all language implementations of Ruby, Python, PHP
except for PyPy are 100 times slower than them of C, Fortran, Java from the aspects of
both floating-point performance and system performance, defect of method-based JIT,
and NAS Parallel Benchmarks have workload diversity for dynamic languages.
論文要旨
言語処理系の実行性能は、プログラムをどの言語で記述するか選択するときの判断基準
の一つとなる。異なる言語処理系間の性能比較は、同じベンチマークを各言語に対して移
植しその性能を比較することにより行うことができる。しかし、移植の際に特定の言語に
過剰に有利な最適化を適用した場合、性能比較の客観性が失われる。また、不明瞭な変換
ルールを用いると、移植されたプログラム間の対応関係が曖昧になり、関数やループといっ
たプログラム全体より小さなレベルでの性能比較が困難になる。既存のベンチマークは手
動での移植が用いられており、変換の客観性に問題がある。さらに、既存のベンチマーク
は SPEC や NAS Parallel Benchmarks などのシステム評価に使われる標準的なベンチマー
クと互換性がなく、そのスコアの客観性にも問題がある。
これらの問題を解決するため、我々はプログラム移植とスコアの客観性を増すための手
法を提案する。まず、変換ルールを明瞭に記述した自動トランスレータによりベンチマーク
を各言語に自動変換することで、変換に根拠を持たせる。また、変換されるベンチマークプ
ログラムは既存の実績のあるベンチマークの負荷を代表するように構成する。この手法に
より、ベンチマークを 5 つの言語に移植し、移植元の言語とあわせて 30 個の言語実装の性
能比較を少ない労力で達成した。その結果、浮動小数点演算性能およびシステム性能の両面
において PyPy を除く Ruby, Python, PHP のすべての言語処理系実装が C, Fortran, Java
と比べ 100 倍以上の性能差があること、method-based JIT の欠点、そして NAS Parallel
Benchmarks が動的言語の計算負荷として多様性を持っていることが示された。
Acknowledgements
I would like to express my deepest gratitude to Professor Kei Hiraki whose
comments and suggestions were unprecedentedly valuable throughout not only
the course of my study but also my life. Special thanks also go to Dr. Koichi
Sasada whose comment on twitter encouraged and helped me very much. Hisanobu
Tomari, Yasuo Ishii, and Koichi Nakamura gave me constructive comments and advice. I am deeply grateful to their discussion with me. My study would not have
started without their idea. I would also like to express my gratitude to Naoki
Tanida, Goki Honjo, Kenichi Koizumi and all the member of the Hiraki Laboratory. Their optimism and witty encouragement gave me power and motivation.
Contents
1 Introduction
1.1
1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
Purposes and types of benchmark . . . . . . . . . . . . . . .
1
1.1.2
Need for comparing programming language implementations
2
1.1.3
Difficulties in comparing programming language implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2
Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Composition of this paper . . . . . . . . . . . . . . . . . . . . . . .
6
2 Related work
7
2.1
Synthetic benchmarks in early years . . . . . . . . . . . . . . . . . .
7
2.2
The Computer Language Benchmarks Game . . . . . . . . . . . . .
8
2.3
Comparison between Java and other languages . . . . . . . . . . . .
9
2.4
Comparison between DSLs for HPC . . . . . . . . . . . . . . . . . .
11
3 Methodology
12
3.1
Selection of base benchmarks . . . . . . . . . . . . . . . . . . . . .
12
3.2
Translator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2.1
I/O and memory management . . . . . . . . . . . . . . . . .
15
3.2.2
For loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.2.3
Arithmetic overflow . . . . . . . . . . . . . . . . . . . . . . .
16
3.2.4
Mimicking class inheritance mechanism in C . . . . . . . . .
17
Benchmarks overview . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.3
i
3.3.1
NAS Parallel Benchmarks . . . . . . . . . . . . . . . . . . .
17
3.3.2
Dhrystone . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
4 Evaluation
20
4.1
Measurement environment . . . . . . . . . . . . . . . . . . . . . . .
20
4.2
Characteristics of measured languages and their implementations
.
20
4.3
NAS Parallel Benchmarks . . . . . . . . . . . . . . . . . . . . . . .
22
4.3.1
Performance overview
. . . . . . . . . . . . . . . . . . . . .
22
4.3.2
Effects by JIT . . . . . . . . . . . . . . . . . . . . . . . . . .
27
4.3.3
Multithread performance of Ruby . . . . . . . . . . . . . . .
35
Dhrystone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.4
5 Conclusion
40
References
44
ii
List of Figures
3.1
Dhrystone call graph (Ruby, iteration = 200,000) . . . . . . . . . .
19
3.2
Statement distribution of SPEC CINT2006 and Dhrystone . . . . .
19
4.1
NPB performance overview
. . . . . . . . . . . . . . . . . . . . . .
22
4.2
Original NPB performance . . . . . . . . . . . . . . . . . . . . . . .
23
4.3
NPB performance of native compiled languages . . . . . . . . . . .
24
4.4
Standard deviation of the score of native compiled languages . . . .
25
4.5
NPB Java version performance
. . . . . . . . . . . . . . . . . . . .
25
4.6
NPB Python version performance . . . . . . . . . . . . . . . . . . .
26
4.7
NPB Ruby version performance . . . . . . . . . . . . . . . . . . . .
27
4.8
NPB PHP version performance . . . . . . . . . . . . . . . . . . . .
28
4.9
Sun Java performance improvement after method division
. . . . .
29
4.10 JIT log of NPB except for SP (CLASS=S) on PyPy 1.4.1 . . . . . .
31
4.11 JIT log of NPB CG (CLASS=W, A) on PyPy 1.4.1 . . . . . . . . .
32
4.12 Call graph of NPB CG (CLASS=S) on Python 2.7 . . . . . . . . .
32
4.13 JIT log of NPB IS (CLASS=W, A) on PyPy 1.4.1 . . . . . . . . . .
33
4.14 JIT log of NPB IS (CLASS=S, W, A, B) on PyPy 1.7 . . . . . . . .
33
4.15 Call graph of NPB IS (CLASS=A) on PyPy 1.4.1 . . . . . . . . . .
34
4.16 Call graph of NPB IS (CLASS=A) on Python 2.7 . . . . . . . . . .
34
4.17 Call graph of NPB IS (CLASS=A) on PyPy 1.7 . . . . . . . . . . .
35
4.18 Scaling of multithread benchmarks on Ruby implementations . . . .
36
4.19 Scaling of multithread benchmarks on JRuby . . . . . . . . . . . . .
37
4.20 Scaling of multithread benchmarks, implemented in FORTRAN77 .
37
iii
4.21 Dhrystone (iteration count = 50,000,000) . . . . . . . . . . . . . . .
38
4.22 JIT log of Dhrystone (iteration count = 200,000) . . . . . . . . . .
39
iv
Chapter 1
Introduction
1.1
1.1.1
Background
Purposes and types of benchmark
It it important to measure and estimate the performance of each layer of computer
system such as microarchitecture, network interface, compiler, operating system,
runtime environment, numerical calculation kernel, and higher-level algorithm to
design total computer system or application because they are built on the top of vertical integration of computer system layers. The performance of these layers can be
evaluated by benchmarking. Benchmarking is to input data that was collected from
real world workloads with specific criteria into software or hardware to compare
the relative performance of them, for example, program execution time, latency,
code size, hardware resource consumption, and power consumption. Processor and
system software are easy to evaluate because the criteria is physical quantity such
as instructions per clock (IPC) and turn around time. However, higher-level layer
is difficult to evaluate such as programming languages and user interface, because
human factors are combined in their evaluation. Many benchmarks have been
proposed and used to evaluate one or several layer that we previously mentioned.
For example, image data named “Lena”[29] is one of the standard benchmarks for
evaluating image processing algorithm. LINPACK benchmark [23] is a represen-
1
tative benchmark for evaluating numerical calculation kernel. It was designed to
measure Basic Linear Algebra Subprograms (BLAS) performance of mathematical
kernel libraries, which is popular function among scientific application. SPECjvm
[7] and DaCapo [12] are benchmark suites for evaluating Java Virtual Machine
(JVM). JVM is used as an infrastructure of language implementations not only
by Java language but also Ruby, Groovy, Scala, and Clojure. The performance of
JVM have an impact on these languages. PostMark [36] is a benchmark for file
system that is one of the important parts of operating system. Netperf [32] and
iperf [57] is a benchmarking tool for evaluating network bandwidth performance.
SPEC CPU2006 [28] is one of the widely used benchmarks for evaluating microarchitecture and compiler. Furthermore, one layer can be subdivided on the basis
of the domain that the benchmark is oriented in. Specialized benchmarks are created when general benchmarks like SPEC CPU2006 do not represent precisely the
workload that user want to estimate. One of the examples is Graph 500 [42], which
is intended to measure data intensive supercomputer application’s performance.
Since 1993, High Performance Computing (HPC) researchers have been used [24]
high-performance LINPACK (HPL) benchmark [46] to evaluate supercomputer.
While HPC focused on 3 dimensional (3-D) physical simulation that utilize numerical calculation, using HPL was proper. However, its workload is different from
graph algorithm, which is important in emerging data intensive application. As we
saw, we have various kind of benchmark programs that evaluates from the highest
layer to the lowest layer and from general purpose to specific purpose. However,
they are focused on area that is easy to evaluate, and there are only few benchmark
suites that were developed from the aspect of comparing language implementations
that run different languages [21] [61] [8].
1.1.2
Need for comparing programming language implementations
For different purposes, different programming languages have been used. For example, compiled languages like FORTRAN and C is mainly used for compute in-
2
tensive applications and system software, both of them require high performance.
On the other hand, interpreter languages like Ruby, Python, and Perl are mainly
used for daily text processing, computer education, and web applications, which do
not require high performance but high productivity. Java and C# are located at
medium position between two areas, such as enterprise applications. These existing
programming languages do not achieve both high performance and high productivity, but programmers have desire for such computer language. For example, HPC
researchers developed new Partitioned Global Address Space (PGAS) languages
like X10 [20], Unified Parallel C (UPC) [17], and Chapel [18]. X10 and Chapel are
developed under the DARPA HPCS language project [38], which means that these
are not toy project languages. Moreover, additional packages for scientific calculation such as NArray [6] for Ruby and NumPy [60] for Python and were developed
for scientist programmers who want to use the same languages as their daily use.
As computer system evolves, interpreter languages have become to be faster than
before. The language implementations themselves also evolved as bytecode optimization and JIT compilation are developed. Therefore, interpreter languages
have began to move into area that requires high performance. However, learning new languages cost for programmers and interpreter languages are still slower
than even if partially fasten by additional libraries. Since there is no ultimate language that has both high productivity and performance, we must select language
as situation. That is because comparing language implementations is necessary to
make a decision to select which language to write application considering balance
of performance and productivity under certain criteria. Productivity is difficult to
evaluate and compare quantitatively because it complexly interacts with richness
of standard libraries, assist of Integrated Development Environment (IDE), reliability by type-checking, user community, and programmer’s preference. Therefore,
evaluation of productivity should be matter of the programmer’s own. Meanwhile,
performance in execution time can be measured quantitatively by running the same
program. Knowledge about language performance should be obtained before the
programmer actually write a program, thus benchmark for language comparison is
3
needed.
1.1.3
Difficulties in comparing programming language implementations
Each programming language specification has differences each other. The difference
is not only in syntax level but also in semantic level such as primitives, standard
libraries, and programming paradigm. Thus, it is impossible to translate all the
applications in the real world keeping its semantics to other languages. For example, prototype-based object-orientation (OO) [19] and class-based OO is difficult to
map each other. Emulating prototype-based OO in class-based OO is virtually the
same as reimplementing a language because of its dynamic characteristics, and emulating class-based OO in prototype-based OO have no uniqueness in coding style,
which is typically apparent in a tutorial of Lua [5]. However, by restricting the area
of programs under certain assumption, comparison is possible. Programmers who
use computer for practical numerical calculations make less use of non-standard
external library and language-specific features that are difficult to use or learn
than programmers who have computer science backgrounds. Their work is to solve
mathematical, physical, or chemical problem and not to make use of programming
languages or enhance computer systems. They learned low productive languages
such as FORTRAN in early university education, thus they cannot spend time
learning new programming features. Therefore, even if they moved into high productive languages, they continue to write programs that are semantically similar
to old programs. The main primitives needed in numerical calculations are number
and set of them like array. These data types have the same semantics among practical programming languages without exceptional situations like arithmetic overflow
or implicit type conversion. Thus, comparing programming languages is possible
from the aspect of numerical calculation.
4
1.2
Our contributions
In this paper, we propose a performance comparison methods of programming languages using automatic source to source translation. We focus on benchmarking
programs that are typically written by HPC programmers who have been used
to languages with high performance and low productivity, with assumption that
such programmers do not utilize high-order concepts like map, reduce, and fold
operations on list, tail-call optimization, or monad in Haskell. The translator was
designed to keep almost all the semantics between base benchmarks and translated
benchmarks. We proposed the characteristics that the base benchmarks should
have. The base benchmarks must be written in statically-typed object-oriented
language like Java. The base benchmarks must be based on actual workloads of
scientific HPC applications. The quantity of analysis data that have been accumulated in past about the base benchmarks is important to know what kind of past
computer system is correspondent to the current system with slow but productive
language. The workload size must be lower than SPEC CPU2006, because big
workload makes it difficult to evaluate slow language implementations. Contribution of this thesis is three folds:
1. Reduced cost for translating benchmark. Once the translator was made, we
could apply the translator to other benchmark without further work. This
point made it possible to catch up with future evolution of workload as long
as the base language is used widely.
2. Correspondent line-to-line relation between benchmarks, that was made possible with automated source to source translator. Such rigorous translation
made comparison easier than by-hand translation. The translation rules are
all written in translator that it is easier to verify whether the translation is
proper than by-hand translation as well.
3. Proper workload selection and examine of old small benchmark that have
accumulated results for various systems. We selected NAS Parallel Bench5
marks (NPB) [11] for numerical application workload and Dhrystone [61] for
system workload.
By NPB and Dhrystone, we could measure characteristics of language implementations and benchmarks themselves. Method-based Just-in-Time (JIT) compiler
suffered from misprediction of big method hotness. Small kernel benchmarks that
the execution time is short is not proper for measuring native compiled languages
because the performance did not varied among them. Without kernel benchmarks,
all the interpreter languages performed about 100 times slower performance than
native compiled languages. PyPy performed higher performance on kernel benchmarks and Dhrystone than other interpreter languages, but 10 times slower than
native compiled languages yet. From comparison of PyPy JIT compiler’s log between application code and kernel code, it was obtained that not only JIT but
also GC must be enhanced for application benchmark, and NPB have workload
diversity as JIT and GC benchmark.
1.3
Composition of this paper
In Section 2, we overview four types of related work: synthetic benchmarks in early
era, website for language performance comparison, language comparison between
Java and old languages, and comparison of DSL for HPC. In Section 3, we discuss
requirement for base benchmark and source to source translator. In Section 4, we
measure various kind of language implementations and analyze them that showed
specific behaviors. In Section 5, we summerize the results and state recommendation.
6
Chapter 2
Related work
2.1
Synthetic benchmarks in early years
Synthetic benchmark is composed with instruction or statement mix ratio that was
collected from actual applications. From 1970s to 1980s, two synthetic benchmark
were widely used. They were implemented in several languages, using information
that was retrieved from applications that were implemented in different languages.
Whetstone [21] is a synthetic benchmark that was based on statement mix of
actual scientific programs used in NPL and Oxford University written in ALGOL.
It was designed to be easily ported to other languages. In [21] Curnow implemented
PL/I and FORTRAN version. There is converted C version of Whetstone [2] based
on its FORTRAN version. Whetstone is based on scientific programs but the
floating-point operations performed in the benchmark is meaningless as scientific
calculations.
Dhrystone [61] is a synthetic benchmark that was based on statement distribution of programs collected from 16 different data collections that consists of
FORTRAN, XPL, PL/I, SAL, ALGOL 68, Pascal, Ada, and C. The original version was written in Ada, Pascal, and C. The operations on string is dominant in
Dhrystone.
Small benchmarks like Whetstone and Dhrystone became obsolete as CPUs
got bigger cache that benchmark code fit info, CPUs got faster that makes bench7
mark execution time short and unreliable, and programmers moved into C/C++
to write applications. Instead of synthetic benchmarks, SPEC became to be used,
which collected frequently used actual applications as workloads. Moreover, further after C/C++, bytecode interpreter and dynamically typed languages emerged.
These languages utilizes virtual machines (VM), Just-in-Time compilation (JIT),
reflection, and garbage collection (GC). Programming styles and workloads in applications have changed that information collected when creating Whetstone and
Dhrystone is insufficient now.
Joshi et al. proposed workload cloning from runtime trace using microarchitecture independent characteristics [34]. The output was C source that consists
of assembly code that performs cloned operations except for stub codes. Creating another workload from existing SPEC binary code kept accuracy while it was
synthetic benchmark, but this method cannot to be applied to cross-language comparison because sequence of basic operations that mimics higher level operation in
scripting language does not equal as a workload, while equivalent exchange of assembly and C statement is easy.
2.2
The Computer Language Benchmarks Game
The Computer Language Benchmarks Game [8] is comparison website that evaluates 27 languages with 13 benchmarks translated into each language. As far as
we know, [8] is the largest benchmark suite that have diversity in languages. It
consists of two parts. One is algorithmic benchmarks such as N-body physical
simulation, Mandelbrot set calculation, permutation, puzzle game solver, pi digits
calculator, and algorithms for bioinformatics. The other measures the performance
of basic operations such as vector manipulation, threading, and memory management. However, its application diversity is biased. The algorithmic benchmarks
except for the first two measures only integer performance. Especially four bioinformatic benchmarks are contained. The benchmark scale is small. The maximum
size of these benchmarks is less than 200 lines in Java at most. This is smaller
8
than Dhrystone, that is about 300 lines in Java. Translation process of [8] has
problems. The translation and the optimization of benchmarks for each language
is contributed by volunteers. Therefore, implementation of benchmarks are incomplete. The volunteers can submit multiple implementations for each language. The
best implementation is adopted as the score of its language. This means that the
score is disturbed by quality of the implementation. The quality depends on the
optimization techniques of the implementer. As programmer population of particular language is bigger, probability that the implementer of the language is highly
skilled grows higher. Thus, the score reflects the proven maximum speed of the
language that depends on the implementation, not the baseline performance.
2.3
Comparison between Java and other languages
Applying Java to HPC has been tried over 10 years. At the same time, the performance of Java has been measured by implementing benchmarks. However, these
measurements had limited divergence of languages: C, FORTRAN, Java, and C#.
Java is at medium position between interpreter languages and native compiled
languages in the metrics of productivity and performance. As enterprise applications that requires both high productivity and performance adopts Java, JVM
vendors has spent efforts to speed up JVM by enhancing JIT and GC. For example,
inline caching, on-stack replacement, tracing JIT, and generational GC. Thus, Java
has come near C and FORTRAN, and applying it to HPC has reality.
To examine whether Java is useful for high performance numerical calculations,
benchmarks are developed, that are also implemented in competing technologies
for reference. Wade et al. implemented Java version of LINPACK [4]. In [16]
Bull measured the performance of it. LINPACK has reliability because it is one of
the benchmarks that are used over 30 years, but it is not enough to measure the
performance only by single workload.
SciMark benchmark suite has C implementation for reference [48]. Java Grande
benchmark suite measures low level operations in addition to SciMark [15]. Sci9
Mark is a kernel benchmark suite that contains Fast Fourier Transform (FFT),
Jacobi Successive Over-relaxation (SOR), Monte Carlo integration, Sparse matrix
multiply, and dense LU matrix factorization, which is used in numerical calculations. The diversity of the workload is wide, but the benchmarks were newly
developed that the score of them is not compatible with existing benchmarks.
Frumkin et al. implemented Java version of NPB [27]. NPB contains kernel
benchmarks similar to SciMark and Computational Fluid Dynamics application
benchmarks. The original source code was written in FORTRAN 77. NPB has
been used over 20 years, and the score of it is correlated with SPEC CFP2006 [58].
However, they focused only on Java performance.
Singer compared JVM and CLR implementations [53] by compiling C language
to Java bytecode and Common Intermediate Language (CIL) bytecode using Java
and .NET backend of GNU Compiler Collection (GCC). They purely measured
the performance of virtual machine implementations because the source code and
compiler frontend was the same. However, the input source code was written C,
that is not used to write a program for both JVM and CLR. This does not represent
usual usage of language implementations.
Vogels compared Common Language Interface (CLI) implementations with Java
[59]. They compared them with Java Grande benchmark suite and SciMark 2.0.
Automated translation tools was used as well as manual translation. One C implementation, four Java implementations and three CLR implementations were
compared. As a result, Microsoft .NET CLR 1.1 performed as well as IBM 1.3.1
JVM.
Weiskirchner compared Ada, C, and Java from the aspect of embedded environment [62]. Their study had wide diversity of operating systems and machines than
other similar studies. but the selection of language implementations they studied
was biased to embedding environment and the workloads were small.
10
2.4
Comparison between DSLs for HPC
As general purpose computation on graphics processing units (GPGPU) develops,
it has been proposed to use special domain specific languages (DSLs) [40] [14] [41] to
write programs run on GPUs that are easier than high level shading languages [39]
[50] [45]. These languages had similar concepts: region specification and SPMD.
As a result, these DSLs converged with two languages that have similar semantics:
Compute Unified Device Architecture (CUDA) on NVIDIA platform and OpenCL
[54] on AMD platform. CUDA compiler also supports OpenCL, therefore performance comparison of DSLs on the same hardware is possible. Karimi et al.
implemented the same program both in CUDA and in OpenCL [35]. The CUDA
implementation was consistently faster than the OpenCL implementation despite
the two implementation have nearly identical code. However, they benchmarked
a program that run only Monte Carlo simulation of quantum dynamics, therefore
divergence of benchmark program is low. Noaje et al. used source-to-source translation technique to generate CUDA programs from existing OpenMP [22] program
automatically[44]. OpenMP is directive-based DSL for utilizing multi-core processor. There are more existing OpenMP codes than GPGPU codes because it was
proposed earlier than GPGPU languages and directive-based that means incremental improve of existing serial code is possible. This situation between OpenMP
and GPGPU is similar to the one between high productive languages and high
performance languages, thus source-to-source translation of this work is similar to
our work. However, they proposed only translation technique and actual performance evaluation is not performed. CUDA and OpenCL are based on C++, but
actual GPU kernel codes are written in the scope of C level and do not utilize
high-level features of C++. Higher level languages that utilizes C++ features such
as Intel ArBB [43] and C++ Accelerated Massive Parallelism (C++ AMP) [1] are
proposed, but these languages are not mature or not released officially at present,
thus performance comparison of these languages is not performed.
11
Chapter 3
Methodology
3.1
Selection of base benchmarks
To translate benchmarks into other languages, we had to consider the balance of
three metrics: which language to choose as the base language, which benchmark to
use whether synthetic benchmark or application benchmark, and whether to create
new benchmark or adopt existing, traditional benchmark.
The base language must not be have special concepts that is difficult to translate
directly into other language such as lazy evaluation that causes different behavior
with eager evaluation, monad in Haskell [33], or computation expression in F#
[56].
Hand-made synthetic benchmark is small because creation of benchmark costs.
Gathering divergent workloads is required to represent actual behavior of them
by a small program. The number of lines of both Whetstone and Dhrystone are
less than 1000 lines nevertheless the authors gathered information from various
applications. As the size of synthesized workloads that we want to create grows,
reconstruction goes harder. Automatically made synthetic benchmark accurately
emulates realistic workloads and the benchmark size is easily adjustable but does
not represent look and feel of actual source code. As languages that we analyze
evolves, their features go higher-level and become to be used. Automatic synthesis cannot be used as languages evolves because the look and feel of synthesized
12
benchmark lose touch with realistic source code.
Meaningful application benchmark is big because workload diversity must be
ensured. For example, SPEC CPU2006 has duplicate part as a workload[47]. Such
redundancy makes benchmark bigger than fundamentally needed size. Big benchmarks are useless from the aspect of language performance comparison. Highly
productive languages are slow that the execution time of benchmark implementations in them will be longer than that of high performance language version. SPEC
CPU2006 takes a few hours and consumes more than 1GB memory to run in modern 32-bit computers. Proper scale benchmark is needed. However, to convince
HPC programmers who are used to C or FORTRAN to write programs in highly
productive languages, performance comparison based on existing proven programs
is necessary.
We chose Java version of NAS Parallel Benchmarks (NPB) 3.0 as base benchmarks. The reasons are as follows: First, Java is the most widely used objectoriented language environment and accepted by conservative programmers. A word
“Java” has two aspects, the first is a language specification and the other is a VM
specification. Java as a language is statically typed, object-oriented, class-based
language. Java is widely used for business and internet applications. In TIOBE
Index [30], 9 years out of past 10 years from 2002 to 2012 Java has been the No. 1
language. Java as a runtime environment is a bytecode VM. The bytecode specification was clearly stated supposing transferring compiled bytecode through the Internet to various environment from embedded machinery to high-end server. Thus
there are many VM implementations both academic and industrial area. Second,
the original NPB were mixture of FORTRAN 77 and C programs. The implementation of the translator should be single. Our purpose includes reducing the cost
of translating programs. Multiple implementation of translator causes increase of
maintenance cost. Third, NPB have actual application workloads. SciMark is the
similar type benchmark suite that was also implemented in Java but did not contain actual, heavy application workload. Last, NPB were reliable benchmarks for
measuring floating-point and parallel performance of computer system. NPB were
13
used over 20 years and the first paper of NPB were referenced by over 1600 paper,
that is bigger than SPEC CPU2000 and CPU2006 at an annualized rate.
There were no integer benchmarks that have both reliability and size near NPB.
Thus, we chose Java version of Dhrystone benchmark [3] as supplemental integer
benchmark. The program size of Dhrystone is small, and the scale of workload
is lower than cache size of modern architecture [58]. However, the performance
measurement data in past architecture has been accumulated for over 20 years.
Programming style was made closer to modern style than original through translation process to Java.
By the past data, we can know that what kind of computer with maximum performance compiler has equivalent performance to the modern computer with high
productive and slow language, which shows how many years interpreter languages’
optimization technology is behind that of the state-of-the-art compiled language
implementations.
3.2
Translator
The most important purpose of the automatic translation is to compare the performance of programming language implementations with explicit and reasonable
rules. Explicitness is needed for reproduce and verify the benchmark program
formally. By writing all the rules as a program, explicitness can be satisfied. Reasonable rules are defined as translation rules that programmer who learned new
language and have to write new program that is within the same domain as he
or she previously created. Even if the programmer learned new language and concept, improvement is restricted to incremental evolution because program he or she
writes is affected by knowledge of old language. The most likely scenario is line
to line translation into very similar semantics. Modern programming languages
commonly have data structures like array, list, and class, and control statements
such as while loop, for loop, conditional branch, and function call. As far as translating programs that uses only these basic elements, the translation keeping the
14
same semantics is easy. Thus we can simulate such situation by automatic process
if the translated program and its language is restricted. We made a translator
that processes Java source code and output each language. Java has basic features
of control statements and data structures with object-oriented flavor and does
not have irregular semantics like monad. This characteristics fit to scientific programs, that does not utilize complicated data structure. Therefore we could design
straight-forward translation rules to convert Java source to the other languages in
line-to-line correspondent as far as we could. However, we had to make exceptional
translation rules or measures to avoid following problems because the behavior and
the semantics were not exactly the same among programming languages we tried
to translate into.
3.2.1
I/O and memory management
I/O and array allocation functions varied among target languages. For example,
an array cannot be allocated specifying the size of the array in PHP and Python.
The size of an array that can be extended in once is restricted in PHP. Multi
dimensional array primitive is not supported in Python, Ruby, and PHP that programmers must assign sub dimensional array into upper dimensional array. However, I/O and memory allocation functions are called mostly in initialization and
termination steps. These functions do not affect on the performance of numerical
calculations that it is not necessary to implement them with line-to-line correspondence, therefore we implemented functions that emulate Java’s memory allocation
mechanism by hand, or replaced with code snippet with ad-hoc translation rules.
3.2.2
For loop
The expressiveness of for loop and foreach loop varied among target languages.
For statement in C and Java consists of four parts: initialization expression, loop
condition expression, increment expression, and body compound statement. The
other does not implement these completely or does not have the same semantics.
15
Ruby had multiple ways that is commonly used to express iteration loop with
index.
Source 3.1: Iteration loops with index in Ruby
# I t e r a t i o n a l l o o p s i m i l a r t o C and Java
fo r i in 1 . . 1 0 do
do som et hi ng
end
# Ruby s p e c i f i c s t y l e
10. times { | i |
do som et hi ng
}
We replaced for-loops that cannot be written in target language with the same
semantics as source language with while loops. We adopted style that appear in
language tutorials in early times and similar look and feel to for-loop of C or Java.
3.2.3
Arithmetic overflow
In Ruby, Python, and PHP, the behavior of number primitive is not the same as
compiled languages. In Ruby and Python, if an arithmetic overflow occurred with
fixed-precision integer primitives, the result is converted into a arbitrary-precision
integer primitive automatically. In PHP, the result is converted into a floatingpoint number primitive as well. Use of arbitrary-precision integer in unnecessary
situation causes performance degradation, and automatic conversion into floatingpoint number changes the meaning of the program. It is difficult to predict if
arithmetic overflow will occur from source codes statically. We avoided to use
benchmarks that overflow happens frequently such as Section 1 of Java Grande
benchmark suite, but NPB had a section that is sensitive to bit width and behavior
of arithmetic overflow of integer primitive in random number generator, therefore
we had to fix the translated program not to cause arithmetic overflow in PHP.
16
3.2.4
Mimicking class inheritance mechanism in C
Java version of NPB utilized class inheritance. C does not have class that we had
to mimic inheritance of class. NPB is computational intensive benchmark that performance of object-oriented functions do not affect on its score. Search of virtual
function does not happen in NPB that we did not have to implement polymorphism that requires virtual method table. In our implementation, an object was
represented as an struct and had only data fields. Inherited object was represented
as a struct that have additional fields after the same fields as the parent class.
3.3
3.3.1
Benchmarks overview
NAS Parallel Benchmarks
The NAS Parallel Benchmarks (NPB) are computational intensive benchmarks
developed at Numerical Aerodynamic Simulation (NAS) Systems Division, NASA
Ames Research Center. NPB contain computational fluid dynamics (CFD) and
kernel code for numerical algorithms. MG, CG, FT, and IS are kernel benchmarks.
MG is a simplified multi-grid kernel. Its data locality is highest except for EP
[25]. CG is a conjugate gradient method kernel. Matrix data is stored Compressed
Row Storage (CRS) format that its data access is unstructured. FT is a Fast Fourier
Transform (FFT) kernel, which performs 3-D FFT partial differential equation.
IS is a bucket sort kernel. In MPI implementation, IS communication cost is
dominated by MPI Alltoallv [51], which means that the kernel accesses wide range
of array.
LU, SP, and BT are CFD benchmarks. These benchmarks solves 3-D compressive fluid system. LU utilizes symmetric successive over-relaxation (SSOR)
method. LU decomposition is not performed contrary to the name. BT and SP
are similar CFD codes in many aspects but communication / computation ratio is
different.
EP, “embarrassingly parallel” kernel, is not implemented in Java version. The
17
workload generates random number, which situation is commonly observed in
Monte-carlo simulation code. This benchmark does not perform interprocessor
communication [11] [10] [25].
Reference [11] for detailed description of computational algorithm.
The original source code of IS is implemented in C. The other is implemented
in FORTRAN77.
3.3.2
Dhrystone
Dhrystone is a synthesized benchmark by hand. The source code consists of integer calculations, load and save on array, load and save on struct member, and
string compare. Malloc and free is not tested while benchmark iteration step in
the original source code, thus the footprint is small and stable. 3.1 shows the call
graph of Dhrystone Ruby version converted by our translator, obtained by rubyprof and KCacheGrind. We can see the dynamic distribution of call to primitive
objects and overview of workload characteristics. The maximum depth of function
call was three as shown in the call graph. Class#new means a creation of temporary array, which size is 1 and used to represent pointer variable in call-by-variable
strategy language with call-by-reference strategy language. Fixnum class in Ruby
is equivalent to integer in C. From the call graph, we could confirm that integer
calculations, integer comparison, and object reference comparison dominates the
Dhrystone benchmark. There is only one parameter to increase the scale of the
benchmark, that is to increase the iteration count. No other input is not needed.
This point makes benchmark small and portable but difficult to enhance workload
diversity, and different from modern benchmarks like SPEC CPU2006. Dhrystone
was a synthetic benchmark that it must be validated whether the source code
statement distribution is compatible with current programs. 3.2 shows statement
distribution of SPEC CINT2006 and Dhrystone. gcc showed higher use of case,
label, and goto statements, but typical compiler has to be written huge number
of conditional branches that this is exceptional case. Without gcc, every bench-
18
dhry.rb
Dhry#execute
44 648
14 985
200 000 x
1 747
400 000 x
Dhry#proc_1
14 985
984
200 000 x
Dhry#func_1
2 684
Dhry#proc_8
8 447
4 670
199 999 x
1 162
199 999 x
4 029
200 000 x
Dhry#proc_6
4 670
843
1 190
199 999 x 200 000 x
8 447
5 167
200 000 x 200 000 x
1 164
200 000 x
Dhry#func_2
5 167
932
200 000 x
Dhry#proc_2
1 164
Dhry#proc_4
932
937
200 000 x
344
600 000 x
Dhry#proc_3
4 029
1 485
732
799 993 x 200 000 x
747
200 000 x
250
399 999 x
454
399 998 x
904
1 400 000 x
158
423
200 000 x 599 999 x
705
1 199 998 x
331
399 998 x
935
1 400 000 x
63
200 000 x
422
800 000 x
330
600 000 x
202
400 000 x
597
800 001 x
1 033
1 800 000 x
219
400 000 x
124
200 000 x
448
800 000 x
157
200 000 x
141
200 000 x
124
221
199 999 x 400 000 x
46
200 000 x
ruby_runtime
329
600 000 x
Dhry#proc_7
3 336
Dhry#func_3
843
Kernel#===
1 485
Class#new
1 479
Array#[]=
2 533
Array#[]
1 892
171
199 999 x
Fixnum#<=
1 129
Fixnum#+
2 529
Fixnum#453
388
799 991 x
Fixnum#==
1 651
Figure 3.1: Dhrystone call graph (Ruby, iteration = 200,000)
mark including Dhrystone showed similar statement distribution. From the aspect
of source code construction, Dhrystone is a well representative subset of SPEC
CINT2006.
ďnjŝƉϮ
ŐĐĐ
ŚϮϲϰƌĞĨ
ŚŵŵĞƌ
ůŝďƋƵĂŶƚƵŵ
ŵĐĨ
ƐũĞŶŐ
ŐŽďŵŬ
ŚƌLJƐƚŽŶĞ
ϳϬ
ϲϬ
ϱϬ
>ŝŶĞƐ
ϰϬ
ϯϬ
ϮϬ
ϭϬ
Ϭ
ĐĂƐĞ
ůĂďĞů
ŝĨ
ŝĨǁŝƚŚĞůƐĞ
ƐǁŝƚĐŚ
ǁŚŝůĞ
ĚŽ
ĨŽƌ
ŐŽƚŽ
ĐŽŶƚŝŶƵĞ
ďƌĞĂŬ
ƌĞƚƵƌŶ
ĂƐƐŝŐŶ
Figure 3.2: Statement distribution of SPEC CINT2006 and Dhrystone
19
ĐĂůů
Chapter 4
Evaluation
4.1
Measurement environment
The system was Dell PowerEdge R410 and its CPU was Xeon E5530 2.4 GHz quadcore processor with 2 logical cores for each physical core. Each core had 256KB
L2 cache and 8MB L3 cache was shared with the cores. The memory bandwidth
was 25.6 GB/s. The system had 12GB main memory. The operating system was
64-bit CentOS 5.5 and the version of Linux kernel was 2.6.18-194.32.1.el5. We
compiled all the language implementations by GCC 4.5.1. We used -O3 -msse4.2
optimization flags.
We evaluated single-thread NPB and Dhrystone translated into each languages.
The maximum problem size for NPB was size A for computational performance and
we supplementary used S, W for JIT analysis. The iteration count for Dhrystone
was 50,000,000. We also measured parallel scalability of Ruby with multithread
implementation of NPB.
4.2
Characteristics of measured languages and their implementations
We selected to measure Ruby, PHP, Python, Fortran2003, C99 and the original
implementations by NAS: FORTRAN77 and Java. Ruby, PHP, and Python are
20
dynamically-typed, object-oriented, class-based languages. All of three are usually
used in web programming and text processing. Every language have runtime source
code evaluation function. Redefine of function is possible in Ruby and Python but
impossible in PHP. The dynamic aspect of these languages makes it hard to static
optimization. PHP does not have any multithreading features. These language
implementations share concepts or implementation partially. Ruby 1.8 is abstract
syntax tree (AST) based interpreter. Ruby 1.9 [52] and Python 2.7 is based on
bytecode interpreter. Their bytecode specification is not intended to keep interlanguage portability because the bytecode was introduced for performance. PyPy
is a self-hosted Python interpreter that is written in Restricted Python. Its JIT
compiler is based on tracing JIT. Jython and JRuby are based on JVM. JRuby
has JIT compiler. Rubinius and unladden-swallow commonly use LLVM backend
for JIT compilation.
Java bytecode is stack-based intermediate representation. Java 7 has invokedynamic opcode for dynamic language support that enhance polymorphic method
invocation. Thus Java platform has begun to extend its area to host environment
for dynamic languages. JIT compilation was introduced with HotSpot VM in Java
1.3 and later other vendor implementations got similar boosting technologies [55].
Fortran 2003 [49] is the newest major update since Fortran 90. One of the
topics that impact on the semantics of Fortran is object-oriented extension that
enables class inheritance, which makes Fortran closer to high productive languages.
By this functionality, translation from Java to Fortran made straight-forward.
C99 is a C specification published in 1999 [26]. Coding style in C has changed
from 1980s when the Dhrystone benchmark was published. C99 contains new
features that affect on performance such as variable-length array, inline function
definition, and restrict pointer.
We implemented Lua version of Dhrystone for reference by hand. Lua is a
prototype-based object-oriented language, hence it is impossible to map classbased object-oriented program to Lua in straight forward manner because there
are several fashion to emulate class in prototype-based language. Almost all the
21
implementation of Lua version is based on C and partially on Java.
Measured language implementations were Sun Java 1.6.0, Sun Java 1.7.0, Oracle
JRockit 1.6.0 20 R28.1.0-4.0.1, IBM Java 6.0-9.0, Apache Harmony 6.0-jdk-991881,
Ruby 1.9.2-p136, Ruby 1.9.3-110206, JRuby 1.6.0.RC1, Rubinius 1.2.0, Python
2.7.1, PyPy 1.4.1, PyPy 1.7, Jython 2.5.2, unladden-swallow, PHP 5.3.6, GCC
4.5.1, GCC 4.6.0, ICC 11.1.073, LLVM 2.8, LuaJIT 2.0.0-beta6, and IFORT/ICC
12.0.1.
4.3
NAS Parallel Benchmarks
4.3.1
Performance overview
Figure 4.1 is the overview of all the language implementations we measured. The
vertical axis is logarithmic scale. We can observe that there is over 100 times
d
'
&d
/^
>h
D'
^W
ϭϬ͕ϬϬϬ͘ϬϬϬ
:ĂǀĂ
ϵϵ
DĞŐĂŽƉĞƌĂƚŝŽŶƐƉĞƌƐĞĐŽŶĚ;DŽƉƐͿ
ϭ͕ϬϬϬ͘ϬϬϬ
ZƵďLJ
WLJƚŚŽŶ
W,W
ϭϬϬ͘ϬϬϬ
ϭϬ͘ϬϬϬ
ϭ͘ϬϬϬ
Ϭ͘ϭϬϬ
Figure 4.1: NPB performance overview
22
&ŽƌƚƌĂŶ
ϮϬϬϯ
KƌŝŐŝŶĂů
ŝŵƉů͘
performance difference between slow and fast language implementations.
Figure 4.2 shows the performance of the original NPB’s performance. This
is the baseline that all the language implementations should target on because
FORTRAN is the most high performance language and has the biggest market
share in scientific computing. ICC 11 performed 2,841 MFlops with the original
ϯ͕ϬϬϬ͘ϬϬϬ
Ϯ͕ϱϬϬ͘ϬϬϬ
Ϯ͕ϬϬϬ͘ϬϬϬ
DŽƉƐ
/ϭϭ͘ϭ͘Ϭϳϯ
/&KZdϭϮ͘Ϭ͘ϭ
ϭ͕ϱϬϬ͘ϬϬϬ
'ϰ͘ϱ͘ϭ
'ϰ͘ϲ͘Ϭ
ϭ͕ϬϬϬ͘ϬϬϬ
ϱϬϬ͘ϬϬϬ
Ϭ͘ϬϬϬ
d
'
&d
/^
>h
D'
^W
Figure 4.2: Original NPB performance
implementation of MG in FORTRAN77.
Figure 4.3 shows the performance of the native compiled languages: Fortran
2003, C99, and the original implementation (FORTRAN77 and C.) Variability
of the scores were observed among benchmarks which had long and mid range
execution time: BT, FT, LU, MG, and SP. For example, original FORTRAN77
implementation of MG on ICC 11 was 2.77 times faster than C99 implementation
compiled on LLVM 2.8. However, two benchmarks that execution time were short
showed less variability of the score. The original FORTRAN77 implementation
23
of CG on IFORT 12 was 1.52 times faster than Fortran 2003 implementation on
GCC 4.6.0, and the Fortran2003 implementation of IS on IFORT 12 was only
1.04 times faster than the same implementation on GCC 4.6.0, they were the
biggest difference in CG and IS. Figure 4.4 shows the standard deviation of the
score shown in Figure 4.3. The standard deviation of CG and IS was less than
100 Mops but that of the other was over 400 Mops.
These results shows that
ϯ͕ϬϬϬ͘ϬϬϬ
/ϭϭ͘ϭ͘Ϭϳϯ
Ϯ͕ϱϬϬ͘ϬϬϬ
/ϭϮ͘Ϭ͘ϭ
'ϰ͘ϱ͘ϭ
Ϯ͕ϬϬϬ͘ϬϬϬ
ϵϵ
'ϰ͘ϲ͘Ϭ
DŽƉƐ
>>sDϮ͘ϴ
/&KZdϭϮ
ϭ͕ϱϬϬ͘ϬϬϬ
'ϰ͘ϱ͘ϭ
&ŽƌƚƌĂŶ
ϮϬϬϯ
'ϰ͘ϲ͘Ϭ
ϭ͕ϬϬϬ͘ϬϬϬ
/ϭϭ͘ϭ͘Ϭϳϯ
/&KZdϭϮ͘Ϭ͘ϭ
'ϰ͘ϱ͘ϭ
ϱϬϬ͘ϬϬϬ
&ϳϳΘ
'ϰ͘ϲ͘Ϭ
Ϭ͘ϬϬϬ
d
'
&d
/^
>h
D'
^W
Figure 4.3: NPB performance of native compiled languages
short time kernel benchmark does not reflect the performance of native compiled
language implementations obviously on modern computers, but bigger benchmarks
that contains programs composed with over 10,000 lines source code such as SPEC
CPU2006 is not needed to see variability. The scale of the benchmark is enough
when they reached to a few thousands of lines.
Figure 4.5 shows the performance of Java implementation of NPB measured
with several vendors’ JVMs. IBM Java was always faster than JRockit and Harmony, showing stable performance. In contrary, Sun JVM showed unstable behavior. This behavior can be described by characteristics of JIT compiler. We discuss
the effect of JIT on the performance in Section 4.3.2.
24
ϳϬϬ
ϲϬϬ
DŽƉƐ
ϱϬϬ
ϰϬϬ
ϯϬϬ
ϮϬϬ
ϭϬϬ
Ϭ
d
'
&d
/^
>h
D'
^W
Figure 4.4: Standard deviation of the score of native compiled languages
ϭ͕ϮϬϬ͘ϬϬϬ
ϭ͕ϬϬϬ͘ϬϬϬ
ϴϬϬ͘ϬϬϬ
ũĚŬϭ͘ϲ͘ϬͺϮϯ
ƐƉ
Ž ϲϬϬ͘ϬϬϬ
D
ũĚŬϭ͘ϳ͘ϬͲĞĂͲďϭϮϳ
ŝďŵͲũĂǀĂͲϲ͘ϬͲϵ͘Ϭ
ũƌŽĐŬŝƚͲũĚŬϭ͘ϲ͘ϬͺϮϬͲZϮϴ͘ϭ͘ϬͲϰ͘Ϭ͘ϭ
ŚĂƌŵŽŶLJͲϲ͘ϬͲũĚŬͲϵϵϭϴϴϭ
ϰϬϬ͘ϬϬϬ
ϮϬϬ͘ϬϬϬ
Ϭ͘ϬϬϬ
d
'
&d
/^
>h
D'
^W
Figure 4.5: NPB Java version performance
25
The performance of the Python implementations are shown in Figure 4.6. BT,
FT, LU, and MG did not run on Python 3.2 because semantics of integer division
was changed in Python 3. PyPy 1.7 was the fastest implementation on CG, FT,
and MG, but BT, IS, and LU performance was lower than Python 2.7.1. We discuss
the efficiency of tracing JIT in Section 4.3.2.
ϭϰϬ͘ϬϬϬ
ϭϮϬ͘ϬϬϬ
ϭϬϬ͘ϬϬϬ
ƉLJƚŚŽŶͲϮ͘ϳ͘ϭ
ϴϬ͘ϬϬϬ
DŽƉƐ
ƉLJƚŚŽŶͲϯ͘Ϯ͘ZϮ
ƉLJƉLJͲϭ͘ϰ͘ϭ
ƉLJƉLJͲϭ͘ϳ
ϲϬ͘ϬϬϬ
ũLJƚŚŽŶͲϮ͘ϱ͘Ϯ͘Zϯ
ƵŶůĂĚĞŶͲƐǁĂůůŽǁ
ϰϬ͘ϬϬϬ
ϮϬ͘ϬϬϬ
Ϭ͘ϬϬϬ
d
'
&d
/^
>h
D'
^W
Figure 4.6: NPB Python version performance
The performance of the Ruby implementations are shown in Figure 4.6. Ruby
1.9.2 performed 2 times faster than Ruby 1.8.7 with all benchmarks. Ruby 1.9.2
or 1.9.3 were the fastest except for CG. JRuby was the fastest on CG. JRuby
contains JIT compiler but JVM also has JIT compiler. We discuss which affect on
the performance in Section 4.3.2.
There were few PHP implementations that we chose official implementation of
PHP and Java implementation, Quercus. However, Quercus did not run because of
runtime timeout without CG and IS (Figure 4.8.) PHP’s language specification is
not appropriate for numerical calculation as we saw in Section 3.2.3 that we should
26
ϴ͘ϬϬϬ
ϳ͘ϬϬϬ
ϲ͘ϬϬϬ
DŽƉƐ
ϱ͘ϬϬϬ
ƌƵďLJͲϭ͘ϴ͘ϳ
ƌƵďLJͲϭ͘ϵ͘Ϯ
ϰ͘ϬϬϬ
ƌƵďLJͲϭ͘ϵ͘ϯ
ũƌƵďLJͲϭ͘ϲ͘Ϭ͘Zϭ
ϯ͘ϬϬϬ
ƌƵďŝŶŝƵƐͲϭ͘Ϯ͘Ϭ
Ϯ͘ϬϬϬ
ϭ͘ϬϬϬ
Ϭ͘ϬϬϬ
d
'
&d
/^
>h
D'
^W
Figure 4.7: NPB Ruby version performance
see PHP’s performance as reference measure between other languages.
4.3.2
Effects by JIT
Sun Java performance degradation
From Figure 4.5, we can see Sun Java, both JDK 1.6 and JDK 1.7, only performed
14.9% - 4.6% Mops compared to IBM Java with BT, LU, and SP, while they
performed as well or higher than IBM Java with CG, FT, IS, and MG.
All of BT, LU, and SP are CFD application benchmarks. Application benchmarks are bigger than the kernel benchmarks: CG, FT, IS, and MG. All the numbers of source code lines of CFD benchmarks without comments were over 2000,
while them of kernel benchmarks were less than 1000. Moreover, we compared the
size of every methods in the class file, and we observed the CFD benchmarks contains methods that the size was from 6000 bytes to 10000 bytes, while the maximum
27
ϲ͘ϬϬϬ
ϱ͘ϬϬϬ
ϰ͘ϬϬϬ
ƐƉ
Žϯ͘ϬϬϬ
D
W,Wϱ͘ϯ͘ϲ
YƵĞƌĐƵƐϰ͘Ϭ͘ϭϴ
Ϯ͘ϬϬϬ
ϭ͘ϬϬϬ
Ϭ͘ϬϬϬ
d
'
&d
/^
>h
D'
^W
Figure 4.8: NPB PHP version performance
size of methods contained in the kernel benchmarks was about 2000 bytes. We profiled BT benchmark by hprof, and observed that the execution time percentage of
BT.compute rhs method on Sun Java was about 10 times bigger than that of the
other JVMs. We divided BT.compute rhs that original size was 9723 bytes into
into smaller methods that the size of them were 2929 and 6798 bytes, and observed
that the program was accelerated 10 times on Sun Java and no acceleration on the
other JVMs.
Ruby
JRuby and Rubinius have JIT compiler. The compiler of both JRuby and Rubinius
are based on method JIT. JRuby translates intermediate representation (IR) of
“hot” Ruby method into an anonymous class of Java. The generated class is JIT
compiled into native code by JVM. Rubinius translates its original IR into LLVM
[37] IR. JIT compilation is performed by LLVM infrastructure. We measured the
behavior of JIT compile invocation with problem size S. JRuby did not try to JIT
compile big methods such as BT.compute rhs. This behavior was not changed even
28
ϭϴϬϬ
ϭϲϬϬ
ϭϰϬϬ
ϭϮϬϬ
ƐƉϭϬϬϬ
Ž&ů
D ϴϬϬ
ϲϬϬ
KƌŝŐŝŶĂů
ϰϬϬ
ŝǀŝĚĞĚ
ϮϬϬ
Ϭ
Figure 4.9: Sun Java performance improvement after method division
if JIT compile threshold that is based on call count was changed to 0, which means
that all the methods the interpreter walk on will be compiled. Rubinius tried to JIT
compile the methods that JRuby did not try to compile, but almost all attempts
of inlining on big methods was failure. The methods not inlined was multiply and
addition method of fixed-point precision number class, and load method of array
element.
These results shows that method JIT suffers from big method. The size of code
fragment that is compiled into optimized bytecode or native machine code depends
on the size of method on method-based JIT. If each method is equally divided
into appropriate size, compilation unit size becomes appropriate naturally, but if
the method size distribution is unequally divided, big methods are generated, and
these methods make coarse-grain compilation unit. Typical profiler of method JIT
counts the number of method call, but actually hot method often contains loops.
Therefore, such profiler mispredicts the hotness of the method that contains many
29
primitive operations and does not call other methods. Moreover, typical embedded
environment has resource limitation. In such environment, compilation of the big
method is impossible or does not pay, thus heuristics that cut off the big method
is adopted. By these reasons, method compilation that is actually needed is not
perfectly performed, and the penalty of miss prediction or restriction from the
specification causes actual scientific applications slow.
PyPy
We measured how JIT compilation is performed in with tracing JIT [13] using NPB
python version on PyPy 1.4.1. Figure 4.10 shows the timing of JIT compilation
and GC during execution with problem size S. SP was eliminated due to a runtime
error. Pale green area means that compiled native code is running. Dark green
area means that JIT profiling and compilation are performed. Red area means that
garbage collector is triggered.
CG performed the longest execution time of native code among NPB. FT performed the highest ratio between native code execution time and JIT compilation
time, but the execution time was dominated 36.6% by GC, that was higher than
CG. Execution time of JIT optimization was the highest in MG. GC and JIT optimization dominated the execution time of BT and LU. Compared to the other
benchmarks, the frequency of JIT and GC in BT and LU was irregular. In contrary,
FT, IS, and MG showed regular behavior of JIT and GC.
We measured the CG not only with S but also with W and A, because the performance of CG on PyPy 1.4 was about 10 times faster than any other interpreter
language implementations without PyPy 1.7, and the execution time was relatively
short. The results are shown in Figure 4.11. As problem size increased, the time
jit-running increased.
CG.conj grad method contains 10 for-loops and only one external function call
to Math.sqrt. The call graph obtained from CG class S on Python 2.7 (4.12) shows
that conj grad dominated over 90% of execution time. CG.conj grad was called
30
Figure 4.10: JIT log of NPB except for SP (CLASS=S) on PyPy 1.4.1
only 16 times while execution. In contrary, BT.Schwarztrauber method dominated
over 80% of execution time and was called over 1000 times. Thus, the locality of CG
was high. The locality of code execution affects on efficiency of JIT compilation.
To measure the performance of JIT compiler, the code locality must vary. NPB
have workload variation from the aspect of the efficiency of JIT compilation.
On PyPy1.4, IS completes less than 0.02 second with problem size S that this
31
Figure 4.11: JIT log of NPB CG (CLASS=W, A) on PyPy 1.4.1
Figure 4.12: Call graph of NPB CG (CLASS=S) on Python 2.7
size is inappropriate to see the behavior of the language implementation. We measured the behavior of JIT compiler also with bigger problem size, W and A. The
results are shown in Figure 4.13.
As problem size increased, the percentage of
JIT code execution time decreased. Instead of that, GC execution time increased.
At size A, over 90% of execution time was GC-related process. The function that
dominates the execution time as the problem size increased was Random.randlc
as shown in Figure 4.15. The caller of Random.randlc, IS.initKeys is not called
during the benchmark kernel is running. Random.randlc also dominated the execution time in Python 2.7, which is Non-JIT implementation, shown in Figure
4.16. IS.initKeys took 131.9 seconds on PyPy 1.4.1 while only 1.66 seconds on
32
Figure 4.13: JIT log of NPB IS (CLASS=W, A) on PyPy 1.4.1
Figure 4.14: JIT log of NPB IS (CLASS=S, W, A, B) on PyPy 1.7
PyPy 1.7 as shown in Figure 4.17. Nevertheless, the tendency that as problem size
increases GC execution time increases is kept, showed in Figure 4.14. In the other
benchmarks, initialization functions did not dominate the execution time or were
common as functions that were called during the benchmark were running. These
results show that total execution time of IS in addition to the score of benchmark
kernel can be used to evaluate garbage collector and JIT, and the IS kernel is a
workload that stresses GC.
33
Figure 4.15: Call graph of NPB IS (CLASS=A) on PyPy 1.4.1
Figure 4.16: Call graph of NPB IS (CLASS=A) on Python 2.7
It was confirmed that the characteristics of NPB from the aspect of JIT and GC
combination is also divergent as well as previous work [25] [31] [63] [9] that were
based on the standpoint of parallel distributed environment and microarchitecture.
IS and Actual CFD codes caused JIT and GC continuously through execution.
These benchmarks showed that improvement only on JIT does not accelerate applications that causes GC frequently.
34
Figure 4.17: Call graph of NPB IS (CLASS=A) on PyPy 1.7
4.3.3
Multithread performance of Ruby
HPC scientific programs utilizes multiprocess and multithreading to improve parallelism. Original and Java version of NPB has parallel implementations. We
translated Java version into Ruby and measured parallel processing performance.
Ruby implementations except for JRuby have Global Interpreter Lock (GIL) or
Giant VM Lock (GVL). These lock is reserved the interpreter to interact with native compiled module safely. We measured the impact of GIL and the JRuby’s
multithread performance on NPB by increasing the number of threads (4.18.)
The results show that the impact of GVL is bigger on Ruby 1.9 than Ruby 1.8,
and Rubinius suffers the biggest impact of GVL among four implementations.
Figure 4.19 shows JRuby’s scaling up to 8 threads.
Compared to the FORTRAN77 implementation, the CFD benchmarks that require high computation cost scaled as well, but the kernel benchmarks that requires
low computation cost showed lower scalability on JRuby. This indicates that the
cost of lock on JRuby is high.
35
ϰ
ϯ͘ϱ
d
ϯ͘ϱ
>h
ϯ
Ğƚ ϯ
ZĂ Ϯ͘ϱ
ŶŽ
ƚŝĂ Ϯ
ƌĞ
ůĞ ϭ͘ϱ
ĐĐ
ϭ
Ğƚ Ϯ͘ϱ
ZĂ
ŶŽ Ϯ
ƚŝĂ
ƌĞ ϭ͘ϱ
ůĞ
ĐĐ
ϭ
Ϭ͘ϱ
Ϭ͘ϱ
Ϭ
Ϭ
ϭ
Ϯ
ϯ
ϭ
ϰ
Ϯ
ϯ
ϭ͘ϲ
'
ϰ
ϯ
ϰ
ϯ
ϰ
D'
ϭ͘ϰ
Ϯ͘ϱ
ϭ͘Ϯ
ƚĞĂ
ZŶ ϭ
ŝŽƚ Ϭ͘ϴ
Ăƌ
Ğů
ĐĞĐ Ϭ͘ϲ
Ϭ͘ϰ
ƚĞĂ Ϯ
ZŶ
ŝŽƚ ϭ͘ϱ
Ăƌ
Ğů
ĐĞĐ ϭ
Ϭ͘ϱ
Ϭ͘Ϯ
Ϭ
Ϭ
ϭ
Ϯ
ϯ
ϰ
ϭ
Ϯ
dŚƌĞĂĚƐ
ϰ
dŚƌĞĂĚƐ
ϯ͘ϱ
&d
ϯ͘ϱ
^W
ϯ
Ğƚ ϯ
ZĂ Ϯ͘ϱ
ŶŽ
ƚŝĂ Ϯ
ƌĞ
ůĞ ϭ͘ϱ
ĐĐ
ϭ
ĞŐ Ϯ͘ϱ
ĂZ
Ŷ
Žƚŝ Ϯ
Ăƌ
Ğů ϭ͘ϱ
ĞĐ
Đ ϭ
Ϭ͘ϱ
Ϭ͘ϱ
Ϭ
ϯ
dŚƌĞĂĚƐ
dŚƌĞĂĚƐ
ϭ
Ϯ
ϯ
Ϭ
ϰ
dŚƌĞĂĚƐ
ϭ͘ϴ
ϭ
Ϯ
dŚƌĞĂĚƐ
/^
ϭ͘ϲ
ϭ͘ϰ
Ğƚ
ĂZϭ͘Ϯ
Ŷ
ŝŽƚ ϭ
ƌĂĞ Ϭ͘ϴ
ůĞ
ĐĐ Ϭ͘ϲ
:ZƵďLJϭ͘ϲ͘ϬZϭ
ZƵďŝŶŝƵƐϭ͘Ϯ͘Ϭ
ZƵďLJϭ͘ϴ͘ϳ
Ϭ͘ϰ
ZƵďLJϭ͘ϵ͘ϯ
Ϭ͘Ϯ
Ϭ
ϭ
Ϯ
ϯ
ϰ
dŚƌĞĂĚƐ
Figure 4.18: Scaling of multithread benchmarks on Ruby implementations
36
ϴ
ϳ
ϲ
d
Ğƚ
ĂZ
Ŷ ϱ
Žŝƚ
ƌĂĞ ϰ
ůĞ
ĐĐ
ϯ
'
&d
/^
>h
D'
Ϯ
^W
ϭ
Ϭ
ϭ
Ϯ
ϯ
ϰ
ϱ
ϲ
ϳ
ϴ
dŚĞŶƵŵďĞƌŽĨƚŚƌĞĂĚƐ
Figure 4.19: Scaling of multithread benchmarks on JRuby
ϴ
ϳ
ϲ
d
Ğƚ
ĂZ ϱ
Ŷ
Žŝƚ
ƌĂĞ ϰ
ůĞ
ĐĐ ϯ
'
&d
/^
>h
D'
Ϯ
^W
ϭ
Ϭ
ϭ
Ϯ
ϯ
ϰ
ϱ
ϲ
ϳ
ϴ
dŚĞŶƵŵďĞƌŽĨƚŚƌĞĂĚƐ
Figure 4.20: Scaling of multithread benchmarks, implemented in FORTRAN77
37
4.4
Dhrystone
Figure 4.21 shows the dhrystone/sec of each language implementations. The fastest
ϭϬϬϬϬϬϬϬϬ
ϭϬϬϬϬϬϬϬ
>ƵĂ
ŚƌLJƐƚŽŶĞͬƐĞĐ
ϭϬϬϬϬϬϬ
ϭϬϬϬϬϬ
ϭϬϬϬϬ
ϭϬϬϬ
ϭϬϬ
ZƵďLJ
WLJƚŚŽŶ
W,W
:ĂǀĂ
ϵϵ
KƌŝŐ͘
ŝŵƉů͘
ϭϬ
ϭ
Figure 4.21: Dhrystone (iteration count = 50,000,000)
implementation of scripting language, PyPy 1.7, is still slower 10 or more times
slower than native compiled language implementations. The Dhrystone score of
PyPy 1.7 on current system was near DEC AlphaStation 500/400 measured in
[58], which was introduced in 1996 with Alpha 21164A 400MHz processor. The
score of Ruby 1.8.7 and PHP were near EPSON PRO-486, which was introduced
in 1993 with i486DX2 66MHz processor.
Figure 4.22 shows the JIT log of Dhrystone with iteration count 200,000. From
the log of PyPy 1.4, JIT-related operations were performed until the time elapsed
47.8 % of the total execution time, while PyPy 1.7 converged at 11.7% . This means
convergence speed was increased in PyPy 1.7. After JIT compiler converged, GC
was still running continuously.
38
Figure 4.22: JIT log of Dhrystone (iteration count = 200,000)
39
Chapter 5
Conclusion
In this paper, we proposed a method to compare the performance of programming
language implementations using benchmarks that was produced by source to source
translator, translated from the base benchmarks that was compatible with existing
standard benchmarks. Computer system is built on various layer, therefore each
layer has benchmark programs to measure the performance of it. However, only
few existing benchmark suites were focused on comparing programming language
implementations. Existing language comparison benchmarks had a problem that
they were translated into various language by hand, that makes it difficult to
compare the performance function by function and catch up with the evolution of
both language and workload.
We proposed an automated translation process that made it easier to generate
benchmark programs for several languages that have the same semantics. In comparison with hand-coded benchmarks, our benchmarks had two advantage: First,
we could compare each language implementations with exact line-to-line correspondence relations because translation rules were explicitly written as translator. Second, we had reduced limitation of programming language diversity and benchmark
size because if once we created source to source translator, we can automatically
catch up with the update of the base benchmarks. We achieved translation of
actual application benchmark into 5 languages, which was impossible by hand.
We discussed the metrics the benchmarks implementer should consider to make
40
the benchmarks reflect appropriate workload and language feature usage. Special
concepts that the benchmarked languages do not have commonly must not be used
in the base benchmarks. Hand-based synthetic benchmark cost that it cannot be
used. Assembly-level automatic benchmark synthesis can not be used because it is
impossible to reflect the features across languages. Meaningful applications benchmarks are big to compare interpreter languages that are slower than languages that
are used by product applications. Subsetted benchmarks must be used, however,
the base benchmarks must be proven benchmarks to compare language implementations with past benchmark data. We selected NPB and Dhrystone as the best
benchmarks that satisfy the metrics preceding above.
We translated NPB and Dhrystone into 5 languages. Performance evaluation
was performed on 30 language implementations for NPB and 25 implementations
for Dhrystone. Our benchmarks are the biggest than ever related work from the
aspect of both language diversity and implementation diversity with over 1,000
lines scale benchmark programs.
By our comparison, several facts and recommendations on JIT compiler, GC,
GIL, workload characteristics, and total performance were obtained.
The performance of language implementations that performs method-based,
which means coarse-grain, JIT compilation heavily affected when JIT compilation
was not performed because of misprediction or VM restriction on big methods.
As we saw in NPB, actual scientific applications contain big methods that these
applications suffer on coarse-grain JIT compiler. This point would not be confirmed
with kernel-only benchmarks that the every method is small and easy to optimize.
Workload diversity is needed not only network or microarchitectual evaluation
but also JIT compiler and garbage collector. CFD applications caused contiguous
compilation and garbage collection. PyPy improved the JIT compiler to cover
wider workloads from version 1.4.1 to 1.7, but integer sorting application was got
slow. The runtime profile of integer sorting benchmark was dominated by GC. To
improve the language implementation much wider, the implementer must make an
effort on GC.
41
Native compiled languages showed small differences with short execution time
benchmarks, but with long execution time benchmarks they showed up to 2.77
times difference. The standard deviation of the Mops score of long execution time
benchmarks were over 400 Mops, while that of short execution time benchmarks
were less than 100Mops.
Language comparison between interpreter languages showed that PyPy was
the most fastest implementation for scientific calculations. However, in comparison with maximum score of compiled language implementations, even the fastest
score of PyPy was slower 9 times than compiled languages. The other score was
much slower, that were about from 20 times to 30times. Quercus was the slowest
implementations among implementations we compared.
The impact of GIL on the performance was measured on Ruby implementations.
JRuby was the only implementation that scaled with the number of threads. The
GIL impacted more on Ruby 1.9 than Ruby 1.8. The impact of GIL did not increase
as the number of threads increased without IS. The recommendation from this
result is that multithreading on Ruby except for JRuby is meaningless when writing
scientific application. Moreover, JRuby scales in fact, however, the acceleration
potential of dynamic language that was proved by PyPy was at least 10 times
on computational kernel, which is higher rate than performance increase that can
be achieved by parallelization on currently available multicore processors. Making
program to utilize multithreading is hard that using Ruby does not worth than
switch programming language to Python that have similar concepts.
We conclude that PyPy 1.7 is recommended for both productive and high performance scientific computation among the implementations we measured. However, PyPy is from 30 to 20 times slower than compiled languages, and less than
10 times slower than Java. We do not recommended to use Quercus in scientific
calculation.
By analyzing distribution of statements, expressions, operators and function
calls of SPEC CINT 2006 and Dhrystone, it was observed that differences of syntactic characteristics of them were small. The biggest characteristics of Dhrystone
42
from the aspect of dynamic languages with GC is that there is no dynamic malloc and free during benchmarking iteration. This was confirmed by PyPy’s JIT
behavior, which showed that JIT compiled code dominated during the execution
after convergence and GC rarely happened. It is showed in previous work that the
score of Dhrystone and SPEC CINT2006 has correlativity, therefore integer performance of language implementations can be measured with NPB IS and Dhrystone
considering GC effects.
Dhrystone is useful to observe how multiple staged JIT optimization works.
Dhrystone have only one way to make heavy workload, that is increasing the
iteration count parameter. Simple, many time iteration is easy to trigger both
method-based JIT and tracing JIT. However, inside the iteration still reflects actual integer workload in past. By measuring the speed of convergence of dhrystone
per sec, we can measure the performance of JIT compiler keeping compatibility
with past benchmarking scores.
The score of Dhrystone in interpreter language with modern computer system
was equivalent to the desktop or workstation systems introduced in early 1990’s.
We conclude that optimization technique of the dynamic interpreter languages are
about 15 years behind statically-typed compiled languages. To defeat staticallytyped language, further enhancement is needed.
43
References
[1] Blazing-fast code using GPUs and more, with Microsoft Visual C++. http:
//ecn.channel9.msdn.com/content/DanielMoth CppAMP Intro.pdf.
[2] C Converted Whetstone Double Precision Benchmark. http://www.netlib.
org/benchmark/whetstone.c.
[3] Dhrystone
benchmark
written
in
Java.
http://www.okayan.jp/
DhrystoneApplet/.
[4] Linpack benchmark―Java version.
http://www.netlib.org/benchmark/
linpackjava/.
[5] lua-users wiki: Object Orientation Tutorial. http://lua-users.org/wiki/
ObjectOrientationTutorial.
[6] Numerical Ruby NArray. http://narray.rubyforge.org/.
[7] SPECjvm2008 benchmarks. http://www.spec.org/jvm2008/.
[8] The Computer Language Benchmarks Game.
http://shootout.alioth.
debian.org/.
[9] G. Abandah. Characterizing shared-memory applications: A case study of the
nas parallel benchmarks. Hewllet-Packard Labs Technical Report HPL-97-24,
1997.
44
[10] D. Bailey, T. Harris, W. Saphir, R. Van Der Wijngaart, A. Woo, and
M. Yarrow. The nas parallel benchmarks 2.0. Technical report, Technical
Report NAS-95-020, NASA Ames Research Center, 1995.
[11] D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter, L. Dagum,
R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S. Schreiber, et al. The nas
parallel benchmarks summary and preliminary results. In Supercomputing,
1991. Supercomputing’91. Proceedings of the 1991 ACM/IEEE Conference on,
page 158―165. IEEE, 1991.
[12] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang,
Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel
Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump,
Han Lee, J. Eliot B. Moss, B. Moss, Aashish Phansalkar, Darko Stefanovi,
Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The dacapo
benchmarks: Java benchmarking development and analysis. In OOPSLA ’06:
Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented
programming systems, languages, and applications, page 169―190, New York,
NY, USA, 2006. ACM.
[13] C.F. Bolz, A. Cuni, M. Fijalkowski, and A. Rigo. Tracing the meta-level:
Pypy’s tracing jit compiler. In Proceedings of the 4th workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and
Programming Systems, pages 18–25. ACM, 2009.
[14] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and
P. Hanrahan. Brook for gpus: stream computing on graphics hardware. In
ACM Transactions on Graphics (TOG), volume 23, pages 777–786. ACM,
2004.
[15] J.M. Bull, L.A. Smith, L. Pottage, and R. Freeman. Benchmarking java
against c and fortran for scientific applications. In Proceedings of the 2001
joint ACM-ISCOPE conference on Java Grande, pages 97–105. ACM, 2001.
45
[16] J.M. Bull, L.A. Smith, M.D. Westhead, D.S. Henty, and R.A. Davey. A benchmark suite for high performance java. Concurrency - Practice and Experience,
12(6):375–388, 2000.
[17] W.W. Carlson, J.M. Draper, D.E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and language specification. Center for Computing
Sciences, Institute for Defense Analyses, 1999.
[18] B.L. Chamberlain, D. Callahan, and H.P. Zima. Parallel programmability and
the chapel language. International Journal of High Performance Computing
Applications, 21(3):291–312, 2007.
[19] C. Chambers, D. Ungar, and E. Lee. An efficient implementation of self a
dynamically-typed object-oriented language based on prototypes. ACM SIGPLAN Notices, 24(10):49–70, 1989.
[20] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu,
C. Von Praun, and V. Sarkar. X10: an object-oriented approach to nonuniform cluster computing. In ACM SIGPLAN Notices, volume 40, pages
519–538. ACM, 2005.
[21] H.J. Curnow and B.A. Wichmann. A synthetic benchmark. The Computer
Journal, 19(1):43, 1976.
[22] L. Dagum and R. Menon. Openmp: an industry standard api for sharedmemory programming. Computational Science & Engineering, IEEE, 5(1):46–
55, 1998.
[23] J. Dongarra. The linpack benchmark: An explanation. In Supercomputing,
pages 456–474. Springer, 1988.
[24] J.J. Dongarra, H.W. Meuer, and E. Strohmaier. Top500 supercomputers.
1993.
46
[25] A.A. Faraj and X. Yuan. Communication characteristics in the NAS parallel
benchmarks. PhD thesis, Florida State University, 2002.
[26] International Organization for Standardization, International Electrotechnical
Commission, et al. Iso/iec 9899: 1999.
[27] M.A. Frumkin, M. Schultz, H. Jin, and J. Yan. Implementation of the nas
parallel benchmarks in java. 2002.
[28] J.L. Henning. Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006.
[29] J. Hutchinson. Culture, communication, and an information age madonna.
IEEE Professional Communication Society Newsletter, 45(3):1–7, 2001.
[30] P. Jansen. Tiobe programming community index, tiobe software.
[31] H. Jin and R.F. Van der Wijngaart. Performance characteristics of the multizone nas parallel benchmarks. Journal of Parallel and Distributed Computing,
66(5):674–685, 2006.
[32] R. Jones et al. Netperf: a network performance benchmark. Hewlett-Packard
Company, 1996.
[33] S.P. Jones. Haskell 98 language and libraries: the revised report. Cambridge
Univ Pr, 2003.
[34] A. Joshi, L. Eeckhout, and L. John. The return of synthetic benchmarks. In
2008 SPEC Benchmark Workshop, page 1, 2008.
[35] K. Karimi, N.G. Dickson, and F. Hamze. A performance comparison of cuda
and opencl. Arxiv preprint arXiv:1005.2581, 2010.
[36] J. Katcher.
Postmark:
A new file system benchmark.
Technical re-
port, Technical Report TR3022, Network Appliance, 1997. www. netapp.
com/tech library/3022. html, 1997.
47
[37] C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program
analysis & transformation. In Code Generation and Optimization, 2004. CGO
2004. International Symposium on, pages 75–86. IEEE, 2004.
[38] E. Lusk, K. Yelick, and K. Guest Editors. Languages for high-productivity
computing: the darpa hpcs language project. Parallel Processing Letters,
17(1):89–102, 2007.
[39] W.R. Mark, R.S. Glanville, K. Akeley, and M.J. Kilgard. Cg: A system for
programming graphics hardware in a c-like language. In ACM Transactions
on Graphics (TOG), volume 22, pages 896–907. ACM, 2003.
[40] M.D. McCool, Z. Qin, and T.S. Popa. Shader metaprogramming. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics
hardware, pages 57–68. Eurographics Association, 2002.
[41] P. McCormick, J. Inman, J. Ahrens, J. Mohd-Yusof, G. Roth, and S. Cummins.
Scout: a data-parallel programming language for graphics processors. Parallel
Computing, 33(10-11):648–662, 2007.
[42] R.C. Murphy, K.B. Wheeler, B.W. Barrett, and J. Ang. Introducing the graph
500. Cray User ’s Group (CUG), 2010.
[43] C.J. Newburn, B. So, Z. Liu, M. McCool, A. Ghuloum, S.D. Toit, Z.G. Wang,
Z.H. Du, Y. Chen, G. Wu, et al. Intel’s array building blocks: A retargetable,
dynamic compiler and embedded language. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on,
pages 224–235. IEEE, 2011.
[44] G. Noaje, C. Jaillet, and M. Krajecki.
Source-to-source code translator:
Openmp c to cuda. In High Performance Computing and Communications
(HPCC), 2011 IEEE 13th International Conference on, pages 512–519. IEEE,
2011.
48
R 9 high level shading
[45] C. Peeper and J.L. Mitchell. Introduction to the directx⃝
language.
[46] A. Petitet, R.C. Whaley, J. Dongarra, and A. Cleary. Hpl-a portable implementation of the high-performance linpack benchmark for distributed-memory
computers.
[47] Aashish Phansalkar, Ajay Joshi, and Lizy K. John. Subsetting the spec
cpu2006 benchmark suite. SIGARCH Comput. Archit. News, 35:69–76, March
2007.
[48] R. Pozo and B. Miller. Scimark 2.0. http: // math. nist. gov/ scimark2 ,
2000.
[49] J. Reid and W.G. Convener. The new features of fortran 2003. Publicado em:
ftp: // ftp. nag. co. uk/ sc22wg5 , 1601.
[50] R.J. Rost. OpenGL (R) Shading Language. Addison-Wesley Professional,
2005.
[51] W. Saphir, R. Van der Wijngaart, A. Woo, and M. Yarrow. New implementations and results for the nas parallel benchmarks 2. In 8th SIAM Conference
on Parallel Processing for Scientific Computing, pages 14–17, 1997.
[52] K. Sasada. Yarv: yet another rubyvm: innovating the ruby interpreter. In
Companion to the 20th annual ACM SIGPLAN conference on Object-oriented
programming, systems, languages, and applications, pages 158–159. ACM,
2005.
[53] J. Singer. Jvm versus clr: a comparative study. In Proceedings of the 2nd
international conference on Principles and practice of programming in Java,
pages 167–169. Computer Science Press, Inc., 2003.
49
[54] J.E. Stone, D. Gohara, and G. Shi. Opencl: A parallel programming standard
for heterogeneous computing systems. Computing in science & engineering,
12(3):66, 2010.
[55] T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue, M. Kawahito,
K. Ishizaki, H. Komatsu, and T. Nakatani. Overview of the ibm java justin-time compiler. IBM systems Journal, 39(1):175–193, 2000.
[56] D. Syme, A. Granicz, A. Cisternino, and Ltd MyiLibrary. Expert F#. Apress,
2007.
[57] A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs. Iperf: The tcp/udp
bandwidth measurement tool, 2005.
[58] H. Tomari and K. Hiraki. Retrospective study of performance and power consumption of computer systems. IPSJ Online Transactions, 4:217–227, 2011.
[59] W. Vogels. Benchmarking the cli for high performance computing. In Software,
IEE Proceedings-, volume 150, pages 266–274. IET, 2003.
[60] S. Walt, S.C. Colbert, and G. Varoquaux. The numpy array: a structure
for efficient numerical computation. Computing in Science & Engineering,
13(2):22–30, 2011.
[61] R.P. Weicker. Dhrystone: a synthetic systems programming benchmark. Communications of the ACM, 27(10):1013–1030, 1984.
[62] M. Weiskirchner. Comparison of the execution times of ada, c and java. Paper,
September, 2003.
[63] F.C. Wong, R.P. Martin, R.H. Arpaci-Dusseau, and D.E. Culler. Architectural
requirements and scalability of the nas parallel benchmarks. In Supercomputing, ACM/IEEE 1999 Conference, pages 41–41. IEEE, 1999.
50