Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A PERFORMANCE COMPARISON METHOD OF PROGRAMMING LANGUAGES USING SOURCE TO SOURCE TRANSLATION TECHNIQUE ソースコード間トランスレータを用いた多種言語処理系性能比較法 の研究 by 48-106126 Takafumi Nose 野瀬 貴史 A Master Thesis 修士論文 Submitted to the Graduate School of the University of Tokyo on February 8, 2012 in Partial Fulfillment of the Requirements for the Degree of Master of Information Science and Technology in Computer Science Thesis Supervisor: Kei Hiraki 平木 敬 Professor of Computer Science ABSTRACT Performance of language implementation is one of the criteria for selecting which language to write a program. Performance comparison between different language implementations is possible by comparing performance of benchmark programs that are ported to each language based on the same benchmark. However, if the porter apply optimizations that are too much advantageous to specific language, objectivity of comparison is lost. Additionally, vague translating rules make it difficult to understand correspondence relation between the ported benchmarks and compare the benchmarks at function or loop level rather than whole program level. Existing benchmarks are ported by hand, therefore the objectivity of the translation is lost. Moreover, score of the existing benchmarks has objectivity problem because they are not compatible with standard benchmarks used for system evaluation such as SPEC or NAS Parallel Benchmarks. To address these issues, we propose a method to enhance the objectivity of the program translation and the benchmark score. First, we use an automatic translator that includes explicit translation rules to make the benchmarks for reasonable translation. The base benchmark is made up to represent workload of existing proven benchmarks. By our method, we achieved translation of benchmarks into 5 languages with small effort, and compared the performance of 30 language implementations including the base benchmarks. it was shown that all language implementations of Ruby, Python, PHP except for PyPy are 100 times slower than them of C, Fortran, Java from the aspects of both floating-point performance and system performance, defect of method-based JIT, and NAS Parallel Benchmarks have workload diversity for dynamic languages. 論文要旨 言語処理系の実行性能は、プログラムをどの言語で記述するか選択するときの判断基準 の一つとなる。異なる言語処理系間の性能比較は、同じベンチマークを各言語に対して移 植しその性能を比較することにより行うことができる。しかし、移植の際に特定の言語に 過剰に有利な最適化を適用した場合、性能比較の客観性が失われる。また、不明瞭な変換 ルールを用いると、移植されたプログラム間の対応関係が曖昧になり、関数やループといっ たプログラム全体より小さなレベルでの性能比較が困難になる。既存のベンチマークは手 動での移植が用いられており、変換の客観性に問題がある。さらに、既存のベンチマーク は SPEC や NAS Parallel Benchmarks などのシステム評価に使われる標準的なベンチマー クと互換性がなく、そのスコアの客観性にも問題がある。 これらの問題を解決するため、我々はプログラム移植とスコアの客観性を増すための手 法を提案する。まず、変換ルールを明瞭に記述した自動トランスレータによりベンチマーク を各言語に自動変換することで、変換に根拠を持たせる。また、変換されるベンチマークプ ログラムは既存の実績のあるベンチマークの負荷を代表するように構成する。この手法に より、ベンチマークを 5 つの言語に移植し、移植元の言語とあわせて 30 個の言語実装の性 能比較を少ない労力で達成した。その結果、浮動小数点演算性能およびシステム性能の両面 において PyPy を除く Ruby, Python, PHP のすべての言語処理系実装が C, Fortran, Java と比べ 100 倍以上の性能差があること、method-based JIT の欠点、そして NAS Parallel Benchmarks が動的言語の計算負荷として多様性を持っていることが示された。 Acknowledgements I would like to express my deepest gratitude to Professor Kei Hiraki whose comments and suggestions were unprecedentedly valuable throughout not only the course of my study but also my life. Special thanks also go to Dr. Koichi Sasada whose comment on twitter encouraged and helped me very much. Hisanobu Tomari, Yasuo Ishii, and Koichi Nakamura gave me constructive comments and advice. I am deeply grateful to their discussion with me. My study would not have started without their idea. I would also like to express my gratitude to Naoki Tanida, Goki Honjo, Kenichi Koizumi and all the member of the Hiraki Laboratory. Their optimism and witty encouragement gave me power and motivation. Contents 1 Introduction 1.1 1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Purposes and types of benchmark . . . . . . . . . . . . . . . 1 1.1.2 Need for comparing programming language implementations 2 1.1.3 Difficulties in comparing programming language implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Composition of this paper . . . . . . . . . . . . . . . . . . . . . . . 6 2 Related work 7 2.1 Synthetic benchmarks in early years . . . . . . . . . . . . . . . . . . 7 2.2 The Computer Language Benchmarks Game . . . . . . . . . . . . . 8 2.3 Comparison between Java and other languages . . . . . . . . . . . . 9 2.4 Comparison between DSLs for HPC . . . . . . . . . . . . . . . . . . 11 3 Methodology 12 3.1 Selection of base benchmarks . . . . . . . . . . . . . . . . . . . . . 12 3.2 Translator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 I/O and memory management . . . . . . . . . . . . . . . . . 15 3.2.2 For loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 Arithmetic overflow . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.4 Mimicking class inheritance mechanism in C . . . . . . . . . 17 Benchmarks overview . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 i 3.3.1 NAS Parallel Benchmarks . . . . . . . . . . . . . . . . . . . 17 3.3.2 Dhrystone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Evaluation 20 4.1 Measurement environment . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Characteristics of measured languages and their implementations . 20 4.3 NAS Parallel Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 22 4.3.1 Performance overview . . . . . . . . . . . . . . . . . . . . . 22 4.3.2 Effects by JIT . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.3 Multithread performance of Ruby . . . . . . . . . . . . . . . 35 Dhrystone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 5 Conclusion 40 References 44 ii List of Figures 3.1 Dhrystone call graph (Ruby, iteration = 200,000) . . . . . . . . . . 19 3.2 Statement distribution of SPEC CINT2006 and Dhrystone . . . . . 19 4.1 NPB performance overview . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Original NPB performance . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 NPB performance of native compiled languages . . . . . . . . . . . 24 4.4 Standard deviation of the score of native compiled languages . . . . 25 4.5 NPB Java version performance . . . . . . . . . . . . . . . . . . . . 25 4.6 NPB Python version performance . . . . . . . . . . . . . . . . . . . 26 4.7 NPB Ruby version performance . . . . . . . . . . . . . . . . . . . . 27 4.8 NPB PHP version performance . . . . . . . . . . . . . . . . . . . . 28 4.9 Sun Java performance improvement after method division . . . . . 29 4.10 JIT log of NPB except for SP (CLASS=S) on PyPy 1.4.1 . . . . . . 31 4.11 JIT log of NPB CG (CLASS=W, A) on PyPy 1.4.1 . . . . . . . . . 32 4.12 Call graph of NPB CG (CLASS=S) on Python 2.7 . . . . . . . . . 32 4.13 JIT log of NPB IS (CLASS=W, A) on PyPy 1.4.1 . . . . . . . . . . 33 4.14 JIT log of NPB IS (CLASS=S, W, A, B) on PyPy 1.7 . . . . . . . . 33 4.15 Call graph of NPB IS (CLASS=A) on PyPy 1.4.1 . . . . . . . . . . 34 4.16 Call graph of NPB IS (CLASS=A) on Python 2.7 . . . . . . . . . . 34 4.17 Call graph of NPB IS (CLASS=A) on PyPy 1.7 . . . . . . . . . . . 35 4.18 Scaling of multithread benchmarks on Ruby implementations . . . . 36 4.19 Scaling of multithread benchmarks on JRuby . . . . . . . . . . . . . 37 4.20 Scaling of multithread benchmarks, implemented in FORTRAN77 . 37 iii 4.21 Dhrystone (iteration count = 50,000,000) . . . . . . . . . . . . . . . 38 4.22 JIT log of Dhrystone (iteration count = 200,000) . . . . . . . . . . 39 iv Chapter 1 Introduction 1.1 1.1.1 Background Purposes and types of benchmark It it important to measure and estimate the performance of each layer of computer system such as microarchitecture, network interface, compiler, operating system, runtime environment, numerical calculation kernel, and higher-level algorithm to design total computer system or application because they are built on the top of vertical integration of computer system layers. The performance of these layers can be evaluated by benchmarking. Benchmarking is to input data that was collected from real world workloads with specific criteria into software or hardware to compare the relative performance of them, for example, program execution time, latency, code size, hardware resource consumption, and power consumption. Processor and system software are easy to evaluate because the criteria is physical quantity such as instructions per clock (IPC) and turn around time. However, higher-level layer is difficult to evaluate such as programming languages and user interface, because human factors are combined in their evaluation. Many benchmarks have been proposed and used to evaluate one or several layer that we previously mentioned. For example, image data named “Lena”[29] is one of the standard benchmarks for evaluating image processing algorithm. LINPACK benchmark [23] is a represen- 1 tative benchmark for evaluating numerical calculation kernel. It was designed to measure Basic Linear Algebra Subprograms (BLAS) performance of mathematical kernel libraries, which is popular function among scientific application. SPECjvm [7] and DaCapo [12] are benchmark suites for evaluating Java Virtual Machine (JVM). JVM is used as an infrastructure of language implementations not only by Java language but also Ruby, Groovy, Scala, and Clojure. The performance of JVM have an impact on these languages. PostMark [36] is a benchmark for file system that is one of the important parts of operating system. Netperf [32] and iperf [57] is a benchmarking tool for evaluating network bandwidth performance. SPEC CPU2006 [28] is one of the widely used benchmarks for evaluating microarchitecture and compiler. Furthermore, one layer can be subdivided on the basis of the domain that the benchmark is oriented in. Specialized benchmarks are created when general benchmarks like SPEC CPU2006 do not represent precisely the workload that user want to estimate. One of the examples is Graph 500 [42], which is intended to measure data intensive supercomputer application’s performance. Since 1993, High Performance Computing (HPC) researchers have been used [24] high-performance LINPACK (HPL) benchmark [46] to evaluate supercomputer. While HPC focused on 3 dimensional (3-D) physical simulation that utilize numerical calculation, using HPL was proper. However, its workload is different from graph algorithm, which is important in emerging data intensive application. As we saw, we have various kind of benchmark programs that evaluates from the highest layer to the lowest layer and from general purpose to specific purpose. However, they are focused on area that is easy to evaluate, and there are only few benchmark suites that were developed from the aspect of comparing language implementations that run different languages [21] [61] [8]. 1.1.2 Need for comparing programming language implementations For different purposes, different programming languages have been used. For example, compiled languages like FORTRAN and C is mainly used for compute in- 2 tensive applications and system software, both of them require high performance. On the other hand, interpreter languages like Ruby, Python, and Perl are mainly used for daily text processing, computer education, and web applications, which do not require high performance but high productivity. Java and C# are located at medium position between two areas, such as enterprise applications. These existing programming languages do not achieve both high performance and high productivity, but programmers have desire for such computer language. For example, HPC researchers developed new Partitioned Global Address Space (PGAS) languages like X10 [20], Unified Parallel C (UPC) [17], and Chapel [18]. X10 and Chapel are developed under the DARPA HPCS language project [38], which means that these are not toy project languages. Moreover, additional packages for scientific calculation such as NArray [6] for Ruby and NumPy [60] for Python and were developed for scientist programmers who want to use the same languages as their daily use. As computer system evolves, interpreter languages have become to be faster than before. The language implementations themselves also evolved as bytecode optimization and JIT compilation are developed. Therefore, interpreter languages have began to move into area that requires high performance. However, learning new languages cost for programmers and interpreter languages are still slower than even if partially fasten by additional libraries. Since there is no ultimate language that has both high productivity and performance, we must select language as situation. That is because comparing language implementations is necessary to make a decision to select which language to write application considering balance of performance and productivity under certain criteria. Productivity is difficult to evaluate and compare quantitatively because it complexly interacts with richness of standard libraries, assist of Integrated Development Environment (IDE), reliability by type-checking, user community, and programmer’s preference. Therefore, evaluation of productivity should be matter of the programmer’s own. Meanwhile, performance in execution time can be measured quantitatively by running the same program. Knowledge about language performance should be obtained before the programmer actually write a program, thus benchmark for language comparison is 3 needed. 1.1.3 Difficulties in comparing programming language implementations Each programming language specification has differences each other. The difference is not only in syntax level but also in semantic level such as primitives, standard libraries, and programming paradigm. Thus, it is impossible to translate all the applications in the real world keeping its semantics to other languages. For example, prototype-based object-orientation (OO) [19] and class-based OO is difficult to map each other. Emulating prototype-based OO in class-based OO is virtually the same as reimplementing a language because of its dynamic characteristics, and emulating class-based OO in prototype-based OO have no uniqueness in coding style, which is typically apparent in a tutorial of Lua [5]. However, by restricting the area of programs under certain assumption, comparison is possible. Programmers who use computer for practical numerical calculations make less use of non-standard external library and language-specific features that are difficult to use or learn than programmers who have computer science backgrounds. Their work is to solve mathematical, physical, or chemical problem and not to make use of programming languages or enhance computer systems. They learned low productive languages such as FORTRAN in early university education, thus they cannot spend time learning new programming features. Therefore, even if they moved into high productive languages, they continue to write programs that are semantically similar to old programs. The main primitives needed in numerical calculations are number and set of them like array. These data types have the same semantics among practical programming languages without exceptional situations like arithmetic overflow or implicit type conversion. Thus, comparing programming languages is possible from the aspect of numerical calculation. 4 1.2 Our contributions In this paper, we propose a performance comparison methods of programming languages using automatic source to source translation. We focus on benchmarking programs that are typically written by HPC programmers who have been used to languages with high performance and low productivity, with assumption that such programmers do not utilize high-order concepts like map, reduce, and fold operations on list, tail-call optimization, or monad in Haskell. The translator was designed to keep almost all the semantics between base benchmarks and translated benchmarks. We proposed the characteristics that the base benchmarks should have. The base benchmarks must be written in statically-typed object-oriented language like Java. The base benchmarks must be based on actual workloads of scientific HPC applications. The quantity of analysis data that have been accumulated in past about the base benchmarks is important to know what kind of past computer system is correspondent to the current system with slow but productive language. The workload size must be lower than SPEC CPU2006, because big workload makes it difficult to evaluate slow language implementations. Contribution of this thesis is three folds: 1. Reduced cost for translating benchmark. Once the translator was made, we could apply the translator to other benchmark without further work. This point made it possible to catch up with future evolution of workload as long as the base language is used widely. 2. Correspondent line-to-line relation between benchmarks, that was made possible with automated source to source translator. Such rigorous translation made comparison easier than by-hand translation. The translation rules are all written in translator that it is easier to verify whether the translation is proper than by-hand translation as well. 3. Proper workload selection and examine of old small benchmark that have accumulated results for various systems. We selected NAS Parallel Bench5 marks (NPB) [11] for numerical application workload and Dhrystone [61] for system workload. By NPB and Dhrystone, we could measure characteristics of language implementations and benchmarks themselves. Method-based Just-in-Time (JIT) compiler suffered from misprediction of big method hotness. Small kernel benchmarks that the execution time is short is not proper for measuring native compiled languages because the performance did not varied among them. Without kernel benchmarks, all the interpreter languages performed about 100 times slower performance than native compiled languages. PyPy performed higher performance on kernel benchmarks and Dhrystone than other interpreter languages, but 10 times slower than native compiled languages yet. From comparison of PyPy JIT compiler’s log between application code and kernel code, it was obtained that not only JIT but also GC must be enhanced for application benchmark, and NPB have workload diversity as JIT and GC benchmark. 1.3 Composition of this paper In Section 2, we overview four types of related work: synthetic benchmarks in early era, website for language performance comparison, language comparison between Java and old languages, and comparison of DSL for HPC. In Section 3, we discuss requirement for base benchmark and source to source translator. In Section 4, we measure various kind of language implementations and analyze them that showed specific behaviors. In Section 5, we summerize the results and state recommendation. 6 Chapter 2 Related work 2.1 Synthetic benchmarks in early years Synthetic benchmark is composed with instruction or statement mix ratio that was collected from actual applications. From 1970s to 1980s, two synthetic benchmark were widely used. They were implemented in several languages, using information that was retrieved from applications that were implemented in different languages. Whetstone [21] is a synthetic benchmark that was based on statement mix of actual scientific programs used in NPL and Oxford University written in ALGOL. It was designed to be easily ported to other languages. In [21] Curnow implemented PL/I and FORTRAN version. There is converted C version of Whetstone [2] based on its FORTRAN version. Whetstone is based on scientific programs but the floating-point operations performed in the benchmark is meaningless as scientific calculations. Dhrystone [61] is a synthetic benchmark that was based on statement distribution of programs collected from 16 different data collections that consists of FORTRAN, XPL, PL/I, SAL, ALGOL 68, Pascal, Ada, and C. The original version was written in Ada, Pascal, and C. The operations on string is dominant in Dhrystone. Small benchmarks like Whetstone and Dhrystone became obsolete as CPUs got bigger cache that benchmark code fit info, CPUs got faster that makes bench7 mark execution time short and unreliable, and programmers moved into C/C++ to write applications. Instead of synthetic benchmarks, SPEC became to be used, which collected frequently used actual applications as workloads. Moreover, further after C/C++, bytecode interpreter and dynamically typed languages emerged. These languages utilizes virtual machines (VM), Just-in-Time compilation (JIT), reflection, and garbage collection (GC). Programming styles and workloads in applications have changed that information collected when creating Whetstone and Dhrystone is insufficient now. Joshi et al. proposed workload cloning from runtime trace using microarchitecture independent characteristics [34]. The output was C source that consists of assembly code that performs cloned operations except for stub codes. Creating another workload from existing SPEC binary code kept accuracy while it was synthetic benchmark, but this method cannot to be applied to cross-language comparison because sequence of basic operations that mimics higher level operation in scripting language does not equal as a workload, while equivalent exchange of assembly and C statement is easy. 2.2 The Computer Language Benchmarks Game The Computer Language Benchmarks Game [8] is comparison website that evaluates 27 languages with 13 benchmarks translated into each language. As far as we know, [8] is the largest benchmark suite that have diversity in languages. It consists of two parts. One is algorithmic benchmarks such as N-body physical simulation, Mandelbrot set calculation, permutation, puzzle game solver, pi digits calculator, and algorithms for bioinformatics. The other measures the performance of basic operations such as vector manipulation, threading, and memory management. However, its application diversity is biased. The algorithmic benchmarks except for the first two measures only integer performance. Especially four bioinformatic benchmarks are contained. The benchmark scale is small. The maximum size of these benchmarks is less than 200 lines in Java at most. This is smaller 8 than Dhrystone, that is about 300 lines in Java. Translation process of [8] has problems. The translation and the optimization of benchmarks for each language is contributed by volunteers. Therefore, implementation of benchmarks are incomplete. The volunteers can submit multiple implementations for each language. The best implementation is adopted as the score of its language. This means that the score is disturbed by quality of the implementation. The quality depends on the optimization techniques of the implementer. As programmer population of particular language is bigger, probability that the implementer of the language is highly skilled grows higher. Thus, the score reflects the proven maximum speed of the language that depends on the implementation, not the baseline performance. 2.3 Comparison between Java and other languages Applying Java to HPC has been tried over 10 years. At the same time, the performance of Java has been measured by implementing benchmarks. However, these measurements had limited divergence of languages: C, FORTRAN, Java, and C#. Java is at medium position between interpreter languages and native compiled languages in the metrics of productivity and performance. As enterprise applications that requires both high productivity and performance adopts Java, JVM vendors has spent efforts to speed up JVM by enhancing JIT and GC. For example, inline caching, on-stack replacement, tracing JIT, and generational GC. Thus, Java has come near C and FORTRAN, and applying it to HPC has reality. To examine whether Java is useful for high performance numerical calculations, benchmarks are developed, that are also implemented in competing technologies for reference. Wade et al. implemented Java version of LINPACK [4]. In [16] Bull measured the performance of it. LINPACK has reliability because it is one of the benchmarks that are used over 30 years, but it is not enough to measure the performance only by single workload. SciMark benchmark suite has C implementation for reference [48]. Java Grande benchmark suite measures low level operations in addition to SciMark [15]. Sci9 Mark is a kernel benchmark suite that contains Fast Fourier Transform (FFT), Jacobi Successive Over-relaxation (SOR), Monte Carlo integration, Sparse matrix multiply, and dense LU matrix factorization, which is used in numerical calculations. The diversity of the workload is wide, but the benchmarks were newly developed that the score of them is not compatible with existing benchmarks. Frumkin et al. implemented Java version of NPB [27]. NPB contains kernel benchmarks similar to SciMark and Computational Fluid Dynamics application benchmarks. The original source code was written in FORTRAN 77. NPB has been used over 20 years, and the score of it is correlated with SPEC CFP2006 [58]. However, they focused only on Java performance. Singer compared JVM and CLR implementations [53] by compiling C language to Java bytecode and Common Intermediate Language (CIL) bytecode using Java and .NET backend of GNU Compiler Collection (GCC). They purely measured the performance of virtual machine implementations because the source code and compiler frontend was the same. However, the input source code was written C, that is not used to write a program for both JVM and CLR. This does not represent usual usage of language implementations. Vogels compared Common Language Interface (CLI) implementations with Java [59]. They compared them with Java Grande benchmark suite and SciMark 2.0. Automated translation tools was used as well as manual translation. One C implementation, four Java implementations and three CLR implementations were compared. As a result, Microsoft .NET CLR 1.1 performed as well as IBM 1.3.1 JVM. Weiskirchner compared Ada, C, and Java from the aspect of embedded environment [62]. Their study had wide diversity of operating systems and machines than other similar studies. but the selection of language implementations they studied was biased to embedding environment and the workloads were small. 10 2.4 Comparison between DSLs for HPC As general purpose computation on graphics processing units (GPGPU) develops, it has been proposed to use special domain specific languages (DSLs) [40] [14] [41] to write programs run on GPUs that are easier than high level shading languages [39] [50] [45]. These languages had similar concepts: region specification and SPMD. As a result, these DSLs converged with two languages that have similar semantics: Compute Unified Device Architecture (CUDA) on NVIDIA platform and OpenCL [54] on AMD platform. CUDA compiler also supports OpenCL, therefore performance comparison of DSLs on the same hardware is possible. Karimi et al. implemented the same program both in CUDA and in OpenCL [35]. The CUDA implementation was consistently faster than the OpenCL implementation despite the two implementation have nearly identical code. However, they benchmarked a program that run only Monte Carlo simulation of quantum dynamics, therefore divergence of benchmark program is low. Noaje et al. used source-to-source translation technique to generate CUDA programs from existing OpenMP [22] program automatically[44]. OpenMP is directive-based DSL for utilizing multi-core processor. There are more existing OpenMP codes than GPGPU codes because it was proposed earlier than GPGPU languages and directive-based that means incremental improve of existing serial code is possible. This situation between OpenMP and GPGPU is similar to the one between high productive languages and high performance languages, thus source-to-source translation of this work is similar to our work. However, they proposed only translation technique and actual performance evaluation is not performed. CUDA and OpenCL are based on C++, but actual GPU kernel codes are written in the scope of C level and do not utilize high-level features of C++. Higher level languages that utilizes C++ features such as Intel ArBB [43] and C++ Accelerated Massive Parallelism (C++ AMP) [1] are proposed, but these languages are not mature or not released officially at present, thus performance comparison of these languages is not performed. 11 Chapter 3 Methodology 3.1 Selection of base benchmarks To translate benchmarks into other languages, we had to consider the balance of three metrics: which language to choose as the base language, which benchmark to use whether synthetic benchmark or application benchmark, and whether to create new benchmark or adopt existing, traditional benchmark. The base language must not be have special concepts that is difficult to translate directly into other language such as lazy evaluation that causes different behavior with eager evaluation, monad in Haskell [33], or computation expression in F# [56]. Hand-made synthetic benchmark is small because creation of benchmark costs. Gathering divergent workloads is required to represent actual behavior of them by a small program. The number of lines of both Whetstone and Dhrystone are less than 1000 lines nevertheless the authors gathered information from various applications. As the size of synthesized workloads that we want to create grows, reconstruction goes harder. Automatically made synthetic benchmark accurately emulates realistic workloads and the benchmark size is easily adjustable but does not represent look and feel of actual source code. As languages that we analyze evolves, their features go higher-level and become to be used. Automatic synthesis cannot be used as languages evolves because the look and feel of synthesized 12 benchmark lose touch with realistic source code. Meaningful application benchmark is big because workload diversity must be ensured. For example, SPEC CPU2006 has duplicate part as a workload[47]. Such redundancy makes benchmark bigger than fundamentally needed size. Big benchmarks are useless from the aspect of language performance comparison. Highly productive languages are slow that the execution time of benchmark implementations in them will be longer than that of high performance language version. SPEC CPU2006 takes a few hours and consumes more than 1GB memory to run in modern 32-bit computers. Proper scale benchmark is needed. However, to convince HPC programmers who are used to C or FORTRAN to write programs in highly productive languages, performance comparison based on existing proven programs is necessary. We chose Java version of NAS Parallel Benchmarks (NPB) 3.0 as base benchmarks. The reasons are as follows: First, Java is the most widely used objectoriented language environment and accepted by conservative programmers. A word “Java” has two aspects, the first is a language specification and the other is a VM specification. Java as a language is statically typed, object-oriented, class-based language. Java is widely used for business and internet applications. In TIOBE Index [30], 9 years out of past 10 years from 2002 to 2012 Java has been the No. 1 language. Java as a runtime environment is a bytecode VM. The bytecode specification was clearly stated supposing transferring compiled bytecode through the Internet to various environment from embedded machinery to high-end server. Thus there are many VM implementations both academic and industrial area. Second, the original NPB were mixture of FORTRAN 77 and C programs. The implementation of the translator should be single. Our purpose includes reducing the cost of translating programs. Multiple implementation of translator causes increase of maintenance cost. Third, NPB have actual application workloads. SciMark is the similar type benchmark suite that was also implemented in Java but did not contain actual, heavy application workload. Last, NPB were reliable benchmarks for measuring floating-point and parallel performance of computer system. NPB were 13 used over 20 years and the first paper of NPB were referenced by over 1600 paper, that is bigger than SPEC CPU2000 and CPU2006 at an annualized rate. There were no integer benchmarks that have both reliability and size near NPB. Thus, we chose Java version of Dhrystone benchmark [3] as supplemental integer benchmark. The program size of Dhrystone is small, and the scale of workload is lower than cache size of modern architecture [58]. However, the performance measurement data in past architecture has been accumulated for over 20 years. Programming style was made closer to modern style than original through translation process to Java. By the past data, we can know that what kind of computer with maximum performance compiler has equivalent performance to the modern computer with high productive and slow language, which shows how many years interpreter languages’ optimization technology is behind that of the state-of-the-art compiled language implementations. 3.2 Translator The most important purpose of the automatic translation is to compare the performance of programming language implementations with explicit and reasonable rules. Explicitness is needed for reproduce and verify the benchmark program formally. By writing all the rules as a program, explicitness can be satisfied. Reasonable rules are defined as translation rules that programmer who learned new language and have to write new program that is within the same domain as he or she previously created. Even if the programmer learned new language and concept, improvement is restricted to incremental evolution because program he or she writes is affected by knowledge of old language. The most likely scenario is line to line translation into very similar semantics. Modern programming languages commonly have data structures like array, list, and class, and control statements such as while loop, for loop, conditional branch, and function call. As far as translating programs that uses only these basic elements, the translation keeping the 14 same semantics is easy. Thus we can simulate such situation by automatic process if the translated program and its language is restricted. We made a translator that processes Java source code and output each language. Java has basic features of control statements and data structures with object-oriented flavor and does not have irregular semantics like monad. This characteristics fit to scientific programs, that does not utilize complicated data structure. Therefore we could design straight-forward translation rules to convert Java source to the other languages in line-to-line correspondent as far as we could. However, we had to make exceptional translation rules or measures to avoid following problems because the behavior and the semantics were not exactly the same among programming languages we tried to translate into. 3.2.1 I/O and memory management I/O and array allocation functions varied among target languages. For example, an array cannot be allocated specifying the size of the array in PHP and Python. The size of an array that can be extended in once is restricted in PHP. Multi dimensional array primitive is not supported in Python, Ruby, and PHP that programmers must assign sub dimensional array into upper dimensional array. However, I/O and memory allocation functions are called mostly in initialization and termination steps. These functions do not affect on the performance of numerical calculations that it is not necessary to implement them with line-to-line correspondence, therefore we implemented functions that emulate Java’s memory allocation mechanism by hand, or replaced with code snippet with ad-hoc translation rules. 3.2.2 For loop The expressiveness of for loop and foreach loop varied among target languages. For statement in C and Java consists of four parts: initialization expression, loop condition expression, increment expression, and body compound statement. The other does not implement these completely or does not have the same semantics. 15 Ruby had multiple ways that is commonly used to express iteration loop with index. Source 3.1: Iteration loops with index in Ruby # I t e r a t i o n a l l o o p s i m i l a r t o C and Java fo r i in 1 . . 1 0 do do som et hi ng end # Ruby s p e c i f i c s t y l e 10. times { | i | do som et hi ng } We replaced for-loops that cannot be written in target language with the same semantics as source language with while loops. We adopted style that appear in language tutorials in early times and similar look and feel to for-loop of C or Java. 3.2.3 Arithmetic overflow In Ruby, Python, and PHP, the behavior of number primitive is not the same as compiled languages. In Ruby and Python, if an arithmetic overflow occurred with fixed-precision integer primitives, the result is converted into a arbitrary-precision integer primitive automatically. In PHP, the result is converted into a floatingpoint number primitive as well. Use of arbitrary-precision integer in unnecessary situation causes performance degradation, and automatic conversion into floatingpoint number changes the meaning of the program. It is difficult to predict if arithmetic overflow will occur from source codes statically. We avoided to use benchmarks that overflow happens frequently such as Section 1 of Java Grande benchmark suite, but NPB had a section that is sensitive to bit width and behavior of arithmetic overflow of integer primitive in random number generator, therefore we had to fix the translated program not to cause arithmetic overflow in PHP. 16 3.2.4 Mimicking class inheritance mechanism in C Java version of NPB utilized class inheritance. C does not have class that we had to mimic inheritance of class. NPB is computational intensive benchmark that performance of object-oriented functions do not affect on its score. Search of virtual function does not happen in NPB that we did not have to implement polymorphism that requires virtual method table. In our implementation, an object was represented as an struct and had only data fields. Inherited object was represented as a struct that have additional fields after the same fields as the parent class. 3.3 3.3.1 Benchmarks overview NAS Parallel Benchmarks The NAS Parallel Benchmarks (NPB) are computational intensive benchmarks developed at Numerical Aerodynamic Simulation (NAS) Systems Division, NASA Ames Research Center. NPB contain computational fluid dynamics (CFD) and kernel code for numerical algorithms. MG, CG, FT, and IS are kernel benchmarks. MG is a simplified multi-grid kernel. Its data locality is highest except for EP [25]. CG is a conjugate gradient method kernel. Matrix data is stored Compressed Row Storage (CRS) format that its data access is unstructured. FT is a Fast Fourier Transform (FFT) kernel, which performs 3-D FFT partial differential equation. IS is a bucket sort kernel. In MPI implementation, IS communication cost is dominated by MPI Alltoallv [51], which means that the kernel accesses wide range of array. LU, SP, and BT are CFD benchmarks. These benchmarks solves 3-D compressive fluid system. LU utilizes symmetric successive over-relaxation (SSOR) method. LU decomposition is not performed contrary to the name. BT and SP are similar CFD codes in many aspects but communication / computation ratio is different. EP, “embarrassingly parallel” kernel, is not implemented in Java version. The 17 workload generates random number, which situation is commonly observed in Monte-carlo simulation code. This benchmark does not perform interprocessor communication [11] [10] [25]. Reference [11] for detailed description of computational algorithm. The original source code of IS is implemented in C. The other is implemented in FORTRAN77. 3.3.2 Dhrystone Dhrystone is a synthesized benchmark by hand. The source code consists of integer calculations, load and save on array, load and save on struct member, and string compare. Malloc and free is not tested while benchmark iteration step in the original source code, thus the footprint is small and stable. 3.1 shows the call graph of Dhrystone Ruby version converted by our translator, obtained by rubyprof and KCacheGrind. We can see the dynamic distribution of call to primitive objects and overview of workload characteristics. The maximum depth of function call was three as shown in the call graph. Class#new means a creation of temporary array, which size is 1 and used to represent pointer variable in call-by-variable strategy language with call-by-reference strategy language. Fixnum class in Ruby is equivalent to integer in C. From the call graph, we could confirm that integer calculations, integer comparison, and object reference comparison dominates the Dhrystone benchmark. There is only one parameter to increase the scale of the benchmark, that is to increase the iteration count. No other input is not needed. This point makes benchmark small and portable but difficult to enhance workload diversity, and different from modern benchmarks like SPEC CPU2006. Dhrystone was a synthetic benchmark that it must be validated whether the source code statement distribution is compatible with current programs. 3.2 shows statement distribution of SPEC CINT2006 and Dhrystone. gcc showed higher use of case, label, and goto statements, but typical compiler has to be written huge number of conditional branches that this is exceptional case. Without gcc, every bench- 18 dhry.rb Dhry#execute 44 648 14 985 200 000 x 1 747 400 000 x Dhry#proc_1 14 985 984 200 000 x Dhry#func_1 2 684 Dhry#proc_8 8 447 4 670 199 999 x 1 162 199 999 x 4 029 200 000 x Dhry#proc_6 4 670 843 1 190 199 999 x 200 000 x 8 447 5 167 200 000 x 200 000 x 1 164 200 000 x Dhry#func_2 5 167 932 200 000 x Dhry#proc_2 1 164 Dhry#proc_4 932 937 200 000 x 344 600 000 x Dhry#proc_3 4 029 1 485 732 799 993 x 200 000 x 747 200 000 x 250 399 999 x 454 399 998 x 904 1 400 000 x 158 423 200 000 x 599 999 x 705 1 199 998 x 331 399 998 x 935 1 400 000 x 63 200 000 x 422 800 000 x 330 600 000 x 202 400 000 x 597 800 001 x 1 033 1 800 000 x 219 400 000 x 124 200 000 x 448 800 000 x 157 200 000 x 141 200 000 x 124 221 199 999 x 400 000 x 46 200 000 x ruby_runtime 329 600 000 x Dhry#proc_7 3 336 Dhry#func_3 843 Kernel#=== 1 485 Class#new 1 479 Array#[]= 2 533 Array#[] 1 892 171 199 999 x Fixnum#<= 1 129 Fixnum#+ 2 529 Fixnum#453 388 799 991 x Fixnum#== 1 651 Figure 3.1: Dhrystone call graph (Ruby, iteration = 200,000) mark including Dhrystone showed similar statement distribution. From the aspect of source code construction, Dhrystone is a well representative subset of SPEC CINT2006. ďnjŝƉϮ ŐĐĐ ŚϮϲϰƌĞĨ ŚŵŵĞƌ ůŝďƋƵĂŶƚƵŵ ŵĐĨ ƐũĞŶŐ ŐŽďŵŬ ŚƌLJƐƚŽŶĞ ϳϬ ϲϬ ϱϬ >ŝŶĞƐ ϰϬ ϯϬ ϮϬ ϭϬ Ϭ ĐĂƐĞ ůĂďĞů ŝĨ ŝĨǁŝƚŚĞůƐĞ ƐǁŝƚĐŚ ǁŚŝůĞ ĚŽ ĨŽƌ ŐŽƚŽ ĐŽŶƚŝŶƵĞ ďƌĞĂŬ ƌĞƚƵƌŶ ĂƐƐŝŐŶ Figure 3.2: Statement distribution of SPEC CINT2006 and Dhrystone 19 ĐĂůů Chapter 4 Evaluation 4.1 Measurement environment The system was Dell PowerEdge R410 and its CPU was Xeon E5530 2.4 GHz quadcore processor with 2 logical cores for each physical core. Each core had 256KB L2 cache and 8MB L3 cache was shared with the cores. The memory bandwidth was 25.6 GB/s. The system had 12GB main memory. The operating system was 64-bit CentOS 5.5 and the version of Linux kernel was 2.6.18-194.32.1.el5. We compiled all the language implementations by GCC 4.5.1. We used -O3 -msse4.2 optimization flags. We evaluated single-thread NPB and Dhrystone translated into each languages. The maximum problem size for NPB was size A for computational performance and we supplementary used S, W for JIT analysis. The iteration count for Dhrystone was 50,000,000. We also measured parallel scalability of Ruby with multithread implementation of NPB. 4.2 Characteristics of measured languages and their implementations We selected to measure Ruby, PHP, Python, Fortran2003, C99 and the original implementations by NAS: FORTRAN77 and Java. Ruby, PHP, and Python are 20 dynamically-typed, object-oriented, class-based languages. All of three are usually used in web programming and text processing. Every language have runtime source code evaluation function. Redefine of function is possible in Ruby and Python but impossible in PHP. The dynamic aspect of these languages makes it hard to static optimization. PHP does not have any multithreading features. These language implementations share concepts or implementation partially. Ruby 1.8 is abstract syntax tree (AST) based interpreter. Ruby 1.9 [52] and Python 2.7 is based on bytecode interpreter. Their bytecode specification is not intended to keep interlanguage portability because the bytecode was introduced for performance. PyPy is a self-hosted Python interpreter that is written in Restricted Python. Its JIT compiler is based on tracing JIT. Jython and JRuby are based on JVM. JRuby has JIT compiler. Rubinius and unladden-swallow commonly use LLVM backend for JIT compilation. Java bytecode is stack-based intermediate representation. Java 7 has invokedynamic opcode for dynamic language support that enhance polymorphic method invocation. Thus Java platform has begun to extend its area to host environment for dynamic languages. JIT compilation was introduced with HotSpot VM in Java 1.3 and later other vendor implementations got similar boosting technologies [55]. Fortran 2003 [49] is the newest major update since Fortran 90. One of the topics that impact on the semantics of Fortran is object-oriented extension that enables class inheritance, which makes Fortran closer to high productive languages. By this functionality, translation from Java to Fortran made straight-forward. C99 is a C specification published in 1999 [26]. Coding style in C has changed from 1980s when the Dhrystone benchmark was published. C99 contains new features that affect on performance such as variable-length array, inline function definition, and restrict pointer. We implemented Lua version of Dhrystone for reference by hand. Lua is a prototype-based object-oriented language, hence it is impossible to map classbased object-oriented program to Lua in straight forward manner because there are several fashion to emulate class in prototype-based language. Almost all the 21 implementation of Lua version is based on C and partially on Java. Measured language implementations were Sun Java 1.6.0, Sun Java 1.7.0, Oracle JRockit 1.6.0 20 R28.1.0-4.0.1, IBM Java 6.0-9.0, Apache Harmony 6.0-jdk-991881, Ruby 1.9.2-p136, Ruby 1.9.3-110206, JRuby 1.6.0.RC1, Rubinius 1.2.0, Python 2.7.1, PyPy 1.4.1, PyPy 1.7, Jython 2.5.2, unladden-swallow, PHP 5.3.6, GCC 4.5.1, GCC 4.6.0, ICC 11.1.073, LLVM 2.8, LuaJIT 2.0.0-beta6, and IFORT/ICC 12.0.1. 4.3 NAS Parallel Benchmarks 4.3.1 Performance overview Figure 4.1 is the overview of all the language implementations we measured. The vertical axis is logarithmic scale. We can observe that there is over 100 times d ' &d /^ >h D' ^W ϭϬ͕ϬϬϬ͘ϬϬϬ :ĂǀĂ ϵϵ DĞŐĂŽƉĞƌĂƚŝŽŶƐƉĞƌƐĞĐŽŶĚ;DŽƉƐͿ ϭ͕ϬϬϬ͘ϬϬϬ ZƵďLJ WLJƚŚŽŶ W,W ϭϬϬ͘ϬϬϬ ϭϬ͘ϬϬϬ ϭ͘ϬϬϬ Ϭ͘ϭϬϬ Figure 4.1: NPB performance overview 22 &ŽƌƚƌĂŶ ϮϬϬϯ KƌŝŐŝŶĂů ŝŵƉů͘ performance difference between slow and fast language implementations. Figure 4.2 shows the performance of the original NPB’s performance. This is the baseline that all the language implementations should target on because FORTRAN is the most high performance language and has the biggest market share in scientific computing. ICC 11 performed 2,841 MFlops with the original ϯ͕ϬϬϬ͘ϬϬϬ Ϯ͕ϱϬϬ͘ϬϬϬ Ϯ͕ϬϬϬ͘ϬϬϬ DŽƉƐ /ϭϭ͘ϭ͘Ϭϳϯ /&KZdϭϮ͘Ϭ͘ϭ ϭ͕ϱϬϬ͘ϬϬϬ 'ϰ͘ϱ͘ϭ 'ϰ͘ϲ͘Ϭ ϭ͕ϬϬϬ͘ϬϬϬ ϱϬϬ͘ϬϬϬ Ϭ͘ϬϬϬ d ' &d /^ >h D' ^W Figure 4.2: Original NPB performance implementation of MG in FORTRAN77. Figure 4.3 shows the performance of the native compiled languages: Fortran 2003, C99, and the original implementation (FORTRAN77 and C.) Variability of the scores were observed among benchmarks which had long and mid range execution time: BT, FT, LU, MG, and SP. For example, original FORTRAN77 implementation of MG on ICC 11 was 2.77 times faster than C99 implementation compiled on LLVM 2.8. However, two benchmarks that execution time were short showed less variability of the score. The original FORTRAN77 implementation 23 of CG on IFORT 12 was 1.52 times faster than Fortran 2003 implementation on GCC 4.6.0, and the Fortran2003 implementation of IS on IFORT 12 was only 1.04 times faster than the same implementation on GCC 4.6.0, they were the biggest difference in CG and IS. Figure 4.4 shows the standard deviation of the score shown in Figure 4.3. The standard deviation of CG and IS was less than 100 Mops but that of the other was over 400 Mops. These results shows that ϯ͕ϬϬϬ͘ϬϬϬ /ϭϭ͘ϭ͘Ϭϳϯ Ϯ͕ϱϬϬ͘ϬϬϬ /ϭϮ͘Ϭ͘ϭ 'ϰ͘ϱ͘ϭ Ϯ͕ϬϬϬ͘ϬϬϬ ϵϵ 'ϰ͘ϲ͘Ϭ DŽƉƐ >>sDϮ͘ϴ /&KZdϭϮ ϭ͕ϱϬϬ͘ϬϬϬ 'ϰ͘ϱ͘ϭ &ŽƌƚƌĂŶ ϮϬϬϯ 'ϰ͘ϲ͘Ϭ ϭ͕ϬϬϬ͘ϬϬϬ /ϭϭ͘ϭ͘Ϭϳϯ /&KZdϭϮ͘Ϭ͘ϭ 'ϰ͘ϱ͘ϭ ϱϬϬ͘ϬϬϬ &ϳϳΘ 'ϰ͘ϲ͘Ϭ Ϭ͘ϬϬϬ d ' &d /^ >h D' ^W Figure 4.3: NPB performance of native compiled languages short time kernel benchmark does not reflect the performance of native compiled language implementations obviously on modern computers, but bigger benchmarks that contains programs composed with over 10,000 lines source code such as SPEC CPU2006 is not needed to see variability. The scale of the benchmark is enough when they reached to a few thousands of lines. Figure 4.5 shows the performance of Java implementation of NPB measured with several vendors’ JVMs. IBM Java was always faster than JRockit and Harmony, showing stable performance. In contrary, Sun JVM showed unstable behavior. This behavior can be described by characteristics of JIT compiler. We discuss the effect of JIT on the performance in Section 4.3.2. 24 ϳϬϬ ϲϬϬ DŽƉƐ ϱϬϬ ϰϬϬ ϯϬϬ ϮϬϬ ϭϬϬ Ϭ d ' &d /^ >h D' ^W Figure 4.4: Standard deviation of the score of native compiled languages ϭ͕ϮϬϬ͘ϬϬϬ ϭ͕ϬϬϬ͘ϬϬϬ ϴϬϬ͘ϬϬϬ ũĚŬϭ͘ϲ͘ϬͺϮϯ ƐƉ Ž ϲϬϬ͘ϬϬϬ D ũĚŬϭ͘ϳ͘ϬͲĞĂͲďϭϮϳ ŝďŵͲũĂǀĂͲϲ͘ϬͲϵ͘Ϭ ũƌŽĐŬŝƚͲũĚŬϭ͘ϲ͘ϬͺϮϬͲZϮϴ͘ϭ͘ϬͲϰ͘Ϭ͘ϭ ŚĂƌŵŽŶLJͲϲ͘ϬͲũĚŬͲϵϵϭϴϴϭ ϰϬϬ͘ϬϬϬ ϮϬϬ͘ϬϬϬ Ϭ͘ϬϬϬ d ' &d /^ >h D' ^W Figure 4.5: NPB Java version performance 25 The performance of the Python implementations are shown in Figure 4.6. BT, FT, LU, and MG did not run on Python 3.2 because semantics of integer division was changed in Python 3. PyPy 1.7 was the fastest implementation on CG, FT, and MG, but BT, IS, and LU performance was lower than Python 2.7.1. We discuss the efficiency of tracing JIT in Section 4.3.2. ϭϰϬ͘ϬϬϬ ϭϮϬ͘ϬϬϬ ϭϬϬ͘ϬϬϬ ƉLJƚŚŽŶͲϮ͘ϳ͘ϭ ϴϬ͘ϬϬϬ DŽƉƐ ƉLJƚŚŽŶͲϯ͘Ϯ͘ZϮ ƉLJƉLJͲϭ͘ϰ͘ϭ ƉLJƉLJͲϭ͘ϳ ϲϬ͘ϬϬϬ ũLJƚŚŽŶͲϮ͘ϱ͘Ϯ͘Zϯ ƵŶůĂĚĞŶͲƐǁĂůůŽǁ ϰϬ͘ϬϬϬ ϮϬ͘ϬϬϬ Ϭ͘ϬϬϬ d ' &d /^ >h D' ^W Figure 4.6: NPB Python version performance The performance of the Ruby implementations are shown in Figure 4.6. Ruby 1.9.2 performed 2 times faster than Ruby 1.8.7 with all benchmarks. Ruby 1.9.2 or 1.9.3 were the fastest except for CG. JRuby was the fastest on CG. JRuby contains JIT compiler but JVM also has JIT compiler. We discuss which affect on the performance in Section 4.3.2. There were few PHP implementations that we chose official implementation of PHP and Java implementation, Quercus. However, Quercus did not run because of runtime timeout without CG and IS (Figure 4.8.) PHP’s language specification is not appropriate for numerical calculation as we saw in Section 3.2.3 that we should 26 ϴ͘ϬϬϬ ϳ͘ϬϬϬ ϲ͘ϬϬϬ DŽƉƐ ϱ͘ϬϬϬ ƌƵďLJͲϭ͘ϴ͘ϳ ƌƵďLJͲϭ͘ϵ͘Ϯ ϰ͘ϬϬϬ ƌƵďLJͲϭ͘ϵ͘ϯ ũƌƵďLJͲϭ͘ϲ͘Ϭ͘Zϭ ϯ͘ϬϬϬ ƌƵďŝŶŝƵƐͲϭ͘Ϯ͘Ϭ Ϯ͘ϬϬϬ ϭ͘ϬϬϬ Ϭ͘ϬϬϬ d ' &d /^ >h D' ^W Figure 4.7: NPB Ruby version performance see PHP’s performance as reference measure between other languages. 4.3.2 Effects by JIT Sun Java performance degradation From Figure 4.5, we can see Sun Java, both JDK 1.6 and JDK 1.7, only performed 14.9% - 4.6% Mops compared to IBM Java with BT, LU, and SP, while they performed as well or higher than IBM Java with CG, FT, IS, and MG. All of BT, LU, and SP are CFD application benchmarks. Application benchmarks are bigger than the kernel benchmarks: CG, FT, IS, and MG. All the numbers of source code lines of CFD benchmarks without comments were over 2000, while them of kernel benchmarks were less than 1000. Moreover, we compared the size of every methods in the class file, and we observed the CFD benchmarks contains methods that the size was from 6000 bytes to 10000 bytes, while the maximum 27 ϲ͘ϬϬϬ ϱ͘ϬϬϬ ϰ͘ϬϬϬ ƐƉ Žϯ͘ϬϬϬ D W,Wϱ͘ϯ͘ϲ YƵĞƌĐƵƐϰ͘Ϭ͘ϭϴ Ϯ͘ϬϬϬ ϭ͘ϬϬϬ Ϭ͘ϬϬϬ d ' &d /^ >h D' ^W Figure 4.8: NPB PHP version performance size of methods contained in the kernel benchmarks was about 2000 bytes. We profiled BT benchmark by hprof, and observed that the execution time percentage of BT.compute rhs method on Sun Java was about 10 times bigger than that of the other JVMs. We divided BT.compute rhs that original size was 9723 bytes into into smaller methods that the size of them were 2929 and 6798 bytes, and observed that the program was accelerated 10 times on Sun Java and no acceleration on the other JVMs. Ruby JRuby and Rubinius have JIT compiler. The compiler of both JRuby and Rubinius are based on method JIT. JRuby translates intermediate representation (IR) of “hot” Ruby method into an anonymous class of Java. The generated class is JIT compiled into native code by JVM. Rubinius translates its original IR into LLVM [37] IR. JIT compilation is performed by LLVM infrastructure. We measured the behavior of JIT compile invocation with problem size S. JRuby did not try to JIT compile big methods such as BT.compute rhs. This behavior was not changed even 28 ϭϴϬϬ ϭϲϬϬ ϭϰϬϬ ϭϮϬϬ ƐƉϭϬϬϬ Ž&ů D ϴϬϬ ϲϬϬ KƌŝŐŝŶĂů ϰϬϬ ŝǀŝĚĞĚ ϮϬϬ Ϭ Figure 4.9: Sun Java performance improvement after method division if JIT compile threshold that is based on call count was changed to 0, which means that all the methods the interpreter walk on will be compiled. Rubinius tried to JIT compile the methods that JRuby did not try to compile, but almost all attempts of inlining on big methods was failure. The methods not inlined was multiply and addition method of fixed-point precision number class, and load method of array element. These results shows that method JIT suffers from big method. The size of code fragment that is compiled into optimized bytecode or native machine code depends on the size of method on method-based JIT. If each method is equally divided into appropriate size, compilation unit size becomes appropriate naturally, but if the method size distribution is unequally divided, big methods are generated, and these methods make coarse-grain compilation unit. Typical profiler of method JIT counts the number of method call, but actually hot method often contains loops. Therefore, such profiler mispredicts the hotness of the method that contains many 29 primitive operations and does not call other methods. Moreover, typical embedded environment has resource limitation. In such environment, compilation of the big method is impossible or does not pay, thus heuristics that cut off the big method is adopted. By these reasons, method compilation that is actually needed is not perfectly performed, and the penalty of miss prediction or restriction from the specification causes actual scientific applications slow. PyPy We measured how JIT compilation is performed in with tracing JIT [13] using NPB python version on PyPy 1.4.1. Figure 4.10 shows the timing of JIT compilation and GC during execution with problem size S. SP was eliminated due to a runtime error. Pale green area means that compiled native code is running. Dark green area means that JIT profiling and compilation are performed. Red area means that garbage collector is triggered. CG performed the longest execution time of native code among NPB. FT performed the highest ratio between native code execution time and JIT compilation time, but the execution time was dominated 36.6% by GC, that was higher than CG. Execution time of JIT optimization was the highest in MG. GC and JIT optimization dominated the execution time of BT and LU. Compared to the other benchmarks, the frequency of JIT and GC in BT and LU was irregular. In contrary, FT, IS, and MG showed regular behavior of JIT and GC. We measured the CG not only with S but also with W and A, because the performance of CG on PyPy 1.4 was about 10 times faster than any other interpreter language implementations without PyPy 1.7, and the execution time was relatively short. The results are shown in Figure 4.11. As problem size increased, the time jit-running increased. CG.conj grad method contains 10 for-loops and only one external function call to Math.sqrt. The call graph obtained from CG class S on Python 2.7 (4.12) shows that conj grad dominated over 90% of execution time. CG.conj grad was called 30 Figure 4.10: JIT log of NPB except for SP (CLASS=S) on PyPy 1.4.1 only 16 times while execution. In contrary, BT.Schwarztrauber method dominated over 80% of execution time and was called over 1000 times. Thus, the locality of CG was high. The locality of code execution affects on efficiency of JIT compilation. To measure the performance of JIT compiler, the code locality must vary. NPB have workload variation from the aspect of the efficiency of JIT compilation. On PyPy1.4, IS completes less than 0.02 second with problem size S that this 31 Figure 4.11: JIT log of NPB CG (CLASS=W, A) on PyPy 1.4.1 Figure 4.12: Call graph of NPB CG (CLASS=S) on Python 2.7 size is inappropriate to see the behavior of the language implementation. We measured the behavior of JIT compiler also with bigger problem size, W and A. The results are shown in Figure 4.13. As problem size increased, the percentage of JIT code execution time decreased. Instead of that, GC execution time increased. At size A, over 90% of execution time was GC-related process. The function that dominates the execution time as the problem size increased was Random.randlc as shown in Figure 4.15. The caller of Random.randlc, IS.initKeys is not called during the benchmark kernel is running. Random.randlc also dominated the execution time in Python 2.7, which is Non-JIT implementation, shown in Figure 4.16. IS.initKeys took 131.9 seconds on PyPy 1.4.1 while only 1.66 seconds on 32 Figure 4.13: JIT log of NPB IS (CLASS=W, A) on PyPy 1.4.1 Figure 4.14: JIT log of NPB IS (CLASS=S, W, A, B) on PyPy 1.7 PyPy 1.7 as shown in Figure 4.17. Nevertheless, the tendency that as problem size increases GC execution time increases is kept, showed in Figure 4.14. In the other benchmarks, initialization functions did not dominate the execution time or were common as functions that were called during the benchmark were running. These results show that total execution time of IS in addition to the score of benchmark kernel can be used to evaluate garbage collector and JIT, and the IS kernel is a workload that stresses GC. 33 Figure 4.15: Call graph of NPB IS (CLASS=A) on PyPy 1.4.1 Figure 4.16: Call graph of NPB IS (CLASS=A) on Python 2.7 It was confirmed that the characteristics of NPB from the aspect of JIT and GC combination is also divergent as well as previous work [25] [31] [63] [9] that were based on the standpoint of parallel distributed environment and microarchitecture. IS and Actual CFD codes caused JIT and GC continuously through execution. These benchmarks showed that improvement only on JIT does not accelerate applications that causes GC frequently. 34 Figure 4.17: Call graph of NPB IS (CLASS=A) on PyPy 1.7 4.3.3 Multithread performance of Ruby HPC scientific programs utilizes multiprocess and multithreading to improve parallelism. Original and Java version of NPB has parallel implementations. We translated Java version into Ruby and measured parallel processing performance. Ruby implementations except for JRuby have Global Interpreter Lock (GIL) or Giant VM Lock (GVL). These lock is reserved the interpreter to interact with native compiled module safely. We measured the impact of GIL and the JRuby’s multithread performance on NPB by increasing the number of threads (4.18.) The results show that the impact of GVL is bigger on Ruby 1.9 than Ruby 1.8, and Rubinius suffers the biggest impact of GVL among four implementations. Figure 4.19 shows JRuby’s scaling up to 8 threads. Compared to the FORTRAN77 implementation, the CFD benchmarks that require high computation cost scaled as well, but the kernel benchmarks that requires low computation cost showed lower scalability on JRuby. This indicates that the cost of lock on JRuby is high. 35 ϰ ϯ͘ϱ d ϯ͘ϱ >h ϯ Ğƚ ϯ ZĂ Ϯ͘ϱ ŶŽ ƚŝĂ Ϯ ƌĞ ůĞ ϭ͘ϱ ĐĐ ϭ Ğƚ Ϯ͘ϱ ZĂ ŶŽ Ϯ ƚŝĂ ƌĞ ϭ͘ϱ ůĞ ĐĐ ϭ Ϭ͘ϱ Ϭ͘ϱ Ϭ Ϭ ϭ Ϯ ϯ ϭ ϰ Ϯ ϯ ϭ͘ϲ ' ϰ ϯ ϰ ϯ ϰ D' ϭ͘ϰ Ϯ͘ϱ ϭ͘Ϯ ƚĞĂ ZŶ ϭ ŝŽƚ Ϭ͘ϴ Ăƌ Ğů ĐĞĐ Ϭ͘ϲ Ϭ͘ϰ ƚĞĂ Ϯ ZŶ ŝŽƚ ϭ͘ϱ Ăƌ Ğů ĐĞĐ ϭ Ϭ͘ϱ Ϭ͘Ϯ Ϭ Ϭ ϭ Ϯ ϯ ϰ ϭ Ϯ dŚƌĞĂĚƐ ϰ dŚƌĞĂĚƐ ϯ͘ϱ &d ϯ͘ϱ ^W ϯ Ğƚ ϯ ZĂ Ϯ͘ϱ ŶŽ ƚŝĂ Ϯ ƌĞ ůĞ ϭ͘ϱ ĐĐ ϭ ĞŐ Ϯ͘ϱ ĂZ Ŷ Žƚŝ Ϯ Ăƌ Ğů ϭ͘ϱ ĞĐ Đ ϭ Ϭ͘ϱ Ϭ͘ϱ Ϭ ϯ dŚƌĞĂĚƐ dŚƌĞĂĚƐ ϭ Ϯ ϯ Ϭ ϰ dŚƌĞĂĚƐ ϭ͘ϴ ϭ Ϯ dŚƌĞĂĚƐ /^ ϭ͘ϲ ϭ͘ϰ Ğƚ ĂZϭ͘Ϯ Ŷ ŝŽƚ ϭ ƌĂĞ Ϭ͘ϴ ůĞ ĐĐ Ϭ͘ϲ :ZƵďLJϭ͘ϲ͘ϬZϭ ZƵďŝŶŝƵƐϭ͘Ϯ͘Ϭ ZƵďLJϭ͘ϴ͘ϳ Ϭ͘ϰ ZƵďLJϭ͘ϵ͘ϯ Ϭ͘Ϯ Ϭ ϭ Ϯ ϯ ϰ dŚƌĞĂĚƐ Figure 4.18: Scaling of multithread benchmarks on Ruby implementations 36 ϴ ϳ ϲ d Ğƚ ĂZ Ŷ ϱ Žŝƚ ƌĂĞ ϰ ůĞ ĐĐ ϯ ' &d /^ >h D' Ϯ ^W ϭ Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ dŚĞŶƵŵďĞƌŽĨƚŚƌĞĂĚƐ Figure 4.19: Scaling of multithread benchmarks on JRuby ϴ ϳ ϲ d Ğƚ ĂZ ϱ Ŷ Žŝƚ ƌĂĞ ϰ ůĞ ĐĐ ϯ ' &d /^ >h D' Ϯ ^W ϭ Ϭ ϭ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ dŚĞŶƵŵďĞƌŽĨƚŚƌĞĂĚƐ Figure 4.20: Scaling of multithread benchmarks, implemented in FORTRAN77 37 4.4 Dhrystone Figure 4.21 shows the dhrystone/sec of each language implementations. The fastest ϭϬϬϬϬϬϬϬϬ ϭϬϬϬϬϬϬϬ >ƵĂ ŚƌLJƐƚŽŶĞͬƐĞĐ ϭϬϬϬϬϬϬ ϭϬϬϬϬϬ ϭϬϬϬϬ ϭϬϬϬ ϭϬϬ ZƵďLJ WLJƚŚŽŶ W,W :ĂǀĂ ϵϵ KƌŝŐ͘ ŝŵƉů͘ ϭϬ ϭ Figure 4.21: Dhrystone (iteration count = 50,000,000) implementation of scripting language, PyPy 1.7, is still slower 10 or more times slower than native compiled language implementations. The Dhrystone score of PyPy 1.7 on current system was near DEC AlphaStation 500/400 measured in [58], which was introduced in 1996 with Alpha 21164A 400MHz processor. The score of Ruby 1.8.7 and PHP were near EPSON PRO-486, which was introduced in 1993 with i486DX2 66MHz processor. Figure 4.22 shows the JIT log of Dhrystone with iteration count 200,000. From the log of PyPy 1.4, JIT-related operations were performed until the time elapsed 47.8 % of the total execution time, while PyPy 1.7 converged at 11.7% . This means convergence speed was increased in PyPy 1.7. After JIT compiler converged, GC was still running continuously. 38 Figure 4.22: JIT log of Dhrystone (iteration count = 200,000) 39 Chapter 5 Conclusion In this paper, we proposed a method to compare the performance of programming language implementations using benchmarks that was produced by source to source translator, translated from the base benchmarks that was compatible with existing standard benchmarks. Computer system is built on various layer, therefore each layer has benchmark programs to measure the performance of it. However, only few existing benchmark suites were focused on comparing programming language implementations. Existing language comparison benchmarks had a problem that they were translated into various language by hand, that makes it difficult to compare the performance function by function and catch up with the evolution of both language and workload. We proposed an automated translation process that made it easier to generate benchmark programs for several languages that have the same semantics. In comparison with hand-coded benchmarks, our benchmarks had two advantage: First, we could compare each language implementations with exact line-to-line correspondence relations because translation rules were explicitly written as translator. Second, we had reduced limitation of programming language diversity and benchmark size because if once we created source to source translator, we can automatically catch up with the update of the base benchmarks. We achieved translation of actual application benchmark into 5 languages, which was impossible by hand. We discussed the metrics the benchmarks implementer should consider to make 40 the benchmarks reflect appropriate workload and language feature usage. Special concepts that the benchmarked languages do not have commonly must not be used in the base benchmarks. Hand-based synthetic benchmark cost that it cannot be used. Assembly-level automatic benchmark synthesis can not be used because it is impossible to reflect the features across languages. Meaningful applications benchmarks are big to compare interpreter languages that are slower than languages that are used by product applications. Subsetted benchmarks must be used, however, the base benchmarks must be proven benchmarks to compare language implementations with past benchmark data. We selected NPB and Dhrystone as the best benchmarks that satisfy the metrics preceding above. We translated NPB and Dhrystone into 5 languages. Performance evaluation was performed on 30 language implementations for NPB and 25 implementations for Dhrystone. Our benchmarks are the biggest than ever related work from the aspect of both language diversity and implementation diversity with over 1,000 lines scale benchmark programs. By our comparison, several facts and recommendations on JIT compiler, GC, GIL, workload characteristics, and total performance were obtained. The performance of language implementations that performs method-based, which means coarse-grain, JIT compilation heavily affected when JIT compilation was not performed because of misprediction or VM restriction on big methods. As we saw in NPB, actual scientific applications contain big methods that these applications suffer on coarse-grain JIT compiler. This point would not be confirmed with kernel-only benchmarks that the every method is small and easy to optimize. Workload diversity is needed not only network or microarchitectual evaluation but also JIT compiler and garbage collector. CFD applications caused contiguous compilation and garbage collection. PyPy improved the JIT compiler to cover wider workloads from version 1.4.1 to 1.7, but integer sorting application was got slow. The runtime profile of integer sorting benchmark was dominated by GC. To improve the language implementation much wider, the implementer must make an effort on GC. 41 Native compiled languages showed small differences with short execution time benchmarks, but with long execution time benchmarks they showed up to 2.77 times difference. The standard deviation of the Mops score of long execution time benchmarks were over 400 Mops, while that of short execution time benchmarks were less than 100Mops. Language comparison between interpreter languages showed that PyPy was the most fastest implementation for scientific calculations. However, in comparison with maximum score of compiled language implementations, even the fastest score of PyPy was slower 9 times than compiled languages. The other score was much slower, that were about from 20 times to 30times. Quercus was the slowest implementations among implementations we compared. The impact of GIL on the performance was measured on Ruby implementations. JRuby was the only implementation that scaled with the number of threads. The GIL impacted more on Ruby 1.9 than Ruby 1.8. The impact of GIL did not increase as the number of threads increased without IS. The recommendation from this result is that multithreading on Ruby except for JRuby is meaningless when writing scientific application. Moreover, JRuby scales in fact, however, the acceleration potential of dynamic language that was proved by PyPy was at least 10 times on computational kernel, which is higher rate than performance increase that can be achieved by parallelization on currently available multicore processors. Making program to utilize multithreading is hard that using Ruby does not worth than switch programming language to Python that have similar concepts. We conclude that PyPy 1.7 is recommended for both productive and high performance scientific computation among the implementations we measured. However, PyPy is from 30 to 20 times slower than compiled languages, and less than 10 times slower than Java. We do not recommended to use Quercus in scientific calculation. By analyzing distribution of statements, expressions, operators and function calls of SPEC CINT 2006 and Dhrystone, it was observed that differences of syntactic characteristics of them were small. The biggest characteristics of Dhrystone 42 from the aspect of dynamic languages with GC is that there is no dynamic malloc and free during benchmarking iteration. This was confirmed by PyPy’s JIT behavior, which showed that JIT compiled code dominated during the execution after convergence and GC rarely happened. It is showed in previous work that the score of Dhrystone and SPEC CINT2006 has correlativity, therefore integer performance of language implementations can be measured with NPB IS and Dhrystone considering GC effects. Dhrystone is useful to observe how multiple staged JIT optimization works. Dhrystone have only one way to make heavy workload, that is increasing the iteration count parameter. Simple, many time iteration is easy to trigger both method-based JIT and tracing JIT. However, inside the iteration still reflects actual integer workload in past. By measuring the speed of convergence of dhrystone per sec, we can measure the performance of JIT compiler keeping compatibility with past benchmarking scores. The score of Dhrystone in interpreter language with modern computer system was equivalent to the desktop or workstation systems introduced in early 1990’s. We conclude that optimization technique of the dynamic interpreter languages are about 15 years behind statically-typed compiled languages. To defeat staticallytyped language, further enhancement is needed. 43 References [1] Blazing-fast code using GPUs and more, with Microsoft Visual C++. http: //ecn.channel9.msdn.com/content/DanielMoth CppAMP Intro.pdf. [2] C Converted Whetstone Double Precision Benchmark. http://www.netlib. org/benchmark/whetstone.c. [3] Dhrystone benchmark written in Java. http://www.okayan.jp/ DhrystoneApplet/. [4] Linpack benchmark―Java version. http://www.netlib.org/benchmark/ linpackjava/. [5] lua-users wiki: Object Orientation Tutorial. http://lua-users.org/wiki/ ObjectOrientationTutorial. [6] Numerical Ruby NArray. http://narray.rubyforge.org/. [7] SPECjvm2008 benchmarks. http://www.spec.org/jvm2008/. [8] The Computer Language Benchmarks Game. http://shootout.alioth. debian.org/. [9] G. Abandah. Characterizing shared-memory applications: A case study of the nas parallel benchmarks. Hewllet-Packard Labs Technical Report HPL-97-24, 1997. 44 [10] D. Bailey, T. Harris, W. Saphir, R. Van Der Wijngaart, A. Woo, and M. Yarrow. The nas parallel benchmarks 2.0. Technical report, Technical Report NAS-95-020, NASA Ames Research Center, 1995. [11] D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S. Schreiber, et al. The nas parallel benchmarks summary and preliminary results. In Supercomputing, 1991. Supercomputing’91. Proceedings of the 1991 ACM/IEEE Conference on, page 158―165. IEEE, 1991. [12] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, B. Moss, Aashish Phansalkar, Darko Stefanovi, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The dacapo benchmarks: Java benchmarking development and analysis. In OOPSLA ’06: Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications, page 169―190, New York, NY, USA, 2006. ACM. [13] C.F. Bolz, A. Cuni, M. Fijalkowski, and A. Rigo. Tracing the meta-level: Pypy’s tracing jit compiler. In Proceedings of the 4th workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, pages 18–25. ACM, 2009. [14] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for gpus: stream computing on graphics hardware. In ACM Transactions on Graphics (TOG), volume 23, pages 777–786. ACM, 2004. [15] J.M. Bull, L.A. Smith, L. Pottage, and R. Freeman. Benchmarking java against c and fortran for scientific applications. In Proceedings of the 2001 joint ACM-ISCOPE conference on Java Grande, pages 97–105. ACM, 2001. 45 [16] J.M. Bull, L.A. Smith, M.D. Westhead, D.S. Henty, and R.A. Davey. A benchmark suite for high performance java. Concurrency - Practice and Experience, 12(6):375–388, 2000. [17] W.W. Carlson, J.M. Draper, D.E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and language specification. Center for Computing Sciences, Institute for Defense Analyses, 1999. [18] B.L. Chamberlain, D. Callahan, and H.P. Zima. Parallel programmability and the chapel language. International Journal of High Performance Computing Applications, 21(3):291–312, 2007. [19] C. Chambers, D. Ungar, and E. Lee. An efficient implementation of self a dynamically-typed object-oriented language based on prototypes. ACM SIGPLAN Notices, 24(10):49–70, 1989. [20] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. Von Praun, and V. Sarkar. X10: an object-oriented approach to nonuniform cluster computing. In ACM SIGPLAN Notices, volume 40, pages 519–538. ACM, 2005. [21] H.J. Curnow and B.A. Wichmann. A synthetic benchmark. The Computer Journal, 19(1):43, 1976. [22] L. Dagum and R. Menon. Openmp: an industry standard api for sharedmemory programming. Computational Science & Engineering, IEEE, 5(1):46– 55, 1998. [23] J. Dongarra. The linpack benchmark: An explanation. In Supercomputing, pages 456–474. Springer, 1988. [24] J.J. Dongarra, H.W. Meuer, and E. Strohmaier. Top500 supercomputers. 1993. 46 [25] A.A. Faraj and X. Yuan. Communication characteristics in the NAS parallel benchmarks. PhD thesis, Florida State University, 2002. [26] International Organization for Standardization, International Electrotechnical Commission, et al. Iso/iec 9899: 1999. [27] M.A. Frumkin, M. Schultz, H. Jin, and J. Yan. Implementation of the nas parallel benchmarks in java. 2002. [28] J.L. Henning. Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006. [29] J. Hutchinson. Culture, communication, and an information age madonna. IEEE Professional Communication Society Newsletter, 45(3):1–7, 2001. [30] P. Jansen. Tiobe programming community index, tiobe software. [31] H. Jin and R.F. Van der Wijngaart. Performance characteristics of the multizone nas parallel benchmarks. Journal of Parallel and Distributed Computing, 66(5):674–685, 2006. [32] R. Jones et al. Netperf: a network performance benchmark. Hewlett-Packard Company, 1996. [33] S.P. Jones. Haskell 98 language and libraries: the revised report. Cambridge Univ Pr, 2003. [34] A. Joshi, L. Eeckhout, and L. John. The return of synthetic benchmarks. In 2008 SPEC Benchmark Workshop, page 1, 2008. [35] K. Karimi, N.G. Dickson, and F. Hamze. A performance comparison of cuda and opencl. Arxiv preprint arXiv:1005.2581, 2010. [36] J. Katcher. Postmark: A new file system benchmark. Technical re- port, Technical Report TR3022, Network Appliance, 1997. www. netapp. com/tech library/3022. html, 1997. 47 [37] C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Code Generation and Optimization, 2004. CGO 2004. International Symposium on, pages 75–86. IEEE, 2004. [38] E. Lusk, K. Yelick, and K. Guest Editors. Languages for high-productivity computing: the darpa hpcs language project. Parallel Processing Letters, 17(1):89–102, 2007. [39] W.R. Mark, R.S. Glanville, K. Akeley, and M.J. Kilgard. Cg: A system for programming graphics hardware in a c-like language. In ACM Transactions on Graphics (TOG), volume 22, pages 896–907. ACM, 2003. [40] M.D. McCool, Z. Qin, and T.S. Popa. Shader metaprogramming. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 57–68. Eurographics Association, 2002. [41] P. McCormick, J. Inman, J. Ahrens, J. Mohd-Yusof, G. Roth, and S. Cummins. Scout: a data-parallel programming language for graphics processors. Parallel Computing, 33(10-11):648–662, 2007. [42] R.C. Murphy, K.B. Wheeler, B.W. Barrett, and J. Ang. Introducing the graph 500. Cray User ’s Group (CUG), 2010. [43] C.J. Newburn, B. So, Z. Liu, M. McCool, A. Ghuloum, S.D. Toit, Z.G. Wang, Z.H. Du, Y. Chen, G. Wu, et al. Intel’s array building blocks: A retargetable, dynamic compiler and embedded language. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on, pages 224–235. IEEE, 2011. [44] G. Noaje, C. Jaillet, and M. Krajecki. Source-to-source code translator: Openmp c to cuda. In High Performance Computing and Communications (HPCC), 2011 IEEE 13th International Conference on, pages 512–519. IEEE, 2011. 48 R 9 high level shading [45] C. Peeper and J.L. Mitchell. Introduction to the directx⃝ language. [46] A. Petitet, R.C. Whaley, J. Dongarra, and A. Cleary. Hpl-a portable implementation of the high-performance linpack benchmark for distributed-memory computers. [47] Aashish Phansalkar, Ajay Joshi, and Lizy K. John. Subsetting the spec cpu2006 benchmark suite. SIGARCH Comput. Archit. News, 35:69–76, March 2007. [48] R. Pozo and B. Miller. Scimark 2.0. http: // math. nist. gov/ scimark2 , 2000. [49] J. Reid and W.G. Convener. The new features of fortran 2003. Publicado em: ftp: // ftp. nag. co. uk/ sc22wg5 , 1601. [50] R.J. Rost. OpenGL (R) Shading Language. Addison-Wesley Professional, 2005. [51] W. Saphir, R. Van der Wijngaart, A. Woo, and M. Yarrow. New implementations and results for the nas parallel benchmarks 2. In 8th SIAM Conference on Parallel Processing for Scientific Computing, pages 14–17, 1997. [52] K. Sasada. Yarv: yet another rubyvm: innovating the ruby interpreter. In Companion to the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pages 158–159. ACM, 2005. [53] J. Singer. Jvm versus clr: a comparative study. In Proceedings of the 2nd international conference on Principles and practice of programming in Java, pages 167–169. Computer Science Press, Inc., 2003. 49 [54] J.E. Stone, D. Gohara, and G. Shi. Opencl: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering, 12(3):66, 2010. [55] T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue, M. Kawahito, K. Ishizaki, H. Komatsu, and T. Nakatani. Overview of the ibm java justin-time compiler. IBM systems Journal, 39(1):175–193, 2000. [56] D. Syme, A. Granicz, A. Cisternino, and Ltd MyiLibrary. Expert F#. Apress, 2007. [57] A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs. Iperf: The tcp/udp bandwidth measurement tool, 2005. [58] H. Tomari and K. Hiraki. Retrospective study of performance and power consumption of computer systems. IPSJ Online Transactions, 4:217–227, 2011. [59] W. Vogels. Benchmarking the cli for high performance computing. In Software, IEE Proceedings-, volume 150, pages 266–274. IET, 2003. [60] S. Walt, S.C. Colbert, and G. Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22–30, 2011. [61] R.P. Weicker. Dhrystone: a synthetic systems programming benchmark. Communications of the ACM, 27(10):1013–1030, 1984. [62] M. Weiskirchner. Comparison of the execution times of ada, c and java. Paper, September, 2003. [63] F.C. Wong, R.P. Martin, R.H. Arpaci-Dusseau, and D.E. Culler. Architectural requirements and scalability of the nas parallel benchmarks. In Supercomputing, ACM/IEEE 1999 Conference, pages 41–41. IEEE, 1999. 50