Download Code Cache Optimizations for Dynamically Compiled Languages

Document related concepts
no text concepts found
Transcript
Master Thesis
Code Cache Optimizations for
Dynamically Compiled Languages
Tobias Hartmann
Albert Noll
Supervisor
Prof. Thomas R. Gross
Laboratory for Software Technology
ETH Zurich
February 2014
Abstract
Past activities in optimizing the performance of the HotSpotTM Java Virtual Machine focused on
the performance of the dynamic compilers and the supporting runtime. Since dynamically compiled
code is stored in a code cache to avoid recompilations, the organization and maintenance of the
code cache has a significant impact on the overall performance. The organization of the code
cache became even more important with the introduction of tiered compilation in Java Platform,
Standard Edition (Java SE) 7. By using two dynamic compilers with different characteristics, not
only the amount, but also the number of different types of compiled code increased.
The current code cache is optimized to handle homogeneous code, i.e., only one type of compiled
code. The code cache is organized as a single heap data structure on top of a contiguous chunk of
memory. Therefore, profiled code which has a predefined limited lifetime is mixed with non-profiled
code, which potentially remains in the code cache forever. This leads to different performance and
design problems. For example, the method sweeper has to scan the entire code cache while sweeping,
even if some entries are never flushed or contain non-method code.
This thesis addresses these issues at a lower layer by redesigning the structure of the code cache.
The code cache is segmented into multiple code heaps, each of which contains compiled code of a
particular type and therefore separates code with different properties.
The disadvantage of having a fixed size per code heap is then minimized by lazily creating and
dynamically resizing these code heaps at runtime.
The main advantages of this design are (i) more efficient sweeping, (ii) improved code locality,
(iii) the possibility for fine grained (per code heap) locking and (iv) improved management of
heterogeneous code.
A detailed evaluation shows that this approach improves overall performance. The execution time
is improved by up to 7% and the more efficient code cache sweeping reduces the time taken by the
sweeper by up to 46%. This, together with a decreased fragmentation of the non-profiled code heap
by around 98%, leads to a reduced instruction translation lookaside buffer (44%) and instruction
cache (14%) miss rate.
iii
Zusammenfassung
Vergangene Bemühungen die Leistung der HotSpotTM Java Virtual Machine zu optimieren, konzentrierten sich vorrangig auf die Leistung der dynamischen Compiler und der unterstützenden Laufzeitumgebung. Da dynamisch kompilierter Code aber, um Rekompilierungen zu vermeiden, in einem
Code Cache gespeichert wird, hat die Organisation und Verwaltung dieses Codes einen signifikanten
Einfluss auf die Gesamtleistung. Dieser Sachverhalt gewann noch an Bedeutung als mit Java SE
7 die Tiered Compilation eingeführt wurde. Durch gleichzeitige Verwendung zweier dynamischer
Compiler, die den Code teilweise instrumentieren, erhöhte sich nicht nur die Menge kompilierten
Codes, sondern auch die Anzahl der verschiedenen Codetypen.
Das Design des Code Caches basiert zur Zeit auf einer Heap Datenstruktur über einem zusammenhängenden Speicherbereich und ist optimiert, homogenen Code, d.h. kompilierten Code eines
Typs, zu speichern. Daher wird Profiled Code, welcher eine vordefinierte, beschränkte Lebenszeit
hat, mit Non-Profiled Code, welcher potentiell dauerhaft im Code Cache bleibt, vermischt. Dies
führt unweigerlich zu verschiedenen Leistungseinbußen und Designproblemen. Beispielsweise muss
der Sweeper immer den gesamten Code Cache scannen, auch wenn einige Einträge nie gelöscht
werden oder Non-Method Code enthalten.
Der Ansatz welcher in dieser Arbeit präsentiert wird, geht die Probleme auf einer tieferen Ebene
an, indem der Code Cache restrukturiert wird. Der Code Cache wird in mehrere Code Heaps
aufgeteilt, welche nur kompilierten Code eines bestimmten Typs enthalten und deshalb Code mit
verschiedenen Eigenschaften trennen. Der Nachteil einer festen Größe pro Code Heap wird dadurch
minimiert, dass die Code Heaps lazy, d.h. erst wenn nötig, erstellt werden und zur Laufzeit ihre
Größe ändern können.
Die Vorteile dieses Designs sind (i) effizienteres Sweepen, (ii) eine verbesserte räumliche Codelokalität, (iii) die Möglichkeit von präzisem Locking (pro Code Heap) und (iv) eine verbesserte
Verwaltung von heterogenem Code.
Eine detaillierte Auswertung der Implementierungen zeigt, dass der Ansatz vielversprechend ist.
Die Laufzeit ist um bis zu 7% verringert, wobei das effizientere Sweepen die Zeit im Sweeper um
bis zu 46% reduziert. Dies, zusammen mit der verringerten Fragmentierung des Non-Profiled Code
Heaps um ca. 98%, führt zu einer geringeren Instruction TLB (44%) und Instruction Cache (14%)
Miss Rate.
v
Acknowledgments
First of all, I want to thank my supervisor Albert Noll for giving me the opportunity to take part
in this challenging project. I would like to thank you for your guidance and help with just the right
mix of support and personal responsibility.
This work was performed in cooperation with Oracle. Advice given by Vladimir Kozlov and Christian Thalinger has been a great help in improving the design and implementation. I would like to
thank Azeem Jiva for making this possible.
Further, I would like to offer my special thanks to Patrick von Reth for his constructive feedback
and Jens Schuessler for his proofreading and improvement of my orthography.
Last but not least, I want to thank Yasmin Mülhaupt for her feedback, encouragement and endless
support while writing this thesis. Life is so much better with you.
Tobias Hartmann
February, 2014
vii
Contents
Acknowledgments
vii
1 Introduction
1
1.1
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.1
3
Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Background information
7
2.1
Dynamic compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Java Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
The HotSpotTM
9
2.4
Java Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2
Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3
Dynamic compilation
2.3.4
Tiered compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.5
Code cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.6
Code cache sweeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.7
Serviceability Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Nashorn JavaScript engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Related work
23
3.1
Maxine Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2
Graal Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3
Jikes Research Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4
Dalvik Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ix
x
Contents
4 Design
27
4.1
Segmented code cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2
Dynamic code heap sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Implementation
5.1
5.2
33
Segmented code cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.1
Code cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1.2
Code cache sweeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.3
Serviceability Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.4
Dynamic tracing framework DTrace . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.5
Stack tracing tool Pstack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.6
Adding new code heaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Dynamic code heap sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1
Virtual space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.2
Code heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.3
Code cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.4
Lazy creation of method code heaps . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.5
Serviceability Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.6
Dynamic tracing framework DTrace . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.7
Stack tracing tool Pstack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Evaluation
49
6.1
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2
Default code heap sizes
6.3
Dynamic code heap sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4
Overall performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5
Sweep time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.6
Memory fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.7
ITLB and cache behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.8
Hotness of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Contents
7 Conclusion
7.1
xi
67
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A Appendix
A.1 Additional graphs
69
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2.1 Google Octane JavaScript benchmark . . . . . . . . . . . . . . . . . . . . . . 71
A.2.2 SPECjbb2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2.3 SPECjbb2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.3 New JVM options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.4 Used tools/frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.4.1 Eclipse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.4.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.4.3 yEd Graph Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Bibliography
72
1
Introduction
1.1
Motivation
The HotSpotTM Java Virtual Machine (JVM) was subject to many changes during the last ten years
of development. Past activities in optimizing the performance focused on the dynamic compilers
and the supporting runtime. But since the dynamically compiled code is stored in a code cache
to avoid frequent recompilations, the organization and maintenance of the code has a significant
impact on the overall performance. For example, the bugs JDK-80275931 and JDK-80201512 report
serious performance regressions due to the code cache taking the wrong actions.
The organization of the code cache became even more important with the introduction of tiered
compilation in Java SE 7. Previously only the interpreter gathered profiling information before
compiling a method with the server compiler [43], which then used the profiling information. Tiered
compilation uses the client compiler to generate compiled versions of methods that collect profiling
information. Methods compiled by the client compiler are potentially later compiled using the
server compiler. This leads to faster startup and more precise profiling information. As a result,
not only the amount, but also the number of different types of compiled code increased.
The current design of the code cache, however, is optimized to handle only one type of compiled
code. The code cache is organized as a single heap data structure on top of a contiguous chunk of
memory. To add new code to the code cache, the code cache allocates space independently of the
type of code that needs to be stored. For example, profiled code which has a limited lifetime can
be placed next to non-profiled code, which potentially remains in the code cache forever. Further,
JVM internal structures, such as the interpreter or adapters that are used to jump from compiled
to interpreted code, are mixed with compiled Java methods. Mixing different code types leads to
different performance and design problems. For example, the method sweeper, which is responsible
for removing methods from the code cache, must scan all code types while sweeping. This results
in a serious overhead because some entries are never flushed or even contain non-method code.
The approach presented in this thesis addresses these issues at the structural level of the code
cache. Instead of having a single code heap, the code cache is segmented into distinct code heaps,
each of which contains compiled code of a particular type. Such a design enables to separate code
with different properties.
1
2
The implementation of the method sweeper causes a performance regression for small code cache sizes.
Large performance regressions occur when the code cache fills up.
1
2
1. Introduction
As described in the last paragraph, there are three different types of compiled code: (i) JVMinternal (non-method) code, (ii) profiled-code, and (iii) non-profiled code. This thesis evaluates an
approach in which each of the before-mentioned code types is stored into an individual code heap.
Available code heaps are:
• a non-method code heap containing non-method code, such as buffers and bytecode interpreter. This code type will stay in the code cache forever.
• A profiled code heap containing lightly optimized, profiled methods with a short lifetime
and
• a non-profiled code heap containing fully optimized, non-profiled methods with a potentially long lifetime.
The main advantages of this design are:
• Efficient sweeping: It is possible to skip non-methods through specialized iterators, sweep
methods with shorter lifetime more frequently and easily parallelize the sweeping process.
• Improved code locality: Code of the same type is likely to be accessed close in time. As
a result the instruction cache and the instruction translation lookaside buffer (TLB) misses
are reduced.
• Fine grained locking: It is possible to use one lock per code heap or code type respectively,
instead of locking the entire code cache for each access. Fine grained locking enables both
fast sweeping as well as parallel allocation in different code heaps.
• Improved management of heterogeneous code: Future versions of the HotSpotTM JVM
may include GPU (Project Sumatra3 ) or ahead-of-time (AOT)-compiled code that should be
stored in a separate code heap. Furthermore, the segmented code cache helps to better control
the memory footprint of the JVM by limiting the space that is reserved for individual code
types.
On the other hand, fixing the size per code heap has disadvantages: The required size for nonmethod code, such as adapters, the bytecode interpreter or compiler buffers, does not only depend
on client code, but also on the machine architecture and the JVM settings. Additionally, the size
needed for profiled and non-profiled code, respectively, depends on the amount of profiling that is
done which further depends on the application, runtime, and JVM settings. During the startup
phase of an application, more space for the profiled code heap is needed. After profiling of hot
code is completed, the hot methods are compiled by the C2 compiler and less space for the profiled
methods is needed. It is therefore difficult to set default values for the corresponding code heap
sizes.
3
Enables Java applications to take advantage of graphical processing units (GPUs). For detailed information see
http://openjdk.java.net/projects/sumatra/
1. Introduction
3
To solve these issues the presented approach allows to dynamically resize the method code heaps
according to runtime requirements. A long running application would, for example, first increase
the size of the profiled method code heap because a lot of profiling is done at application startup.
Later, when enough profiling information is gathered, the size of the non-profiled code heap is
increased. Additionally, the method code heaps are created lazily at runtime, such that the size of
the non-method code heap is fixed after JVM initialization.
The contributions of this thesis are (i) the design and implementation of a dynamically sized segmented code cache and (ii) a complete documentation and evaluation of the system regarding performance, runtime behaviour and memory consumption. Change (i) is delivered as two incremental
patches to the code in the HotSpotTM JVM mercurial repository4 changeset 5765:9d39e8a8ff61 from
27 December 2013. The patches fix bug JDK-8015774: Add support for multiple code heaps [29] in
the JDK Bug System.
1.2
Structure of the thesis
Chapter 2 provides technical background information needed for this thesis. After a short overview
of dynamic compilation and the Java language implementation, this section presents the HotSpotTM
Java Virtual Machine in detail. The description focuses on those modules affected by the changes
introduced with this work.
Chapter 3 presents related work and compares the runtime systems to the solutions presented in
this thesis.
Chapter 4 introduces the high-level design of the segmented code cache and the dynamic code heap
sizes. Important design decisions, e.g., the code heap design and memory layout, are described and
justified. This chapter builds the basis for the implementation described in Chapter 5.
Chapter 5 describes the implementation of the segmented code cache and the dynamic code heap
sizes based on the design decisions presented in Chapter 4. The development steps are listed
chronologically and illustrated with examples.
Chapter 6 provides a complete evaluation of the system using various benchmarks executed on different platforms. The segmented code heap and the dynamic code heap sizes version are compared
to the baseline version.
Chapter 7 summarizes the results of this thesis and suggests future work.
1.2.1
Timeline
In this section the development steps are presented chronologically to give an overview of the
changes to the baseline version, which where necessary to implement the system. The first part
lists the changes for the segmented code cache and the second part lists the changes for the dynamic
4
http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot
4
1. Introduction
resizing of code heaps. Only major progress is described, minor changes like bug fixing, refactoring,
and adaptations according to reviews by Oracle are omitted.
Segmented code cache
• Code heap management: Implement and adapt the necessary functionality like macros,
iterators and array structures to manage multiple code heaps in the code cache.
• Code cache interface: Changes to the code cache interface to support multiple code heaps.
Changes to client modules, such as the compile broker.
• Code blob types: Integration of code blob types to assign compiled code blobs to the
corresponding code heaps.
• Code heap layout: Definition of one code heap per code blob type (methods, buffers,
adapters, runtime stubs and singletons). Code heaps are still defined statically using macros.
• Code cache sweeper: Changes to the code cache sweeper to support multiple code heaps.
Only sweeping method code heaps.
• Code heap: Changes to implementation of code heaps to support explicit setting of a
reserved code space. Now splitting one chunk of space into parts and distributing them
among the code heaps to ensure a contiguous memory layout.
• Final code heap layout: Code heaps are now defined dynamically at runtime. Depending
on the JVM configuration, such as tiered compilation and used compilers, a non-method code
heap, a profiled method code heap and a non-profiled method code heap is used.
• Memory service: The memory service used for JVM monitoring and management support
is modified to be able to register multiple code heaps.
• JVM options: New command line arguments for setting the code heap sizes.
• Serviceability Agent: The Java classes of the Serviceability Agent (see Section 2.3.7) are
adapted to support multiple code heaps.
• External tool support: Changes to the DTrace (see Section 5.1.4) scripts and the Pstack
(see Section 5.1.5) support libraries for Solaris and BSD to support multiple code heaps.
• Optimizations and bug fixes: Remove direct code heap references needed for the Serviceability Agent and tool support. Now directly referencing the code heap array and using
specialized iterator classes to select individual code heaps.
• Evaluation: Detailed evaluation of the system with respect to performance and other runtime characteristics, using a set of Python scripts and multiple benchmarks suites.
Dynamic code heap sizes
1. Introduction
5
• Virtual space: Changes to the virtual space class to support growing downwards. Integration of functions to expand, shrink and grow into another virtual space.
• Code heap: Changes to internal implementation of code heaps to support virtual spaces
that are growing down. Add functionality to shrink and grow into another, adjacent code
heap.
• Dynamic code heap sizes: Implement dynamic resizing of profiled and non-profiled method
code heaps if allocation in the code cache fails. Add corresponding JVM option.
• Serviceability Agent: Adaptation of the Java classes of the Serviceability Agent to support
code heaps that grow down.
• External tool support: Fix the DTrace scripts and Pstack support libraries for Solaris and
BSD.
• Optimizations and bug fixes: Add additional asserts and comments to make the code
more readable. Disabling dynamic resizing if tiered compilation is disabled.
• Lazy method code heaps: The method code heaps are lazily created at runtime when the
first method is allocated in the code cache. This avoids setting the size of the non-method
code heap statically.
• Evaluation: Detailed evaluation of the system compared to the segmented code heap version
and the baseline version.
2
Background information
This chapter provides background information needed to understand this thesis. After a short
introduction to dynamic compilation, the Java language implementation is described. Language
features like virtual calls and garbage collection are explained and compared to other languages.
Section 2.3 describes the HotSpotTM Java Virtual Machine (JVM), that implements the Java
Virtual Machine Specification [27]. The focus is laid on modules affected by the changes introduced
with this work. Finally, Section 2.4 presents the Nashorn JavaScript engine, a pure Java runtime
for JavaScript running on top of the HotSpotTM JVM.
2.1
Dynamic compilation
To explain the term dynamic compilation, the counterpart static compilation is presented first.
Static or ahead-of-time compilation describes the process of creating a native executable from
source code. Most work is therefore done before execution. Complex and time consuming analyses
can be performed because they do not affect the runtime performance. The information is then
used to highly optimize the code. Further, static compilation introduces no additional startup time
and memory consumption because compilation is already done. Static compilers mostly use static
information, for example, type information, but may also rely on profiling information that was
gathered during previous runs of a compiled version of the program.
Statically compiled executables are often required to run on different platforms. It is therefore
impossible to use platform specific features like vectorization instructions (e.g., SSE instructions)
because at compilation time it is not known if the target platform supports these features. Further,
static compilers must ensure that all assumptions always hold at runtime. If an highly optimistic
assumption is violated at runtime, the program may crash or deliver an invalid result. Some static
compilers are, in general, conservative and effective optimizations are not done. For example,
inlining of virtual functions is only reasonable if the static compiler can prove that the target class
of the call does not change at runtime, which is hard or even impossible to determine statically [4].
Dynamic or just-in-time (JIT) compilation tries to solve some of these issues by performing compilation at runtime while the program is being executed. For example, the Java runtime environment
first compiles the Java source code to an intermediate, platform independent representation (Java
bytecode), which is then delivered to the target platform (Section 2.2 and Section 2.3.3). The JVM
executes Java bytecode by interpreting the bytecode or compiling Java bytecode to machine code
7
8
2. Background information
on the fly. This allows to only partly compile the code by specifically selecting methods that are
heavily used and highly optimize them. Dynamic compilers can use program information collected
at runtime to specialize code. They may instrument the code to gather detailed runtime profiling information, for example, the number of invocations, runtime types and branches taken. This
information can then be used to generate highly optimized machine code.
In contrast to static compilers, dynamic compilers can also perform highly optimistic optimizations
and simply recompile if the assumptions on which these optimizations are based do not longer
hold (called deoptimization, see Section 2.3.3). Examples of optimistic optimizations are global
optimizations like inlining of library functions or removing unnecessary synchronization. Further,
it is possible to fine tune the code for a particular platform by, for example, using vectorization
operations specific to the available CPU.
The main disadvantage of dynamic compilers, however, is the startup delay that is caused by loading
and initializing the supporting runtime and the initial compilation of code. Since compiling all
methods at runtime causes an unacceptable delay for many applications, the JVM typically starts
with interpreting bytecode. Only hot code is compiled to machine code, which provides high
performance. In general, dynamic compilers must find a trade-off between compilation time and
quality of the generated code. Additionally, an application executed by a dynamic compiler has a
higher memory consumption (footprint) than a statically compiled program because the dynamic
compilers and the supporting runtime consume memory at runtime.
More information about the implementation of the dynamic compilers in the HotSpotTM Java
Virtual Machine can be found in Section 2.3.3.
2.2
Java Language
Java is a general-purpose, object-oriented and platform independent language originally developed
by Sun Microsystems (now Oracle). As of early 2014, Java is the second most popular programming
language (according to [1] and [2]) with more than 9 million developers [40] only outreached by the
low-level language C.
Java was initially designed for interactive television, but became popular by allowing to execute
untrusted applications (so called Java applets) from the world wide web inside a sandbox enforcing
network- and file-access restrictions. Today, Java is mostly used for business applications because
of the variety of frameworks available and the platform independence that simplifies and reduces
the costs of software development.
Java is designed to be easy to learn by using a simple object model and a concise C++ - like syntax.
It is object-oriented, supports modular and reusable code, and is extensible by allowing classes to
be explicitly loaded at runtime or whenever needed. Java supports various security features that
are part of the design of the language and the runtime, to allow untrusted code to be executed in
a sandbox. Because Java is statically typed, the compiler catches many errors already at compile
time where other languages would only fail at runtime.
2. Background information
9
A Java application is compiled to a .class file, which is the binary format for the Java Virtual
Machine (JVM). The JVM interprets or dynamically compiles the .class file to native code (see
Section 2.3). Java is platform independent, both at the source and the binary level, allowing the
same program to be executed on different systems. For example, personal computers and mobile
devices. The slogan ”Write once, run anywhere” (WORA)1 emphasizes this platform-independence.
The performance of programs written in Java greatly exceeds the performance of purely interpreted
languages like Python2 , but can be slower than programs written in C or C++, which are compiled
to native machine code. This performance difference might not only come from dynamic compilation. Other Java language features such as array bound checks have a performance impact as
well. However, modern JVMs, e.g., Oracles HotSpotTM Java Virtual Machine (see Section 2.3), use
sophisticated dynamic compilers that reach a performance comparable to C/C++ applications by
making use of optimizations that are only possible at runtime. For example, in contrast to C++
where virtual calls are expensive at runtime and are therefore avoided whenever possible, Java
embraces virtual methods. The JVM therefore has to implement them to be fast, using techniques
like inlining.
Despite that, Java applications can use functions or libraries written in other languages, such as
C or Assembly, through the Java Native Interface. Hence, computationally intensive parts can be
hand written and optimised in a platform-specific language.
All these features are described in the Java Language Specification [13] and are implemented by
the Java Platform, Standard Edition (Java SE). Because Java has no formal standardization, for
example by the International Organization for Standardization (ISO), the Oracle implementation
is the de facto standard. It consists of two different distributions: (i) The Java Runtime Environment (JRE) containing the supporting runtime needed to execute Java programs and (ii) the Java
Development Kit (JDK) containing development tools like the Java-to-bytecode compiler.
The main part of the Java SE is the platform specific JVM that executes the platform independent
bytecode and implements all features defined by the Java Language Specification. Because the
specification is abstract, different implementations of the JVM are possible.
2.3
The HotSpotTM Java Virtual Machine
This section presents the design and implementation of the HotSpotTM Java Virtual Machine
(JVM), the reference implementation of the Java Virtual Machine Specification [27] developed by
Oracle Corporation.
The state of the art described here serves as the baseline version for this thesis and the design
and implementation presented in Chapters 4 and 5, respectively. More information about the
HotSpotTM Java Virtual Machine is available in [41].
1
2
Slogan used by Sun Microsystems to emphasize the cross-platform benefits of the Java programming language.
A widely used high-level programming language that focuses on readability and fast development.
10
2.3.1
2. Background information
Overview
The HotSpotTM Java Virtual Machine is the code execution component of the Java platform and
the core component of the Java Runtime Environment. The HotSpotTM JVM is responsible for
executing Java applications and available on many platforms and operating systems. Supported
platforms include Sun Solaris, Microsoft Windows and Linux on IA-32 and IA-64 Intel architectures,
Sparc, PPC and ARM.
The HotSpotTM JVM is mainly written in C++ and the source code contains approximately 250,000
lines of code [35]. Figure 2.1 shows the HotSpotTM JVM in the context of the main components of
the Java Runtime Environment.
Java source code in the form of .java files is processed by the static Java compiler javac and
compiled to platform-independent Java bytecode. In this process the semantics of the Java language
is mapped to bytecode instructions that are stored in a standardized form in .class files. Because
each bytecode occupies one byte, there exist 256 possible bytecodes. Currently, the Java bytecode
instruction set uses only 205 bytecodes, the remaining bytecodes are used internally by the JVM
or, for example, by debuggers to set breakpoints.
Figure 2.1: Overview of the components of the Java Runtime Environment.
Bytecode always resides within a method and a method is always contained in a class. Because a
class file contains the bytecode of exactly one class, multiple class files can be packaged into a Java
Archive by the Java Archiver jar. The HotSpotTM JVM then loads the class or jar files, verifies
the bytecode correctness and executes the bytecode by interpreting or using a dynamic compiler
(see the following sections for details).
Because the JVM executes bytecode and is therefore independent of the source programming language, it is possible to run other languages on top of the JVM by implementing a compiler that
compiles the source language to Java bytecode. Originally, the JVM instruction set was statically
typed, making it hard to run dynamic languages. The Da Vinci Machine Project by Oracle aims to
support dynamic languages by adding a new invokedynamic instruction to allow method invocation with dynamic type checking (see ”the Da Vinci Machine Project - a multi-language renaissance
for the Java Virtual Machine architecture” [34]). For example, jython, a Python programming
2. Background information
11
language interpreter, generates Java bytecode from Python code (see Figure 2.1) that is executed
by the JVM.
Figure 2.2 shows the internal architecture of the HotSpotTM JVM. Important subsystems, such as
garbage collection and the heap- and stack management are omitted for simplicity.
Figure 2.2: Overview of the HotSpotTM Java Virtual Machine architecture.
The class loader is responsible for dynamically loading a class when it is first referenced at runtime.
The bytecode verifier performs checks to ensure that the bytecode is safe and does not corrupt the
JVM. For example, it checks that a branch instruction always targets an instruction in the same
method.
The JVM can then execute the program by interpreting or dynamically compiling the bytecode to
machine code, using one of two available dynamic compilers (see Section 2.3.2 and Section 2.3.3,
respectively). To decide if a method is compiled, the interpreter gathers profiling information at
runtime. Methods that are used extensively, so called hot spots, are scheduled for compilation
whereas cold methods are interpreted (this is where the name HotSpotTM comes from). Compiling
only hot methods pays off because often 90% of the execution time of a computer program is spent
executing 10% of the code3 and therefore the JVM can highly optimize these 10% of the code.
Finally, the security manager checks the behaviour of the application against a security policy and
stops the program in case a prohibited instruction is executed. Typically, web applets are executed
with a strict security policy to ensure safe execution of untrusted code. The JVM serves as an
3
Also known as the 90/10 law, an application of the Pareto principle to software engineering.
12
2. Background information
additional layer between the operating system and the underlying hardware.
The HotSpotTM JVM provides several command line options and environment variables to control
the runtime behaviour and enable or disable functionality. For example, the size of the code cache
(see Section 2.3.5) can be set by passing -XX:ReservedCodeCacheSize= at JVM startup. The
work presented in this thesis adds additional command line options introduced in Chapter 5 and
described in Section A.3 of the Appendix. A full list of options that are available for the baseline
version can be found at [32].
The following sections describe components of the JVM that are important for the work presented
in this thesis.
2.3.2
Interpreter
The bytecode interpreter of the HotSpotTM JVM is a template-based, artificial stack machine. The
interpreter iterates over the bytecodes executing a fixed assembly code snippet for each bytecode.
The interpreter is generated at startup by the JVM by using a template table and stored in the
code cache (see Section 2.3.5). This table contains a template, i.e., a description of each bytecode
and the corresponding assembly code. The assembly code snippets are compiled to machine code
while the interpreter is loaded and then executed when the corresponding bytecode is encountered.
For complex operations that are hard to implement in assembly, like a constant pool4 lookup, the
interpreter calls into the JVM runtime.
This approach is faster than a classic switch-statement (see Section ”Interpreter” in [42]) and
enables sophisticated adaptions of the interpreter to specific processor architectures, speeding up the
interpretation. For example, the same application can be executed on an old processor architecture
as well as making use of features of the newest processor generation.
The interpreter performs basic profiling. Method entries and back-branches in loops are counted
and dynamic compilation is triggered on counter overflow. This ensures that ”hot” methods are
identified and compiled to machine code. Such a two-tiered execution model is characterized by a
fast startup and reaches peak performance when hot methods are compiled to optimized machine
code.
Additional data, such as branch and type profile information, may be gathered to support optimizations in the dynamic compilers. Further, the warm-up period caused by the initial interpretation
allows the dynamic compilers to make better optimization decisions. After initial class loading,
dynamic compilers can base their optimizations on a more complete class hierarchy.
2.3.3
Dynamic compilation
This section presents the dynamic compilation system of the HotSpotTM JVM. Dynamic compilation, in general, is introduced in Section 2.1.
4
The constant pool is a set of constants used by a type. For example, string and integer constants.
2. Background information
13
In the HotSpotTM JVM dynamic compilation enables fast application execution as well as efficient
runtime profiling. Figure 2.3 shows the state transitions of code in the JVM. First, a method
m() is interpreted. If m() is hot, the JVM decides to dynamically compile it. Because profiling
Figure 2.3: State transitions of code in the JVM.
in the interpreter is slow and only limited information is available, in a first step the dynamic
compiler may add profiling code to the compiled version to gather detailed information while the
code is executed (see Section 2.3.4). If the profile information later suggests that the method is
still executed a lot, it is re-compiled without profiling.
The profile information gathered in the first step enables optimistic optimizations by making assumptions that may be violated at runtime. For example, information about common call targets
can be used for optimistic inlining5 . Optimistic inlining results in a substantial performance gain
for Java applications, since virtual calls are used frequently.
Code generated by the dynamic compilers is stored in a code cache (see Section 2.3.5) to avoid
recompilation of already compiled methods. If compiled code calls a not-yet compiled method,
control is transferred back to the interpreter.
If an optimistic assumption is violated at runtime, the corresponding optimization becomes invalid
and the compiled code is invalidated. If the code is currently being executed, the JVM has to
stop execution, undo the compiler optimizations and transfer control back to the interpreter. This
procedure is called deoptimization. In general, deoptimization is hard because the JVM needs to
reconstruct the interpreter state from already compiled code. Hence, the compiled code contains
metadata for deoptimization, garbage collection and exception handling. Detection and deletion of
such invalidated code is done by the code cache sweeper (see Section 2.3.6).
If a method contains a long running loop, it may never exit but still get hot. To identify such
methods and make sure that they are eventually compiled, back branches are counted. If a threshold
is reached, the method is replaced with a compiled version while running. To achieve this, the stack
5
If the JVM notices that the target of a virtual call is always the same at runtime, the JVM may optimistically
inline it. The inlining is based on the assumption that the call target will indeed always be the same and may become
invalid if the assumption is violated at runtime (e.g., if a new class is loaded).
14
2. Background information
frame of the compiled method is initialized to the interpreted stack frame and control is transferred
to the compiled version at the next back branch. This process is called on-stack replacement (OSR).
The HotSpotTM JVM has two dynamic compilers. The client compiler, also called C1, generates
lightly optimized code and has a small memory footprint. C1 is fast and therefore good for interactive programs with graphical user interfaces. The server compiler, also called C2, is a highly
optimizing compiler. C2 generates code of higher quality than C1 at the cost of a longer compilation time. C2 is good for long running server applications where startup and reaction time are less
important.
Traditionally, only one dynamic compiler was used and the user had to choose between a server
or a client JVM. In version 7 of the JDK Tiered Compilation was introduced, supporting multiple
levels of compilation where both compilers are used (see Section 2.3.4). The client and the server
compiler are described in more detail in the following sections. Additional information can be found
in [20].
2.3.3.1
Client compiler
The client compiler (C1) is a dynamic compiler designed for a low startup time and a small memory footprint. Because of its high compilation time, C1 is used for interactive applications. C1
implements simple optimizations what leads to a lower peak performance of the generated code,
compared to the server compiler.
Figure 2.4: Compilation phases of the client compiler. The platform-independent phases are colored
blue, whereas the platform-dependent parts are brown.
The compilation of a method is divided into three phases (see Figure 2.4). First, the platformindependent front-end analyses the bytecode by abstract interpretation and builds a high-level
intermediate representation (HIR). The HIR consists of a control flow graph and is used to perform
several high-level optimizations, for example, constant folding and null check elimination. The
optimizations aim at improving local code quality. Only few global optimizations are performed.
C1 uses method inlining as well, but not as aggressive as it is done by the server compiler (see
Section 2.3.3.2).
In the second phase, a low-level intermediate representation (LIR) is generated from the HIR by the
platform-specific back end. The LIR is similar to machine code, but partly platform-independent.
Some low-level peephole optimizations, such as combining of operations, are performed and virtual
registers are assigned to physical registers. Finally, in phase three, machine code is generated by
2. Background information
15
traversing the LIR and emitting instructions.
The client compiler supports adding of profiling instructions to the LIR. This profiling code is
then executed at runtime and gathers profiling information about methods, similar to the profiling
performed in the interpreter. This feature is used with tiered compilation (see Section 2.3.4) to get
a more accurate runtime profile for the server compiler.
More detailed information about the client compiler can be found in [20] and [45].
2.3.3.2
Server compiler
The server compiler (C2) is a dynamic compiler designed to generate highly optimized code that
reaches peak performance. C2 is slower than the client compiler and therefore more suitable for
long running applications, where the compile time is paid back by faster execution and a slower
startup is negligible.
The phases of the server compiler include parsing, aggressive optimization, instruction selection,
global code motion, register allocation, peephole optimization and code generation. An intermediate representation (IR) based on the static single assignment form 6 (SSA) is used for all phases.
Register allocation is implemented with a graph coloring algorithm [43] that is slower than the
linear scan algorithm of the client compiler, but produces better results especially for large register
sets found in many modern processors.
The server compiler tries to optimize aggressively by making use of detailed profiling information
gathered by the interpreted or profiled code. This includes class hierarchy aware inlining, including optimistic inlining of virtual calls that may trigger deoptimization in case the assumptions do
not hold (see Section 2.3.3). Other optimizations are global code motion, loop unrolling, common subexpression elimination and global value numbering. Also Java specific optimizations are
performed, for example, elimination of null- and range-checks [31].
Optimistic inlining is one of the most effective compiler optimization. It not only expands the scope
of the compiler, and therefore enhances the efficiency of other optimizations, but also improves the
performance of virtual calls because some of the virtual calls can be converted to static calls. In
contrast to other programming languages, such as C++, Java embraces the use of virtual calls.
It is therefore important that virtual calls are optimized to be fast. A Class Hierarchy Analysis
(CHA) [7] determines if virtual calls can be converted to a static call and therefore be inlined. If a
new class is loaded at runtime, the CHA is adjusted and a deoptimization is performed if the call
is no longer static. In this case an inline cache (IC) [18] is used to speedup the virtual call. If the
call target is not cached in the inline cache, a (slow) runtime lookup by the JVM is needed.
More detailed information about the server compiler can be found in [43].
6
A property of the intermediate representation that allows each variable to be assigned only once. It simplifies
and improves many compiler optimizations. [11]
16
2.3.4
2. Background information
Tiered compilation
Although there exists two dynamic compilers, originally only one of them was used at runtime.
With Version 7 of the JDK Tiered Compilation was introduced to get the best of both compilers.
The JVM uses the interpreter and C1 compiler for fast startup and profiling, and the C2 compiler
to eventually reach peak performance with highly optimized code.
Tiered compilation executes code at different ”tiers”, in the following called execution levels. In
addition to the interpreter gathering profiling information, the client compiler is used to generate
compiled versions of methods that collect profiling information. The main advantage is a faster
execution during the profiling phase, since the C1 compiled code is considerably faster than interpreted code. Further, the compiled versions provide more accurate profile data that can be used
by the server compiler to re-compile the code with sophisticated optimizations. A consequence of
tiered compilation is that more code is generated than in a non-tiered setting. More code requires
more space in the code cache.
Figure 2.5 lists the execution levels. At level 0 only the interpreter is used and no code is compiled
by the dynamic compilers. Level 1 uses the C1 compiler with full optimization. Level 2 and level 3
use the C1 compiler as well, but different amounts of profiling code is added, leading to a decreased
performance. In general, level 2 is faster than level 3 by about 30%. Finally, level 4 uses the C2
compiler. The C2 compiler does not instrument compiled code.
Figure 2.5: List of execution levels used with tiered compilation. The dotted arrows show one possible transition from interpreting to compiling the method with C1 and gathering profile information
and finally compiling a fully optimized version with C2.
Different transitions between the execution levels are possible. A policy decides the next level for
each method depending on runtime information, for example, the number of compilation tasks
that are currently waiting for processing by the C1 and C2 compilers, profiling information and
thresholds. The dotted arrows in Figure 2.5 show the most common transition: Execution starts
with the interpreter, then the policy decides to compile the method at level 3. After profiling is
completed the transition is made to level 4 where the method is fully optimized and remains until
it is removed from the code cache (see Section 2.3.6).
In general, tiered compilation performs better than a single dynamic compiler. The startup time
may be even faster than with the client compiler because C2 compiled code may be already available
2. Background information
17
during the initialization phase. Because profiling is a lot faster with C1 compiled code, a longer
profiling phase is possible resulting in a more detailed runtime profile. This potentially leads to
better optimizations by the server compiler and a better peak performance compared to a non-tiered
compilation.
2.3.5
Code cache
The compiled code is stored in a code cache for later reuse. The code cache also contains JVM
internal structures that are generated at startup, such as the bytecode interpreter (see Section
2.3.2) or support code for transitions between the Java application and the JVM.
Figure 2.6: Simplified architecture and interface of the code cache and related components as
implemented by the corresponding C++ classes.
The code cache is a core component of the JVM, being directly referenced from more than 150
places in the source code and indirectly through the Serviceability Agent (see Section 2.3.7) and
supporting tools. Figure 2.6 shows the simplified architecture and interface of the code cache and
related components.
The main part of the code cache is the CodeHeap, a heap-based data structure that provides
functionality for allocating and managing blocks of memory. Internally the code heap uses memory
segments of fixed size that are linked together. The code heap maintains a free list to reuse
deallocated blocks and implements basic operations, like iteration and searching. The underlying
memory is managed by a VirtualSpace that allows committing of a reserved address range in
smaller chunks. Committing is the process of backing up previously reserved memory with physical
memory. This reserved memory is represented by a ReservedSpace that basically contains the size
and alignment constraints and abstracts the allocation of memory by the operating system.
To summarize, the code cache consists of a heap data structure on top of a contiguous chunk of
memory that is reserved on JVM startup and then committed in small chunks when needed.
The code heap contains all different types of code. Different code types are abstracted by a
18
2. Background information
CodeBlob. This includes, for example, methods compiled with different optimization levels, so
called nmethods and non-methods like buffers and adapters. Buffers are used by the compilers
to temporarily store the generated code, whereas adapters are code snippets that transfer control
from compiled code to interpreted code or vice versa.
The interface of the code cache abstracts the underlying implementation by providing high-level
functions to allocate, deallocate and iterate over CodeBlobs and gather statistics, such as the
capacity or free space. The code cache is referenced by many components, for instance, the sweeper
(see Section 2.3.6) or the compile broker, which manages the compilation of methods.
Because the size of the code cache is fixed and cannot change at runtime, the code cache can get
full and there is not enough memory left to allocate space for new methods. If this is the case,
compilation is disabled and the code cache sweeper starts removing methods. If enough memory
is released, compilation is re-enabled.
Detailed information about the implementation of the code cache and its components is presented
in Chapter 5.
2.3.6
Code cache sweeper
The code cache sweeper cleans the code cache by removing methods. Compiled methods can
become invalid for various reasons, for example, if an optimization is not longer valid due to class
loading. Figure 2.7 shows the different states of a method in the code cache.
Figure 2.7: State transitions of methods after compilation.
After compilation the method is alive, meaning that the code is valid and can be executed. A
method can then be made not entrant. A not entrant method cannot be called. This transition
is necessary if the code is not longer needed or invalid and can be initiated by the following
components:
• the sweeper: if the code cache is full (described below),
• deoptimization: if the optimized code becomes invalid because an assumption does not
longer hold (see 2.3.3),
• dependency invalidation: dependencies are runtime assertions that may trigger deoptimization if violated due to dynamic linking or class evolution,
• tiered compilation: if the code is replaced by a different version (see 2.3.4).
2. Background information
19
If a method is made non entrant, it cannot be executed, but may still be active on the stack. Hence,
the method cannot immediately be removed. To account for this, the code cache sweeper removes
methods in several steps.
First, the code cache sweeper performs stack scanning and marks all methods that are active on
the stack. If a non entrant method is not marked, and therefore is not active on the stack, it is
converted to a zombie state. It is ”half dead” in the sense that it is not longer executed, but may
still be referenced by inline caches (ICs).
In the second step, all inline caches that refer to zombie or not entrant methods are flushed and
zombie methods change their state to marked for reclamation. If a zombie method that is marked
for reclamation is encountered during the next sweeper iteration, it can simply be removed from
the code cache because no inline caches refer to it.
Stack scanning happens at so called safepoints. At a safepoint the JVM stops all execution and
Java threads cannot modify the Java heap or stack. Safepoints are also needed for garbage collection, deoptimization and various debugging operations. They are implemented by a global polling
page that is frequently accessed by all threads and readable during normal operation, but made
unreadable when a safepoint is requested. More information can be found in [15].
The code cache sweeper starts sweeping if at least one of the following conditions is met:
1. The code cache is getting full.
2. There is sufficient state change in the code cache since the last sweep. Currently, this is the
case if more than 1% of the bytes in the code cache changed.
3. Enough time has passed since the last sweep. This time threshold depends on the size of the
code cache and is only checked if methods are compiled. The smaller the code cache, the
more often it is swept.
As mentioned above, apart from deoptimization, dependency invalidation and tiered compilation,
the sweeper can decide to make a method not entrant. Making methods not entrant is especially
important if the code cache is getting full and the sweeper must remove the least used methods to
gain enough space to re-enable compilation. This process is performed during sweeping and based
on the hotness of a method.
The hotness measures the utilization of a method. Initially, the hotness is set to a high value that
is decremented every time the method is encountered by the sweeper and reset if a method is found
on the stack of a Java thread. A hot method is likely to be frequently encountered on the stack and
therefore maintains a high hotness value. On the other hand, a cold method will not be found on
the stack and will therefore tend towards a low hotness value. Methods with a low hotness value
are first considered for removal by the sweeper.
20
2.3.7
2. Background information
Serviceability Agent
In general, it is hard to debug native code running in the JVM or even debugging the JVM itself.
For example, debuggers for the C++ programming language require debug information and are
not aware of the internal memory structures used by the JVM. Java debuggers, on the other hand,
are only able to analyse Java programs but not the underlying JVM.
The Serviceability Agent (SA) is a collection of Java APIs designed for internal usage by the JVM
developers to analyse crashed JVMs. The SA is based on low level debugging primitives and works
by reading the process memory of a JVM. The SA analyses the JVM internal data structures, for
example, stack frames, garbage collection statistics and code cache entries.
Figure 2.8: The Serviceability Agent running on top of a JVM and attaching to a target JVM
trough an external tool.
The Serviceability Agent is almost completely written in Java and thus allows cross OS and cross
CPU debugging. The Java classes are basically re-implementations of the HotSpotTM C++ classes
and access the in-memory structures created by the target JVM. Hence, the SA does not rely on
support code running inside the JVM but uses a native tool, such as ptrace, to read the process
memory or directly parses core files. Figure 2.8 shows the overall architecture.
Detailed information about the Serviceability Agent can be found in [47] and [44]. Sections 5.1.3
and 5.2.5 describe the changes to the SA that where necessary to support the segmented code cache
and the dynamic code heap sizes.
2.4
Nashorn JavaScript engine
Nashorn is a lightweight runtime for the scripting language JavaScript (JS). Nashorn is purely
written in Java and runs on top of the HotSpotTM Java Virtual Machine. It allows Java developers
to embed JavaScript in their Java applications, as well as execute independent JS applications
using a command-line tool. Nashorn is developed by Oracle and released with Java 8. It is
fully compatible with the ECMAScript [9], the standardized scripting language implemented by
JavaScript.
Compared to Rhino, an open source JS engine developed by the Mozilla Foundation, the current
version of Nashorn is around 2 to 10 times faster [25]. Nashorn replaces Rhino with Java 8.
2. Background information
21
The engine is based on the Da Vinci Machine [34], a project that aims to add support for dynamic languages to the JVM by providing additional features and bytecode instructions. Although
JavaScript is originally interpreted, Nashorn does not implement an own interpreter. It directly
generates bytecode that is executed by the JVM.
Nashorn makes heavy use of MethodHandles and the invokedynamic bytecode introduced by the
Da Vinci Machine project and described in the Java Community Process (JSR) 292 [33]. This is
necessary because in contrast to Java, which is statically typed, JavaScript is a dynamically typed
language where the actual type of an object can only be determined at runtime. The invokedynamic
bytecode can call methods with such a looser specification by allowing to customize the linkage
between a call site and the method. Such a dynamic call site is linked just before its first execution,
using MethodHandles to specify the actual behaviour. Detailed information and examples can be
found at [39].
Since Nashorn runs on top of the JVM it can provide additional functionality for interoperating
with the Java platform. For example, it is possible to create and manipulate Java objects from
JavaScript or extend Java classes. Because it is shipped with the Java Runtime Environment,
which is privileged code, it is designed for secure execution. Nashorn does not implement a web
browser API like the HTML5 canvas or audio support.
Because Nashorn generates a lot of bytecode, it is mainly used for testing and benchmarking the
implementation presented in this thesis (see Section 5 and 6).
Additional information about Nashorn can be found in [50] and [26].
3
Related work
This chapter presents related work that solves similar issues in a different context and compares
the approaches to the solutions presented in this thesis.
3.1
Maxine Virtual Machine
The Maxine Virtual Machine (MVM) [24] is a Java Virtual Machine completely written in Java and
developed by Oracle Labs. The MVM exploits advanced Java language features, like type safety,
garbage collection, annotations and reflection and features a modular architecture. The code base
is fully compatible with modern Java IDEs and the standard JDK.
The Maxine source code is translated to Java bytecode by the standard javac compiler. Instead
of directly using a JVM to execute this bytecode, a so called boot image is generated. It contains
a near-executable memory image of the running MVM consisting of a heap populated with class
data and objects, and a code cache populated with MVM code compiled by Maxines optimizing
compiler C1X. The boot image generator is a Java application using large parts of the Maxine code
and is executed by an existing JVM, for instance, the HotSpotTM JVM. A small C program then
maps the boot image into memory and calls the MVM entry point.
In contrast to the HotSpotTM JVM, the Maxine VM does not interpret bytecode, but always compiles method bytecodes to machine code on first method invocation. It uses the lightly optimizing,
template-based baseline compiler T1X. Frequently executed (hot) methods are then compiled by
the highly optimizing C1X compiler. Both compilers are written in Java with a small amount of assembly code and compiled at MVM build time. The C1X compiler is a Java-port of the C1 compiler
of the HotSpotTM JVM.
Because Maxine does not use an interpreter, but dynamically compiles all bytecodes encountered,
large amounts of machine code have to be stored. The code cache consists of three regions. The
boot code region is unmanaged, i.e., not affected by garbage collection, and contains all machine
code needed for MVM startup. The run-time region is unmanaged as well and stores the code
generated by the optimizing compiler. The managed baseline region contains code generated by
the T1X compiler.
Similar to the HotSpotTM JVM, baseline code compiled with the T1X compiler can become unused
or obsolete if the corresponding method is recompiled by the optimizing C1X compiler. Code eviction
23
24
3. Related work
removes this code if allocation of new space fails, by using a semi-space garbage collection scheme.
With this scheme, memory is split into a from-space and a to-space. All newly compiled code is
allocated into the to-space. If it is full, garbage collection is triggered and the to-space becomes
the from-space. All reachable code is then copied to the to-space and the code remaining in the
from-space is removed. Newly compiled code is then again stored in the to-space.
The advantage of this code cache design is the fact that garbage collection implicitly compacts the
to-space memory. In contrast to the code cache in the HotSpotTM JVM, no free list is necessary
as new space is allocated incrementally at the top of the to-space (bump-pointer allocation). This
design avoids free list management overhead, such as updating and merging of free blocks. Additionally, the fragmentation is essentially zero as the compiled code is not interspersed with free
blocks.
The boot code region is similar to the non-method code heap introduced in Chapter 4, with the
exception that it only contains ahead-of-time compiled MVM code. The baseline and the run-time
regions are similar to the profiled and the non-profiled code heaps, but are not dynamically resized.
The run-time region is unmanaged and therefore optimized code stays in the code cache forever,
even if it is not longer used.
More information, including performance evaluations, can be found in [52] and [14].
3.2
Graal Compiler
The Graal compiler is an enhanced version of the C1X compiler from the Maxine code base. It is
completely written in Java and uses a new high level intermediate representation [8] to improve
the code quality. Graal tries to improve the performance of programs executed in a JVM to match
server compiler or even native performance. By using Java language features, the compiler is highly
extensible and implements sophisticated optimizations.
The Graal VM project aims to integrate the Graal compiler back into the HotSpotTM JVM to
improve its extensibility and performance. The source code of the JVM is adapted to support
the Graal compiler. Graal uses the existing JVM infrastructure, such as the garbage collector and
object locking and is invoked by the CompileBroker similar to the C1 and C2 compilers. Graal
does not implement an own code cache, but uses the code cache of the HotSpotTM JVM to store
generated machine code.
To ease the implementation of programming languages on top of the Graal VM, a Java framework
called Truffle [53] is provided that relies on abstract syntax tree (AST) interpretation. Graal
performs a partial evaluation of the AST and generates optimized machine code. Because Graal
and Truffle Java code is not part of the executed application, it should be treated specially. For
example, the code that is part of the Graal compiler is not swept and stays in the code cache
forever. It may be beneficial to account for this by creating an own code heap for this code in
future versions of the HotSpotTM JVM (see also Section 7.1).
3. Related work
3.3
25
Jikes Research Virtual Machine
The Jikes Research Virtual Machine (JRVM) [21] is an open source Java virtual machine that
is completely implemented in the Java programming language. Similar to the Maxine VM (see
Section 3.1), Jikes is meta-circular, i.e., it relies on an external JVM for boot image generation and
uses a small C program to load the image and boot up Jikes.
Jikes does not interpret bytecode but provides three dynamic compilers. A lightly optimizing
baseline compiler is responsible for initially compiling bytecode, and is also included in the boot
image. Jikes directly translates bytecodes to machine code, reducing the initial overhead of dynamic
compilation. A JNI compiler processes native methods according to the Java Native Interface (JNI)
and generates adapters to transition from Java code to native code and vice versa. The optimizing
compiler uses an intermediate representation and performs sophisticated optimizations available in
three levels. It is slower than the baseline compiler but generates code with high performance.
The adaptive optimization system [22] decides which compiler optimization level is used to compile
a method. The compiled code is stored in a Java object, which basically is an array of machine
instructions with additional metadata. A compiled method can become obsolete if it is no longer
valid (e.g., due to classloading) or replaced by a new version. The stack scanning phase of the
garbage collector sets obsolete methods to dead if they are not longer used. They are then removed
by the next garbage collection cycle.
In contrast to the HotSpotTM JVM, the JRVM does not store the generated code in a contiguous
memory region, but allocates Java objects to store the code. A specific feature of the JRVM is
that all calls are performed indirect through the Jikes RVM Table of Contents (JTOC) for static
methods or the Type Information Block (TIB) for virtual methods (see [23]). Hence, the memory
space the objects are allocated into can be a moving space and the garbage collector defines the
code management policies. The garbage collector ensures that compiled code that is not longer
needed is removed and the free space can be reused. This design limits the amount of code that
can be generated to the available amount of memory.
3.4
Dalvik Virtual Machine
The Dalvik Virtual Machine (DVM) is a Java virtual machine for the Android1 operating system.
It uses its own bytecode format dex-code, which is translated from the Java bytecode. Because it is
used on mobile devices, the design emphasizes compactness, memory usage and security. Execution
is done by a highly optimized and fast interpreter. Important OS and library code is statically
compiled. With Android 2.2, a dynamic compiler was introduced that focuses on minimal memory
usage and fast compilation.
Many dynamic compilers used in JVMs operate at method-level granularity, i.e., always compile
entire methods. The disadvantages are long compilation delays and high memory usage. Further,
1
An operating system for mobile devices developed by Google.
26
3. Related work
cold parts within the hot methods may be compiled as well. The HotSpotTM JVM places uncommon
traps on cold paths that transfer control to the interpreter if executed. The Dalvik compiler
uses trace granularity by identifying hot execution parts, compiling the corresponding blocks and
storing them into a translation cache. In this way, only the hot parts of a method are compiled,
resulting in lower memory usage, fast compilation and low reaction time. However, it requires
more state synchronization with the interpreter because transfers between the compiled code and
the interpreter occur frequently.
An own translation cache is used for each JVM (there are approaches to share the cache [16]). Similar to the run-time region of the Maxine VM (see Section 3.1), the translation cache is unmanaged.
A contiguous memory space is reserved at DVM startup and bump-pointer allocation is used to
allocate space for compiled code. If the code cache is full, it is flushed and populated again. For
detailed information see the Dalvik source code2 , especially the files /vm/compiler/Compiler.cpp
and /vm/compiler/codegen/x86/CodegenInterface.cpp.
2
The Dalvik source code is available at https://android.googlesource.com/platform/dalvik/+/HEAD/
4
Design
This chapter explains the high-level design of the segmented code cache and the dynamic code heap
sizes. Chapter 5 presents the detailed implementation based on these decisions.
The current design of the code cache is optimized to handle only one type of compiled code, although
there are multiple types, created and accessed by different components of the JVM. The segmented
code cache divides the code cache into distinct segments, each of which contains compiled code of
a particular type with specific properties.
The dynamic code heaps allow the code cache segments to dynamically adapt their sizes to the
runtime needs, reducing the limitations of having segments of fixed size. Further, it is now possible
to lazily create the segments when they are first used, and set the size according to the runtime
needs.
4.1
Segmented code cache
As described above, the code cache contains code with different characteristics. An intuitive design
suggests to separate code blobs by putting each code type into a distinct code cache segment.
There are multiple disadvantages with this approach. First, a default size for each segment has to
be defined, which is hard because it is usually not clear how much memory each segment needs at
runtime. Second, this approach increases memory fragmentation because there will be free space
between the segments and there are types of code of which only a few actual instances exist at
runtime (e.g., the deoptimization blob1 ). Further, the code locality may decrease because of the
highly separated code segments. The decreased code locality, together with the increased memory
fragmentation, leads to a higher instruction TLB miss rate which additionally affects performance.
Taking these disadvantages into account, a more coarse grained segmentation of the code cache
seems appropriate. The main distinguishing feature of a code segment is the separation of compiled
methods and non-method code, such as runtime stubs and JVM internals. Method code is highly
dynamic, has different compilation levels and lifetimes and makes up most of the code cache. Nonmethod code is more static, persistent and limited in size, occupying only around 2% of total size
of the code cache.
1
An entry in the code cache written in assembly and used for deoptimization. On deoptimization, the return
address of the corresponding compiled method is patched to redirect execution to the deoptimization blob.
27
28
4. Design
For methods, one can further distinguish between profiled and non-profiled methods. Profiled
methods are lightly optimized and have a limited lifetime, whereas non-profiled methods are highly
optimized and possibly remain in the code cache forever. Therefore, the code cache is segmented
into three parts, corresponding to the following three types of code:
• non-method code: non-method code, such as buffers and the bytecode interpreter,
• profiled code: lightly optimized, profiled methods and
• non-profiled code: fully optimized, non-profiled methods.
To define the memory needed for the code of each type at runtime, experiments with different
benchmarks are performed (see Section 6.2). The code cache memory consumption highly depends
on the architecture, JVM settings and application. For example, if tiered compilation is disabled,
no profiled code is generated and the profiled code heap is not created. Additional JVM options
are introduced to enable the user to explicitly set the reserved memory for each code type.
Per default, 5 MB are used for the non-methods and the remaining code cache size is distributed
equally among the profiled and the non-profiled methods. Figure 4.1 shows the simplified memory
layout of the segmentation from low to high addresses. Currently, the JVM does only support
Figure 4.1: Memory layout of the code cache segments from low to high addresses.
code cache sizes smaller than 2 GB2 . To ensure that the maximum distance of two segments in
the code cache does not exceed 2 GB, the segments are placed adjacent to each other in memory.
The boundaries between the segments are fixed because currently the top level code heap data
structures do not support resizing of their address spaces. Detailed information about how the
memory layout is implemented can be found in Section 5.1.1.2.
The interface of the code cache is adapted to provide access to the code of a specific type. For
example, instead of iterating over the entire code cache, it is now necessary to specify the type of
code to iterate over.
As already stated in Section 2.3.5, the code cache is directly referenced from more than 150 places
in the source code and indirectly through the Serviceability Agent. The components accessing the
2
The code cache size is limited to 2 GB because currently the assembler used by the JVM generates 32-bit
immediates for control flow transfers, such as jumps and calls, and therefore only supports an address space of 2 GB.
4. Design
29
code cache were not designed to support different types of code or a segmented code cache. The
most important components that must be adapted are the AdvancedThresholdPolicy and the
code cache sweeper.
The AdvancedThresholdPolicy class implements the tiered compilation policy and manages the
transitions of a compiled method between the execution levels (see Section 2.3.4). It uses information about free space in the code cache to change the compile thresholds accordingly and to prevent
the code cache from filling up too fast. The policy is modified to use the type of the method to
determine the corresponding segment of the code cache and uses the free space in this segment
instead of the overall free space in the code cache.
The code cache sweeper is adapted to skip non-method code by only processing the method code
segments. This reduces the time that is needed for sweeping. Further, profiled methods, that have
a shorter lifetime, can now easily be swept more frequently than non-profiled methods.
Future versions of the HotSpotTM JVM may include GPU or ahead-of-time (AOT) compiled code,
making the code stored in the code cache even more heterogeneous. This is taken into account by
an extensible design that easily allows to add new code types (see Section 5.1.6). Additional code
can then be stored in a separate code cache segment.
Section 5.1 describes the implementation of the segmented code cache. The segments are implemented as multiple code heaps, heap-based data structures that provide functionality for allocating
and managing blocks of memory.
4.2
Dynamic code heap sizes
One disadvantage of a segmented code cache is that in the original design, the size of the segments is
fixed. The JVM is highly dynamic and static default values for the sizes are not always applicable.
For example, a long running application will generate mostly profiled code in the beginning and use
this profile information to generate highly optimized and non-profiled code that potentially stays
in the code cache forever. This means that at the beginning there is a majority of profiled methods
and later more space for non-profiled methods is needed. The code cache fills up even faster with
small code cache sizes.
The same problem applies to non-method code. On the one hand, the memory space needed
depends on JVM settings, such as if tiered compilation is used. The memory space depends on the
architecture as well, such as the number of cores (more cores induce more compiler threads). On
the other hand, it also depends on the application. Again, a static default value for the non-method
code cache segment is hard to define.
One solution to this problem would be to dynamically allocate more code cache segments and
deallocate segments that are not longer needed. Despite the fact that this would increase memory
fragmentation, it is not always possible because the maximum distance between the segments must
not exceed 2 GB (see Section 4.1).
30
4. Design
Another solution uses an already existing code cache segment if one segment is full. For instance,
if the segment for profiled methods is full, the non-profiled segment is used to store the additional
profiled methods. This solution solves the problems, but extinguishes the advantages of the segmented code cache. Code of different types would be mixed again. It is also hard to implement
because optimizations dependent on a segmented code cache would have to be reverted at runtime
if code of different types is not longer separated.
The approach taken in this thesis is based on the idea to dynamically move the boundary between
segments to expand one and shrink the other. The segments for non-profiled and profiled methods
are adjacent to each other in memory. To be able to move the boundary between the segments,
both segments need to fill up towards the boundary. This means that the non-profiled segment
has to grow downwards and the profiled segment has to grow upwards. Figure 4.2 illustrates the
memory layout of the code cache segments. The dotted arrows show the direction of growth.
Figure 4.2: Memory layout of the code cache segments with dynamic code heap sizes. The dotted
arrows show the direction of growth. The size of the non-method segment is lazily set at runtime.
Initially, all segments are created with a fixed size as described in Section 4.1. If, for example,
the profiled segment gets full and allocation fails, the non profiled segment shrinks by moving the
boundary towards higher addresses until there is enough space in the profiled segment.
The design is extensible and allows to change the layout of the code cache segments, for example,
if new segments are introduced or the order of the existing ones needs to be changed.
The code cache segment for non-methods is a special case. It is not possible to move the boundary
into the profiled segment at runtime because it grows towards the higher addresses. But because
most non-method code, such as the interpreter and runtime stubs, is generated at JVM startup
and compilation of methods starts afterwards, the profiled and non-profiled method segments are
created lazily.
This means in detail that the non-method segment first occupies the entire memory space reserved
for the code cache. When the first method is allocated, i.e., JVM startup is completed, the size of
the non-method segment is fixed and the method segments are created using the remaining space.
Because after the startup a small amount of non-method code is created as well, the non-method
segment is fixed to its current size plus a buffer space to account for this. The size of the additional
4. Design
31
buffer is equal to the JVM option CodeCacheMinimumFreeSpace (500 kB per default).
If the C2 compiler is enabled, additional space for the scratch buffers3 allocated at runtime is
needed. The following formula is used to compute the space by using 1% of the memory reserved
for the code cache (but at least 500 kB) and additional 128 kB for each compiler thread:
max(500 kB, ReservedCodeCacheSize · 0.001) + (CICompilerCount · 128 kB)
Evaluation shows that the additional space is sufficient for non-method code generated after JVM
startup. Hence, the JVM option to control the size of the non-method segment is needless and
removed. A JVM option is introduced to enable or disable the dynamic resizing of code segments
because there are scenarios where it is not needed. (for example, without tiered compilation).
Because the segments are now resized, the definition of free space that is available in a segment is refined. The possibilities for the profiled segment are illustrated in Figure 4.3. The
AdvancedThresholdPolicy is changed to consider the space that is free in the entire code cache,
instead of only one segment (Figure 4.3 (c)). The code cache sweeper is adapted to consider only
the space in the current segment (Figure 4.3 (a)) because resizing is not always possible. For example, if the space at the boundary of the adjacent segment is occupied, resizing is not possible,
even if there is a lot of free space available. Hence, the sweeper has to start sweeping. Figure 4.3
(b) shows a setting where only a part of the free space is taken into account. Evaluation shows
that version (b) and (c) result in insufficient sweeping.
Figure 4.3: Possibilities to measure the available space in the profiled method segment. The
memory space marked in white is free, the space marked in green is considered to be available for
the profiled method segment.
3
A scratch buffer is a temporary buffer blob created by the C2 compiler to emit code into.
5
Implementation
This chapter presents the implementation of the two major contributions of this thesis in detail:
the segmented code cache and the dynamic code heap sizes. The code version described here is
based on the design decisions introduced in Chapter 4 and thoroughly evaluated in Chapter 6.
The changes are provided as two patches to the code in the HotSpotTM JVM mercurial repository1
changeset 5765:9d39e8a8ff61 from 27 December 2013, which is called baseline version in the following. The patch for the segmented code cache fixes bug JDK-8015774: Add support for multiple
code heaps [29] in the JDK Bug System. The patch for the dynamic code heap sizes builds on these
changes. An overview of the changes in chronological order is provided in Section 1.2.1.
The code of the HotSpotTM JVM is stored in the src folder. Paths to files listed in the following
sections always start in this folder.
5.1
Segmented code cache
This section describes the implementation of the segmented code cache, including changes to other
components that are necessary to support multiple code cache segments. The implementation of
the code cache can be found in the file /share/vm/code/codeCache.cpp.
Section 4.1 presents the types of code and the three segments the code cache is divided into. The
following sections describe the management and layout of these code segments, now called code
heaps in detail. The adaptions include changes to other components, for example, the code cache
sweeper and the Serviceability Agent. Further, the sections describe the changes to support-code
for external tools that access the code cache.
As already described previously, extensibility is of utter importance since there will be new code
types in the future. Section 5.1.6 therefore describes the integration of new code types and the
corresponding code heaps.
5.1.1
Code cache
As described in Section 2.3.5 the code cache of the baseline version contains a single code heap for
storing all code. There is no functionality to distinguish between code blobs of different types. For
1
http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot
33
34
5. Implementation
example, the function CodeCache::allocate takes the size as an argument and returns a pointer
to the newly allocated space. The code cache does not know the code type that will be stored.
5.1.1.1
Code blob types
The first step is to keep track of the code types when new code is compiled and stored in the
code cache. A struct CodeBlobType2 defines the types MethodNonProfiled, MethodProfiled and
NonMethod for non-profiled methods, profiled methods and non-method code. The struct is then
used throughout the code cache interface to select the appropriate code type and serves as an
abstraction of the code heaps. For example, the iterator functions CodeCache::first blob and
CodeCache::next blob now take a CodeBlobType as an argument and iterate over the corresponding code blobs or code heap, respectively. The same applies to other functions, such as allocation
and deallocation of code blobs, where the destination code heap has to be specified.
To set the code type right from the beginning, the implementation of CodeBlob2 and nmethod3
is adapted to propagate the code type through the new operator to CodeCache::allocate. For
non-method code, such as runtime stubs, the code type is simply equal to NonMethod. For methods the code type is determined by the compilation level that corresponds to the execution level
described in Section 2.3.4. The function CodeCache::get code blob type implements the translation between the compilation level and the code type by taking a compilation level and returning
the corresponding code type.
5.1.1.2
Code heaps
As described in Section 2.3.5, the CodeHeap4 is a heap-based data structure that provides functionality for allocating and managing memory blocks. The baseline version of the code cache contains
only one code heap for storing and retrieving all code.
To add support for multiple code segments, multiple code heaps must be created. A one-to-one
relationship between the code types and the code heaps is established because code of a specific type
should be stored in its own code heap. To support this relationship, fields for storing the name and
code type are added to the implementation of the code heap. The function CodeHeap::accepts
checks if a code heap stores code of the given type.
The code heaps are created in CodeCache::initialize heaps during initialization of the code
cache. To make sure that the maximum distance in memory between the code heaps does not
exceed 2 GB (see Section 4.1), the underlying ReservedSpace5 is first created and then split into
three parts using existing functionality. Each part is used to initialize the VirtualSpace5 of a
code heap. Because in the baseline version each code heap creates its own ReservedSpace, the
2
Defined
Defined
4
Defined
5
Defined
3
in
in
in
in
/share/vm/code/codeBlob.hpp
/share/vm/code/nmethod.hpp
/share/vm/memory/heap.hpp
/share/vm/runtime/virtualspace.hpp
5. Implementation
35
implementation of the function CodeHeap::reserve is changed to take a ReservedSpace that is
previously created. Figure 5.1 shows the overall picture.
Figure 5.1: Layered structure of the segmented code cache. ReservedSpace is created during
startup, split up and used to initialize the VirtualSpaces of the code heaps. The CodeBlobTypes
are used to access the code of a specific type residing in a specific code heap.
As described in Section 4.1 the following code heaps are created by default:
• a non-method code heap containing non-method code, such as buffers and runtime stubs,
• a profiled code heap containing lightly optimized, profiled methods, and
• a non-profiled code heap containing fully optimized, non-profiled methods.
If tiered compilation is disabled, the profiled code heap is not created because profiling is only done
in the interpreter. The non-profiled code heap is expanded accordingly.
The JVM options6 NonMethodCodeHeapSize, ProfiledCodeHeapSize and NonProfiledCodeHeapSize
are introduced to control the size of each code heap. Checks7 for consistency of code heap and code
cache sizes are added to validate user provided values.
The code cache keeps track of the code heaps by storing them in a GrowableArray8 , a dynamic
extensible array data structure, very similar to a vector or list from the C++ Standard Template Library (STL). The code heaps are only used by the code cache and not directly accessible
from outside. The internal function CodeCache::get code heap performs the mapping between a
CodeBlobType, used in the interface, and the corresponding code heap, by iterating over the array
and returning the code heap that accepts the code type.
All functions of the code cache are adapted to access multiple code heaps. For example, the
CodeCache::allocate function now takes a code blob type as argument to allocate space for this
type of code in the code cache. The function then invokes CodeCache::get code heap to get the
corresponding code heap, allocates memory in this code heap and returns a pointer to the code
heap.
6
Defined in /share/vm/runtime/globals.hpp
Implemented in /share/vm/runtime/arguments.cpp
8
Defined in /share/vm/utilities/growableArray.hpp
7
36
5. Implementation
Custom iterators for the GrowableArray provide simple access to multiple code heaps.
A GrowableArrayIterator is used to iterate over all entries of a GrowableArray, whereas a
GrowableArrayFilterIterator iterates over elements that satisfy a given predicate. For example,
a predicate may specify that the code heaps have to accept a given set of code blob types. The
custom iterators implement the STL iterator interface9 .
Currently, the GrowableArrayFilterIterator is used to iterate over all code heaps containing
methods (i.e., the profiled and the non-profiled code heaps). To select the code heaps, a predicate called IsMethodPredicate is used that returns true for all code heaps accepting the method
CodeBlobTypes. Additional predicates can easily be defined (see Section 5.1.6).
If one of the code heaps is full, memory allocation fails. This is noticed at several locations in
the code and reported to the CompileBroker10 , an intermediate component handling compilation
requests. The CompileBroker disables the dynamic compilers until the code cache sweeper has
freed enough space and prints a warning message. To provide detailed information about a full
code heap, the JVM reports the code blob type for which the allocation failed. The CompileBroker
forwards this information to the report codemem full function of the code cache, which prints the
warning message containing the code heap that is full.
The code cache also provides functions to obtain statistical information about the capacity, in
particular the maximum, unallocated and current capacity. In a segmented code cache, the statistical information can be computed for one code heap or for the entire code cache. Hence, existing
functions are kept and additional functions are added to compute the values for one code heap by
providing the corresponding code type.
The AdvancedThresholdPolicy class uses information about the free space in the code cache to
change compile thresholds for tiered compilation. Adaptive thresholds prevent the code cache from
filling up too fast. The policy uses the function CodeCache::reverse free ratio which returns
the reverse of the free ratio of the code cache. E.g., if 25% (1/4) of the code cache is full, the
function returns 4.
In a segmented code cache there can be space available in one code heap, even if all other code
heaps are full. The compile thresholds must be set according to the free space that is available
for the destination code type. If, for instance, a method moves from execution level 3 (C1, full
profiling) to execution level 4 (C2), the thresholds should be set according to the space that is
available in the non-profiled code heap. Consequently, reverse free ratio is modified to take the
code type provided by the AdvancedThresholdPolicy and computes the reverse free ratio of the
corresponding code heap. The function is also used by the code cache sweeper (see Section 5.1.2).
Finally, debugging functions like printing of status information (CodeCache::print internals and
CodeCache::print summary) and verify functions (CodeCache::verify, self-verifications functions
only executed in debug builds) are adapted to work with a segmented code cache.
9
10
See http://www.cplusplus.com/reference/iterator/
Defined in /share/vm/compiler/compileBroker.hpp
5. Implementation
5.1.1.3
37
Optimizations
Some of the functions provided by the code cache only address method code.
For example,
first nmethod, find nmethod or nmethods do iterate over compiled methods and ignore nonmethod code. Because the baseline version stores all code in one code heap, the code cache must
iterate over all entries and skip non-method code.
Although non-method code makes up only around 2% of the code cache, skipping non-methods
pollutes the source code with runtime checks and decreases the performance of method operations
on the code cache. The code cache sweeper is mostly affected because it periodically scans all
methods but must skip non-method code (see Section 5.1.2).
With the custom iterators the functions are changed to only iterate over code heaps that contain compiled methods by using the GrowableArrayFilterIterator in combination with the
IsMethodPredicate. All runtime is nmethod()-checks are removed from the code cache functions. To summarize, the sweeper now iterates over less code cache entries and performs no runtime
checks. Section 6.5 evaluates the performance gain of these optimizations.
5.1.2
Code cache sweeper
As described in Section 2.3.6, the code cache sweeper is responsible for cleaning up the code cache.
The sweeper scans the code cache, updates the states and hotness values of methods and removes
methods not longer needed, invalid or if the code cache is full. Sweeping is done by compiler threads.
To reduce the time spent sweeping one full traversal of the code cache is split up into several smaller
iterations that eventually cover all methods. The number of invocations is controlled by the JVM
option NmethodSweepFraction.
Because in the baseline version the sweeper sweeps a single code heap, the implementation is
changed to support sweeping multiple code heaps. The sweeper scans all method code heaps,
starting with the non-profiled code heap, and skips the non-method code heap (non-method code
is not swept). A field is added to keep track of the current code type, i.e., the current code heap,
and continue with this code heap in the next iteration.
Since it is guaranteed that only methods are encountered while scanning the code heaps, all
is nmethod-checks are removed. The fact that significantly less code cache entries are processes
reduces the time spent sweeping the code cache (see evaluation in Section 6.5).
The function NMethodSweeper::possibly sweep invokes the sweeper (i) if the code cache is getting full, (ii) if there is sufficient state change in the code cache compared to the last sweep or
(iii) if enough time has passed since the last sweep. A formula is used that invokes the sweeper
more often for small code cache sizes or if the code cache is getting full by making use of the
CodeCache::reverse free ratio function. Because there are multiple code heaps, the maximum
reverse free ratio of all code heaps is used. That means that the sweeper is invoked if one of the
code heaps reaches its maximum capacity.
38
5. Implementation
It may be more efficient to only sweep the fullest code heap or in general sweep the profiled
code heap more often because profiled methods have a shorter lifetime. Such a selective sweeping
mechanism is described in Section 7.1 and may be implemented in future versions.
When processing a method its hotness value is compared to a threshold value. If the hotness value
is below the threshold, the method is set to not entrant and removed during the next iterations.
The threshold value is computed using the reverse free ratio of the code heap the current method
is stored in. The fuller the code heap, the bigger the threshold gets and the more methods are
removed from this code heap.
5.1.3
Serviceability Agent
As described in Section 2.3.7, the Serviceability Agent (SA) is a collection of Java APIs and tools
to debug the HotSpotTM JVM. The SA is based on low level debugging primitives and works by
directly reading the process memory and analysing data structures, such as stack frames, garbage
collection statistics and code cache entries.
The Serviceability Agent is executed in its own JVM and does not execute code in the target
JVM. The SA relies on the VMStructs class11 that contains a table with descriptions of the classes
and corresponding fields of the HotSpotTM JVM source code. The SA also contains processor
and platform dependent declarations, for example, CPU registers and the memory size of different
types. Most classes are declared with VMStructs as a friend, so that even private fields can be
accessed. For example, there are entries that describe the fields of the nmethod class, allowing the
SA to gather information about compiled methods.
The SA is almost completely written in Java and basically re-implements the C++ classes of the
HotSpotTM JVM (the Java code can be found in /agent/src/share/classes/sun/jvm/hotspot/).
These Java classes access fields declared in the VMStructs class at runtime. They are referenced by
their name and read out of the memory space of the target JVM. For example, the main functionality of the code cache, such as searching for methods and iterating over the code, can be found in
the Java class sun.jvm.hotspot.code.CodeCache. It builds uppon the Java implementation of a
code heap (sun.jvm.hotspot.memory.CodeHeap).
The SA is adapted to support a segmented code cache with multiple code heaps. First, the
GrowableArray field of the C++ implementation of the code cache is added to the VMStructs
class, to be able to access the array from Java code. During initialization, the SA reads the field
and instantiates a local copy of this GrowableArray implemented in Java12 . The Java functions
contains, findBlob and iterate are adapted to access the code heaps in the GrowabledArray.
Additional helper functions are added to extract the necessary information.
To test the adapted functionality, the HotSpotTM Debugger (HSDB) [3] is used. The debugger
allows to list the compiled methods in the code cache and only succeeds if the Java classes can
11
12
Defined in /share/vm/runtime/vmStructs.hpp
Declared in sun.jvm.hotspot.utilities.GrowableArray
5. Implementation
39
access the code heaps.
More information about the implementation of the SA can be found in [36] and [47].
5.1.4
Dynamic tracing framework DTrace
DTrace is a dynamic tracing framework developed to debug the operating system kernel and applications. DTrace is available for several operating systems, for example, Solaris, Mac OS X and
FreeBSD. Tracing scripts written in the D programming language13 contain so called probes and
associated actions that are executed if the conditions of a probe are met. For example, a probe can
fire if certain functions are executed or a process is started in the profiled application. The probe
may then examine the call stack or supplied arguments and print information.
The HotSpotTM JVM provides probes that can be used in a D script to monitor the state of the
JVM or running Java applications (a list of probes can be found in [38]). This includes probes
for garbage collection, method compilation and class loading. Currently, only the Solaris and BSD
operating systems are supported. The DTrace support code for these platforms can be found in
/src/os/solaris/dtrace/ and /src/os/BSD/dtrace/, respectively. A description of the detailed
implementation can be found in [36].
To enable DTrace to also show Java frames in the stack traces and resolve the name of the corresponding Java functions, a helper script jhelper.d14 is provided that implements the lookup
of function addresses. Because these addresses point to compiled methods in the code cache, the
helper script has to be updated to support a segmented code cache.
The helper script accesses the code cache in memory by referring to the corresponding symbol
defined in the shared library. The offsets that are necessary to compute the addresses of the fields
are generated by the file generateJvmOffsets.cpp. The helper script of the baseline version uses
these offsets to directly access the entries of the code heap and then resolve the function address.
Because the segmented code cache has multiple code heaps, the script is changed to first access
the GrowableArray which stores pointers to all code heaps. The generateJvmOffsets.cpp file is
adapted to additionally generate the offsets of the len and data fields used by the GrowableArray
to store the number of elements and the actual data in array form. New probes are added to first
obtain the destination code heap and read its configuration, such as the address of the segment
table, and then continue by resolving the function address in this code heap.
One limiting feature of the D language is that there is no support for loop statements. It is
therefore impossible to iterate over the GrowableArray to search for the destination code heap.
At the moment the helper script supports up to five code heaps in the code cache, by specifying a
probe for each. If more code heaps are added, the probes have to be extended.
More information can be found in the DTrace User Guide [37].
13
A language inspired by C consisting of so called probes with conditions and actions that are executed if the
corresponding probe fires, i.e., the condition is met.
14
Located in /os/solaris/dtrace/
40
5. Implementation
5.1.5
Stack tracing tool Pstack
Pstack is an utility that prints the stack trace of all threads currently executed by a process. It is
used for debugging purposes, for example, to figure out where a process is stuck. The HotSpotTM
JVM includes Pstack support to not only show the stack frames of JVM internal threads, but also
find the names of Java methods on the stack of those threads currently executing Java code. The
support code can be found in /os/solaris/dtrace/ and /os/bsd/dtrace/, respectively.
Pstack performs the name lookup by calling into the shared library libjvm db.so15 shipped with
the HotSpotTM JVM. This support library gathers the necessary information by directly reading
the memory space of the JVM, using the same technique as the DTrace script (see Section 5.1.4).
Similar to DTrace, changes to the code are necessary to support a segmented code cache with
multiple code heaps. Instead of referencing the symbol in the shared library, the corresponding
entry in the VMStructs class (see Section 5.1.3) is used to access the GrowableArray of code heaps.
Compared to DTrace, the advantage is that the Pstack support library is written in C and therefore
loops can be used to iterate over the code heap array. On initialization, a local array is created to
store the code heap configurations. The implementation of the contains function, to check if the
code cache contains a method, and find start, to find the start segment of a method in a code
heap, is changed to support multiple code heaps.
More information about Pstack can be found in [36].
5.1.6
Adding new code heaps
As described in Section 4.1, future versions of the HotSpotTM JVM are likely to include GPU
and/or ahead-of-time (AOT) compiled code that should be stored in a separate code heap. The
implementation of the segmented code cache is extensible, allowing to define a new code type and
the corresponding code heap in just a few steps. If, for example, a new code heap for GPU code
needs to be added to the code cache, the following steps are necessary:
• Definition of a new code type: Creation of a new CodeBlobType for GPU code. This type
is used to access the GPU code in the code cache, including allocation and deallocation of
memory. If the GPU code should be treated similar to method code, for example, by the
sweeper, the IsMethodPredicate must be adapted.
• Creation of the code heap: CodeCache::initialize heaps creates and initializes the new
code heap with a part of the memory space reserved for the code cache. A new JVM option
can specify the size of the new code heap.
• Define code heap availability: If the code is not always available or used, the availability
criteria can be defined in CodeCache::heap available so that the code heap is only created
if necessary.
15
The implementation can be found in /os/solaris/dtrace/libjvm db.c
5. Implementation
41
GPU code created with the new CodeBlobType is then stored in a separate code heap.
5.2
Dynamic code heap sizes
This section describes the implementation of the dynamic resizing of code heaps, including changes
to other components that are necessary to support dynamic code heap sizes. As described in Section
4.2, the design is based on the idea to dynamically move the boundary between the code heaps to
expand one code heap at the cost of the other code heap. The implementation builds upon the
changes introduced by the segmented code cache (see Section 5.1).
To be able to move the boundary between two adjacent code heaps, both code heaps need to fill
up towards the boundary, i.e., one code heap has to grow upwards and the other code heap has
to grow downwards. As shown in Figure 5.1, the memory used by a code heap is managed by a
VirtualSpace on top of a ReservedSpace.
The baseline version does only support upwards growing code heaps. Therefore, the implementation of all related components has to be adapted to support downwards growing and moving of
boundaries. In the following sections the changes are described from bottom-up. First, Section
5.2.1 presents the implementation of the VirtualSpace class. Section 5.2.2 then describes the
changes to the code heap to support a downward growing VirtualSpace and expansion into an
adjacent code heap. Finally, Section 5.2.3 changes the implementation of the code cache such that
the code heaps are dynamically resized if one code heap is full.
5.2.1
Virtual space
The VirtualSpace class allows committing of a ReservedSpace in smaller chunks by providing
functions to expand and shrink the initially committed memory. The ReservedSpace contains the
size and alignment constraints and abstracts the allocation of memory by the operating system.
The reserved space is therefore independent of the direction of growth. The virtual space, however,
has to be adapted to support growing top-down.
To add the missing functionality to the virtual space implementation, either (i) a subclass is added
that grows downwards, (ii) a class is added that reimplements the virtual space but grows downwards, or (iii) an additional parameter is introduced which controls the direction of growth. The
first solution drops out because there is no subclass relation between a virtual space that is growing downwards and one that is growing upwards. A new superclass would have to be added that
combines the common functionality. The second solution drops out because there would be a lot
of code duplication. The third solution is implemented and described below.
To support large pages16 the virtual space is split into three parts, called lower, middle and upper
region. The lower and the upper regions are always pages size aligned and used to correctly align
16
Normally the kernel provides memory that is allocated in chunks of 4K bytes, named pages. However, most CPU
architectures and operating systems support bigger pages that are allocated in one block and not swapped to disk.
42
5. Implementation
the low and high addresses of the middle region. They are usually very small or may not exist at
all if the low and high addresses of the virtual space are already aligned. The middle region is large
page size aligned if the platform supports large pages.
Figure 5.2 (a) shows the regions and corresponding pointers of an upwards growing virtual space.
High memory addresses are on top, low addresses on the bottom. The lower high boundary,
middle high boundary and upper high boundary pointers define the three regions by marking
their high boundaries and are set on initialization of the virtual space. The white part represents
unused and the blue part represents already committed memory. The lower level, middle level
and upper level pointers mark the current usage of each region. They are aligned according to
the regions alignment and updated if the committed space is expanded or shrinked. For example,
the lower region is full, therefore the lower level is equal to the lower high boundary. The level
pointer marks the usage level of the entire virtual space without region alignment. This means
that it is equal to the requested space and may be lower than the actual committed space due to
alignment constraints. It corresponds to the high or low water mark (depending on the direction
of growth) for the last allocated byte. In Figure 5.2 (a), for example, the level pointer is smaller
than the middle level because the middle level is large page aligned.
Figure 5.2: Regions and pointers of the VirtualSpace to support large pages. The white parts
represent unused and the blue parts represent committed memory.
Figure 5.2 (b) shows a virtual space growing downwards. The boundary pointers do not need
to change, since the regions do not change. The level pointers are now initialized to the high
boundaries of the corresponding region and move from high to low addresses.
To control the direction of growth, the additional variable grows up is introduced and added to
the initialization functions of the virtual space. If set to true, the virtual space grows towards the
higher memory addresses.
In the baseline version the level pointers are called ... high (for instance, upper high) because
5. Implementation
43
they always determine the highest committed address in the corresponding region. They are
renamed to ... level, as shown in Figure 5.2 because they may now point to the lowest address if the virtual space grows downwards. The level pointers are initialized to the corresponding region boundary to represent an empty region. For example, middle level is initialized to
lower high boundary for an upwards growing and set to middle high boundary for a downwards
growing virtual space.
The main functionality of the virtual space, namely expanding and shrinking the committed memory space is implemented by the functions expand by and shrink by. Expanding works by first
calculating the unaligned new level, based on the number of bytes that are needed. Then, the
unaligned new levels for each region are determined by simply comparing the overall level to the
boundaries of each region. After aligning the region levels based on the corresponding region alignment, their initial values are compared to the new values to determine the regions that are affected
by the growth. Memory in those regions is committed in pages of the corresponding size and the
level pointers are adapted.
The implementation of expand by is changed to support a downwards growing virtual space if
grows up is set to false. This affects the calculation of the new levels (levels are decreased),
alignment (rounding down instead of up), comparison of level pointers and boundaries (the level
pointer is compared to the upper boundary of the region) and committing of memory.
To uncommit previously committed memory, the function shrink by is used. Shrinking works
similar to expanding by first calculating the new levels, determining which regions are affected and
finally uncommitting the memory. The function is adapted to support downwards growing.
As already described above, the actual committed size of the virtual space may be greater than the
requested size because of the alignment constraints. Hence, the implementation of the virtual space
provides not only the function committed size that works by simple computing the difference of
the level pointer and the boundary, but also a function actual committed size that sums up
the committed memory of all regions. The function is adapted to support a downwards growing
virtual space. A function actual uncommitted space, to calculate the actual uncommitted space,
is added and used for the dynamic resizing of code heaps (see Section 5.2.2). Further, the runtime
assertions17 and verification functions18 for debug builds are adapted to work with a downwards
growing virtual space.
Having support for upwards and downwards growing virtual spaces, it is now possible to implement
expanding into another virtual space that is adjacent and has the inverse direction of growth. The
following new functions are added:
• set high boundary: Sets the high boundary of the virtual space and adapts the boundaries of
the middle and upper region accordingly. The middle and upper region level may be affected
17
A predicate leading to an error and stopping the program if violated at runtime. It may, for example, check if
the level addresses are always valid, i.e., point to an address inside the virtual space boundaries.
18
Verification functions are executed periodically in debug builds to verify consistency of JVM internal structures.
For example, VirtualSpace::check for contiguity verifies the correctness of the level and boundary pointers of a
virtual space.
44
5. Implementation
as well, but the overall level is not changed.
• set low boundary: Sets the low boundary of the virtual space and adapts the boundaries of
the lower and middle region accordingly. The lower and middle region level may be affected
as well, but the overall level is not changed.
• grow into: Moves the boundary between the virtual space and an adjacent virtual space such
that the size of this space is increased by a given number of bytes and the size of the adjacent
virtual space is decreased accordingly. The already committed area remains unchanged for
both virtual spaces.
The grow into function is used by the code heap to implement the dynamic resizing (see Section
5.2.2). The function first checks if there is enough uncommitted memory available in the other
virtual space and then moves the boundary into the lower part if growing upwards, or into the
upper part of the other virtual space if growing downwards. Because of the complexity of the code,
numerous assertions are added to check that the spaces are adjacent, the directions of growth are
correct and the resulting virtual spaces are valid.
5.2.2
Code heap
The implementation of the CodeHeap is based on an underlying VirtualSpace, called memory in the following.
The memory is split into so called segments of fixed size (defined by
CodeCacheSegmentSize, 64 bytes by default), numbered in ascending order. If a block of space
in the code heap is allocated, multiple segments are reserved and linked together, starting with a
special header segment. The header segment contains information about the length (number of segments) of the block and if it is used. If a block is deallocated, it is added to a free list (a linked list
of free blocks) and is later reused. Already committed but not previously used memory is marked
as unused and used if no suitable block is found in the free list. If allocation fails, additional space
in the virtual space is committed and initialized as unused.
Figure 5.3 shows a simplified example memory layout of the virtual space of an upwards growing
code heap. The segments 0 and 1 belong to a block of length one that is in the free list. Segments
2 to 5 are part of a block of size three and currently used. Segments 6 and 7 are committed but
not yet used.
A second virtual space, the so called segment map, is used to efficiently find the header segment of
a block, given a segment inside this block. This is needed by different components of the JVM, e.g.,
inline caches. For each segment in the memory virtual space, there is an entry in the segment map.
This entry contains the number of the segment in the corresponding block. The arrows in Figure
5.3 show an example lookup. To find the header segment of the block the segment number 5 (1.)
belongs to, the corresponding entry in the segment map is consulted (2.). The entry states that
segment 5 is the third segment in the block. Therefore, the header is located three segments lower.
A second lookup in the segment map (3.) verifies the location, the distance to the header is now
5. Implementation
45
Figure 5.3: Example layout of the virtual spaces for memory and segment map.
zero, and the header segment is read from memory (4.). The second lookup is needed because the
entries in the segment map are limited to one byte and therefore multiple lookups are necessary if
the value of the block size does not fit into one byte (see CodeHeap::find start for details about
the implementation). Unused segments are marked with the special value OxFF in the segment
map.
To implement dynamic resizing, the code heap has to support a downwards growing virtual space.
A parameter grows up is added to the constructor to control the direction of growth. If the code
heap grows downwards, the segments of the virtual space must be numbered in ascending order
from top to bottom because expansion takes place towards the lower addresses. The same applies
to the segment map.
Allocation of new blocks is adapted such that the header segment still resides at the lowest address of
the block (now the segment with the highest number) and the segment map is initialized accordingly.
The implementation of the iterator functions first block and next block is modified to iterate
from top to bottom if the code heap is growing down. The helper functions segment for, to get
the segment number for an address, and block at to get the segment for a segment number, and
debug functions are adapted accordingly.
To enable dynamic resizing of the code heaps, by expanding one code heap into another code heap,
the function grow into is added. The function tries to move the boundary of the code heap into
the space of another adjacent code heap such that its virtual space is increased while the space
of the other code heap is decreased. Thereby the code heap is not expanded, i.e., the committed
memory stays the same. If there is not enough uncommitted space available in the other code
heap, the function tries to shrink the code heap by uncommitting already committed but unused
memory. An increased number of segments needs a larger segment map. To be able to increase the
size of the segment maps accordingly, not only the memory virtual spaces, but also the segment
maps need to be adjacent to each other.
46
5. Implementation
The previously unimplemented function shrink by is now implemented to support shrinking of
code heaps. The function shrinks the committed memory by uncommitting already committed
space and free blocks from the code heap boundary.
Figure 5.4 shows the dynamic resizing of the method code heaps. The virtual spaces of the profiled
and the non-profiled code heap are adjacent in memory. The profiled code heap grows upwards,
whereas the non-profiled code heap grows downwards. In Figure 5.4 (a) the non-profiled code heap
is almost full, only some small free blocks are available on the free list. In contrast, the profiled
code heap has unused and even not yet committed space at its upper boundary, marked in white
and blue. The grey part is memory that is already committed by the virtual space due to alignment
constraints, but not yet initialized by the code heap.
Figure 5.4: Dynamic resizing of method code heaps. The non-profiled code heap is full and grows
into the profiled code heap.
The code heaps are resized to increase the size of the non-profiled code heap. First, the shrink by
method shrinks the profiled code heap by uncommitting the alignment space and the unused space
and removing some of the free blocks from the free list. Now there is enough uncommitted space
to lower the boundary between the code heaps. The non-profiled code heap can be expanded by
committing the additional space. Figure 5.4 (b) shows the result.
5.2.3
Code cache
To be able to dynamically resize the profiled and non-profiled code heap, their virtual spaces for
memory and segment map have to be placed adjacent to each other in memory. The function
5. Implementation
47
CodeHeap::reserve is adapted to take the ReservedSpace objects for memory and segment map
as an argument instead of allocating them internally at a random position in memory. The code
cache creates the reserved spaces adjacent in memory and passes them to the code heaps. They
are then used to initialize the virtual spaces. The function get segmap size is added to compute
the size of the segment map according to the size of the code heap.
The resizing of the profiled and non-profiled code heap is performed as described in Section 5.2.2.
It is necessary if allocate fails at runtime due to a lack of space in one of the method code heaps.
Allocate is adapted to use a new function expand heap, that tries to resize the code heaps if one
code heap is full, instead of only trying to reserve more memory in the current code heap.
The function expand heap first checks if the given code heap can be expanded by allocating more
memory in its own virtual space. If this is not possible and the code heap is a method code heap,
the functions makes use of the helper function get adjacent heap to find the adjacent code heap.
If there is enough free space in this code heap, the boundary between the code heaps is moved
accordingly. The newly gained space is committed and therefore available for allocations. A new
JVM option MoveCodeHeapBoundaries is introduced that controls the behaviour of expand heap.
If it is set to false, no resizing is performed. For example, if tiered compilation is disabled, the
profiled code heap does not exists and dynamic resizing is disabled.
With the segmented code cache, the AdvancedThresholdPolicy uses the reverse free ratio of the
destination code heap to set the compile thresholds for tiered compilation (see Section 5.1.1.2). It
is adapted to use the reverse free ratio of the entire code cache if dynamic resizing of code heaps
is enabled. This is justified by the fact that free space in other code heaps can be used by moving
the boundary and therefore the compile thresholds should only increase if all method code heaps
are full.
5.2.4
Lazy creation of method code heaps
As described in Section 4.2, the method code heaps are created lazily after JVM startup when the
first method allocation takes place. The function create heaps initially assigns all space reserved
for the code cache to the non-method code heap. Particularly, the function initializes the virtual
space of the non-method code heap with the entire ReservedSpace. Only after the first method
allocation is requested by allocate, the function initialize method heaps is executed. It fixes
the size of the non-method code heap and initializes the method code heaps.
Because after JVM startup there still occur new allocations in the non-method code heap, for example, for adapters, the code heap is fixed to its current size plus the CodeCacheMinimumFreeSpace.
If tiered compilation is enabled, additional space for the C2 scratch buffers is needed. To account
for this additional space, 1% of the memory reserved for the code cache (but at least 500 kB) and
additional 128 kB for each compiler thread are allocated. The non-method code heap is expanded
or shrinked accordingly and the upper boundary of the virtual space is set (see Figure 4.2).
The underlying reserved space is split according to the non-method code heap size and the remaining
48
5. Implementation
space is used for the method code heaps. The non-method code heap size is thereby subtracted
from the non-profiled code heap size. The JVM option NonMethodCodeHeapSize, to set the size of
the non-method code heap, is removed because its size can now be implicitly controlled by either
setting the overall code cache size or the CodeCacheMinimumFreeSpace.
If the existing JVM option PrintCodeCacheExtension is enabled, detailed debug output about
the dynamic resizing of code heaps and especially about the lazy creation of method code heaps is
printed.
5.2.5
Serviceability Agent
To be able to determine the direction of growth of a code heap, the grows up field is added to the
VMStructs class. The ... high fields of the VirtualSpace are renamed to ... level as described
in Section 5.2.1 and all references are adapted.
The Java implementation of the code heap is changed to support downwards growing virtual spaces.
A field growsUp is added, determining the direction of growth, and initialized by reading the
corresponding grows up field of the code heap as defined in the VMStructs table. The function
begin is adapted to return the lowest used address instead of the lower boundary and the helper
functions segmentFor and blockAt are changed to count the segments from top down if the code
heap is growing downwards. The function blockBase, returning the header segment of a code
block, is updated to start at the high address and use a negative offset in the segment map if the
code heap is growing downwards.
5.2.6
Dynamic tracing framework DTrace
The generateJvmOffsets.cpp file is adapted to generate the offset of the grows up field in the
CodeHeap class. This offset is then used by the DTrace helper script to read the actual value and
determine the direction of growth of a code heap. Depending on this direction the probes for
finding the header segment of a block and finding the block for a segment number either count the
segments top down or bottom up.
5.2.7
Stack tracing tool Pstack
The Pstack support library libjvm db.c is adapted to additionally save the direction of growth for
each code heap in the local array heap grows up. The functions segment for and block at then
use this information and either start at the high addresses while indexing the code heap if it grows
down, or start at the low addresses if it grows up. The same applies to the find start method
that is responsible for finding the header segment of a code block.
6
Evaluation
This chapter evaluates the performance of the implementation of the segmented code cache and
the dynamic code heap sizes and compares the results to the baseline version.
Section 6.1 presents the experimental setup with details about the machines and benchmarks used.
Section 6.2 determines the best default values for the code heap sizes and Section 6.3 assesses the
dynamic adaption of the code heaps. Section 6.4 evaluates the overall performance and the fraction
needed by the code cache sweeper to remove unused methods.
Section 6.5 measures the time taken by the code cache sweeper. Section 6.6 determines the code
cache memory fragmentation of the baseline version and compares the fragmentation to the segmented code cache. Because the Instruction Translation Lookaside Buffer (ITLB) miss rate is likely
to be affected as well, Section 6.7 measures it together with the instruction cache miss rate using
hardware performance counters. Section 6.8 illustrates the hotness of methods in the different code
heaps and compares the hotness to the alignment in the baseline version.
6.1
Experimental setup
To account for different usage scenarios of the JVM, testing and evaluation is performed on two
different machines:
• 4-core system: Desktop computer with an Intel Core i7-3820 CPU at 3.60 GHz (4 physical
and 8 virtual cores, 10 MB cache) and 8 GB main memory running Ubuntu 12.04.3 (precise).
GCC version 4.6.3 is used to build the JVM.
• 32-core system:
Server with four Intel Xeon E7-4830 CPUs at 2.13 GHz (8 physical and
16 virtual cores each, 24 MB cache) and 64 GB main memory running Ubuntu 11.10 (oneiric).
GCC version 4.6.1 is used to build the JVM.
The implementation is also tested under Solaris and Windows to account for platform specific properties, for example, large page support or external tools (see Section 5.1.4 and 5.1.5). Additionally,
Oracles internal regression test facility JDK Putback Reliability Testing (JPRT), that tests the
implementation on different platforms, is used to verify correctness.
To get detailed performance measurements under real-world conditions, the following benchmark
suites are used:
49
50
6. Evaluation
• Octane: A JavaScript benchmark developed by Google to measure the performance of realworld JavaScript applications [12]. The benchmark has a runtime of about 7 minutes. Its
version 2.0 is executed using the JDK 8 JavaScript engine Nashorn (see Section 2.4) version
1.8.0 build b121.
• SPECjbb2005: A Java benchmark developed by the Standard Performance Evaluation
Corporation (SPEC) to evaluate the performance of server side Java applications [48]. It
emulates a three-tier client/server system and has a runtime of about 2 hours and 20 minutes.
The latest version 1.07 is used.
• SPECjbb2013: A Java benchmark developed by SPEC to measure the performance of the
latest Java 7 application features [49]. It models a world-wide supermarket IT infrastructure
and has a runtime of about 2 hours and 20 minutes. The latest version 1.0 is used.
The Octane benchmarks are used primarily because in combination with Nashorn the dynamic
compilers generate a lot of code, which is well suited to stress-test the code cache. The SPECjbb
benchmarks have a longer runtime but generate less code. More detailed information about the
benchmarks can be found in Section A.2 of the Appendix.
The execution and evaluation of the benchmarks is automated using a collection of Python scripts
to be able to reproduce the results later. The graphs presented in the following sections always
show the arithmetic mean of multiple runs together with the 95% confidence interval displayed as
errorbar.
The segmented code cache and dynamic code heap sizes are provided as two patches to the code in
the HotSpotTM JVM mercurial repository1 changeset 5765:9d39e8a8ff61 from 27 December 2013,
which is called baseline version in the following.
6.2
Default code heap sizes
The implementation of the segmented code cache provides JVM options to set the sizes of the
non-method, the profiled and the non-profiled code heap. All values are dependent on the memory
reserved for the code cache, as this space is shared between the code heaps. To determine how the
code cache memory should be distributed, reasonable default values are determined by measuring
performance while executing benchmarks with different code heap sizes. Although it is possible to
define platform dependent default values, currently the same conservative values are used for all
platforms.
To determine the default size for the non-method code heap, all benchmarks (see Section 6.1) are
executed and the required space for non-method code is measured. On the 4-core system a code
heap size of 4 MB is sufficient, whereas the 32-core system needs around half a megabyte more
space. This is because more compiler threads are executed and therefore more C2 code buffers
1
http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot
51
6. Evaluation
are created. The default non-method code heap size is set to 5 MB to make sure the JVM runs
efficiently on all platforms.
Next, the default sizes for the method code heaps are determined. Octane is used as a short running
and SPECjbb2005 is used as a long running benchmark. Both are executed with a small code cache
size of 64 MB, to make sure that the code cache is getting full, and different non-profiled code heap
sizes. Octane is executed 20 times (7 minutes each) and SPECjbb2005 is executed 5 times (2 hours
and 20 minutes each) for each configuration on the 32-core system. Figure 6.1 shows the benchmark
results.
400
Code heap sizes with Octane
900
800
SPECjbb2005 benchmark score / 1000
350
Octane benchmark score
300
250
200
150
100
50
0 10
Code heap sizes with SPECjbb2005
Segmented Code Cache
20
30
40
50
Size of non-profiled code heap (MB)
(a) Nashorn with the Google Octane benchmark.
700
600
500
400
300
200
100
0 10
Segmented Code Cache
20
30
40
50
Size of non-profiled code heap (MB)
(b) SPECjbb2005 benchmark.
Figure 6.1: Performance evaluation with different non-profiled code heap sizes on the 32-core
system. The code cache size is 64 MB, where 5 MB are used for the non-method code heap and
the rest is distributed equally between the profiled and the non-profiled code heap.
Figure 6.1(a) shows that there is a performance degradation for small and large non-profiled code
heap sizes. The system performs best with a non-profiled code heap size of 20 MB, i.e., with a
profiled code heap size of around 60 MB (64 MB minus the space needed for the non-method code
heap). This is because Octane consists of multiple short running benchmarks and therefore the
JVM profiles a lot of code.
The SPECjbb2005 benchmark, presented in Figure 6.1(b), shows no performance change with
different non-profiled code heap sizes. This is because there is not enough code generated to fill up
the code heaps and it therefore makes no difference if the code cache is segmented or not.
To account both for short running and long running applications, the available memory space is
equally distributed among the non-profiled and the profiled code heaps. For example, with a code
cache size of 65 MB, 5 MB are used for the non-method code heap and 30 MB are used for each
method code heap.
52
6.3
6. Evaluation
Dynamic code heap sizes
To monitor the dynamic resizing of the code heaps, the JVM option PrintCodeCacheExtension
is used. The option prints information about a code heap whenever it is expanded or resized. The
Octane benchmarks are executed on the 4-core machine with a code cache size of 64 MB. To force
the JVM to resize the code heaps, the profiled code heap is set to only 10 MB and the non-profiled
code heap is set to 54 MB.
The output shows that the internal JVM startup is completed around 0.3 seconds after the start
and the non-method code heap is fixed to 2.2 MB, which is sufficient for the 4-core system. The
method code heaps are created and the JVM allocates 10 MB to the profiled code heap and 51.8
MB (54 MB minus 2.2 MB for the non method code heap) to the non-profiled code heap.
Dynamic code heap sizes (Octane Benchmark, 4-core)
50
Code heap size (MB)
40
30
20
10
00
20
Non-profiled code heap
Used part of non-profiled code heap
Profiled code heap
Used part of profiled code heap
40
60
80
100
120
Time since start (seconds)
Figure 6.2: Dynamic resizing of code heaps with the Google Octane benchmark. The code cache
size is 64 MB, the profiled code heap is initially set to 10 MB and the non-profiled code heap to 54
MB. The solid lines show the reserved size whereas the dotted lines represent the used part.
Figure 6.2 shows the variation of the code heap sizes in time after the start. The solid lines show
the reserved size whereas the dotted lines represent the memory that is used by the code heap. The
used size corresponds to the committed size described in Section 5.2.2 and is illustrated by the red,
green and blue parts in Figure 5.4.
At the beginning, the profiled code heap grows fast because a lot of code is profiled. The used part
is always equal to the reserved space that is continuously increased by growing into the non-profiled
code heap. The non-profiled code heap shrinks due to the memory consumption of the profiled
code heap. However, the usage of the non-profiled code heap increases, albeit more slowly than for
the profiled code heap. After 91 seconds the sizes of the code heaps have stabilized, but the usage
of the non-profiled code heap still grows. This is because at this stage the profiled methods are
now replaced by non-profiled and highly optimized versions. The total runtime of the benchmark
53
6. Evaluation
is 223 seconds, but after 125 seconds both code heaps use all their reserved memory. Compilation
is not disabled because the method sweeper starts removing methods and space is added to the
free lists of the code heaps (counted as ”used” here). The final sizes are 32.6 MB for the profiled
and 20.2 MB for the non-profiled code heap.
6.4
Overall performance
To measure the overall performance of the segmented code cache and the dynamic code heap
sizes, the benchmarks described in Section 6.1 are executed with different code cache sizes. Small
sizes, between 16 MB and 64 MB, are used to make sure that the code cache fills up and method
sweeping as well as dynamic resizing takes place. Large code cache sizes, from 128 MB to 256 MB,
evaluate the implementation without code cache contention. The short running Octane benchmark
is repeated 20 times and the long running SPECjbb benchmarks are repeated three times for each
configuration.
Octane Benchmark (4-core)
Octane benchmark score
600
500
400
300
200
100
0 16 32 48 64
Dynamic code heap sizes
Segmented code cache
Baseline version
128
256
Code cache size (MB)
Figure 6.3: Octane benchmark with different code cache sizes on the 4-core machine.
Figure 6.3 shows the results of executing the Octane benchmark with each implementation on
the 4-core machine with different code cache sizes. For very small sizes, the baseline version
clearly performs best. This is because on average the fragmentation of a segmented code cache
with multiple code heaps is higher than the fragmentation of a single code heap (see Section 6.6).
Therefore, the code heaps fill up faster and compilation is disabled until the method sweeper freed
enough memory, leading to a performance regression. The dynamic code heap sizes perform slightly
better (around 17%) than the segmented code cache because the code heaps can dynamically adapt
to the runtime needs and therefore fill up more slowly.
With code cache sizes greater 32 MB, the segmented code cache performs on average 1% to 4%
54
6. Evaluation
better than the baseline version, but with a high variation it is hard to confirm this. The performance gain is probably due to a lower sweep time (see Section 6.5) and a better ITLB behaviour
(see Section 6.7). For 16 MB and 48 MB the dynamic code heap sizes perform worse than the
baseline version and partly even worse than the segmented code cache, although the code heaps
are able to adapt to the runtime requirements. This performance degradation is due to the code
cache sweeper. The sweeper removes to many methods because the code heaps seem full prior to
resizing (see Section 6.5). With a larger code cache the dynamic code heap sizes perform better,
with a performance gain of around 4%.
Octane Benchmark (32-core)
350
Octane benchmark score
300
250
200
150
100
50
0 16 32 48 64
Dynamic code heap sizes
Segmented code cache
Baseline version
128
256
Code cache size (MB)
Figure 6.4: Octane benchmark with different code cache sizes on the 32-core machine.
Figure 6.4 shows the same configuration executed on the 32-core system. On average, the implementations perform worse than on the 4-core system. The lower performance is due to the
processor being slower (2.13 GHz vs. 3.60 GHz) and a limited parallelizability / scalability of the
Octane benchmark. Comparing the performance for different code cache sizes, the implementations
perform comparable to the execution on the 4-core system. The segmented code cache performs
up to 7% better than the baseline version, except for a very small code cache size of 16 MB. The
dynamic code heap sizes perform worse for small code cache sizes and similar to the segmented
code cache for larger code cache sizes.
Figure 6.5 displays the performance of the SPECjbb2005 benchmark executed on the 4-core machine. On average the difference in performance of the segmented code cache and the dynamic
code heap sizes compared to the baseline version is below 0.5%. Additionally, the 95% confidence
intervals show that there is no measurable difference between the implementations. To verify this
result, the same configuration is executed on the 32-core system. The results are shown by Figure
A.1 in Section A.1 of the Appendix. The average performance is significantly better than on the
4-core system because the SPECjbb2005 benchmark scales better than the Octane benchmark. The
average performance gain compared to the baseline version is below 1%.
55
6. Evaluation
SPECjbb2005 bOps (in thousand)
250
SPECjbb2005 (4-core)
200
150
100
50
0 16 32 48 64
Dynamic code heap sizes
Segmented code cache
Baseline version
128
256
Code cache size (MB)
Figure 6.5: SPECjbb2005 benchmark with different code cache sizes on the 4-core machine.
35
SPECjbb2013 (32-core)
max-jOPS (in thousand)
30
25
20
15
10
5
0 16 32 48 64
Dynamic code heap sizes
Segmented code cache
Baseline version
128
256
Code cache size (MB)
Figure 6.6: SPECjbb2013 benchmark with different code cache sizes on the 32-core machine.
The SPECjbb2013 benchmark is executed on the 32-core as well. Figure 6.6 shows that the confidence intervals are high and the implementations perform equally well. The first version of the
SPECjbb2013 benchmark that is used here still seems to have a very high variance and even more
than three runs do not decrease the confidence intervals. Anyway, it is noticeable that the performance of the segmented code cache and the dynamic code heap sizes is around 68% worse than
the baseline version for a very small code cache size of 16 MB. As stated earlier, this is due to the
increased fragmentation resulting in the individual code heaps filling up more rapidly. The same
configuration is executed on the 4-core system (see Figure A.2 in the Appendix).
56
6.5
6. Evaluation
Sweep time
The code cache sweeper performs stack scanning at safepoints to update the hotness values of
methods and determines methods not longer needed. The sweeper removes methods in multiple
steps (see Section 2.3.6). To measure the time taken by the sweeper, a patch to the HotSpotTM
JVM is implemented that adds the JVM option PrintMethodFlushingStatistics. If enabled,
additional information about the code cache sweeper, for example, the total time taken, is printed
(see Section A.3). The patch also fixes bug JDK-8025277: Add -XX: flag to print code cache
sweeper statistics [30] in the JDK Bug System.
The Octane benchmarks are executed on the 4-core machine with different code cache sizes and
the time taken by the sweeper is measured. Figure 6.7 shows the results for 20 iterations per code
cache size.
50
Total sweep time (seconds)
40
Sweep time (Octane Benchmark, 4-core)
Dynamic code heap sizes
Segmented code cache
Baseline version
30
20
10
0 16 32 48 64
128
Code cache size (MB)
256
Figure 6.7: Time taken by the method sweeper on the 4-core machine.
The sweeper of the segmented code cache performs better than the baseline version by around 19%
to 46%. The dynamic code heap sizes perform between 12% and 21% better for small code cache
sizes up to 64 MB, but up to 54% worse for larger code cache sizes. More sophisticated evaluations
show that this is because the code cache sweeper is invoked too often and sweeps to many methods.
In more detail, the NMethodSweeper::possibly sweep method uses the maximum reverse free
ratio of all method code heaps (CodeCache::reverse free ratio) to decide if the sweeper should
be invoked. The function NMethodSweeper::process nmethod then uses the reverse free ratio of
the corresponding code heap to compute the hotness threshold that decides if a method should be
removed. The problem is that with the dynamic code heap sizes the reverse free ratio of a code
heap may be large even if there is still enough space in the adjacent code heap allowing the code
heap to grow. Additionally, there is more code generated than with the segmented code cache
57
6. Evaluation
version because the AdvancedTresholdPolicy is adapted to use the reverse free ratio of the entire
code cache (around 37% more methods are compiled on the 4-core system). Especially the profiled
code heap quickly gets full and is then expanded dynamically by growing into the non-profiled code
heap. Because the dynamic resizing is done in small steps, the sweeper assumes that the profiled
code heap is always almost full and sweeps as often as possible (see also Section 6.6). For small
code cache sizes up to 64 MB, the behaviour of the sweeper is appropriate because the code heaps
are indeed full and sweeping is necessary. The sweep time is almost identical to the value of the
segmented code cache.
Simply changing the implementation of the sweeper to use the reverse free ratio of the entire code
cache does not improve, but greatly degrade performance. This is because the sweeper then sweeps
to little, resulting in the code cache getting full and compilation being disabled. Also adapting the
AdvancedThresholdPolicy does not solve the problem because then not enough code is generated
to make use of the dynamic resizing of the code heaps. Multiple solution approaches are described
in Section 7.1.
120
Total sweep time (seconds)
100
Sweep time (Octane Benchmark, 32-core)
Dynamic code heap sizes
Segmented code cache
Baseline version
80
60
40
20
0 16 32 48 64
128
Code cache size (MB)
256
Figure 6.8: Time taken by the method sweeper to remove methods on the 32-core machine.
Figure 6.8 shows the results of the same benchmark on the 32-core machine. On average, the
sweep time is larger than on the 4-core machine. This is partly because the runtime is higher (426
seconds instead of 261 seconds) and more compiler threads (18 instead of 4 compiler threads) are
used, leading to an increased amount of code. The trend of the sweep time is similar, the segmented
code cache performs up to 41% better and the dynamic code heap sizes perform worse than the
baseline version for larger code cache sizes.
To measure the sweep time while executing a long running program, the SPECjbb2005 benchmark
is used. Figure 6.9 shows the results of executing the benchmark on the 32-core machine with 3
repetitions for each code cache size. On average the sweep time is extremely low (0.1 to 1.6 seconds)
58
6. Evaluation
3.5
Sweep time (SPECjbb2005 Benchmark, 32-core)
Total sweep time (seconds)
3.0
Dynamic code heap sizes
Segmented code cache
Baseline version
2.5
2.0
1.5
1.0
0.5
0.0
0.5 16 32 48 64
128
Code cache size (MB)
256
Figure 6.9: Time taken by the method sweeper to remove methods on the 32-core machine.
compared to the Octane benchmark. This is because the SPECjbb2005 benchmark generates
less code, so that even with small code cache sizes almost no sweeper activity is necessary. Due
to the high variance it is not possible to make a statement about the differences between the
implementations.
Although the sweep time is greatly improved for the segmented code cache, the overall performance
is affected only little (see Section 6.4). This can be explained by the fact that the code cache sweeper
is executed by only one (compiler-) thread in parallel to normal execution and therefore affects
performance to a lesser extent. Stack scanning must be executed during a safepoint. However, the
performance gain may be improved by using separate locks for each code heap (see Section 7.1),
instead of one lock for the entire code cache that has to be acquired each time the sweeper processes
a method.
6.6
Memory fragmentation
To evaluate the fragmentation of the code heaps and compare the fragmentation to the fragmentation of the single code heap of the baseline version, an additional patch extends the code cache
by a function print usage graph that prints the length and usage information of each code heap
block. The function is executed once at the end of the execution, analysed by a Python script and
visualized by a graph. The graph always grows down and is independent of the direction of growth
of the corresponding code heap.
To be able to easily compare the fragmentation of different versions, the external fragmentation2
is computed by using the following formula described in [51]:
2
External fragmentation is the fragmentation that occurs if the allocated memory is interspersed by free segments.
59
6. Evaluation
Block Of Free Memory
External Memory Fragmentation = 1− Largest
Total Free Memory
Code cache fragmentation (Octane Benchmark, 4-core)
Figure 6.10: Fragmentation of the code cache of the baseline version. Profiled methods are marked
in red, non-profiled methods and non-method code is marked in blue.
Figure 6.10 shows the fragmentation of the code cache of the baseline version after a single run of the
Octane benchmark on the 4-core machine. The code cache size is set to 256 MB because for small
code cache sizes the fragmentation varies a lot due to the frequent sweeping. Segments containing
profiled methods are marked in red, non-profiled methods and non-method code is marked in blue.
The vast majority of the code cache is occupied by profiled methods that are mixed with nonprofiled methods and non-method code. Because only a few blocks between the allocated segments
are free and most of the free space is present in one block at the top of the code heap, the external
fragmentation for this execution is 7.02%.
Figure 6.11 shows the fragmentation of the code heaps for the segmented code cache version after
executing the same benchmark configuration. The sizes of the graphs are fixed and not related
to the size of the corresponding code heap. The graph titled with ”All code heaps” shows the
overall memory layout of the code cache containing the three (adjacent) code heaps. The numbers
in parentheses specify the external fragmentation of each code heap for this run. As expected,
non-method, profiled and non-profiled code is perfectly separated. The fragmentation of the nonmethod code heap is similar to the fragmentation of the baseline version, whereas the fragmentation
of the profiled code heap is worse and the fragmentation of the non-profiled code heap improved.
Figure 6.12 shows the same information for the dynamic code heap sizes version. Because the nonmethod code heap size is fixed lazily, the non-method code heap is smaller than the non-method
code heap of the segmented code cache version. Although the amount of non-method code is the
same for both versions, the non-method code heap appears to be fuller here. As already stated in
Section 6.5 more profiled code is generated and the profiled code heap is therefore subject to frequent
resizing and sweeping, resulting in high fragmentation. The non-profiled code heap has a very low
external fragmentation. Table 6.1 lists the average external fragmentation values for the code heaps
of all three implementations while running 20 repetitions of the Octane benchmark with a code
cache size of 256 MB on the 4-core machine. The values after the ± sign correspond to the 95%
60
6. Evaluation
Code heap fragmentation (Octane Benchmark, 4-core)
Non-method code heap (8.13%)
Profiled code heap (14.41%)
Non-profiled code heap (0.03%)
All code heaps (25.42%)
Figure 6.11: Fragmentation of the code heaps of the segmented code cache version. Profiled methods
are marked in red, non-profiled methods and non-method code is marked in blue.
Code heap fragmentation (Octane Benchmark, 4-core)
Non-method code heap (29.04%)
Profiled code heap (40.04%)
Non-profiled code heap (0.03%)
All code heaps (6.37%)
Figure 6.12: Fragmentation of the code heaps of the dynamic code heap sizes version. Profiled
methods are marked in red, non-profiled methods and non-method code is marked in blue.
61
6. Evaluation
confidence interval. The fragmentation values for the non-method and the non-profiled code heap
Version
Baseline version
Segmented code cache
Dynamic code heap sizes
Non-method
Profiled
Non-Profiled
5.16% ± 0.96
23.52% ± 2.56
19.46% ± 1.42
55.5% ± 12.4
0.09% ± 0.07
0.15% ± 0.04
All
4.9% ± 0.73
24.94% ± 0.37
15.89% ± 7.98
Table 6.1: Average external fragmentation.
are most important because the code stored there has the longest lifetime and is hot, i.e., is used
permanently (see also Section 6.8). In contrast, the profiled code is only stored temporarily and will
be replaced by an optimized, non-profiled version. Table 6.1 shows that the fragmentation of the
non-method code heap with the segmented code cache is equal to the fragmentation of the baseline
version code cache. For the dynamic code heap sizes, the fragmentation is higher. This is probably
due to the lazy fixing of the non-method code heap size and needs further investigation. The
fragmentation of the non-profiled code heaps is greatly improved. With the segmented code cache
it is 98% better and with the dynamic code heap sizes around 97% compared to the fragmentation
of the baseline version code heap.
In theory, the segmentation of the code cache should improve the instruction TLB and instruction
cache hit rate because code of the same type, that is likely to be accessed close in time, is now
located at the same place. Additionally, a lower fragmentation leads to less unused blocks that
may pollute the instruction cache. Section 6.7 evaluates the instruction TLB and instruction cache
hit rate in detail.
6.7
ITLB and cache behaviour
The instructions stored in the code cache are executed by the processor at runtime. To speed up
the fetching of executable instructions from memory, the processor uses an instruction cache that
contains frequently used memory pages. To speed up the virtual-to-physical address translation
because the user processes accesses virtual memory addresses, an instruction translation lookaside buffer (ITLB) is used. The ITLB caches the operating systems page table that contains the
corresponding physical address for each virtual address.
In general, an instruction cache read miss is most expensive because the thread has to stop execution
until the instruction is fetched from memory. This fetching from memory may cause an ITLB miss
which is costly as well. It is therefore important to optimize the code cache with respect to ITLB
and instruction cache behaviour. The segmented code cache should improve because code locality
is increased and fragmentation reduced (at least for the non-profiled code, see Section 6.6).
To measure the ITLB and instruction cache behaviour of the implementations, hardware performance counters3 are used. The hardware performance counters are enabled and accessed using
3
Special registers build into modern CPUs to measure activities. For example, the number of cache misses or
the number of instructions that are executed is measured. The registers are limited and have a low overhead. An
overview for the Intel architecture can be found in [19].
62
6. Evaluation
the perf4 tool, available in the Linux kernel. The events measured by the hardware performance
counters are CPU specific and can be found in the Intel Software Developer’s Manual [19]. The
following events are used to measure instruction cache and ITLB misses:
• ITLB MISSES.MISS CAUSES A WALK (Event 85H, Umask 01H): ”Misses in ITLB
that causes a page walk of any page.” ([19], page 19-5)
• ICACHE.MISSES (Event 80H, Umask 02H): ”Number of Instruction Cache, Streaming
Buffer and Victim Cache Misses. Includes UC accesses.” ([19], page 19-5)
Figure 6.13 shows the instruction TLB miss rate while executing the Octane benchmark on the
32-core system. Different code cache sizes are used and each configuration is executed 20 times
(142 minutes altogether).
Instruction TLB behaviour (Octane Benchmark, 32-core)
Instruction TLB load misses (in million)
1200
1000
800
600
400
200
0 16 32 48 64
Baseline version
Dynamic code heap sizes
Segmented code cache
128
256
Code cache size (MB)
Figure 6.13: Instruction TLB misses while running the Octane benchmark on the 32-core machine.
As expected, the segmented code cache performs better than the baseline version by reducing the
number of ITLB misses by up to 19%, except for very small code cache sizes below 32 MB. With
the dynamic code heap sizes, the miss rate improves by up to 13% and is similar to the baseline
version for small code cache sizes. This is due to the resizing of the method code heaps and the
increased amount of sweeping activity, polluting the instruction cache (see below). This results
in an increased amount of fetching from main memory, corrupting the ITLB. Figure A.3 in the
Appendix shows the same configuration executed on the 4-core machine, leading to similar results.
Figure 6.14 shows the instruction cache misses on the 32-core system. With the segmented code
cache the miss rate is up to 14% lower for larger code cache sizes and higher for small code cache
4
See https://perf.wiki.kernel.org
63
6. Evaluation
Instruction cache misses (in million)
60000
Instruction cache behaviour (Octane Benchmark, 32-core)
50000
40000
30000
20000
10000
0 16 32 48 64
Baseline version
Dynamic code heap sizes
Segmented code cache
128
256
Code cache size (MB)
Figure 6.14: Instruction cache misses while running the Octane benchmark on the 32-core machine.
sizes, compared to the baseline version. With the dynamic code heap sizes the instruction cache
miss rate is up to 30% higher than with the baseline version. As already described above and
in Section 6.6, the higher instruction cache miss rate is due to the resizing of code heaps and the
increased sweeping activity. Executing the same configuration on the 4-core system provides similar
results (see Figure A.4 in the Appendix).
Instruction TLB behaviour (SPECjbb2005 Benchmark, 32-core)
Instruction TLB load misses (in billion)
70
60
50
40
30
20
10
0 16 32 48 64
Baseline version
Dynamic code heap sizes
Segmented code cache
128
256
Code cache size (MB)
Figure 6.15: ITLB misses while running the SPECjbb2005 benchmark on the 32-core machine.
The improvement can also partly be explained by the fact that each compiler thread is either
assigned to the C1 or the C2 compiler and therefore either accesses profiled or non-profiled code.
This means that with the segmented code cache a compiler thread only accesses one code heap,
64
6. Evaluation
improving code locality. Since threads are likely to be executed on the same CPU, that has its
own instruction cache and ITLB, the cache misses are reduced. This may also explain the slightly
higher miss rates on the 4-core machine (see Figure A.4 in the Appendix) because more compiler
threads are executed on the same CPU, mutually trashing the caches.
To evaluate the behaviour of a long running program, the SPECjbb2005 benchmark is executed
three times (6.60 hours) on the 32-core system. Figure 6.15 shows the instruction TLB miss rate.
The miss rate with the segmented code cache is improved by up to 44% compared to the baseline
version. This is because the benchmark is long running and therefore a lot of methods are highly
optimized and stored in the non-profiled code heap. This increases code locality and therefore
lowers the ITLB miss rate. In contrast to the Octane benchmark, also the dynamic code heap sizes
perform better than the baseline version. As shown in Figure 6.9 of Section 6.5, the sweep time
for the SPECjbb2005 benchmark is similar to the baseline version. Therefore, the code locality is
not degraded and the instruction cache miss rate is comparable to the segmented code cache (see
Figure A.5 in the Appendix). The dynamic code heap sizes version performs up to 38% better than
the baseline version.
It is also noticeable that the ITLB and instruction cache miss rate is not increasing with smaller
code caches sizes. This is because the SPECjbb2005 benchmark generates only a small amount of
code and is therefore only slightly dependent on the code cache size.
6.8
Hotness of methods
As described in Section 2.3.6, the sweeper decides to remove a method from the code cache based
on its hotness. The hotness value measures the utilization and is initially set to a high value that
is decremented every time the method is encountered by the sweeper in the code cache and reset
by stack scanning. Hot methods are scheduled for profiling and eventually optimized and stored in
the non-profiled code heap.
To be able to measure the hotness, the patch used in Section 6.6 is extended to also print the
hotness value for each code heap block. A Python script analyses the log file and visualizes the
hotness distribution in the code cache by using different colors for each value.
Figure 6.16 shows the hotness distribution in the code cache after running the Octane benchmark
with the baseline version on the 4-core system. Because the code cache size is set to 265 MB, the
maximum hotness is per definition 265 MB ∗
2
MB
= 512. The average hotness is 427.
The hottest methods accumulate at the bottom, i.e., the top of the code cache because those are
the methods allocated last. In general, one notices a top down trend from colder to hotter methods.
There are some hot methods in between that are either due to freed segments that were reused
by recently compiled methods or correspond to methods that are encountered on the stack by the
code cache sweeper.
Figure 6.17 shows the hotness distribution of the code heaps after running the same configuration
65
6. Evaluation
Hotness of code (Octane Benchmark, 4-core)
500
480
460
440
420
400
380
360
Figure 6.16: Hotness distribution in the code cache after running the Octane benchmark with the
baseline version and a code cache size of 256 MB on the 4-core machine.
with the segmented code cache version. The size of the graph is not related to the size of the
corresponding code heap. The code in the non-method code heap is not swept and therefore always
hot. The profiled code heap contains mostly colder methods, but some hot code that was recently
scheduled for profiling. The non-profiled code heap contains a large block of hot code, corresponding
to the hot methods that were recently optimized.
Figure 6.18 shows the same measurements for the dynamic code heap sizes version. The hotness
distribution in the non-profiled code heap is similar to the segmented code cache, but the profiled
code heap contains a much greater percentage of hot methods. This can be explained by the
increased amount of sweeping that is caused by the resizing of the profiled code heap. The profiled
code heap is always filled and therefore the sweeper removes methods that are then recompiled and
hot.
Table 6.2 lists the average hotness values after 20 runs of the Octane benchmark with a code cache
size of 256 MB. The values after the ± sign correspond to the 95% confidence interval. With the
segmented code cache, the average hotness in the profiled code heap is lower than in the non-profiled
code heap because hot methods are eventually optimized and stored in the non-profiled code heap.
This is not the case with the dynamic code heap sizes due to excessive sweeping of the profiled
code heap.
Version
Baseline version
Segmented code cache
Dynamic code heap sizes
Profiled
Non-Profiled
430.68 ± 3.51
469.71 ± 10.06
448.8 ± 3.0
410.63 ± 3.41
All
427.18 ± 2.88
437.01 ± 3.08
452.69 ± 2.48
Table 6.2: Average hotness after 20 runs of the Octane benchmark.
66
6. Evaluation
Hotness of code (Octane Benchmark, 4-core)
510
495
480
465
Non-method code heap
Profiled code heap
450
435
420
405
390
Non-profiled code heap
All code heaps
Figure 6.17: Hotness distribution in the code heaps after running the Octane benchmark with the
segmented code cache version and a code cache size of 256 MB on the 4-core machine.
Hotness of code (Octane Benchmark, 4-core)
500
480
460
440
Non-method code heap
Profiled code heap
420
400
380
360
340
Non-profiled code heap
All code heaps
Figure 6.18: Hotness distribution in the code heaps after running the Octane benchmark with the
dynamic code heap sizes version and a code cache size of 256 MB on the 4-core machine.
7
Conclusion
Past activities in optimizing the performance of the HotSpotTM Java Virtual Machine focused on
the performance of the dynamic compilers and the supporting runtime. This thesis presents an
approach that optimizes the JVM at a low layer by redesigning the structure of the code cache. The
changes form the basis for further optimizations and make the JVM more extensible and adaptive.
Because the code cache is a core component of the JVM, being directly referenced from more than
150 locations in the source code, the implementation presented in this thesis is fairly complex. The
two patches consist of around 3600 lines of code and affect 44 files of the baseline version.
A detailed evaluation shows that the approach is promising and that the organization of the code
cache has a significant impact on the overall performance of the JVM. The execution time is
improved by up to 7% and the more efficient code cache sweeping reduces the time taken by the
sweeper by up to 46%. This, together with a decreased fragmentation of the non-profiled code heap
by around 98%, leads to a reduced instruction TLB (44%) and instruction cache (14%) miss rate.
It therefore seems worthwhile to include the changes into the product version of the HotSpotTM
Java Virtual Machine. As of February 2014, the segmented code cache patch is reviewed by Oracle
as a fix for the bug JDK-8015774: Add support for multiple code heaps [29] in the JDK Bug System.
It will most probably be included into one of the future releases.
7.1
Future work
As described in Section 6.5, there is too much sweeping activity with the dynamic code heap sizes.
Although the method code heaps are resized if full, the sweeper already starts removing more
methods if a code heap is getting full.
One solution is to resize the code heaps earlier. Instead of resizing a code heap if allocation fails
and there is no more uncommitted space available, the code heap is already resized in advance
when it starts getting full. This ensures that a code heap only fills up if there is not enough space
available in the adjacent code heap. One problem with this solution is the tendency of oscillations
of the boundary if both code heaps start to resize alternately while getting full.
Another approach is to adapt the threshold computation in the sweeper, such that free space in the
adjacent code heap is taken into account as well, when deciding if the sweeper should be invoked.
67
68
7. Conclusion
The challenge of this solution is to find a compromise between sweeping too often and sweeping
not often enough.
In general, it is also possible to generate less compiled code by adapting the tiered compilation
threshold policies. By now, the free space in adjacent code heaps is considered as available for both
code heaps. This is not always the case because blocks of free space that are not located at the
boundary cannot be made available by resizing. Hence, it may be good to adapt the policies to
only partly consider this space as available.
Some of these solutions may be combined. A detailed evaluation is necessary to assess their performance. In the following, further optimizations of the code cache are proposed that greatly differ
in their complexity and size.
• Concurrent sweeping:
By now, only one compiler thread is used for sweeping. With
the segmented code cache, the sweeping process can easily be parallelized by assigning one
sweeper thread to each code heap. This potentially improves the sweep time and allows for
further optimizations. For example, code with a limited lifetime, such as profiled code, can
be swept more often.
• Selectively disabling compilation: Currently, compilation is fully disabled if one of the
code heaps is full and resizing fails. But most of the time there is still space available in the
other code heaps, for example, on the free list. Hence, it may pay off to continue compiling
code of the corresponding code types and only selectively turn off compilation for those code
heaps that are full.
• Fine grained locking: Instead of using a single lock for the entire code cache, it is possible
to use a lock per code heap, enabling multiple threads to access the code cache in parallel
and possibly improving performance of the dynamic compilers and the code cache sweeper.
• Separation of code and metadata:
Currently, the compiled method code stored in the
code cache does not only contain executable code, but also metadata, for example, the header
or relocation and debugging information. By separating this metadata from the actual code
and storing it somewhere else, for example, in a separate code heap, the instruction cache
and ITLB miss rates may further be reduced.
• Code heap partitioning: By now, the code heaps are partitioned into segments of a fixed
size (CodeCacheSegmentSize). To decrease fragmentation the code heap could be split into
regions with different segment sizes to account for methods of different sizes.
• Heterogeneous code:
Future versions of the HotSpotTM JVM may have to manage
additional types of code. For example, the project Sumatra adds support for GPU code that
is handed to the GPU drivers and then converted to machine code that is executable on the
GPU (OpenCL code). Additional code heaps may be created to store code of such new types
(see Section 5.1.6).
A
Appendix
A.1
Additional graphs
This section contains supplementary graphs referenced from the evaluation in Section 6.
SPECjbb2005 (32-core)
800
SPECjbb2005 bOps (in thousand)
700
600
500
400
300
200
100
0 16 32 48 64
Dynamic code heap sizes
Segmented code cache
Baseline version
128
256
Code cache size (MB)
Figure A.1: SPECjbb2005 benchmark with different code cache sizes on the 32-core machine.
10
SPECjbb2013 (4-core)
max-jOPS (in thousand)
8
6
4
2
0 16 32 48 64
Dynamic code heap sizes
Segmented code cache
Baseline version
128
256
Code cache size (MB)
Figure A.2: SPECjbb2013 benchmark with different code cache sizes on the 4-core machine.
69
70
A. Appendix
Instruction TLB behaviour (Octane Benchmark, 4-core)
Instruction TLB load misses (in million)
1000
800
600
400
200
0 16 32 48 64
Baseline version
Dynamic code heap sizes
Segmented code cache
128
256
Code cache size (MB)
Figure A.3: Instruction TLB misses while running the Octane benchmark on the 4-core machine.
Instruction cache misses (in million)
20000
Instruction cache behaviour (Octane Benchmark, 4-core)
15000
10000
5000
0 16 32 48 64
Baseline version
Dynamic code heap sizes
Segmented code cache
128
256
Code cache size (MB)
Figure A.4: Instruction cache misses while running the Octane benchmark on the 4-core machine.
Instruction cache misses (in billion)
10000Instruction cache behaviour (SPECjbb2005 Benchmark, 32-core)
8000
6000
4000
2000
0 16 32 48 64
Baseline version
Dynamic code heap sizes
Segmented code cache
128
256
Code cache size (MB)
Figure A.5: ICache misses while running the SPECjbb2005 benchmark on the 32-core machine.
A. Appendix
A.2
71
Benchmarks
This section provides additional information about the benchmarks used to evaluate the implementations.
A.2.1
Google Octane JavaScript benchmark
The Google Octane benchmark is a JavaScript benchmark to measure the performance of large, realworld JavaScript applications. Its version 2.0 was released on the 6th of November 2013 and consists
of 17 individual tests. These tests include a variety of sophisticated applications, for example, an
OS kernel simulation, equation and constraint solvers and physical simulations, focusing on different
aspects, such as code optimization, garbage collection or floating point operations. The time it takes
to complete the individual tests is measured and a score is computed that is inversely proportional
to this runtime. The higher the score, the better the performance.
Octane is executed in the JVM by using the Nashorn framework and generates a lot of code. It
is therefore good to stress the code cache and evaluate different optimizations of the code cache
structure. More detailed information about the Octane benchmark and the individual tests can be
found in [12].
A.2.2
SPECjbb2005
The SPECjbb2005 benchmark was developed by the Standard Performance Evaluation Corporation (SPEC) to evaluate the performance of server side Java applications. Its current version 1.07
emulates a client/server system consisting of three tiers, including business logic and object manipulation to simulate real-world applications. To measure scalability the workload is slowly increased
and a detailed report is created, rating performance in business operations per seconds (bops).
SPECjbb2005 was replaced by SPECjbb2013 on October 1, 2013 but is still used for evaluation
purposes in this work because of its low variance. More information can be found in [48].
A.2.3
SPECjbb2013
The SPECjbb2013 benchmark was developed by the Standard Performance Evaluation Corporation
(SPEC) to measure the performance of the latest Java 7 application features and replaces the
SPECjbb2005 benchmark. It simulates a world-wide supermarket IT infrastructure including pointof-sale requests, online purchases and data-mining operations. It iteratively increases the workload
to account for server systems with many CPUs and measures performance by using two metrics.
A pure throughput metric in the form of maximum Java operations per seconds (jOPS) and a
critical throughput metric under service level agreements constraining the response times. More
information can be found in [49].
72
A. Appendix
A.3
New JVM options
This section lists and describes the additional JVM command line options that are introduced by
the segmented code cache and the dynamic code heap sizes. A complete list of already existing JVM
options can be found in [32]. The JVM options can be set by specifying -XX:[name]=[value] on the
command line. For example, -XX:ReservedCodeCacheSize=512M sets the ReservedCodeCacheSize
option to 512 MB.
The following options are added:
• NonProfiledCodeHeapSize: Sets the size in bytes of the code heap containing non-profiled
methods. Per default it is set to 50% of the ReservedCodeCacheSize.
• ProfiledCodeHeapSize: Sets the size in bytes of the code heap containing profiled methods.
It is only applicable if tiered compilation is enabled. Per default it is set to 50% of the
ReservedCodeCacheSize.
• MoveCodeHeapBoundaries: Dynamically resizes the method code heaps by adjusting the
boundaries between them.
• PrintMethodFlushingStatistics: A diagnostic JVM option that first has to be unlocked
by specifying -XX:+UnlockDiagnosticVMOptions. It prints statistics about the sweeper and
fixes bug JDK-8025277: Add -XX: flag to print code cache sweeper statistics [30]
in the JDK Bug System.
A.4
A.4.1
Used tools/frameworks
Eclipse
Because the HotSpotTM JVM is mostly written in C++, the Eclipse IDE for C/C++ Developers
[10] is used for development.
A.4.2
Python
The plots presented in Section 6 are automatically generated by using a set of simple Python
scripts to automate execution and analysis. For graph generation the Python 2D plotting library
Matplotlib [17] is used.
A.4.3
yEd Graph Editor
To create the high-quality diagrams used in this thesis, the yEd graph editor is used. It is a free tool
available from [54] and supports importing of own data, different types of diagrams (for example,
UML, Flowchart and Entity Relationship diagrams) and a variety of output formats.
Bibliography
[1] Programming Language Popularity. http://www.langpop.com, 2013.
[2] TIOBE Programming Community Index.
http://www.tiobe.com/index.php/content/
paperinfo/tpci/index.html, 2013.
[3] P. Bajaj. HotSpot’s Hidden Treasures - The HotSpotTM Serviceability Agent’s powerful tools
can debug live Java processes and core files. http://www.oraclejavamagazine-digital.
com/javamagazine/20120708?pg=41#pg41, 2012. Oracle Java Magazine.
[4] C++ FAQ. Are ”inline virtual” member functions ever actually ”inlined”?
http://www.
parashift.com/c++-faq-lite/inline-virtuals.html.
[5] Christian Häubl. Optimized Strings for the Java HotSpotTM VM. http://www.ssw.uni-linz.
ac.at/Research/Papers/Haeubl08Master/Haeubl08Master.pdf, 2008.
[6] Christian Wimmer.
Client Compiler.
Linear Scan Register Allocation for the Java HotSpotTM
http://www.ssw.uni-linz.ac.at/Research/Papers/Wimmer04Master/
Wimmer04Master.pdf, 2004.
[7] J. Dean, D. Grove, and C. Chambers. Optimization of object-oriented programs using static
class hierarchy analysis. pages 77–101. Springer-Verlag, 1995.
[8] G. Duboscq, L. Stadler, T. Würthinger, D. Simon, C. Wimmer, and H. Moessenboeck. Graal
ir: An extensible declarative intermediate representation. In Proceedings of the Asia-Pacific
Programming Languages and Compilers Workshop, 2013.
[9] Ecma
international.
ECMAScript
Language
Specification.
http://www.
ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf, 2011.
[10] T. E. Foundation.
Eclipse IDE for C/C++ Developers.
https://www.eclipse.org/
downloads/packages/eclipse-ide-cc-developers/keplersr1.
[11] A. Gal, C. W. Probst, and M. Franz. Structural encoding of static single assignment form.
Electron. Notes Theor. Comput. Sci., 141(2):85–102, Dec. 2005.
[12] Google. Octane Benchmark. https://developers.google.com/octane/, 2013.
73
74
Bibliography
[13] J. Gosling, B. Joy, G. Steele, and G. Bracha. Java(TM) Language Specification, The (3rd
Edition) (Java (Addison-Wesley)). Addison-Wesley Professional, 2005.
[14] M. Haupt. Maxine: A JVM Written in Java. http://www.jugsaxony.org/wp-content/
uploads/2012/05/Maxine-A_JVM_in_Java.pdf.
[15] P. Hohensee.
The HotSpotTM Java Virtual Machine.
http://www.cs.princeton.edu/
picasso/mats/HotspotOverview.pdf.
[16] Y.-C. Huang, Y.-S. Chen, W. Yang, and J. J.-J. Shann. File-Based Sharing For Dynamically
Compiled Code On Dalvik Virtual Machine. Computer Symposium (ICS), 2010 International,
2010.
[17] J. Hunter. matplotlib for Python. http://www.matplotlib.org.
[18] U. Hölzle, C. Chambers, and D. Ungar. Optimizing Dynamically-Typed Object-Oriented
Languages With Polymorphic Inline Caches. In ECOOP ’91: Proceedings of the European
Conference on Object-Oriented Programming. Springer-Verlag, 1991.
[19] Intel.
Intel(R)
64
and
IA-32
Architectures
Software
Developer’s
Manual.
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/
64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf, September
2013.
[20] V. Ivanov. JIT-compiler in JVM seen by a Java developer. http://www.stanford.edu/
class/cs343/resources/java-hotspot.pdf, 2013.
[21] Jikes RVM Project Organization. Jikes RVM. http://jikesrvm.org.
[22] Jikes RVM Project Organization.
Jikes RVM: Adaptive Optimization System.
http:
//jikesrvm.org/Adaptive+Optimization+System.
[23] Jikes RVM Project Organization.
Jikes RVM: Class and Code Management.
http://
jikesrvm.org/Class+and+Code+Management.
[24] O. Labs. The Maxine Virtual Machine. https://wikis.oracle.com/display/MaxineVM/
Home.
[25] M. Lagergren.
Nashorn War Stories.
http://www.oracle.com/technetwork/java/
jvmls2013lager-2014150.pdf, 2013.
[26] J. Laskey. CON4082 - Nashorn: JavaScript on the JVM. http://www.youtube.com/watch?
v=4nCrbwsSzBw, 2013. YouTube Channel: Oracle Learning Library.
[27] T. Lindholm, F. Yellin, G. Bracha, and A. Buckley. The Java Virtual Machine Specification:
Java Se, 7 Ed. Always learning. Prentice Hall PTR, 2013.
[28] Lukas Stadler. Serializable Coroutines for the HotSpotTM Java Virtual Machine. http://ssw.
jku.at/Research/Papers/Stadler11Master/Stadler11Master.pdf, 2011.
Bibliography
75
[29] A. Noll. JDK Bug system, JDK-8015774: Add support for multiple code heaps. https:
//bugs.openjdk.java.net/browse/JDK-8015774, 2013.
[30] A. Noll. JDK Bug system, JDK-8025277: Add -XX: flag to print code cache sweeper statistics.
https://bugs.openjdk.java.net/browse/JDK-8025277, 2013.
[31] Oracle. HotSpot Glossary of Terms. http://openjdk.java.net/groups/hotspot/docs/
HotSpotGlossary.html.
[32] Oracle. Java HotSpotTM VM Options. http://www.oracle.com/technetwork/java/javase/
tech/vmoptions-jsp-140102.html.
[33] Oracle. JSR 292: Supporting Dynamically Typed Languages on the Java Platform. https:
//www.jcp.org/en/jsr/detail?id=292.
[34] Oracle. the Da Vinci Machine Project. http://openjdk.java.net/projects/mlvm/.
[35] Oracle. The HotSpotTM Group. http://openjdk.java.net/groups/hotspot/.
[36] Oracle. Serviceability in HotSpotTM . http://openjdk.java.net/groups/hotspot/docs/
Serviceability.html, 2007.
[37] Oracle. DTrace User Guide. http://docs.oracle.com/cd/E19253-01/819-5488/, 2010.
[38] Oracle.
DTrace Probes in HotSpotTM VM.
http://docs.oracle.com/javase/6/docs/
technotes/guides/vm/dtrace.html, 2011.
[39] Oracle. Java Virtual Machine Support for Non-Java Languages. http://docs.oracle.com/
javase/7/docs/technotes/guides/vm/multiple-language-support.html, 2013.
[40] Oracle. Learn about Java Technology. http://www.java.com/en/about/, 2013.
[41] Oracle. The Java HotSpotTM Performance Engine Architecture. http://www.oracle.com/
technetwork/java/whitepaper-135217.html, 2013.
[42] Oracle. HotSpotTM Runtime Overview. http://openjdk.java.net/groups/hotspot/docs/
RuntimeOverview.html, 2014.
[43] M. Paleczny, C. Vick, and C. Click.
The Java HotSpotTM Server Compiler.
https://
www.usenix.org/legacy/events/jvm01/full_papers/paleczny/paleczny.pdf, 2001. Paper from JVM ’01.
[44] T. Printezis and K. Russell. Experimental Tools for Serviceability. http://www.oracle.com/
technetwork/java/javase/tech/3280-d-150044.pdf, 2002. Talk from JavaOne 2002.
[45] T. Rodriguez and K. Russell. Client Compiler for the Java HotSpotTM Virtual Machine:
Technology and Application.
http://www.oracle.com/technetwork/java/javase/tech/
3198-d1-150056.pdf, 2002. Talk from JavaOne 2002.
76
Bibliography
[46] T. Rodriguez and K. Russell. Client Compiler for the Java HotSpotTM Virtual Machine: Technology and Application. http://www.slideshare.net/iwanowww/jitcompiler-in-jvm-by,
2002. Talk from JavaOne 2002.
[47] K. Russell and L. Bak. The HotSpotTM Serviceability Agent: An out-of-process high level
debugger for a Java(tm) virtual machine. https://www.usenix.org/legacy/events/jvm01/
full_papers/russell/russell_html/index.html, 2001. Paper from JVM ’01.
[48] Standard Performance Evaluation Corporation.
SPECjbb2005.
http://www.spec.org/
SPECjbb2013.
http://www.spec.org/
jbb2005/, 2005.
[49] Standard Performance Evaluation Corporation.
jbb2013/, 2013.
[50] A.
Szegedi.
Project
Nashorn
in
Java
8.
http://www.parleys.com/play/
51afc0e7e4b01033a7e4b6e9/chapter30/about, 2013.
[51] Wikipedia.
External fragmentation.
http://en.wikipedia.org/wiki/Fragmentation_
(computing)#External_fragmentation.
[52] C. Wimmer, M. Haupt, M. L. V. de Vanter, M. J. Jordan, L. Daynès, and D. Simon. Maxine:
An approachable virtual machine for, and in, java. TACO, 9(4):30, 2013.
[53] C. Wimmer and T. Würthinger. Truffle: A self-optimizing runtime system. In Proceedings
of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity, SPLASH ’12, pages 13–14, New York, NY, USA, 2012. ACM.
[54] yWorks. yEd Graph Editor. http://www.yworks.com.