Download Ruben Heynssens Modern Scripting Language Performance

Document related concepts
no text concepts found
Transcript
Performance Analysis and Benchmarking of Python, a
Modern Scripting Language
Ruben Heynssens
Supervisor: Prof. dr. ir. Lieven Eeckhout
Counsellor: Dr. Jennifer Sartor
Master's dissertation submitted in order to obtain the academic degree of
Master of Science in de ingenieurswetenschappen: computerwetenschappen
Department of Electronics and Information Systems
Chairman: Prof. dr. ir. Jan Van Campenhout
Faculty of Engineering and Architecture
Academic year 2013-2014
Performance Analysis and Benchmarking of Python, a
Modern Scripting Language
Ruben Heynssens
Supervisor: Prof. dr. ir. Lieven Eeckhout
Counsellor: Dr. Jennifer Sartor
Master's dissertation submitted in order to obtain the academic degree of
Master of Science in de ingenieurswetenschappen: computerwetenschappen
Department of Electronics and Information Systems
Chairman: Prof. dr. ir. Jan Van Campenhout
Faculty of Engineering and Architecture
Academic year 2013-2014
Acknowledgements
I would like to thank Prof. L. Eeckout and Dr. J. Sartor for their guidance and encouragement, Dr. W. Heirman for the advice and assistance with the Bluepower machine
and measuring hardware events and the Ghent University Electronics and Information
Systems department for the use of the Bluepower machine. I would also like to thank
Prof. F. Mueller from the North Carolina State University for sharing the code to apply
software prefetching. Finally I would like to thank my parents for all their support, faith
and encouragement.
De auteur geeft de toelating deze masterproef voor consultatie beschikbaar te stellen en
delen van de masterproef te kopiëren voor persoonlijk gebruik.
Elk ander gebruik valt onder de beperkingen van het auteursrecht, in het bijzonder met
betrekking tot de verplichting de bron uitdrukkelijk te vermelden bij het aanhalen van
resultaten uit deze masterproef.
The author gives permission to make this master dissertation available for consultation
and to copy parts of this master dissertation for personal use.
In the case of any other use, the limitations of the copyright have to be respected, in
particular with regard to the obligation to state expressly the source when quoting results
from this master dissertation.
Ghent, June 2014
Ruben Heynssens
Performance Analysis and Benchmarking of
Python, a Modern Scripting Language
1
Ruben Heynssens
Supervisors: Prof. Lieven Eeckhout, Dr. Jennifer Sartor
Abstract—I investigated the difference between
various Python runtime environments and performed a comparison with C. The main contributions are promoting better benchmarking
for Python and applying existing techniques for
benchmarking on Python, because the most popular Python benchmarks run too short for a decent
analysis. Furthermore I give suggestions to improve the performance of existing Python runtime
environments.
This is achieved by searching a representative
benchmark suite and developing a decent benchmarking methodology. Then I applied the methodology on the most common and popular Python
runtime environments. Furthermore I compared
the runtime environments with C in order to show
the difference between compilation and interpretation.
The results indicate that the default interpreter
is very slow. However it is sufficient when Python
is used as ‘glue code’ and moreover, a large amount
of libraries are available. Python Just-In-Time
compilation improves the performance very well
for CPU-intensive applications. Problems with the
level-1 data cache cause a much smaller benefit for
I/O- and memory-intensive application compared
to the default interpreter. Compiling Python code
to C improves the performance drastically if extra
type information is supplied, otherwise there is
no significant benefit compared to the default
interpreter.
Index Terms—Python, performance, benchmarking
I. Introduction
Over the past decades, scripting languages have
become increasingly popular due to the increasing importance of graphical user interfaces and the
growth of the internet. They have been created for
very specific purposes, like ‘glueing’ components together or performing text processing. They do not
require the programmer to specify the type of variables, and thus allow for easy and rapid development.
Since scripting languages are commonly interpreted
for the particular machine they are running on, they
are portable.
A lot of different languages are available nowadays, like awk, JavaScript, PHP, Perl, Bash, etc.
They are used in very different domains. However
in my thesis I decided to focus on Python. This
language is very popular and often used by people
who have no experience with programming because
of the ease of use. Recently IPython, an interactive
Python interpreter running in a web browser, has
been released. This, together with the wide variety
of scientific libraries available for Python, has caused
the academic world to show interest in Python. For
academic applications the performance becomes of
utmost importance. However there has not been a
lot of research which compares the different options
to improve the performance of Python. Moreover, a
thorough performance analysis of the various runtime environments across a broad range of general
applications has not been performed.
II. Existing Python Runtime Environments
And Related Work
There are currently three common approaches to
running Python programs:
• interpretation
• compilation to a lower level language
• Just-In-Time compilation
I have explored the most common and most popular runtime environments for Python, which cover
this range of interpretation and compilation techniques. CPython, the default runtime environment,
applies simple interpretation. Cython has been used
to evaluate the behaviour of compiling Python code,
in this case to C and then the C code is compiled
to an executable. Cython also offers the possibility
to add type information, which allows extra optimisation. PyPy is the most famous runtime environment providing Just-In-Time compilation for
Python. Most other projects have been discontinued
in favour of it [1]. A Just-In-Time compiler (JIT) will
compile ‘hot’ code at runtime and afterwards, the
compiled code is used, which should execute faster.
PyPy’s Just-In-Time compiler follows the principles
of a tracing JIT [2], [3]. A benchmarking methodology has already been developed for evaluating the
behaviour of JIT compilers for Java [4].
The main criticisms people have with Python are
related to the Global Interpreter Lock (GIL). This
lock prevents two threads from simultaneously executing code, even if multiple cores are available.
People have attempted to remove the GIL, however
those attempts have not been successful. For now it
does not seem there is a solution to this problem.
Therefore a new module has been created, called
multiprocessing, which successfully circumvents
the GIL by spawning subprocesses. For now, this
is considered the best approach to multi-threaded
applications.
type information, which means that without type information Cython does not perform well. Automated
type guessing could make this easier, however this
has not been researched yet.
The time measurements showed that C-code equivalent of the Python programs always ran faster than
the Python code on any runtime environment. PyPy
performs very well for the CPU-intensive benchmarks. The performance for the I/O- and memoryintensive benchmarks is not so good. CPython and
Cython both perform similarly. Type information
was not added in this experiment. This was done to
ensure a fair comparison and because it is not easy
to add correct type information, most users will not
do it.
For C, the measured hardware events show that
the type of benchmark influences the events. This
behaviour is also noticed for PyPy, however to a
lesser extent. This is of course caused by the JIT,
because the code is compiled to assembly. A similar
behaviour is expected for Cython, but this is not
the case. Without type information, the behaviour is
much more similar to CPython. Cython wraps each
value in an object, because the type is not known.
This leads to a behaviour very similar to interpretation. Both Cython and CPython have consistent
values for the events. The type of the benchmarks
has very little influence on it.
The multi-threaded benchmarks show that the
GIL is circumvented by the multiprocessing module. I also benchmarked the behaviour without the
threading to compare the parallel speed-up for the
different runtime environments. The results show
that CPython gets a very decent speed-up. The overhead of spawning multiple subprocesses is very small.
The multi-threaded behaviour of PyPy is not as good
as CPython’s. The execution was only twice as fast
with threading than without, while eight cores were
available. The multiprocessing module spawns a
new interpreter on each core. For PyPy, this means
that a new JIT is created as well. Since it is not
possible to share information, such as compiled code,
and the Just-In-Time compiler has to work with less
code on each core, a smaller improvement is obtained.
In order to get a better understanding of PyPy’s
JIT, the behaviour over time has been analysed. This
behaviour is represented by a tool included in the
PyPy source code. It shows that the JIT is mainly
working in the beginning of the execution, which is
important because this results in the largest benefit
[5]. The JIT has also proven to be very effective,
III. Benchmarking Suite and Methodology
To analyse the behaviour of the different runtime
environments, a benchmark suite was required. The
Grand Unified Python Benchmark Suite is meant
to compare different Python implementations with
each other. However after running the benchmark
suite, I noticed that most benchmarks are very short.
Since this is not sufficient for a decent comparison
and analysis, I researched other benchmarking suites.
The Computer Language Benchmarks Game suite has
proven to be the most reliable one and it also allows
easy comparison with C.
In order to get a clear view on the behaviour of
the different runtime environments, hardware events
like the number of cycles, branches, level-1 instruction cache misses, etc. have been measured. These
are measured using perf and PAPI. The values are
verified using raw events to guarantee their correctness. In order to analyse the JIT component of
PyPy, the stable behaviour, based on the DaCapo
methodology[4], and the behaviour without JIT have
been measured. The stable behaviour tries to eliminate the JIT cost, while still using the optimised
code. This gives a better view on the efficiency and
influence of the JIT.
IV. Results And Discussion
First a specific application, the pairwise distance
calculation, is used to compare the benefit of adding
type information for Cython. Adding type information resulted in a 1000 times faster execution in
comparison with regular Python that includes no
2
since, for all but one benchmark, it ran for less than
one percent of the time. The behaviour of PyPy
without the JIT shows very poor performance. This
means the JIT is necessary to improve the performance and leads to a very decent speed-up.
However there are still some memory problems
with PyPy. PyPy’s interpreter, garbage collector and
JIT all need to store data. This negatively influences the execution of the user application and this
becomes visible for the I/O- and memory-intensive
benchmarks.
The hardware event measurements have shown
that the JIT improves the level-1 instruction cache
behaviour. However more misses per instruction occur in the level-1 data cache. This might be improved
by prefetching. Therefore I redid the measurements
with hardware prefetching turned off. These results
confirm that hardware prefetching reduces the number of level-1 misses per instruction in the data
cache. A larger benefit could be obtained by applying
software prefetching. However I have not been able
to finish this.
can be incrementally added where most benefit will
be obtained. This will result in a faster development
cycle than when using C and improve the performance incredibly in comparison with CPython. The
overhead of combining C with Python is very small
compared to the total execution time and thus the
overhead of Cython is minimal as well. Note that it is
not necessary to write C code with this approach. For
novice users, it is advised to only use ‘simple’ data
structures and statements. It will be a lot easier to
add type information.
If even the Cython approach is not good enough,
it is possible to combine Python with C, C++ or
Fortran. There are many libraries available which
make this easy. However this will add a cost to the
development and should therefore be avoided.
I analysed the most popular runtime environments
and their performance for one of the most prevalent
scripting languages, Python. I developed methodology techniques for this exploration, and have offered
suggestions to users in regards to the situations
for which each runtime environment would be most
advantageous.
V. Conclusions
Acknowledgements
I would like to thank Prof. L. Eeckhout and Dr.
J. Sartor for their guidance and encouragement. This
work was carried out using the Bluepower machine
at the Ghent University Electronics and Information
Systems department. I would also like to thank
Dr. W. Heirman for the advice and assistance with
the Bluepower machine and measuring the hardware
events and Prof. F. Mueller for sharing the code to
apply software prefetching.
The analysis leads to the conclusion that CPython,
the default interpreter, is only useful when performance is of no concern. This means that it can be
used for short applications or as ‘glue code’. Multithreaded applications will get a decent performance
boost. The libraries for most computationally intensive tasks are written in C. This means that even for
heavy calculations CPython is an option, however
not if the calculations are written by the user in
Python. While PyPy can call C libraries, it is advised
not to do this, because the Just-In-Time compiler
cannot improve the performance of non-Python code.
Therefore CPython is more attractive for code bases
that need these libraries.
When performance becomes important and the
algorithms cannot be improved any further, PyPy is
a viable option, but only if it is a CPU-intensive task.
The memory problems hinder PyPy too much for
I/O and memory-intensive applications. Concurrent
applications will not gain a huge benefit on PyPy
either.
If performance is of utmost importance for the execution of the application, Cython is the best option.
However this approach is not for a novice user. The
advantage of Cython is that it allows you to do the
entire development in Python, including the testing.
When the application is finished, type information
References
[1] A. Rigo, “Representation-based just-in-time specialization
and the psyco prototype for python,” in Proceedings of the
2004 ACM SIGPLAN Symposium on Partial Evaluation
and Semantics-based Program Manipulation, ser. PEPM
’04. New York, NY, USA: ACM, 2004, pp. 15–26. [Online].
Available: http://doi.acm.org/10.1145/1014007.1014010
[2] C. F. Bolz, A. Cuni, M. Fijalkowski, and A. Rigo,
“Tracing the meta-level: Pypy’s tracing jit compiler,” in
Proceedings of the 4th Workshop on the Implementation,
Compilation, Optimization of Object-Oriented Languages
and Programming Systems, ser. ICOOOLPS ’09. New
York, NY, USA: ACM, 2009, pp. 18–25. [Online]. Available:
http://doi.acm.org/10.1145/1565824.1565827
[3] C. F. Bolz, A. Cuni, M. FijaBkowski, M. Leuschel,
S. Pedroni, and A. Rigo, “Allocation removal by partial
evaluation in a tracing jit,” in Proceedings of the 20th
ACM SIGPLAN Workshop on Partial Evaluation and
Program Manipulation, ser. PEPM ’11. New York,
NY, USA: ACM, 2011, pp. 43–52. [Online]. Available:
http://doi.acm.org/10.1145/1929501.1929508
3
[4] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan,
K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg,
D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump,
H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović,
T. VanDrunen, D. von Dincklage, and B. Wiedermann,
“The DaCapo benchmarks: Java benchmarking development and analysis,” in OOPSLA ’06: Proceedings of
the 21st annual ACM SIGPLAN conference on ObjectOriented Programing, Systems, Languages, and Applications. New York, NY, USA: ACM Press, Oct. 2006, pp.
169–190.
[5] S.-W. Lee and S.-M. Moon, “Selective just-in-time
compilation for client-side mobile javascript engine,” in
Proceedings of the 14th International Conference on
Compilers, Architectures and Synthesis for Embedded
Systems, ser. CASES ’11. New York, NY, USA: ACM,
2011, pp. 5–14. [Online]. Available: http://doi.acm.org/
10.1145/2038698.2038703
4
Prestatieanalyse en benchmarking van Python,
een moderne scripting taal
1
Ruben Heynssens
Begeleiders: Prof. Lieven Eeckhout, Dr. Jennifer Sartor
Samenvatting—Ik onderzocht het verschil tussen diverse Python runtime omgevingen en voerde
een vergelijking uit met C. De belangrijkste bijdragen zijn het aanmoedigen van betere benchmarking voor Python en het toepassen van bestaande technieken om Python te benchmarken,
omdat de populairste Python benchmarks een te
korte uitvoeringstijd hebben voor een degelijke
analyse. Verder geef ik ook suggesties om de prestatie van bestaande Python runtime omgevingen
te verbeteren.
Dit wordt bereikt door een representatieve benchmark suite te zoeken en een degelijke benchmarking methodologie te ontwikkelen. Vervolgens pas
ik deze methodologie toe op de meest gebruikte
en populaire Python runtime omgevingen. Verder
vergelijk ik de runtime omgevingen met C om het
verschil tussen compilatie en interpretatie aan te
tonen.
De resultaten geven aan dat de standaard interpreter zeer traag is. Deze voldoet echter wel wanneer Python gebruikt wordt als ‘glue code’ en bovendien zijn er zeer veel bibliotheken beschikbaar.
Het toepassen van Just-In-Time compilatie op Python verbetert de prestatie zeer goed voor CPUintensieve toepassingen. Problemen met de level-1
data cache veroorzaken een veel kleiner voordeel
voor I/O- en geheugen-intensieve toepassingen in
vergelijking met de standaard interpreter. Het
compileren van Python code naar C verbetert de
prestatie enorm indien er extra type informatie
wordt gegeven, anders wordt er geen significante
winst bekomen tegenover de standaard interpreter.
Index Terms—Python, prestatie, benchmarking
I. Introductie
Scripting talen zijn gedurende de laatste decennia
zeer populair geworden dankzij het toenemende belang van grafische gebruikersinterfaces en de groei
van het internet. Ze zijn ontwikkeld voor zeer specifieke doeleinden, zoals het aaneenlijmen van complexe componenten of tekstverwerking. Ze verlangen
niet van de programmeur dat de types van de variabelen gedeclareerd worden, waardoor ze een snel-
lere en eenvoudigere ontwikkeling mogelijk maken.
Aangezien scripting talen normaliter geïnterpreteerd
worden voor de machine waarop ze uitgevoerd worden, is het gemakkelijk om de toepassingen op andere
hardware uit te voeren.
Er zijn een groot aantal scripting talen beschikbaar, zoals awk, JavaScript, PHP, Perl, Bash, enz.
Deze worden gebruikt in verscheidene domeinen.
Voor deze thesis heb ik echter besloten om de nadruk
te leggen op Python. Deze taal is tegenwoordig zeer
populair en wordt ook gebruikt door mensen die minder ervaring hebben met programmeren dankzij het
gebruiksgemak. Recent is IPython, een interactieve
Python interpreter die werkt in een web browser,
gelanceerd. Deze tool, samen met de vele wetenschappelijke modules die reeds beschikbaar zijn in Python,
hebben gezorgd dat er interesse is gekomen voor
Python vanuit de wetenschappelijke wereld. Voor
wetenschappelijke toepassingen is de prestatie van
Python een zeer belangrijke factor. Er is nog niet veel
onderzoek gedaan die de verschillende mogelijkheden
vergelijkt om de prestatie van Python te verbeteren.
Bovendien zijn de meeste Python runtime omgevingen nog niet geanalyseerd over een uitgebreid scala
van algemene toepassingen.
II. Bestaande Python runtime omgevingen
en gerelateerd werk
Momenteel zijn er drie veelgebruikte methoden om
Python toepassingen uit te voeren:
• interpretatie
• compilatie naar een andere taal
• Just-In-Time compilatie
Ik heb de meest gebruikte en populaire runtime
omgevingen voor Python onderzocht, die deze methoden bevatten. CPython, de standaard interpreter,
past enkel interpretatie toe. Cython is gebruikt om
het gedrag te evalueren voor de tweede methode,
waarbij de Python code naar C gecompileerd wordt.
Vervolgens wordt de C code gecompileerd naar een
uitvoerbaar bestand. Cython biedt ook de mogelijkheid aan om type informatie toe te voegen. Deze
informatie zal zorgen dat er nog verder geoptimaliseerd zal worden. Het is echter niet gemakkelijk
om correcte informatie te geven, zeker niet wanneer
de types ingewikkeld worden. PyPy is het bekendste
project dat Just-In-Time compilatie toepast. Andere
projecten zijn gestopt vanwege dit project [1]. Een
Just-In-Time compiler (JIT) zal tijdens de uitvoering
vaak uitgevoerde code compileren. Daarna is het
mogelijk om de gecompileerde versie te gebruiken,
welke sneller zou moeten uitvoeren. PyPy’s Just-InTime compiler werkt volgens de principes van een
tracing JIT [2], [3]. Er is reeds een benchmarking
methodologie ontwikkeld voor Java om het gedrag
van JIT compilers te evalueren [4].
De belangrijkste ergernissen die mensen hebben
met Python zijn gerelateerd aan de Global Interpreter Lock (GIL). Dit lock voorkomt dat twee threads
gelijktijdig code kunnen uitvoeren, zelfs wanneer
meerdere cores beschikbaar zijn. Er zijn reeds pogingen ondernomen om de GIL te verwijderen, maar
men is er tot nu toe nog niet geslaagd. Voorlopig
blijkt er ook geen oplossing te zijn voor dit probleem.
Daarom is een nieuwe module gecreëerd, genaamd
multiprocessing, die de GIL kan omzeilen door
subprocessen aan te maken. Voorlopig wordt dit
beschouwd als de beste aanpak voor multi-threaded
toepassingen.
gebaseerd op DaCapo [4], en het gedrag zonder JIT
gemeten. Het stabiele gedrag probeert de JIT te elimineren, terwijl de geoptimaliseerde code nog steeds
gebruikt wordt. Dit geeft een beter overzicht van de
efficiëntie en de invloed van de JIT.
IV. Resultaten en discussie
Eerst wordt een specifieke toepassing, de pairwise
distance calculation, gebruikt om het voordeel van
de type informatie van Cython te analyseren. Het
toevoegen van type informatie leidt tot een 1000
keer snellere uitvoering in vergelijking met de gewone
Python code die geen type informatie heeft, wat
betekent dat Cython niet goed presteert zonder type
informatie. Geautomatiseerde type guessing zou dit
gemakkelijker kunnen maken. Hier is echter nog geen
onderzoek over verricht.
De tijdsmetingen tonen aan dat C nog steeds de
snelste runtime omgeving is. PyPy presteert ook zeer
goed voor CPU-intensieve toepassingen. De verbetering voor I/O- en geheugen-intensieve benchmarks
zijn een stuk minder. CPython en Cython presteren
beide gelijkaardig. Dit gedrag wordt verklaard door
de type informatie. Voor de benchmarks was geen
type informatie toegevoegd, om een eerlijke vergelijking te maken. Het is niet gemakkelijk deze informatie toe te voegen, waardoor de meeste gebruikers dit
niet zullen doen.
De gemeten hardware events tonen aan dat het
type benchmark de events beïnvloedt. Dit gedrag
wordt ook opgemerkt voor PyPy, hoewel in mindere
mate. Dit wordt natuurlijk veroorzaakt door de JIT,
die de code compileert naar assembly. Een gelijkaardig gedrag wordt verwacht voor Cython, maar
dit is niet het geval. Zonder de type informatie is
het gedrag gelijkaardig aan CPython. Iedere waarde
wordt gewrapped in een object door Cython, omdat
het type niet gekend is. Dit leidt tot een gedrag
zeer gelijkaardig aan interpretatie. Zowel Cython als
CPython hebben consistente waarden voor de events.
Het type van de benchmark heeft er bijna geen
invloed op.
De multi-threaded benchmarks tonen aan dat
de GIL met succes omzeild wordt door de
multiprocessing module. Ik heb ook het gedrag
zonder threading gemeten om de versnelling te vergelijken. De resultaten tonen dat CPython zeer goed
versnelt dankzij het gebruik van meerdere threads.
De kost van het opstarten van meerdere interpreters is zeer klein. PyPy versnelt echter niet zo veel
als CPython voor multi-threaded toepassingen. De
III. Benchmarking suite en methodologie
Om het gedrag van de verschillende runtime omgevingen te analyseren is er een benchmark suite
nodig. The Grand Unified Python Benchmark Suite is
bedoeld om verschillende Python implementaties met
elkaar te vergelijken. Na het uitvoeren van deze suite
heb ik echter opgemerkt dat de meeste benchmarks
zeer kort zijn. Aangezien dit niet voldoende is om
een degelijke vergelijking en analyse te doen, heb ik
andere suites onderzocht. The Computer Language
Benchmarks Game suite heeft bewezen de best betrouwbare te zijn en laat ook toe om Python met C
te vergelijken.
Om een duidelijk beeld te krijgen van het gedrag
van de verschillende runtime omgevingen worden
hardware events gemeten zoals het aantal cycles,
branches, instructie cache misses in de eerste niveau
cache, enz. Deze worden gemeten met behulp van perf
en PAPI. De waarden zijn gecontroleerd met ruwe
events om de correctheid te garanderen. Om PyPy’s
JIT te analyseren worden ook het stabiele gedrag,
2
uitvoering was slechts twee keer sneller met threading, terwijl er acht cores beschikbaar waren. Zoals
vermeld, maakt de multiprocessing module een
nieuwe interpreter aan op iedere core. Voor PyPy
betekent dit dat er ook nieuwe JIT aangemaakt
wordt. Aangezien het niet mogelijk is informatie uit
te wisselen, zoals gecompileerde code, en de compiler
moet werken met minder code op iedere core, wordt
een kleinere winst bekomen.
Om een beter zicht te krijgen op PyPy’s JIT
werd het tijdsgedrag geanalyseerd. Dit gedrag wordt
voorgesteld met behulp van een tool dat in de PyPy
broncode zit. Het toont aan dat de JIT vooral in het
begin werkt, wat belangrijk is omdat dit resulteert
in de grootste winst [5]. De JIT heeft ook bewezen
zeer effectief te zijn, aangezien het, behalve bij één
benchmark, voor minder dan één percent actief was.
De resultaten van het stabiele gedrag hebben dit
beaamd. Het gedrag van PyPy zonder JIT toont
een zeer zwakke prestatie. Dit betekent dat de JIT
noodzakelijk is om de prestatie te verbeteren en leidt
ook tot een zeer degelijke prestatie winst.
Er zijn echter ook nog enkele geheugenproblemen
met PyPy. PyPy’s interpreter, garbage collector en
JIT moeten data opslaan. Dit beïnvloedt de uitvoering van een gebruikerstoepassing op een negatieve
manier en wordt zichtbaar voor de I/O- en geheugenintensieve benchmarks.
De hardware event metingen tonen dat de JIT het
level-1 instructie cache gedrag verbetert. Er treden
echter meer misses per instructie op in de level-1
data cache. Dit zou verbeterd kunnen worden met
behulp van prefetching. Daarom heb ik de metingen heruitgevoerd zonder hardware prefetching. Deze
resultaten bevestigen dat hardware prefetching het
aantal level-1 misses per instructie in de data cache
vermindert. Een grotere winst zou bekomen kunnen
worden door software prefetching toe te passen. Onderzoek omtrent dit onderwerp heb ik niet kunnen
afwerken.
een optie is, maar enkel wanneer de berekeningen niet
door de gebruiker in Python geschreven worden. Hoewel PyPy ook C bibliotheken kan oproepen, wordt er
aangeraden dit niet te doen, omdat de JIT compiler
de prestatie niet kan verbeteren van andere talen.
Daarom is CPython interessanter voor toepassingen
die deze bibliotheken nodig hebben.
Indien prestatie belangrijk wordt en de algoritmes
niet meer verbeterd kunnen worden, is PyPy een
goede keuze, maar enkel wanneer het om een CPUintensieve taak gaat. De geheugen problemen hinderen PyPy te veel bij I/O- en geheugen-intensieve
toepassingen. Ook multi-threaded toepassingen zullen geen grote winst bekomen bij PyPy.
Wanneer prestatie echt een kritieke factor is voor
de uitvoering van de applicatie biedt Cython de beste
optie. Deze aanpak is echter niet voor onervaren
gebruikers. Het voordeel van Cython is dat het mogelijk is om de volledige ontwikkeling in Python te
doen, inclusief het testen. Wanneer de toepassing
klaar is, kan type informatie gradueel toegevoegd
worden waar de meeste winst bekomen zal worden.
Dit zal resulteren in een snellere ontwikkelingscyclus
dan wanneer C gebruikt zou worden en de prestatie
zal enorm verbeteren in vergelijking met CPython.
De overhead van het combineren van C met Python
is zeer klein vergeleken met de totale uitvoeringstijd.
Merk op dat het bij deze aanpak niet nodig is om C
code te schrijven, waardoor de overhead van Cython
dus ook zeer klein is. Merk op dat het bij deze aanpak
niet nodig is om C code te schrijven. Voor onervaren
gebruikers wordt het aangeraden om enkel ‘eenvoudige’ data structuren en constructies te gebruiken.
Daardoor zal het een stuk gemakkelijk zijn om type
informatie toe te voegen.
Indien zelfs de Cython aanpak niet goed genoeg
is, is het mogelijk om Python te combineren met
C, C++ of Fortran. Er zijn vele bibliotheken
beschikbaar die dit gemakkelijk maken. Dit zal
echter wel de ontwikkelingskost verhogen en daarom
is het best om dit te vermijden.
V. Conclusies
De analyse leidt tot de conclusie dat CPython,
de standaard interpreter, enkel nuttig is wanneer
de prestatie er niet toe doet. Dit betekent dat
het gebruikt kan worden voor applicaties die een
korte uitvoeringstijd hebben of als ‘glue code’. Multithreaded toepassing zullen een degelijke prestatie
winst krijgen. De meeste bibliotheken die computationeel intensieve taken uitvoeren zijn geschreven in
C, waardoor zelfs voor zware berekeningen CPython
Ik heb de bekendste runtime omgevingen en hun
prestatie geanalyseerd voor één van de belangrijkste
scripting talen, namelijk Python. Ik heb methodologische technieken ontwikkeld voor dit onderzoek en
gebruikers suggesties aangeboden met betrekking tot
welke runtime omgeving de meest geschikte is voor
iedere situatie.
3
Dankwoord
Ik zou graag Prof. L. Eeckhout en Dr. J. Sartor
bedanken voor hun begeleiding en aanmoediging. Dit
werk is uitgevoerd, gebruik makend van de Bluepower machine van de vakgroep van Elektronica en
Informatiesystemen van de universiteit van Gent. Ik
zou ook nog graag Dr. W. Heirman bedanken voor
zijn advies en begeleiding met de Bluepower machine
en het meten van hardware events en Prof. F. Mueller
voor het delen van de code om software prefetching
toe te passen.
Referenties
[1] A. Rigo, “Representation-based just-in-time specialization
and the psyco prototype for python,” in Proceedings of the
2004 ACM SIGPLAN Symposium on Partial Evaluation
and Semantics-based Program Manipulation, ser. PEPM
’04. New York, NY, USA: ACM, 2004, pp. 15–26. [Online].
Available: http://doi.acm.org/10.1145/1014007.1014010
[2] C. F. Bolz, A. Cuni, M. Fijalkowski, and A. Rigo,
“Tracing the meta-level: Pypy’s tracing jit compiler,” in
Proceedings of the 4th Workshop on the Implementation,
Compilation, Optimization of Object-Oriented Languages
and Programming Systems, ser. ICOOOLPS ’09. New
York, NY, USA: ACM, 2009, pp. 18–25. [Online]. Available:
http://doi.acm.org/10.1145/1565824.1565827
[3] C. F. Bolz, A. Cuni, M. FijaBkowski, M. Leuschel,
S. Pedroni, and A. Rigo, “Allocation removal by partial
evaluation in a tracing jit,” in Proceedings of the 20th
ACM SIGPLAN Workshop on Partial Evaluation and
Program Manipulation, ser. PEPM ’11. New York,
NY, USA: ACM, 2011, pp. 43–52. [Online]. Available:
http://doi.acm.org/10.1145/1929501.1929508
[4] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan,
K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg,
D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump,
H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović,
T. VanDrunen, D. von Dincklage, and B. Wiedermann,
“The DaCapo benchmarks: Java benchmarking development and analysis,” in OOPSLA ’06: Proceedings of
the 21st annual ACM SIGPLAN conference on ObjectOriented Programing, Systems, Languages, and Applications. New York, NY, USA: ACM Press, Oct. 2006, pp.
169–190.
[5] S.-W. Lee and S.-M. Moon, “Selective just-in-time
compilation for client-side mobile javascript engine,” in
Proceedings of the 14th International Conference on
Compilers, Architectures and Synthesis for Embedded
Systems, ser. CASES ’11. New York, NY, USA: ACM,
2011, pp. 5–14. [Online]. Available: http://doi.acm.org/
10.1145/2038698.2038703
4
Contents
1 Introduction
1.1 Scripting Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2
2 Related Work
2.1 Scripting Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
5
3 Runtime Environments
3.1 CPython . . . . . . . . . . . . . . . .
3.1.1 Architecture . . . . . . . . .
3.1.2 Optimisations . . . . . . . . .
3.1.3 Multi-threaded Applications .
3.2 PyPy . . . . . . . . . . . . . . . . . .
3.2.1 Architecture . . . . . . . . .
3.2.2 Multi-threaded Applications .
3.3 Cython . . . . . . . . . . . . . . . .
3.4 Other Runtime Environments . . . .
3.5 Conclusion . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
8
8
9
11
12
13
19
19
20
20
4 Benchmarking
4.1 Benchmarking Suites . . . . . . . . . . . . . . . . . .
4.1.1 The Grand Unified Python Benchmark Suite
4.1.2 The Computer Language Benchmarks Game
4.2 Benchmarking Methodologies . . . . . . . . . . . . .
4.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Hardware Events . . . . . . . . . . . . . . . . . . . .
4.4.1 Perf . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 PAPI . . . . . . . . . . . . . . . . . . . . . .
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
23
23
24
27
30
30
30
32
32
5 Analysis Runtime Environments
5.1 Preliminary Comparison . . . . .
5.1.1 Cython Type Information
5.1.2 Type Guessing . . . . . .
5.2 PyPy Beats C . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
34
34
37
38
.
.
.
.
.
.
.
.
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xiv
CONTENTS
5.3
5.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
40
40
41
42
44
45
47
48
50
51
52
52
53
54
54
6 Analysis PyPy
6.1 Translatorshell . . . . . . . . . . . . . . . . . . . . .
6.2 Hooks . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 JIT Viewer . . . . . . . . . . . . . . . . . . . . . . .
6.4 Behaviour Over Time . . . . . . . . . . . . . . . . .
6.4.1 A Lot Of Garbage Collection . . . . . . . . .
6.4.2 Behaviour Of The Just-In-Time Compilation
6.4.3 Influence Of The Nursery Size . . . . . . . .
6.5 Influence Of The Just-In-Time Compiler . . . . . . .
6.5.1 Time Measurements . . . . . . . . . . . . . .
6.5.2 Hardware Events . . . . . . . . . . . . . . . .
6.6 Adjusting The Heap Size . . . . . . . . . . . . . . . .
6.7 Prefetching . . . . . . . . . . . . . . . . . . . . . . .
6.7.1 Hardware Prefetching . . . . . . . . . . . . .
6.7.2 Software Prefetching . . . . . . . . . . . . . .
6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
56
57
60
60
60
62
63
63
64
65
70
71
72
74
76
7 Final Conclusions
7.1 CPython . . . .
7.2 Cython . . . .
7.3 PyPy . . . . . .
7.4 General . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
77
78
78
79
5.5
5.6
5.7
Time Measurements . . . . . . . . . . . . .
Hardware Events . . . . . . . . . . . . . . .
5.4.1 Cycles Per Instruction . . . . . . . .
5.4.2 Branch Behaviour . . . . . . . . . .
5.4.3 Level-1 Instruction Cache Behaviour
5.4.4 Level-1 Data Cache Behaviour . . .
5.4.5 Last Level Cache Load Behaviour .
5.4.6 Last Level Cache Store Behaviour .
5.4.7 Translation Lookaside Buffers . . . .
Multi-threaded Applications . . . . . . . . .
5.5.1 C . . . . . . . . . . . . . . . . . . . .
5.5.2 Cython . . . . . . . . . . . . . . . .
5.5.3 CPython . . . . . . . . . . . . . . .
5.5.4 PyPy . . . . . . . . . . . . . . . . .
PYC . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Appendices
81
A PyPy Options
82
B Removing Threading From binary-trees and k-nucleotide
85
B.1 k-nucleotide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
B.2 binary-trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
CONTENTS
xv
Bibliography
88
List of Figures
91
Chapter 1
Introduction
M
ost scripting languages are run using an interpreter, which takes in the source
code of the program, referred to as a ‘script’, and executes the code on-thefly. The use of an interpreter means that the source code is translated without
optimising at runtime to assembly, which means the code does not need to be recompiled
to run on different machines. Of course it is required that the interpreter is installed
on each machine, on which the code has to be executed. The use of an interpreter also
results in an added flexibility of the scripting language itself. Since everything is executed
at runtime, type information can be deduced while running the script. Most scripting
languages are therefore dynamically typed, instead of the static typing used by most
system programming languages, such as C, C++, etc. A lot of scripting languages will also
provide more complex constructs, like list comprehensions. This increases the productivity
of the programmer, but since those constructs have a very specific goal, it also means that
a lot of scripting languages are used for a very specific purpose. Think for for example
about awk to perform text processing.
At the moment, there are a lot of scripting languages available, like JavaScript, Perl,
lua, Bash, awk, etc. They are used in very different domains. However for my thesis, I
decided to focus on Python. The reason for this choice is explained in Section 1.3.
1.1
Scripting Languages
Scripting languages are designed for different tasks than system programming
languages, and this leads to fundamental differences in the languages. System programming languages were designed for building data structures and
algorithms from scratch, starting from the most primitive computer elements
such as words of memory. In contrast, scripting languages are designed for
gluing: They assume the existence of a set of powerful components and are
intended primarily for connecting components. — John K. Ousterhout [22]
Scripting languages are becoming increasingly popular and important due to the increasing importance of graphical user interfaces and the growth of the internet. They
have become possible because of hardware improvements. The main benefit they offer
is ease of use and high productivity. However they are still mainly used as ‘glue code’,
which means that they ‘glue’ already existing components together, which would be more
difficult in a system programming language (think for example about piping in the Unix
1
CHAPTER 1. INTRODUCTION
2
shell). Since they also provide a higher productivity, they are also used to hack something together quickly to provide a prototype. Another advantage is that most scripting
languages provide a command-line interpreter, which allows interactive programming by
requesting commands and executing each command as soon as it is received. This means
you can get feedback while writing code, which is currently pushed to the limits in a new
approach called live coding.
1.2
Python
At the moment of writing this document, Python occupies the eighth place on the TIOBE
Programming Community index 1 . The only scripting language ranked higher is PHP. The
first major version of CPython, the default Python interpreter, was released in January
1994 by Guido van Rossum. It has now reached its third version. In other words, Python
is becoming a stable language, with a high number of users.
The Python syntax is easily readable by people not experienced in programming,
because words are used, like and and or, instead of the respective constructs && and ||
used in most computer languages. Python is also a dynamically typed language, which
means it is not necessary to give type information, nor are variables restricted to a single
type. The programmer does not need to concern himself with overflows. These are caught
at runtime and a new variable is generated which can contain the value. Something unique
to this language is the fact that indentation is mandatory and needs to be correct to run.
This forces users to write code that is easy to read. Finally the syntax is meant to be
concise, which allows fast development.
Another reason for Python’s popularity is the huge amount of modules, or libraries,
available. This means that it is possible to reuse code, which makes it easier and quicker
to program. It is also possible to call C, C++ and Fortran libraries, from within Python.
Furthermore it is easy to combine code written in C with Python, because the task of
compiling the C code to a library can be automated by modules like cffi.
1.3
Why Python?
I already mentioned in Section 1.2 that Python is one of the most popular scripting
languages. However recently Python is being used for academic purposes. This has been
caused by the release of IPython, an interactive Python interpreter running in a web
browser. It is even possible to share the IPython notebooks by putting them online [1].
There are also a lot of libraries available for scientific computing in a lot of different fields.
A few commonly used are listed in Table 1.1. Most of these libraries work very well with
the IPython notebook. For example, the images generated with matplotlib can be inlined
in the notebook.
Now that Python is being used for scientific purposes, the performance becomes of
utmost importance. Not much research has been published which compares the different
options to improve the performance of Python. Moreover, most Python runtime environments have not been analysed across a broad range of general applications. Therefore I
1
The TIOBE Programming Community index, located at http://www.tiobe.com/, is an indicator of
the popularity of computer languages. It is based on search results of popular search engines like Google,
Bing, Yahoo!, Wikipedia, etc. It does not rank computer languages according to the number of lines
written in them, or how ‘good’ they are.[26]
CHAPTER 1. INTRODUCTION
3
Table 1.1: Commonly used libraries for scientific computing
Library
NumPy
SciPy
matplotlib
pandas
SymPy
scikit
StatsModels
Description
fundamental package for numerical computation
collection of numerical algorithms and domain-specific toolboxes
2D plotting library which produces publication quality figures in a variety of hardcopy formats
providing high-performance, easy to use data structures
for symbolic mathematics and computer algebra
tools for data mining and data analysis
tools for statistical computing and data analysis
analyse Python’s performance over various runtime environments, compare the differences
between these runtime environments and make recommendations about how to achieve
better performance.
In order to accomplish all this, I started researching Python and scripting languages
in general. In Chapter 2, I describe the most interesting findings about previous attempts
to improve scripting languages and Python. This research led me to the most common
approaches to executing Python. These are discussed in Chapter 3. A benchmark suite
and methodology is necessary to analyse those different runtime environments. This is
described in Chapter 4. In Chapter 5 the different runtime environments are compared
to each other and some specific characteristics of the runtime environments are analysed.
One runtime environment, PyPy, has been analysed further in Chapter 6. Finally a general
conclusion is taken, based on the analysis, in Chapter 7.
Chapter 2
Related Work
T
his chapter has been divided in two parts. First the research focusses on scripting
languages in general. The goal is to learn about improving performance, in the
hope that similar approaches are possible in Python. Then the research focusses
specifically on Python itself with the intention of discovering what improvements already
have been attempted before.
2.1
Scripting Languages
JavaScript is currently very popular in the academic world and a lot of optimisations
have been suggested to improve the speed. One idea is to use parallel execution, in
order to take advantage of multiple cores. However JavaScript is entirely sequential and
most programmers would not like to use Web workers, which allow parallel execution.
Two approaches to improve the performance using parallelism have been suggested [20].
The first approach exploits loop-level parallelism, by assigning each iteration of a loop
to a separate thread. However this leads to difficult data dependencies, which means a
rollback mechanism is required. The second approach uses method-level speculation, by
using a different thread for each function call. It is necessary to predict the return values
to use this approach. Speed-ups up to eight times the speed of the sequential execution
have been obtained. No attempts have been made to implement this method for Python.
Since Python offers the possibility to use multiple threads, it might be preferred to let the
programmer decide if multiple threads are necessary.
One of the main inefficiencies of JavaScript is related to the use of eval. However it
is used a lot because of its great power. Therefore it is not desirable to just remove the
statement from the language. Instead a special tool, called the Evalorizer, is created to
reduce the usage of eval [21]. This tool was able to replace 97% of the eval invocations.
The performance did not really improve a lot, the main reason to remove the construct
was related to safety. This construct is also available in Python as a built-in function.
However it seems that this feature is not often used, which means that the benefit obtained
by removing it will be even lower.
Using a Just-In-Time compiler is another commonly used approach to improve the
speed of JavaScript. This can increase the loading time, however the running time should
be reduced. It is important to efficiently detect hot spots as early as possible. This can
be done based on the function invocation count, loop iteration count and the transition
count, which includes the caller of the compiled code if it is located far away from the
4
CHAPTER 2. RELATED WORK
5
called code [18]. This approach has been attempted for Python as well and is described
in detail in Section 3.2.
Since scripting languages are different from system programming languages, the benchmarking is different as well, because heap sizes, important for the garbage collector, and
dynamic compilation will influence the result. However most evaluations still use methodologies developed for C, C++ and Fortran. A solution to this problem is supplied by
the DaCapo benchmarking methodology [6], which focuses on Just-In-Time compilation.
Three methodologies are necessary for a good evaluation:
Mix: measures an iteration of an application which mixes JIT compilation and recompilation work with application time. This shows the tradeoff of total time with
compilation and application execution time.
Stable: steady-state run in which there is no JIT compilation. This is accomplished by
reusing the code generated from a previous run with JIT compilation. It measures
final code quality.
Deterministic Stable & Mix: eliminates sampling and recompilation. The JIT compiler is modified to perform replay compilation which applies a fixed compilation
plan when it first compiles each method. First it is necessary to modify to compiler
to record its decisions for each method. Then the benchmark is executed several
times and the best plan is selected. This methodology allows researchers to control
the virtual machine and compiler in order to get a better view on the influence of
proposed improvements.
A good coverage of the different hardware architectures is also advised and different
heap sizes should be taken into account in order to get a representative view.
A special compiler, called HipHop, has been created by Facebook to compile PHP to
C++ [27]. Facebook’s entire code base is compiled to C++ in three minutes and a half
on a twelve core server using the HipHop compiler. It then takes another eight minutes
to compile on a cluster. The custom built compiler completes requests about two and a
half times faster than Zend, a commonly used web application framework built in PHP
5. It is estimated that the HipHop compiler should be over five times faster than Zend,
however due to added extensions to the HipHop compiler, it is not possible to confirm this
assumption. This project shows that it is possible to combine both worlds. The interpreter
is used during the development and allows testing code quickly. The compilation process
is only used to install the code on the servers, with increased performance. However it was
necessary to drop a few commonly used features, like automatic promotion from integer
to float in case of an overflow and the eval statement. A similar approach is also possible
for Python by using Cython. However the code is compiled to C instead of C++. This
method is explained in Section 3.3.
2.2
Python
Since Python is a dynamic language, performing static compilation and detecting errors
early, is very difficult. However, some researchers have asked whether those dynamic
features are actually used [14]? Alex Holker and James Harland assume that most of the
Python code will be static typed, yet the startup code will most likely be dynamic. Since
CHAPTER 2. RELATED WORK
6
Python uses a lazy module loading system, there is dynamic code even after the startup
code. However most of the Python code is actually static and some of the dynamic code
could be replaced by static code. 70% of their tested programs had less dynamic activity
after the startup. While this means some static analysis could be performed, there is still
a huge fraction of the applications containing dynamic code. Static analysis is not the
best method for improving the performance of Python.
Another characteristic of Python is that it is used to glue existing components together.
This means that the system’s dynamic linking and loading capabilities are under severe
stress. In order to test how Python performs in these circumstances, benchmarking should
be performed. A special benchmark has been designed to test the performance of Python
in these circumstances, called Pynamic [16]. This benchmark stresses the dynamic linking
and loading capabilities, using a predefined profile. This also stresses the operating system,
because scalable tools require parallel systems.
Python is executed similarly to Java. The Python interpreter will first generate Python
bytecode from the user application. This bytecode is then executed in a virtual machine.
Optimising the Python bytecode [5] will improve the performance of Python. In a first
stage, the bytecode is expanded by inlining the functions and applying loop unrolling.
Then it is possible to apply a variety of data-flow optimisations like value propagation,
constant propagation, algebraic simplifications, dead code elimination, copy propagation,
common subexpression elimination, etc. The first stage is very important, because it
enlarges the effect of the optimisation of the second stage. An improvement of about
10-30 percent was obtained for Pystone, Crypto-1.2.5 - Rijndael test, Pypy MD5, Pypy
SHA, Pybench and several micro tests. However a much larger speed-up is preferred.
All previous optimisation approaches have not led to a huge performance benefit. The
reason for this is that Python was not created to be used for scientific applications with
intensive computations. A common approach is to do the computationally intensive tasks
in a lower level language, like C or Fortran, and use Python to glue the code together.
However is it not possible to improve the performance of Python, by using different structures and statements? A study [11] shows that it is possible to improve the performance
of Python substantially by using for example vectorisation1 instead of iterating over an
array. The study also compared the most optimal Python solution with solutions in C++,
C and Fortran. Those solutions were still about ten times faster than the Python solution.
However combining code written in C++, C and Fortran with Python shows about the
same performance, which proves that Python is very good as glue code.
Cython takes this a step further, by compiling Python code, optionally extended with
type information to improve the performance, to C. The influence has already been researched previously [4, 25]. Huge speed-ups compared to normal Python execution have
been accomplished, while less memory is required for the execution. An important advantage of Cython is the ease of use compared to other approaches like F2PY. Cython has the
capability to execute code outside the Global Interpreter Lock, however it is tedious. The
main advantage is the possibility to incrementally improve the performance. It is also not
necessary to optimise the entire program, but only the code which will result in the largest
performance gain. A more in-depth explanation about Cython follows in Section 3.3.
Another common approach to improve the performance of dynamic languages has
already been mentioned for JavaScript, namely a Just-In-Time compiler. Psyco is one of
1
Vectorisation is the process of revising loop-based, scalar-oriented code to use matrix and vector
operations.
CHAPTER 2. RELATED WORK
7
the first projects attempting this for Python, using partial evaluation and specialisation
[23]. It is possible to get massive performance improvements, however not in the generic
case. This project has been discontinued in favour of PyPy. A few articles have been
written to explain how PyPy works [10, 9]. In Section 3.2 a detailed overview of the
project is given.
It is also possible to repurpose an existing Just-In-Time compiler (repurposed JIT compiler). However they have not given the expected performance benefit. This is because
of overreliance on the Just-In-Time compiler and traditional optimisations [12]. Specialisation is the key to improving the performance. Data-flow optimisations are the most
important ones.
Even Google has attempted to improve the performance of Python in a project called
‘Unladen swallow’. They also decided to go for the Just-In-Time compiler approach. It is
built on top of LLVM, however it seems the project has reached its end.
Chapter 3
Runtime Environments
A runtime environment contains everything necessary to execute a program.
This includes settings, libraries, a garbage collector, etc. However it does not
include tools to change the program.
T
he best-performing and most popular runtime environments that Python applications run on top of are summarized in Table 3.1. They provide three different
approaches to executing Python code. The first one, called CPython, will just
interpret the Python code without anything else. The second one, PyPy, intends to improve the speed by using a Just-In-Time compiler on top of an interpreter. The last one,
Cython, also tries to improve the performance by compiling the Python code to C and
then the C code to an executable. A more detailed description about the three approaches
follows.
3.1
CPython
CPython is the default Python interpreter, used by most people. It is written in C and
to make a distinction between the language and the interpreter the developers named it
CPython instead of Python.
3.1.1
Architecture
CPython is a very simple interpreter. Its architecture is represented in Figure 3.1. First
it will read the source code and compile it to Python bytecode, which is an intermediate
format of the Python code, similar to what happens in Java. This compilation process
is executed by a bytecode compiler. Once the Python bytecodes are generated, they are
passed to a bytecode interpreter, which will execute the instructions one after another.
CPython uses a stack based virtual machine, which means that all objects are put on a
Table 3.1: Python runtime environments
Environment
CPython
PyPy
Cython
Language
C
RPython
Python
Remark
the default Python interpreter
uses a Just-In-Time compiler
compile Python code to C
8
CHAPTER 3. RUNTIME ENVIRONMENTS
9
stack and when it is necessary to perform an operation, the required number of objects
are popped from the stack. The operation will then be performed on the objects and the
result is put on the stack again.
3.1.1.1
Garbage Collector
CPython has a generational garbage collector with three generations. New objects are
allocated in the first generation. When objects survive a few collections, set by a parameter, they are moved to the next generation. Each generation is collected less often than
the previous one. The moment to perform a garbage collection depends on the number of
allocations and deallocations.
It is also possible to bypass the garbage collector and explicitly delete objects. However
this approach is not commonly used. Since it can cause memory leaks, it is even disadvised.
3.1.2
Optimisations
It is already possible to improve the performance of CPython by either passing an optimisation flag to the interpreter or by generating pyc files, which contain the Python
bytecode and eliminate the bytecode compilation step. However it is important to remark
that these improvements happen in the bytecode compiler. They will only influence the
loading time, which is the time necessary to read the script and compile it to Python
bytecode. This means that while some effort has been put into improving the speed of
CPython, it will not influence the execution time a lot. Most Python scripts are not long
enough for the loading time to become large enough to have an influence on the total
execution time. It is expected that most of the execution time is spent in running the
code itself.
3.1.2.1
Optimisation Flags
Currently it is possible to pass the -O flag or the -OO flag. This extract is taken from the
manual pages:
-O
Turn on basic optimizations.
This changes the filename
extension for compiled (bytecode) files from .pyc to .pyo.
Given twice, causes docstrings to be discarded.
-OO
Discard docstrings in addition to the -O optimizations.
The -O flag will eliminate assert statements and the __debug__ variable is set to False.
This means that statement blocks of the form if __debug__: ... will be removed as
well. The -OO flag will also remove documentation.
As mentioned before, these optimisations do not really improve the total execution
time, only the loading time, which is only a very small fraction of the total execution
time in most cases. These flags have been provided to allow optimisations in the future.
Since they are not real optimisations and it is not useful to benchmark short running
applications, where the influence should be larger, I decided to ignore them.
CHAPTER 3. RUNTIME ENVIRONMENTS
10
Figure 3.1: Architecture CPython
3.1.2.2
PYC
The second mechanism CPython has to improve the speed of Python is by using cached
files. They have the pyc extension. These files contain the Python bytecodes of the
program, which were generated by the bytecode compiler during a previous run, however
no modifications are made to the code. Only the step to compile the Python script to
bytecode is skipped with this optimisation, however it might increase the reading time, if
the Python bytecodes take a lot of place to be stored.
This means that again only the loading time will be improved, because the Python
source code does not need to be compiled to Python bytecode anymore. However the
improvement will only be obtained if the pyc files are not too large, which would lead to
an increased reading time.
It is possible to view the Python bytecodes using the dis module. Listing 3.1 illustrates how this can be accomplished for the calculation of the nth fibonacci number. The
disassembled Python code for this calculation is visible in Listing 3.2 and clearly shows
that the virtual machine is stack based. Each time, first the necessary operands are loaded,
followed by the execution of the operator.
from d i s import d i s
def f i b ( n ) :
if n < 2:
return n
else :
return f i b ( n−1) + f i b ( n−2)
dis ( fib )
Listing 3.1: Disassemble the Python code for the fibonacci problem to Python bytecode
11
CHAPTER 3. RUNTIME ENVIRONMENTS
4
0
3
6
9
5
7
>>
LOAD_FAST
LOAD_CONST
COMPARE_OP
POP_JUMP_IF_FALSE
0 (n)
1 (2)
0 ( <)
16
12 LOAD_FAST
15 RETURN_VALUE
0 (n)
16
19
22
25
26
29
32
35
38
39
42
43
44
47
0 ( fib )
0 (n)
2 (1)
LOAD_GLOBAL
LOAD_FAST
LOAD_CONST
BINARY_SUBTRACT
CALL_FUNCTION
LOAD_GLOBAL
LOAD_FAST
LOAD_CONST
BINARY_SUBTRACT
CALL_FUNCTION
BINARY_ADD
RETURN_VALUE
LOAD_CONST
RETURN_VALUE
1
0
0
1
( 1 p o s i t i o n a l , 0 keyword p a i r )
( fib )
(n)
(2)
1 ( 1 p o s i t i o n a l , 0 keyword p a i r )
0 ( None )
Listing 3.2: Python bytecode for the fibonacci problem
3.1.3
Multi-threaded Applications
There are currently three different mechanisms to write multi-threaded applications in
Python:
• thread based
• event based
• multiprocessing module
For now the multiprocessing module is the best approach to create multi-threaded
applications on different cores.
3.1.3.1
Thread Based Concurrency
This approach is commonly used by most computer languages. The basic principle is
that a sequence of instructions can run inside a thread and multiple threads can run
concurrently. Every application has at least one thread, called the main thread. In order
to manage all those threads, synchronization mechanisms are supplied. CPython uses the
same common approach most computer languages follow.
However CPython has the Global Interpreter Lock (GIL), which prevents threads from
running simultaneously. The Global Interpreter Lock is actually a mutex, which prevents
threads from executing Python bytecodes at the same time. This means that threaded
applications cannot benefit from multiprocessor systems.
Note that the Global Interpreter Lock is not bad. The reason it is included in CPython
is because CPython’s memory management is not thread-safe. Blocking or long-running
CHAPTER 3. RUNTIME ENVIRONMENTS
12
operations happen outside it. The benefits are that single-threaded applications will run
with an increased speed and it makes the integration with C a lot easier. This means the
Global Interpreter Lock only causes problems for multi-threaded applications which do
not call C libraries or have a lot of I/O.
There have been many discussions and attempts to remove the Global Interpreter
Lock. However this is not an easy task and nobody has succeeded yet. Many features
now depend on the guarantees that it enforces, which makes it even harder. Since the
multiprocessing module solves the problem with the Global interpreter lock, it seems it
will not be removed in the near future.
3.1.3.2
Event Based Concurrency
Event based concurrency is based on transactional memory. The idea is to handle ‘events’
one after the other. The ordering of the events is not deterministic, because often they
are external. The handling of each event occurs deterministically, however not in parallel
with the handling of other events.
In order to handle an event, a transaction is used. This is a tentative execution of
the code used to handle the event. When a conflict is detected with other concurrently
executing transactions, the transaction is aborted and restarted. This approach assumes
that conflicts will not occur often, and that the handling is done quickly or conflicts are
detected early in the process. There is no hardware support yet, which means that only a
software implementation is available. This causes a huge performance disadvantage [24].
3.1.3.3
Multiprocessing Module
Using the multiprocessing module is the only option to run code simultaneously on
different cores. This approach spawns a subprocess on each core, in order to avoid the
Global Interpreter Lock. This causes complicated dependencies between the subprocesses,
because the data will need to be synchronised. This module is particularly useful when
there is not a lot of data shared between the subprocesses. Still, even if there is a lot of
data shared, this is the best option to execute code in parallel.
3.2
PyPy
The PyPy project aims to produce a flexible and fast Python implementation. In order to
have a fast implementation, a Just-In-Time compiler (JIT) is used. The developers have
decided not to focus solely on Python. Currently there are implementations for Ruby,
PHP, Prolog, SmallTalk and a preliminary version for JavaScript.
The objective of a just-in-time (JIT) compiler for a dynamic language is to
improve the speed of the language over an implementation of the language
that uses interpretation. The first goal of a JIT is therefore to remove the
interpretation overhead, i.e. the overhead of bytecode (or AST) dispatch and
the overhead of the interpreter’s data structures, such as operand stack etc.
The second important problem that any JIT for a dynamic language needs to
solve is how to deal with the overhead of boxing primitive types and of type
dispatching. Those are problems that are usually not present or at least less
severe in statically typed languages [8].
13
CHAPTER 3. RUNTIME ENVIRONMENTS
The interpretation overhead is reduced by compiling the most executed code, during
the execution of the program. This means that next time the code has to be executed,
it does not need to be translated anymore. Moreover, the Just-In-Time compiler will
improve the performance by applying optimisations. This means that at runtime, ‘hot’
code, meaning often executed code, is discovered and then optimised and compiled to
assembly. It is important that this does not take a lot of time, since it happens during the
execution. Compiling code which is barely executed, will take more time than the gained
benefit.
PyPy is written in RPython – Restricted Python – suitable for creating dynamic
language interpreters. The project uses extreme programming, which means that there is
no strict definition of RPython. It changes when necessary or in order to improve PyPy.
PyPy still does not entirely support the Python3 syntax. A lot has changed between
Python2 and Python3 and PyPy is still mainly following the Python2 syntax. An effort
is being put into updating the interpreter to the next version, however it will be awhile
before this is finished. Another disadvantage is that it is not preferred to use C libraries
with PyPy, because the Just-In-Time compiler cannot optimise this code. However the
most popular C libraries, like numpy, have been translated to work with PyPy and by
default the cffi module is installed, which allows the user to combine C code with PyPy
very easily.
3.2.1
Architecture
The PyPy project contains an interpreter, a JIT component and the RPython toolchain.
The latter is used to compile PyPy itself. Of course there is also a garbage collector.
The behaviour of both the Just-In-Time compiler and the garbage collector can be
modified by setting certain flags or variables. A detailed description about them is added
in Appendix A.
3.2.1.1
PyPy’s Interpreter
The interpreter consists of a bytecode compiler and a bytecode evaluator. An object space
is used to abstract all actions. This makes it easier to support other computer languages.
bytecode compiler sees the Python source code of the user application and compiles it
to Python code objects. The chain used to accomplish this is not very special. It
can be seen in Figure 3.2. Finally these are passed to the bytecode evaluator.
bytecode evaluator or bytecode interpreter interprets the Python code objects and
delegates the correct action to the standard object space. This is basically a Python
virtual machine.
standard object space is responsible for creating and manipulating the Python objects
seen by the application.
Listing 3.3 shows the disassembled code of the fibonacci problem shown in Listing 3.1.
It is almost completely the same as the bytecode used by CPython. The stack based
nature of the virtual machine is again clearly visible.
4
0 LOAD_FAST
3 LOAD_CONST
0 (n)
1 (2)
CHAPTER 3. RUNTIME ENVIRONMENTS
Figure 3.2: The phases used by the bytecode compiler
14
15
CHAPTER 3. RUNTIME ENVIRONMENTS
6 COMPARE_OP
9 POP_JUMP_IF_FALSE
5
7
>>
0 ( <)
16
12 LOAD_FAST
15 RETURN_VALUE
0 (n)
16
19
22
25
26
29
32
35
38
39
42
43
44
47
0 ( fib )
0 (n)
2 (1)
LOAD_GLOBAL
LOAD_FAST
LOAD_CONST
BINARY_SUBTRACT
CALL_FUNCTION
LOAD_GLOBAL
LOAD_FAST
LOAD_CONST
BINARY_SUBTRACT
CALL_FUNCTION
BINARY_ADD
RETURN_VALUE
LOAD_CONST
RETURN_VALUE
1
0 ( fib )
0 (n)
1 (2)
1
0 ( None )
Listing 3.3: PyPy bytecode for the fibonacci problem
3.2.1.2
PyPy’s Just-In-Time Compiler
A Just-In-Time compiler will search at runtime for code which is often executed, based
on counters indicating how many times a loop is executed. Once such a piece of code
is found, the Just-In-Time compiler will compile that code to a lower level language, in
most cases assembly. At this point the execution becomes platform specific. This does
not cause problems since you will not change the hardware while executing the script.
The most common optimisations like constant folding, common subexpression elimination, allocation removal, etc. are implemented in PyPy’s Just-In-Time compiler. PyPy
also applies very aggressive inlining, in order to optimise as much as possible. The methodology used is based on the principles of a tracing Just-In-Time compiler [10].
A tracing JIT works by observing the running program and recording its commonly executed parts into linear execution traces. Those traces are optimized
and turned into machine code [8].
The Just-In-Time compiler starts by tracing the bytecode of the interpreter interpreting
the application level code. When unfolding enough, eventually a loop will be detected
written in the user’s application. This trace will be compiled by the JIT backend, which
will generate the assembly. The assembly will be returned to the frontend, which traces the
bytecode, so it can be used the next time. Guards are placed when it is possible to jump out
of the trace. They are added to ensure the correctness of the code. Listing 3.4 contains the
Python bytecode for the initial if-condition in the Fibonnaci code in Listing 3.1, checking
if n is smaller than two of the fibonacci problem, with the guards.
0 LOAD_FAST n
guard ( i 4 == 1 )
guard ( p5 i s n u l l )
g u a r d _ n o n n u l l _ c l a s s ( p12 , C o n s t C l a s s ( W_IntObject ) , d e s c r = <... >)
guard ( i 8 == 0 )
CHAPTER 3. RUNTIME ENVIRONMENTS
16
3 LOAD_CONST 2
guard ( p2 == ConstPtr ( p t r 2 5 ) )
6 COMPARE_OP <
i 2 6 = ( ( pypy . o b j s p a c e . s t d . i n t o b j e c t . W_IntObject ) p12 ) . i n s t _ i n t v a l [ pure ]
i28 = i26 < 2
guard ( i 2 8 i s t r u e )
9 POP_JUMP_IF_FALSE 16
Listing 3.4: Python bytecode with guards for the fibonacci problem
The Python bytecode in Listing 3.4 shows that a trace does not contain all code of
the function, but only the code that is actually executed. In this case the code for the
function if n is smaller than two is not included. Therefore, a guard is inserted to catch
the case when n is smaller than two.
When a guard fails, a blackhole interpreter is used to get back to a safe point, because
it is incredibly complex to move the state from the registers used by the assembled code to
the bytecode interpreter, where all values should be boxed. Once a safe point is reached,
the bytecode evaluator continues to operate using the Python bytecode.
3.2.1.3
Runtime Interaction
When a Python program is run, the Python code is compiled to Python code objects
by the bytecode compiler. These objects are passed to the bytecode interpreter, which
will interpret the Python code objects and execute them using the standard object space.
It works like a normal stack-based virtual machine. When the Just-In-Time compiler is
enabled, the tracing interpreter will trace the code. Each time a loop is found, it will
decide if the loop should be compiled, based on how many times it is executed. When
the trace needs to be compiled, the trace will be given to the JIT backend. Then the
trace is compiled to assembly and returned to the tracing interpreter. Now the assembled
version can be used, each time the same code is called. When a guard fails, it is difficult to
return to the bytecode interpreter, because the state needs to be passed and because half
a Python instruction might already have been executed. Instead a blackhole interpreter
continues until a safe point is reached. Then the bytecode interpreter continues. This
process repeats until the Python program is finished. A visual representation can be
found in Figure 3.3.
3.2.1.4
RPython Toolchain
To compile PyPy itself, a special toolchain has been developed. The job of the RPython
toolchain is to translate RPython programs into an efficient version of that program for
one of the various target platforms, generally one that is considerably lower-level than
Python. The RPython toolchain never sees the RPython source code or syntax trees, but
rather starts with the code objects that define the behaviour of the function objects one
gives it as input.
First it is important to remark that RPython code does not exist in files, but instead
it exists only in memory. Writing an RPython program means writing a program which
generates the RPython code objects in memory. The RPython toolchain itself is written
in Python, which means it can be compiled by any Python interpreter, and ipso facto by
PyPy itself.
CHAPTER 3. RUNTIME ENVIRONMENTS
17
Figure 3.3: Architecture PyPy
Since the interpreter is given as input to the RPython toolchain, the interpreter needs
to be written in RPython. The Just-In-Time compiler is generated during the compilation
of PyPy. For this to work, it is required that a few hints are given to the interpreter. It
is however not necessary to include a Just-In-Time compiler, but this will lead to worse
performance.
The toolchain can be seen in Figure 3.4 and includes following steps:
1. The code objects are converted to a control flow graph by the Flow Object Space.
2. The control flow graphs are processed by the Annotator, which performs wholeprogram type inference to annotate each variable of the control flow graph with the
types it may take at run-time.
3. The information provided by the Annotator is used by the RTyper to convert the high
level operations of the control flow graphs into operations closer to the abstraction
level of the target platform.
• After the RTyping phase, it is possible to insert a Just-In-Time compiler. The
Just-In-Time compiler will be generated from the hints given in the interpreter,
which means that it is not necessary to write any code, except for the hints in
the interpreter, to use the Just-In-Time compiler for a different language.
4. Optionally, various transformations can be applied which, for example, perform optimisations such as inlining or add capabilities such as stackless-style concurrency. In
this phase code is also inserted for the garbage collector and exception management.
CHAPTER 3. RUNTIME ENVIRONMENTS
18
5. The graphs are converted to source code for the target platform and compiled into
an executable.
Figure 3.4: The RPython toolchain
3.2.1.5
Garbage Collector
It is possible to use PyPy with four different garbage collectors.
Semispace copying has two arenas of equal size, but only one arena is used and gets
filled with new objects. When the arena is full, the live objects are copied into the
other arena. The old arena is then cleared.
Generational (2 generations) adds a nursery to the semispace garbage collector, which
is a chunk of the current semispace. Allocations fill the nursery, and when it is full, it
is collected and the objects still alive are moved to the rest of the current semispace.
The idea is that it is very common for objects to die soon after they are created.
Generational GCs help a lot in this case and the semispaces fill up much more slowly,
making full collections less frequent.
Hybrid (3 generations) can handle both objects that are inside and objects that are
outside the semispaces (‘external’). The external objects are not moved and collected
in a mark-and-sweep fashion. Large objects are allocated as external objects to avoid
costly moves. Small objects that survive for a certain time, based on the number of
semispace collections, are also made external so that they stop moving.
This is coupled with a segregation of the objects in three generations. Each generation is collected much less often than the previous one.
CHAPTER 3. RUNTIME ENVIRONMENTS
19
Minimark is based on the hybrid garbage collector.
It uses a nursery for the young objects, and mark-and-sweep for the old objects.
This is a moving GC, but objects may only move once (from the nursery to the old
stage).
The main difference with the hybrid garbage collector is that the mark-and-sweep
objects (the ‘old stage’) are directly handled by a custom allocator, instead of being
handled by malloc() calls. This reduces the amount of memory necessary during a
major collection compared to the hybrid garbage collector.
An incremental version of this garbage collector is available which is used by default.
For the benchmarking of PyPy the default incremental version of the minimark garbage
collector has been used.
3.2.2
Multi-threaded Applications
PyPy uses the exact same mechanisms as CPython for multi-threaded applications. This
means that PyPy has issues with the Global Interpreter Lock as well. It is also possible
to use the other approaches with PyPy. Again the best approach is the multiprocessing
module.
3.3
Cython
Cython is an optimising static compiler for both the Python programming
language and the extended Cython programming language. It makes writing
C extensions for Python as easy as Python itself.1
Cython allows you to combine Python with C and C++. It simplifies writing Python
code that calls back and forth from and to C or C++ code natively. Furthermore it
is possible to convert Python code, optionally enhanced with statements of the Cython
language which allows you to add static type information, to C or C++.
Listing 3.5 contains the Python code to solve the pairwise distance calculation problem,
while the Cython code can be found in Listing 3.6. This is a common scientific problem,
which aims to calculate the distance between a number of points with each other. This
example clearly shows that the Cython code is larger and more complicated. It will be
difficult for novice users to add the type declarations. Most people choose Python for
the ease of programming and dynamic typing is an important part of that. The Cython
language gives the possibility to add more information, which allows Cython to translate
the code better to C and Cython will also be able to further optimise the code, resulting
in a faster execution. However it is complicated to do this for real Python applications
and might not attract many users.
import numpy a s np
def p a i r w i s e (X) :
M = X. shape [ 0 ]
N = X. shape [ 1 ]
D = np . empty ( (M, M) , dtype=np . f l o a t )
1
Cython.org
CHAPTER 3. RUNTIME ENVIRONMENTS
20
f o r i in range (M) :
f o r j in range (M) :
d = 0.0
f o r k in range (N) :
tmp = X[ i , k ] − X[ j , k ]
d += tmp ∗ tmp
D[ i , j ] = np . s q r t ( d )
return D
Listing 3.5: Python code for the pairwise distance calculation problem
import numpy a s np
c i m p o r t cython
from l i b c . math c i m p o r t s q r t
@cython . boundscheck ( F a l s e )
@cython . wraparound ( F a l s e )
def p a i r w i s e ( d o u b l e [ : , : : 1 ] X) :
c d e f i n t M = X. shape [ 0 ]
c d e f i n t N = X. shape [ 1 ]
c d e f d o u b l e tmp , d
cdef double [ : , : : 1 ] D =
np . empty ( (M, M) , dtype=np . f l o a t 6 4 )
f o r i in range (M) :
f o r j in range (M) :
d = 0.0
f o r k in range (N) :
tmp = X[ i , k ] − X[ j , k ]
d += tmp ∗ tmp
D[ i , j ] = s q r t ( d )
return np . a s a r r a y (D)
Listing 3.6: Cython code for the pairwise distance calculation problem
3.4
Other Runtime Environments
There are some other runtime environments available, like py2exe, cx_Freeze, Shed Skin,
etc. Even Google has attempted to improve the Python performance in a project called
Unladen Swallow. However I believe the most important runtime environments are included. PyPy is the best performing Just-In-Time compiler, Cython is an optimising
static compiler and CPython is the default interpreter. This gives the three different,
commonly used approaches.
3.5
Conclusion
There are three common approaches to running Python code:
• interpretation
• compiling to a lower level language (commonly C)
• interpretation and a Just-In-Time compiler
CHAPTER 3. RUNTIME ENVIRONMENTS
21
For each of those approaches, a runtime environment has been chosen, which is commonly used. CPython is the default interpreter, used by most people. Both Cython and
PyPy try to improve the performance of Python. Cython tries to accomplish this by converting the Cython code to C and then compiling it to assembly. To get an optimal result,
it is however necessary to add extra information, which is not easy and might not attract
a lot of users. PyPy tries to improve the execution time by using a Just-In-Time compiler,
which means that it is not necessary to add type information. However currently there is
very poor support for the third version of Python.
Chapter 4
Benchmarking
After all, facts are facts, and although we may quote one to another with
a chuckle the words of the Wise Statesman, ’Lies–damned lies–and statistics,’
still there are some easy figures the simplest must understand, and the astutest
cannot wriggle out of. — Leonard Henry Courtney, 1895
T
he goal of benchmarking is to obtain measurements on a computer system by
executing a computing task, which will allow comparison between different hardware and software combinations. Both the computer system and the task cause a
difficulty with benchmarking a computer language. It is important that the measurements
and benchmarks are representative to make a generalisation about the programming language, instead of being specific to the computer system or task. Therefore it necessary to
have benchmarks which are I/O-, memory- and CPU-intensive. Furthermore also multithreaded benchmarks should be included. While it is not possible to ensure a similar
behaviour on different hardware, using multiple benchmarks should also result in a similar
behaviour on most commonly used machines. The same problems should surface on most
commonly used hardware components. The most important task of benchmarking starts
with choosing the correct benchmarking suite, which groups together computer tasks in a
wide variety of domains. It is important that the benchmarks give a representative view
of the language.
Benchmarking a dynamic language is even more complicated, because dynamic components, such as a garbage collector and Just-In-Time compiler, make the runs nondeterministic. Therefore it is necessary to have a good benchmarking methodology in
order to draw correct conclusions.
The next step is to decide which characteristics to measure. It is logical to measure
the execution time of each benchmark, but this will not explain the behaviour and will not
result in clear conclusions. To get a better understanding of the runtime environments, it is
interesting to measure hardware events, like the number of cycles, instructions, branches,
etc. Since this is a very demanding task, I automated this using shell and Python scripts.
The most important part is the interpretation of the results, which happens after the
benchmarking. The results are explained in Chapter 5 and Chapter 6.
22
CHAPTER 4. BENCHMARKING
4.1
23
Benchmarking Suites
Since it is important to have benchmarks, which are representative for the language, I
looked at which benchmarks are used in academical research. It seems that most times
microbenchmarks are used to compare the behaviour in a very specific domain. Some
of these benchmarks are even included with Python by default , like pystone, iobench,
etc. However these do not help my research, because I want to get a wide view on the
behaviour. Eventually I found two benchmarking suites, which provide a decent number of
benchmarks in various domains. One, called The Grand Unified Python Benchmark Suite,
focusses entirely on the Python language. The other, called The Computer Language
Benchmarks Game, aims to compare computer languages with each other.
4.1.1
The Grand Unified Python Benchmark Suite
This project is intended to be an authoritative source of benchmarks for all
Python implementations. The focus is on real-world benchmarks, rather than
synthetic benchmarks, using whole applications when possible.1
Most of the benchmarks are based on work from the Unladen Swallow project by
Google, used by PyPy to show their performance. There is a website [2] comparing
different PyPy versions with CPython versions, as can be seen in Figure 4.1. This suite
contains 55 benchmarks, the following benchmarks are the most commonly used ones:
• 2to3
• calls
• django
• fastpickle
• fastunnpickle
• float
• html5lib
• html5lib_warmup
• mako
• nbody
• nqueens
• pickle
• pickle_dict
• pickle_list
• pybench
1
the homepage of The Grand Unified Python Benchmark Suite
CHAPTER 4. BENCHMARKING
24
• regex
• richards
• rietveld
• slowpickle
• slowspitfire
• slowunpickle
• spitfire
• spambayes
• startup
• threading
• unpack_sequence
• unpickle
After testing out the benchmarks, I noticed most of them are really short, as can be
seen in Figure 4.2. In order to get a representative view of the performance, I searched
for a different benchmarking suite.
4.1.2
The Computer Language Benchmarks Game
We are trying to show the performance of various programming language implementations – so we ask that contributed programs not only give the correct
result, but also use the same algorithm to calculate that result.2
This suite contains only 15 benchmarks, but most of them take quiet awhile to complete. This suite also offers the possibility to compare with other languages, but PyPy
cannot run all of the benchmarks. This means the number of benchmarks I could use is
reduced even more, because of the poor support of PyPy for the latest version of Python
as mentioned in Section 3.2.
An advantage of this suite, is that it tries to compare computer languages with each
other. This means that I can easily compare Python with other languages, such as C,
which is known to have very good performance. The benchmarks I was able to use are
listed in Table 4.1.
After going through the source code I was able to draw some conclusions about the
characteristics of each of those benchmarks:
n-body performs mainly mathematical operations like multiplications, additions, subtractions, and even a few divisions and exponentiations. The data used to execute
these operations are predefined in the source code as a dictionary, containing tuples
and lists. The results are written to a list. Since the most executed operations are
mathematical, it is considered a CPU-intensive benchmark, however it will also be
necessary to load and store a small amount of memory.
2
site of the Computer Language Benchmarks Game
CHAPTER 4. BENCHMARKING
Figure 4.1: Comparison between PyPy and CPython by speed.python.org
Table 4.1: Benchmarks used to evaluate the performance
fannkuch-redux
spectral-norm
n-body
k-nucleotide
fasta-redux
binary-trees
Repeatedly access a tiny integer-sequence
Calculate an eigenvalue using the power method
Perform an N-body simulation of the Jovian planets
Repeatedly update hashtables and k-nucleotide strings
Generate and write random DNA sequences
Allocate and deallocate many many binary trees
25
26
CHAPTER 4. BENCHMARKING
Time (s)
0
2to3
0.2
0.4
0.6
0.8
1
12.42
call_method
call_method_slots
call_method_unknown
call_simple
float
iterative_count
nbody
normal_startup
nqueens
pickle_dict
pickle_list
regex_compile
regex_effbot
regex_v8
richards
slowpickle
slowunpickle
startup_nosite
threaded_count
unpack_sequence
unpickle_list
Figure 4.2: Time measurements for The Grand Unified Python Benchmark suite
27
CHAPTER 4. BENCHMARKING
spectral-norm performs mainly multiplications and additions. It will also execute a few
bit shifts and divisions. The data used to perform the calculations are based on the
indices of a loop, iterating over an array. The amount of loading and storing is very
small for this benchmark. It is a pure CPU-intensive benchmark.
fannkuch-redux mainly swaps data and performs some arithmetic to calculate the locations to accesses and to decide when to stop. The amount of memory used is very
small, therefore it is not considered a memory-intensive benchmark, but a CPUintensive one.
k-nucleotide first reads a file containing a very large DNA sequence. Then it will find certain sequences in the sequence read from the file and perform a sort operation. Both
operations happen in parallel, which means this benchmark also includes threading. The most important part of this benchmark is the reading, which makes it an
I/O-intensive benchmark and there will also be some memory operations.
fasta-redux provides the input for k-nucleotide by generating a very large DNA sequence.
A random lookup table is generated to create the DNA sequence. Generating the
table is done by doing some operations on predefined data, which means this benchmark will use the CPU and the memory, however the most important part is writing
to a file. This makes it an I/O-intensive benchmark.
binary-trees first creates a huge binary tree. Then it counts the number of trees having
a certain depth in parallel. The main characteristic of this benchmark is the amount
of memory it consumes, which makes it a memory-intensive benchmark. Of course
it also uses threading.
The type of the benchmarks are summarised in Table 4.2. Even if the number of
benchmarks that I am able to use is reduced, I still have CPU-, I/O- and memory-intensive
benchmarks. Furthermore there are two benchmarks which use threading. This means I
should get a representative view of the performance of Python in the most common areas.
Common arguments are supplied for each of the benchmarks in this suite. I have
tested these commonly used arguments and kept the ones which are not too short. The
arguments used for each benchmark are mentioned in Table 4.3.
4.2
Benchmarking Methodologies
In Chapter 3, I already mentioned the different runtime environments I am going to
benchmark. However PyPy is a special case, because it has the Just-In-Time compiler. It
would be interesting to get a better understanding of the working and effect of the JustIn-Time compiler. To accomplish this, I created the ‘PyPy stable’ methodology (PyPyS).
Table 4.2: Used benchmarks grouped by type
CPU
I/O
k-nucleotide*
n-body
spectral-norm
fasta-redux
fannkuch-redux
(*) uses multiple cores
memory
binary-trees*
28
CHAPTER 4. BENCHMARKING
Table 4.3
benchmark
fannkuch-redux
spectral-norm
n-body
k-nucleotide
fasta-redux
binary-trees
arguments
11, 12
3000, 5500
5000000, 50000000
2500000
2500000, 25000000
20
The stable methodology has been used before for research about Java and JavaScript [6].
The idea of this methodology is to eliminate as much of the overhead of the Just-In-Time
compiler as possible in order to measure the behaviour of the user application. This is
accomplished by executing the application first with the Just-In-Time compiler and then
it is benchmarked without the Just-In-Time compiler, using the already-compiled code.
This shows how much time and resources are actually lost because of the Just-In-Time
compilation when comparing with a normal execution.
A simplified version of the code used to measure the stable behaviour can be found in
Listing 4.1. First the code prepares for acquiring the results. This setup code contains
the number of iterations each hardware event is measured and the location of a log file.
Furthermore code has been added to load C libraries, used to measure the hardware
events. Finally the benchmark and the argument for the benchmark are taken from the
input arguments and the benchmark is loaded into the memory. After the setup code, first
an initial run of the benchmark is executed with the Just-In-Time compiler enabled. This
makes sure the code is compiled to assembly, to be reused later on. After this initial run,
the Just-In-Time compiler is turned off and a garbage collection of the heap is forced. This
ensures that the next iteration starts with a fresh heap so the results are not influenced.
Next the hardware events and time are measured consecutively, and after each iteration
a garbage collector is forced. The results are stored in a two dimensional space. Finally,
after acquiring all measurements, the results are written to the log file and standard out.
At the end of the harness some clean up code has been added. This frees the memory
used by the C libraries.
However this harness does not work for the multi-threaded benchmarks, because of the
techniques used by the multiprocessing module. Each execution a new interpreter is
launched on each core, which means that a new Just-In-Time compiler is created as well.
The k-nucleotide benchmark also reads data from a file, which makes it a lot more difficult
to eliminate the Just-In-Time compiler, while still using the compiled code. Therefore no
results have been acquired for those two benchmarks.
NR_IT = 10
LOG = ’ h a r n e s s _ l o g / l o g ’
###############
f f i = FFI ( )
f f i . cdef ( ’ ’ ’
i n t setup () ;
v o i d teardown ( ) ;
int start ( int iter ) ;
long long ∗ stop ( int i t e r ) ;
29
CHAPTER 4. BENCHMARKING
’’’)
with open ( ’ c o u n t e r s . c ’ ) a s r :
s r c = r . read ( )
C = f f i . v e r i f y ( s r c , l i b r a r i e s =[ ’ p a p i ’ ] )
nr_events = C . s e t u p ( )
ben = argv [ 1 ]
a r g = i n t ( argv [ 2 ] )
mod = __import__ ( ’ py_ ’ + ben )
# warmup
set_param ( " d e f a u l t " )
s t a r t = time ( )
mod . main ( a r g )
end = time ( )
# iterations
set_param ( " o f f " )
collect ()
f o r j in range ( nr_events ) :
f o r i in range (NR_IT) :
nr = C . s t a r t ( j )
s t a r t = time ( )
mod . main ( a r g )
end = time ( )
r e s = C. stop ( j )
collect ()
# present r e s u l t s
print ( r e s u l t s )
C . teardown ( )
Listing 4.1: Simplified version of the harness for the stable behaviour
Another method to analyze PyPy’s Just-In-Time compiler is to disable it entirely
(PyPyNJ). Since it is an optional part of PyPy, it is very easy to do this. I have also
added this approach to the environments used to perform benchmarking. The resulting
runtime environments are listed in Table 4.4.
Table 4.4: The different benchmarked runtime environments with their most important
cost
environment
CPython
PYC
PyPy
Cython
C
PyPyS
PyPyNJ
cost
interpretation
interpretation
JIT
(optimised)
(not optimised)
remark
default Python interpreter
CPython with reduced loading time
Just-In-Time compiler
compiles Python to C
compiled with GCC
PyPy’s stable behaviour
PyPy without Just-In-Time compiler
30
CHAPTER 4. BENCHMARKING
Remember that both PyPyS and PyPyNJ are without the Just-In-Time compilation.
However the stable runtime environment will use previously generated and optimised assembly, while the PyPyNJ runtime environment will not use any optimised code.
4.3
Setup
I have used the Bluepower machine at ELIS to do the benchmarking which has following
hardware characteristics:
# processors
vendor id
model name
cpu MHz
cache size
cpu cores
cache alignment
address sizes
:
:
:
:
:
:
:
:
8
GenuineIntel
Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
1596.000
8192 KB
4
64
40 bits physical, 48 bits virtual
It is running on Ubuntu 11.04 (Natty Narwhal). The following versions of the different
runtime environments are installed:
CPython
PyPy
GCC
Cython
:
:
:
:
3.2
2.7.3
4.5.2
0.19.2
Both CPython and PyPy do not limit the heap size. This means that by default the
entire RAM will be used if necessary. The nursery size of PyPy equals by default to
four megabytes or half the cache size. For this machine, PyPy’s nursery size is about
four megabytes and the maximum heap size for both CPython and PyPy equals to about
fourteen gigabytes.
4.4
Hardware Events
As mentioned previously, the hardware events lead to a better understanding of the different runtime environments. They will show the difference in behaviour of the runtime
environments and they will bring out problems and bottlenecks. Measuring hardware
events can be accomplished with perf and PAPI. Perf works from the command line,
while PAPI is a library which can be used from within C code. Both are discussed in
detail below.
4.4.1
Perf
Using the perf list command, it is possible to list all available predefined events. I measured the most interesting ones. However it became clear that perf is not to be trusted
to give correct results for all events. To make sure all values are correct I compared the
CHAPTER 4. BENCHMARKING
31
results of the perf predefined events with the ones obtained by using raw events. These
raw events are found in the software developer’s manual of the hardware manufacturer of
the processer – in my case the Intel manual [15]. This confirmed that perf did not return
correct values for the UNCORE events – in my case the last level cache events. Therefore
I used the raw event descriptors to get correct values. There were also certain results I
could not verify to be correct, because not all predefined perf events are listed in the Intel
manual. This leaves me with following events, for which I definitely obtained the correct
values:
• cycles
• instructions
• branches
• branch-misses
• L1-dcache-loads
• L1-dcache-load-misses
• L1-dcache-prefetches
• L1-dcache-prefetch-misses
• L1-icache-loads
• L1-icache-load-misses
• UNC_L3_HITS.READ (raw event)
• UNC_L3_MISS.READ (raw event)
• UNC_L3_HITS.WRITE (raw event)
• UNC_L3_MISS.WRITE (raw event)
• dTLB-load-misses
• dTLB-store-misses
• iTLB-load-misses
I also made sure perf did not scale the events by doing more iterations with fewer
events per iteration. When there are not enough hardware counters to measure the supplied hardware events, perf measures all events for a smaller period and scales the results
accordingly. Using less hardware events stops perf from scaling the results, therefore I
measured fewer hardware events per iteration and ran the benchmark multiple times to
gather all counters without scaling.
CHAPTER 4. BENCHMARKING
4.4.2
32
PAPI
PAPI does not allow you to use raw event descriptors. This means the raw events could
not be encoded. There are multiple hardware events predefined, however these were not
correct either. The perf events are also available from PAPI and I decided to use those. I
was able to verify their correctness by using PAPI and perf simultaneously. However there
are no results for the level-3 cache. These are all the hardware events I measured using
PAPI:
• cycles
• instructions
• branches
• branch-misses
• L1-dcache-loads
• L1-dcache-load-misses
• L1-dcache-prefetches
• L1-dcache-prefetch-misses
• L1-icache-loads
• L1-icache-load-misses
• dTLB-load-misses
• dTLB-store-misses
• iTLB-load-misses
PAPI will not scale the events, so I was forced to measure multiple runs to get all
results.
4.5
Conclusion
There are seven runtime environments I have decided to benchmark:
• CPython
• PYC
• PyPy
• Cython
• C
• PyPyS (stable)
CHAPTER 4. BENCHMARKING
33
• PyPyNJ (without Just-In-Time Compiler)
The first two runtime environments are included to show the behaviour of the most
used Python interpreter. C has been included to compare with a different programming
language and one of the fastest computer languages currently used. The last two environments have been added to get a better understanding of the Just-In-Time compiler of
PyPy.
After going through the various benchmarking suites, I decided to use The Computer
Language Benchmarks Game, because the benchmarks take a considerable time and there
are benchmarks included for each of the most important domains. An added bonus is that
it easily allows comparison with other languages, most importantly C. This benchmarking
suite should give a representative overview of the different runtime environments.
All hardware events are measured using perf, except for the stable behaviour of PyPy
which is measured by PAPI. I have decided which events to benchmark to get the correct
measurements to discover the differences in the behaviour of the runtime environments.
Chapter 5
Analysis Runtime Environments
A
fter acquiring all results, it is time to dig deeper and compare the different
runtime environments. The goal of this chapter is to find the best performing
runtime environment, for which the most important results are the time measurements. This is followed by a comparison between the various approaches to executing
Python code and C. However first a more in-depth analysis is provided about a few specific
runtime environments.
5.1
Preliminary Comparison
A preliminary comparison will provide a better overview of the capabilities of the different
runtime environments. The pairwise distance calculation problem, as mentioned in Section 3.3 is used for this purpose. The Python and Cython code can be found in Listing 3.5
and Listing 3.6 respectively.
Figure 5.1 shows the time measurements for the different runtime environments for
1000 points. According to these results, CPython is actually the slowest runtime environment, while it is the most used one. Cython does not improve the speed when no type
information is supplied, yet it compiles the Python code to C and does some optimisations. It appears to optimize very little when the type information is not available. Once
the type information is added, a huge performance benefit is obtained. PyPy is able to
improve the performance drastically for this example, without any change to the code. It
appears that the Just-In-Time compiler is very effective.
The biggest difference is between Cython with and without type information. Cython
has the possibility to improve the speed a lot better than PyPy, however it cannot do this
without extra information.
5.1.1
Cython Type Information
There is a huge difference in execution time, when adding type information to Cython.
An analysis of the generated C code, gives a better understanding about the modifications
made when type information is available. The first difference is the size of the C file.
Without the type information the C file contains 2589 lines and is about 106 kilobyte
large, while the C file generated by Cython with type information contains 15878 lines
and is about 580 kilobytes. These C files are compiled to libraries by Cython. The sizes
of the libraries are 95 and 486 kilobytes without and with type information. The files
34
35
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
1000
Time (s)
100
10
1
0.1
0.01
PyPy
CPython
Cython
Cython
type info
Figure 5.1: Time measurements for the pairwise distance calculation problem with 1000
points
generated with type information are obviously a lot larger than without. The source code
helps explaining this huge difference.
The initial part of the C code of the pairwise distance calculation problem is visible in
Listing 5.1 and the same part of the optimised version is shown in Listing 5.2. This clearly
shows the difference caused by adding the type information. Without the information,
PyObjects are used to perform the calculations, otherwise the specific types can be used.
These PyObjects contain the reference count, used by the garbage collector to free unused
memory, and a type pointer. The data of the variable is not stored in a PyObject, instead
a pointer to the value is stored. This means the value is ‘wrapped’. The array is replaced
by a __Pyx_memviewslice variable, which contains a pointer to the data and some extra
information like the size. The advantage of this type is that it is very easy to access
the data, as can be seen in Listing 5.3. The same access using a PyObject is much more
complex, because references must be followed and extra checks are required. This is shown
in Listing 5.4.
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
PyObject ∗__pyx_v_M = NULL;
PyObject ∗__pyx_v_N = NULL;
PyObject ∗__pyx_v_D = NULL;
PyObject
PyObject
PyObject
PyObject
PyObject
∗__pyx_v_i =
∗__pyx_v_j =
∗__pyx_v_d =
∗__pyx_v_k =
∗__pyx_v_tmp
NULL;
NULL;
NULL;
NULL;
= NULL;
36
i n t __pyx_v_M;
i n t __pyx_v_N ;
__Pyx_memviewslice __pyx_v_D =
{ 0 , 0 , { 0 } , { 0 } , { 0 } };
i n t __pyx_v_i ;
i n t __pyx_v_j ;
double __pyx_v_d ;
i n t __pyx_v_k ;
double __pyx_v_tmp ;
Listing 5.1: initial part of the C code Listing 5.2: initial
generated by Cython for the
part of the C code generated
pairwise distance calculation
by Cython for the pairwise
problem
distance calculation problem
with optimised Python code
__pyx_v_M = (__pyx_v_X . shape [ 0 ] ) ;
Listing 5.3: access an element of a __Pyx_memviewslice structure
__pyx_t_1 = __Pyx_PyObject_GetAttrStr (__pyx_v_X, __pyx_n_s__shape ) ;
i f ( u n l i k e l y ( ! __pyx_t_1) ) {
__pyx_filename = __pyx_f [ 0 ] ;
__pyx_lineno = 4 ;
__pyx_clineno = __LINE__;
goto __pyx_L1_error ;
}
__Pyx_GOTREF(__pyx_t_1) ;
__pyx_t_2 = __Pyx_GetItemInt (__pyx_t_1 , 0 , s i z e o f ( long ) , PyInt_FromLong , 0 ,
0 , 1) ;
i f ( ! __pyx_t_2) {
__pyx_filename = __pyx_f [ 0 ] ;
__pyx_lineno = 4 ;
__pyx_clineno = __LINE__;
goto __pyx_L1_error ;
}
__Pyx_GOTREF(__pyx_t_2) ;
__Pyx_DECREF(__pyx_t_1) ;
__pyx_t_1 = 0 ;
__pyx_v_M = __pyx_t_2 ;
__pyx_t_2 = 0 ;
Listing 5.4: access an element of a PyObject structure
This same problem occurs with the for loop, however the code becomes even more
complicated without the type information. Since the type is not known, code is added
for lists and tuples as well. It takes over forty lines of C code to get the correct element,
while the same task is accomplished in three lines of code with the type information. The
reference counting increases the complexity of the code and results in a performance loss.
All these things are related to the time difference, but they also explain the difference in
size. For each type, operations are defined and the code which performs the operations is
added as well. Since the version without type information only uses one type, PyObject,
not much other code is necessary. The version with type information needs a lot of
functions to perform those operations. A few operations for the memoryview object are
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
37
shown in Listing 5.5. Often those operations are implemented in a best case scenario, for
example by ignoring overlap for array operations and extra code, together with checks, are
added in case this assumption is wrong. It will also provide better exception management
with more specific errors.
static
static
static
static
static
static
static
static
static
PyObject
PyObject
PyObject
PyObject
PyObject
PyObject
PyObject
PyObject
PyObject
∗ __pyx_memoryview_transpose ( PyObject ∗ __pyx_v_self ) ;
∗__pyx_memoryview__get__base ( PyObject ∗ __pyx_v_self ) ;
∗__pyx_memoryview_get_shape ( PyObject ∗ __pyx_v_self ) ;
∗ __pyx_memoryview_get_strides ( PyObject ∗ __pyx_v_self ) ;
∗ __pyx_memoryview_get_suboffsets ( PyObject ∗ __pyx_v_self ) ;
∗__pyx_memoryview_get_ndim ( PyObject ∗ __pyx_v_self ) ;
∗ __pyx_memoryview_get_itemsize ( PyObject ∗ __pyx_v_self ) ;
∗__pyx_memoryview_get_nbytes ( PyObject ∗ __pyx_v_self ) ;
∗ __pyx_memoryview_get_size ( PyObject ∗ __pyx_v_self ) ;
Listing 5.5: A few operations which can be used on the memoryview object
The basic idea behind Cython is to focus on generating code which is as good as
possible. There is no concern about the size of the code. After the code generation, the C
compiler will optimise even more and during the compilation, all unnecessary code will be
removed. Inlining will improve the instruction locality and enables more optimisations.
5.1.2
Type Guessing
In Section 5.1.1, the influence of type information has been investigated. A huge speedup was obtained when type information was added. However without the information,
Cython is not useful. Since it is difficult for a lot of users to add the information, a
different approach is necessary. A possibility to resolve this issue, is to guess the type
information. This step could be added before the compilation of Python code to C.
As mentioned in Section 2.2, not all Python code is static. The lazy loading of modules
results in dynamic code. However this should not provide issues for guessing the type
information. The real problems are related to the fact that a variable can have multiple
different types during its existence. This could be solved by creating new variables instead.
However the main problem for this approach is the complexity of the Python types.
The combination of tuples, arrays and lists can cause incredibly complex type definitions,
yet they are very often used in Python, because of the flexibility of the for loop. These
however have to be translated to types closer to C. Since this is even a very difficult task
manually, automating this task seems almost impossible. Therefore it is necessary that
the Python code is rewritten to be more similar to C. This means that no complex types
and type combinations should be used, which reduces the ease of programming in Python.
Of course, not all Python code needs to be optimised in order to obtain a decent performance gain. However these complex structures are most times used in loops, meaning
they are the ones that need type information.
Type guessing in combination with Cython has not yet been tried before. It would be
possible to obtain huge speed-ups, however guessing the types of Python variables might
be too complex. The problem is related to the combination of lists, tuples, dictionaries,
etc.
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
5.2
38
PyPy Beats C
First I would like to mention that this is a very specific example, by no means do I imply
that PyPy is actually faster than C in general. The intention is to show the benefit and
the power of PyPy and Just-In-Time compilation.
This crafted example performs a string append with two integers in a loop. Both
integers are simply the same loop variable. The Python code is included in Listing 5.6.
A new variable is created for the string append every iteration. There are two versions
for C. In Listing 5.7 the resulting string is stored in the same variable (on the stack) each
time. In Listing 5.8 a new variable is created and freed each iteration. Since PyPy has
a garbage collector, the variable will not be freed each iteration. However the behaviour
should be closer to the latter C version.
f o r i in xrange ( 1 0 0 0 0 0 0 0 ) :
"%d %d " % ( i , i )
Listing 5.6: Python string append
#include <s t d i o . h>
#include < s t d l i b . h>
i n t main ( ) {
int i = 0 ;
char x [ 4 4 ] ;
f o r ( i = 0 ; i < 1 0 0 0 0 0 0 0 ; i ++) {
s p r i n t f ( x , "%d %d " , i , i ) ;
}
}
Listing 5.7: C string append using the stack
#include <s t d i o . h>
#include < s t d l i b . h>
i n t main ( ) {
int i = 0 ;
f o r ( i = 0 ; i < 1 0 0 0 0 0 0 0 ; i ++) {
char ∗x = m a l l o c ( 4 4 ∗ s i z e o f ( char ) ) ;
s p r i n t f ( x , "%d %d " , i , i ) ;
free (x) ;
}
}
Listing 5.8: C string append using the heap
Table 5.1 contains the time measurements. CPython has been added to show that it
is really PyPy doing the work and not the Python code. I have made certain that the
string append operation has not been optimised away by PyPy. We notice that PyPy is
even twice as fast as the quickest C implementation and almost three times as fast as the
C implementation with the variable created on the heap.
PyPy’s Just-In-Time compiler, which works on traces at runtime, can inline and unroll
the string append operation. The string append is a very generic function, but because
of the inlining, specialization can be applied on the arguments. GCC on the other hand
is not able to do this, because the sprintf call sits in libc. This means the generic
39
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
function has to be called each time the sprintf is called, which results in much slower
performance.
This example truly shows the power of Just-In-Time compilation, namely all code can
be seen at runtime and optimisations have as much information available as possible.
5.3
Time Measurements
Figure 5.2 contains the time of the various runtime environments, divided by the time
of CPython. This figure clearly shows that C is still the quickest, as expected. Both
CPython and Cython appear to be very slow. This means that the most used Python
runtime environment – CPython – is actually one of the slowest available. The Cython
runtime environment compiles the Python code to C, which means we would expect a
behaviour much like C code. This is clearly not the case. Section 3.3 mentioned that
Cython needs type information, to improve the performance of the Python code. The lack
of type information results in very poor performance. While PyPy is not able to beat
C, for CPU-intensive benchmarks its’ performance is very good. Memory problems are
causing a smaller benefit for I/O- and memory-intensive benchmarks. For k-nucleotide the
performance is even worse than CPython’s. This is analysed in Section 5.4 with hardware
events.
1.6
C
Time relative to CPython
1.4
Cython
CPU
CPython
PyPy
I/O
memory
1.2
1
0.8
0.6
0.4
0.2
0
x
x
orm
orm
edu
edu
h-r 11 uch-r 12 tral-n 000 tral-n 500
c
3 pec
5
kuc
k
n
n
p
s e
s
fan
fan
ody
n-b 00
00
0
50
s
x
x
e
ody
du
du
tid
ree
n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20
00 k-n 000 fas 000 fas 000 bin
0
0
0
25
25
50
25
Figure 5.2: Time comparison between the runtime environments normalised to CPython
Table 5.1: Time measurements of the string append problem
runtime
CPython
PyPy
C (stack)
C (heap)
time (s)
15.087
1.187
2.709
3.115
remark
the default Python interpreter (using the same code as PyPy)
the result is stored in the same variable
the result is stored in a new variable each iteration
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
5.4
40
Hardware Events
As discussed in Section 4.4, hardware events were measured to get a better understanding
of the runtime environments and the differences between them. All events are divided by
the number of instructions, to scale the number appropriately and obtain a fair comparison
between runtime environments. Otherwise, a runtime which takes a lot of time, will
logically result in, for example, a higher number of CPU cycles. Note that the benchmarks
will always have an influence on the results. A memory-intensive benchmark will have more
data cache loads per instruction than for example a CPU-intensive benchmark. Therefore
it is important to compare the results per domain.
5.4.1
Cycles Per Instruction
Figure 5.3 contains the number of cycles per instruction obtained for the different runtime
environments for each benchmark. C tends to score very high, because it will use more
complex machine instructions, which will take multiple cycles. Another reason for the
high values is that C will not stall a lot, because there are fewer branches, which can be
seen in Section 5.4.2. This is especially clear for the CPU-intensive benchmarks. The high
peak for k-nucleotide can be explained by the high amount of memory operations to the
last level cache. These will increase the number of cycles per instruction. The values for
fasta-redux and binary-trees are lower, while these also use a lot of memory. However there
are fewer accesses to the last level cache. The results for the last level cache are explained
in more detail in Section 5.4.5 and Section 5.4.6.
Since Cython also compiles the code to C, a similar result is expected. This is however
not the case. Cython has very consistent values, which means that the type of benchmark
does not have considerable influence on the results. The explanation for this low result
can be found in the fact that there is no type information available at compile time and
thus it is not possible to optimise the code efficiently. This means it is not possible to use
the more complex instructions and there are more branches, as explained in Section 5.4.2,
disrupting the flow of execution. This leads to a lower number of cycles per instruction.
Note that the --embed option was used to generate the executable. This will generate an
appropriate main method to start the execution of the Python script. This method will
make sure the code compiled with Cython can be used. This was however not used for the
pairwise distance calculation problem, shown in Section 5.1, which means that this does
not influence the result very much. Instead the answer is found in the wrapping of the
values.
CPython follows the behaviour of Cython very closely, yet it has the interpretation
overhead. The virtual machine will cause exactly the same problems, namely it will need
to jump to the correct code, which leads to a larger amount of branch misses. Since
everything is executed at runtime, no complex assembly instructions will be used either.
This results in about the exact same behaviour as for Cython.
PyPy is using a lot more cycles per instruction for the I/O- and memory-intensive
benchmarks. This is related to the memory operations, which take multiple cycles. The
peak for binary-trees is explained by the memory problems. The Just-In-Time compiler
will have to store data to decide which code is ‘hot’ and the compiled code needs to be
stored as well. The binary-trees will already use a lot of memory. This causes a huge
number of store misses in the last level cache, as explained in Section 5.4.6.
41
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
3.5
C
CPU
Cycles per instruction
3
Cython
CPython
PyPy
I/O
memory
2.5
2
1.5
1
0.5
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 5
f 2
b
50
00
k 2
nk
nk
pe
pe
2
5
n
n
s
s
fa
fa
Figure 5.3: The number of cycles per instruction
5.4.2
Branch Behaviour
The branch behaviour can be seen in Figure 5.4 and Figure 5.5. The results show that C
has less branch instructions than the other environments. This is particularly clear for the
spectral-norm and n-body benchmarks. This result was expected, since C does not have
the interpretation cost. Only the k-nucleotide benchmark jumps out. This behaviour can
be explained by the algorithm used in the benchmark. It is necessary to use a lot of if
conditions, for the code to work correctly. The branch misses per instruction are very low,
however fasta-redux is an exception. This is because of the random initialisation of the
lookup table.
Cython has slightly lower values than CPython. As mentioned previously, Cython
really needs type information, which was not given, to improve the Python code. This
means the generated C code is not optimised, but more importantly the types are not
known. Cython will generate code to wrap the Python values, no operations will be
performed on the values directly. This means that each operation needs to be abstracted,
which causes the higher number of branches. The type of benchmark has no influence on
this behaviour.
CPython has constant results for the different benchmarks. Just like Cython, the type
of benchmark is not a big influence. This is the result of the Python virtual machine. At
runtime each instruction is executed, which means a branch is necessary.
PyPy behaves closer to the other Python runtime environments than to C. However
it shows a very strange behaviour, caused by the Just-In-Time compiler. It is explained
more elaborately in Section 6.5.2.2.
42
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
0.35
Branches per instruction
0.3
C
Cython
CPython
CPU
PyPy
I/O
memory
0.25
0.2
0.15
0.1
0.05
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 2
f 5
b
50
00
k 2
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Branch misses per instruction
Figure 5.4: The number of branches per instruction
0.2
C
Cython
CPython
CPU
PyPy
I/O
memory
0.15
0.1
0.05
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
a 0
a 0
h
h
r
c 0
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 2
f 5
b
50
00
k 2
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Figure 5.5: The number of branch misses per instruction
5.4.3
Level-1 Instruction Cache Behaviour
Figure 5.6 and Figure 5.7 contain the level-1 instruction cache loads and level-1 instruction cache load misses both per instruction. We notice that the level-1 instruction cache
load misses are influenced greatly by the benchmark for C. The results are very high
for fannkuch-redux while very low for spectral-norm. The fannkuch-redux benchmark is
executed almost entirely in a single method, while spectral-norm uses a lot of different
methods, resulting in worse code locality. This causes the huge difference. The other
values are more similar to each other. The results are much lower than the other runtime
43
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
environments, because the code locality for them is worse.
Again Cython does not follow the behaviour of C. The values for Cython are almost
the same, the benchmarks have almost no influence on the instruction cache load misses
in the first level cache. Furthermore, Cython has a huge amount of instruction cache load
misses. This can again be explained by the wrapping of the values. The operation on the
values have to be executed by calling the appropriate method, this means code has to be
loaded from a lot of different places and explains the high number of instruction cache
load misses per instruction.
The same observation is made for CPython. Apart from k-nucleotide, which has fewer
level-1 instruction cache loads per instruction, the values are very similar. CPython does
not perform optimally when reading from a file. This problem is not the case for writing
to a device. There CPython has a high number of level-1 instruction cache loads.
PyPy is the only runtime, apart from C, which shows some variation in the obtained
values. The differences are not as high as C and are inherent to the benchmarks. The
instruction cache load misses are higher than CPython, because of the added cost of
the Just-In-Time compiler. Furthermore guards are added so the trace can be followed,
however these can result in more instruction cache load misses per instruction when they
fail. This is the case for binary-trees.
The level-1 instruction cache load misses are very high for each runtime environment
for the fasta-redux benchmark, because of the random generation of the lookup table.
L1 icache loads per instruction
0.7
0.6
C
CPU
Cython
CPython
I/O
PyPy
memory
0.5
0.4
0.3
0.2
0.1
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 5
f 2
b
50
00
k 2
nk
nk
pe
pe
2
5
n
n
s
s
fa
fa
Figure 5.6: The number of level-1 instruction cache loads per instruction
44
L1 icache load misses per instruction
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
0.012
0.01
C
Cython
CPU
CPython
PyPy
I/O
memory
0.008
0.006
0.004
0.002
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 5
f 2
b
50
00 k- 25
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Figure 5.7: The number of level-1 instruction cache load misses per instruction
5.4.4
Level-1 Data Cache Behaviour
The level-1 data cache behaviour is captured in Figures 5.8 and 5.9. Again we notice that
the influence of each benchmark is very large for C. We also notice, not surprisingly, that
the number of misses is high for the I/O- and memory-intensive benchmarks. However the
misses are considerably lower for the outputting benchmark, fasta-redux. Since spectralnorm uses no memory, it make sense that C has very low values there. The n-body
benchmark causes about as many load operations as k-nucleotide and binary-trees, because
it uses data stored in memory to do the calculations.
Cython clearly does not follow the same behaviour as C. The benchmark influence is
minimal, only the n-body, k-nucleotide and binary-trees benchmarks have a slightly higher
number of load operations per instruction. These are of course the benchmarks making
most use of the memory.
CPython shows an even more steady behaviour. There is barely any difference in
the number of load operations per instruction measured over the different benchmarks,
because every value needs to be loaded from memory, since an interpreter is used. Due to
the extra overhead of the interpretation CPython also has more misses.
PyPy clearly shows that spectral-norm is entirely CPU-intensive. The other benchmarks show very little variation in the data cache load operations per instruction. However the misses show clearly that PyPy is having some issues with I/O. The cause for this
problem is of course the interpretation and the Just-In-Time compiler. Both need to use
memory, which cannot be used to contain application data. This leads to a higher number
of load misses per instruction.
45
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
L1 dcache loads per instruction
0.5
0.45
C
Cython
CPython
CPU
PyPy
I/O
memory
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 2
f 5
b
50
00
k 2
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
L1 dcache load misses per instruction
Figure 5.8: The number of level-1 data cache loads per instruction
C
0.07
Cython
CPU
CPython
PyPy
I/O
memory
0.06
0.05
0.04
0.03
0.02
0.01
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
a 0
a 0
h
h
r
c 0
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 2
f 5
b
50
00
k 2
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Figure 5.9: The number of level-1 data cache load misses per instruction
5.4.5
Last Level Cache Load Behaviour
Figure 5.10 and Figure 5.11 contain the load operations and load misses for the last level
cache. The only clear results for C are for the I/O- and memory-intensive benchmarks.
C has especially high values for the k-nucleotide benchmark, while this does not reflect in
the misses. These peak is caused by storing the data from a file and the added loading
operations for performing the sort. The other I/O- and memory-intensive benchmarks
also lead to a higher number of loads from the last level cache.
Almost all of Cython’s load operations miss in the last level cache. The values are
46
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
considerably higher than those of CPython, which can be explained by the fact that
CPython has a memory manager. This manager improves the data locality and can also
cache data.
CPython also has low values for the I/O-intensive benchmarks. However for the knucleotide benchmark, a higher number of load operations in the last level cache per
instruction is better. CPython has a very low value there, which is caused by the interpretation overhead. Every read operation is followed by executing the next Python
operation code in the virtual machine. This lowers the amount of load operations per
instruction. We notice that CPython does not miss a lot in the last level cache, because
a lot of instructions are used to keep the interpreter going. This means that the number
of instructions increases, which causes the misses per instruction to decrease.
PyPy has the highest values for the load operations in the last level cache per instruction for the k-nucleotide benchmark of the Python environments. PyPy follows the
behaviour of C for the I/O and memory-intensive benchmarks. This means PyPy behaves well for the last level cache load behaviour. However the misses show some real
issues. The biggest problems seem to lie with fannkuch-redux and n-body, while those are
CPU-intensive benchmarks. This is caused by the added overhead of the interpreter and
the Just-In-Time compiler, which is very active for the CPU-intensive benchmarks. The
spectral-norm benchmark uses almost no memory at all, so there are no problems for that
benchmark.
LLC loads per instruction
0.005
C
CPU
Cython
CPython
0.014271
PyPy
I/O
memory
0.004
0.003
0.002
0.001
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
a 0
a 0
h
h
a
a
r
c 0
tr 3 c tr 5
00
0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
f 5
f 2
e
ec
b
50
00 k- 25
nk
nk
p
p
5
2
n
n
s
s
fa
fa
Figure 5.10: The number of last level cache loads per instruction
47
LLC load misses per instruction
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
0.005
C
Cython
CPU
CPython
PyPy
I/O
memory
0.004
0.003
0.002
0.001
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 5
f 2
b
50
00 k- 25
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Figure 5.11: The number of last level cache load misses per instruction
5.4.6
Last Level Cache Store Behaviour
The last level cache store behaviour is captured in Figures 5.12 and 5.13. Again we notice
that the highest values for C are obtained for the I/O- and memory-intensive benchmarks.
C has a peak for the k-nucleotide benchmark, because of the sort operation. That will
cause a lot of the data to be restored. We also notice that the highest misses occur for
the k-nucleotide and binary-trees benchmarks, simply because those benchmarks store most
memory.
Cython does not show any interesting behaviour. The misses are very similar over the
different benchmarks. Again it follows the behaviour of CPython very closely.
CPython behaves in much the same way as Cython. The values for the store operations
per instruction are very similar for each benchmark. The misses per instruction are very
stable over the different benchmarks. The type of benchmark does not influence the
behaviour.
The store operations per instruction are a lot higher for PyPy for the k-nucleotide and
binary-trees benchmarks. These are of course the benchmarks that use the most memory.
The misses are very similar for fannkuch-redux, n-body and fasta-redux.
48
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
0.4
LLC stores per instruction
0.35
C
Cython
CPython
0.719365
CPU
PyPy
I/O
memory
0.3
0.25
0.2
0.15
0.1
0.05
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 2
f 5
b
50
00
k 2
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Figure 5.12: The number of last level stores stores per instruction
LLC store misses per instruction
0.8
0.7
C
Cython
CPU
CPython
I/O
PyPy
memory
0.6
0.5
0.4
0.3
0.2
0.1
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
a 0
a 0
h
h
r
c 0
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 5
f 2
b
50
00
k 2
nk
nk
pe
pe
2
5
n
n
s
s
fa
fa
Figure 5.13: The number of last level cache store misses per instruction
5.4.7
Translation Lookaside Buffers
The misses for translation lookaside buffers are shown in Figure 5.14, Figure 5.15 and
Figure 5.16. They confirm the behaviour of the instruction and data cache misses.
49
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
iTLB load misses per instruction
0.00025
C
Cython
CPython
CPU
PyPy
I/O
memory
0.0002
0.00015
0.0001
5e-05
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 3 tra 5
00
0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
c
f 2
f 5
b
50
00 k- 25
nk
nk
pe
pe
2
5
n
n
s
s
fa
fa
Figure 5.14: The number of instruction translation lookaside buffer load misses per instruction
dTLB load misses per instruction
0.0035
0.003
C
CPU
Cython
CPython
0.014121
PyPy
I/O
memory
0.0025
0.002
0.0015
0.001
0.0005
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
a
a
00
0 0 u c 0 0 s ta 0 0 s ta 0 0 a r
ch
ch
tr 3 c tr 5
00
0 0 k -n 2 5 0 fa 2 5 0 fa 5 0 0 b in
ku
ku
e
ec
5
0
n
n
p
p
2
5
n
n
s
s
fa
fa
Figure 5.15: The number of data translation lookaside buffer load misses per instruction
50
dTLB store misses per instruction
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
C
0.002
Cython
CPU
CPython
PyPy
I/O
memory
0.0015
0.001
0.0005
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 5
f 2
b
50
00 k- 25
nk
nk
pe
pe
2
5
n
n
s
s
fa
fa
Figure 5.16: The number of data translation lookaside buffer store misses per instruction
5.5
Multi-threaded Applications
Remember that the two benchmarks using multiple cores are k-nucleotide and binarytrees. To compare the effectiveness of the threading, both benchmarks have been executed
without threading enabled. To accomplish this, it was necessary to modify the source
code. Both benchmarks use a threading pool and call a map function on that pool. I
have only removed the threading pool, the map function remained. It was of course not
the same map function. For the C code I either put pthread_join after each create or
I returned 1 for the number of cores. A more in-depth description of the modifications
necessary to remove the threading is added in Appendix B.
It is customary to calculate the parallel speed-up Sp and efficiency Ep with p the
number of processes, in order to evaluate the behaviour of multi-threaded applications.
They are calculated as follows:
T1
Tp
Sp
=
p
Sp =
Ep
with Tp the total execution time with p processes
This means that the efficiency is one, only if the code has perfect parallelism. This
is however never possible, because there is always a small fraction which has to be executed sequentially. Note that the algorithm influences these measurements as well. Some
algorithms will have more sequential fractions than others, thus resulting in a lower efficiency. The efficiency will range between zero and one, with numbers closer to one being
better.
The results are shown in Table 5.2 with the number of processes equal to eight, since
this is the number of cores available on the test machine. Therefore the ideal speed-up
would be eight.
51
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
To make a distinction between the runtime environments with and without threading
in the graphs shown below, ‘NT’ (no threading) is added to the name of the runtime
environment to indicate the behaviour without threading.
5.5.1
C
Very poor performance results are obtained for C in Table 5.2. It would be reasonable
to assume this is due to the short duration of the benchmarks, because a short duration
means that the sequential part is relatively larger than the parallel part. Therefore I have
increased the arguments, which shows that that the poor performance is not related to
the duration of the benchmarks. The time measurements for C relative to the time with
threading are shown in Figure 5.17. The performance clearly depends on the algorithm. It
is easier to execute the binary-trees benchmark concurrently than the k-nucleotide benchmark. However it is not possible to get a decent efficiency for both algorithms. This means
that both benchmarks have a considerable sequential section.
Time relative to time with threading
3.5
C
C NT
3
2.5
2
1.5
1
0.5
0
k-nucleotide
2500000
k-nucleotide
25000000
binary-trees
20
binary-trees
24
Figure 5.17: Time measurements for C normalised to the time with threading
Table 5.2: The parallel speed-up and efficiency with eight processes for the different
runtime environments
runtime
C
Cython
CPython
PyPy
k-nucleotide
S8
E8
1.498 0.187
3.930 0.491
4.137 0.517
1.929 0.241
binary-trees
S8
E8
3.481 0.435
5.679 0.710
6.397 0.800
1.996 0.250
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
5.5.2
52
Cython
Cython does not follow C for the multi-threaded behaviour either. A much larger speed-up
is obtained for both benchmarks. Figure 5.18 gives the time measurements for Cython
relative to the time with threading. It shows that again binary-trees makes better use of
multiple cores.
Time relative to time with threading
6
Cython
Cython NT
5
4
3
2
1
0
k-nucleotide
2500000
binary-trees
20
Figure 5.18: Time measurements for Cython normalised to the time with threading
The inefficiencies are divided over multiple cores, because it is easy to read the wrapped
values in parallel, thus resulting in better performance. This leads to a larger speed benefit
by using multiple cores. Again the behaviour is very similar to CPython.
5.5.3
CPython
Figure 5.19 compares the time measurements with and without threading for both benchmarks running on top of the CPython runtime environment. This shows that a huge
speed-up is obtained by using the multiprocessing module.
In Section 3.1.3, the different approaches to multi-threaded applications have been
discussed. The conclusion was that the normal approach to threading leads to a very
poor performance. However a module, called multiprocessing, has been created to avoid
this problem by spawning subprocesses. The added cost of spawning the subprocesses
is its biggest disadvantage. Both threaded benchmarks use this module and obtain very
nice speed-ups. This means that while there are issues because of the Global Interpreter
Lock with CPython, multi-threaded applications can definitely be written in Python.
The multiprocessing module provides an alternative implementation, which successfully
bypasses that lock. Applications written using that module can get a very good speed-up
compared to a single-threaded variant.
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
Time relative to time with threading
7
CPython
53
CPython NT
6
5
4
3
2
1
0
k-nucleotide
2500000
binary-trees
20
Figure 5.19: Time measurements for CPython normalised to the time with threading
5.5.4
PyPy
Time relative to time with threading
Figure 5.20 shows the time difference between PyPy with and without threading. We
notice that the speed-up is not as high as CPython’s. Without threading, PyPy is a lot
faster than CPython. However with threading, the difference is a lot smaller. CPython is
even faster than PyPy for k-nucleotide when running with multiple threads.
2
PyPy
PyPy NT
1.5
1
0.5
0
k-nucleotide
2500000
binary-trees
20
Figure 5.20: Time measurements for PyPy normalised to the time with threading
The Just-In-Time compiler is the most important influence for PyPy. It will already
improve the performance of PyPy substantially. This is investigated in Chapter 6. The
multi-threaded approach will add a lot of complications. On each subprocess a new inter-
54
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
preter is launched, this means that the Just-In-Time compiler will only see the operations
passing on a single core. The most important part of a Just-In-Time compiler is to compile as soon as possible. However this is not possible, because of the approach of the
multiprocessing module. On each core the Just-In-Time compiler will have to compile
the code, which will be only partially available. This leads to a much lower performance
gain than just using an interpreter, like CPython.
5.6
PYC
In Section 4.2, it was mentioned that the PYC runtime environment would also be benchmarked. The approach is the same as CPython’s, however a previously generated file
containing the Python bytecode is used instead of generating the Python bytecode at
runtime, thus eliminating the loading overhead. Yet it is not analysed in the previous
comparison. The results for the PYC runtime environment are very similar to the ones
obtained for CPython. The time difference is shown in Figure 5.21, the hardware results
are almost identical as well. This confirms the suspicion that avoiding the bytecode compilation step will not have a huge influence on the total execution time. Since there is
barely any difference, I did not include a full comparison of PYC with the other runtime
environments.
CPython
1.2
time relative to CPython
CPU
PYC
I/O
memory
1
0.8
0.6
0.4
0.2
0
x
x
orm
orm
edu
edu
h-r 11 uch-r 12 tral-n 000 tral-n 500
c
3 pec
5
kuc
k
n
n
p
s e
s
fan
fan
ody
n-b 00
00
0
50
s
x
e
e
ody
du
tid
tid
ree
n-b 00 ucleo 00 ucleo 00 ta-re 00 ary-t 20
00 k-n 000 k-n 000 fas 000 bin
0
0
0
0
25
50
25
25
Figure 5.21: Time comparison between the PYC and CPython runtime environments,
normalised to CPython
5.7
Conclusion
C is of course performing the best. The best performing Python runtime environment,
without making any modifications to the source code, is PyPy. This has been proven by
the time measurements.
The hardware events have shown that the Python runtime environments use more
CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS
55
branches than C. The type of benchmark has a much bigger influence on C than on the
other runtime environments in general.
CPython is often the slowest runtime environment, while being the most used one. For
smaller scripts this is fine, however once the performance becomes important, it is clearly
not a desirable choice.
Cython can improve the performance drastically as shown in Section 5.1, however
type information is necessary. Without the type information it sometimes has worse
performance than CPython. This is caused by the wrapping of the values, which leads
to a high number of instruction cache misses and load misses in the last level cache. The
store misses in the last level cache are also considerably large.
Both the instruction cache and data load misses in the first level cache are elevated for
PyPy. Often the last level cache misses are also large. Since PyPy shows most promise,
it has received the main focus in my research. In Chapter 6, a more in-depth analysis is
performed, specifically for PyPy.
For C, the measured hardware events show that the type of benchmark greatly influences the events. This behaviour is also noticed for PyPy, however to a lesser extent.
This is of course caused by the JIT, because the code is compiled to assembly. A similar
behaviour is expected for Cython, but this is not the case. Without the type information,
the behaviour is much more similar to CPython. The type of the benchmarks has very
little influence on the hardware events for CPython and Cython.
The problems with the Global Interpreter Lock are resolved by the multiprocessing
module. CPython gets a very decent speed-up using this module. The overhead of spawning the subprocesses and launching an interpreter on each core is relatively small. However
the same speed-ups are not obtained for PyPy, because the Just-In-Time compiler is not
working optimally when multiple interpreters are used on different cores. It is not possible
to share the compiled code between the different subprocesses.
Chapter 6
Analysis PyPy
S
ince PyPy is the best performing Python runtime environment when no extra information is added, I have mainly focused on analysing this runtime environment.
There are already some tools available to examine to the influence of the Just-InTime compiler. I also checked the behaviour of the garbage collector. However to evaluate
this behaviour, the hardware events and time measurements do not provide enough information. First some tools are discussed to get a better understanding of PyPy.
6.1
Translatorshell
This tool provides a better understanding of the internals of PyPy itself and how the
PyPy executable is created. This translatorshell is a Python script, which is included in
PyPy itself. It enables the user to access the code, which is used for the interpretation and
Just-In-Time compilation. The best way to explain how it works is by using an example.
A lot of example code fragments are included with the PyPy source code, one of them is
a method which searches for a perfect number1 . The Python code to verify if a number is
perfect can be written like this:
def is_perfect_number ( n=i n t ) :
div = 1
sum = 0
while d i v < n :
i f n % d i v == 0 :
sum += d i v
d i v += 1
return n == sum
The first insight this tool provides is to allow the user to view the flow graph. This is
accomplished by creating a Translation object and calling the view method:
>>> t = Translation(snippet.is_perfect_number, [int])
>>> t.view()
Note that the is_perfect_number function is located in a module called snippet and
that it has one argument of the type ‘int’. It has been mentioned previously that Python
1
A perfect number is a positive integer, which is equal to the sum of its proper positive divisors,
excluding the number itself. For example 28 is a perfect number because 28 = 1 + 2 + 4 + 7 + 14 and 28 is
only divisible by 1, 2, 4, 7 and 14.
56
CHAPTER 6. ANALYSIS PYPY
57
is dynamically typed, yet it is necessary to declare the type here, because we are using
code internal to PyPy. It is not necessary to declare the types inside Python programs,
however, internally, PyPy will decide on types for each variable. Calling the view method
will show a graph similar to the one in Figure 6.1.
It is also possible to view the graph after the annotation phase. The user just needs
to execute the following instructions:
>>> t.annotate()
>>> t.view()
The obtained graph is shown in Figure 6.2. It is clear that a lot of code has been added
for the RPython toolchain. The graph of the is_perfect_number function is still visible
at the left.
The rtyping phase is the last phase after which the graph can be viewed. It can be
generated using the following instructions:
>>> t.rtype()
>>> t.view()
>>> t.compile_c()
The last instruction will compile it to a C library. The generated graph is included
in Figure 6.3 and clearly shows that a lot of information is added to create the PyPy
executable. As mentioned in Section 3.2, the garbage collector and Just-In-Time compiler
are not yet added to this graph.
6.2
Hooks
A hook allows the programmer to either investigate the behaviour of a module or the react
on a certain event. PyPy provides the user with various hooks. The ones concerning the
Just-In-Time compiler are the most relevant. They are available to the user by importing
the pypyjit module from a Python script. The following hooks are provided:
optimise hook is called each time a loop is optimised (before assembler compilation).
It allows the programmer to view the Python bytecode which will be compiled and
modify it as well. The argument it gets passed is a JitLoopInfo object.
compile hook is called each time a loop is compiled and it is not reentrant. Again the
argument passed to it is a JitLoopInfo object.
abort hook called each time tracing is aborted. It gets passed a driver, a greenkey, the
abort reason and a list of operations.
The JitLoopInfo object contains the following information:
driver name of the JIT driver
greenkey representation of the place where the loop was compiled
operations list of operations in the loop
CHAPTER 6. ANALYSIS PYPY
Figure 6.1: The flow graph for the is_perfect_number method
58
CHAPTER 6. ANALYSIS PYPY
59
Figure 6.2: The flow graph for the is_perfect_number method after the annotate phase
Figure 6.3: The flow graph for the is_perfect_number method after the rtyping phase
CHAPTER 6. ANALYSIS PYPY
60
loop_no loop cardinal number
bridge_no bridge number (if it is a bridge)
type loop type (either ‘bridge’ or ‘loop’)
asmaddr address of the location of the machine code
asmlen length of the machine code
After experimenting with the hooks, I noticed they are not to be trusted for counting
the number of Just-In-Time compilations. While they are useful, mostly for quickly testing
code transformations, they did not serve me for the benchmarking and analysing of PyPy.
However I did use it to verify the correctness of my harness for the stable behaviour.
6.3
JIT Viewer
The PyPy developers have already created a tool to investigate the behaviour of the JustIn-Time compiler. It is a web application, which can be installed locally and it is called the
‘jitviewer’. First it is necessary to run the Python script and let it write log information
to a file. Next the log file should be passed to the application. This tool allows the user
to see the Python source code, Python bytecode, an intermediate representation of the
code of the Just-In-Time compiler and also the generated assembly. Figure 6.4 shows the
jitviewer in action for the spectral-norm benchmark.
A problem I noticed with this viewer is that the correct output is not written to the
log file when the Python script uses the multiprocessing module. This means that the
jitviewer does not work correctly in this case. However I was able to use this to verify the
correctness of the harness for the stable behaviour and it provides the easiest way to view
the traces compiled by the Just-In-Time compiler.
6.4
Behaviour Over Time
Using the same log file, generated as mentioned in Section 6.3, it is also possible to apply
a different tool, which shows the behaviour of the Just-In-Time compiler and the garbage
collector over the course of the application’s execution. This is accomplished by generating
an image containing the various components in a time lapse of the run. Green is used
to show the activity of the Just-In-Time compiler, while red is used for the activity of
the garbage collector. Different shades are used to show specific components of either the
Just-In-Time compiler (for example tracing, optimising, etc.) or the garbage collector (for
example minor collections). The time spent interpreting is left transparent. The most
active components are added to the figure in a key. If the time lapse is not interesting, it
is also possible to generate a summary showing the activity of each specific component.
6.4.1
A Lot Of Garbage Collection
The figures obtained using this tool show a lot of garbage collection taking place, as can be
seen in Figure 6.5, showing the time lapse of the fannkuch-redux benchmark with argument
11. Only the spectral-norm benchmark with the smallest argument does not show this
behaviour. This is unexpected behaviour.Therefore I decided to increase the size of the
61
CHAPTER 6. ANALYSIS PYPY
Figure 6.4: The jitviewer in action, showing the Python code, Python bytecode and the
intermediate representation of the code of the Just-In-Time compiler for the
spectral-norm benchmark
image generated by this tool and the result can be found in Figure 6.6. The problem with
this tool is that it does not draw the time spent in interpretation. If the total execution
time is relatively long, which is the case for these benchmarks, then drawing a line of one
pixel to show a short activity of a specific component, such as a minor garbage collection,
is too wide and misleading. My suspicion is that there are a lot of garbage collections after
each other, that do not take a lot of time. This would result in the obtained behaviour. I
have been able to confirm my suspicion by going through the log file and the graphs are
indeed not very representative for the garbage collections. It is however not possible to
generate very large graphs, which means it is important to be careful when interpreting
them when you see a large amount of time spent visually. This will be clarified in the
summary, which can be seen in Listing 6.1, or the key of the graph.
interpret
gc−minor
gc−minor−w a l k r o o t s
j i t −t r a c i n g
j i t −resume
98.057905084%
1.013017932%
0.464278975%
0.328666524%
0.055008926%
62
CHAPTER 6. ANALYSIS PYPY
Figure 6.5: Time lapse of fannkuch-redux with argument 11 and generated width 1000
j i t −o p t i m i z e
j i t −backend
gc−s e t −n u r s e r y −s i z e
j i t −l o g −v i r t u a l s t a t e
j i t −backend−dump
gc−hardware
j i t −backend−addr
j i t −mem−l o o p t o k e n −a l l o c
j i t −l o g −noopt−l o o p
j i t −l o g −opt−b r i d g e
j i t −a b o r t
j i t −l o g −c o m p i l i n g −b r i d g e
j i t −l o g −r e w r i t t e n −b r i d g e
j i t −l o g −s h o r t −preamble
j i t −l o g −opt−l o o p
j i t −l o g −r e w r i t t e n −l o o p
j i t −l o g −c o m p i l i n g −l o o p
j i t −summary
j i t −mem−c o l l e c t
j i t −backend−c o u n t s
0.052875836%
0.015693034%
0.007687626%
0.002679879%
0.001098056%
0.000594516%
0.000138570%
0.000095715%
0.000065545%
0.000037148%
0.000032234%
0.000027141%
0.000026188%
0.000015614%
0.000015473%
0.000012621%
0.000012392%
0.000007832%
0.000005127%
0.000002011%
Listing 6.1: Time lapse of the fannkuch-redux benchmark with argument 11
Figure 6.6: First part time lapse of fannkuch-redux with argument 11 and generated
width 20000
6.4.2
Behaviour Of The Just-In-Time Compilation
Figure 6.7 contains the time lapse for spectral-norm with the smallest argument. This
graph does not have the problem that garbage collection is not represented correctly. The
Just-In-Time compiler’s behaviour is similar for all benchmarks and it is represented nicely
in this graph.
We can conclude that the Just-In-Time compiler is mainly active at the beginning of
the run, which is to be expected. Compiling takes some time, so it is best to do this as soon
as possible, which means as much benefit as possible can be gained from it. Furthermore
we notice that barely any time is spent in the Just-In-Time compiler. Note that a huge
amount of the time is spent dumping to a log file (1.9%). Remember that it is only
63
CHAPTER 6. ANALYSIS PYPY
Figure 6.7: Time lapse of the spectral-norm benchmark with argument 3000
necessary to write to a log file to view this graph and normally when running the code
there will not be any logging, which means this cost will be removed. This is the only
graph where the amount of time spent in the Just-In-Time compiler, including dumping
to a log file, is larger than one percent, as can be seen in Table 6.1.
The activity of the garbage collector is visible in Table 6.2. The values for the benchmarks which generate a lot of memory are higher than the others. The k-nucleotide benchmark has very low values, because that benchmark only creates memory in the beginning.
6.4.3
Influence Of The Nursery Size
Section 3.2.1 already mentioned that it is possible to modify the garbage collector by
setting some variables. Since we see a lot of collections happening, it would be interesting
to see how the parametrisation of the garbage collector influences the behaviour over
time. Increasing the nursery size should reduce the number of collections. The summary
verifies that the time spent in collecting the memory is indeed reduced. Instead of the
1.013% obtained for the default value (four megabytes), now only 0.712% of time is spent
on minor collections. However the total execution time increases from 22.64 seconds to
28.117 seconds due to bad data locality.
6.5
Influence Of The Just-In-Time Compiler
In Chapter 3, it has already been mentioned that the Just-In-Time compiler of PyPy is
very interesting and I have put extra effort into investigating its behaviour by measuring
Table 6.1: The time spent in the Just-In-Time compiler compared to the total execution
time
benchmark
fannkuch-redux
fannkuch-redux
spectral-norm
spectral-norm
n-body
n-body
k-nucleotide
fasta-redux
binary-trees
arg
11
12
3000
5500
5000000
50000000
2500000
25000000
20
activity Just-In-Time compiler (%)
0.45
0.01
3.66
0.84
0.20
0.33
0.58
0.91
1.04
64
CHAPTER 6. ANALYSIS PYPY
Table 6.2: The time spent in the garbage collector compared to the total execution time
benchmark
fannkuch-redux
fannkuch-redux
spectral-norm
spectral-norm
n-body
n-body
k-nucleotide
fasta-redux
binary-trees
arg
11
12
3000
5500
5000000
50000000
2500000
25000000
20
activity garbage collector (%)
1.49
1.49
0.62
0.20
1.32
1.32
0.33
5.33
2.53
the stable behaviour and the behaviour with the Just-In-Time compiler disabled. We
would expect that running PyPy without the Just-In-Time compiler will be slower. We
expect the stable behavior run to perform better than running PyPy with the JIT compiler,
because the stable behavior uses the optimized code generated from the compiler, while
removing the time the compiler ran.
6.5.1
Time Measurements
The time results, visually represented in Figure 6.8, clearly show that the expected behaviour is indeed correct. The difference between running PyPy with and without Just-InTime compiler is huge. This means that it is very effective and also necessary to improve
the performance of Python.
The difference between the stable version and PyPy is less clear. However the stable
behaviour is indeed a bit faster than the normal behaviour. Sometimes the values are
a bit higher, because the amount of memory is a bit larger due to the harness and the
garbage collection does not always happen exactly when asked. However the results are
clear enough to confirm what we noticed in Section 6.4, namely that not a lot of time is
lost in the Just-In-Time compiler and that the overhead is very low.
65
CHAPTER 6. ANALYSIS PYPY
20
PyPyNJ
18
Time relative to PyPy
16
CPU
139.3838
PyPy
I/O
PyPyS
memory
141.5164
14
12
10
8
6
4
2
0
ux
ux
rm
rm
red
red
-no
-no
ch- 11 kuch- 12 ctral 3000 ctral 5500
nku
n
spe
spe
fan
fan
s
x
x
e
ody
ody
du
du
tid
ree
n-b 00
n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20
0
00 k-n 000 fas 000 fas 000 bin
00
00
0
50
25
25
50
25
Figure 6.8: Time measurements PyPy runtime environments normalised to PyPy. Note
that there are no results for PyPyS for k-nucleotide and binary-trees.
6.5.2
Hardware Events
The previous sections have clearly shown that the Just-In-Time compiler is indeed very
effective. It improves the application execution time significantly, while having very little
overhead. However it is still not clear what causes the benefit and if there are still disadvantages to the Just-In-Time compiler. To answer these questions, we analyse the
hardware events below.
6.5.2.1
Cycles Per Instruction
Figure 6.9 presents the number of cycles per instruction. It clearly shows the influence
of the Just-In-Time compiler. For the CPU-intensive benchmarks, there is barely any
difference between PyPy and PyPyS. PyPyNJ has a slightly increased number of cycles
per instruction in comparison with the other two, except for n-body.
The fasta-redux results show that for the stable behaviour more cycles are used per
instruction, while the PyPyNJ results are better. This is caused by the larger number of
data cache misses, explained in Section 6.5.2.4.
The k-nucleotide and binary-trees benchmarks, which both use a lot of memory, clearly
show that the Just-In-Time compiler has a huge influence on the cycles per instruction. A
lot of time will be lost loading and storing the information necessary for the Just-In-Time
compilation, such as the counters and the compiled code.
66
CHAPTER 6. ANALYSIS PYPY
1.5
PyPyNJ
Cycles per instruction
1.4
CPU
1.3
PyPy
I/O
PyPyS
memory
2.798530
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 5
f 2
b
50
00
k 2
nk
nk
pe
pe
2
5
n
n
s
s
fa
fa
Figure 6.9: Cycles per instruction for the PyPy runtime environments. Note that there
are no results for PyPyS for k-nucleotide and binary-trees.
6.5.2.2
Branch Behaviour
The branch behaviour is captured in Figure 6.10 and Figure 6.11. Since the PyPyS
behaviour is almost identical to the PyPy behaviour, we can conclude that most branches
are caused by the application code and not by the Just-In-Time compiler itself.
The results for PyPyNJ are very consistent, only the results for the I/O-intensive
benchmarks are slightly elevated. This is however inherent to the benchmarks. Both
k-nucleotide and fasta-redux have a higher number of branches.
The Just-In-Time compiler causes slightly more branches for the I/O- and memoryintensive benchmarks.
The branch misses for all PyPy runtime environments are very consistent, except for
the spectral-norm benchmark. The traces for this benchmark are a lot shorter, which leads
to more branches. A lot of them are mispredicted due to the algorithm.
67
CHAPTER 6. ANALYSIS PYPY
0.32
Branches per instruction
0.3
PyPyNJ
CPU
PyPy
I/O
PyPyS
memory
0.28
0.26
0.24
0.22
0.2
0.18
0.16
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 2
f 5
b
50
00
k 2
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Branch misses per instruction
Figure 6.10: Branches per instruction for the PyPy runtime environments. Note that
there are no results for PyPyS for k-nucleotide and binary-trees.
0.2
PyPyNJ
CPU
PyPy
I/O
PyPyS
memory
0.15
0.1
0.05
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
a
a
00
0 0 u c 0 0 s ta 0 0 s ta 0 0 a r
ch
ch
tr 3 c tr 5
00
0 0 k -n 2 5 0 fa 2 5 0 fa 5 0 0 b in
ku
ku
e
ec
5
0
n
n
p
p
5
2
n
n
s
s
fa
fa
Figure 6.11: Branch misses per instruction for the PyPy runtime environments. Note
that there are no results for PyPyS for k-nucleotide and binary-trees.
6.5.2.3
Level-1 Instruction Cache Behaviour
Figure 6.12 contains the instruction loads of the level-1 cache divided by the number of
instructions. The values obtained for PyPyNJ are very consistent, this is of course the
effect of the virtual machine. Using a Just-In-Time compiler has a huge influence on this
behaviour. The changes are inherent to the benchmarks. The influence of the Just-In-Time
compiler itself is however minimal.
68
CHAPTER 6. ANALYSIS PYPY
L1 icache loads per instruction
The most interesting results are visually represented in Figure 6.13, which presents the
number of load misses of the level-1 instruction cache per instruction. This graph clearly
shows that the Just-In-Time compiler reduces the misses, because the jumps are replaced
with guards and the traces contain an entire loop of instructions that have to be executed
consecutively. This improves the instruction locality.
PyPyNJ
0.5
CPU
PyPy
I/O
PyPyS
memory
0.45
0.4
0.35
0.3
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 2
f 5
b
50
00
k 2
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
L1 icache load misses per instruction
Figure 6.12: Level-1 instruction cache loads per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and
binary-trees.
0.02
0.018
PyPyNJ
CPU
PyPy
I/O
PyPyS
memory
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 5
f 2
b
50
00 k- 25
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Figure 6.13: Level-1 instruction cache load misses per instruction for the PyPy runtime
environments. Note that there are no results for PyPyS for k-nucleotide
and binary-trees.
69
CHAPTER 6. ANALYSIS PYPY
6.5.2.4
Level-1 Data Cache Behaviour
The level-1 data cache behaviour is represented in Figures 6.14 and 6.15. There is a big
difference for spectral-norm between running the benchmark with or without the Just-InTime compiler. This difference between PyPyNJ and the other two runtime environments
occurs because that benchmark uses almost no memory. The compiled code of the JustIn-Time compiler itself needs to be stored, however no other memory is necessary, because
the values of the variables can be stored in registers. The virtual machine will cause a
lot more data to be stored, because the wrapping will prevent the interpreter from using
registers. This effect is visual for this benchmark. The other benchmarks show very
consistent behaviour.
Again the most revealing results are captured by the misses. They clearly show that
the added cost of storing the compiled code results in a doubling of the load misses for the
I/O- and memory-intensive benchmarks, where memory is important. A possible approach
to improve this problem is to apply prefetching to improve the data cache behaviour. This
approach is investigated in Section 6.7. The results for spectral-norm clearly show that the
Just-In-Time compiler is very efficient for pure CPU-intensive applications.
L1 dcache loads per instruction
0.4
0.35
PyPyNJ
CPU
PyPy
I/O
PyPyS
memory
0.3
0.25
0.2
0.15
0.1
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 2
f 5
b
50
00
k 2
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Figure 6.14: Level-1 data cache loads per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binarytrees.
70
L1 dcache load misses per instruction
CHAPTER 6. ANALYSIS PYPY
0.07
PyPyNJ
CPU
0.06
PyPy
I/O
PyPyS
memory
0.05
0.04
0.03
0.02
0.01
0
s
y
y
x
x
x
x
e
m
m
ee
od
od
du
du
du
du
t id
or
or
r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0
0
0
0
0
0
a
a
h
h
r
c
ra 5
ra 3
00
0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a
uc
uc
ct
ct
f 2
f 5
b
50
00
k 2
nk
nk
pe
pe
5
2
n
n
s
s
fa
fa
Figure 6.15: Level-1 data cache load misses per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and
binary-trees.
6.6
Adjusting The Heap Size
PyPy allows the user to modify some of its internal variables, which influence the behaviour
of the Just-In-Time compiler or the garbage collector. Since PyPy has problems with level1 data cache behaviour when running the Just-In-Time compiler, I decided to experiment
with the maximum heap size. This variable allows you to define the maximum size of the
heap PyPy is allowed to use for all generations together. First, I checked the minimum
amount of memory each benchmark requires to run. I call this the minimum heap size
(MHS). Table 6.3 contains the minimal heap size for each benchmark. As expected, knucleotide and binary-trees need most memory.
Since PyPy has problems with the level-1 data cache, I decided to evaluate the behaviour of PyPy, when it just has sufficient memory to run these benchmarks. This is
Table 6.3: Minimum heap size required for each benchmark
benchmark
fannkuch-redux
fannkuch-redux
spectral-norm
spectral-norm
n-body
n-body
k-nucleotide
fasta-redux
fasta-redux
binary-trees
arg
11
12
3000
5500
5000000
50000000
2500000
2500000
25000000
20
minimum heap size (MB)
4
4
3
3
3
3
46
7
7
297
71
CHAPTER 6. ANALYSIS PYPY
followed by a comparison with twice the minimum heap size, three times the minimal heap
size and the default heap size (fourteen gigabytes). This methodology is advised to explore
the space-time tradeoff of automatic memory management [6], i.e. garbage collection.
The time measurements, normalised according to the minimum heap size, for the
different runs are included in Figure 6.16. While there are no huge differences changing
the heap sizes, there can be a huge difference in standard deviation. For example, the
standard deviation of the time measurements for fannkuch-redux with argument 12 and
the heap size set to twice the minimum heap size is 13.839 seconds. PyPy’s memory
management explains why there are no huge differences. Almost no data is freed while
running the programs, which means that the minimal heap size is very close to the actual
amount of used data.
Time relative to time with MHS
1.1
1xMHS
CPU
2xMHS
3xMHS
default
I/O
memory
1.05
1
0.95
0.9
0.85
s
x
x
x
x
e
ody
ody
du
du
tid
orm
orm
ree
edu
edu
n-b 00
n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20
h-r 11 uch-r 12 tral-n 000 tral-n 500
0
0 fas 00 fas 00 bin
0 -n
c
c
3
5
0
0
0
kuc
k
e
e
0
0
0
0
0
k
n
n
0
0
sp
sp
50
25
25
fan
fan
50
25
Figure 6.16: Time measurements normalised to the execution time with the minimum
heap size (MHS) for PyPy with varying heap sizes
6.7
Prefetching
In Section 6.5.2, I mentioned prefetching as an idea to improve the performance of PyPy
by reducing the number of load misses in the level-1 data cache. Prefetching is a technique
that loads data before it is requested by the application’s executing instructions, in the
hope that later on the ‘prefetched data’ will be necessary and no time will be lost loading
it. The necessary data is commonly identified prior to the execution based on the assembly
instructions. This technique can be applied on the hardware and software level and has
been researched previously [19, 3, 13]. Hardware prefetching is already applied by hardware
manufacturers and is now a built-in feature on most machines. Software prefetching offers
the benefit that more information is available, which allows for example to determine loop
bounds or indirect indexing [17]. Special algorithms can be applied for a specific piece of
software, however it increases the instruction count [17].
The results, discussed in Section 6.5.2, have lead to the assumption that prefetching
might improve the speed of PyPy, since the Just-In-Time compiler increases the number of
72
CHAPTER 6. ANALYSIS PYPY
data misses. Therefore prefetching is expected to reduce these data misses, if it is effective.
To confirm this assumption, I compared running PyPy with and without prefetching
turned on, and explored both hardware and software prefetching techniques.
6.7.1
Hardware Prefetching
Section 6.5.2 already showed that the Just-In-Time compiler improves the instruction
cache behaviour of the level-1 cache, however it introduces a lot more misses in the data
cache. In the following sections, the influence of hardware prefetching is explained for
both the instruction and the data level-1 cache.
The results for PyPyS have been dropped, since they are almost identical to the normal
behaviour of PyPy and it increases the readability of the figures.
The next discussion contains graphs comparing the PyPy and PyPyNJ runtime environments with and without hardware prefetching enabled. ‘NP’ (No Prefetching) is used
to indicate the behaviour of the runtime environments without hardware prefetching.
6.7.1.1
Level-1 Instruction Cache Behaviour
L1 instruction cache loads per instruction
Figure 6.17 shows that hardware prefetching has almost no influence on the number of load
operations in the level-1 instruction cache. There is a huge difference for the fasta-redux
benchmark with argument 2500000; however this difference has almost completely disappeared when the argument is larger. This leads us to believe that the smaller argument
might be a special case.
The level-1 instruction cache misses, represented in Figure 6.18, confirm that hardware
prefetching has almost no influence on the level-1 instruction cache behaviour.
0.52
0.5
PyPyNJ
PyPyNJ NP
CPU
PyPy
PyPy NP
I/O
memory
0.48
0.46
0.44
0.42
0.4
0.38
0.36
0.34
0.32
s
x
x
x
x
e
ody
ody
du
du
tid
orm
orm
ree
edu
edu
n-b 00
n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20
h-r 11 uch-r 12 tral-n 000 tral-n 500
s
s
0
0
0
0
0
3 pec
5
kuc
k
ec
00
00 k-n 500 fa 500 fa 000 bin
n
n
p
0
0
n
n
s
s
5
2
2
fa
fa
50
25
Figure 6.17: Influence prefetching on the level-1 instruction cache loads for the PyPy
runtime environments
73
L1 instruction cache load misses per instruction
CHAPTER 6. ANALYSIS PYPY
0.02
0.018
PyPyNJ
PyPyNJ NP
CPU
PyPy
PyPy NP
I/O
memory
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
s
x
x
e
ux
ux
rm
rm
ody
ody
du
du
tid
ree
red
red
-no
-no
n-b 00
n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20
ch- 11kuch- 12 ctral 3000 ctral 5500
0
00 k-n 000 fas 000 fas 000 bin
00
nku
n
00
0
spe
spe
50
25
25
fan
fan
50
25
Figure 6.18: Influence prefetching on the level-1 instruction cache load misses for the
PyPy runtime environments
6.7.1.2
Level-1 Data Cache Behaviour
The influence of prefetching on the level-1 data cache is shown in Figure 6.19. The graph
clearly shows that the influence of software prefetching is negligible. However the misses,
included in Figure 6.20, show a different result. The number of misses per instruction are
reduced significantly when prefetching is enabled, particularly for the I/O- and memoryintensive benchmarks when PyPy runs with the Just-In-Time compiler. This means that
prefetching is beneficial to reduce the negative data cache effects of the Just-In-Time
compiler.
74
CHAPTER 6. ANALYSIS PYPY
L1 data cache loads per instruction
0.4
PyPyNJ
PyPyNJ NP
PyPy
CPU
0.35
PyPy NP
I/O
memory
0.3
0.25
0.2
0.15
0.1
s
x
x
e
ux
ux
rm
rm
ody
ody
du
du
tid
ree
red
red
-no
-no
n-b 00
n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20
ch- 11kuch- 12 ctral 3000 ctral 5500
0
00 k-n 000 fas 000 fas 000 bin
00
nku
n
00
0
spe
spe
50
25
25
fan
fan
50
25
L1 data cache load misses per instruction
Figure 6.19: Influence prefetching on the level-1 data cache loads for the PyPy runtime
environments
0.09
0.08
PyPyNJ
PyPyNJ NP
CPU
PyPy
PyPy NP
I/O
memory
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
s
x
x
x
x
e
ody
ody
du
du
tid
orm
orm
ree
edu
edu
n-b 00
n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20
h-r 11 uch-r 12 tral-n 000 tral-n 500
s
s
0
0
0
0
0
3 pec
5
kuc
k
ec
00
00 k-n 500 fa 500 fa 000 bin
n
n
p
0
0
n
n
s
s
5
2
2
fa
fa
50
25
Figure 6.20: Influence prefetching on the level-1 data cache load misses for the PyPy
runtime environments
6.7.2
Software Prefetching
A well known approach at software prefetching has been presented by Hans-J. Boehm [7]
for the Boehm-Demers-Weiser garbage collector. The idea behind his algorithm, called
prefetch-on-grey, is shown in Listing 6.2. It basically keeps prefetching all data, without
considering if it is necessary and will be used somewhere in the near future.
CHAPTER 6. ANALYSIS PYPY
75
Push a l l r o o t s on t h e mark s t a c k , making them g r e y .
While t h e r e i s a p o i n t e r t o an o b j e c t g on t h e s t a c k
Pop g from t h e mark s t a c k .
I f g i s s t i l l g r e y ( not a l r e a d y b l a c k )
Blacken g .
For each p o i n t e r p i n g ’ s t a r g e t
Prefetch p .
I f p i s white
Grey p and push p on t h e mark s t a c k .
Listing 6.2: Prefetch-on-grey algorithm
Since it is of utmost importance that PyPy only loads data that will be used and
preferably loaded right before it is needed, because otherwise it causes negative effects by
polluting the cache. The influence of early and late prefetches has already been researched
previously [17]. However due to time issues I decided not to focus on this. Software
prefetching is most times applied on for loops. However it seems much more interesting
to apply this on traces. The best approach I found will discover misses by monitoring
the memory access of the program at the instruction level. It would be possible to apply
this after the Just-In-Time compilation. An overview of the approach is presented in
Figure 6.21. A few steps are used to insert prefetch statements:
1. A filter tags each memory access as a hit or a miss (only load misses are taken into
account).
2. For each memory access, candidate predictors are generated for the location and for
pointers also for the contents of the location. The idea is to keep track of a fixed
contiguous region of memory around the location and location’s contents address.
They discovered that region sizes of 10 and 20 cache lines in each direction gave
good results.
3. The candidate predictions are hashed by their memory line addresses and stored
in the Memory Line Address Map (MLA map). For each load miss, the address is
checked in the MLA map. If there are existing predictions for the load miss address,
each predictor is considered further for validity.
4. To decide how long the predictors should be kept, a sliding window is used.
5. They prune the predictions, to keep the most accurate ones.
6. The highly accurate predictions are inserted in the assembly.
After going through the PyPy source code, I have come to the conclusion that it is not
easy to apply these researchers’ techniques in PyPy. The Just-In-Time compiler’s backend
generates assembly without first generating C. In order to use the harness, it would be
necessary to go over the entire code after the assembly generation. Since the harness is
written in C++, there would be an added cost to load and call the libraries. A better
approach would be to translate the framework to Python code or even better to construct
an algorithm which does not need the entire generated assembly. This approach would
have to work on the Python bytecodes. One of the main issues with software prefetching
for PyPy is that everything happens at runtime, which means that the cost of applying
CHAPTER 6. ANALYSIS PYPY
76
Figure 6.21: Prefetching approach created by Frank Mueller and Jaydeep Marathe [19]
the prefetching should be preferably a lot smaller than the gained benefit. Most research
has been based around applying prefetching before the program is executed. Due to time
constraints, I have not been able to complete my research about prefetching. However
I believe it is possible and would improve the data cache load misses. The obtained
improvement is not yet clear, however, the cost might prove to be high. Therefore, it is
not advisable to use prefetching with short-running programs.
6.8
Conclusion
PyPy’s Just-In-Time compiler is very effective. There is barely any time spent performing
Just-In-Time compilation, while the total execution time gets reduced substantially. The
main problems of PyPy are caused by the memory. The level-1 data cache load misses
are a lot higher with the Just-In-Time compiler than without. These memory problems
are already improved by applying hardware prefetching. However a much larger benefit
could be obtained from using software prefetching. It is not easy to apply this and it
would require a lot of work to make this work for every hardware architecture, since
PyPy generates the assembly itself. Still more research about this approach is required to
discover the influence of software prefetching.
Chapter 7
Final Conclusions
T
he best performing language according to the experimental results is C. No Python runtime environment is able to outperform it. This is however not a big
surprise. The disadvantage of C is that it is hard to use for most people. Python
is a language which is easy to program, even by non-experts. It can be used by a lot more
people. However there is less focus on efficient performance. Recently the interest from the
academic world in Python has increased, which means that performance becomes a key
factor. My thesis explores the different runtime environments that can run Python programs, their impact on execution time and hardware events, and possibilities to optimise
performance.
There are three approaches to running Python code:
• interpretation
• compilation to a lower level language
• interpretation with a Just-In-Time compiler
Each approach has been investigated, based on a commonly used runtime environment,
which uses the specific approach. The default interpreter, called CPython, is used to represent the interpretation approach. Cython has been chosen to compile Python code to C
and then the C code is compiled to an executable. It also performs some optimisations,
in the hope of getting even better performance. PyPy is the most famous Python implementation using a Just-In-Time compiler. Other projects have been discontinued in favour
of it. A conclusion follows for each runtime specifically. Finally a general conclusion is
provided.
7.1
CPython
CPython has proven to be very slow. This was not a big surprise, since no effort has
been put into optimising the execution. There is an intention to improve it, because
optimisation flags have been added. However there are no concrete optimisations yet.
The improvements that have been added do not optimise the execution time, but only
the loading time. Since Python is mainly used to glue existing components together,
this is obviously important. However the loading cost is very small compared to the cost
of running a computationally heavy task. While CPython is a very good choice for glue
77
CHAPTER 7. FINAL CONCLUSIONS
78
code, it is not meant for heavy computations. However a lot of the libraries, which perform
computationally intensive tasks, are written in C or Fortran, thus following the intention
of a ‘glue language’.
One of the main concerns about CPython is the Global Interpreter Lock, which prevents simultaneous execution of different threads, even if multiple cores are available.
During my research, I discovered that the Global Interpreter Lock actually improves
the performance, however for multi-threaded applications it is a huge bottleneck. There
have been many attempts to remove it, however, none have been successful. A solution is available, which allows simultaneous executing on different core, as a library, called
multiprocessing. This library introduces an overhead because subprocesses are spawned,
each containing a new interpreter. This overhead is not very large and the multi-threaded
benchmarks have shown that very decent speed-ups are possible. The problems with the
Global Interpreter Lock are circumvented by the multiprocessing module. This means
that there is no reason not to use CPython for multi-threaded applications.
7.2
Cython
Cython has the possibility to provide a huge speed-up over CPython. However type
information is necessary to get those speed-ups. This is counterintuitive. Most people
choose Python for the ease of programming. Dynamic typing is a very important factor in
that decision. The plethora of different types and the fact that overflows are caught and
resolved on-the-fly also make programming easier. The types need to be converted to C,
which does not have support for most types. This makes the definition of the types a lot
more difficult. Therefore it is advised that the Python code should be closer to C, which is
counterintuitive for Python programmers. For now this dilemma cannot be solved, which
makes Cython a less interesting alternative.
A possible solution is to guess the types before giving it to Cython. However some
variables might change their type during the execution. This could be solved by creating
new variables or limiting the language. The problem is that it is even difficult to manually
define the types, because they are totally different from C. This means that automation
will be even more difficult. This approach has not been tried yet. If it would be successful,
a huge speed-up would be obtained.
It is clear that Cython is not a viable alternative for most Python users. However it
might be interesting for people writing computationally heavy tasks. They already have
experience with C or Fortran and should be able to write Python code, enhanced with
Cython statements, successfully. Cython also allows the programmer to incrementally
improve the performance. This means that first Python code can be written, which can
be tested on smaller examples. Once the code is finished, the most important functions
can be improved by adding type information. Research has shown that there is not a
significant difference between a stand-alone C program and a Python version using C
libraries [11]. This would not improve the execution time of the program compared to C,
however it might reduce the development time.
7.3
PyPy
PyPy has proven that the performance of CPython is not optimal by using a Just-In-Time
compiler. It is not necessary to modify the Python code for PyPy to work. However the
CHAPTER 7. FINAL CONCLUSIONS
79
project does not yet support the latest version of Python. The intention of this project
is to improve Python code, which means that the libraries should be written in Python
as well to take full advantage of the Just-In-Time compiler. It is also possible to use C
libraries, but the Just-In-Time compiler will not be able to optimise them. This advantage
is clearly shown in Section 5.2, where PyPy outperforms C for the string append example.
The largest performance benefit is obtained for CPU-intensive applications. The I/Oand memory-intensive benchmarks do not show the same performance gain. This is caused
by memory problems. The interpreter, garbage collector and Just-In-Time compiler need
to use some memory, which negatively influences the memory of the user application.
This problem becomes evident when a lot of memory is used. It is difficult to reduce the
memory used by the interpreter, garbage collector and Just-In-Time compiler. While is
it not possible to reduce the amount of memory used by the user application, prefetching
reduces the number of level-1 data cache misses per instruction. Software prefetching
could lead to an even larger benefit, however it is not a simple task to implement this,
and is left for future work.
PyPy’s Just-In-Time compiler is very efficient. The Just-In-Time compiler runs for
a very short amount of time to achieve very large performance improvements for the
application. The Just-In-Time compiler mainly works in the beginning of the execution.
This is important because research has proven that the ‘hot’ code should be detected and
compiled early to improve the speed [18].
The multiprocessing module reduces the effect of the Just-In-Time compiler, because
on each core a new interpreter is launched. This means that each core gets its own JustIn-Time compiler and it is not possible to share information between the different cores.
Therefore PyPy exhibited worse speed-ups between single-threaded and multi-threaded
versions of the benchmarks, as compared with CPython.
7.4
General
When performance is not an issue, CPython is the preferred runtime environment. There
are a lot of modules available, it is easy to use, has extensive documentation and a lot of
users. For short running applications, this is also an excellent choice. Since CPython can
communicate with C, C++ and Fortran, CPython will do fine, when Python is used as
‘glue code’. It also performs well with multi-threaded applications.
Before attempting to improve the performance by changing to a different runtime
environment, it would be wise to see if the algorithms cannot be improved. A better
algorithm might provide an even higher performance boost.
If the performance becomes important and the algorithms cannot be improved, PyPy
is a much better choice, if the application is CPU-intensive. If there is threading involved
or a lot of memory is used, PyPy has some performance problems.
Finally if performance is really of utmost importance, Cython provides the best option.
First it is possible to write the code in Python and test it with smaller examples. Once
the code is considered correct, it is possible to improve the performance by adding type
information to the most important methods. This will provide the smoothest workflow.
This reduces the development time, while still supplying an incredibly fast execution time.
However this approach is only for experienced programmers.
CHAPTER 7. FINAL CONCLUSIONS
80
To conclude, the difference between various Python runtime environments is investigated and a comparison with C is performed. This research has led to the conclusion
that better benchmarking is necessary for Python. The most popular Python benchmarks
are not sufficient for a decent analysis. Furthermore techniques from Java research have
been used to perform an analysis of PyPy. It is important when analysing a Just-In-Time
compiler to take into account different heap sizes and to use the stable methodology. This
gives a better overview of the influence of the garbage collector and Just-In-Time compiler. Finally suggestions are given to improve the performance of existing Python runtime
environments and which runtime environment is most advantageous for each situation.
Appendices
81
Appendix A
PyPy Options
It is possible to modify the behaviour of the Just-In-Time compiler by setting some values.
The possible modifications are mentioned in the help page:
Advanced JIT options: a comma-separated list of OPTION=VALUE:
decay=N
amount to regularly decay counters by (0=none, 1000=max) (default 40)
enable_opts=N
INTERNAL USE ONLY (MAY NOT WORK OR LEAD TO CRASHES): optimizations to
enable, or all =
intbounds:rewrite:virtualize:string:earlyforce:pure:heap:unroll (default
all)
function_threshold=N
number of times a function must run for it to become traced from start
(default 1619)
inlining=N
inline python functions or not (1/0) (default 1)
loop_longevity=N
a parameter controlling how long loops will be kept before being freed,
an estimate (default 1000)
max_retrace_guards=N
number of extra guards a retrace can cause (default 15)
max_unroll_loops=N
number of extra unrollings a loop can cause (default 0)
retrace_limit=N
how many times we can try retracing before giving up (default 5)
threshold=N
82
APPENDIX A. PYPY OPTIONS
83
number of times a loop has to run for it to become hot (default 1039)
trace_eagerness=N
number of times a guard has to fail before we start compiling a bridge
(default 200)
trace_limit=N
number of recorded operations before we abort tracing with ABORT_TOO_LONG
(default 6000)
off
turn off the JIT
help
print this page
It is also possible to configure the garbage collector by setting some variables:
PYPY_GC_NURSERY The nursery size. Defaults to 1/2 of your cache or 4M. Small
values (like 1 or 1KB) are useful for debugging.
PYPY_GC_NURSERY_CLEANUP The interval at which nursery is cleaned up.
Must be smaller than the nursery size and bigger than the biggest object we can
allocate in the nursery.
PYPY_GC_INCREMENT_STEP The size of memory marked during the marking
step. Default is size of nursery times 2. If you mark it too high your GC is not
incremental at all. The minimum is set to size that survives minor collection times
1.5 so we reclaim anything all the time.
PYPY_GC_MAJOR_COLLECT Major collection memory factor. Default is 1.82,
which means trigger a major collection when the memory consumed equals 1.82
times the memory really used at the end of the previous major collection.
PYPY_GC_GROWTH Major collection threshold’s max growth rate. Default is 1.4.
Useful to collect more often than normally on sudden memory growth, e.g. when
there is a temporary peak in memory usage.
PYPY_GC_MAX The max heap size. If coming near this limit, it will first collect
more often, then raise an RPython MemoryError, and if that is not enough, crash
the program with a fatal error. Try values like 1.6GB.
PYPY_GC_MAX_DELTA The major collection threshold will never be set to more
than PYPY_GC_MAX_DELTA the amount really used after a collection. Defaults
to 1/8th of the total RAM size (which is constrained to be at most 2/3/4GB on 32-bit
systems). Try values like 200MB.
PYPY_GC_MIN Don’t collect while the memory size is below this limit. Useful to
avoid spending all the time in the GC in very small programs. Defaults to 8 times
the nursery.
APPENDIX A. PYPY OPTIONS
84
PYPY_GC_DEBUG Enable extra checks around collections that are too slow for
normal use. Values are 0 (off), 1 (on major collections) or 2 (also on minor collections).
Appendix B
Removing Threading From
binary-trees and k-nucleotide
There are only two benchmark using threads. I removed threading from both benchmarks
to analyse the multi-threaded behaviour. This is explained in Section 5.5.
B.1
k-nucleotide
This has been accomplished for the C version by returning one for the number of cores:
int
get_cpu_count ( void ) {
cpu_set_t cpu_set ;
CPU_ZERO(&cpu_set ) ;
s c h e d _ g e t a f f i n i t y ( 0 , s i z e o f ( cpu_set ) , &cpu_set ) ;
return 1 ;
// r e t u r n CPU_COUNT(& cpu_set ) ;
}
For Python, it is necessary to remove any use of a pool. This can be accomplished by
replacing the map function of the pool, by the standard one:
def main ( ) :
global s e q u e n c e
sequence = prepare ()
#p=Pool ( )
#r e s 2 = p . map_async ( f i n d _ s e q , r e v e r s e d ( "GGT GGTA GGTATT GGTATTTTAATT
GGTATTTTAATTTATAGT" . s p l i t ( ) ) )
#r e s 1 = p . map_async ( s o r t _ s e q , ( 1 , 2 ) )
r e s 2 = map( f i n d _ s e q , reversed ( "GGT GGTA GGTATT GGTATTTTAATT
GGTATTTTAATTTATAGT" . s p l i t ( ) ) )
r e s 1 = map( s o r t _ s e q , ( 1 , 2 ) )
#f o r s i n r e s 1 . g e t ( ) : p r i n t ( s +’\n ’ )
#r e s 2 = r e v e r s e d ( [ r f o r r i n r e s 2 . g e t ( ) ] )
f o r s in r e s 1 : print ( s+ ’ \n ’ )
r e s 2 = reversed ( [ r f o r r in r e s 2 ] )
85
APPENDIX B. REMOVING THREADING FROM BINARY-TREES AND K-NUCLEOTIDE86
print ( " \n " . j o i n ( " { 1 : d}\ t {0} " . format ( ∗ s ) f o r s in r e s 2 ) )
B.2
binary-trees
It was not easy to use the method for k-nucleotide for this benchmark. A different solution
is to put a pthread_join after each pthread_create:
/∗
∗ The c a l c u l a t i o n s i s s t a r t e d i n r e v e r s e o r d e r compared t o most o t h e r
∗ s o l u t i o n s . The r e a s o n i s t h a t a l l d a t a must be on t h e s t a c k and t h e
∗ r e s u l t from s h a l l o w e s t t r e e must be p r i n t e d f i r s t .
∗/
void
d o _ t r e e s ( i n t depth , i n t min_depth , i n t max_depth )
{
pthread_t t h r e a d ;
pthread_attr_t a t t r ;
struct item_worker_data wd ;
i f ( depth < min_depth )
return ;
p t h r e a d _ a t t r _ i n i t (& a t t r ) ;
p t h r e a d _ a t t r _ s e t s t a c k s i z e (& a t t r , s t a c k _ s z ( depth + 1 ) ) ;
wd . i t e r a t i o n s = 1 << ( max_depth − depth + min_depth ) ;
wd . check = 0 ;
wd . depth = depth ;
p t h r e a d _ c r e a t e (& thread , &a t t r , item_worker , &wd) ;
p t h r e a d _ j o i n ( thread , NULL) ;
d o _ t r e e s ( depth −2, min_depth , max_depth ) ;
// p t h r e a d _ j o i n ( t h r e a d , NULL) ;
p t h r e a d _ a t t r _ d e s t r o y (& a t t r ) ;
p r i n t f ( "%d\ t t r e e s o f depth %d\ t check : %d\n " ,
2 ∗ wd . i t e r a t i o n s ,
depth ,
wd . check ) ;
}
For the Python version, it is again possible to replace the map function of the pool
with the default one:
def main ( n , min_depth=4) :
max_depth = max( min_depth + 2 , n )
s t r e t c h _ d e p t h = max_depth + 1
i f mp . cpu_count ( ) > 1 :
#p o o l = mp . Pool ( )
#chunkmap = p o o l . map
chunkmap = map
else :
chunkmap = map
print ( ’ s t r e t c h t r e e o f depth {0}\ t check : {1} ’ . format (
APPENDIX B. REMOVING THREADING FROM BINARY-TREES AND K-NUCLEOTIDE87
s t r e t c h _ d e p t h , make_check ( ( 0 , s t r e t c h _ d e p t h ) ) ) )
l o n g _ l i v e d _ t r e e = make_tree ( 0 , max_depth )
mmd = max_depth + min_depth
f o r d in range ( min_depth , s t r e t c h _ d e p t h , 2 ) :
i = 2 ∗∗ (mmd − d )
cs = 0
f o r argchunk in get_argchunks ( i , d ) :
c s += sum( chunkmap ( make_check , argchunk ) )
print ( ’ {0}\ t t r e e s o f depth {1}\ t check : {2} ’ . format ( i ∗ 2 , d , c s ) )
print ( ’ l o n g l i v e d t r e e o f depth {0}\ t check : {1} ’ . format (
max_depth , c h e c k _ t r e e ( l o n g _ l i v e d _ t r e e ) ) )
Bibliography
[1] nbviewer, a simple way to share ipython notebooks, May 2014.
[2] speed.pypy.org project, May 2014.
[3] Aneesh Aggarwal.
Software caching vs. prefetching.
supplement):157–162, June 2002.
SIGPLAN Not., 38(2
[4] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn,
and Kurt Smith. Cython: The best of both worlds. Computing in Science and Engg.,
13(2):31–39, March 2011.
[5] Yosi Ben Asher and Nadav Rotem. The effect of unrolling and inlining for python
bytecode optimizations. In Proceedings of SYSTOR 2009: The Israeli Experimental
Systems Conference, SYSTOR ’09, pages 14:1–14:14, New York, NY, USA, 2009.
ACM.
[6] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur,
A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump,
H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage,
and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and
analysis. In OOPSLA ’06: Proceedings of the 21st annual ACM SIGPLAN conference
on Object-Oriented Programing, Systems, Languages, and Applications, pages 169–
190, New York, NY, USA, October 2006. ACM Press.
[7] Hans-J. Boehm. Reducing garbage collector cache misses. SIGPLAN Not., 36(1):59–
64, October 2000.
[8] Carl Friedrich Bolz, Antonio Cuni, Maciej FijaBkowski, Michael Leuschel, Samuele
Pedroni, and Armin Rigo. Allocation removal by partial evaluation in a tracing
jit. In Proceedings of the 20th ACM SIGPLAN Workshop on Partial Evaluation and
Program Manipulation, PEPM ’11, pages 43–52, New York, NY, USA, 2011. ACM.
[9] Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, Michael Leuschel, Samuele
Pedroni, and Armin Rigo. Runtime feedback in a meta-tracing jit for efficient dynamic
languages. In Proceedings of the 6th Workshop on Implementation, Compilation,
Optimization of Object-Oriented Languages, Programs and Systems, ICOOOLPS ’11,
pages 9:1–9:8, New York, NY, USA, 2011. ACM.
[10] Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. Tracing
the meta-level: Pypy’s tracing jit compiler. In Proceedings of the 4th Workshop on
the Implementation, Compilation, Optimization of Object-Oriented Languages and
88
BIBLIOGRAPHY
89
Programming Systems, ICOOOLPS ’09, pages 18–25, New York, NY, USA, 2009.
ACM.
[11] Xing Cai, Hans Petter Langtangen, and Halvard Moe. On the performance of the
python programming language for serial and parallel scientific computations. Sci.
Program., 13(1):31–56, January 2005.
[12] Jose Castanos, David Edelsohn, Kazuaki Ishizaki, Priya Nagpurkar, Toshio Nakatani,
Takeshi Ogasawara, and Peng Wu. On the benefits and pitfalls of extending a statically typed language jit compiler for dynamic scripting languages. SIGPLAN Not.,
47(10):195–212, October 2012.
[13] Chen-Yong Cher, Antony L. Hosking, and T. N. Vijaykumar. Software prefetching for
mark-sweep garbage collection: Hardware analysis and software redesign. SIGARCH
Comput. Archit. News, 32(5):199–210, October 2004.
[14] Alex Holkner and James Harland. Evaluating the dynamic behaviour of python applications. In Proceedings of the Thirty-Second Australasian Conference on Computer
Science - Volume 91, ACSC ’09, pages 19–28, Darlinghurst, Australia, Australia, 2009.
Australian Computer Society, Inc.
[15] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual –
Volume 3, February 2014.
[16] Gregory L. Lee, Dong H. Ahn, Bronis R. de Supinski, John Gyllenhaal, and Patrick
Miller. Pynamic: The python dynamic benchmark. In Proceedings of the 2007 IEEE
10th International Symposium on Workload Characterization, IISWC ’07, pages 101–
106, Washington, DC, USA, 2007. IEEE Computer Society.
[17] Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. When prefetching works, when it
doesn&rsquo;t, and why. ACM Trans. Archit. Code Optim., 9(1):2:1–2:29, March
2012.
[18] Seong-Won Lee and Soo-Mook Moon. Selective just-in-time compilation for clientside mobile javascript engine. In Proceedings of the 14th International Conference
on Compilers, Architectures and Synthesis for Embedded Systems, CASES ’11, pages
5–14, New York, NY, USA, 2011. ACM.
[19] Jaydeep Marathe and Frank Mueller. Pfetch: Software prefetching exploiting temporal predictability of memory access streams. In Proceedings of the 9th Workshop
on MEmory Performance: DEaling with Applications, Systems and Architecture,
MEDEA ’08, pages 1–8, New York, NY, USA, 2008. ACM.
[20] Jan Martinsen, Hakan Grahn, and Anders Isberg. Using speculation to enhance
javascript performance in web applications. IEEE Internet Computing, 17(2):10–19,
March 2013.
[21] Fadi Meawad, Gregor Richards, Floréal Morandat, and Jan Vitek. Eval begone!:
Semi-automated removal of eval from javascript programs. In Proceedings of the
ACM International Conference on Object Oriented Programming Systems Languages
and Applications, OOPSLA ’12, pages 607–620, New York, NY, USA, 2012. ACM.
BIBLIOGRAPHY
90
[22] John K. Ousterhout. Scripting: Higher-level programming for the 21st century. Computer, 31(3):23–30, March 1998.
[23] Armin Rigo. Representation-based just-in-time specialization and the psyco prototype for python. In Proceedings of the 2004 ACM SIGPLAN Symposium on Partial
Evaluation and Semantics-based Program Manipulation, PEPM ’04, pages 15–26, New
York, NY, USA, 2004. ACM.
[24] Armin Rigo. Transactional Memory (II). http://morepypy.blogspot.be/2012/01/
transactional-memory-ii.html, 2012. [Online; accessed 5-June-2014].
[25] Dag Sverre Seljebotn. Fast numerical computations with cython. In Gaël Varoquaux,
Stéfan van der Walt, and Jarrod Millman, editors, Proceedings of the 8th Python in
Science Conference, pages 15 – 22, Pasadena, CA USA, 2009.
[26] TIOBE Software. TIOBE programming community index, 2014.
[27] Haiping Zhao, Iain Proctor, Minghui Yang, Xin Qi, Mark Williams, Qi Gao, Guilherme Ottoni, Andrew Paroski, Scott MacVicar, Jason Evans, and Stephen Tu. The
hiphop compiler for php. In Proceedings of the ACM International Conference on
Object Oriented Programming Systems Languages and Applications, OOPSLA ’12,
pages 575–586, New York, NY, USA, 2012. ACM.
List of Figures
3.1
3.2
3.3
3.4
Architecture CPython . . . . . . . . . . .
The phases used by the bytecode compiler
Architecture PyPy . . . . . . . . . . . . .
The RPython toolchain . . . . . . . . . .
4.1
4.2
Comparison between PyPy and CPython by speed.python.org . . . . . . . . 25
Time measurements for The Grand Unified Python Benchmark suite . . . . 26
5.1
Time measurements for the pairwise distance calculation problem with 1000
points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time comparison between the runtime environments normalised to CPython
The number of cycles per instruction . . . . . . . . . . . . . . . . . . . . . .
The number of branches per instruction . . . . . . . . . . . . . . . . . . . .
The number of branch misses per instruction . . . . . . . . . . . . . . . . .
The number of level-1 instruction cache loads per instruction . . . . . . . .
The number of level-1 instruction cache load misses per instruction . . . . .
The number of level-1 data cache loads per instruction . . . . . . . . . . . .
The number of level-1 data cache load misses per instruction . . . . . . . .
The number of last level cache loads per instruction . . . . . . . . . . . . .
The number of last level cache load misses per instruction . . . . . . . . . .
The number of last level stores stores per instruction . . . . . . . . . . . . .
The number of last level cache store misses per instruction . . . . . . . . . .
The number of instruction translation lookaside buffer load misses per instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The number of data translation lookaside buffer load misses per instruction
The number of data translation lookaside buffer store misses per instruction
Time measurements for C normalised to the time with threading . . . . . .
Time measurements for Cython normalised to the time with threading . . .
Time measurements for CPython normalised to the time with threading . .
Time measurements for PyPy normalised to the time with threading . . . .
Time comparison between the PYC and CPython runtime environments,
normalised to CPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
5.18
5.19
5.20
5.21
6.1
6.2
6.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
14
17
18
35
39
41
42
42
43
44
45
45
46
47
48
48
49
49
50
51
52
53
53
54
The flow graph for the is_perfect_number method . . . . . . . . . . . . . 58
The flow graph for the is_perfect_number method after the annotate phase 59
The flow graph for the is_perfect_number method after the rtyping phase 59
91
LIST OF FIGURES
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
6.12
6.13
6.14
6.15
6.16
6.17
6.18
6.19
6.20
6.21
The jitviewer in action, showing the Python code, Python bytecode and
the intermediate representation of the code of the Just-In-Time compiler
for the spectral-norm benchmark . . . . . . . . . . . . . . . . . . . . . . .
Time lapse of fannkuch-redux with argument 11 and generated width 1000
First part time lapse of fannkuch-redux with argument 11 and generated
width 20000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time lapse of the spectral-norm benchmark with argument 3000 . . . . . .
Time measurements PyPy runtime environments normalised to PyPy. Note
that there are no results for PyPyS for k-nucleotide and binary-trees. . . .
Cycles per instruction for the PyPy runtime environments. Note that there
are no results for PyPyS for k-nucleotide and binary-trees. . . . . . . . . .
Branches per instruction for the PyPy runtime environments. Note that
there are no results for PyPyS for k-nucleotide and binary-trees. . . . . . .
Branch misses per instruction for the PyPy runtime environments. Note
that there are no results for PyPyS for k-nucleotide and binary-trees. . . .
Level-1 instruction cache loads per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and
binary-trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Level-1 instruction cache load misses per instruction for the PyPy runtime
environments. Note that there are no results for PyPyS for k-nucleotide and
binary-trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Level-1 data cache loads per instruction for the PyPy runtime environments.
Note that there are no results for PyPyS for k-nucleotide and binary-trees.
Level-1 data cache load misses per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and
binary-trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time measurements normalised to the execution time with the minimum
heap size (MHS) for PyPy with varying heap sizes . . . . . . . . . . . . .
Influence prefetching on the level-1 instruction cache loads for the PyPy
runtime environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Influence prefetching on the level-1 instruction cache load misses for the
PyPy runtime environments . . . . . . . . . . . . . . . . . . . . . . . . . .
Influence prefetching on the level-1 data cache loads for the PyPy runtime
environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Influence prefetching on the level-1 data cache load misses for the PyPy
runtime environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prefetching approach created by Frank Mueller and Jaydeep Marathe [19]
92
. 61
. 62
. 62
. 63
. 65
. 66
. 67
. 67
. 68
. 68
. 69
. 70
. 71
. 72
. 73
. 74
. 74
. 76