Download Ruben Heynssens Modern Scripting Language Performance

Performance Analysis and Benchmarking of Python, a Modern Scripting Language Ruben Heynssens Supervisor: Prof. dr. ir. Lieven Eeckhout Counsellor: Dr. Jennifer Sartor Master's dissertation submitted in order to obtain the academic degree of Master of Science in de ingenieurswetenschappen: computerwetenschappen Department of Electronics and Information Systems Chairman: Prof. dr. ir. Jan Van Campenhout Faculty of Engineering and Architecture Academic year 2013-2014 Performance Analysis and Benchmarking of Python, a Modern Scripting Language Ruben Heynssens Supervisor: Prof. dr. ir. Lieven Eeckhout Counsellor: Dr. Jennifer Sartor Master's dissertation submitted in order to obtain the academic degree of Master of Science in de ingenieurswetenschappen: computerwetenschappen Department of Electronics and Information Systems Chairman: Prof. dr. ir. Jan Van Campenhout Faculty of Engineering and Architecture Academic year 2013-2014 Acknowledgements I would like to thank Prof. L. Eeckout and Dr. J. Sartor for their guidance and encouragement, Dr. W. Heirman for the advice and assistance with the Bluepower machine and measuring hardware events and the Ghent University Electronics and Information Systems department for the use of the Bluepower machine. I would also like to thank Prof. F. Mueller from the North Carolina State University for sharing the code to apply software prefetching. Finally I would like to thank my parents for all their support, faith and encouragement. De auteur geeft de toelating deze masterproef voor consultatie beschikbaar te stellen en delen van de masterproef te kopiëren voor persoonlijk gebruik. Elk ander gebruik valt onder de beperkingen van het auteursrecht, in het bijzonder met betrekking tot de verplichting de bron uitdrukkelijk te vermelden bij het aanhalen van resultaten uit deze masterproef. The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the limitations of the copyright have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation. Ghent, June 2014 Ruben Heynssens Performance Analysis and Benchmarking of Python, a Modern Scripting Language 1 Ruben Heynssens Supervisors: Prof. Lieven Eeckhout, Dr. Jennifer Sartor Abstract—I investigated the difference between various Python runtime environments and performed a comparison with C. The main contributions are promoting better benchmarking for Python and applying existing techniques for benchmarking on Python, because the most popular Python benchmarks run too short for a decent analysis. Furthermore I give suggestions to improve the performance of existing Python runtime environments. This is achieved by searching a representative benchmark suite and developing a decent benchmarking methodology. Then I applied the methodology on the most common and popular Python runtime environments. Furthermore I compared the runtime environments with C in order to show the difference between compilation and interpretation. The results indicate that the default interpreter is very slow. However it is sufficient when Python is used as ‘glue code’ and moreover, a large amount of libraries are available. Python Just-In-Time compilation improves the performance very well for CPU-intensive applications. Problems with the level-1 data cache cause a much smaller benefit for I/O- and memory-intensive application compared to the default interpreter. Compiling Python code to C improves the performance drastically if extra type information is supplied, otherwise there is no significant benefit compared to the default interpreter. Index Terms—Python, performance, benchmarking I. Introduction Over the past decades, scripting languages have become increasingly popular due to the increasing importance of graphical user interfaces and the growth of the internet. They have been created for very specific purposes, like ‘glueing’ components together or performing text processing. They do not require the programmer to specify the type of variables, and thus allow for easy and rapid development. Since scripting languages are commonly interpreted for the particular machine they are running on, they are portable. A lot of different languages are available nowadays, like awk, JavaScript, PHP, Perl, Bash, etc. They are used in very different domains. However in my thesis I decided to focus on Python. This language is very popular and often used by people who have no experience with programming because of the ease of use. Recently IPython, an interactive Python interpreter running in a web browser, has been released. This, together with the wide variety of scientific libraries available for Python, has caused the academic world to show interest in Python. For academic applications the performance becomes of utmost importance. However there has not been a lot of research which compares the different options to improve the performance of Python. Moreover, a thorough performance analysis of the various runtime environments across a broad range of general applications has not been performed. II. Existing Python Runtime Environments And Related Work There are currently three common approaches to running Python programs: • interpretation • compilation to a lower level language • Just-In-Time compilation I have explored the most common and most popular runtime environments for Python, which cover this range of interpretation and compilation techniques. CPython, the default runtime environment, applies simple interpretation. Cython has been used to evaluate the behaviour of compiling Python code, in this case to C and then the C code is compiled to an executable. Cython also offers the possibility to add type information, which allows extra optimisation. PyPy is the most famous runtime environment providing Just-In-Time compilation for Python. Most other projects have been discontinued in favour of it [1]. A Just-In-Time compiler (JIT) will compile ‘hot’ code at runtime and afterwards, the compiled code is used, which should execute faster. PyPy’s Just-In-Time compiler follows the principles of a tracing JIT [2], [3]. A benchmarking methodology has already been developed for evaluating the behaviour of JIT compilers for Java [4]. The main criticisms people have with Python are related to the Global Interpreter Lock (GIL). This lock prevents two threads from simultaneously executing code, even if multiple cores are available. People have attempted to remove the GIL, however those attempts have not been successful. For now it does not seem there is a solution to this problem. Therefore a new module has been created, called multiprocessing, which successfully circumvents the GIL by spawning subprocesses. For now, this is considered the best approach to multi-threaded applications. type information, which means that without type information Cython does not perform well. Automated type guessing could make this easier, however this has not been researched yet. The time measurements showed that C-code equivalent of the Python programs always ran faster than the Python code on any runtime environment. PyPy performs very well for the CPU-intensive benchmarks. The performance for the I/O- and memoryintensive benchmarks is not so good. CPython and Cython both perform similarly. Type information was not added in this experiment. This was done to ensure a fair comparison and because it is not easy to add correct type information, most users will not do it. For C, the measured hardware events show that the type of benchmark influences the events. This behaviour is also noticed for PyPy, however to a lesser extent. This is of course caused by the JIT, because the code is compiled to assembly. A similar behaviour is expected for Cython, but this is not the case. Without type information, the behaviour is much more similar to CPython. Cython wraps each value in an object, because the type is not known. This leads to a behaviour very similar to interpretation. Both Cython and CPython have consistent values for the events. The type of the benchmarks has very little influence on it. The multi-threaded benchmarks show that the GIL is circumvented by the multiprocessing module. I also benchmarked the behaviour without the threading to compare the parallel speed-up for the different runtime environments. The results show that CPython gets a very decent speed-up. The overhead of spawning multiple subprocesses is very small. The multi-threaded behaviour of PyPy is not as good as CPython’s. The execution was only twice as fast with threading than without, while eight cores were available. The multiprocessing module spawns a new interpreter on each core. For PyPy, this means that a new JIT is created as well. Since it is not possible to share information, such as compiled code, and the Just-In-Time compiler has to work with less code on each core, a smaller improvement is obtained. In order to get a better understanding of PyPy’s JIT, the behaviour over time has been analysed. This behaviour is represented by a tool included in the PyPy source code. It shows that the JIT is mainly working in the beginning of the execution, which is important because this results in the largest benefit [5]. The JIT has also proven to be very effective, III. Benchmarking Suite and Methodology To analyse the behaviour of the different runtime environments, a benchmark suite was required. The Grand Unified Python Benchmark Suite is meant to compare different Python implementations with each other. However after running the benchmark suite, I noticed that most benchmarks are very short. Since this is not sufficient for a decent comparison and analysis, I researched other benchmarking suites. The Computer Language Benchmarks Game suite has proven to be the most reliable one and it also allows easy comparison with C. In order to get a clear view on the behaviour of the different runtime environments, hardware events like the number of cycles, branches, level-1 instruction cache misses, etc. have been measured. These are measured using perf and PAPI. The values are verified using raw events to guarantee their correctness. In order to analyse the JIT component of PyPy, the stable behaviour, based on the DaCapo methodology[4], and the behaviour without JIT have been measured. The stable behaviour tries to eliminate the JIT cost, while still using the optimised code. This gives a better view on the efficiency and influence of the JIT. IV. Results And Discussion First a specific application, the pairwise distance calculation, is used to compare the benefit of adding type information for Cython. Adding type information resulted in a 1000 times faster execution in comparison with regular Python that includes no 2 since, for all but one benchmark, it ran for less than one percent of the time. The behaviour of PyPy without the JIT shows very poor performance. This means the JIT is necessary to improve the performance and leads to a very decent speed-up. However there are still some memory problems with PyPy. PyPy’s interpreter, garbage collector and JIT all need to store data. This negatively influences the execution of the user application and this becomes visible for the I/O- and memory-intensive benchmarks. The hardware event measurements have shown that the JIT improves the level-1 instruction cache behaviour. However more misses per instruction occur in the level-1 data cache. This might be improved by prefetching. Therefore I redid the measurements with hardware prefetching turned off. These results confirm that hardware prefetching reduces the number of level-1 misses per instruction in the data cache. A larger benefit could be obtained by applying software prefetching. However I have not been able to finish this. can be incrementally added where most benefit will be obtained. This will result in a faster development cycle than when using C and improve the performance incredibly in comparison with CPython. The overhead of combining C with Python is very small compared to the total execution time and thus the overhead of Cython is minimal as well. Note that it is not necessary to write C code with this approach. For novice users, it is advised to only use ‘simple’ data structures and statements. It will be a lot easier to add type information. If even the Cython approach is not good enough, it is possible to combine Python with C, C++ or Fortran. There are many libraries available which make this easy. However this will add a cost to the development and should therefore be avoided. I analysed the most popular runtime environments and their performance for one of the most prevalent scripting languages, Python. I developed methodology techniques for this exploration, and have offered suggestions to users in regards to the situations for which each runtime environment would be most advantageous. V. Conclusions Acknowledgements I would like to thank Prof. L. Eeckhout and Dr. J. Sartor for their guidance and encouragement. This work was carried out using the Bluepower machine at the Ghent University Electronics and Information Systems department. I would also like to thank Dr. W. Heirman for the advice and assistance with the Bluepower machine and measuring the hardware events and Prof. F. Mueller for sharing the code to apply software prefetching. The analysis leads to the conclusion that CPython, the default interpreter, is only useful when performance is of no concern. This means that it can be used for short applications or as ‘glue code’. Multithreaded applications will get a decent performance boost. The libraries for most computationally intensive tasks are written in C. This means that even for heavy calculations CPython is an option, however not if the calculations are written by the user in Python. While PyPy can call C libraries, it is advised not to do this, because the Just-In-Time compiler cannot improve the performance of non-Python code. Therefore CPython is more attractive for code bases that need these libraries. When performance becomes important and the algorithms cannot be improved any further, PyPy is a viable option, but only if it is a CPU-intensive task. The memory problems hinder PyPy too much for I/O and memory-intensive applications. Concurrent applications will not gain a huge benefit on PyPy either. If performance is of utmost importance for the execution of the application, Cython is the best option. However this approach is not for a novice user. The advantage of Cython is that it allows you to do the entire development in Python, including the testing. When the application is finished, type information References [1] A. Rigo, “Representation-based just-in-time specialization and the psyco prototype for python,” in Proceedings of the 2004 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation, ser. PEPM ’04. New York, NY, USA: ACM, 2004, pp. 15–26. [Online]. Available: http://doi.acm.org/10.1145/1014007.1014010 [2] C. F. Bolz, A. Cuni, M. Fijalkowski, and A. Rigo, “Tracing the meta-level: Pypy’s tracing jit compiler,” in Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, ser. ICOOOLPS ’09. New York, NY, USA: ACM, 2009, pp. 18–25. [Online]. Available: http://doi.acm.org/10.1145/1565824.1565827 [3] C. F. Bolz, A. Cuni, M. FijaBkowski, M. Leuschel, S. Pedroni, and A. Rigo, “Allocation removal by partial evaluation in a tracing jit,” in Proceedings of the 20th ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation, ser. PEPM ’11. New York, NY, USA: ACM, 2011, pp. 43–52. [Online]. Available: http://doi.acm.org/10.1145/1929501.1929508 3 [4] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann, “The DaCapo benchmarks: Java benchmarking development and analysis,” in OOPSLA ’06: Proceedings of the 21st annual ACM SIGPLAN conference on ObjectOriented Programing, Systems, Languages, and Applications. New York, NY, USA: ACM Press, Oct. 2006, pp. 169–190. [5] S.-W. Lee and S.-M. Moon, “Selective just-in-time compilation for client-side mobile javascript engine,” in Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems, ser. CASES ’11. New York, NY, USA: ACM, 2011, pp. 5–14. [Online]. Available: http://doi.acm.org/ 10.1145/2038698.2038703 4 Prestatieanalyse en benchmarking van Python, een moderne scripting taal 1 Ruben Heynssens Begeleiders: Prof. Lieven Eeckhout, Dr. Jennifer Sartor Samenvatting—Ik onderzocht het verschil tussen diverse Python runtime omgevingen en voerde een vergelijking uit met C. De belangrijkste bijdragen zijn het aanmoedigen van betere benchmarking voor Python en het toepassen van bestaande technieken om Python te benchmarken, omdat de populairste Python benchmarks een te korte uitvoeringstijd hebben voor een degelijke analyse. Verder geef ik ook suggesties om de prestatie van bestaande Python runtime omgevingen te verbeteren. Dit wordt bereikt door een representatieve benchmark suite te zoeken en een degelijke benchmarking methodologie te ontwikkelen. Vervolgens pas ik deze methodologie toe op de meest gebruikte en populaire Python runtime omgevingen. Verder vergelijk ik de runtime omgevingen met C om het verschil tussen compilatie en interpretatie aan te tonen. De resultaten geven aan dat de standaard interpreter zeer traag is. Deze voldoet echter wel wanneer Python gebruikt wordt als ‘glue code’ en bovendien zijn er zeer veel bibliotheken beschikbaar. Het toepassen van Just-In-Time compilatie op Python verbetert de prestatie zeer goed voor CPUintensieve toepassingen. Problemen met de level-1 data cache veroorzaken een veel kleiner voordeel voor I/O- en geheugen-intensieve toepassingen in vergelijking met de standaard interpreter. Het compileren van Python code naar C verbetert de prestatie enorm indien er extra type informatie wordt gegeven, anders wordt er geen significante winst bekomen tegenover de standaard interpreter. Index Terms—Python, prestatie, benchmarking I. Introductie Scripting talen zijn gedurende de laatste decennia zeer populair geworden dankzij het toenemende belang van grafische gebruikersinterfaces en de groei van het internet. Ze zijn ontwikkeld voor zeer specifieke doeleinden, zoals het aaneenlijmen van complexe componenten of tekstverwerking. Ze verlangen niet van de programmeur dat de types van de variabelen gedeclareerd worden, waardoor ze een snellere en eenvoudigere ontwikkeling mogelijk maken. Aangezien scripting talen normaliter geïnterpreteerd worden voor de machine waarop ze uitgevoerd worden, is het gemakkelijk om de toepassingen op andere hardware uit te voeren. Er zijn een groot aantal scripting talen beschikbaar, zoals awk, JavaScript, PHP, Perl, Bash, enz. Deze worden gebruikt in verscheidene domeinen. Voor deze thesis heb ik echter besloten om de nadruk te leggen op Python. Deze taal is tegenwoordig zeer populair en wordt ook gebruikt door mensen die minder ervaring hebben met programmeren dankzij het gebruiksgemak. Recent is IPython, een interactieve Python interpreter die werkt in een web browser, gelanceerd. Deze tool, samen met de vele wetenschappelijke modules die reeds beschikbaar zijn in Python, hebben gezorgd dat er interesse is gekomen voor Python vanuit de wetenschappelijke wereld. Voor wetenschappelijke toepassingen is de prestatie van Python een zeer belangrijke factor. Er is nog niet veel onderzoek gedaan die de verschillende mogelijkheden vergelijkt om de prestatie van Python te verbeteren. Bovendien zijn de meeste Python runtime omgevingen nog niet geanalyseerd over een uitgebreid scala van algemene toepassingen. II. Bestaande Python runtime omgevingen en gerelateerd werk Momenteel zijn er drie veelgebruikte methoden om Python toepassingen uit te voeren: • interpretatie • compilatie naar een andere taal • Just-In-Time compilatie Ik heb de meest gebruikte en populaire runtime omgevingen voor Python onderzocht, die deze methoden bevatten. CPython, de standaard interpreter, past enkel interpretatie toe. Cython is gebruikt om het gedrag te evalueren voor de tweede methode, waarbij de Python code naar C gecompileerd wordt. Vervolgens wordt de C code gecompileerd naar een uitvoerbaar bestand. Cython biedt ook de mogelijkheid aan om type informatie toe te voegen. Deze informatie zal zorgen dat er nog verder geoptimaliseerd zal worden. Het is echter niet gemakkelijk om correcte informatie te geven, zeker niet wanneer de types ingewikkeld worden. PyPy is het bekendste project dat Just-In-Time compilatie toepast. Andere projecten zijn gestopt vanwege dit project [1]. Een Just-In-Time compiler (JIT) zal tijdens de uitvoering vaak uitgevoerde code compileren. Daarna is het mogelijk om de gecompileerde versie te gebruiken, welke sneller zou moeten uitvoeren. PyPy’s Just-InTime compiler werkt volgens de principes van een tracing JIT [2], [3]. Er is reeds een benchmarking methodologie ontwikkeld voor Java om het gedrag van JIT compilers te evalueren [4]. De belangrijkste ergernissen die mensen hebben met Python zijn gerelateerd aan de Global Interpreter Lock (GIL). Dit lock voorkomt dat twee threads gelijktijdig code kunnen uitvoeren, zelfs wanneer meerdere cores beschikbaar zijn. Er zijn reeds pogingen ondernomen om de GIL te verwijderen, maar men is er tot nu toe nog niet geslaagd. Voorlopig blijkt er ook geen oplossing te zijn voor dit probleem. Daarom is een nieuwe module gecreëerd, genaamd multiprocessing, die de GIL kan omzeilen door subprocessen aan te maken. Voorlopig wordt dit beschouwd als de beste aanpak voor multi-threaded toepassingen. gebaseerd op DaCapo [4], en het gedrag zonder JIT gemeten. Het stabiele gedrag probeert de JIT te elimineren, terwijl de geoptimaliseerde code nog steeds gebruikt wordt. Dit geeft een beter overzicht van de efficiëntie en de invloed van de JIT. IV. Resultaten en discussie Eerst wordt een specifieke toepassing, de pairwise distance calculation, gebruikt om het voordeel van de type informatie van Cython te analyseren. Het toevoegen van type informatie leidt tot een 1000 keer snellere uitvoering in vergelijking met de gewone Python code die geen type informatie heeft, wat betekent dat Cython niet goed presteert zonder type informatie. Geautomatiseerde type guessing zou dit gemakkelijker kunnen maken. Hier is echter nog geen onderzoek over verricht. De tijdsmetingen tonen aan dat C nog steeds de snelste runtime omgeving is. PyPy presteert ook zeer goed voor CPU-intensieve toepassingen. De verbetering voor I/O- en geheugen-intensieve benchmarks zijn een stuk minder. CPython en Cython presteren beide gelijkaardig. Dit gedrag wordt verklaard door de type informatie. Voor de benchmarks was geen type informatie toegevoegd, om een eerlijke vergelijking te maken. Het is niet gemakkelijk deze informatie toe te voegen, waardoor de meeste gebruikers dit niet zullen doen. De gemeten hardware events tonen aan dat het type benchmark de events beïnvloedt. Dit gedrag wordt ook opgemerkt voor PyPy, hoewel in mindere mate. Dit wordt natuurlijk veroorzaakt door de JIT, die de code compileert naar assembly. Een gelijkaardig gedrag wordt verwacht voor Cython, maar dit is niet het geval. Zonder de type informatie is het gedrag gelijkaardig aan CPython. Iedere waarde wordt gewrapped in een object door Cython, omdat het type niet gekend is. Dit leidt tot een gedrag zeer gelijkaardig aan interpretatie. Zowel Cython als CPython hebben consistente waarden voor de events. Het type van de benchmark heeft er bijna geen invloed op. De multi-threaded benchmarks tonen aan dat de GIL met succes omzeild wordt door de multiprocessing module. Ik heb ook het gedrag zonder threading gemeten om de versnelling te vergelijken. De resultaten tonen dat CPython zeer goed versnelt dankzij het gebruik van meerdere threads. De kost van het opstarten van meerdere interpreters is zeer klein. PyPy versnelt echter niet zo veel als CPython voor multi-threaded toepassingen. De III. Benchmarking suite en methodologie Om het gedrag van de verschillende runtime omgevingen te analyseren is er een benchmark suite nodig. The Grand Unified Python Benchmark Suite is bedoeld om verschillende Python implementaties met elkaar te vergelijken. Na het uitvoeren van deze suite heb ik echter opgemerkt dat de meeste benchmarks zeer kort zijn. Aangezien dit niet voldoende is om een degelijke vergelijking en analyse te doen, heb ik andere suites onderzocht. The Computer Language Benchmarks Game suite heeft bewezen de best betrouwbare te zijn en laat ook toe om Python met C te vergelijken. Om een duidelijk beeld te krijgen van het gedrag van de verschillende runtime omgevingen worden hardware events gemeten zoals het aantal cycles, branches, instructie cache misses in de eerste niveau cache, enz. Deze worden gemeten met behulp van perf en PAPI. De waarden zijn gecontroleerd met ruwe events om de correctheid te garanderen. Om PyPy’s JIT te analyseren worden ook het stabiele gedrag, 2 uitvoering was slechts twee keer sneller met threading, terwijl er acht cores beschikbaar waren. Zoals vermeld, maakt de multiprocessing module een nieuwe interpreter aan op iedere core. Voor PyPy betekent dit dat er ook nieuwe JIT aangemaakt wordt. Aangezien het niet mogelijk is informatie uit te wisselen, zoals gecompileerde code, en de compiler moet werken met minder code op iedere core, wordt een kleinere winst bekomen. Om een beter zicht te krijgen op PyPy’s JIT werd het tijdsgedrag geanalyseerd. Dit gedrag wordt voorgesteld met behulp van een tool dat in de PyPy broncode zit. Het toont aan dat de JIT vooral in het begin werkt, wat belangrijk is omdat dit resulteert in de grootste winst [5]. De JIT heeft ook bewezen zeer effectief te zijn, aangezien het, behalve bij één benchmark, voor minder dan één percent actief was. De resultaten van het stabiele gedrag hebben dit beaamd. Het gedrag van PyPy zonder JIT toont een zeer zwakke prestatie. Dit betekent dat de JIT noodzakelijk is om de prestatie te verbeteren en leidt ook tot een zeer degelijke prestatie winst. Er zijn echter ook nog enkele geheugenproblemen met PyPy. PyPy’s interpreter, garbage collector en JIT moeten data opslaan. Dit beïnvloedt de uitvoering van een gebruikerstoepassing op een negatieve manier en wordt zichtbaar voor de I/O- en geheugenintensieve benchmarks. De hardware event metingen tonen dat de JIT het level-1 instructie cache gedrag verbetert. Er treden echter meer misses per instructie op in de level-1 data cache. Dit zou verbeterd kunnen worden met behulp van prefetching. Daarom heb ik de metingen heruitgevoerd zonder hardware prefetching. Deze resultaten bevestigen dat hardware prefetching het aantal level-1 misses per instructie in de data cache vermindert. Een grotere winst zou bekomen kunnen worden door software prefetching toe te passen. Onderzoek omtrent dit onderwerp heb ik niet kunnen afwerken. een optie is, maar enkel wanneer de berekeningen niet door de gebruiker in Python geschreven worden. Hoewel PyPy ook C bibliotheken kan oproepen, wordt er aangeraden dit niet te doen, omdat de JIT compiler de prestatie niet kan verbeteren van andere talen. Daarom is CPython interessanter voor toepassingen die deze bibliotheken nodig hebben. Indien prestatie belangrijk wordt en de algoritmes niet meer verbeterd kunnen worden, is PyPy een goede keuze, maar enkel wanneer het om een CPUintensieve taak gaat. De geheugen problemen hinderen PyPy te veel bij I/O- en geheugen-intensieve toepassingen. Ook multi-threaded toepassingen zullen geen grote winst bekomen bij PyPy. Wanneer prestatie echt een kritieke factor is voor de uitvoering van de applicatie biedt Cython de beste optie. Deze aanpak is echter niet voor onervaren gebruikers. Het voordeel van Cython is dat het mogelijk is om de volledige ontwikkeling in Python te doen, inclusief het testen. Wanneer de toepassing klaar is, kan type informatie gradueel toegevoegd worden waar de meeste winst bekomen zal worden. Dit zal resulteren in een snellere ontwikkelingscyclus dan wanneer C gebruikt zou worden en de prestatie zal enorm verbeteren in vergelijking met CPython. De overhead van het combineren van C met Python is zeer klein vergeleken met de totale uitvoeringstijd. Merk op dat het bij deze aanpak niet nodig is om C code te schrijven, waardoor de overhead van Cython dus ook zeer klein is. Merk op dat het bij deze aanpak niet nodig is om C code te schrijven. Voor onervaren gebruikers wordt het aangeraden om enkel ‘eenvoudige’ data structuren en constructies te gebruiken. Daardoor zal het een stuk gemakkelijk zijn om type informatie toe te voegen. Indien zelfs de Cython aanpak niet goed genoeg is, is het mogelijk om Python te combineren met C, C++ of Fortran. Er zijn vele bibliotheken beschikbaar die dit gemakkelijk maken. Dit zal echter wel de ontwikkelingskost verhogen en daarom is het best om dit te vermijden. V. Conclusies De analyse leidt tot de conclusie dat CPython, de standaard interpreter, enkel nuttig is wanneer de prestatie er niet toe doet. Dit betekent dat het gebruikt kan worden voor applicaties die een korte uitvoeringstijd hebben of als ‘glue code’. Multithreaded toepassing zullen een degelijke prestatie winst krijgen. De meeste bibliotheken die computationeel intensieve taken uitvoeren zijn geschreven in C, waardoor zelfs voor zware berekeningen CPython Ik heb de bekendste runtime omgevingen en hun prestatie geanalyseerd voor één van de belangrijkste scripting talen, namelijk Python. Ik heb methodologische technieken ontwikkeld voor dit onderzoek en gebruikers suggesties aangeboden met betrekking tot welke runtime omgeving de meest geschikte is voor iedere situatie. 3 Dankwoord Ik zou graag Prof. L. Eeckhout en Dr. J. Sartor bedanken voor hun begeleiding en aanmoediging. Dit werk is uitgevoerd, gebruik makend van de Bluepower machine van de vakgroep van Elektronica en Informatiesystemen van de universiteit van Gent. Ik zou ook nog graag Dr. W. Heirman bedanken voor zijn advies en begeleiding met de Bluepower machine en het meten van hardware events en Prof. F. Mueller voor het delen van de code om software prefetching toe te passen. Referenties [1] A. Rigo, “Representation-based just-in-time specialization and the psyco prototype for python,” in Proceedings of the 2004 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation, ser. PEPM ’04. New York, NY, USA: ACM, 2004, pp. 15–26. [Online]. Available: http://doi.acm.org/10.1145/1014007.1014010 [2] C. F. Bolz, A. Cuni, M. Fijalkowski, and A. Rigo, “Tracing the meta-level: Pypy’s tracing jit compiler,” in Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, ser. ICOOOLPS ’09. New York, NY, USA: ACM, 2009, pp. 18–25. [Online]. Available: http://doi.acm.org/10.1145/1565824.1565827 [3] C. F. Bolz, A. Cuni, M. FijaBkowski, M. Leuschel, S. Pedroni, and A. Rigo, “Allocation removal by partial evaluation in a tracing jit,” in Proceedings of the 20th ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation, ser. PEPM ’11. New York, NY, USA: ACM, 2011, pp. 43–52. [Online]. Available: http://doi.acm.org/10.1145/1929501.1929508 [4] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann, “The DaCapo benchmarks: Java benchmarking development and analysis,” in OOPSLA ’06: Proceedings of the 21st annual ACM SIGPLAN conference on ObjectOriented Programing, Systems, Languages, and Applications. New York, NY, USA: ACM Press, Oct. 2006, pp. 169–190. [5] S.-W. Lee and S.-M. Moon, “Selective just-in-time compilation for client-side mobile javascript engine,” in Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems, ser. CASES ’11. New York, NY, USA: ACM, 2011, pp. 5–14. [Online]. Available: http://doi.acm.org/ 10.1145/2038698.2038703 4 Contents 1 Introduction 1.1 Scripting Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 2 Related Work 2.1 Scripting Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 5 3 Runtime Environments 3.1 CPython . . . . . . . . . . . . . . . . 3.1.1 Architecture . . . . . . . . . 3.1.2 Optimisations . . . . . . . . . 3.1.3 Multi-threaded Applications . 3.2 PyPy . . . . . . . . . . . . . . . . . . 3.2.1 Architecture . . . . . . . . . 3.2.2 Multi-threaded Applications . 3.3 Cython . . . . . . . . . . . . . . . . 3.4 Other Runtime Environments . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 8 9 11 12 13 19 19 20 20 4 Benchmarking 4.1 Benchmarking Suites . . . . . . . . . . . . . . . . . . 4.1.1 The Grand Unified Python Benchmark Suite 4.1.2 The Computer Language Benchmarks Game 4.2 Benchmarking Methodologies . . . . . . . . . . . . . 4.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Hardware Events . . . . . . . . . . . . . . . . . . . . 4.4.1 Perf . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 PAPI . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 23 24 27 30 30 30 32 32 5 Analysis Runtime Environments 5.1 Preliminary Comparison . . . . . 5.1.1 Cython Type Information 5.1.2 Type Guessing . . . . . . 5.2 PyPy Beats C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 34 34 37 38 . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv CONTENTS 5.3 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 40 41 42 44 45 47 48 50 51 52 52 53 54 54 6 Analysis PyPy 6.1 Translatorshell . . . . . . . . . . . . . . . . . . . . . 6.2 Hooks . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 JIT Viewer . . . . . . . . . . . . . . . . . . . . . . . 6.4 Behaviour Over Time . . . . . . . . . . . . . . . . . 6.4.1 A Lot Of Garbage Collection . . . . . . . . . 6.4.2 Behaviour Of The Just-In-Time Compilation 6.4.3 Influence Of The Nursery Size . . . . . . . . 6.5 Influence Of The Just-In-Time Compiler . . . . . . . 6.5.1 Time Measurements . . . . . . . . . . . . . . 6.5.2 Hardware Events . . . . . . . . . . . . . . . . 6.6 Adjusting The Heap Size . . . . . . . . . . . . . . . . 6.7 Prefetching . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Hardware Prefetching . . . . . . . . . . . . . 6.7.2 Software Prefetching . . . . . . . . . . . . . . 6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 56 57 60 60 60 62 63 63 64 65 70 71 72 74 76 7 Final Conclusions 7.1 CPython . . . . 7.2 Cython . . . . 7.3 PyPy . . . . . . 7.4 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 77 78 78 79 5.5 5.6 5.7 Time Measurements . . . . . . . . . . . . . Hardware Events . . . . . . . . . . . . . . . 5.4.1 Cycles Per Instruction . . . . . . . . 5.4.2 Branch Behaviour . . . . . . . . . . 5.4.3 Level-1 Instruction Cache Behaviour 5.4.4 Level-1 Data Cache Behaviour . . . 5.4.5 Last Level Cache Load Behaviour . 5.4.6 Last Level Cache Store Behaviour . 5.4.7 Translation Lookaside Buffers . . . . Multi-threaded Applications . . . . . . . . . 5.5.1 C . . . . . . . . . . . . . . . . . . . . 5.5.2 Cython . . . . . . . . . . . . . . . . 5.5.3 CPython . . . . . . . . . . . . . . . 5.5.4 PyPy . . . . . . . . . . . . . . . . . PYC . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendices 81 A PyPy Options 82 B Removing Threading From binary-trees and k-nucleotide 85 B.1 k-nucleotide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 B.2 binary-trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 CONTENTS xv Bibliography 88 List of Figures 91 Chapter 1 Introduction M ost scripting languages are run using an interpreter, which takes in the source code of the program, referred to as a ‘script’, and executes the code on-thefly. The use of an interpreter means that the source code is translated without optimising at runtime to assembly, which means the code does not need to be recompiled to run on different machines. Of course it is required that the interpreter is installed on each machine, on which the code has to be executed. The use of an interpreter also results in an added flexibility of the scripting language itself. Since everything is executed at runtime, type information can be deduced while running the script. Most scripting languages are therefore dynamically typed, instead of the static typing used by most system programming languages, such as C, C++, etc. A lot of scripting languages will also provide more complex constructs, like list comprehensions. This increases the productivity of the programmer, but since those constructs have a very specific goal, it also means that a lot of scripting languages are used for a very specific purpose. Think for for example about awk to perform text processing. At the moment, there are a lot of scripting languages available, like JavaScript, Perl, lua, Bash, awk, etc. They are used in very different domains. However for my thesis, I decided to focus on Python. The reason for this choice is explained in Section 1.3. 1.1 Scripting Languages Scripting languages are designed for different tasks than system programming languages, and this leads to fundamental differences in the languages. System programming languages were designed for building data structures and algorithms from scratch, starting from the most primitive computer elements such as words of memory. In contrast, scripting languages are designed for gluing: They assume the existence of a set of powerful components and are intended primarily for connecting components. — John K. Ousterhout [22] Scripting languages are becoming increasingly popular and important due to the increasing importance of graphical user interfaces and the growth of the internet. They have become possible because of hardware improvements. The main benefit they offer is ease of use and high productivity. However they are still mainly used as ‘glue code’, which means that they ‘glue’ already existing components together, which would be more difficult in a system programming language (think for example about piping in the Unix 1 CHAPTER 1. INTRODUCTION 2 shell). Since they also provide a higher productivity, they are also used to hack something together quickly to provide a prototype. Another advantage is that most scripting languages provide a command-line interpreter, which allows interactive programming by requesting commands and executing each command as soon as it is received. This means you can get feedback while writing code, which is currently pushed to the limits in a new approach called live coding. 1.2 Python At the moment of writing this document, Python occupies the eighth place on the TIOBE Programming Community index 1 . The only scripting language ranked higher is PHP. The first major version of CPython, the default Python interpreter, was released in January 1994 by Guido van Rossum. It has now reached its third version. In other words, Python is becoming a stable language, with a high number of users. The Python syntax is easily readable by people not experienced in programming, because words are used, like and and or, instead of the respective constructs && and || used in most computer languages. Python is also a dynamically typed language, which means it is not necessary to give type information, nor are variables restricted to a single type. The programmer does not need to concern himself with overflows. These are caught at runtime and a new variable is generated which can contain the value. Something unique to this language is the fact that indentation is mandatory and needs to be correct to run. This forces users to write code that is easy to read. Finally the syntax is meant to be concise, which allows fast development. Another reason for Python’s popularity is the huge amount of modules, or libraries, available. This means that it is possible to reuse code, which makes it easier and quicker to program. It is also possible to call C, C++ and Fortran libraries, from within Python. Furthermore it is easy to combine code written in C with Python, because the task of compiling the C code to a library can be automated by modules like cffi. 1.3 Why Python? I already mentioned in Section 1.2 that Python is one of the most popular scripting languages. However recently Python is being used for academic purposes. This has been caused by the release of IPython, an interactive Python interpreter running in a web browser. It is even possible to share the IPython notebooks by putting them online [1]. There are also a lot of libraries available for scientific computing in a lot of different fields. A few commonly used are listed in Table 1.1. Most of these libraries work very well with the IPython notebook. For example, the images generated with matplotlib can be inlined in the notebook. Now that Python is being used for scientific purposes, the performance becomes of utmost importance. Not much research has been published which compares the different options to improve the performance of Python. Moreover, most Python runtime environments have not been analysed across a broad range of general applications. Therefore I 1 The TIOBE Programming Community index, located at http://www.tiobe.com/, is an indicator of the popularity of computer languages. It is based on search results of popular search engines like Google, Bing, Yahoo!, Wikipedia, etc. It does not rank computer languages according to the number of lines written in them, or how ‘good’ they are.[26] CHAPTER 1. INTRODUCTION 3 Table 1.1: Commonly used libraries for scientific computing Library NumPy SciPy matplotlib pandas SymPy scikit StatsModels Description fundamental package for numerical computation collection of numerical algorithms and domain-specific toolboxes 2D plotting library which produces publication quality figures in a variety of hardcopy formats providing high-performance, easy to use data structures for symbolic mathematics and computer algebra tools for data mining and data analysis tools for statistical computing and data analysis analyse Python’s performance over various runtime environments, compare the differences between these runtime environments and make recommendations about how to achieve better performance. In order to accomplish all this, I started researching Python and scripting languages in general. In Chapter 2, I describe the most interesting findings about previous attempts to improve scripting languages and Python. This research led me to the most common approaches to executing Python. These are discussed in Chapter 3. A benchmark suite and methodology is necessary to analyse those different runtime environments. This is described in Chapter 4. In Chapter 5 the different runtime environments are compared to each other and some specific characteristics of the runtime environments are analysed. One runtime environment, PyPy, has been analysed further in Chapter 6. Finally a general conclusion is taken, based on the analysis, in Chapter 7. Chapter 2 Related Work T his chapter has been divided in two parts. First the research focusses on scripting languages in general. The goal is to learn about improving performance, in the hope that similar approaches are possible in Python. Then the research focusses specifically on Python itself with the intention of discovering what improvements already have been attempted before. 2.1 Scripting Languages JavaScript is currently very popular in the academic world and a lot of optimisations have been suggested to improve the speed. One idea is to use parallel execution, in order to take advantage of multiple cores. However JavaScript is entirely sequential and most programmers would not like to use Web workers, which allow parallel execution. Two approaches to improve the performance using parallelism have been suggested [20]. The first approach exploits loop-level parallelism, by assigning each iteration of a loop to a separate thread. However this leads to difficult data dependencies, which means a rollback mechanism is required. The second approach uses method-level speculation, by using a different thread for each function call. It is necessary to predict the return values to use this approach. Speed-ups up to eight times the speed of the sequential execution have been obtained. No attempts have been made to implement this method for Python. Since Python offers the possibility to use multiple threads, it might be preferred to let the programmer decide if multiple threads are necessary. One of the main inefficiencies of JavaScript is related to the use of eval. However it is used a lot because of its great power. Therefore it is not desirable to just remove the statement from the language. Instead a special tool, called the Evalorizer, is created to reduce the usage of eval [21]. This tool was able to replace 97% of the eval invocations. The performance did not really improve a lot, the main reason to remove the construct was related to safety. This construct is also available in Python as a built-in function. However it seems that this feature is not often used, which means that the benefit obtained by removing it will be even lower. Using a Just-In-Time compiler is another commonly used approach to improve the speed of JavaScript. This can increase the loading time, however the running time should be reduced. It is important to efficiently detect hot spots as early as possible. This can be done based on the function invocation count, loop iteration count and the transition count, which includes the caller of the compiled code if it is located far away from the 4 CHAPTER 2. RELATED WORK 5 called code [18]. This approach has been attempted for Python as well and is described in detail in Section 3.2. Since scripting languages are different from system programming languages, the benchmarking is different as well, because heap sizes, important for the garbage collector, and dynamic compilation will influence the result. However most evaluations still use methodologies developed for C, C++ and Fortran. A solution to this problem is supplied by the DaCapo benchmarking methodology [6], which focuses on Just-In-Time compilation. Three methodologies are necessary for a good evaluation: Mix: measures an iteration of an application which mixes JIT compilation and recompilation work with application time. This shows the tradeoff of total time with compilation and application execution time. Stable: steady-state run in which there is no JIT compilation. This is accomplished by reusing the code generated from a previous run with JIT compilation. It measures final code quality. Deterministic Stable & Mix: eliminates sampling and recompilation. The JIT compiler is modified to perform replay compilation which applies a fixed compilation plan when it first compiles each method. First it is necessary to modify to compiler to record its decisions for each method. Then the benchmark is executed several times and the best plan is selected. This methodology allows researchers to control the virtual machine and compiler in order to get a better view on the influence of proposed improvements. A good coverage of the different hardware architectures is also advised and different heap sizes should be taken into account in order to get a representative view. A special compiler, called HipHop, has been created by Facebook to compile PHP to C++ [27]. Facebook’s entire code base is compiled to C++ in three minutes and a half on a twelve core server using the HipHop compiler. It then takes another eight minutes to compile on a cluster. The custom built compiler completes requests about two and a half times faster than Zend, a commonly used web application framework built in PHP 5. It is estimated that the HipHop compiler should be over five times faster than Zend, however due to added extensions to the HipHop compiler, it is not possible to confirm this assumption. This project shows that it is possible to combine both worlds. The interpreter is used during the development and allows testing code quickly. The compilation process is only used to install the code on the servers, with increased performance. However it was necessary to drop a few commonly used features, like automatic promotion from integer to float in case of an overflow and the eval statement. A similar approach is also possible for Python by using Cython. However the code is compiled to C instead of C++. This method is explained in Section 3.3. 2.2 Python Since Python is a dynamic language, performing static compilation and detecting errors early, is very difficult. However, some researchers have asked whether those dynamic features are actually used [14]? Alex Holker and James Harland assume that most of the Python code will be static typed, yet the startup code will most likely be dynamic. Since CHAPTER 2. RELATED WORK 6 Python uses a lazy module loading system, there is dynamic code even after the startup code. However most of the Python code is actually static and some of the dynamic code could be replaced by static code. 70% of their tested programs had less dynamic activity after the startup. While this means some static analysis could be performed, there is still a huge fraction of the applications containing dynamic code. Static analysis is not the best method for improving the performance of Python. Another characteristic of Python is that it is used to glue existing components together. This means that the system’s dynamic linking and loading capabilities are under severe stress. In order to test how Python performs in these circumstances, benchmarking should be performed. A special benchmark has been designed to test the performance of Python in these circumstances, called Pynamic [16]. This benchmark stresses the dynamic linking and loading capabilities, using a predefined profile. This also stresses the operating system, because scalable tools require parallel systems. Python is executed similarly to Java. The Python interpreter will first generate Python bytecode from the user application. This bytecode is then executed in a virtual machine. Optimising the Python bytecode [5] will improve the performance of Python. In a first stage, the bytecode is expanded by inlining the functions and applying loop unrolling. Then it is possible to apply a variety of data-flow optimisations like value propagation, constant propagation, algebraic simplifications, dead code elimination, copy propagation, common subexpression elimination, etc. The first stage is very important, because it enlarges the effect of the optimisation of the second stage. An improvement of about 10-30 percent was obtained for Pystone, Crypto-1.2.5 - Rijndael test, Pypy MD5, Pypy SHA, Pybench and several micro tests. However a much larger speed-up is preferred. All previous optimisation approaches have not led to a huge performance benefit. The reason for this is that Python was not created to be used for scientific applications with intensive computations. A common approach is to do the computationally intensive tasks in a lower level language, like C or Fortran, and use Python to glue the code together. However is it not possible to improve the performance of Python, by using different structures and statements? A study [11] shows that it is possible to improve the performance of Python substantially by using for example vectorisation1 instead of iterating over an array. The study also compared the most optimal Python solution with solutions in C++, C and Fortran. Those solutions were still about ten times faster than the Python solution. However combining code written in C++, C and Fortran with Python shows about the same performance, which proves that Python is very good as glue code. Cython takes this a step further, by compiling Python code, optionally extended with type information to improve the performance, to C. The influence has already been researched previously [4, 25]. Huge speed-ups compared to normal Python execution have been accomplished, while less memory is required for the execution. An important advantage of Cython is the ease of use compared to other approaches like F2PY. Cython has the capability to execute code outside the Global Interpreter Lock, however it is tedious. The main advantage is the possibility to incrementally improve the performance. It is also not necessary to optimise the entire program, but only the code which will result in the largest performance gain. A more in-depth explanation about Cython follows in Section 3.3. Another common approach to improve the performance of dynamic languages has already been mentioned for JavaScript, namely a Just-In-Time compiler. Psyco is one of 1 Vectorisation is the process of revising loop-based, scalar-oriented code to use matrix and vector operations. CHAPTER 2. RELATED WORK 7 the first projects attempting this for Python, using partial evaluation and specialisation [23]. It is possible to get massive performance improvements, however not in the generic case. This project has been discontinued in favour of PyPy. A few articles have been written to explain how PyPy works [10, 9]. In Section 3.2 a detailed overview of the project is given. It is also possible to repurpose an existing Just-In-Time compiler (repurposed JIT compiler). However they have not given the expected performance benefit. This is because of overreliance on the Just-In-Time compiler and traditional optimisations [12]. Specialisation is the key to improving the performance. Data-flow optimisations are the most important ones. Even Google has attempted to improve the performance of Python in a project called ‘Unladen swallow’. They also decided to go for the Just-In-Time compiler approach. It is built on top of LLVM, however it seems the project has reached its end. Chapter 3 Runtime Environments A runtime environment contains everything necessary to execute a program. This includes settings, libraries, a garbage collector, etc. However it does not include tools to change the program. T he best-performing and most popular runtime environments that Python applications run on top of are summarized in Table 3.1. They provide three different approaches to executing Python code. The first one, called CPython, will just interpret the Python code without anything else. The second one, PyPy, intends to improve the speed by using a Just-In-Time compiler on top of an interpreter. The last one, Cython, also tries to improve the performance by compiling the Python code to C and then the C code to an executable. A more detailed description about the three approaches follows. 3.1 CPython CPython is the default Python interpreter, used by most people. It is written in C and to make a distinction between the language and the interpreter the developers named it CPython instead of Python. 3.1.1 Architecture CPython is a very simple interpreter. Its architecture is represented in Figure 3.1. First it will read the source code and compile it to Python bytecode, which is an intermediate format of the Python code, similar to what happens in Java. This compilation process is executed by a bytecode compiler. Once the Python bytecodes are generated, they are passed to a bytecode interpreter, which will execute the instructions one after another. CPython uses a stack based virtual machine, which means that all objects are put on a Table 3.1: Python runtime environments Environment CPython PyPy Cython Language C RPython Python Remark the default Python interpreter uses a Just-In-Time compiler compile Python code to C 8 CHAPTER 3. RUNTIME ENVIRONMENTS 9 stack and when it is necessary to perform an operation, the required number of objects are popped from the stack. The operation will then be performed on the objects and the result is put on the stack again. 3.1.1.1 Garbage Collector CPython has a generational garbage collector with three generations. New objects are allocated in the first generation. When objects survive a few collections, set by a parameter, they are moved to the next generation. Each generation is collected less often than the previous one. The moment to perform a garbage collection depends on the number of allocations and deallocations. It is also possible to bypass the garbage collector and explicitly delete objects. However this approach is not commonly used. Since it can cause memory leaks, it is even disadvised. 3.1.2 Optimisations It is already possible to improve the performance of CPython by either passing an optimisation flag to the interpreter or by generating pyc files, which contain the Python bytecode and eliminate the bytecode compilation step. However it is important to remark that these improvements happen in the bytecode compiler. They will only influence the loading time, which is the time necessary to read the script and compile it to Python bytecode. This means that while some effort has been put into improving the speed of CPython, it will not influence the execution time a lot. Most Python scripts are not long enough for the loading time to become large enough to have an influence on the total execution time. It is expected that most of the execution time is spent in running the code itself. 3.1.2.1 Optimisation Flags Currently it is possible to pass the -O flag or the -OO flag. This extract is taken from the manual pages: -O Turn on basic optimizations. This changes the filename extension for compiled (bytecode) files from .pyc to .pyo. Given twice, causes docstrings to be discarded. -OO Discard docstrings in addition to the -O optimizations. The -O flag will eliminate assert statements and the __debug__ variable is set to False. This means that statement blocks of the form if __debug__: ... will be removed as well. The -OO flag will also remove documentation. As mentioned before, these optimisations do not really improve the total execution time, only the loading time, which is only a very small fraction of the total execution time in most cases. These flags have been provided to allow optimisations in the future. Since they are not real optimisations and it is not useful to benchmark short running applications, where the influence should be larger, I decided to ignore them. CHAPTER 3. RUNTIME ENVIRONMENTS 10 Figure 3.1: Architecture CPython 3.1.2.2 PYC The second mechanism CPython has to improve the speed of Python is by using cached files. They have the pyc extension. These files contain the Python bytecodes of the program, which were generated by the bytecode compiler during a previous run, however no modifications are made to the code. Only the step to compile the Python script to bytecode is skipped with this optimisation, however it might increase the reading time, if the Python bytecodes take a lot of place to be stored. This means that again only the loading time will be improved, because the Python source code does not need to be compiled to Python bytecode anymore. However the improvement will only be obtained if the pyc files are not too large, which would lead to an increased reading time. It is possible to view the Python bytecodes using the dis module. Listing 3.1 illustrates how this can be accomplished for the calculation of the nth fibonacci number. The disassembled Python code for this calculation is visible in Listing 3.2 and clearly shows that the virtual machine is stack based. Each time, first the necessary operands are loaded, followed by the execution of the operator. from d i s import d i s def f i b ( n ) : if n < 2: return n else : return f i b ( n−1) + f i b ( n−2) dis ( fib ) Listing 3.1: Disassemble the Python code for the fibonacci problem to Python bytecode 11 CHAPTER 3. RUNTIME ENVIRONMENTS 4 0 3 6 9 5 7 >> LOAD_FAST LOAD_CONST COMPARE_OP POP_JUMP_IF_FALSE 0 (n) 1 (2) 0 ( <) 16 12 LOAD_FAST 15 RETURN_VALUE 0 (n) 16 19 22 25 26 29 32 35 38 39 42 43 44 47 0 ( fib ) 0 (n) 2 (1) LOAD_GLOBAL LOAD_FAST LOAD_CONST BINARY_SUBTRACT CALL_FUNCTION LOAD_GLOBAL LOAD_FAST LOAD_CONST BINARY_SUBTRACT CALL_FUNCTION BINARY_ADD RETURN_VALUE LOAD_CONST RETURN_VALUE 1 0 0 1 ( 1 p o s i t i o n a l , 0 keyword p a i r ) ( fib ) (n) (2) 1 ( 1 p o s i t i o n a l , 0 keyword p a i r ) 0 ( None ) Listing 3.2: Python bytecode for the fibonacci problem 3.1.3 Multi-threaded Applications There are currently three different mechanisms to write multi-threaded applications in Python: • thread based • event based • multiprocessing module For now the multiprocessing module is the best approach to create multi-threaded applications on different cores. 3.1.3.1 Thread Based Concurrency This approach is commonly used by most computer languages. The basic principle is that a sequence of instructions can run inside a thread and multiple threads can run concurrently. Every application has at least one thread, called the main thread. In order to manage all those threads, synchronization mechanisms are supplied. CPython uses the same common approach most computer languages follow. However CPython has the Global Interpreter Lock (GIL), which prevents threads from running simultaneously. The Global Interpreter Lock is actually a mutex, which prevents threads from executing Python bytecodes at the same time. This means that threaded applications cannot benefit from multiprocessor systems. Note that the Global Interpreter Lock is not bad. The reason it is included in CPython is because CPython’s memory management is not thread-safe. Blocking or long-running CHAPTER 3. RUNTIME ENVIRONMENTS 12 operations happen outside it. The benefits are that single-threaded applications will run with an increased speed and it makes the integration with C a lot easier. This means the Global Interpreter Lock only causes problems for multi-threaded applications which do not call C libraries or have a lot of I/O. There have been many discussions and attempts to remove the Global Interpreter Lock. However this is not an easy task and nobody has succeeded yet. Many features now depend on the guarantees that it enforces, which makes it even harder. Since the multiprocessing module solves the problem with the Global interpreter lock, it seems it will not be removed in the near future. 3.1.3.2 Event Based Concurrency Event based concurrency is based on transactional memory. The idea is to handle ‘events’ one after the other. The ordering of the events is not deterministic, because often they are external. The handling of each event occurs deterministically, however not in parallel with the handling of other events. In order to handle an event, a transaction is used. This is a tentative execution of the code used to handle the event. When a conflict is detected with other concurrently executing transactions, the transaction is aborted and restarted. This approach assumes that conflicts will not occur often, and that the handling is done quickly or conflicts are detected early in the process. There is no hardware support yet, which means that only a software implementation is available. This causes a huge performance disadvantage [24]. 3.1.3.3 Multiprocessing Module Using the multiprocessing module is the only option to run code simultaneously on different cores. This approach spawns a subprocess on each core, in order to avoid the Global Interpreter Lock. This causes complicated dependencies between the subprocesses, because the data will need to be synchronised. This module is particularly useful when there is not a lot of data shared between the subprocesses. Still, even if there is a lot of data shared, this is the best option to execute code in parallel. 3.2 PyPy The PyPy project aims to produce a flexible and fast Python implementation. In order to have a fast implementation, a Just-In-Time compiler (JIT) is used. The developers have decided not to focus solely on Python. Currently there are implementations for Ruby, PHP, Prolog, SmallTalk and a preliminary version for JavaScript. The objective of a just-in-time (JIT) compiler for a dynamic language is to improve the speed of the language over an implementation of the language that uses interpretation. The first goal of a JIT is therefore to remove the interpretation overhead, i.e. the overhead of bytecode (or AST) dispatch and the overhead of the interpreter’s data structures, such as operand stack etc. The second important problem that any JIT for a dynamic language needs to solve is how to deal with the overhead of boxing primitive types and of type dispatching. Those are problems that are usually not present or at least less severe in statically typed languages [8]. 13 CHAPTER 3. RUNTIME ENVIRONMENTS The interpretation overhead is reduced by compiling the most executed code, during the execution of the program. This means that next time the code has to be executed, it does not need to be translated anymore. Moreover, the Just-In-Time compiler will improve the performance by applying optimisations. This means that at runtime, ‘hot’ code, meaning often executed code, is discovered and then optimised and compiled to assembly. It is important that this does not take a lot of time, since it happens during the execution. Compiling code which is barely executed, will take more time than the gained benefit. PyPy is written in RPython – Restricted Python – suitable for creating dynamic language interpreters. The project uses extreme programming, which means that there is no strict definition of RPython. It changes when necessary or in order to improve PyPy. PyPy still does not entirely support the Python3 syntax. A lot has changed between Python2 and Python3 and PyPy is still mainly following the Python2 syntax. An effort is being put into updating the interpreter to the next version, however it will be awhile before this is finished. Another disadvantage is that it is not preferred to use C libraries with PyPy, because the Just-In-Time compiler cannot optimise this code. However the most popular C libraries, like numpy, have been translated to work with PyPy and by default the cffi module is installed, which allows the user to combine C code with PyPy very easily. 3.2.1 Architecture The PyPy project contains an interpreter, a JIT component and the RPython toolchain. The latter is used to compile PyPy itself. Of course there is also a garbage collector. The behaviour of both the Just-In-Time compiler and the garbage collector can be modified by setting certain flags or variables. A detailed description about them is added in Appendix A. 3.2.1.1 PyPy’s Interpreter The interpreter consists of a bytecode compiler and a bytecode evaluator. An object space is used to abstract all actions. This makes it easier to support other computer languages. bytecode compiler sees the Python source code of the user application and compiles it to Python code objects. The chain used to accomplish this is not very special. It can be seen in Figure 3.2. Finally these are passed to the bytecode evaluator. bytecode evaluator or bytecode interpreter interprets the Python code objects and delegates the correct action to the standard object space. This is basically a Python virtual machine. standard object space is responsible for creating and manipulating the Python objects seen by the application. Listing 3.3 shows the disassembled code of the fibonacci problem shown in Listing 3.1. It is almost completely the same as the bytecode used by CPython. The stack based nature of the virtual machine is again clearly visible. 4 0 LOAD_FAST 3 LOAD_CONST 0 (n) 1 (2) CHAPTER 3. RUNTIME ENVIRONMENTS Figure 3.2: The phases used by the bytecode compiler 14 15 CHAPTER 3. RUNTIME ENVIRONMENTS 6 COMPARE_OP 9 POP_JUMP_IF_FALSE 5 7 >> 0 ( <) 16 12 LOAD_FAST 15 RETURN_VALUE 0 (n) 16 19 22 25 26 29 32 35 38 39 42 43 44 47 0 ( fib ) 0 (n) 2 (1) LOAD_GLOBAL LOAD_FAST LOAD_CONST BINARY_SUBTRACT CALL_FUNCTION LOAD_GLOBAL LOAD_FAST LOAD_CONST BINARY_SUBTRACT CALL_FUNCTION BINARY_ADD RETURN_VALUE LOAD_CONST RETURN_VALUE 1 0 ( fib ) 0 (n) 1 (2) 1 0 ( None ) Listing 3.3: PyPy bytecode for the fibonacci problem 3.2.1.2 PyPy’s Just-In-Time Compiler A Just-In-Time compiler will search at runtime for code which is often executed, based on counters indicating how many times a loop is executed. Once such a piece of code is found, the Just-In-Time compiler will compile that code to a lower level language, in most cases assembly. At this point the execution becomes platform specific. This does not cause problems since you will not change the hardware while executing the script. The most common optimisations like constant folding, common subexpression elimination, allocation removal, etc. are implemented in PyPy’s Just-In-Time compiler. PyPy also applies very aggressive inlining, in order to optimise as much as possible. The methodology used is based on the principles of a tracing Just-In-Time compiler [10]. A tracing JIT works by observing the running program and recording its commonly executed parts into linear execution traces. Those traces are optimized and turned into machine code [8]. The Just-In-Time compiler starts by tracing the bytecode of the interpreter interpreting the application level code. When unfolding enough, eventually a loop will be detected written in the user’s application. This trace will be compiled by the JIT backend, which will generate the assembly. The assembly will be returned to the frontend, which traces the bytecode, so it can be used the next time. Guards are placed when it is possible to jump out of the trace. They are added to ensure the correctness of the code. Listing 3.4 contains the Python bytecode for the initial if-condition in the Fibonnaci code in Listing 3.1, checking if n is smaller than two of the fibonacci problem, with the guards. 0 LOAD_FAST n guard ( i 4 == 1 ) guard ( p5 i s n u l l ) g u a r d _ n o n n u l l _ c l a s s ( p12 , C o n s t C l a s s ( W_IntObject ) , d e s c r = <... >) guard ( i 8 == 0 ) CHAPTER 3. RUNTIME ENVIRONMENTS 16 3 LOAD_CONST 2 guard ( p2 == ConstPtr ( p t r 2 5 ) ) 6 COMPARE_OP < i 2 6 = ( ( pypy . o b j s p a c e . s t d . i n t o b j e c t . W_IntObject ) p12 ) . i n s t _ i n t v a l [ pure ] i28 = i26 < 2 guard ( i 2 8 i s t r u e ) 9 POP_JUMP_IF_FALSE 16 Listing 3.4: Python bytecode with guards for the fibonacci problem The Python bytecode in Listing 3.4 shows that a trace does not contain all code of the function, but only the code that is actually executed. In this case the code for the function if n is smaller than two is not included. Therefore, a guard is inserted to catch the case when n is smaller than two. When a guard fails, a blackhole interpreter is used to get back to a safe point, because it is incredibly complex to move the state from the registers used by the assembled code to the bytecode interpreter, where all values should be boxed. Once a safe point is reached, the bytecode evaluator continues to operate using the Python bytecode. 3.2.1.3 Runtime Interaction When a Python program is run, the Python code is compiled to Python code objects by the bytecode compiler. These objects are passed to the bytecode interpreter, which will interpret the Python code objects and execute them using the standard object space. It works like a normal stack-based virtual machine. When the Just-In-Time compiler is enabled, the tracing interpreter will trace the code. Each time a loop is found, it will decide if the loop should be compiled, based on how many times it is executed. When the trace needs to be compiled, the trace will be given to the JIT backend. Then the trace is compiled to assembly and returned to the tracing interpreter. Now the assembled version can be used, each time the same code is called. When a guard fails, it is difficult to return to the bytecode interpreter, because the state needs to be passed and because half a Python instruction might already have been executed. Instead a blackhole interpreter continues until a safe point is reached. Then the bytecode interpreter continues. This process repeats until the Python program is finished. A visual representation can be found in Figure 3.3. 3.2.1.4 RPython Toolchain To compile PyPy itself, a special toolchain has been developed. The job of the RPython toolchain is to translate RPython programs into an efficient version of that program for one of the various target platforms, generally one that is considerably lower-level than Python. The RPython toolchain never sees the RPython source code or syntax trees, but rather starts with the code objects that define the behaviour of the function objects one gives it as input. First it is important to remark that RPython code does not exist in files, but instead it exists only in memory. Writing an RPython program means writing a program which generates the RPython code objects in memory. The RPython toolchain itself is written in Python, which means it can be compiled by any Python interpreter, and ipso facto by PyPy itself. CHAPTER 3. RUNTIME ENVIRONMENTS 17 Figure 3.3: Architecture PyPy Since the interpreter is given as input to the RPython toolchain, the interpreter needs to be written in RPython. The Just-In-Time compiler is generated during the compilation of PyPy. For this to work, it is required that a few hints are given to the interpreter. It is however not necessary to include a Just-In-Time compiler, but this will lead to worse performance. The toolchain can be seen in Figure 3.4 and includes following steps: 1. The code objects are converted to a control flow graph by the Flow Object Space. 2. The control flow graphs are processed by the Annotator, which performs wholeprogram type inference to annotate each variable of the control flow graph with the types it may take at run-time. 3. The information provided by the Annotator is used by the RTyper to convert the high level operations of the control flow graphs into operations closer to the abstraction level of the target platform. • After the RTyping phase, it is possible to insert a Just-In-Time compiler. The Just-In-Time compiler will be generated from the hints given in the interpreter, which means that it is not necessary to write any code, except for the hints in the interpreter, to use the Just-In-Time compiler for a different language. 4. Optionally, various transformations can be applied which, for example, perform optimisations such as inlining or add capabilities such as stackless-style concurrency. In this phase code is also inserted for the garbage collector and exception management. CHAPTER 3. RUNTIME ENVIRONMENTS 18 5. The graphs are converted to source code for the target platform and compiled into an executable. Figure 3.4: The RPython toolchain 3.2.1.5 Garbage Collector It is possible to use PyPy with four different garbage collectors. Semispace copying has two arenas of equal size, but only one arena is used and gets filled with new objects. When the arena is full, the live objects are copied into the other arena. The old arena is then cleared. Generational (2 generations) adds a nursery to the semispace garbage collector, which is a chunk of the current semispace. Allocations fill the nursery, and when it is full, it is collected and the objects still alive are moved to the rest of the current semispace. The idea is that it is very common for objects to die soon after they are created. Generational GCs help a lot in this case and the semispaces fill up much more slowly, making full collections less frequent. Hybrid (3 generations) can handle both objects that are inside and objects that are outside the semispaces (‘external’). The external objects are not moved and collected in a mark-and-sweep fashion. Large objects are allocated as external objects to avoid costly moves. Small objects that survive for a certain time, based on the number of semispace collections, are also made external so that they stop moving. This is coupled with a segregation of the objects in three generations. Each generation is collected much less often than the previous one. CHAPTER 3. RUNTIME ENVIRONMENTS 19 Minimark is based on the hybrid garbage collector. It uses a nursery for the young objects, and mark-and-sweep for the old objects. This is a moving GC, but objects may only move once (from the nursery to the old stage). The main difference with the hybrid garbage collector is that the mark-and-sweep objects (the ‘old stage’) are directly handled by a custom allocator, instead of being handled by malloc() calls. This reduces the amount of memory necessary during a major collection compared to the hybrid garbage collector. An incremental version of this garbage collector is available which is used by default. For the benchmarking of PyPy the default incremental version of the minimark garbage collector has been used. 3.2.2 Multi-threaded Applications PyPy uses the exact same mechanisms as CPython for multi-threaded applications. This means that PyPy has issues with the Global Interpreter Lock as well. It is also possible to use the other approaches with PyPy. Again the best approach is the multiprocessing module. 3.3 Cython Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language. It makes writing C extensions for Python as easy as Python itself.1 Cython allows you to combine Python with C and C++. It simplifies writing Python code that calls back and forth from and to C or C++ code natively. Furthermore it is possible to convert Python code, optionally enhanced with statements of the Cython language which allows you to add static type information, to C or C++. Listing 3.5 contains the Python code to solve the pairwise distance calculation problem, while the Cython code can be found in Listing 3.6. This is a common scientific problem, which aims to calculate the distance between a number of points with each other. This example clearly shows that the Cython code is larger and more complicated. It will be difficult for novice users to add the type declarations. Most people choose Python for the ease of programming and dynamic typing is an important part of that. The Cython language gives the possibility to add more information, which allows Cython to translate the code better to C and Cython will also be able to further optimise the code, resulting in a faster execution. However it is complicated to do this for real Python applications and might not attract many users. import numpy a s np def p a i r w i s e (X) : M = X. shape [ 0 ] N = X. shape [ 1 ] D = np . empty ( (M, M) , dtype=np . f l o a t ) 1 Cython.org CHAPTER 3. RUNTIME ENVIRONMENTS 20 f o r i in range (M) : f o r j in range (M) : d = 0.0 f o r k in range (N) : tmp = X[ i , k ] − X[ j , k ] d += tmp ∗ tmp D[ i , j ] = np . s q r t ( d ) return D Listing 3.5: Python code for the pairwise distance calculation problem import numpy a s np c i m p o r t cython from l i b c . math c i m p o r t s q r t @cython . boundscheck ( F a l s e ) @cython . wraparound ( F a l s e ) def p a i r w i s e ( d o u b l e [ : , : : 1 ] X) : c d e f i n t M = X. shape [ 0 ] c d e f i n t N = X. shape [ 1 ] c d e f d o u b l e tmp , d cdef double [ : , : : 1 ] D = np . empty ( (M, M) , dtype=np . f l o a t 6 4 ) f o r i in range (M) : f o r j in range (M) : d = 0.0 f o r k in range (N) : tmp = X[ i , k ] − X[ j , k ] d += tmp ∗ tmp D[ i , j ] = s q r t ( d ) return np . a s a r r a y (D) Listing 3.6: Cython code for the pairwise distance calculation problem 3.4 Other Runtime Environments There are some other runtime environments available, like py2exe, cx_Freeze, Shed Skin, etc. Even Google has attempted to improve the Python performance in a project called Unladen Swallow. However I believe the most important runtime environments are included. PyPy is the best performing Just-In-Time compiler, Cython is an optimising static compiler and CPython is the default interpreter. This gives the three different, commonly used approaches. 3.5 Conclusion There are three common approaches to running Python code: • interpretation • compiling to a lower level language (commonly C) • interpretation and a Just-In-Time compiler CHAPTER 3. RUNTIME ENVIRONMENTS 21 For each of those approaches, a runtime environment has been chosen, which is commonly used. CPython is the default interpreter, used by most people. Both Cython and PyPy try to improve the performance of Python. Cython tries to accomplish this by converting the Cython code to C and then compiling it to assembly. To get an optimal result, it is however necessary to add extra information, which is not easy and might not attract a lot of users. PyPy tries to improve the execution time by using a Just-In-Time compiler, which means that it is not necessary to add type information. However currently there is very poor support for the third version of Python. Chapter 4 Benchmarking After all, facts are facts, and although we may quote one to another with a chuckle the words of the Wise Statesman, ’Lies–damned lies–and statistics,’ still there are some easy figures the simplest must understand, and the astutest cannot wriggle out of. — Leonard Henry Courtney, 1895 T he goal of benchmarking is to obtain measurements on a computer system by executing a computing task, which will allow comparison between different hardware and software combinations. Both the computer system and the task cause a difficulty with benchmarking a computer language. It is important that the measurements and benchmarks are representative to make a generalisation about the programming language, instead of being specific to the computer system or task. Therefore it necessary to have benchmarks which are I/O-, memory- and CPU-intensive. Furthermore also multithreaded benchmarks should be included. While it is not possible to ensure a similar behaviour on different hardware, using multiple benchmarks should also result in a similar behaviour on most commonly used machines. The same problems should surface on most commonly used hardware components. The most important task of benchmarking starts with choosing the correct benchmarking suite, which groups together computer tasks in a wide variety of domains. It is important that the benchmarks give a representative view of the language. Benchmarking a dynamic language is even more complicated, because dynamic components, such as a garbage collector and Just-In-Time compiler, make the runs nondeterministic. Therefore it is necessary to have a good benchmarking methodology in order to draw correct conclusions. The next step is to decide which characteristics to measure. It is logical to measure the execution time of each benchmark, but this will not explain the behaviour and will not result in clear conclusions. To get a better understanding of the runtime environments, it is interesting to measure hardware events, like the number of cycles, instructions, branches, etc. Since this is a very demanding task, I automated this using shell and Python scripts. The most important part is the interpretation of the results, which happens after the benchmarking. The results are explained in Chapter 5 and Chapter 6. 22 CHAPTER 4. BENCHMARKING 4.1 23 Benchmarking Suites Since it is important to have benchmarks, which are representative for the language, I looked at which benchmarks are used in academical research. It seems that most times microbenchmarks are used to compare the behaviour in a very specific domain. Some of these benchmarks are even included with Python by default , like pystone, iobench, etc. However these do not help my research, because I want to get a wide view on the behaviour. Eventually I found two benchmarking suites, which provide a decent number of benchmarks in various domains. One, called The Grand Unified Python Benchmark Suite, focusses entirely on the Python language. The other, called The Computer Language Benchmarks Game, aims to compare computer languages with each other. 4.1.1 The Grand Unified Python Benchmark Suite This project is intended to be an authoritative source of benchmarks for all Python implementations. The focus is on real-world benchmarks, rather than synthetic benchmarks, using whole applications when possible.1 Most of the benchmarks are based on work from the Unladen Swallow project by Google, used by PyPy to show their performance. There is a website [2] comparing different PyPy versions with CPython versions, as can be seen in Figure 4.1. This suite contains 55 benchmarks, the following benchmarks are the most commonly used ones: • 2to3 • calls • django • fastpickle • fastunnpickle • float • html5lib • html5lib_warmup • mako • nbody • nqueens • pickle • pickle_dict • pickle_list • pybench 1 the homepage of The Grand Unified Python Benchmark Suite CHAPTER 4. BENCHMARKING 24 • regex • richards • rietveld • slowpickle • slowspitfire • slowunpickle • spitfire • spambayes • startup • threading • unpack_sequence • unpickle After testing out the benchmarks, I noticed most of them are really short, as can be seen in Figure 4.2. In order to get a representative view of the performance, I searched for a different benchmarking suite. 4.1.2 The Computer Language Benchmarks Game We are trying to show the performance of various programming language implementations – so we ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.2 This suite contains only 15 benchmarks, but most of them take quiet awhile to complete. This suite also offers the possibility to compare with other languages, but PyPy cannot run all of the benchmarks. This means the number of benchmarks I could use is reduced even more, because of the poor support of PyPy for the latest version of Python as mentioned in Section 3.2. An advantage of this suite, is that it tries to compare computer languages with each other. This means that I can easily compare Python with other languages, such as C, which is known to have very good performance. The benchmarks I was able to use are listed in Table 4.1. After going through the source code I was able to draw some conclusions about the characteristics of each of those benchmarks: n-body performs mainly mathematical operations like multiplications, additions, subtractions, and even a few divisions and exponentiations. The data used to execute these operations are predefined in the source code as a dictionary, containing tuples and lists. The results are written to a list. Since the most executed operations are mathematical, it is considered a CPU-intensive benchmark, however it will also be necessary to load and store a small amount of memory. 2 site of the Computer Language Benchmarks Game CHAPTER 4. BENCHMARKING Figure 4.1: Comparison between PyPy and CPython by speed.python.org Table 4.1: Benchmarks used to evaluate the performance fannkuch-redux spectral-norm n-body k-nucleotide fasta-redux binary-trees Repeatedly access a tiny integer-sequence Calculate an eigenvalue using the power method Perform an N-body simulation of the Jovian planets Repeatedly update hashtables and k-nucleotide strings Generate and write random DNA sequences Allocate and deallocate many many binary trees 25 26 CHAPTER 4. BENCHMARKING Time (s) 0 2to3 0.2 0.4 0.6 0.8 1 12.42 call_method call_method_slots call_method_unknown call_simple ﬂoat iterative_count nbody normal_startup nqueens pickle_dict pickle_list regex_compile regex_eﬀbot regex_v8 richards slowpickle slowunpickle startup_nosite threaded_count unpack_sequence unpickle_list Figure 4.2: Time measurements for The Grand Unified Python Benchmark suite 27 CHAPTER 4. BENCHMARKING spectral-norm performs mainly multiplications and additions. It will also execute a few bit shifts and divisions. The data used to perform the calculations are based on the indices of a loop, iterating over an array. The amount of loading and storing is very small for this benchmark. It is a pure CPU-intensive benchmark. fannkuch-redux mainly swaps data and performs some arithmetic to calculate the locations to accesses and to decide when to stop. The amount of memory used is very small, therefore it is not considered a memory-intensive benchmark, but a CPUintensive one. k-nucleotide first reads a file containing a very large DNA sequence. Then it will find certain sequences in the sequence read from the file and perform a sort operation. Both operations happen in parallel, which means this benchmark also includes threading. The most important part of this benchmark is the reading, which makes it an I/O-intensive benchmark and there will also be some memory operations. fasta-redux provides the input for k-nucleotide by generating a very large DNA sequence. A random lookup table is generated to create the DNA sequence. Generating the table is done by doing some operations on predefined data, which means this benchmark will use the CPU and the memory, however the most important part is writing to a file. This makes it an I/O-intensive benchmark. binary-trees first creates a huge binary tree. Then it counts the number of trees having a certain depth in parallel. The main characteristic of this benchmark is the amount of memory it consumes, which makes it a memory-intensive benchmark. Of course it also uses threading. The type of the benchmarks are summarised in Table 4.2. Even if the number of benchmarks that I am able to use is reduced, I still have CPU-, I/O- and memory-intensive benchmarks. Furthermore there are two benchmarks which use threading. This means I should get a representative view of the performance of Python in the most common areas. Common arguments are supplied for each of the benchmarks in this suite. I have tested these commonly used arguments and kept the ones which are not too short. The arguments used for each benchmark are mentioned in Table 4.3. 4.2 Benchmarking Methodologies In Chapter 3, I already mentioned the different runtime environments I am going to benchmark. However PyPy is a special case, because it has the Just-In-Time compiler. It would be interesting to get a better understanding of the working and effect of the JustIn-Time compiler. To accomplish this, I created the ‘PyPy stable’ methodology (PyPyS). Table 4.2: Used benchmarks grouped by type CPU I/O k-nucleotide* n-body spectral-norm fasta-redux fannkuch-redux (*) uses multiple cores memory binary-trees* 28 CHAPTER 4. BENCHMARKING Table 4.3 benchmark fannkuch-redux spectral-norm n-body k-nucleotide fasta-redux binary-trees arguments 11, 12 3000, 5500 5000000, 50000000 2500000 2500000, 25000000 20 The stable methodology has been used before for research about Java and JavaScript [6]. The idea of this methodology is to eliminate as much of the overhead of the Just-In-Time compiler as possible in order to measure the behaviour of the user application. This is accomplished by executing the application first with the Just-In-Time compiler and then it is benchmarked without the Just-In-Time compiler, using the already-compiled code. This shows how much time and resources are actually lost because of the Just-In-Time compilation when comparing with a normal execution. A simplified version of the code used to measure the stable behaviour can be found in Listing 4.1. First the code prepares for acquiring the results. This setup code contains the number of iterations each hardware event is measured and the location of a log file. Furthermore code has been added to load C libraries, used to measure the hardware events. Finally the benchmark and the argument for the benchmark are taken from the input arguments and the benchmark is loaded into the memory. After the setup code, first an initial run of the benchmark is executed with the Just-In-Time compiler enabled. This makes sure the code is compiled to assembly, to be reused later on. After this initial run, the Just-In-Time compiler is turned off and a garbage collection of the heap is forced. This ensures that the next iteration starts with a fresh heap so the results are not influenced. Next the hardware events and time are measured consecutively, and after each iteration a garbage collector is forced. The results are stored in a two dimensional space. Finally, after acquiring all measurements, the results are written to the log file and standard out. At the end of the harness some clean up code has been added. This frees the memory used by the C libraries. However this harness does not work for the multi-threaded benchmarks, because of the techniques used by the multiprocessing module. Each execution a new interpreter is launched on each core, which means that a new Just-In-Time compiler is created as well. The k-nucleotide benchmark also reads data from a file, which makes it a lot more difficult to eliminate the Just-In-Time compiler, while still using the compiled code. Therefore no results have been acquired for those two benchmarks. NR_IT = 10 LOG = ’ h a r n e s s _ l o g / l o g ’ ############### f f i = FFI ( ) f f i . cdef ( ’ ’ ’ i n t setup () ; v o i d teardown ( ) ; int start ( int iter ) ; long long ∗ stop ( int i t e r ) ; 29 CHAPTER 4. BENCHMARKING ’’’) with open ( ’ c o u n t e r s . c ’ ) a s r : s r c = r . read ( ) C = f f i . v e r i f y ( s r c , l i b r a r i e s =[ ’ p a p i ’ ] ) nr_events = C . s e t u p ( ) ben = argv [ 1 ] a r g = i n t ( argv [ 2 ] ) mod = __import__ ( ’ py_ ’ + ben ) # warmup set_param ( " d e f a u l t " ) s t a r t = time ( ) mod . main ( a r g ) end = time ( ) # iterations set_param ( " o f f " ) collect () f o r j in range ( nr_events ) : f o r i in range (NR_IT) : nr = C . s t a r t ( j ) s t a r t = time ( ) mod . main ( a r g ) end = time ( ) r e s = C. stop ( j ) collect () # present r e s u l t s print ( r e s u l t s ) C . teardown ( ) Listing 4.1: Simplified version of the harness for the stable behaviour Another method to analyze PyPy’s Just-In-Time compiler is to disable it entirely (PyPyNJ). Since it is an optional part of PyPy, it is very easy to do this. I have also added this approach to the environments used to perform benchmarking. The resulting runtime environments are listed in Table 4.4. Table 4.4: The different benchmarked runtime environments with their most important cost environment CPython PYC PyPy Cython C PyPyS PyPyNJ cost interpretation interpretation JIT (optimised) (not optimised) remark default Python interpreter CPython with reduced loading time Just-In-Time compiler compiles Python to C compiled with GCC PyPy’s stable behaviour PyPy without Just-In-Time compiler 30 CHAPTER 4. BENCHMARKING Remember that both PyPyS and PyPyNJ are without the Just-In-Time compilation. However the stable runtime environment will use previously generated and optimised assembly, while the PyPyNJ runtime environment will not use any optimised code. 4.3 Setup I have used the Bluepower machine at ELIS to do the benchmarking which has following hardware characteristics: # processors vendor id model name cpu MHz cache size cpu cores cache alignment address sizes : : : : : : : : 8 GenuineIntel Intel(R) Xeon(R) CPU X5570 @ 2.93GHz 1596.000 8192 KB 4 64 40 bits physical, 48 bits virtual It is running on Ubuntu 11.04 (Natty Narwhal). The following versions of the different runtime environments are installed: CPython PyPy GCC Cython : : : : 3.2 2.7.3 4.5.2 0.19.2 Both CPython and PyPy do not limit the heap size. This means that by default the entire RAM will be used if necessary. The nursery size of PyPy equals by default to four megabytes or half the cache size. For this machine, PyPy’s nursery size is about four megabytes and the maximum heap size for both CPython and PyPy equals to about fourteen gigabytes. 4.4 Hardware Events As mentioned previously, the hardware events lead to a better understanding of the different runtime environments. They will show the difference in behaviour of the runtime environments and they will bring out problems and bottlenecks. Measuring hardware events can be accomplished with perf and PAPI. Perf works from the command line, while PAPI is a library which can be used from within C code. Both are discussed in detail below. 4.4.1 Perf Using the perf list command, it is possible to list all available predefined events. I measured the most interesting ones. However it became clear that perf is not to be trusted to give correct results for all events. To make sure all values are correct I compared the CHAPTER 4. BENCHMARKING 31 results of the perf predefined events with the ones obtained by using raw events. These raw events are found in the software developer’s manual of the hardware manufacturer of the processer – in my case the Intel manual [15]. This confirmed that perf did not return correct values for the UNCORE events – in my case the last level cache events. Therefore I used the raw event descriptors to get correct values. There were also certain results I could not verify to be correct, because not all predefined perf events are listed in the Intel manual. This leaves me with following events, for which I definitely obtained the correct values: • cycles • instructions • branches • branch-misses • L1-dcache-loads • L1-dcache-load-misses • L1-dcache-prefetches • L1-dcache-prefetch-misses • L1-icache-loads • L1-icache-load-misses • UNC_L3_HITS.READ (raw event) • UNC_L3_MISS.READ (raw event) • UNC_L3_HITS.WRITE (raw event) • UNC_L3_MISS.WRITE (raw event) • dTLB-load-misses • dTLB-store-misses • iTLB-load-misses I also made sure perf did not scale the events by doing more iterations with fewer events per iteration. When there are not enough hardware counters to measure the supplied hardware events, perf measures all events for a smaller period and scales the results accordingly. Using less hardware events stops perf from scaling the results, therefore I measured fewer hardware events per iteration and ran the benchmark multiple times to gather all counters without scaling. CHAPTER 4. BENCHMARKING 4.4.2 32 PAPI PAPI does not allow you to use raw event descriptors. This means the raw events could not be encoded. There are multiple hardware events predefined, however these were not correct either. The perf events are also available from PAPI and I decided to use those. I was able to verify their correctness by using PAPI and perf simultaneously. However there are no results for the level-3 cache. These are all the hardware events I measured using PAPI: • cycles • instructions • branches • branch-misses • L1-dcache-loads • L1-dcache-load-misses • L1-dcache-prefetches • L1-dcache-prefetch-misses • L1-icache-loads • L1-icache-load-misses • dTLB-load-misses • dTLB-store-misses • iTLB-load-misses PAPI will not scale the events, so I was forced to measure multiple runs to get all results. 4.5 Conclusion There are seven runtime environments I have decided to benchmark: • CPython • PYC • PyPy • Cython • C • PyPyS (stable) CHAPTER 4. BENCHMARKING 33 • PyPyNJ (without Just-In-Time Compiler) The first two runtime environments are included to show the behaviour of the most used Python interpreter. C has been included to compare with a different programming language and one of the fastest computer languages currently used. The last two environments have been added to get a better understanding of the Just-In-Time compiler of PyPy. After going through the various benchmarking suites, I decided to use The Computer Language Benchmarks Game, because the benchmarks take a considerable time and there are benchmarks included for each of the most important domains. An added bonus is that it easily allows comparison with other languages, most importantly C. This benchmarking suite should give a representative overview of the different runtime environments. All hardware events are measured using perf, except for the stable behaviour of PyPy which is measured by PAPI. I have decided which events to benchmark to get the correct measurements to discover the differences in the behaviour of the runtime environments. Chapter 5 Analysis Runtime Environments A fter acquiring all results, it is time to dig deeper and compare the different runtime environments. The goal of this chapter is to find the best performing runtime environment, for which the most important results are the time measurements. This is followed by a comparison between the various approaches to executing Python code and C. However first a more in-depth analysis is provided about a few specific runtime environments. 5.1 Preliminary Comparison A preliminary comparison will provide a better overview of the capabilities of the different runtime environments. The pairwise distance calculation problem, as mentioned in Section 3.3 is used for this purpose. The Python and Cython code can be found in Listing 3.5 and Listing 3.6 respectively. Figure 5.1 shows the time measurements for the different runtime environments for 1000 points. According to these results, CPython is actually the slowest runtime environment, while it is the most used one. Cython does not improve the speed when no type information is supplied, yet it compiles the Python code to C and does some optimisations. It appears to optimize very little when the type information is not available. Once the type information is added, a huge performance benefit is obtained. PyPy is able to improve the performance drastically for this example, without any change to the code. It appears that the Just-In-Time compiler is very effective. The biggest difference is between Cython with and without type information. Cython has the possibility to improve the speed a lot better than PyPy, however it cannot do this without extra information. 5.1.1 Cython Type Information There is a huge difference in execution time, when adding type information to Cython. An analysis of the generated C code, gives a better understanding about the modifications made when type information is available. The first difference is the size of the C file. Without the type information the C file contains 2589 lines and is about 106 kilobyte large, while the C file generated by Cython with type information contains 15878 lines and is about 580 kilobytes. These C files are compiled to libraries by Cython. The sizes of the libraries are 95 and 486 kilobytes without and with type information. The files 34 35 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 1000 Time (s) 100 10 1 0.1 0.01 PyPy CPython Cython Cython type info Figure 5.1: Time measurements for the pairwise distance calculation problem with 1000 points generated with type information are obviously a lot larger than without. The source code helps explaining this huge difference. The initial part of the C code of the pairwise distance calculation problem is visible in Listing 5.1 and the same part of the optimised version is shown in Listing 5.2. This clearly shows the difference caused by adding the type information. Without the information, PyObjects are used to perform the calculations, otherwise the specific types can be used. These PyObjects contain the reference count, used by the garbage collector to free unused memory, and a type pointer. The data of the variable is not stored in a PyObject, instead a pointer to the value is stored. This means the value is ‘wrapped’. The array is replaced by a __Pyx_memviewslice variable, which contains a pointer to the data and some extra information like the size. The advantage of this type is that it is very easy to access the data, as can be seen in Listing 5.3. The same access using a PyObject is much more complex, because references must be followed and extra checks are required. This is shown in Listing 5.4. CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS PyObject ∗__pyx_v_M = NULL; PyObject ∗__pyx_v_N = NULL; PyObject ∗__pyx_v_D = NULL; PyObject PyObject PyObject PyObject PyObject ∗__pyx_v_i = ∗__pyx_v_j = ∗__pyx_v_d = ∗__pyx_v_k = ∗__pyx_v_tmp NULL; NULL; NULL; NULL; = NULL; 36 i n t __pyx_v_M; i n t __pyx_v_N ; __Pyx_memviewslice __pyx_v_D = { 0 , 0 , { 0 } , { 0 } , { 0 } }; i n t __pyx_v_i ; i n t __pyx_v_j ; double __pyx_v_d ; i n t __pyx_v_k ; double __pyx_v_tmp ; Listing 5.1: initial part of the C code Listing 5.2: initial generated by Cython for the part of the C code generated pairwise distance calculation by Cython for the pairwise problem distance calculation problem with optimised Python code __pyx_v_M = (__pyx_v_X . shape [ 0 ] ) ; Listing 5.3: access an element of a __Pyx_memviewslice structure __pyx_t_1 = __Pyx_PyObject_GetAttrStr (__pyx_v_X, __pyx_n_s__shape ) ; i f ( u n l i k e l y ( ! __pyx_t_1) ) { __pyx_filename = __pyx_f [ 0 ] ; __pyx_lineno = 4 ; __pyx_clineno = __LINE__; goto __pyx_L1_error ; } __Pyx_GOTREF(__pyx_t_1) ; __pyx_t_2 = __Pyx_GetItemInt (__pyx_t_1 , 0 , s i z e o f ( long ) , PyInt_FromLong , 0 , 0 , 1) ; i f ( ! __pyx_t_2) { __pyx_filename = __pyx_f [ 0 ] ; __pyx_lineno = 4 ; __pyx_clineno = __LINE__; goto __pyx_L1_error ; } __Pyx_GOTREF(__pyx_t_2) ; __Pyx_DECREF(__pyx_t_1) ; __pyx_t_1 = 0 ; __pyx_v_M = __pyx_t_2 ; __pyx_t_2 = 0 ; Listing 5.4: access an element of a PyObject structure This same problem occurs with the for loop, however the code becomes even more complicated without the type information. Since the type is not known, code is added for lists and tuples as well. It takes over forty lines of C code to get the correct element, while the same task is accomplished in three lines of code with the type information. The reference counting increases the complexity of the code and results in a performance loss. All these things are related to the time difference, but they also explain the difference in size. For each type, operations are defined and the code which performs the operations is added as well. Since the version without type information only uses one type, PyObject, not much other code is necessary. The version with type information needs a lot of functions to perform those operations. A few operations for the memoryview object are CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 37 shown in Listing 5.5. Often those operations are implemented in a best case scenario, for example by ignoring overlap for array operations and extra code, together with checks, are added in case this assumption is wrong. It will also provide better exception management with more specific errors. static static static static static static static static static PyObject PyObject PyObject PyObject PyObject PyObject PyObject PyObject PyObject ∗ __pyx_memoryview_transpose ( PyObject ∗ __pyx_v_self ) ; ∗__pyx_memoryview__get__base ( PyObject ∗ __pyx_v_self ) ; ∗__pyx_memoryview_get_shape ( PyObject ∗ __pyx_v_self ) ; ∗ __pyx_memoryview_get_strides ( PyObject ∗ __pyx_v_self ) ; ∗ __pyx_memoryview_get_suboffsets ( PyObject ∗ __pyx_v_self ) ; ∗__pyx_memoryview_get_ndim ( PyObject ∗ __pyx_v_self ) ; ∗ __pyx_memoryview_get_itemsize ( PyObject ∗ __pyx_v_self ) ; ∗__pyx_memoryview_get_nbytes ( PyObject ∗ __pyx_v_self ) ; ∗ __pyx_memoryview_get_size ( PyObject ∗ __pyx_v_self ) ; Listing 5.5: A few operations which can be used on the memoryview object The basic idea behind Cython is to focus on generating code which is as good as possible. There is no concern about the size of the code. After the code generation, the C compiler will optimise even more and during the compilation, all unnecessary code will be removed. Inlining will improve the instruction locality and enables more optimisations. 5.1.2 Type Guessing In Section 5.1.1, the influence of type information has been investigated. A huge speedup was obtained when type information was added. However without the information, Cython is not useful. Since it is difficult for a lot of users to add the information, a different approach is necessary. A possibility to resolve this issue, is to guess the type information. This step could be added before the compilation of Python code to C. As mentioned in Section 2.2, not all Python code is static. The lazy loading of modules results in dynamic code. However this should not provide issues for guessing the type information. The real problems are related to the fact that a variable can have multiple different types during its existence. This could be solved by creating new variables instead. However the main problem for this approach is the complexity of the Python types. The combination of tuples, arrays and lists can cause incredibly complex type definitions, yet they are very often used in Python, because of the flexibility of the for loop. These however have to be translated to types closer to C. Since this is even a very difficult task manually, automating this task seems almost impossible. Therefore it is necessary that the Python code is rewritten to be more similar to C. This means that no complex types and type combinations should be used, which reduces the ease of programming in Python. Of course, not all Python code needs to be optimised in order to obtain a decent performance gain. However these complex structures are most times used in loops, meaning they are the ones that need type information. Type guessing in combination with Cython has not yet been tried before. It would be possible to obtain huge speed-ups, however guessing the types of Python variables might be too complex. The problem is related to the combination of lists, tuples, dictionaries, etc. CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 5.2 38 PyPy Beats C First I would like to mention that this is a very specific example, by no means do I imply that PyPy is actually faster than C in general. The intention is to show the benefit and the power of PyPy and Just-In-Time compilation. This crafted example performs a string append with two integers in a loop. Both integers are simply the same loop variable. The Python code is included in Listing 5.6. A new variable is created for the string append every iteration. There are two versions for C. In Listing 5.7 the resulting string is stored in the same variable (on the stack) each time. In Listing 5.8 a new variable is created and freed each iteration. Since PyPy has a garbage collector, the variable will not be freed each iteration. However the behaviour should be closer to the latter C version. f o r i in xrange ( 1 0 0 0 0 0 0 0 ) : "%d %d " % ( i , i ) Listing 5.6: Python string append #include <s t d i o . h> #include < s t d l i b . h> i n t main ( ) { int i = 0 ; char x [ 4 4 ] ; f o r ( i = 0 ; i < 1 0 0 0 0 0 0 0 ; i ++) { s p r i n t f ( x , "%d %d " , i , i ) ; } } Listing 5.7: C string append using the stack #include <s t d i o . h> #include < s t d l i b . h> i n t main ( ) { int i = 0 ; f o r ( i = 0 ; i < 1 0 0 0 0 0 0 0 ; i ++) { char ∗x = m a l l o c ( 4 4 ∗ s i z e o f ( char ) ) ; s p r i n t f ( x , "%d %d " , i , i ) ; free (x) ; } } Listing 5.8: C string append using the heap Table 5.1 contains the time measurements. CPython has been added to show that it is really PyPy doing the work and not the Python code. I have made certain that the string append operation has not been optimised away by PyPy. We notice that PyPy is even twice as fast as the quickest C implementation and almost three times as fast as the C implementation with the variable created on the heap. PyPy’s Just-In-Time compiler, which works on traces at runtime, can inline and unroll the string append operation. The string append is a very generic function, but because of the inlining, specialization can be applied on the arguments. GCC on the other hand is not able to do this, because the sprintf call sits in libc. This means the generic 39 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS function has to be called each time the sprintf is called, which results in much slower performance. This example truly shows the power of Just-In-Time compilation, namely all code can be seen at runtime and optimisations have as much information available as possible. 5.3 Time Measurements Figure 5.2 contains the time of the various runtime environments, divided by the time of CPython. This figure clearly shows that C is still the quickest, as expected. Both CPython and Cython appear to be very slow. This means that the most used Python runtime environment – CPython – is actually one of the slowest available. The Cython runtime environment compiles the Python code to C, which means we would expect a behaviour much like C code. This is clearly not the case. Section 3.3 mentioned that Cython needs type information, to improve the performance of the Python code. The lack of type information results in very poor performance. While PyPy is not able to beat C, for CPU-intensive benchmarks its’ performance is very good. Memory problems are causing a smaller benefit for I/O- and memory-intensive benchmarks. For k-nucleotide the performance is even worse than CPython’s. This is analysed in Section 5.4 with hardware events. 1.6 C Time relative to CPython 1.4 Cython CPU CPython PyPy I/O memory 1.2 1 0.8 0.6 0.4 0.2 0 x x orm orm edu edu h-r 11 uch-r 12 tral-n 000 tral-n 500 c 3 pec 5 kuc k n n p s e s fan fan ody n-b 00 00 0 50 s x x e ody du du tid ree n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20 00 k-n 000 fas 000 fas 000 bin 0 0 0 25 25 50 25 Figure 5.2: Time comparison between the runtime environments normalised to CPython Table 5.1: Time measurements of the string append problem runtime CPython PyPy C (stack) C (heap) time (s) 15.087 1.187 2.709 3.115 remark the default Python interpreter (using the same code as PyPy) the result is stored in the same variable the result is stored in a new variable each iteration CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 5.4 40 Hardware Events As discussed in Section 4.4, hardware events were measured to get a better understanding of the runtime environments and the differences between them. All events are divided by the number of instructions, to scale the number appropriately and obtain a fair comparison between runtime environments. Otherwise, a runtime which takes a lot of time, will logically result in, for example, a higher number of CPU cycles. Note that the benchmarks will always have an influence on the results. A memory-intensive benchmark will have more data cache loads per instruction than for example a CPU-intensive benchmark. Therefore it is important to compare the results per domain. 5.4.1 Cycles Per Instruction Figure 5.3 contains the number of cycles per instruction obtained for the different runtime environments for each benchmark. C tends to score very high, because it will use more complex machine instructions, which will take multiple cycles. Another reason for the high values is that C will not stall a lot, because there are fewer branches, which can be seen in Section 5.4.2. This is especially clear for the CPU-intensive benchmarks. The high peak for k-nucleotide can be explained by the high amount of memory operations to the last level cache. These will increase the number of cycles per instruction. The values for fasta-redux and binary-trees are lower, while these also use a lot of memory. However there are fewer accesses to the last level cache. The results for the last level cache are explained in more detail in Section 5.4.5 and Section 5.4.6. Since Cython also compiles the code to C, a similar result is expected. This is however not the case. Cython has very consistent values, which means that the type of benchmark does not have considerable influence on the results. The explanation for this low result can be found in the fact that there is no type information available at compile time and thus it is not possible to optimise the code efficiently. This means it is not possible to use the more complex instructions and there are more branches, as explained in Section 5.4.2, disrupting the flow of execution. This leads to a lower number of cycles per instruction. Note that the --embed option was used to generate the executable. This will generate an appropriate main method to start the execution of the Python script. This method will make sure the code compiled with Cython can be used. This was however not used for the pairwise distance calculation problem, shown in Section 5.1, which means that this does not influence the result very much. Instead the answer is found in the wrapping of the values. CPython follows the behaviour of Cython very closely, yet it has the interpretation overhead. The virtual machine will cause exactly the same problems, namely it will need to jump to the correct code, which leads to a larger amount of branch misses. Since everything is executed at runtime, no complex assembly instructions will be used either. This results in about the exact same behaviour as for Cython. PyPy is using a lot more cycles per instruction for the I/O- and memory-intensive benchmarks. This is related to the memory operations, which take multiple cycles. The peak for binary-trees is explained by the memory problems. The Just-In-Time compiler will have to store data to decide which code is ‘hot’ and the compiled code needs to be stored as well. The binary-trees will already use a lot of memory. This causes a huge number of store misses in the last level cache, as explained in Section 5.4.6. 41 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 3.5 C CPU Cycles per instruction 3 Cython CPython PyPy I/O memory 2.5 2 1.5 1 0.5 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 5 f 2 b 50 00 k 2 nk nk pe pe 2 5 n n s s fa fa Figure 5.3: The number of cycles per instruction 5.4.2 Branch Behaviour The branch behaviour can be seen in Figure 5.4 and Figure 5.5. The results show that C has less branch instructions than the other environments. This is particularly clear for the spectral-norm and n-body benchmarks. This result was expected, since C does not have the interpretation cost. Only the k-nucleotide benchmark jumps out. This behaviour can be explained by the algorithm used in the benchmark. It is necessary to use a lot of if conditions, for the code to work correctly. The branch misses per instruction are very low, however fasta-redux is an exception. This is because of the random initialisation of the lookup table. Cython has slightly lower values than CPython. As mentioned previously, Cython really needs type information, which was not given, to improve the Python code. This means the generated C code is not optimised, but more importantly the types are not known. Cython will generate code to wrap the Python values, no operations will be performed on the values directly. This means that each operation needs to be abstracted, which causes the higher number of branches. The type of benchmark has no influence on this behaviour. CPython has constant results for the different benchmarks. Just like Cython, the type of benchmark is not a big influence. This is the result of the Python virtual machine. At runtime each instruction is executed, which means a branch is necessary. PyPy behaves closer to the other Python runtime environments than to C. However it shows a very strange behaviour, caused by the Just-In-Time compiler. It is explained more elaborately in Section 6.5.2.2. 42 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 0.35 Branches per instruction 0.3 C Cython CPython CPU PyPy I/O memory 0.25 0.2 0.15 0.1 0.05 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 2 f 5 b 50 00 k 2 nk nk pe pe 5 2 n n s s fa fa Branch misses per instruction Figure 5.4: The number of branches per instruction 0.2 C Cython CPython CPU PyPy I/O memory 0.15 0.1 0.05 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 a 0 a 0 h h r c 0 ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 2 f 5 b 50 00 k 2 nk nk pe pe 5 2 n n s s fa fa Figure 5.5: The number of branch misses per instruction 5.4.3 Level-1 Instruction Cache Behaviour Figure 5.6 and Figure 5.7 contain the level-1 instruction cache loads and level-1 instruction cache load misses both per instruction. We notice that the level-1 instruction cache load misses are influenced greatly by the benchmark for C. The results are very high for fannkuch-redux while very low for spectral-norm. The fannkuch-redux benchmark is executed almost entirely in a single method, while spectral-norm uses a lot of different methods, resulting in worse code locality. This causes the huge difference. The other values are more similar to each other. The results are much lower than the other runtime 43 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS environments, because the code locality for them is worse. Again Cython does not follow the behaviour of C. The values for Cython are almost the same, the benchmarks have almost no influence on the instruction cache load misses in the first level cache. Furthermore, Cython has a huge amount of instruction cache load misses. This can again be explained by the wrapping of the values. The operation on the values have to be executed by calling the appropriate method, this means code has to be loaded from a lot of different places and explains the high number of instruction cache load misses per instruction. The same observation is made for CPython. Apart from k-nucleotide, which has fewer level-1 instruction cache loads per instruction, the values are very similar. CPython does not perform optimally when reading from a file. This problem is not the case for writing to a device. There CPython has a high number of level-1 instruction cache loads. PyPy is the only runtime, apart from C, which shows some variation in the obtained values. The differences are not as high as C and are inherent to the benchmarks. The instruction cache load misses are higher than CPython, because of the added cost of the Just-In-Time compiler. Furthermore guards are added so the trace can be followed, however these can result in more instruction cache load misses per instruction when they fail. This is the case for binary-trees. The level-1 instruction cache load misses are very high for each runtime environment for the fasta-redux benchmark, because of the random generation of the lookup table. L1 icache loads per instruction 0.7 0.6 C CPU Cython CPython I/O PyPy memory 0.5 0.4 0.3 0.2 0.1 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 5 f 2 b 50 00 k 2 nk nk pe pe 2 5 n n s s fa fa Figure 5.6: The number of level-1 instruction cache loads per instruction 44 L1 icache load misses per instruction CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 0.012 0.01 C Cython CPU CPython PyPy I/O memory 0.008 0.006 0.004 0.002 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 5 f 2 b 50 00 k- 25 nk nk pe pe 5 2 n n s s fa fa Figure 5.7: The number of level-1 instruction cache load misses per instruction 5.4.4 Level-1 Data Cache Behaviour The level-1 data cache behaviour is captured in Figures 5.8 and 5.9. Again we notice that the influence of each benchmark is very large for C. We also notice, not surprisingly, that the number of misses is high for the I/O- and memory-intensive benchmarks. However the misses are considerably lower for the outputting benchmark, fasta-redux. Since spectralnorm uses no memory, it make sense that C has very low values there. The n-body benchmark causes about as many load operations as k-nucleotide and binary-trees, because it uses data stored in memory to do the calculations. Cython clearly does not follow the same behaviour as C. The benchmark influence is minimal, only the n-body, k-nucleotide and binary-trees benchmarks have a slightly higher number of load operations per instruction. These are of course the benchmarks making most use of the memory. CPython shows an even more steady behaviour. There is barely any difference in the number of load operations per instruction measured over the different benchmarks, because every value needs to be loaded from memory, since an interpreter is used. Due to the extra overhead of the interpretation CPython also has more misses. PyPy clearly shows that spectral-norm is entirely CPU-intensive. The other benchmarks show very little variation in the data cache load operations per instruction. However the misses show clearly that PyPy is having some issues with I/O. The cause for this problem is of course the interpretation and the Just-In-Time compiler. Both need to use memory, which cannot be used to contain application data. This leads to a higher number of load misses per instruction. 45 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS L1 dcache loads per instruction 0.5 0.45 C Cython CPython CPU PyPy I/O memory 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 2 f 5 b 50 00 k 2 nk nk pe pe 5 2 n n s s fa fa L1 dcache load misses per instruction Figure 5.8: The number of level-1 data cache loads per instruction C 0.07 Cython CPU CPython PyPy I/O memory 0.06 0.05 0.04 0.03 0.02 0.01 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 a 0 a 0 h h r c 0 ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 2 f 5 b 50 00 k 2 nk nk pe pe 5 2 n n s s fa fa Figure 5.9: The number of level-1 data cache load misses per instruction 5.4.5 Last Level Cache Load Behaviour Figure 5.10 and Figure 5.11 contain the load operations and load misses for the last level cache. The only clear results for C are for the I/O- and memory-intensive benchmarks. C has especially high values for the k-nucleotide benchmark, while this does not reflect in the misses. These peak is caused by storing the data from a file and the added loading operations for performing the sort. The other I/O- and memory-intensive benchmarks also lead to a higher number of loads from the last level cache. Almost all of Cython’s load operations miss in the last level cache. The values are 46 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS considerably higher than those of CPython, which can be explained by the fact that CPython has a memory manager. This manager improves the data locality and can also cache data. CPython also has low values for the I/O-intensive benchmarks. However for the knucleotide benchmark, a higher number of load operations in the last level cache per instruction is better. CPython has a very low value there, which is caused by the interpretation overhead. Every read operation is followed by executing the next Python operation code in the virtual machine. This lowers the amount of load operations per instruction. We notice that CPython does not miss a lot in the last level cache, because a lot of instructions are used to keep the interpreter going. This means that the number of instructions increases, which causes the misses per instruction to decrease. PyPy has the highest values for the load operations in the last level cache per instruction for the k-nucleotide benchmark of the Python environments. PyPy follows the behaviour of C for the I/O and memory-intensive benchmarks. This means PyPy behaves well for the last level cache load behaviour. However the misses show some real issues. The biggest problems seem to lie with fannkuch-redux and n-body, while those are CPU-intensive benchmarks. This is caused by the added overhead of the interpreter and the Just-In-Time compiler, which is very active for the CPU-intensive benchmarks. The spectral-norm benchmark uses almost no memory at all, so there are no problems for that benchmark. LLC loads per instruction 0.005 C CPU Cython CPython 0.014271 PyPy I/O memory 0.004 0.003 0.002 0.001 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 a 0 a 0 h h a a r c 0 tr 3 c tr 5 00 0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc f 5 f 2 e ec b 50 00 k- 25 nk nk p p 5 2 n n s s fa fa Figure 5.10: The number of last level cache loads per instruction 47 LLC load misses per instruction CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 0.005 C Cython CPU CPython PyPy I/O memory 0.004 0.003 0.002 0.001 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 5 f 2 b 50 00 k- 25 nk nk pe pe 5 2 n n s s fa fa Figure 5.11: The number of last level cache load misses per instruction 5.4.6 Last Level Cache Store Behaviour The last level cache store behaviour is captured in Figures 5.12 and 5.13. Again we notice that the highest values for C are obtained for the I/O- and memory-intensive benchmarks. C has a peak for the k-nucleotide benchmark, because of the sort operation. That will cause a lot of the data to be restored. We also notice that the highest misses occur for the k-nucleotide and binary-trees benchmarks, simply because those benchmarks store most memory. Cython does not show any interesting behaviour. The misses are very similar over the different benchmarks. Again it follows the behaviour of CPython very closely. CPython behaves in much the same way as Cython. The values for the store operations per instruction are very similar for each benchmark. The misses per instruction are very stable over the different benchmarks. The type of benchmark does not influence the behaviour. The store operations per instruction are a lot higher for PyPy for the k-nucleotide and binary-trees benchmarks. These are of course the benchmarks that use the most memory. The misses are very similar for fannkuch-redux, n-body and fasta-redux. 48 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 0.4 LLC stores per instruction 0.35 C Cython CPython 0.719365 CPU PyPy I/O memory 0.3 0.25 0.2 0.15 0.1 0.05 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 2 f 5 b 50 00 k 2 nk nk pe pe 5 2 n n s s fa fa Figure 5.12: The number of last level stores stores per instruction LLC store misses per instruction 0.8 0.7 C Cython CPU CPython I/O PyPy memory 0.6 0.5 0.4 0.3 0.2 0.1 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 a 0 a 0 h h r c 0 ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 5 f 2 b 50 00 k 2 nk nk pe pe 2 5 n n s s fa fa Figure 5.13: The number of last level cache store misses per instruction 5.4.7 Translation Lookaside Buffers The misses for translation lookaside buffers are shown in Figure 5.14, Figure 5.15 and Figure 5.16. They confirm the behaviour of the instruction and data cache misses. 49 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS iTLB load misses per instruction 0.00025 C Cython CPython CPU PyPy I/O memory 0.0002 0.00015 0.0001 5e-05 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 3 tra 5 00 0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct c f 2 f 5 b 50 00 k- 25 nk nk pe pe 2 5 n n s s fa fa Figure 5.14: The number of instruction translation lookaside buffer load misses per instruction dTLB load misses per instruction 0.0035 0.003 C CPU Cython CPython 0.014121 PyPy I/O memory 0.0025 0.002 0.0015 0.001 0.0005 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 a a 00 0 0 u c 0 0 s ta 0 0 s ta 0 0 a r ch ch tr 3 c tr 5 00 0 0 k -n 2 5 0 fa 2 5 0 fa 5 0 0 b in ku ku e ec 5 0 n n p p 2 5 n n s s fa fa Figure 5.15: The number of data translation lookaside buffer load misses per instruction 50 dTLB store misses per instruction CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS C 0.002 Cython CPU CPython PyPy I/O memory 0.0015 0.001 0.0005 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 5 f 2 b 50 00 k- 25 nk nk pe pe 2 5 n n s s fa fa Figure 5.16: The number of data translation lookaside buffer store misses per instruction 5.5 Multi-threaded Applications Remember that the two benchmarks using multiple cores are k-nucleotide and binarytrees. To compare the effectiveness of the threading, both benchmarks have been executed without threading enabled. To accomplish this, it was necessary to modify the source code. Both benchmarks use a threading pool and call a map function on that pool. I have only removed the threading pool, the map function remained. It was of course not the same map function. For the C code I either put pthread_join after each create or I returned 1 for the number of cores. A more in-depth description of the modifications necessary to remove the threading is added in Appendix B. It is customary to calculate the parallel speed-up Sp and efficiency Ep with p the number of processes, in order to evaluate the behaviour of multi-threaded applications. They are calculated as follows: T1 Tp Sp = p Sp = Ep with Tp the total execution time with p processes This means that the efficiency is one, only if the code has perfect parallelism. This is however never possible, because there is always a small fraction which has to be executed sequentially. Note that the algorithm influences these measurements as well. Some algorithms will have more sequential fractions than others, thus resulting in a lower efficiency. The efficiency will range between zero and one, with numbers closer to one being better. The results are shown in Table 5.2 with the number of processes equal to eight, since this is the number of cores available on the test machine. Therefore the ideal speed-up would be eight. 51 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS To make a distinction between the runtime environments with and without threading in the graphs shown below, ‘NT’ (no threading) is added to the name of the runtime environment to indicate the behaviour without threading. 5.5.1 C Very poor performance results are obtained for C in Table 5.2. It would be reasonable to assume this is due to the short duration of the benchmarks, because a short duration means that the sequential part is relatively larger than the parallel part. Therefore I have increased the arguments, which shows that that the poor performance is not related to the duration of the benchmarks. The time measurements for C relative to the time with threading are shown in Figure 5.17. The performance clearly depends on the algorithm. It is easier to execute the binary-trees benchmark concurrently than the k-nucleotide benchmark. However it is not possible to get a decent efficiency for both algorithms. This means that both benchmarks have a considerable sequential section. Time relative to time with threading 3.5 C C NT 3 2.5 2 1.5 1 0.5 0 k-nucleotide 2500000 k-nucleotide 25000000 binary-trees 20 binary-trees 24 Figure 5.17: Time measurements for C normalised to the time with threading Table 5.2: The parallel speed-up and efficiency with eight processes for the different runtime environments runtime C Cython CPython PyPy k-nucleotide S8 E8 1.498 0.187 3.930 0.491 4.137 0.517 1.929 0.241 binary-trees S8 E8 3.481 0.435 5.679 0.710 6.397 0.800 1.996 0.250 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 5.5.2 52 Cython Cython does not follow C for the multi-threaded behaviour either. A much larger speed-up is obtained for both benchmarks. Figure 5.18 gives the time measurements for Cython relative to the time with threading. It shows that again binary-trees makes better use of multiple cores. Time relative to time with threading 6 Cython Cython NT 5 4 3 2 1 0 k-nucleotide 2500000 binary-trees 20 Figure 5.18: Time measurements for Cython normalised to the time with threading The inefficiencies are divided over multiple cores, because it is easy to read the wrapped values in parallel, thus resulting in better performance. This leads to a larger speed benefit by using multiple cores. Again the behaviour is very similar to CPython. 5.5.3 CPython Figure 5.19 compares the time measurements with and without threading for both benchmarks running on top of the CPython runtime environment. This shows that a huge speed-up is obtained by using the multiprocessing module. In Section 3.1.3, the different approaches to multi-threaded applications have been discussed. The conclusion was that the normal approach to threading leads to a very poor performance. However a module, called multiprocessing, has been created to avoid this problem by spawning subprocesses. The added cost of spawning the subprocesses is its biggest disadvantage. Both threaded benchmarks use this module and obtain very nice speed-ups. This means that while there are issues because of the Global Interpreter Lock with CPython, multi-threaded applications can definitely be written in Python. The multiprocessing module provides an alternative implementation, which successfully bypasses that lock. Applications written using that module can get a very good speed-up compared to a single-threaded variant. CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS Time relative to time with threading 7 CPython 53 CPython NT 6 5 4 3 2 1 0 k-nucleotide 2500000 binary-trees 20 Figure 5.19: Time measurements for CPython normalised to the time with threading 5.5.4 PyPy Time relative to time with threading Figure 5.20 shows the time difference between PyPy with and without threading. We notice that the speed-up is not as high as CPython’s. Without threading, PyPy is a lot faster than CPython. However with threading, the difference is a lot smaller. CPython is even faster than PyPy for k-nucleotide when running with multiple threads. 2 PyPy PyPy NT 1.5 1 0.5 0 k-nucleotide 2500000 binary-trees 20 Figure 5.20: Time measurements for PyPy normalised to the time with threading The Just-In-Time compiler is the most important influence for PyPy. It will already improve the performance of PyPy substantially. This is investigated in Chapter 6. The multi-threaded approach will add a lot of complications. On each subprocess a new inter- 54 CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS preter is launched, this means that the Just-In-Time compiler will only see the operations passing on a single core. The most important part of a Just-In-Time compiler is to compile as soon as possible. However this is not possible, because of the approach of the multiprocessing module. On each core the Just-In-Time compiler will have to compile the code, which will be only partially available. This leads to a much lower performance gain than just using an interpreter, like CPython. 5.6 PYC In Section 4.2, it was mentioned that the PYC runtime environment would also be benchmarked. The approach is the same as CPython’s, however a previously generated file containing the Python bytecode is used instead of generating the Python bytecode at runtime, thus eliminating the loading overhead. Yet it is not analysed in the previous comparison. The results for the PYC runtime environment are very similar to the ones obtained for CPython. The time difference is shown in Figure 5.21, the hardware results are almost identical as well. This confirms the suspicion that avoiding the bytecode compilation step will not have a huge influence on the total execution time. Since there is barely any difference, I did not include a full comparison of PYC with the other runtime environments. CPython 1.2 time relative to CPython CPU PYC I/O memory 1 0.8 0.6 0.4 0.2 0 x x orm orm edu edu h-r 11 uch-r 12 tral-n 000 tral-n 500 c 3 pec 5 kuc k n n p s e s fan fan ody n-b 00 00 0 50 s x e e ody du tid tid ree n-b 00 ucleo 00 ucleo 00 ta-re 00 ary-t 20 00 k-n 000 k-n 000 fas 000 bin 0 0 0 0 25 50 25 25 Figure 5.21: Time comparison between the PYC and CPython runtime environments, normalised to CPython 5.7 Conclusion C is of course performing the best. The best performing Python runtime environment, without making any modifications to the source code, is PyPy. This has been proven by the time measurements. The hardware events have shown that the Python runtime environments use more CHAPTER 5. ANALYSIS RUNTIME ENVIRONMENTS 55 branches than C. The type of benchmark has a much bigger influence on C than on the other runtime environments in general. CPython is often the slowest runtime environment, while being the most used one. For smaller scripts this is fine, however once the performance becomes important, it is clearly not a desirable choice. Cython can improve the performance drastically as shown in Section 5.1, however type information is necessary. Without the type information it sometimes has worse performance than CPython. This is caused by the wrapping of the values, which leads to a high number of instruction cache misses and load misses in the last level cache. The store misses in the last level cache are also considerably large. Both the instruction cache and data load misses in the first level cache are elevated for PyPy. Often the last level cache misses are also large. Since PyPy shows most promise, it has received the main focus in my research. In Chapter 6, a more in-depth analysis is performed, specifically for PyPy. For C, the measured hardware events show that the type of benchmark greatly influences the events. This behaviour is also noticed for PyPy, however to a lesser extent. This is of course caused by the JIT, because the code is compiled to assembly. A similar behaviour is expected for Cython, but this is not the case. Without the type information, the behaviour is much more similar to CPython. The type of the benchmarks has very little influence on the hardware events for CPython and Cython. The problems with the Global Interpreter Lock are resolved by the multiprocessing module. CPython gets a very decent speed-up using this module. The overhead of spawning the subprocesses and launching an interpreter on each core is relatively small. However the same speed-ups are not obtained for PyPy, because the Just-In-Time compiler is not working optimally when multiple interpreters are used on different cores. It is not possible to share the compiled code between the different subprocesses. Chapter 6 Analysis PyPy S ince PyPy is the best performing Python runtime environment when no extra information is added, I have mainly focused on analysing this runtime environment. There are already some tools available to examine to the influence of the Just-InTime compiler. I also checked the behaviour of the garbage collector. However to evaluate this behaviour, the hardware events and time measurements do not provide enough information. First some tools are discussed to get a better understanding of PyPy. 6.1 Translatorshell This tool provides a better understanding of the internals of PyPy itself and how the PyPy executable is created. This translatorshell is a Python script, which is included in PyPy itself. It enables the user to access the code, which is used for the interpretation and Just-In-Time compilation. The best way to explain how it works is by using an example. A lot of example code fragments are included with the PyPy source code, one of them is a method which searches for a perfect number1 . The Python code to verify if a number is perfect can be written like this: def is_perfect_number ( n=i n t ) : div = 1 sum = 0 while d i v < n : i f n % d i v == 0 : sum += d i v d i v += 1 return n == sum The first insight this tool provides is to allow the user to view the flow graph. This is accomplished by creating a Translation object and calling the view method: >>> t = Translation(snippet.is_perfect_number, [int]) >>> t.view() Note that the is_perfect_number function is located in a module called snippet and that it has one argument of the type ‘int’. It has been mentioned previously that Python 1 A perfect number is a positive integer, which is equal to the sum of its proper positive divisors, excluding the number itself. For example 28 is a perfect number because 28 = 1 + 2 + 4 + 7 + 14 and 28 is only divisible by 1, 2, 4, 7 and 14. 56 CHAPTER 6. ANALYSIS PYPY 57 is dynamically typed, yet it is necessary to declare the type here, because we are using code internal to PyPy. It is not necessary to declare the types inside Python programs, however, internally, PyPy will decide on types for each variable. Calling the view method will show a graph similar to the one in Figure 6.1. It is also possible to view the graph after the annotation phase. The user just needs to execute the following instructions: >>> t.annotate() >>> t.view() The obtained graph is shown in Figure 6.2. It is clear that a lot of code has been added for the RPython toolchain. The graph of the is_perfect_number function is still visible at the left. The rtyping phase is the last phase after which the graph can be viewed. It can be generated using the following instructions: >>> t.rtype() >>> t.view() >>> t.compile_c() The last instruction will compile it to a C library. The generated graph is included in Figure 6.3 and clearly shows that a lot of information is added to create the PyPy executable. As mentioned in Section 3.2, the garbage collector and Just-In-Time compiler are not yet added to this graph. 6.2 Hooks A hook allows the programmer to either investigate the behaviour of a module or the react on a certain event. PyPy provides the user with various hooks. The ones concerning the Just-In-Time compiler are the most relevant. They are available to the user by importing the pypyjit module from a Python script. The following hooks are provided: optimise hook is called each time a loop is optimised (before assembler compilation). It allows the programmer to view the Python bytecode which will be compiled and modify it as well. The argument it gets passed is a JitLoopInfo object. compile hook is called each time a loop is compiled and it is not reentrant. Again the argument passed to it is a JitLoopInfo object. abort hook called each time tracing is aborted. It gets passed a driver, a greenkey, the abort reason and a list of operations. The JitLoopInfo object contains the following information: driver name of the JIT driver greenkey representation of the place where the loop was compiled operations list of operations in the loop CHAPTER 6. ANALYSIS PYPY Figure 6.1: The flow graph for the is_perfect_number method 58 CHAPTER 6. ANALYSIS PYPY 59 Figure 6.2: The flow graph for the is_perfect_number method after the annotate phase Figure 6.3: The flow graph for the is_perfect_number method after the rtyping phase CHAPTER 6. ANALYSIS PYPY 60 loop_no loop cardinal number bridge_no bridge number (if it is a bridge) type loop type (either ‘bridge’ or ‘loop’) asmaddr address of the location of the machine code asmlen length of the machine code After experimenting with the hooks, I noticed they are not to be trusted for counting the number of Just-In-Time compilations. While they are useful, mostly for quickly testing code transformations, they did not serve me for the benchmarking and analysing of PyPy. However I did use it to verify the correctness of my harness for the stable behaviour. 6.3 JIT Viewer The PyPy developers have already created a tool to investigate the behaviour of the JustIn-Time compiler. It is a web application, which can be installed locally and it is called the ‘jitviewer’. First it is necessary to run the Python script and let it write log information to a file. Next the log file should be passed to the application. This tool allows the user to see the Python source code, Python bytecode, an intermediate representation of the code of the Just-In-Time compiler and also the generated assembly. Figure 6.4 shows the jitviewer in action for the spectral-norm benchmark. A problem I noticed with this viewer is that the correct output is not written to the log file when the Python script uses the multiprocessing module. This means that the jitviewer does not work correctly in this case. However I was able to use this to verify the correctness of the harness for the stable behaviour and it provides the easiest way to view the traces compiled by the Just-In-Time compiler. 6.4 Behaviour Over Time Using the same log file, generated as mentioned in Section 6.3, it is also possible to apply a different tool, which shows the behaviour of the Just-In-Time compiler and the garbage collector over the course of the application’s execution. This is accomplished by generating an image containing the various components in a time lapse of the run. Green is used to show the activity of the Just-In-Time compiler, while red is used for the activity of the garbage collector. Different shades are used to show specific components of either the Just-In-Time compiler (for example tracing, optimising, etc.) or the garbage collector (for example minor collections). The time spent interpreting is left transparent. The most active components are added to the figure in a key. If the time lapse is not interesting, it is also possible to generate a summary showing the activity of each specific component. 6.4.1 A Lot Of Garbage Collection The figures obtained using this tool show a lot of garbage collection taking place, as can be seen in Figure 6.5, showing the time lapse of the fannkuch-redux benchmark with argument 11. Only the spectral-norm benchmark with the smallest argument does not show this behaviour. This is unexpected behaviour.Therefore I decided to increase the size of the 61 CHAPTER 6. ANALYSIS PYPY Figure 6.4: The jitviewer in action, showing the Python code, Python bytecode and the intermediate representation of the code of the Just-In-Time compiler for the spectral-norm benchmark image generated by this tool and the result can be found in Figure 6.6. The problem with this tool is that it does not draw the time spent in interpretation. If the total execution time is relatively long, which is the case for these benchmarks, then drawing a line of one pixel to show a short activity of a specific component, such as a minor garbage collection, is too wide and misleading. My suspicion is that there are a lot of garbage collections after each other, that do not take a lot of time. This would result in the obtained behaviour. I have been able to confirm my suspicion by going through the log file and the graphs are indeed not very representative for the garbage collections. It is however not possible to generate very large graphs, which means it is important to be careful when interpreting them when you see a large amount of time spent visually. This will be clarified in the summary, which can be seen in Listing 6.1, or the key of the graph. interpret gc−minor gc−minor−w a l k r o o t s j i t −t r a c i n g j i t −resume 98.057905084% 1.013017932% 0.464278975% 0.328666524% 0.055008926% 62 CHAPTER 6. ANALYSIS PYPY Figure 6.5: Time lapse of fannkuch-redux with argument 11 and generated width 1000 j i t −o p t i m i z e j i t −backend gc−s e t −n u r s e r y −s i z e j i t −l o g −v i r t u a l s t a t e j i t −backend−dump gc−hardware j i t −backend−addr j i t −mem−l o o p t o k e n −a l l o c j i t −l o g −noopt−l o o p j i t −l o g −opt−b r i d g e j i t −a b o r t j i t −l o g −c o m p i l i n g −b r i d g e j i t −l o g −r e w r i t t e n −b r i d g e j i t −l o g −s h o r t −preamble j i t −l o g −opt−l o o p j i t −l o g −r e w r i t t e n −l o o p j i t −l o g −c o m p i l i n g −l o o p j i t −summary j i t −mem−c o l l e c t j i t −backend−c o u n t s 0.052875836% 0.015693034% 0.007687626% 0.002679879% 0.001098056% 0.000594516% 0.000138570% 0.000095715% 0.000065545% 0.000037148% 0.000032234% 0.000027141% 0.000026188% 0.000015614% 0.000015473% 0.000012621% 0.000012392% 0.000007832% 0.000005127% 0.000002011% Listing 6.1: Time lapse of the fannkuch-redux benchmark with argument 11 Figure 6.6: First part time lapse of fannkuch-redux with argument 11 and generated width 20000 6.4.2 Behaviour Of The Just-In-Time Compilation Figure 6.7 contains the time lapse for spectral-norm with the smallest argument. This graph does not have the problem that garbage collection is not represented correctly. The Just-In-Time compiler’s behaviour is similar for all benchmarks and it is represented nicely in this graph. We can conclude that the Just-In-Time compiler is mainly active at the beginning of the run, which is to be expected. Compiling takes some time, so it is best to do this as soon as possible, which means as much benefit as possible can be gained from it. Furthermore we notice that barely any time is spent in the Just-In-Time compiler. Note that a huge amount of the time is spent dumping to a log file (1.9%). Remember that it is only 63 CHAPTER 6. ANALYSIS PYPY Figure 6.7: Time lapse of the spectral-norm benchmark with argument 3000 necessary to write to a log file to view this graph and normally when running the code there will not be any logging, which means this cost will be removed. This is the only graph where the amount of time spent in the Just-In-Time compiler, including dumping to a log file, is larger than one percent, as can be seen in Table 6.1. The activity of the garbage collector is visible in Table 6.2. The values for the benchmarks which generate a lot of memory are higher than the others. The k-nucleotide benchmark has very low values, because that benchmark only creates memory in the beginning. 6.4.3 Influence Of The Nursery Size Section 3.2.1 already mentioned that it is possible to modify the garbage collector by setting some variables. Since we see a lot of collections happening, it would be interesting to see how the parametrisation of the garbage collector influences the behaviour over time. Increasing the nursery size should reduce the number of collections. The summary verifies that the time spent in collecting the memory is indeed reduced. Instead of the 1.013% obtained for the default value (four megabytes), now only 0.712% of time is spent on minor collections. However the total execution time increases from 22.64 seconds to 28.117 seconds due to bad data locality. 6.5 Influence Of The Just-In-Time Compiler In Chapter 3, it has already been mentioned that the Just-In-Time compiler of PyPy is very interesting and I have put extra effort into investigating its behaviour by measuring Table 6.1: The time spent in the Just-In-Time compiler compared to the total execution time benchmark fannkuch-redux fannkuch-redux spectral-norm spectral-norm n-body n-body k-nucleotide fasta-redux binary-trees arg 11 12 3000 5500 5000000 50000000 2500000 25000000 20 activity Just-In-Time compiler (%) 0.45 0.01 3.66 0.84 0.20 0.33 0.58 0.91 1.04 64 CHAPTER 6. ANALYSIS PYPY Table 6.2: The time spent in the garbage collector compared to the total execution time benchmark fannkuch-redux fannkuch-redux spectral-norm spectral-norm n-body n-body k-nucleotide fasta-redux binary-trees arg 11 12 3000 5500 5000000 50000000 2500000 25000000 20 activity garbage collector (%) 1.49 1.49 0.62 0.20 1.32 1.32 0.33 5.33 2.53 the stable behaviour and the behaviour with the Just-In-Time compiler disabled. We would expect that running PyPy without the Just-In-Time compiler will be slower. We expect the stable behavior run to perform better than running PyPy with the JIT compiler, because the stable behavior uses the optimized code generated from the compiler, while removing the time the compiler ran. 6.5.1 Time Measurements The time results, visually represented in Figure 6.8, clearly show that the expected behaviour is indeed correct. The difference between running PyPy with and without Just-InTime compiler is huge. This means that it is very effective and also necessary to improve the performance of Python. The difference between the stable version and PyPy is less clear. However the stable behaviour is indeed a bit faster than the normal behaviour. Sometimes the values are a bit higher, because the amount of memory is a bit larger due to the harness and the garbage collection does not always happen exactly when asked. However the results are clear enough to confirm what we noticed in Section 6.4, namely that not a lot of time is lost in the Just-In-Time compiler and that the overhead is very low. 65 CHAPTER 6. ANALYSIS PYPY 20 PyPyNJ 18 Time relative to PyPy 16 CPU 139.3838 PyPy I/O PyPyS memory 141.5164 14 12 10 8 6 4 2 0 ux ux rm rm red red -no -no ch- 11 kuch- 12 ctral 3000 ctral 5500 nku n spe spe fan fan s x x e ody ody du du tid ree n-b 00 n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20 0 00 k-n 000 fas 000 fas 000 bin 00 00 0 50 25 25 50 25 Figure 6.8: Time measurements PyPy runtime environments normalised to PyPy. Note that there are no results for PyPyS for k-nucleotide and binary-trees. 6.5.2 Hardware Events The previous sections have clearly shown that the Just-In-Time compiler is indeed very effective. It improves the application execution time significantly, while having very little overhead. However it is still not clear what causes the benefit and if there are still disadvantages to the Just-In-Time compiler. To answer these questions, we analyse the hardware events below. 6.5.2.1 Cycles Per Instruction Figure 6.9 presents the number of cycles per instruction. It clearly shows the influence of the Just-In-Time compiler. For the CPU-intensive benchmarks, there is barely any difference between PyPy and PyPyS. PyPyNJ has a slightly increased number of cycles per instruction in comparison with the other two, except for n-body. The fasta-redux results show that for the stable behaviour more cycles are used per instruction, while the PyPyNJ results are better. This is caused by the larger number of data cache misses, explained in Section 6.5.2.4. The k-nucleotide and binary-trees benchmarks, which both use a lot of memory, clearly show that the Just-In-Time compiler has a huge influence on the cycles per instruction. A lot of time will be lost loading and storing the information necessary for the Just-In-Time compilation, such as the counters and the compiled code. 66 CHAPTER 6. ANALYSIS PYPY 1.5 PyPyNJ Cycles per instruction 1.4 CPU 1.3 PyPy I/O PyPyS memory 2.798530 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 5 f 2 b 50 00 k 2 nk nk pe pe 2 5 n n s s fa fa Figure 6.9: Cycles per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. 6.5.2.2 Branch Behaviour The branch behaviour is captured in Figure 6.10 and Figure 6.11. Since the PyPyS behaviour is almost identical to the PyPy behaviour, we can conclude that most branches are caused by the application code and not by the Just-In-Time compiler itself. The results for PyPyNJ are very consistent, only the results for the I/O-intensive benchmarks are slightly elevated. This is however inherent to the benchmarks. Both k-nucleotide and fasta-redux have a higher number of branches. The Just-In-Time compiler causes slightly more branches for the I/O- and memoryintensive benchmarks. The branch misses for all PyPy runtime environments are very consistent, except for the spectral-norm benchmark. The traces for this benchmark are a lot shorter, which leads to more branches. A lot of them are mispredicted due to the algorithm. 67 CHAPTER 6. ANALYSIS PYPY 0.32 Branches per instruction 0.3 PyPyNJ CPU PyPy I/O PyPyS memory 0.28 0.26 0.24 0.22 0.2 0.18 0.16 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 2 f 5 b 50 00 k 2 nk nk pe pe 5 2 n n s s fa fa Branch misses per instruction Figure 6.10: Branches per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. 0.2 PyPyNJ CPU PyPy I/O PyPyS memory 0.15 0.1 0.05 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 a a 00 0 0 u c 0 0 s ta 0 0 s ta 0 0 a r ch ch tr 3 c tr 5 00 0 0 k -n 2 5 0 fa 2 5 0 fa 5 0 0 b in ku ku e ec 5 0 n n p p 5 2 n n s s fa fa Figure 6.11: Branch misses per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. 6.5.2.3 Level-1 Instruction Cache Behaviour Figure 6.12 contains the instruction loads of the level-1 cache divided by the number of instructions. The values obtained for PyPyNJ are very consistent, this is of course the effect of the virtual machine. Using a Just-In-Time compiler has a huge influence on this behaviour. The changes are inherent to the benchmarks. The influence of the Just-In-Time compiler itself is however minimal. 68 CHAPTER 6. ANALYSIS PYPY L1 icache loads per instruction The most interesting results are visually represented in Figure 6.13, which presents the number of load misses of the level-1 instruction cache per instruction. This graph clearly shows that the Just-In-Time compiler reduces the misses, because the jumps are replaced with guards and the traces contain an entire loop of instructions that have to be executed consecutively. This improves the instruction locality. PyPyNJ 0.5 CPU PyPy I/O PyPyS memory 0.45 0.4 0.35 0.3 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 2 f 5 b 50 00 k 2 nk nk pe pe 5 2 n n s s fa fa L1 icache load misses per instruction Figure 6.12: Level-1 instruction cache loads per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. 0.02 0.018 PyPyNJ CPU PyPy I/O PyPyS memory 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 n u 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 5 f 2 b 50 00 k- 25 nk nk pe pe 5 2 n n s s fa fa Figure 6.13: Level-1 instruction cache load misses per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. 69 CHAPTER 6. ANALYSIS PYPY 6.5.2.4 Level-1 Data Cache Behaviour The level-1 data cache behaviour is represented in Figures 6.14 and 6.15. There is a big difference for spectral-norm between running the benchmark with or without the Just-InTime compiler. This difference between PyPyNJ and the other two runtime environments occurs because that benchmark uses almost no memory. The compiled code of the JustIn-Time compiler itself needs to be stored, however no other memory is necessary, because the values of the variables can be stored in registers. The virtual machine will cause a lot more data to be stored, because the wrapping will prevent the interpreter from using registers. This effect is visual for this benchmark. The other benchmarks show very consistent behaviour. Again the most revealing results are captured by the misses. They clearly show that the added cost of storing the compiled code results in a doubling of the load misses for the I/O- and memory-intensive benchmarks, where memory is important. A possible approach to improve this problem is to apply prefetching to improve the data cache behaviour. This approach is investigated in Section 6.7. The results for spectral-norm clearly show that the Just-In-Time compiler is very efficient for pure CPU-intensive applications. L1 dcache loads per instruction 0.4 0.35 PyPyNJ CPU PyPy I/O PyPyS memory 0.3 0.25 0.2 0.15 0.1 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 2 f 5 b 50 00 k 2 nk nk pe pe 5 2 n n s s fa fa Figure 6.14: Level-1 data cache loads per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binarytrees. 70 L1 dcache load misses per instruction CHAPTER 6. ANALYSIS PYPY 0.07 PyPyNJ CPU 0.06 PyPy I/O PyPyS memory 0.05 0.04 0.03 0.02 0.01 0 s y y x x x x e m m ee od od du du du du t id or or r e 1 1 -r e 1 2 l-n 0 0 0 l-n 5 0 0 n -b 0 0 n -b 0 0 le o 0 0 -r e 0 0 -r e 0 0 y -t r 2 0 0 0 0 0 0 a a h h r c ra 5 ra 3 00 0 0 -n u 5 0 0 a s t 5 0 0 a s t 0 0 0 in a uc uc ct ct f 2 f 5 b 50 00 k 2 nk nk pe pe 5 2 n n s s fa fa Figure 6.15: Level-1 data cache load misses per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. 6.6 Adjusting The Heap Size PyPy allows the user to modify some of its internal variables, which influence the behaviour of the Just-In-Time compiler or the garbage collector. Since PyPy has problems with level1 data cache behaviour when running the Just-In-Time compiler, I decided to experiment with the maximum heap size. This variable allows you to define the maximum size of the heap PyPy is allowed to use for all generations together. First, I checked the minimum amount of memory each benchmark requires to run. I call this the minimum heap size (MHS). Table 6.3 contains the minimal heap size for each benchmark. As expected, knucleotide and binary-trees need most memory. Since PyPy has problems with the level-1 data cache, I decided to evaluate the behaviour of PyPy, when it just has sufficient memory to run these benchmarks. This is Table 6.3: Minimum heap size required for each benchmark benchmark fannkuch-redux fannkuch-redux spectral-norm spectral-norm n-body n-body k-nucleotide fasta-redux fasta-redux binary-trees arg 11 12 3000 5500 5000000 50000000 2500000 2500000 25000000 20 minimum heap size (MB) 4 4 3 3 3 3 46 7 7 297 71 CHAPTER 6. ANALYSIS PYPY followed by a comparison with twice the minimum heap size, three times the minimal heap size and the default heap size (fourteen gigabytes). This methodology is advised to explore the space-time tradeoff of automatic memory management [6], i.e. garbage collection. The time measurements, normalised according to the minimum heap size, for the different runs are included in Figure 6.16. While there are no huge differences changing the heap sizes, there can be a huge difference in standard deviation. For example, the standard deviation of the time measurements for fannkuch-redux with argument 12 and the heap size set to twice the minimum heap size is 13.839 seconds. PyPy’s memory management explains why there are no huge differences. Almost no data is freed while running the programs, which means that the minimal heap size is very close to the actual amount of used data. Time relative to time with MHS 1.1 1xMHS CPU 2xMHS 3xMHS default I/O memory 1.05 1 0.95 0.9 0.85 s x x x x e ody ody du du tid orm orm ree edu edu n-b 00 n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20 h-r 11 uch-r 12 tral-n 000 tral-n 500 0 0 fas 00 fas 00 bin 0 -n c c 3 5 0 0 0 kuc k e e 0 0 0 0 0 k n n 0 0 sp sp 50 25 25 fan fan 50 25 Figure 6.16: Time measurements normalised to the execution time with the minimum heap size (MHS) for PyPy with varying heap sizes 6.7 Prefetching In Section 6.5.2, I mentioned prefetching as an idea to improve the performance of PyPy by reducing the number of load misses in the level-1 data cache. Prefetching is a technique that loads data before it is requested by the application’s executing instructions, in the hope that later on the ‘prefetched data’ will be necessary and no time will be lost loading it. The necessary data is commonly identified prior to the execution based on the assembly instructions. This technique can be applied on the hardware and software level and has been researched previously [19, 3, 13]. Hardware prefetching is already applied by hardware manufacturers and is now a built-in feature on most machines. Software prefetching offers the benefit that more information is available, which allows for example to determine loop bounds or indirect indexing [17]. Special algorithms can be applied for a specific piece of software, however it increases the instruction count [17]. The results, discussed in Section 6.5.2, have lead to the assumption that prefetching might improve the speed of PyPy, since the Just-In-Time compiler increases the number of 72 CHAPTER 6. ANALYSIS PYPY data misses. Therefore prefetching is expected to reduce these data misses, if it is effective. To confirm this assumption, I compared running PyPy with and without prefetching turned on, and explored both hardware and software prefetching techniques. 6.7.1 Hardware Prefetching Section 6.5.2 already showed that the Just-In-Time compiler improves the instruction cache behaviour of the level-1 cache, however it introduces a lot more misses in the data cache. In the following sections, the influence of hardware prefetching is explained for both the instruction and the data level-1 cache. The results for PyPyS have been dropped, since they are almost identical to the normal behaviour of PyPy and it increases the readability of the figures. The next discussion contains graphs comparing the PyPy and PyPyNJ runtime environments with and without hardware prefetching enabled. ‘NP’ (No Prefetching) is used to indicate the behaviour of the runtime environments without hardware prefetching. 6.7.1.1 Level-1 Instruction Cache Behaviour L1 instruction cache loads per instruction Figure 6.17 shows that hardware prefetching has almost no influence on the number of load operations in the level-1 instruction cache. There is a huge difference for the fasta-redux benchmark with argument 2500000; however this difference has almost completely disappeared when the argument is larger. This leads us to believe that the smaller argument might be a special case. The level-1 instruction cache misses, represented in Figure 6.18, confirm that hardware prefetching has almost no influence on the level-1 instruction cache behaviour. 0.52 0.5 PyPyNJ PyPyNJ NP CPU PyPy PyPy NP I/O memory 0.48 0.46 0.44 0.42 0.4 0.38 0.36 0.34 0.32 s x x x x e ody ody du du tid orm orm ree edu edu n-b 00 n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20 h-r 11 uch-r 12 tral-n 000 tral-n 500 s s 0 0 0 0 0 3 pec 5 kuc k ec 00 00 k-n 500 fa 500 fa 000 bin n n p 0 0 n n s s 5 2 2 fa fa 50 25 Figure 6.17: Influence prefetching on the level-1 instruction cache loads for the PyPy runtime environments 73 L1 instruction cache load misses per instruction CHAPTER 6. ANALYSIS PYPY 0.02 0.018 PyPyNJ PyPyNJ NP CPU PyPy PyPy NP I/O memory 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 s x x e ux ux rm rm ody ody du du tid ree red red -no -no n-b 00 n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20 ch- 11kuch- 12 ctral 3000 ctral 5500 0 00 k-n 000 fas 000 fas 000 bin 00 nku n 00 0 spe spe 50 25 25 fan fan 50 25 Figure 6.18: Influence prefetching on the level-1 instruction cache load misses for the PyPy runtime environments 6.7.1.2 Level-1 Data Cache Behaviour The influence of prefetching on the level-1 data cache is shown in Figure 6.19. The graph clearly shows that the influence of software prefetching is negligible. However the misses, included in Figure 6.20, show a different result. The number of misses per instruction are reduced significantly when prefetching is enabled, particularly for the I/O- and memoryintensive benchmarks when PyPy runs with the Just-In-Time compiler. This means that prefetching is beneficial to reduce the negative data cache effects of the Just-In-Time compiler. 74 CHAPTER 6. ANALYSIS PYPY L1 data cache loads per instruction 0.4 PyPyNJ PyPyNJ NP PyPy CPU 0.35 PyPy NP I/O memory 0.3 0.25 0.2 0.15 0.1 s x x e ux ux rm rm ody ody du du tid ree red red -no -no n-b 00 n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20 ch- 11kuch- 12 ctral 3000 ctral 5500 0 00 k-n 000 fas 000 fas 000 bin 00 nku n 00 0 spe spe 50 25 25 fan fan 50 25 L1 data cache load misses per instruction Figure 6.19: Influence prefetching on the level-1 data cache loads for the PyPy runtime environments 0.09 0.08 PyPyNJ PyPyNJ NP CPU PyPy PyPy NP I/O memory 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 s x x x x e ody ody du du tid orm orm ree edu edu n-b 00 n-b 00 ucleo 00 ta-re 00 ta-re 00 ary-t 20 h-r 11 uch-r 12 tral-n 000 tral-n 500 s s 0 0 0 0 0 3 pec 5 kuc k ec 00 00 k-n 500 fa 500 fa 000 bin n n p 0 0 n n s s 5 2 2 fa fa 50 25 Figure 6.20: Influence prefetching on the level-1 data cache load misses for the PyPy runtime environments 6.7.2 Software Prefetching A well known approach at software prefetching has been presented by Hans-J. Boehm [7] for the Boehm-Demers-Weiser garbage collector. The idea behind his algorithm, called prefetch-on-grey, is shown in Listing 6.2. It basically keeps prefetching all data, without considering if it is necessary and will be used somewhere in the near future. CHAPTER 6. ANALYSIS PYPY 75 Push a l l r o o t s on t h e mark s t a c k , making them g r e y . While t h e r e i s a p o i n t e r t o an o b j e c t g on t h e s t a c k Pop g from t h e mark s t a c k . I f g i s s t i l l g r e y ( not a l r e a d y b l a c k ) Blacken g . For each p o i n t e r p i n g ’ s t a r g e t Prefetch p . I f p i s white Grey p and push p on t h e mark s t a c k . Listing 6.2: Prefetch-on-grey algorithm Since it is of utmost importance that PyPy only loads data that will be used and preferably loaded right before it is needed, because otherwise it causes negative effects by polluting the cache. The influence of early and late prefetches has already been researched previously [17]. However due to time issues I decided not to focus on this. Software prefetching is most times applied on for loops. However it seems much more interesting to apply this on traces. The best approach I found will discover misses by monitoring the memory access of the program at the instruction level. It would be possible to apply this after the Just-In-Time compilation. An overview of the approach is presented in Figure 6.21. A few steps are used to insert prefetch statements: 1. A filter tags each memory access as a hit or a miss (only load misses are taken into account). 2. For each memory access, candidate predictors are generated for the location and for pointers also for the contents of the location. The idea is to keep track of a fixed contiguous region of memory around the location and location’s contents address. They discovered that region sizes of 10 and 20 cache lines in each direction gave good results. 3. The candidate predictions are hashed by their memory line addresses and stored in the Memory Line Address Map (MLA map). For each load miss, the address is checked in the MLA map. If there are existing predictions for the load miss address, each predictor is considered further for validity. 4. To decide how long the predictors should be kept, a sliding window is used. 5. They prune the predictions, to keep the most accurate ones. 6. The highly accurate predictions are inserted in the assembly. After going through the PyPy source code, I have come to the conclusion that it is not easy to apply these researchers’ techniques in PyPy. The Just-In-Time compiler’s backend generates assembly without first generating C. In order to use the harness, it would be necessary to go over the entire code after the assembly generation. Since the harness is written in C++, there would be an added cost to load and call the libraries. A better approach would be to translate the framework to Python code or even better to construct an algorithm which does not need the entire generated assembly. This approach would have to work on the Python bytecodes. One of the main issues with software prefetching for PyPy is that everything happens at runtime, which means that the cost of applying CHAPTER 6. ANALYSIS PYPY 76 Figure 6.21: Prefetching approach created by Frank Mueller and Jaydeep Marathe [19] the prefetching should be preferably a lot smaller than the gained benefit. Most research has been based around applying prefetching before the program is executed. Due to time constraints, I have not been able to complete my research about prefetching. However I believe it is possible and would improve the data cache load misses. The obtained improvement is not yet clear, however, the cost might prove to be high. Therefore, it is not advisable to use prefetching with short-running programs. 6.8 Conclusion PyPy’s Just-In-Time compiler is very effective. There is barely any time spent performing Just-In-Time compilation, while the total execution time gets reduced substantially. The main problems of PyPy are caused by the memory. The level-1 data cache load misses are a lot higher with the Just-In-Time compiler than without. These memory problems are already improved by applying hardware prefetching. However a much larger benefit could be obtained from using software prefetching. It is not easy to apply this and it would require a lot of work to make this work for every hardware architecture, since PyPy generates the assembly itself. Still more research about this approach is required to discover the influence of software prefetching. Chapter 7 Final Conclusions T he best performing language according to the experimental results is C. No Python runtime environment is able to outperform it. This is however not a big surprise. The disadvantage of C is that it is hard to use for most people. Python is a language which is easy to program, even by non-experts. It can be used by a lot more people. However there is less focus on efficient performance. Recently the interest from the academic world in Python has increased, which means that performance becomes a key factor. My thesis explores the different runtime environments that can run Python programs, their impact on execution time and hardware events, and possibilities to optimise performance. There are three approaches to running Python code: • interpretation • compilation to a lower level language • interpretation with a Just-In-Time compiler Each approach has been investigated, based on a commonly used runtime environment, which uses the specific approach. The default interpreter, called CPython, is used to represent the interpretation approach. Cython has been chosen to compile Python code to C and then the C code is compiled to an executable. It also performs some optimisations, in the hope of getting even better performance. PyPy is the most famous Python implementation using a Just-In-Time compiler. Other projects have been discontinued in favour of it. A conclusion follows for each runtime specifically. Finally a general conclusion is provided. 7.1 CPython CPython has proven to be very slow. This was not a big surprise, since no effort has been put into optimising the execution. There is an intention to improve it, because optimisation flags have been added. However there are no concrete optimisations yet. The improvements that have been added do not optimise the execution time, but only the loading time. Since Python is mainly used to glue existing components together, this is obviously important. However the loading cost is very small compared to the cost of running a computationally heavy task. While CPython is a very good choice for glue 77 CHAPTER 7. FINAL CONCLUSIONS 78 code, it is not meant for heavy computations. However a lot of the libraries, which perform computationally intensive tasks, are written in C or Fortran, thus following the intention of a ‘glue language’. One of the main concerns about CPython is the Global Interpreter Lock, which prevents simultaneous execution of different threads, even if multiple cores are available. During my research, I discovered that the Global Interpreter Lock actually improves the performance, however for multi-threaded applications it is a huge bottleneck. There have been many attempts to remove it, however, none have been successful. A solution is available, which allows simultaneous executing on different core, as a library, called multiprocessing. This library introduces an overhead because subprocesses are spawned, each containing a new interpreter. This overhead is not very large and the multi-threaded benchmarks have shown that very decent speed-ups are possible. The problems with the Global Interpreter Lock are circumvented by the multiprocessing module. This means that there is no reason not to use CPython for multi-threaded applications. 7.2 Cython Cython has the possibility to provide a huge speed-up over CPython. However type information is necessary to get those speed-ups. This is counterintuitive. Most people choose Python for the ease of programming. Dynamic typing is a very important factor in that decision. The plethora of different types and the fact that overflows are caught and resolved on-the-fly also make programming easier. The types need to be converted to C, which does not have support for most types. This makes the definition of the types a lot more difficult. Therefore it is advised that the Python code should be closer to C, which is counterintuitive for Python programmers. For now this dilemma cannot be solved, which makes Cython a less interesting alternative. A possible solution is to guess the types before giving it to Cython. However some variables might change their type during the execution. This could be solved by creating new variables or limiting the language. The problem is that it is even difficult to manually define the types, because they are totally different from C. This means that automation will be even more difficult. This approach has not been tried yet. If it would be successful, a huge speed-up would be obtained. It is clear that Cython is not a viable alternative for most Python users. However it might be interesting for people writing computationally heavy tasks. They already have experience with C or Fortran and should be able to write Python code, enhanced with Cython statements, successfully. Cython also allows the programmer to incrementally improve the performance. This means that first Python code can be written, which can be tested on smaller examples. Once the code is finished, the most important functions can be improved by adding type information. Research has shown that there is not a significant difference between a stand-alone C program and a Python version using C libraries [11]. This would not improve the execution time of the program compared to C, however it might reduce the development time. 7.3 PyPy PyPy has proven that the performance of CPython is not optimal by using a Just-In-Time compiler. It is not necessary to modify the Python code for PyPy to work. However the CHAPTER 7. FINAL CONCLUSIONS 79 project does not yet support the latest version of Python. The intention of this project is to improve Python code, which means that the libraries should be written in Python as well to take full advantage of the Just-In-Time compiler. It is also possible to use C libraries, but the Just-In-Time compiler will not be able to optimise them. This advantage is clearly shown in Section 5.2, where PyPy outperforms C for the string append example. The largest performance benefit is obtained for CPU-intensive applications. The I/Oand memory-intensive benchmarks do not show the same performance gain. This is caused by memory problems. The interpreter, garbage collector and Just-In-Time compiler need to use some memory, which negatively influences the memory of the user application. This problem becomes evident when a lot of memory is used. It is difficult to reduce the memory used by the interpreter, garbage collector and Just-In-Time compiler. While is it not possible to reduce the amount of memory used by the user application, prefetching reduces the number of level-1 data cache misses per instruction. Software prefetching could lead to an even larger benefit, however it is not a simple task to implement this, and is left for future work. PyPy’s Just-In-Time compiler is very efficient. The Just-In-Time compiler runs for a very short amount of time to achieve very large performance improvements for the application. The Just-In-Time compiler mainly works in the beginning of the execution. This is important because research has proven that the ‘hot’ code should be detected and compiled early to improve the speed [18]. The multiprocessing module reduces the effect of the Just-In-Time compiler, because on each core a new interpreter is launched. This means that each core gets its own JustIn-Time compiler and it is not possible to share information between the different cores. Therefore PyPy exhibited worse speed-ups between single-threaded and multi-threaded versions of the benchmarks, as compared with CPython. 7.4 General When performance is not an issue, CPython is the preferred runtime environment. There are a lot of modules available, it is easy to use, has extensive documentation and a lot of users. For short running applications, this is also an excellent choice. Since CPython can communicate with C, C++ and Fortran, CPython will do fine, when Python is used as ‘glue code’. It also performs well with multi-threaded applications. Before attempting to improve the performance by changing to a different runtime environment, it would be wise to see if the algorithms cannot be improved. A better algorithm might provide an even higher performance boost. If the performance becomes important and the algorithms cannot be improved, PyPy is a much better choice, if the application is CPU-intensive. If there is threading involved or a lot of memory is used, PyPy has some performance problems. Finally if performance is really of utmost importance, Cython provides the best option. First it is possible to write the code in Python and test it with smaller examples. Once the code is considered correct, it is possible to improve the performance by adding type information to the most important methods. This will provide the smoothest workflow. This reduces the development time, while still supplying an incredibly fast execution time. However this approach is only for experienced programmers. CHAPTER 7. FINAL CONCLUSIONS 80 To conclude, the difference between various Python runtime environments is investigated and a comparison with C is performed. This research has led to the conclusion that better benchmarking is necessary for Python. The most popular Python benchmarks are not sufficient for a decent analysis. Furthermore techniques from Java research have been used to perform an analysis of PyPy. It is important when analysing a Just-In-Time compiler to take into account different heap sizes and to use the stable methodology. This gives a better overview of the influence of the garbage collector and Just-In-Time compiler. Finally suggestions are given to improve the performance of existing Python runtime environments and which runtime environment is most advantageous for each situation. Appendices 81 Appendix A PyPy Options It is possible to modify the behaviour of the Just-In-Time compiler by setting some values. The possible modifications are mentioned in the help page: Advanced JIT options: a comma-separated list of OPTION=VALUE: decay=N amount to regularly decay counters by (0=none, 1000=max) (default 40) enable_opts=N INTERNAL USE ONLY (MAY NOT WORK OR LEAD TO CRASHES): optimizations to enable, or all = intbounds:rewrite:virtualize:string:earlyforce:pure:heap:unroll (default all) function_threshold=N number of times a function must run for it to become traced from start (default 1619) inlining=N inline python functions or not (1/0) (default 1) loop_longevity=N a parameter controlling how long loops will be kept before being freed, an estimate (default 1000) max_retrace_guards=N number of extra guards a retrace can cause (default 15) max_unroll_loops=N number of extra unrollings a loop can cause (default 0) retrace_limit=N how many times we can try retracing before giving up (default 5) threshold=N 82 APPENDIX A. PYPY OPTIONS 83 number of times a loop has to run for it to become hot (default 1039) trace_eagerness=N number of times a guard has to fail before we start compiling a bridge (default 200) trace_limit=N number of recorded operations before we abort tracing with ABORT_TOO_LONG (default 6000) off turn off the JIT help print this page It is also possible to configure the garbage collector by setting some variables: PYPY_GC_NURSERY The nursery size. Defaults to 1/2 of your cache or 4M. Small values (like 1 or 1KB) are useful for debugging. PYPY_GC_NURSERY_CLEANUP The interval at which nursery is cleaned up. Must be smaller than the nursery size and bigger than the biggest object we can allocate in the nursery. PYPY_GC_INCREMENT_STEP The size of memory marked during the marking step. Default is size of nursery times 2. If you mark it too high your GC is not incremental at all. The minimum is set to size that survives minor collection times 1.5 so we reclaim anything all the time. PYPY_GC_MAJOR_COLLECT Major collection memory factor. Default is 1.82, which means trigger a major collection when the memory consumed equals 1.82 times the memory really used at the end of the previous major collection. PYPY_GC_GROWTH Major collection threshold’s max growth rate. Default is 1.4. Useful to collect more often than normally on sudden memory growth, e.g. when there is a temporary peak in memory usage. PYPY_GC_MAX The max heap size. If coming near this limit, it will first collect more often, then raise an RPython MemoryError, and if that is not enough, crash the program with a fatal error. Try values like 1.6GB. PYPY_GC_MAX_DELTA The major collection threshold will never be set to more than PYPY_GC_MAX_DELTA the amount really used after a collection. Defaults to 1/8th of the total RAM size (which is constrained to be at most 2/3/4GB on 32-bit systems). Try values like 200MB. PYPY_GC_MIN Don’t collect while the memory size is below this limit. Useful to avoid spending all the time in the GC in very small programs. Defaults to 8 times the nursery. APPENDIX A. PYPY OPTIONS 84 PYPY_GC_DEBUG Enable extra checks around collections that are too slow for normal use. Values are 0 (off), 1 (on major collections) or 2 (also on minor collections). Appendix B Removing Threading From binary-trees and k-nucleotide There are only two benchmark using threads. I removed threading from both benchmarks to analyse the multi-threaded behaviour. This is explained in Section 5.5. B.1 k-nucleotide This has been accomplished for the C version by returning one for the number of cores: int get_cpu_count ( void ) { cpu_set_t cpu_set ; CPU_ZERO(&cpu_set ) ; s c h e d _ g e t a f f i n i t y ( 0 , s i z e o f ( cpu_set ) , &cpu_set ) ; return 1 ; // r e t u r n CPU_COUNT(& cpu_set ) ; } For Python, it is necessary to remove any use of a pool. This can be accomplished by replacing the map function of the pool, by the standard one: def main ( ) : global s e q u e n c e sequence = prepare () #p=Pool ( ) #r e s 2 = p . map_async ( f i n d _ s e q , r e v e r s e d ( "GGT GGTA GGTATT GGTATTTTAATT GGTATTTTAATTTATAGT" . s p l i t ( ) ) ) #r e s 1 = p . map_async ( s o r t _ s e q , ( 1 , 2 ) ) r e s 2 = map( f i n d _ s e q , reversed ( "GGT GGTA GGTATT GGTATTTTAATT GGTATTTTAATTTATAGT" . s p l i t ( ) ) ) r e s 1 = map( s o r t _ s e q , ( 1 , 2 ) ) #f o r s i n r e s 1 . g e t ( ) : p r i n t ( s +’\n ’ ) #r e s 2 = r e v e r s e d ( [ r f o r r i n r e s 2 . g e t ( ) ] ) f o r s in r e s 1 : print ( s+ ’ \n ’ ) r e s 2 = reversed ( [ r f o r r in r e s 2 ] ) 85 APPENDIX B. REMOVING THREADING FROM BINARY-TREES AND K-NUCLEOTIDE86 print ( " \n " . j o i n ( " { 1 : d}\ t {0} " . format ( ∗ s ) f o r s in r e s 2 ) ) B.2 binary-trees It was not easy to use the method for k-nucleotide for this benchmark. A different solution is to put a pthread_join after each pthread_create: /∗ ∗ The c a l c u l a t i o n s i s s t a r t e d i n r e v e r s e o r d e r compared t o most o t h e r ∗ s o l u t i o n s . The r e a s o n i s t h a t a l l d a t a must be on t h e s t a c k and t h e ∗ r e s u l t from s h a l l o w e s t t r e e must be p r i n t e d f i r s t . ∗/ void d o _ t r e e s ( i n t depth , i n t min_depth , i n t max_depth ) { pthread_t t h r e a d ; pthread_attr_t a t t r ; struct item_worker_data wd ; i f ( depth < min_depth ) return ; p t h r e a d _ a t t r _ i n i t (& a t t r ) ; p t h r e a d _ a t t r _ s e t s t a c k s i z e (& a t t r , s t a c k _ s z ( depth + 1 ) ) ; wd . i t e r a t i o n s = 1 << ( max_depth − depth + min_depth ) ; wd . check = 0 ; wd . depth = depth ; p t h r e a d _ c r e a t e (& thread , &a t t r , item_worker , &wd) ; p t h r e a d _ j o i n ( thread , NULL) ; d o _ t r e e s ( depth −2, min_depth , max_depth ) ; // p t h r e a d _ j o i n ( t h r e a d , NULL) ; p t h r e a d _ a t t r _ d e s t r o y (& a t t r ) ; p r i n t f ( "%d\ t t r e e s o f depth %d\ t check : %d\n " , 2 ∗ wd . i t e r a t i o n s , depth , wd . check ) ; } For the Python version, it is again possible to replace the map function of the pool with the default one: def main ( n , min_depth=4) : max_depth = max( min_depth + 2 , n ) s t r e t c h _ d e p t h = max_depth + 1 i f mp . cpu_count ( ) > 1 : #p o o l = mp . Pool ( ) #chunkmap = p o o l . map chunkmap = map else : chunkmap = map print ( ’ s t r e t c h t r e e o f depth {0}\ t check : {1} ’ . format ( APPENDIX B. REMOVING THREADING FROM BINARY-TREES AND K-NUCLEOTIDE87 s t r e t c h _ d e p t h , make_check ( ( 0 , s t r e t c h _ d e p t h ) ) ) ) l o n g _ l i v e d _ t r e e = make_tree ( 0 , max_depth ) mmd = max_depth + min_depth f o r d in range ( min_depth , s t r e t c h _ d e p t h , 2 ) : i = 2 ∗∗ (mmd − d ) cs = 0 f o r argchunk in get_argchunks ( i , d ) : c s += sum( chunkmap ( make_check , argchunk ) ) print ( ’ {0}\ t t r e e s o f depth {1}\ t check : {2} ’ . format ( i ∗ 2 , d , c s ) ) print ( ’ l o n g l i v e d t r e e o f depth {0}\ t check : {1} ’ . format ( max_depth , c h e c k _ t r e e ( l o n g _ l i v e d _ t r e e ) ) ) Bibliography [1] nbviewer, a simple way to share ipython notebooks, May 2014. [2] speed.pypy.org project, May 2014. [3] Aneesh Aggarwal. Software caching vs. prefetching. supplement):157–162, June 2002. SIGPLAN Not., 38(2 [4] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The best of both worlds. Computing in Science and Engg., 13(2):31–39, March 2011. [5] Yosi Ben Asher and Nadav Rotem. The effect of unrolling and inlining for python bytecode optimizations. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR ’09, pages 14:1–14:14, New York, NY, USA, 2009. ACM. [6] S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA ’06: Proceedings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programing, Systems, Languages, and Applications, pages 169– 190, New York, NY, USA, October 2006. ACM Press. [7] Hans-J. Boehm. Reducing garbage collector cache misses. SIGPLAN Not., 36(1):59– 64, October 2000. [8] Carl Friedrich Bolz, Antonio Cuni, Maciej FijaBkowski, Michael Leuschel, Samuele Pedroni, and Armin Rigo. Allocation removal by partial evaluation in a tracing jit. In Proceedings of the 20th ACM SIGPLAN Workshop on Partial Evaluation and Program Manipulation, PEPM ’11, pages 43–52, New York, NY, USA, 2011. ACM. [9] Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, Michael Leuschel, Samuele Pedroni, and Armin Rigo. Runtime feedback in a meta-tracing jit for efficient dynamic languages. In Proceedings of the 6th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems, ICOOOLPS ’11, pages 9:1–9:8, New York, NY, USA, 2011. ACM. [10] Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. Tracing the meta-level: Pypy’s tracing jit compiler. In Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and 88 BIBLIOGRAPHY 89 Programming Systems, ICOOOLPS ’09, pages 18–25, New York, NY, USA, 2009. ACM. [11] Xing Cai, Hans Petter Langtangen, and Halvard Moe. On the performance of the python programming language for serial and parallel scientific computations. Sci. Program., 13(1):31–56, January 2005. [12] Jose Castanos, David Edelsohn, Kazuaki Ishizaki, Priya Nagpurkar, Toshio Nakatani, Takeshi Ogasawara, and Peng Wu. On the benefits and pitfalls of extending a statically typed language jit compiler for dynamic scripting languages. SIGPLAN Not., 47(10):195–212, October 2012. [13] Chen-Yong Cher, Antony L. Hosking, and T. N. Vijaykumar. Software prefetching for mark-sweep garbage collection: Hardware analysis and software redesign. SIGARCH Comput. Archit. News, 32(5):199–210, October 2004. [14] Alex Holkner and James Harland. Evaluating the dynamic behaviour of python applications. In Proceedings of the Thirty-Second Australasian Conference on Computer Science - Volume 91, ACSC ’09, pages 19–28, Darlinghurst, Australia, Australia, 2009. Australian Computer Society, Inc. [15] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual – Volume 3, February 2014. [16] Gregory L. Lee, Dong H. Ahn, Bronis R. de Supinski, John Gyllenhaal, and Patrick Miller. Pynamic: The python dynamic benchmark. In Proceedings of the 2007 IEEE 10th International Symposium on Workload Characterization, IISWC ’07, pages 101– 106, Washington, DC, USA, 2007. IEEE Computer Society. [17] Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. When prefetching works, when it doesn’t, and why. ACM Trans. Archit. Code Optim., 9(1):2:1–2:29, March 2012. [18] Seong-Won Lee and Soo-Mook Moon. Selective just-in-time compilation for clientside mobile javascript engine. In Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES ’11, pages 5–14, New York, NY, USA, 2011. ACM. [19] Jaydeep Marathe and Frank Mueller. Pfetch: Software prefetching exploiting temporal predictability of memory access streams. In Proceedings of the 9th Workshop on MEmory Performance: DEaling with Applications, Systems and Architecture, MEDEA ’08, pages 1–8, New York, NY, USA, 2008. ACM. [20] Jan Martinsen, Hakan Grahn, and Anders Isberg. Using speculation to enhance javascript performance in web applications. IEEE Internet Computing, 17(2):10–19, March 2013. [21] Fadi Meawad, Gregor Richards, Floréal Morandat, and Jan Vitek. Eval begone!: Semi-automated removal of eval from javascript programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’12, pages 607–620, New York, NY, USA, 2012. ACM. BIBLIOGRAPHY 90 [22] John K. Ousterhout. Scripting: Higher-level programming for the 21st century. Computer, 31(3):23–30, March 1998. [23] Armin Rigo. Representation-based just-in-time specialization and the psyco prototype for python. In Proceedings of the 2004 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation, PEPM ’04, pages 15–26, New York, NY, USA, 2004. ACM. [24] Armin Rigo. Transactional Memory (II). http://morepypy.blogspot.be/2012/01/ transactional-memory-ii.html, 2012. [Online; accessed 5-June-2014]. [25] Dag Sverre Seljebotn. Fast numerical computations with cython. In Gaël Varoquaux, Stéfan van der Walt, and Jarrod Millman, editors, Proceedings of the 8th Python in Science Conference, pages 15 – 22, Pasadena, CA USA, 2009. [26] TIOBE Software. TIOBE programming community index, 2014. [27] Haiping Zhao, Iain Proctor, Minghui Yang, Xin Qi, Mark Williams, Qi Gao, Guilherme Ottoni, Andrew Paroski, Scott MacVicar, Jason Evans, and Stephen Tu. The hiphop compiler for php. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’12, pages 575–586, New York, NY, USA, 2012. ACM. List of Figures 3.1 3.2 3.3 3.4 Architecture CPython . . . . . . . . . . . The phases used by the bytecode compiler Architecture PyPy . . . . . . . . . . . . . The RPython toolchain . . . . . . . . . . 4.1 4.2 Comparison between PyPy and CPython by speed.python.org . . . . . . . . 25 Time measurements for The Grand Unified Python Benchmark suite . . . . 26 5.1 Time measurements for the pairwise distance calculation problem with 1000 points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time comparison between the runtime environments normalised to CPython The number of cycles per instruction . . . . . . . . . . . . . . . . . . . . . . The number of branches per instruction . . . . . . . . . . . . . . . . . . . . The number of branch misses per instruction . . . . . . . . . . . . . . . . . The number of level-1 instruction cache loads per instruction . . . . . . . . The number of level-1 instruction cache load misses per instruction . . . . . The number of level-1 data cache loads per instruction . . . . . . . . . . . . The number of level-1 data cache load misses per instruction . . . . . . . . The number of last level cache loads per instruction . . . . . . . . . . . . . The number of last level cache load misses per instruction . . . . . . . . . . The number of last level stores stores per instruction . . . . . . . . . . . . . The number of last level cache store misses per instruction . . . . . . . . . . The number of instruction translation lookaside buffer load misses per instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The number of data translation lookaside buffer load misses per instruction The number of data translation lookaside buffer store misses per instruction Time measurements for C normalised to the time with threading . . . . . . Time measurements for Cython normalised to the time with threading . . . Time measurements for CPython normalised to the time with threading . . Time measurements for PyPy normalised to the time with threading . . . . Time comparison between the PYC and CPython runtime environments, normalised to CPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 6.1 6.2 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 14 17 18 35 39 41 42 42 43 44 45 45 46 47 48 48 49 49 50 51 52 53 53 54 The flow graph for the is_perfect_number method . . . . . . . . . . . . . 58 The flow graph for the is_perfect_number method after the annotate phase 59 The flow graph for the is_perfect_number method after the rtyping phase 59 91 LIST OF FIGURES 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21 The jitviewer in action, showing the Python code, Python bytecode and the intermediate representation of the code of the Just-In-Time compiler for the spectral-norm benchmark . . . . . . . . . . . . . . . . . . . . . . . Time lapse of fannkuch-redux with argument 11 and generated width 1000 First part time lapse of fannkuch-redux with argument 11 and generated width 20000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time lapse of the spectral-norm benchmark with argument 3000 . . . . . . Time measurements PyPy runtime environments normalised to PyPy. Note that there are no results for PyPyS for k-nucleotide and binary-trees. . . . Cycles per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. . . . . . . . . . Branches per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. . . . . . . Branch misses per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. . . . Level-1 instruction cache loads per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Level-1 instruction cache load misses per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Level-1 data cache loads per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. Level-1 data cache load misses per instruction for the PyPy runtime environments. Note that there are no results for PyPyS for k-nucleotide and binary-trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Time measurements normalised to the execution time with the minimum heap size (MHS) for PyPy with varying heap sizes . . . . . . . . . . . . . Influence prefetching on the level-1 instruction cache loads for the PyPy runtime environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Influence prefetching on the level-1 instruction cache load misses for the PyPy runtime environments . . . . . . . . . . . . . . . . . . . . . . . . . . Influence prefetching on the level-1 data cache loads for the PyPy runtime environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Influence prefetching on the level-1 data cache load misses for the PyPy runtime environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prefetching approach created by Frank Mueller and Jaydeep Marathe [19] 92 . 61 . 62 . 62 . 63 . 65 . 66 . 67 . 67 . 68 . 68 . 69 . 70 . 71 . 72 . 73 . 74 . 74 . 76

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Ruben Heynssens Modern Scripting Language Performance