Download PyStream: Compiling Python onto the GPU

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Falcon (programming language) wikipedia , lookup

Resource management (computing) wikipedia , lookup

Comment (computer programming) wikipedia , lookup

Structured programming wikipedia , lookup

C Sharp syntax wikipedia , lookup

Go (programming language) wikipedia , lookup

Library (computing) wikipedia , lookup

Abstraction (computer science) wikipedia , lookup

Stream processing wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Coding theory wikipedia , lookup

Corecursion wikipedia , lookup

Python syntax and semantics wikipedia , lookup

C++ wikipedia , lookup

Object-oriented programming wikipedia , lookup

Name mangling wikipedia , lookup

Compiler wikipedia , lookup

C Sharp (programming language) wikipedia , lookup

Python (programming language) wikipedia , lookup

Program optimization wikipedia , lookup

Cross compiler wikipedia , lookup

Interpreter (computing) wikipedia , lookup

General-purpose computing on graphics processing units wikipedia , lookup

Transcript
PROC. OF THE 10th PYTHON IN SCIENCE CONF. (SCIPY 2011)
79
PyStream: Compiling Python onto the GPU
Nick Bray
F
Index Terms—pystream, compiling python, gpu
RA
Introduction
FT
Abstract—PyStream is a static compiler that can radically transform Python
code and run it on a Graphics Processing Unit (GPU). Python compiled to
run on the GPU is ~100,000x faster than when interpreted on the CPU. The
PyStream compiler is specially designed to simplify the development of realtime rendering systems by allowing the entire rendering system to be written
in a single, highly productive language. Without PyStream, GPU-accelerated
real-time rendering systems must contain two separate code bases written in
two separate languages: one for the CPU and one for the GPU. Functions
and data structures are not shared between the code bases, and any common
functionality must be redundantly written in both languages. PyStream unifies
a rendering system into a single, Python code base, allowing functions and
data structures to be transparently shared between the CPU and the GPU. A
single, unified code base makes it easy to create, maintain, and evolve a highperformance GPU-accelerated application.
Fig. 1: Naïve background color.
D
High-performance computer hardware can be difficult to program because ease of programming is often traded for raw
performance. For example, graphics processing units (GPUs)
are traditionally programmed in languages that either restrict
memory use or explicitly expose the memory hierarchy to
the programmer. The OpenGL Shading Language (GLSL) is
an example of the first, and OpenCL is an example of the
second. Neither type of language is particularly easy to use,
rather they are designed to address a potential bottleneck for
GPU architectures: memory bandwidth. GPUs pack enough
functional units into a single chip that overall performance
can easily be limited by the memory subsystem’s ability to
feed data to the functional units.
Ease of programming is not the only issue when using
GPU-specific languages. These languages are specialized for
performance-critical numeric computations and are not suitable for writing a complete application. For instance, these
languages cannot load data from disk or provide a graphical
user interface. Instead, GPU languages typically provide APIs
to interoperate with a different, general-purpose language.
Using these APIs results in an application with two code
bases, each with distinct semantics. Additional glue code is
also required overcome the impedance mismatch between the
code bases. Glue code is used to remap and transfer data
structures from one code base to another. Glue code is also
used to invoke functions across the language boundary.
Nick Bray is with Google. E-mail: [email protected].
c
○2011
Nick Bray. This is an open-access article distributed under the terms
of the Creative Commons Attribution License, which permits unrestricted use,
distribution, and reproduction in any medium, provided the original author
and source are credited.
Fig. 2: Background color calculation using shared code.
Constructing a GPU-accelerated application with two code
bases and glue code has a number of software engineering
costs. For instance, any data transferred between the CPU and
the GPU must have its structure defined in both languages
and have glue code to remap and transfer the data. Any
modification to such a data structure will require modifying
all its definitions. Certain functions may also need to be
duplicated so they can be used in each code base. For example,
Figure 1 shows a rendering system where the background
80
PROC. OF THE 10th PYTHON IN SCIENCE CONF. (SCIPY 2011)
color - calculated on the CPU - does not take into account
the color processing performed on the GPU. Figure 2 shows
the same rendering system, but with the color-processing
code duplicated on the CPU and applied to the background
color. Ultimately, applications that incorporate GPU-specific
languages tend to resist change because data structures and
functions in two code bases must be kept in sync. This
complicates the process of maintaining and evolving such an
application.
PyStream
Language Restrictions
RA
FT
PyStream [Bra10] is a static source-to-source compiler that
translates Python code into GLSL for use in real-time graphical rendering. PyStream also generates the glue code necessary
to seamlessly invoke the generated GLSL code from a Python
application. PyStream allows applications incorporating GPUaccelerated real-time rendering to be written as a single,
unified Python code base. This allows the high productivity
of the Python language to be used while also gaining the
performance of GPU acceleration. PyStream sidesteps the
problems that arise from having two code bases, which would
otherwise diminish the productivity gains of using Python.
Programming a GPU with Python allows the use of objectoriented programming, polymorphic functions, and other programming language features that are often not available in
GPU-specific languages. Python also has the advantage of
being more concise. In practice, Python code is roughly five
to seven times more terse than the corresponding GLSL code.
GLSL code. For instance, GLSL programs are constrained to
have no global side effects. Code compiled by PyStream must
behave the same. GLSL does not allow recursive function calls
because it is designed to run on hardware without a call stack.
This restriction is adopted by PyStream. Similarly, GLSL is
designed to run in an environment where memory is statically
allocated for each processor. PyStream in turn requires that
the code it compiles have bounded memory usage, allowing
the compiler to statically allocate memory.
In practice, the most significant of these restrictions appears
to be the need for bounded memory usage. This restriction
prevents the use of recursive data structures and most mutable
list, dict, and set objects. For example, if a program
appends to a list inside of a loop, the compiler will be unable to
determine the maximum size of the list. Future improvements
to the compiler may allow it to bound the number of loop
iterations in some cases, this problem is equivalent to the
halting problem in the general case.
Most of these restrictions are applied after the compiler
optimizes a program. For example, a highly polymorphic
function may initially appear to be recursive, but this recursion
can disappear once the function has been duplicated and
specialized for the different situations in which it is called.
As will be discussed later, PyStream uses a novel approach
for representing Python programs. This approach treats the
Python interpreter as part of the program being compiled.
There are often recursive calls through the interpreter, such
as when the addition of a vector type is implemented in
terms of the addition of its scalar elements. This pattern is so
pervasive that disallowing recursive calls before optimizations
are applied would disallow most Python programs. Problematically, disallowing recursive calls after compilation requires
that a programmer must understand how the compiler behaves.
Although this conceit is undesirable, it is necessary.
PyStream currently does not support a number of Python
features, including exceptions and closures. These features will
be supported in the future.
D
PyStream can compile a restricted subset of Python onto
the GPU. Restrictions are necessary to make the compilation
process tractable. Restrictions are also necessary because of
the fundamental limitations of modern GPU hardware. PyStream’s restricted subset of Python provides, at minimum,
the functionality of GLSL but with the syntax, semantics,
and abstraction mechanisms of Python, as well as complete
integration with Python applications. PyStream requires the
following to translate code onto a GPU:
• A closed world.
• No global side effects.
• No recursive function calls.
• Bounded memory usage.
To statically compile a Python program, a closed world
must be created. If a program can call a function that the
compiler knows nothing about, then the compiler must assume
that the function can have arbitrary side effects: rewriting
globals, classes, and other data structures. In such a situation,
a static compiler cannot prove anything about the program
and therefore cannot transform it in any meaningful way.
To prevent this situation, PyStream disallows the execution
of unknown code. Dynamic code compilation and execution,
such as through the use of exec and eval, is forbidden. In
addition, modules are imported at compile time and assumed
to never change thereafter.
GLSL has several restrictions, when compared to Python,
and they are adopted by PyStream so that it can generate
PyStream in Practice
A real-time rendering system was developed with the PyStream compiler to validate the design of the compiler. The
example rendering system implements the core algorithms
used by the game Starcraft 2 [Fil08]. Rendering systems
typically use many different algorithms to produce a final
image. These algorithms are divided into shader programs
that that are executed on batches of data sent to the GPU.
The example rendering system contains 8 different shader
programs. A shader programs is further subdivided into several
individual shaders that process different kinds of data, such
as vertices in a 3D model or pixels being written into an
image. The code for one of the shader programs in the example
rendering system is included below.
class AmbientPass(ShaderProgram):
def shadeVertex(self, context, pos, texCoord):
context.position = pos
return texCoord,
def shadeFragment(self, context, texCoord):
PYSTREAM: COMPILING PYTHON ONTO THE GPU
81
Compiling Python
D
RA
FT
PyStream takes a novel approach to compiling Python that is
simpler and more flexible than previous approaches [Sal04],
[Pyp11]. The key to PyStream’s approach is that it keeps
its internal representation of the program it is compiling
as simple as possible. Compiling Python can be potentially
complicated because the language is filled with numerous
special cases. For example, adding two objects together can
result in the __add__ method being called on the first
object, the __radd__ method being called on the second
object, or an exception being thrown. More precisely, the
interpreter can do all of the above for a single operation if
both methods exist but return NotImplemented. Calling
either method can result in arbitrary code being executed
and can have arbitrary side effects, so the precise definition
of the addition operator is both complicated and ambiguous.
Any relationship between Python’s addition operator and the
mathematical concept of addition is a convention and not an
intrinsic part of the language. Virtually every Python operation
can execute arbitrary code, even operations such as reading an
attribute of an object.
Fig. 3: An image produced by the example rendering system.
Prior to PyStream, Python compilers attempted to embed extensive knowledge of Python’s semantics into their algorithms.
For example, every analysis algorithm and optimization would
need to implicitly understand how the interpreter dispatched
# Sample the underlying geometry
addition operations. Typically this knowledge was not precise,
g = self.gbuffer.sample(texCoord)
and did not cover every corner case. PyStream takes a different
# Sample the ambient occlusion
ao = self.ao.texture(texCoord).xyz
approach. Instead of trying to embed a complete knowledge of
# Calculate the lighting
Python’s semantics into its algorithms, it treats the interpreter
ambientLight = self.env.ambientColor(g.normal)*ao
as if it were a library being called by the Python program. This
# Modulate the output
color=vec4(g.diffuse*ambientLight, 1.0)
allows PyStream to easily and accurately analyze Python’s
context.colors = (color,)
complex semantics without complicating the compiler. The
consequence of this approach is that PyStream appears to
This shader program performs a specific kind of lighting process three times as much code as other Python compilers.
calculation for the example rendering system. PyStream’s This extra code would need to be evaluated one way or the
shader programs are a Pythonic version of GLSL’s shader other, PyStream evaluates it explicitly as code rather than
programs. The previous shader program is implemented as implicitly inside the compiler.
Because PyStream treats the interpreter as part of the
a class that contains two shader methods: a vertex shader
program,
standard optimizations such as dead code elimination
and a fragment shader. The first two arguments for each
shader are special. The self argument holds data that is and function inlining are extremely effective at eliminating
constant during the execution of the shader. The context Python’s run time overhead. Interpreter functions are initially
argument holds an object with shader-specific fields. For quite complicated, but they are typically optimized down to a
example, colors written to the context.colors field inside single operation and later inlined. In addition to the standard
of a fragment shader will be written into the image(s) being optimizations, several Python-specific transformations are also
rendered after the shader has been executed. All subsequent performed. For example, method calls are optimized to elimiarguments correspond to streams of data being fed into the nate the creation of bound method objects wherever possible.
shader. Return values correspond to streams of data produced
by the shader.
Mapping Python onto the GPU
Python’s abstraction mechanisms are used throughout the
example rendering system. For instance, algorithms for calculating how light reflects off surfaces are encapsulated in
polymorphic Material objects. This allows the appearance
of a surface to be controlled by composing a shader object with
different types of material objects. Rendering systems often
contain custom code generators [Mit07] because the GPUspecific language they are using does not natively support
polymorphism.
After analyzing and optimizing a program, PyStream then
maps it onto the GPU. One of the biggest challenges in mapping a Python shader program onto the GPU is the presence
of memory operations. GLSL does not support pointers in any
form: the address of an objects cannot be taken, and function
arguments are passed by value. Python, on the other hand, hold
every object by reference. PyStream bridges this semantic gap
by eliminating as many memory operations as possible and
then emulating the rest.
82
PROC. OF THE 10th PYTHON IN SCIENCE CONF. (SCIPY 2011)
Fig. 4: Performance of the example rendering system as the number
of objects drawn is increased.
real-time rendering, a task the GPU was designed for. Taken
together, this easily explains the net speedup.
Figure 4 shows the performance of the example rendering
system, in frames per second (FPS), as the number of objects drawn increases. Drawing more objects requires more
computation, and will naturally reduce the rate at which
images are produced. Rendering systems may be bottlenecked
by factors other than computation; they can also be limited
by the rate that glue code can transfer data to the GPU.
PyStream can generate glue code for both OpenGL 2 and
OpenGL 3. OpenGL 3 has features that let it transfer data more
efficiently to the GPU. As seen in the figure, these features
can offer a ~20% speed improvement when the rendering
system is bottlenecked by its glue code. This demonstrates
an interesting benefit of PyStream: future proofing. PyStream
can take advantage of new features offered by GPUs and GPU
APIs without requiring modifications to the rendering system.
FT
Before even trying to map a program onto the GPU, PyStream aggressively eliminates as many memory operations as
possible. If PyStream can eliminate every memory operation,
translating the program into GLSL is trivial. The optimizations
PyStream performs are a mixture of load/store elimination
and shader-specific transformations such as flattening the input
and output data structures for each shader into a list of local
variables. In practice, these optimization eliminate almost all
the memory operations in the example rendering system.
It is not always possible to eliminate every memory operation, however. PyStream uses two different strategies to
emulate the remaining memory operations. If an object is
never modified or is only held by a single reference at a time,
PyStream copies the object as needed rather than treating it
as a distinct memory location. If an object is held by multiple
references and also modified, PyStream places it in an array
of objects and uses an index into the array as a pointer to the
object.
Performance
D
RA
The example rendering system demonstrates that PyStream
is quite effective at compiling Python shaders. A manual
inspection of the generated GLSL code reveals that it is close
to what would be written by hand. Quantitatively, PyStream
provides a massive speedup for the compiled shaders. The
following table shows the time taken to draw one million
pixels with a Python shader program when it is interpreted
on the CPU versus when it is compiled onto the GPU.
Measurements were taken on a AMD Athlon 64 X2 3800+
CPU with a NVidia 9800 GT GPU running Windows XP and
Python 2.6.4.
Shader
CPU
GPU
Speedup
material 220.5 s 5.62 ms 39,211x
skybox
35.5 s 0.81 ms 43,568x
ssao
444.9 s 1.44 ms 308,958x
bilateral 429.1 s 1.49 ms 288,956x
ambient 64.1 s 0.84 ms 76,310x
light
127.1 s 0.95 ms 133,789x
blur
74.2 s 0.54 ms 138,692x
post
442.6 s 9.57 ms 46,272x
average 180.8 s 1.23 ms 146,712x
On average, the shaders in the example rendering system
run 146,712x faster when compiled onto the GPU than when
interpreted on the CPU. The CPU timings are synthetic and
only measure the execution time of the shader code and neglect
the time required to sample textures and other functionality
in the rendering system. The GPU timings take all costs
into account, so the speedup is understated. Five orders of
magnitude speedup is reasonable, however. Compiling an
optimized Python program into C can provide two orders
of magnitude speedup [Sal04]. For unoptimized programs
taking full advantage of Python’s abstraction mechanisms, an
additional order of magnitude of speedup can be achieved
because a static compiler can inline functions and globally
optimize the program whereas an interpreter always pays the
abstraction overhead. Switching from a CPU to a GPU can
easily provide another two orders of magnitude speedup for
Conclusion
PyStream takes a unique approach to high-performance highlevel programming. The compiler can map a significant portion
of a general-purpose language onto a GPU, and allow a
complete GPU-accelerated application to be written with a
single code base. This demonstrates that productive high-level
languages and high performance are not mutually exclusive,
even for critical computational kernels.
R EFERENCES
[Bra10] N. C. Bray. PyStream: Python Shaders Running on the GPU,
PhD thesis, Department of Electrical and Computer Engineering,
University of Illinois at Urbana-Champaign.
[Fil08] D. Filion and R. McNaughton, Effects & techniques, SIGGRAPH
’08: ACM SIGGRAPH 2008 Classes, pp. 133-164.
[Mit07] M. Mittring, Finding next gen: CryEngine 2, SIGGRAPH ’07:
ACM SIGGRAPH 2007 Courses, 2007, pp. 97-121.
[Pyp11] Online: http://codespeak.net/pypy/dist/pypy/doc/
[Sal04] M. Salib, Starkiller: A static type inferencer and compiler for
Python, M.S. thesis, Department of Electrical Engineering and
Computer Science, Massachusetts Institute of Technology.