Download Computational Efficiency vs. Maintainability and Portability

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

2016 Fourth International Workshop on Software Engineering for High Performance Computing in Computational Science
and Engineering
Computational Efficiency
vs. Maintainability and Portability.
Experiences with the Sparse Grid Code SG++
Dirk Pflüger
David Pfander
Institute for Parallel and Distributed Systems
University of Stuttgart, Germany
Email: Dirk.Pfl[email protected]
Institute for Parallel and Distributed Systems
University of Stuttgart, Germany
Email: [email protected]
Abstract—Software in computational science and engineering
often lacks good software engineering practices. One of several
reasons is that a high computational performance and efficiency
does not coincide very well with software quality attributes such
as flexibility, extensibility, usability, modularity or maintainability. Where low-level programming is required to exploit parallel
hardware to the best extent, common programming paradigms
such as object-oriented programming cannot be applied any
more. To achieve as much software quality as possible, the tradeoff between computational efficiency and maintainability and
portability has to be chosen carefully.
We demonstrate some optimizations that have to be applied
at the example of SG++ , a numerics framework for the efficient
solution of higher-dimensional problems in science and engineering with sparse grids. We discuss the performance advantages
and software quality disadvantages they cause. We explain the
criteria for the design decisions in SG++ and report on the lessons
Science Council (Wissenschaftsrat) has recently demanded to
put more emphasis on software as a value on its own, and that
funding agencies should offer support to develop and maintain
public research codes [17].
The trade-off between computational efficiency and scalability on multi- and many-core systems on the one hand,
which is the holy grail in high-performance computing (HPC),
and maintainability and portability on the other hand, which
are classical software attributes, has been identified as one
of the core problems of scientific software development in
academia [16]. To report from our own experiences in molecular dynamics [3]: Textbook software engineering can lead
to design patterns and the use of object-oriented programming (OOP), providing excellent flexibility. However, for
performance-critical code, data-locality has to be exploited,
dynamic data structures have to be replaced by arrays, and
memory transfer can become a severe bottleneck. As a consequence, in HPC scenarios desired software abstractions and
design patterns cannot be used on all levels and have to be
avoided. Despite the fact that this leads to a significant loss
of generality and flexibility, of course.
The advent of multi- and many-core hardware has increased
this problem. Code that runs efficiently on one platform
may even run on its successor only with severe performance
penalties or not at all. Cache-lines have to be exploited,
vector registers have to be used, and branching has to be
avoided. Accelerator cards such as GPUs prohibit or penalize
programming models and approaches that are frequently used
in software engineering such as certain software patterns or
OOP [8]. This renders it a hard and difficult task to write
high-performance code that can be flexibly used for a whole
range of applications and research tasks.
As this is a general problem, new concepts have been
developed such as object-oriented numerics (OON) [1]. The
main idea is to implement compute intensive parts of the
code using imperative programming and to use OOP for
high-level functionality that is less crucial for high efficiency.
Thus, performance critical parts are wrapped in an API that
ensures the classical software quality attributes mentioned
above, at least to some extent. A critical design decision is
Software has become a critically important part of research
and development. Especially in computational science and
engineering (CS&E), computer experiments (simulations) have
established themselves as the “third pillar of science” right
next to theory and experiment [15]. The software for cuttingedge simulations and research has grown over the last decades,
leading to large software packages with plenty of dependencies. They go far beyond classical research codes that can be
implemented, developed, used and maintained by a single researcher. Ensuring good software quality is crucial in academia
if generations of PhD students need to work on the same piece
of software. Flexibility, extensibility, usability, modularity and
maintainability are some of the software quality attributes that
have to be ensured. However, the state of the art in academic
software development frequently exhibits just the opposite:
There is often no well-defined software development process.
The reasons are manifold, be it a lack of software engineering
training in non-CS disciplines, or that the scope and outcomes
of new research are entirely unknown in the beginning and
that many codes have been thought to be just for a single
PhD project [16]. Based on such observations, the German
978-1-5090-5224-0/16 $31.00 © 2016 IEEE
DOI 10.1109/SE-HPCCSE.2016.7
where to place the cut-off between computational efficiency
and usability, flexibility and extensibility. Similar aspects have
been discussed for the development of large-scale numerical
libraries such as FEniCS or DUNE, but their efforts are
not simple to generalize to software packages with a less
specialized application scope.
Challenges for SG++ . For the development of the software
framework SG++ in an academic and interdisciplinary environment without any funding for SE, we have identified the
following challenges within the trade-off:
1) Computational efficiency. For large-scale simulation scenarios, at least the performance-critical parts have to
reach towards the best possible performance on different
types of hardware, even heterogenous one.
2) Portability. The software should run on different types of
hardware. Ideally this should cover the whole HPC-zoo
including clusters, accelerator cards and more.
3) Maintainability. The code should be easy to read and
to maintain, well beyond a single generation of PhD
students. For example, there should not be much code
4) Usability of code. Non-HPC users have to be able to do
scientific research with it without having to understand it
in detail. For researchers, there should be a short “timeto-science”. This demands a short learning curve so that
students can work on and research with SG++ within
the time-frame of a thesis.
In this paper, we demonstrate at the example of SG++
how a good trade-off between a high degree of computational
efficiency and software quality attributes such as usability,
flexibility and extensibility can be achieved – and what restrictions and design principles we had to ensure. SG++ is an
open source software framework to solve higher-dimensional
numerical problems with so-called sparse grids and can be
applied in a whole range of applications, from data mining
to the solution of partial differential equations (PDEs). In
Section II we briefly introduce the underlying numerics and
the scope to which it can be applied. This provides the
requirements to sketch relevant parts of the software design
in Section III. Section IV describes restrictions imposed by
HPC and the efficient use of available hardware resources.
Exemplary, we show optimizations for a typical algorithm in
Section V. In Section VI, we finally summarize some of the
lessons learned.
to predict tomorrow’s temperature based on d measurements
obtained today, and all we have as information about the
underlying dependency is data obtained in the past. For car
crash test simulations, f could be a computationally expensive
software that computes the intrusion of the front wall based on
shape parameters. In computational finance, it could be the fair
value of an option based on stock values. In CO2 sequestration
it can be the time till leakage based on parameters such as
underground permeability, pressure, and porosity. In all those
settings, we would like to obtain an approximation u ≈ f that
we can use to evaluate or to analyze.
In a numerical sense, such a function is higher-dimensional
if it goes beyond the classical 3+1 dimensions space and time.
If we need to represent the function in a computer, we need
to discretize each variable (dimension). However, conventional
approaches fully suffer the curse of dimensionality, a term
going back to Bellman in the 60s [2]: the computational
effort grows exponentially in the dimension. Spending only 10
discrete points to represent a single dimension, we need 10d
points in d dimensions. The treatment of higher-dimensional
problems is thus computationally infeasible.
A. Sparse grids
Sparse grids [4], however, provide a way to mitigate this
curse of dimensionality to a large extent. They reduce the
number of degrees of freedom for N discretization points per
dimension from O(N d ) to only O(N log(N )d−1 ). At the same
time, they maintain a similar accuracy as classical discretizations, at least for sufficiently smooth functions. Functions that
are not smooth (that exhibit jumps, for example) can still be
dealt with if adaptive refinement is employed.
A bit more precise, we are computing functions u(x) ≈
f (x) as a weighted sum of hierarchical basis functions,
αl,i ϕl,i (x) .
u(x) =
To obtain such an approximation we have to determine which
basis functions ϕi to use. This is provided a priori by the
sparse grid construction principle and can be adapted to the
problem at hand a posteriori via adaptive refinement. Then we
have to compute the vector of coefficients α
. This can be very
expensive. In data-driven settings, we would solve a penalized
least squares problem. For the solution of (stochastic) partial
differential equations, a finite element approach can be employed, and a typically large system of linear equations has to
be solved.
Sparse grids are based on a hierarchical basis in 1D, for
which we have a range of choices. The simplest one, leading
to a piecewise linear discretization, is shown in Fig. 1, but
more complex ones such as piecewise polynomial ones or even
B-splines can also be used. Already for the simplest choice,
there is a range of possibilities, for example with respect to the
treatment of the boundary of the computational domain. The
one-dimensional basis functions are then extended to the ddimensional setting as products of their one-dimensional components. This results in a multi-level splitting into hierarchical
Higher-dimensional problems arise in many applications. In
the following, we try to sketch the problem formulation and
the numerical algorithms without the usual formalism in a
more intuitive way. For more detailed information about sparse
grids, we refer to [4], [14], e.g.
The core task is to approximate or represent a functional
dependency f (x1 , x2 , . . . , xd ) in d variables in a numerical
way. It arises in plenty of applications. The function f might
not be known. In a data mining application, we might want
l =1
l =1
l =2
ϕ3,1 ϕ3,3 ϕ3,5 ϕ3,7
l =3
l =3
x3,1 x3,3 x3,5 x3,7
x3,1 x3,3 x3,5 x3,7
ϕ3,3 ϕ3,5
x3,1 x3,3 x3,5 x3,7
Fig. 1: The 1D piecewise linear hierarchical basis up to discretization level 3. Left: no boundary treatment; middle: with
explicit boundary discretization; right: modified extrapolating
basis functions
l =2
ϕ3,1 ϕ3,3 ϕ3,5 ϕ3,7
l =2
l =1
l =3
Fig. 3: Sketch of the algorithmically optimal evaluation of a
sparse grid function.
in HPC. Redesigning algorithms to exploit hardware on a
low level and to gain efficiency is not sufficient. It additionally requires a co-(re)design of the corresponding data
structures [5]. Together, this leads to custom-build compute
kernels and violates abstraction, modularity and flexiblity
that the numerical principle would offer and that software
engineering demands.
The toolbox SG++ provides algorithms, data structures,
refinement strategies, and much more to numerically solve
higher-dimensional problems. It has been successfully applied
to a range of problems from computational finance via data
mining, computational steering, to plasma physics and more,
see [14], [13]. Development of SG++ has started in 2005. In
its history, SG++ has been redesigned twice to move from
a single-purpose single-PhD code to a software toolbox that
has already been used by many researchers worldwide. In the
following, we discuss a few general design principles and how
they are reflected by SG++ .
A central guideline and critical requirement has been to keep
the training period as short as possible. Even undergraduate
students should be able to learn SG++ , to do research based on
SG++ and to contribute to the software within the time-frame
of a thesis or project of five to six months without previous
training. Experience shows that this goal has been achieved in
SG++ is following the OOP paradigm and is written in C++.
The choice of C/C++ is rather obvious from a CS&E-HPC
perspective. C++ supports both OOP for abstractions and lowlevel programming to ensure computational efficiency. There
is support for accelerator cards via C-dialects such as OpenCL
or NVIDIA’s CUDA. And C++ supports OpenMP and MPI for
An important design principle has been the separation of
concerns (SoC) on different levels. As the range of applications
is rather diverse, functionality and algorithms have been subdivided into different modules on a high level of abstraction. A
base module contains the core data structures, helper functions,
Fig. 2: Hierarchical grids in 2D, the sparse grid selection
(black), and the resulting sparse grid
grids. We then can identify those subspaces that contribute
most to the overall solution for sufficiently smooth problems
and omit all others. This mitigates the curse of dimensionality
and results in a sparse grid, see Fig. 2.
B. Sparse grid algorithms
Typical algorithms working on sparse grids have to exploit
the hierarchical structure to scale linearly in the number of unknowns. d-dimensional operations can be split into a loop over
all dimensions, applying one-dimensional operations to all 1D
substructures in the current dimension. Related considerations
hold for the fast evaluation of u at an arbitrary point x: to
obtain an algorithm with optimal complexity, only those basis
functions have to be evaluated that are potentially non-zero.
This can be achieved by a recursive descend in the hierarchical
structure. In 1D, a descend in a binary tree is sufficient.
In higher-dimensional settings, a multi-recursive descend in
both dimensionality and hierarchical level is necessary, see
Fig. 3. This leads to the evaluation of only O(log(N )d ) basis
functions for a grid of size O(N log(N )d−1 ).
However, straightforward implementations of the algorithms
do not match the properties of parallel hardware as used
standard algorithmic patterns and basic algorithms. Depending
on the scope, further modules such as the pde module with
algorithms for the solution of PDEs or the datadriven module
with support for data mining are required. The modularization
on the one hand reduces the amount of code that has to be
considered for a certain use. On the other hand, new modules
can easily be added, ensuring extensibility.
To further increase the usability and flexibility, wrappers
to Python and Java (and thus Matlab) are provided. This
provides a high-level API and facilitates rapid prototyping.
Some functionality as well as support for unit testing is
additionally offered in native Python. Such a multi-language
approach is increasingly adopted by the CS&E community, which is demonstrated by community codes such as
ESPResSo, waLBerla and FEniCS.
The SoC also applies to the encapsulation of data structures, basis functions, different types of grids and algorithms
working on them. This ensures a high degree of flexibility and
extensibility as the same algorithms can work on different grid
types and for different choices of basis functions. However,
and as indicated above, this is where computational efficiency
can be orthogonal to classical software quality attributes. To
employ vectorization, for example, it is not sufficient to just
replace the corresponding algorithm or to realize a specialized
subclass. Here, the data structures have to be adapted, too,
resulting in specialized combinations of data structures and
algorithms [6].
Furthermore, we have to reduce the overhead by inheritance
and even if-statements on a low level. Typical sparse grid
algorithms result in the traversal of plenty of one-dimensional
subgrids. Their traversal end evaluation depends on the choice
of basis functions. In a piecewise linear setting with zero
boundary conditions as sketched in Fig. 1, we should omit
boundary basis functions (left basis). If we used a general
grid traversal that can cope with different types of boundary
treatment and that would cover all three cases in Fig. 1, the
additional branching would result in a severe performance
penalty. A custom implementation for each of the three basis
types becomes performance critical.
For current GPUs with many very wide vector units, where
branch divergence of only a single parallel statement will
stall a group of 32 threads, for example, this becomes even
more critical. Furthermore, some types of current accelerator
cards as used in HPC do not support OOP and recursive
programming at all, or only to a certain extent. Customtailored implementations are required which break portability
and which do not generalize to other hardware platforms.
Template programming provides a high-level programming
technique without additional overhead. Advantages of template-based programming are to avoid code duplication while
providing variants, to keep the code minimal, to provide
polymorphy at compile-time and to facilitate extensibility.
However, despite recent improvements in modern compilers,
templates lead to an obfuscation of the code, to error messages
that are hard to interpret, and to training periods, especially
for non-computer-science students, that quickly exceed what
Fig. 4: Scheme illustrating the separation of user-visible
functionality and high-performance implementations for an
example operation.
is possible for a student thesis. As a short training period is
a critical requirement, we have decided to restrict the use of
templates to a minimum and to rather hazard the consequences
of code duplications.
The low-level programming results in code that is inherently
non-intuitive, not portable, difficult to extend, not modular and
hard to read, to name just a few disadvantages. In SG++ ,
we addressed these issues and isolated high-performance code
wherever possible with the aim to encapsulate it and to make
it transparent to the user.
An example for this strategy is the implementation of
operations. Generally, a user of SG++ creates a grid, chooses
which basis functions to use, formulates the problem and
applies corresponding operations on the grid. The operations
performed can be very complex and performance critical.
Complex low-level code is therefore required that is often
custom-tailored to the combination of choices (grid, basis
functions, type of operation, hardware). For a certain operation, a factory method selects an implementation, see Fig. 4.
It chooses specialized and optimized versions where available
and resorts to a generic, less efficient implementation if
necessary. As operations accept convencience data structures
that are then converted to high-performance data structures
within the operations, the user does not have to deal with the
low-level implementations. This is another SoC, but on a much
lower level.
To illustrate the variety of custom-tailored implementations,
Fig. 5 shows implemented variants of an operation, which
we will discuss in detail in Section V. Besides a generic
variant, optimized ones are depicted here for only three grid
types. All implementations are required to ensure performance.
This indicates the benefits of our modular implementation
approach: It permits the transparent addition of specialized
implementations for specific grid types or hardware platforms.
A factory method hides the details from the user.
In the following two sections, we first describe some core
optimizations that have to be employed in HPC, and we then
illustrate how this applies to SG++ and how they relate to
classical software quality attributes.
Fig. 5: Implemented variants of the “multiple evaluation”
operation for three grid types. Our modular implementation
approach permits the transparent addition of specialized implementations for specific grid types or hardware platforms. If the
MPI-enabled implementation is used, any node-level variant
can be selected on an MPI rank to do the actual computations.
example, that they depend on random input data. The treatment
of branches in hot-spots of a code has become especially
important with wider vector units, as branches often prohibit
efficient vectorization. This is especially important for accelerator cards.
These aspects of modern hardware architectures have in
common that the implementation issues that arise cannot be
easily delegated to a compiler. Therefore, the developer has
to deal with them manually. For parallelization, this involves
frameworks like OpenMP and MPI, and, if accelerator cards
are used, CUDA and OpenCL. With the possible exception of
OpenMP, these are low-level frameworks and require tedious
manual implementations and optimizations.
In practice, vectorization can be even more problematic.
As very fine-grained control over the processor instructions
is necessary for a high vectorization benefit, this leads to a
low-level approach involving intrinsics or even assembly code.
First of all, the data structures and the algorithms have to
be designed in a way that is vectorizable. The situation is
similar for efficient memory access patterns and for branch
optimizations. To improve the cache utilization and to reduce
branches, special data structures can be required that obfuscate
the code and that create unexpected dependencies, and a nonintuitive reformulation of the corresponding algorithm can
become necessary.
The efficient use of all hardware resources will usually
require low-level code and rather complicated formulations of
the algorithms. This makes good maintainability very difficult.
In the next section, we will present examples of optimizations
in SG++ and discuss their performance advantages and their
drawbacks with respect to software quality attributes.
In this section, we demonstrate performance optimizations
at the example of a well-studied operation in SG++ , the socalled multi-evaluation. It is the task of evaluating a sparse grid
function u at multiple points xi . This task is performancecritical in real-time scenarios such as computational steering [7] as well as in data mining applications, where the
solution of a linear system boils down to a vast number of
function evaluations and its modified (transposed) version at
training data points [14].
With the definition of a sparse grid function u(x) in (1),
the multi-evaluation operation can be formulated as a matrixvector multiplication,
Throughout the last decade, clock frequency has stalled and
Moore’s Law has been maintained only by new degrees of
parallelism on different levels. The most important ones are the
number of cores and the width of the vector units. The number
of cores increased from a single core to up to 18 cores on a
single CPU [12]. Vector units exist in the x86 realm since the
introduction of MMX in the 1990s. For scientific computing,
they became relevant with the two-wide SSE instructions and
even more important with the introduction of the Advanced
Vector Extensions (AVX).
Despite the advances in mere compute power, and as predicted in the past, the performance of memory access was not
able to keep pace [11]. To actually achieve a high performance
on modern processors, this has to be reflected in the code. Slow
access to memory has to be considered and strategies have to
be developed that use the available cache hierarchy efficiently.
Data has to be accessed in blocks of the size of a cache-line,
for example, and data structures might have to be redesigned.
An old problem that is still unsolved is the cost of branching. While predictable branches are often free on current
processors with advanced branch prediction logic, branches
can also be inherently unpredictable. A reason can be, for
α = v , (B)i,j = ϕj (xi ) .
Each row of the matrix multiplied by the vector of hierarchical
coefficients (the surplusses) α
corresponds to an evaluation
of the corresponding function u at a single point xi , thus
vi = u(xi ). As the individual evaluations are independent,
the algorithm can be parallelized over the evaluation points,
assigning different u(xi ) to different processors. The kernel is
embarrassingly parallel with respect to the set of evaluation
In the following subsections, we exemplarily present optimizations that are required to exploit features of modern
hardware and to obtain high computational efficiency. We
show the performance advantages and discuss the drawbacks
with respect to other software qualities.
double support = alpha[...];
// evaluate d-dimensional basis function
for (size_t d = 0; d < dims; d++) {
support *= max(1.0 - fabs((levels[...] *
data[...]) - indices[...]), 0.0);
result[i] += support;
A. Multi-evaluation with piecewise-linear basis functions
For a piecewise-linear basis, the basis functions are given
ϕl,i (x) :=
ϕlj ,ij (xj ) ,
Listing 2 shows an excerpt of the code for the evaluation
of d-dimensional basis functions implemented with AVX.
The blocking is only hinted; the overall code for just one
type of basis function is more than 200 lines long. For this
algorithm, the pointer-based tree-like data structure of the grid
has to be converted to a flat array data structure that enables
contiguous memory access to further improve performance.
The optimizations thus require a redesign of the algorithm,
special low-level programming constructs and customized data
structures. As this comes hand in hand with the loss of
portability, readability and flexibility, at least a second, less
optimized version is required that runs on non-Intel hardware.
with the hierarchical 1-dimensional hat functions on level lj
with index ij (see Fig. 1 left),
ϕl,i (x) := max(0, 1 − |2l x − i|) .
A grid is uniquely defined by a set of level-index pairs (l,i)
that are mapped to a column-index in the matrix.
To obtain an efficient implementation, the following optimizations are helpful. First, we need co-designed algorithms and data structures that can be mapped to the hardware. Even though this increases the operation count from
O(md · log(N )d ) to O(md · N log(N )d−1 ) in our implementation (compare Section II-B) for m data points, this already
pays off on commodity CPUs for most of the applications
we have studied so far. On vector computers and accelerator
cards, this becomes crucial. Second, the evaluations have to be
bundled in groups. This reduces the number of times the grid
has to be loaded from memory. As the evaluation has exactly
the same control and differs only in the data, a straightforward
vectorization is possible. Third, as this compute kernel was
written for Intel processors, the vectorized code is programmed
with intrinsics for AVX. Fourth, an additional optimization is
blocking within a single thread. This leads to the interleaved
evaluation of multiple data points by a single thread. It is
required to continously fill the processor’s pipelines with
independent instructions.
Taking just the first optimization into account, the multievaluation leads to a straightforward implementation of (1)
for all evaluation points, see Listing 1: For each data point we
iterate over all grid points and add up the contributions. The
evaluation of a basis function is the implementation of (3)
and (4), a loop over all dimensions. The code is a direct
implementation of the mathematical formulation, simple to
read, does not rely on any hardware properties and is easy
to maintain and compiles on all systems with a standard C++compiler. Furthermore, it is short and consists of only 10 lines
of code (including closing brackets).
Listing 2: Excerpt of the inner-most loop of the evaluation of
a sparse grid function with piecewise-linear basis functions.
The overall length of the whole multi-evaluation is more than
200 lines. This vectorized algorithm was implemented with
AVX intrinsics.
// transformation of data structures ...
// loop over data in chunk-increments ...
// evaluate, loop over all dimensions:
for (size_t d = 0; d < dims; d++) {
__m256d eval_0 = _mm256_load_pd(...);
__m256d eval_1 = _mm256_load_pd(...);
// ... same for eval_2, ..., eval_4
__m256d eval_5 = _mm256_load_pd(...);
// distribute level and index information
__m256d level = _mm256_broadcast_sd(...);
__m256d index = _mm256_broadcast_sd(...);
// evaluate a 1D basis function
eval_0 = _mm256_msub_pd(eval_0, level, index);
eval_0 = _mm256_and_pd(mask, eval_0);
eval_0 = _mm256_sub_pd(one, eval_0);
eval_0 = _mm256_max_pd(zero, eval_0);
res_0 = _mm256_mul_pd(res_0, eval_0);
// ... same block for eval_1, ..., eval_5
// back transformation of data structures
B. Multi-evaluation with piecewise-linear basis functions on
For an implementation on GPUs and other accelerator cards,
the framework OpenCL was chosen. An alternative would
have been NVIDIA’s CUDA. But as our aim is to maintain
as much portability as possible, the choice of OpenCL offers
vendor-portability of the code. This is a trade-off and a design
decision: CUDA typically performs better on NVIDIA GPUs
and has better support with respect to functionality and tools,
but it does not run on other hardware. OpenCL, in contrast,
Listing 1: Straightforward algorithm for evaluating a sparse
grid function with piecewise-linear basis functions at multiple
points in the domain. The algorithm consists of three nested
loops with a few arithmetic operations in the innermost loop.
for (size_t i = 0; i < data_points; i++) {
result[i] = 0.0;
for (size_t j = 0; j < grid_points; j++) {
evaluations are independent which mitigates pipeline stalls
due to instruction dependencies. Additionally, as all blocked
evaluations work on the same grid point at any point during
the calculation, the values for level, index and the surplus αl,i
have to be loaded only once. This is a (rare) case where code
duplication leads to a significantly improved performance.
Listing 3: Excerpt from blocked evaluation with a blocking
factor of 2 chosen for readability. Current GPUs benefit from
a blocking factor of up to 8. This results in an overall length
of 95 lines of code for the kernel.
// ...
double data_0[4], data_1[4];
for (size_t d = 0; d < dim; d++) {
// initialize with different data
data_0[d] = data[...];
data_1[d] = data[...];
for(size_t j = 0; j < grid_points; j++) {
// alpha, level and index are reused
double support_0 = alpha[j];
double support_1 = alpha[j];
for (size_t d = 0; d < dim; d += 1) {
double cur_level = level[...];
double cur_index = index[...];
// repeated 1d evaluations
support_0 *= ...;
support_1 *= ...;
result_0 += support_0;
result_1 += support_1;
// ...
Fig. 6: The execution of an OpenCL kernel on a single device.
The host has to perform additional work before the kernel can
be run.
runs on accelerators of different vendors, including Intel’s
MIC and AMD’s GPUs, and even on CPUs. Furthermore,
OpenCL code is compiled just-in-time during runtime, which
CUDA did not support until recently. This makes additional
advanced optimizations possible. We have optimized both for
OpenCL and CUDA and have observed only a negligible
difference in performance. Therefore, most of our specialized
implementations for accelerators are based on OpenCL.
Developing software for graphics processors brings many
additional challenges. For example, the memory on both
machines has to be managed. Consequently, the code splits
into a lot of boiler-plate code and memory management on
the host device (about 1000 lines of code) and the compute
kernel (95 lines of code). For the following example, we will
focus on the compute kernel. The kernel execution and the
most important additional steps for the program to execute
the OpenCL kernel on a compute device are shown in Fig. 6.
Again, several optimizations are required. As graphics processors have only very limited caching capabilities, reusing
memory becomes even more important than on CPUs. Of
course, the optimizations that we discussed before for CPUs
and that are partially transparent in OpenCL are required, such
as parallelization, vectorization and pipelining. Additionally,
explicit blocking is necessary. This reduces the number of
memory transactions to the main memory by the number
of additional evaluations assigned to an OpenCL thread.
Furthermore, additional independent instructions increase the
utilization of the vector pipelines. Blocking can also be highly
beneficial for CPUs, but becomes crucial for GPUs due to
much smaller caches.
Listing 3 shows a small part of the OpenCL kernel that was
configured with a blocking factor of 2. For actual GPUs a
higher blocking factor is required to maximize performance.
For current NVIDIA GPUs a blocking factor of 8 turned out to
be a good choice. Note that statements that belong to different
Specialized programming dialects for accelerator cards
have, in contrast to commodity programming, performance
penalties for (non-vectorizable) high-level constructs such as
recursion, OOP and other programming techniques which are
frequently employed in software engineering. Avoiding these
to gain performance, a significant loss in terms of software
quality is apparent besides the obvious obfuscation of the code.
This becomes even worse if a more elaborate type of basis
function is employed that is frequently used in sparse grid data
mining [14], the modified linear basis sketched in Fig. 1. The
core difference to the standard linear basis is that the shape of
the basis function depends on the level and the index within the
level. Thus, its evaluation requires a series of case distinctions.
This makes an efficient vectorization very difficult and leads
to severe performance penalties on GPUs.
For SG++ , we have developed an optimization implemented in OpenCL that superimposes the different branches to
achieve a high performance [10]. To execute a specific branch,
mask values have to be supplied that enable the required
arithmetic computations which leads to the computation of
a specific branch. With this scheme, a branch-free algorithm
can be created. This is illustrated by Listing 4 that shows
the branch-free evaluation of a single basis function. The
listing clearly demonstrates that no if-statements are left and
that the resulting high-performance code is non-intuitive. In
particular, it is not trivial to map the body of the loop with
for (int d = 0; i < dim; d++) {
eval = level[d] * data[d];
index_calc = eval - index[d];
abs = as_double(as_ulong(index_calc) |
temp = offset[d] + abs;
local_support = fmax(temp, 0.0);
result *= local_support;
C. Performance evaluation
Runtime (s)
Runtime (s)
Listing 4: Evaluation of a modified linear basis function with
masking to avoid branches; blocking is not shown here to keep
it simple. The 1D basis function evaluation does not require
any if-statements. Mask values are computed on the host in
advance; the level vector contains precomputed values 2lj .
Runtime (s)
its mathematical formulation (4). If such optimizations are
required, an object-oriented SoC is not possible anymore.
Abstraction is virtually impossible, too, as this leads to new
switch statements and a loss in computational efficiency.
OCL blocked
OCL mask
Fig. 7: Runtime of the scenarios. Left: AVX on an Intel
CPU. Middle: multi-evaluation with linear basis and blocking.
Right: transposed multi-evaluation with modified linear basis
and masking. OpenCL (OCL) results have been obtained on a
K20X GPU. For each scenario described in Sec. V, the runtime
is significantly reduced
In the following, we give a brief indication that the optimizations discussed above are not a hacker’s delight, but rather
a necessity. All deteriorations in terms of software quality lead
to significant performance boosts.
Figure 7 shows the performance gains for a spatially
adaptive sparse grid for the three optimizations discussed
above for a real-world dataset from astrophysics, see [9] for
details. The timings were obtained on an Intel Xeon E52650v2 equipped with a K20X GPU. The AVX-accelerated
algorithm is more than two times faster, despite using an
algorithm with a worse complexity. The advantages of the
optimizations for accelerator cards are similarly convincing.
All three optimizations are significant enough to trade-in software quality against low-level implementations. In practice,
the accelerator implementation would combine them and use
several additional optimizations, which leads to even larger
performance gains.
Taking this to the extreme we have been able, for example,
to reduce a large-scale data mining problem resulting in
billions of function evaluations from more than 300,000 s in a
sequential, CPU-only version, to 350 s in a multi-core system
with multiple GPUs [9].
version has to be provided additionally. For our optimizations,
the loss of maintainability is obvious: Just considering the
lines of code, a straightforward implementation with 10 lines
increases to more than 200 for AVX intrinsics and to an
OpenCL-kernel of 95 lines plus 1000 more on the host device
(including the precomputations). Furthermore, it is easy to
derive the underlying mathematics looking at the straightforward implementation. However, this becomes a non-trivial
task for the AVX-vectorized version and requires a significant
re-engineering effort for the masked OpenCL-version where
precomputations on the host in advance obscure the compute
The use of low-level programming based on intrinsics or
even assembly, implementations using blocking or bit tricks
to mitigate the performance penalty of branching all lead to
code that is difficult to maintain. Keeping this in mind, we
have learned the following lessons.
Ensure usability on a high level. To ensure usability
of performance-optimized codes, a high-level API has to be
defined that hides all tuned parts. We encourage high-level
abstractions, the flexible use of OOP and software patterns
on the user level. Invisible to the user, this can be violated
for performance-critical parts. However, the selection of efficient and optimized compute kernels has to be transparent
to the user. Listing 5 and Fig. 4 show the high-level user
view that hides the low-level optimizations. Generic default
implementations should be provided for all settings without
an optimized version. Furthermore, flexible convenience data
We have shown different optimizations for a single, simple
operation. The optimizations that are required, however, are
complicated and require sophisticated tuning of the code to
the hardware at hand. All optimizations have a huge benefit
and are crucial to achieve efficient high-performance code.
All of them violate, however, quality attributes such as
maintainability and portability and therefore contradict software quality attributes that are central to software development
in non-CS&E fields. While it is difficult to quantify the
maintainability and portability of a code, the portability of
an implementation gets clearly lost if the optimizations run
only on certain hardware and thus a generic, low-performance
structures on a high level are necessary, even though the
compute kernels might need and use their own custom data
structure and data may have to be explicitly converted between
both worlds. Our own experience shows that banning the use
of certain programming constructs on the user-level has been
a constant source of controversial discussions with researchers
and developers coming from an HPC background; but the
decision has been shown to be very helpful in the long term.
As SG++ supports its use in multiple languages (C++, Python,
Java, Matlab), the API additionally defines which functionality
has to be wrapped and exhibited from C++ to other languages.
and software quality attributes such as maintainability and
portability. At the example of the software framework SG++ ,
we have shown and demonstrated that both worlds cannot be
simply combined. A clear separation of concerns is required
on several levels as well as decisions in terms of where to place
the trade-off(s). We are confident that the lessons learned and
the design principles of SG++ shown can generalize to other
software projects in science.
Financial supports from the Deutsche Forschungsgemeinschaft (EXC 310, SimTech) and the Landesstiftung BadenWürttemberg (JP-Programm Auto-Tuning) are gratefully acknowledged.
Listing 5: User example for the multi-evaluation operation
// create grid for piecewise linear basis
Grid* grid = Grid::createLinearGrid(dim);
// ... create/refine grid ...
// create coefficient and result vector
DataVector alpha(grid->getStorage->getSize());
DataVector result(dataset->getSize());
// ... compute alpha ...
// factory method for multi-eval selects
// optimized kernel for given grid and basis
opEval = createOperationMultipleEval(
grid, dataset);
opEval->eval(alpha, result);
[1] E. Arge, A. M. Bruaset, and H. P. Langtangen. Object-oriented numerics.
In Numerical Methods and Software Tools in Industrial Mathematics,
pages 7–26. Springer, 1997.
[2] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton
University Press, 1961.
[3] M. Buchholz. Framework zur Parallelisierung von Molekulardynamiksimulationen in verfahrenstechnischen Anwendungen. Verlag Dr. Hut,
[4] H.-J. Bungartz and M. Griebel. Sparse Grids. Acta Numerica, 13:147–
269, 2004.
[5] G. Buse. Exploiting Many-Core Architectures for Dimensionally Adaptive Sparse Grids. Verlag Dr. Hut, 2015.
[6] G. Buse, D. Pflüger, A. Murarasu, and R. Jacob. A non-static data
layout enhancing parallelism and vectorization in sparse grid algorithms.
In Proc. of. Parallel and Distributed Computing (ISPDC) 2012, pages
195–202, June 2012.
[7] D. Butnaru, D. Pflüger, and H.-J. Bungartz. Towards high-dimensional
computational steering of precomputed simulation data using sparse
grids. Procedia Computer Science, 4:56–65, 2011.
[8] A. Chatzigeorgiou and G. Stephanides. Evaluating performance and
power of object-oriented vs. procedural programming in embedded
processors. In Proc. of the 7th Ada-Europe Int. Conf. on Reliable
Software Technologies, Ada-Europe ’02, pages 65–75, London, UK, UK,
2002. Springer-Verlag.
[9] A. Heinecke and D. Pflüger. Multi- and many-core data mining with
adaptive sparse grids. In Proc. of 8th ACM Int. Conf. on Computing
Frontiers, CF ’11, pages 29:1–29:10, New York, USA, 2011. ACM.
[10] A. Heinecke, D. Pflüger, et al. Demonstrating performance portability of
a custom OpenCL data mining application to the Intel(R) Xeon Phi(TM)
coprocessor. In Int. Workshop on OpenCL Proceedings 2013, Georgia
Tech, May 2013.
[11] J. L. Hennessy and D. A. Patterson. Computer Architecture, Fourth
Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, 2006.
[12] Intel Cooperation. Intel Product Brief: Intel Xeon Processor E5-2600
v3 Product Family, 2014.
[13] B. Peherstorfer, C. Kowitz, D. Pflüger, and H.-J. Bungartz. Selected
recent applications of sparse grids. Numerical Mathematics: Theory,
Methods and Applications, 8(01):47–77, 2015.
[14] D. Pflüger. Spatially Adaptive Sparse Grids for High-Dimensional
Problems. Verlag Dr. Hut, 2010.
[15] President’s Information Technology Advisory Committee. Computational science: Ensuring America’s competitiveness. Report to the
president, Executive Office of the President of the United States, 2005.
[16] S. Wagner, D. Pflüger, and M. Mehl. Simulation software engineering:
Experiences and challenges. In Proceedings of the 3rd International
Workshop on Software Engineering for High Performance Computing
in Computational Science and Engineering, SE-HPCCSE ’15, pages 1–
4, New York, NY, USA, 2015. ACM.
[17] Wissenschaftsrat. Bedeutung und Weiterentwicklung von Simulation in
der Wissenschaft. Position paper, Wissenschaftsrat, 2014.
“Duplicates” can be helpful. While duplicate code is generally considered as bad style and potential source of errors,
we have experienced that a SoC on the performance level can
require the duplication of code with specialized adaptations.
This occurs, for example, when providing different handtuned compute kernels for different types of hardware. Often,
the differences between implementations are only few crucial
lines of code. Their fusion however, for example based on
defines, significantly reduces their readability and increases
the training time on the code. In extreme settings, where the
use of specialized hardware requires a completely different
set of algorithms and data structures, a duplication of the
corresponding parts of the code can become necessary. We
have experienced that a modular structure on a high(er) level
is beneficial in this regard. This allows to use different testing
approaches and compilers, and to switch on and off parts of the
code if they are required or become obsolete. A particularly
good documentation and thorough testing are crucial in the
context of duplications.
Ensure maintainability, portability, and flexibility. For
high-performance parts with low-level implementations, maintainability deteriorates. In ongoing work we have already
achieved excellent results with automatic code generation and
tuning of optimization parameters to the hardware at hand. In
this regard, the use of domain specific languages to combine a
high level of abstraction with flexible low-level optimizations
has become increasingly popular in the HPC community.
Both approaches go beyond code duplication and can help
to provide maintainability and portability to some extent.
We have reported on software challenges with respect to
the trade-off between computational efficiency of HPC codes