Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2016 Fourth International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering Computational Efficiency vs. Maintainability and Portability. Experiences with the Sparse Grid Code SG++ Dirk Pflüger David Pfander Institute for Parallel and Distributed Systems University of Stuttgart, Germany Email: Dirk.Pfl[email protected] Institute for Parallel and Distributed Systems University of Stuttgart, Germany Email: [email protected] Abstract—Software in computational science and engineering often lacks good software engineering practices. One of several reasons is that a high computational performance and efficiency does not coincide very well with software quality attributes such as flexibility, extensibility, usability, modularity or maintainability. Where low-level programming is required to exploit parallel hardware to the best extent, common programming paradigms such as object-oriented programming cannot be applied any more. To achieve as much software quality as possible, the tradeoff between computational efficiency and maintainability and portability has to be chosen carefully. We demonstrate some optimizations that have to be applied at the example of SG++ , a numerics framework for the efficient solution of higher-dimensional problems in science and engineering with sparse grids. We discuss the performance advantages and software quality disadvantages they cause. We explain the criteria for the design decisions in SG++ and report on the lessons learned. Science Council (Wissenschaftsrat) has recently demanded to put more emphasis on software as a value on its own, and that funding agencies should offer support to develop and maintain public research codes [17]. The trade-off between computational efficiency and scalability on multi- and many-core systems on the one hand, which is the holy grail in high-performance computing (HPC), and maintainability and portability on the other hand, which are classical software attributes, has been identified as one of the core problems of scientific software development in academia [16]. To report from our own experiences in molecular dynamics [3]: Textbook software engineering can lead to design patterns and the use of object-oriented programming (OOP), providing excellent flexibility. However, for performance-critical code, data-locality has to be exploited, dynamic data structures have to be replaced by arrays, and memory transfer can become a severe bottleneck. As a consequence, in HPC scenarios desired software abstractions and design patterns cannot be used on all levels and have to be avoided. Despite the fact that this leads to a significant loss of generality and flexibility, of course. The advent of multi- and many-core hardware has increased this problem. Code that runs efficiently on one platform may even run on its successor only with severe performance penalties or not at all. Cache-lines have to be exploited, vector registers have to be used, and branching has to be avoided. Accelerator cards such as GPUs prohibit or penalize programming models and approaches that are frequently used in software engineering such as certain software patterns or OOP [8]. This renders it a hard and difficult task to write high-performance code that can be flexibly used for a whole range of applications and research tasks. As this is a general problem, new concepts have been developed such as object-oriented numerics (OON) [1]. The main idea is to implement compute intensive parts of the code using imperative programming and to use OOP for high-level functionality that is less crucial for high efficiency. Thus, performance critical parts are wrapped in an API that ensures the classical software quality attributes mentioned above, at least to some extent. A critical design decision is I. I NTRODUCTION Software has become a critically important part of research and development. Especially in computational science and engineering (CS&E), computer experiments (simulations) have established themselves as the “third pillar of science” right next to theory and experiment [15]. The software for cuttingedge simulations and research has grown over the last decades, leading to large software packages with plenty of dependencies. They go far beyond classical research codes that can be implemented, developed, used and maintained by a single researcher. Ensuring good software quality is crucial in academia if generations of PhD students need to work on the same piece of software. Flexibility, extensibility, usability, modularity and maintainability are some of the software quality attributes that have to be ensured. However, the state of the art in academic software development frequently exhibits just the opposite: There is often no well-defined software development process. The reasons are manifold, be it a lack of software engineering training in non-CS disciplines, or that the scope and outcomes of new research are entirely unknown in the beginning and that many codes have been thought to be just for a single PhD project [16]. Based on such observations, the German 978-1-5090-5224-0/16 $31.00 © 2016 IEEE DOI 10.1109/SE-HPCCSE.2016.7 10 17 where to place the cut-off between computational efficiency and usability, flexibility and extensibility. Similar aspects have been discussed for the development of large-scale numerical libraries such as FEniCS or DUNE, but their efforts are not simple to generalize to software packages with a less specialized application scope. Challenges for SG++ . For the development of the software framework SG++ in an academic and interdisciplinary environment without any funding for SE, we have identified the following challenges within the trade-off: 1) Computational efficiency. For large-scale simulation scenarios, at least the performance-critical parts have to reach towards the best possible performance on different types of hardware, even heterogenous one. 2) Portability. The software should run on different types of hardware. Ideally this should cover the whole HPC-zoo including clusters, accelerator cards and more. 3) Maintainability. The code should be easy to read and to maintain, well beyond a single generation of PhD students. For example, there should not be much code replication. 4) Usability of code. Non-HPC users have to be able to do scientific research with it without having to understand it in detail. For researchers, there should be a short “timeto-science”. This demands a short learning curve so that students can work on and research with SG++ within the time-frame of a thesis. In this paper, we demonstrate at the example of SG++ how a good trade-off between a high degree of computational efficiency and software quality attributes such as usability, flexibility and extensibility can be achieved – and what restrictions and design principles we had to ensure. SG++ is an open source software framework to solve higher-dimensional numerical problems with so-called sparse grids and can be applied in a whole range of applications, from data mining to the solution of partial differential equations (PDEs). In Section II we briefly introduce the underlying numerics and the scope to which it can be applied. This provides the requirements to sketch relevant parts of the software design in Section III. Section IV describes restrictions imposed by HPC and the efficient use of available hardware resources. Exemplary, we show optimizations for a typical algorithm in Section V. In Section VI, we finally summarize some of the lessons learned. to predict tomorrow’s temperature based on d measurements obtained today, and all we have as information about the underlying dependency is data obtained in the past. For car crash test simulations, f could be a computationally expensive software that computes the intrusion of the front wall based on shape parameters. In computational finance, it could be the fair value of an option based on stock values. In CO2 sequestration it can be the time till leakage based on parameters such as underground permeability, pressure, and porosity. In all those settings, we would like to obtain an approximation u ≈ f that we can use to evaluate or to analyze. In a numerical sense, such a function is higher-dimensional if it goes beyond the classical 3+1 dimensions space and time. If we need to represent the function in a computer, we need to discretize each variable (dimension). However, conventional approaches fully suffer the curse of dimensionality, a term going back to Bellman in the 60s [2]: the computational effort grows exponentially in the dimension. Spending only 10 discrete points to represent a single dimension, we need 10d points in d dimensions. The treatment of higher-dimensional problems is thus computationally infeasible. A. Sparse grids Sparse grids [4], however, provide a way to mitigate this curse of dimensionality to a large extent. They reduce the number of degrees of freedom for N discretization points per dimension from O(N d ) to only O(N log(N )d−1 ). At the same time, they maintain a similar accuracy as classical discretizations, at least for sufficiently smooth functions. Functions that are not smooth (that exhibit jumps, for example) can still be dealt with if adaptive refinement is employed. A bit more precise, we are computing functions u(x) ≈ f (x) as a weighted sum of hierarchical basis functions, αl,i ϕl,i (x) . (1) u(x) = (l,i) To obtain such an approximation we have to determine which basis functions ϕi to use. This is provided a priori by the sparse grid construction principle and can be adapted to the problem at hand a posteriori via adaptive refinement. Then we have to compute the vector of coefficients α . This can be very expensive. In data-driven settings, we would solve a penalized least squares problem. For the solution of (stochastic) partial differential equations, a finite element approach can be employed, and a typically large system of linear equations has to be solved. Sparse grids are based on a hierarchical basis in 1D, for which we have a range of choices. The simplest one, leading to a piecewise linear discretization, is shown in Fig. 1, but more complex ones such as piecewise polynomial ones or even B-splines can also be used. Already for the simplest choice, there is a range of possibilities, for example with respect to the treatment of the boundary of the computational domain. The one-dimensional basis functions are then extended to the ddimensional setting as products of their one-dimensional components. This results in a multi-level splitting into hierarchical II. H IGHER -D IMENSIONAL P ROBLEMS AND S PARSE G RIDS Higher-dimensional problems arise in many applications. In the following, we try to sketch the problem formulation and the numerical algorithms without the usual formalism in a more intuitive way. For more detailed information about sparse grids, we refer to [4], [14], e.g. The core task is to approximate or represent a functional dependency f (x1 , x2 , . . . , xd ) in d variables in a numerical way. It arises in plenty of applications. The function f might not be known. In a data mining application, we might want 18 11 ϕ1,1 l =1 l =1 . x1,1 l =2 ϕ0,1 ϕ1,1 . . x2,1 . x2,3 x2,1 ϕ3,1 ϕ3,3 ϕ3,5 ϕ3,7 l =3 . ϕ3,1 l =3 . x3,1 x3,3 x3,5 x3,7 l1 l2=1 ϕ2,3 x3,1 x3,3 x3,5 x3,7 . x2,3 ϕ3,3 ϕ3,5 l2=2 ϕ3,7 . x3,1 x3,3 x3,5 x3,7 l2=3 Fig. 1: The 1D piecewise linear hierarchical basis up to discretization level 3. Left: no boundary treatment; middle: with explicit boundary discretization; right: modified extrapolating basis functions l1=1 l1=3 l =2 x2,3 ϕ3,1 ϕ3,3 ϕ3,5 ϕ3,7 l1=2 . ϕ2,1 l =2 x2,1 l1=1 x1,1 ϕ2,3 ϕ2,1 ϕ1,1 l =1 x0,1 x1,1 x0,0 ϕ2,3 ϕ2,1 l =3 ϕ0,0 l1=2 l1=3 l2 Fig. 3: Sketch of the algorithmically optimal evaluation of a sparse grid function. in HPC. Redesigning algorithms to exploit hardware on a low level and to gain efficiency is not sufficient. It additionally requires a co-(re)design of the corresponding data structures [5]. Together, this leads to custom-build compute kernels and violates abstraction, modularity and flexiblity that the numerical principle would offer and that software engineering demands. l1 l2=1 l2=2 III. D ESIGN OF SG++ The toolbox SG++ provides algorithms, data structures, refinement strategies, and much more to numerically solve higher-dimensional problems. It has been successfully applied to a range of problems from computational finance via data mining, computational steering, to plasma physics and more, see [14], [13]. Development of SG++ has started in 2005. In its history, SG++ has been redesigned twice to move from a single-purpose single-PhD code to a software toolbox that has already been used by many researchers worldwide. In the following, we discuss a few general design principles and how they are reflected by SG++ . A central guideline and critical requirement has been to keep the training period as short as possible. Even undergraduate students should be able to learn SG++ , to do research based on SG++ and to contribute to the software within the time-frame of a thesis or project of five to six months without previous training. Experience shows that this goal has been achieved in general. SG++ is following the OOP paradigm and is written in C++. The choice of C/C++ is rather obvious from a CS&E-HPC perspective. C++ supports both OOP for abstractions and lowlevel programming to ensure computational efficiency. There is support for accelerator cards via C-dialects such as OpenCL or NVIDIA’s CUDA. And C++ supports OpenMP and MPI for parallelization. An important design principle has been the separation of concerns (SoC) on different levels. As the range of applications is rather diverse, functionality and algorithms have been subdivided into different modules on a high level of abstraction. A base module contains the core data structures, helper functions, l2=3 l2 Fig. 2: Hierarchical grids in 2D, the sparse grid selection (black), and the resulting sparse grid grids. We then can identify those subspaces that contribute most to the overall solution for sufficiently smooth problems and omit all others. This mitigates the curse of dimensionality and results in a sparse grid, see Fig. 2. B. Sparse grid algorithms Typical algorithms working on sparse grids have to exploit the hierarchical structure to scale linearly in the number of unknowns. d-dimensional operations can be split into a loop over all dimensions, applying one-dimensional operations to all 1D substructures in the current dimension. Related considerations hold for the fast evaluation of u at an arbitrary point x: to obtain an algorithm with optimal complexity, only those basis functions have to be evaluated that are potentially non-zero. This can be achieved by a recursive descend in the hierarchical structure. In 1D, a descend in a binary tree is sufficient. In higher-dimensional settings, a multi-recursive descend in both dimensionality and hierarchical level is necessary, see Fig. 3. This leads to the evaluation of only O(log(N )d ) basis functions for a grid of size O(N log(N )d−1 ). However, straightforward implementations of the algorithms do not match the properties of parallel hardware as used 19 12 standard algorithmic patterns and basic algorithms. Depending on the scope, further modules such as the pde module with algorithms for the solution of PDEs or the datadriven module with support for data mining are required. The modularization on the one hand reduces the amount of code that has to be considered for a certain use. On the other hand, new modules can easily be added, ensuring extensibility. To further increase the usability and flexibility, wrappers to Python and Java (and thus Matlab) are provided. This provides a high-level API and facilitates rapid prototyping. Some functionality as well as support for unit testing is additionally offered in native Python. Such a multi-language approach is increasingly adopted by the CS&E community, which is demonstrated by community codes such as ESPResSo, waLBerla and FEniCS. The SoC also applies to the encapsulation of data structures, basis functions, different types of grids and algorithms working on them. This ensures a high degree of flexibility and extensibility as the same algorithms can work on different grid types and for different choices of basis functions. However, and as indicated above, this is where computational efficiency can be orthogonal to classical software quality attributes. To employ vectorization, for example, it is not sufficient to just replace the corresponding algorithm or to realize a specialized subclass. Here, the data structures have to be adapted, too, resulting in specialized combinations of data structures and algorithms [6]. Furthermore, we have to reduce the overhead by inheritance and even if-statements on a low level. Typical sparse grid algorithms result in the traversal of plenty of one-dimensional subgrids. Their traversal end evaluation depends on the choice of basis functions. In a piecewise linear setting with zero boundary conditions as sketched in Fig. 1, we should omit boundary basis functions (left basis). If we used a general grid traversal that can cope with different types of boundary treatment and that would cover all three cases in Fig. 1, the additional branching would result in a severe performance penalty. A custom implementation for each of the three basis types becomes performance critical. For current GPUs with many very wide vector units, where branch divergence of only a single parallel statement will stall a group of 32 threads, for example, this becomes even more critical. Furthermore, some types of current accelerator cards as used in HPC do not support OOP and recursive programming at all, or only to a certain extent. Customtailored implementations are required which break portability and which do not generalize to other hardware platforms. Template programming provides a high-level programming technique without additional overhead. Advantages of template-based programming are to avoid code duplication while providing variants, to keep the code minimal, to provide polymorphy at compile-time and to facilitate extensibility. However, despite recent improvements in modern compilers, templates lead to an obfuscation of the code, to error messages that are hard to interpret, and to training periods, especially for non-computer-science students, that quickly exceed what Fig. 4: Scheme illustrating the separation of user-visible functionality and high-performance implementations for an example operation. is possible for a student thesis. As a short training period is a critical requirement, we have decided to restrict the use of templates to a minimum and to rather hazard the consequences of code duplications. The low-level programming results in code that is inherently non-intuitive, not portable, difficult to extend, not modular and hard to read, to name just a few disadvantages. In SG++ , we addressed these issues and isolated high-performance code wherever possible with the aim to encapsulate it and to make it transparent to the user. An example for this strategy is the implementation of operations. Generally, a user of SG++ creates a grid, chooses which basis functions to use, formulates the problem and applies corresponding operations on the grid. The operations performed can be very complex and performance critical. Complex low-level code is therefore required that is often custom-tailored to the combination of choices (grid, basis functions, type of operation, hardware). For a certain operation, a factory method selects an implementation, see Fig. 4. It chooses specialized and optimized versions where available and resorts to a generic, less efficient implementation if necessary. As operations accept convencience data structures that are then converted to high-performance data structures within the operations, the user does not have to deal with the low-level implementations. This is another SoC, but on a much lower level. To illustrate the variety of custom-tailored implementations, Fig. 5 shows implemented variants of an operation, which we will discuss in detail in Section V. Besides a generic variant, optimized ones are depicted here for only three grid types. All implementations are required to ensure performance. This indicates the benefits of our modular implementation approach: It permits the transparent addition of specialized implementations for specific grid types or hardware platforms. A factory method hides the details from the user. In the following two sections, we first describe some core optimizations that have to be employed in HPC, and we then illustrate how this applies to SG++ and how they relate to classical software quality attributes. 20 13 Fig. 5: Implemented variants of the “multiple evaluation” operation for three grid types. Our modular implementation approach permits the transparent addition of specialized implementations for specific grid types or hardware platforms. If the MPI-enabled implementation is used, any node-level variant can be selected on an MPI rank to do the actual computations. example, that they depend on random input data. The treatment of branches in hot-spots of a code has become especially important with wider vector units, as branches often prohibit efficient vectorization. This is especially important for accelerator cards. These aspects of modern hardware architectures have in common that the implementation issues that arise cannot be easily delegated to a compiler. Therefore, the developer has to deal with them manually. For parallelization, this involves frameworks like OpenMP and MPI, and, if accelerator cards are used, CUDA and OpenCL. With the possible exception of OpenMP, these are low-level frameworks and require tedious manual implementations and optimizations. In practice, vectorization can be even more problematic. As very fine-grained control over the processor instructions is necessary for a high vectorization benefit, this leads to a low-level approach involving intrinsics or even assembly code. First of all, the data structures and the algorithms have to be designed in a way that is vectorizable. The situation is similar for efficient memory access patterns and for branch optimizations. To improve the cache utilization and to reduce branches, special data structures can be required that obfuscate the code and that create unexpected dependencies, and a nonintuitive reformulation of the corresponding algorithm can become necessary. The efficient use of all hardware resources will usually require low-level code and rather complicated formulations of the algorithms. This makes good maintainability very difficult. In the next section, we will present examples of optimizations in SG++ and discuss their performance advantages and their drawbacks with respect to software quality attributes. IV. O PTIMIZATIONS FOR HPC V. P ERFORMANCE O PTIMIZATION IN SG++ " ! !# "!$! ! In this section, we demonstrate performance optimizations at the example of a well-studied operation in SG++ , the socalled multi-evaluation. It is the task of evaluating a sparse grid function u at multiple points xi . This task is performancecritical in real-time scenarios such as computational steering [7] as well as in data mining applications, where the solution of a linear system boils down to a vast number of function evaluations and its modified (transposed) version at training data points [14]. With the definition of a sparse grid function u(x) in (1), the multi-evaluation operation can be formulated as a matrixvector multiplication, Throughout the last decade, clock frequency has stalled and Moore’s Law has been maintained only by new degrees of parallelism on different levels. The most important ones are the number of cores and the width of the vector units. The number of cores increased from a single core to up to 18 cores on a single CPU [12]. Vector units exist in the x86 realm since the introduction of MMX in the 1990s. For scientific computing, they became relevant with the two-wide SSE instructions and even more important with the introduction of the Advanced Vector Extensions (AVX). Despite the advances in mere compute power, and as predicted in the past, the performance of memory access was not able to keep pace [11]. To actually achieve a high performance on modern processors, this has to be reflected in the code. Slow access to memory has to be considered and strategies have to be developed that use the available cache hierarchy efficiently. Data has to be accessed in blocks of the size of a cache-line, for example, and data structures might have to be redesigned. An old problem that is still unsolved is the cost of branching. While predictable branches are often free on current processors with advanced branch prediction logic, branches can also be inherently unpredictable. A reason can be, for B α = v , (B)i,j = ϕj (xi ) . (2) Each row of the matrix multiplied by the vector of hierarchical coefficients (the surplusses) α corresponds to an evaluation of the corresponding function u at a single point xi , thus vi = u(xi ). As the individual evaluations are independent, the algorithm can be parallelized over the evaluation points, assigning different u(xi ) to different processors. The kernel is embarrassingly parallel with respect to the set of evaluation points. 21 14 In the following subsections, we exemplarily present optimizations that are required to exploit features of modern hardware and to obtain high computational efficiency. We show the performance advantages and discuss the drawbacks with respect to other software qualities. double support = alpha[...]; // evaluate d-dimensional basis function for (size_t d = 0; d < dims; d++) { support *= max(1.0 - fabs((levels[...] * data[...]) - indices[...]), 0.0); } result[i] += support; A. Multi-evaluation with piecewise-linear basis functions } } For a piecewise-linear basis, the basis functions are given by ϕl,i (x) := d ϕlj ,ij (xj ) , Listing 2 shows an excerpt of the code for the evaluation of d-dimensional basis functions implemented with AVX. The blocking is only hinted; the overall code for just one type of basis function is more than 200 lines long. For this algorithm, the pointer-based tree-like data structure of the grid has to be converted to a flat array data structure that enables contiguous memory access to further improve performance. The optimizations thus require a redesign of the algorithm, special low-level programming constructs and customized data structures. As this comes hand in hand with the loss of portability, readability and flexibility, at least a second, less optimized version is required that runs on non-Intel hardware. (3) j=1 with the hierarchical 1-dimensional hat functions on level lj with index ij (see Fig. 1 left), ϕl,i (x) := max(0, 1 − |2l x − i|) . (4) A grid is uniquely defined by a set of level-index pairs (l,i) that are mapped to a column-index in the matrix. To obtain an efficient implementation, the following optimizations are helpful. First, we need co-designed algorithms and data structures that can be mapped to the hardware. Even though this increases the operation count from O(md · log(N )d ) to O(md · N log(N )d−1 ) in our implementation (compare Section II-B) for m data points, this already pays off on commodity CPUs for most of the applications we have studied so far. On vector computers and accelerator cards, this becomes crucial. Second, the evaluations have to be bundled in groups. This reduces the number of times the grid has to be loaded from memory. As the evaluation has exactly the same control and differs only in the data, a straightforward vectorization is possible. Third, as this compute kernel was written for Intel processors, the vectorized code is programmed with intrinsics for AVX. Fourth, an additional optimization is blocking within a single thread. This leads to the interleaved evaluation of multiple data points by a single thread. It is required to continously fill the processor’s pipelines with independent instructions. Taking just the first optimization into account, the multievaluation leads to a straightforward implementation of (1) for all evaluation points, see Listing 1: For each data point we iterate over all grid points and add up the contributions. The evaluation of a basis function is the implementation of (3) and (4), a loop over all dimensions. The code is a direct implementation of the mathematical formulation, simple to read, does not rely on any hardware properties and is easy to maintain and compiles on all systems with a standard C++compiler. Furthermore, it is short and consists of only 10 lines of code (including closing brackets). Listing 2: Excerpt of the inner-most loop of the evaluation of a sparse grid function with piecewise-linear basis functions. The overall length of the whole multi-evaluation is more than 200 lines. This vectorized algorithm was implemented with AVX intrinsics. // transformation of data structures ... // loop over data in chunk-increments ... // evaluate, loop over all dimensions: for (size_t d = 0; d < dims; d++) { __m256d eval_0 = _mm256_load_pd(...); __m256d eval_1 = _mm256_load_pd(...); // ... same for eval_2, ..., eval_4 __m256d eval_5 = _mm256_load_pd(...); // distribute level and index information __m256d level = _mm256_broadcast_sd(...); __m256d index = _mm256_broadcast_sd(...); // evaluate a 1D basis function eval_0 = _mm256_msub_pd(eval_0, level, index); eval_0 = _mm256_and_pd(mask, eval_0); eval_0 = _mm256_sub_pd(one, eval_0); eval_0 = _mm256_max_pd(zero, eval_0); res_0 = _mm256_mul_pd(res_0, eval_0); // ... same block for eval_1, ..., eval_5 } // back transformation of data structures B. Multi-evaluation with piecewise-linear basis functions on GPUs For an implementation on GPUs and other accelerator cards, the framework OpenCL was chosen. An alternative would have been NVIDIA’s CUDA. But as our aim is to maintain as much portability as possible, the choice of OpenCL offers vendor-portability of the code. This is a trade-off and a design decision: CUDA typically performs better on NVIDIA GPUs and has better support with respect to functionality and tools, but it does not run on other hardware. OpenCL, in contrast, Listing 1: Straightforward algorithm for evaluating a sparse grid function with piecewise-linear basis functions at multiple points in the domain. The algorithm consists of three nested loops with a few arithmetic operations in the innermost loop. for (size_t i = 0; i < data_points; i++) { result[i] = 0.0; for (size_t j = 0; j < grid_points; j++) { 22 15 evaluations are independent which mitigates pipeline stalls due to instruction dependencies. Additionally, as all blocked evaluations work on the same grid point at any point during the calculation, the values for level, index and the surplus αl,i have to be loaded only once. This is a (rare) case where code duplication leads to a significantly improved performance. Listing 3: Excerpt from blocked evaluation with a blocking factor of 2 chosen for readability. Current GPUs benefit from a blocking factor of up to 8. This results in an overall length of 95 lines of code for the kernel. // ... double data_0[4], data_1[4]; for (size_t d = 0; d < dim; d++) { // initialize with different data data_0[d] = data[...]; data_1[d] = data[...]; } for(size_t j = 0; j < grid_points; j++) { // alpha, level and index are reused double support_0 = alpha[j]; double support_1 = alpha[j]; for (size_t d = 0; d < dim; d += 1) { double cur_level = level[...]; double cur_index = index[...]; // repeated 1d evaluations support_0 *= ...; support_1 *= ...; } result_0 += support_0; result_1 += support_1; } // ... Fig. 6: The execution of an OpenCL kernel on a single device. The host has to perform additional work before the kernel can be run. runs on accelerators of different vendors, including Intel’s MIC and AMD’s GPUs, and even on CPUs. Furthermore, OpenCL code is compiled just-in-time during runtime, which CUDA did not support until recently. This makes additional advanced optimizations possible. We have optimized both for OpenCL and CUDA and have observed only a negligible difference in performance. Therefore, most of our specialized implementations for accelerators are based on OpenCL. Developing software for graphics processors brings many additional challenges. For example, the memory on both machines has to be managed. Consequently, the code splits into a lot of boiler-plate code and memory management on the host device (about 1000 lines of code) and the compute kernel (95 lines of code). For the following example, we will focus on the compute kernel. The kernel execution and the most important additional steps for the program to execute the OpenCL kernel on a compute device are shown in Fig. 6. Again, several optimizations are required. As graphics processors have only very limited caching capabilities, reusing memory becomes even more important than on CPUs. Of course, the optimizations that we discussed before for CPUs and that are partially transparent in OpenCL are required, such as parallelization, vectorization and pipelining. Additionally, explicit blocking is necessary. This reduces the number of memory transactions to the main memory by the number of additional evaluations assigned to an OpenCL thread. Furthermore, additional independent instructions increase the utilization of the vector pipelines. Blocking can also be highly beneficial for CPUs, but becomes crucial for GPUs due to much smaller caches. Listing 3 shows a small part of the OpenCL kernel that was configured with a blocking factor of 2. For actual GPUs a higher blocking factor is required to maximize performance. For current NVIDIA GPUs a blocking factor of 8 turned out to be a good choice. Note that statements that belong to different Specialized programming dialects for accelerator cards have, in contrast to commodity programming, performance penalties for (non-vectorizable) high-level constructs such as recursion, OOP and other programming techniques which are frequently employed in software engineering. Avoiding these to gain performance, a significant loss in terms of software quality is apparent besides the obvious obfuscation of the code. This becomes even worse if a more elaborate type of basis function is employed that is frequently used in sparse grid data mining [14], the modified linear basis sketched in Fig. 1. The core difference to the standard linear basis is that the shape of the basis function depends on the level and the index within the level. Thus, its evaluation requires a series of case distinctions. This makes an efficient vectorization very difficult and leads to severe performance penalties on GPUs. For SG++ , we have developed an optimization implemented in OpenCL that superimposes the different branches to achieve a high performance [10]. To execute a specific branch, mask values have to be supplied that enable the required arithmetic computations which leads to the computation of a specific branch. With this scheme, a branch-free algorithm can be created. This is illustrated by Listing 4 that shows the branch-free evaluation of a single basis function. The listing clearly demonstrates that no if-statements are left and that the resulting high-performance code is non-intuitive. In particular, it is not trivial to map the body of the loop with 23 16 for (int d = 0; i < dim; d++) { eval = level[d] * data[d]; index_calc = eval - index[d]; abs = as_double(as_ulong(index_calc) | as_ulong(mask[d])); temp = offset[d] + abs; local_support = fmax(temp, 0.0); result *= local_support; } 2 4 6 1.5 3 4 1 2 2 0.5 1 0 0 0 Generic AVX C. Performance evaluation Runtime (s) Runtime (s) Listing 4: Evaluation of a modified linear basis function with masking to avoid branches; blocking is not shown here to keep it simple. The 1D basis function evaluation does not require any if-statements. Mask values are computed on the host in advance; the level vector contains precomputed values 2lj . 8 Runtime (s) its mathematical formulation (4). If such optimizations are required, an object-oriented SoC is not possible anymore. Abstraction is virtually impossible, too, as this leads to new switch statements and a loss in computational efficiency. OCL OCL blocked OCL OCL mask Fig. 7: Runtime of the scenarios. Left: AVX on an Intel CPU. Middle: multi-evaluation with linear basis and blocking. Right: transposed multi-evaluation with modified linear basis and masking. OpenCL (OCL) results have been obtained on a K20X GPU. For each scenario described in Sec. V, the runtime is significantly reduced In the following, we give a brief indication that the optimizations discussed above are not a hacker’s delight, but rather a necessity. All deteriorations in terms of software quality lead to significant performance boosts. Figure 7 shows the performance gains for a spatially adaptive sparse grid for the three optimizations discussed above for a real-world dataset from astrophysics, see [9] for details. The timings were obtained on an Intel Xeon E52650v2 equipped with a K20X GPU. The AVX-accelerated algorithm is more than two times faster, despite using an algorithm with a worse complexity. The advantages of the optimizations for accelerator cards are similarly convincing. All three optimizations are significant enough to trade-in software quality against low-level implementations. In practice, the accelerator implementation would combine them and use several additional optimizations, which leads to even larger performance gains. Taking this to the extreme we have been able, for example, to reduce a large-scale data mining problem resulting in billions of function evaluations from more than 300,000 s in a sequential, CPU-only version, to 350 s in a multi-core system with multiple GPUs [9]. version has to be provided additionally. For our optimizations, the loss of maintainability is obvious: Just considering the lines of code, a straightforward implementation with 10 lines increases to more than 200 for AVX intrinsics and to an OpenCL-kernel of 95 lines plus 1000 more on the host device (including the precomputations). Furthermore, it is easy to derive the underlying mathematics looking at the straightforward implementation. However, this becomes a non-trivial task for the AVX-vectorized version and requires a significant re-engineering effort for the masked OpenCL-version where precomputations on the host in advance obscure the compute kernel. The use of low-level programming based on intrinsics or even assembly, implementations using blocking or bit tricks to mitigate the performance penalty of branching all lead to code that is difficult to maintain. Keeping this in mind, we have learned the following lessons. Ensure usability on a high level. To ensure usability of performance-optimized codes, a high-level API has to be defined that hides all tuned parts. We encourage high-level abstractions, the flexible use of OOP and software patterns on the user level. Invisible to the user, this can be violated for performance-critical parts. However, the selection of efficient and optimized compute kernels has to be transparent to the user. Listing 5 and Fig. 4 show the high-level user view that hides the low-level optimizations. Generic default implementations should be provided for all settings without an optimized version. Furthermore, flexible convenience data VI. L ESSONS LEARNED We have shown different optimizations for a single, simple operation. The optimizations that are required, however, are complicated and require sophisticated tuning of the code to the hardware at hand. All optimizations have a huge benefit and are crucial to achieve efficient high-performance code. All of them violate, however, quality attributes such as maintainability and portability and therefore contradict software quality attributes that are central to software development in non-CS&E fields. While it is difficult to quantify the maintainability and portability of a code, the portability of an implementation gets clearly lost if the optimizations run only on certain hardware and thus a generic, low-performance 24 17 structures on a high level are necessary, even though the compute kernels might need and use their own custom data structure and data may have to be explicitly converted between both worlds. Our own experience shows that banning the use of certain programming constructs on the user-level has been a constant source of controversial discussions with researchers and developers coming from an HPC background; but the decision has been shown to be very helpful in the long term. As SG++ supports its use in multiple languages (C++, Python, Java, Matlab), the API additionally defines which functionality has to be wrapped and exhibited from C++ to other languages. and software quality attributes such as maintainability and portability. At the example of the software framework SG++ , we have shown and demonstrated that both worlds cannot be simply combined. A clear separation of concerns is required on several levels as well as decisions in terms of where to place the trade-off(s). We are confident that the lessons learned and the design principles of SG++ shown can generalize to other software projects in science. VIII. ACKNOWLEDGMENTS Financial supports from the Deutsche Forschungsgemeinschaft (EXC 310, SimTech) and the Landesstiftung BadenWürttemberg (JP-Programm Auto-Tuning) are gratefully acknowledged. Listing 5: User example for the multi-evaluation operation // create grid for piecewise linear basis Grid* grid = Grid::createLinearGrid(dim); // ... create/refine grid ... // create coefficient and result vector DataVector alpha(grid->getStorage->getSize()); DataVector result(dataset->getSize()); // ... compute alpha ... // factory method for multi-eval selects // optimized kernel for given grid and basis opEval = createOperationMultipleEval( grid, dataset); opEval->eval(alpha, result); R EFERENCES [1] E. Arge, A. M. Bruaset, and H. P. Langtangen. Object-oriented numerics. In Numerical Methods and Software Tools in Industrial Mathematics, pages 7–26. Springer, 1997. [2] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, 1961. [3] M. Buchholz. Framework zur Parallelisierung von Molekulardynamiksimulationen in verfahrenstechnischen Anwendungen. Verlag Dr. Hut, 2010. [4] H.-J. Bungartz and M. Griebel. Sparse Grids. Acta Numerica, 13:147– 269, 2004. [5] G. Buse. Exploiting Many-Core Architectures for Dimensionally Adaptive Sparse Grids. Verlag Dr. Hut, 2015. [6] G. Buse, D. Pflüger, A. Murarasu, and R. Jacob. A non-static data layout enhancing parallelism and vectorization in sparse grid algorithms. In Proc. of. Parallel and Distributed Computing (ISPDC) 2012, pages 195–202, June 2012. [7] D. Butnaru, D. Pflüger, and H.-J. Bungartz. Towards high-dimensional computational steering of precomputed simulation data using sparse grids. Procedia Computer Science, 4:56–65, 2011. [8] A. Chatzigeorgiou and G. Stephanides. Evaluating performance and power of object-oriented vs. procedural programming in embedded processors. In Proc. of the 7th Ada-Europe Int. Conf. on Reliable Software Technologies, Ada-Europe ’02, pages 65–75, London, UK, UK, 2002. Springer-Verlag. [9] A. Heinecke and D. Pflüger. Multi- and many-core data mining with adaptive sparse grids. In Proc. of 8th ACM Int. Conf. on Computing Frontiers, CF ’11, pages 29:1–29:10, New York, USA, 2011. ACM. [10] A. Heinecke, D. Pflüger, et al. Demonstrating performance portability of a custom OpenCL data mining application to the Intel(R) Xeon Phi(TM) coprocessor. In Int. Workshop on OpenCL Proceedings 2013, Georgia Tech, May 2013. [11] J. L. Hennessy and D. A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. [12] Intel Cooperation. Intel Product Brief: Intel Xeon Processor E5-2600 v3 Product Family, 2014. [13] B. Peherstorfer, C. Kowitz, D. Pflüger, and H.-J. Bungartz. Selected recent applications of sparse grids. Numerical Mathematics: Theory, Methods and Applications, 8(01):47–77, 2015. [14] D. Pflüger. Spatially Adaptive Sparse Grids for High-Dimensional Problems. Verlag Dr. Hut, 2010. [15] President’s Information Technology Advisory Committee. Computational science: Ensuring America’s competitiveness. Report to the president, Executive Office of the President of the United States, 2005. [16] S. Wagner, D. Pflüger, and M. Mehl. Simulation software engineering: Experiences and challenges. In Proceedings of the 3rd International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering, SE-HPCCSE ’15, pages 1– 4, New York, NY, USA, 2015. ACM. [17] Wissenschaftsrat. Bedeutung und Weiterentwicklung von Simulation in der Wissenschaft. Position paper, Wissenschaftsrat, 2014. “Duplicates” can be helpful. While duplicate code is generally considered as bad style and potential source of errors, we have experienced that a SoC on the performance level can require the duplication of code with specialized adaptations. This occurs, for example, when providing different handtuned compute kernels for different types of hardware. Often, the differences between implementations are only few crucial lines of code. Their fusion however, for example based on defines, significantly reduces their readability and increases the training time on the code. In extreme settings, where the use of specialized hardware requires a completely different set of algorithms and data structures, a duplication of the corresponding parts of the code can become necessary. We have experienced that a modular structure on a high(er) level is beneficial in this regard. This allows to use different testing approaches and compilers, and to switch on and off parts of the code if they are required or become obsolete. A particularly good documentation and thorough testing are crucial in the context of duplications. Ensure maintainability, portability, and flexibility. For high-performance parts with low-level implementations, maintainability deteriorates. In ongoing work we have already achieved excellent results with automatic code generation and tuning of optimization parameters to the hardware at hand. In this regard, the use of domain specific languages to combine a high level of abstraction with flexible low-level optimizations has become increasingly popular in the HPC community. Both approaches go beyond code duplication and can help to provide maintainability and portability to some extent. VII. C ONCLUSIONS We have reported on software challenges with respect to the trade-off between computational efficiency of HPC codes 25 18