Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ferran Obón Santacana Universität Kassel (UNIKA) Overview of Presentation 1. State of the art 2. Dealing with large numerical facilities 3. The substructure algorithm 4. Review of available tools State of the art Currently, only small numerical models, reaching only an order of 10 Dynamic Degrees of Freedom (DDOF), are used in the field of substructure testing. Very large and complex numerical models are out of reach within the context of continuos hybrid simulation due to hardware limitations. Hybrid simulation has not reached yet its full potential! Using Large Numerical Facilities Will cover one of the major points of task NA1.4: Extensebility to non-linear physical structurees and complex numerical models. Will allow continuous geographically distributed tests with numerical models with a couple of thousand DDOF. Will provide more accurate simulations Contribution of METU (ETABS model) Contribution of ITU Using Large Numerical Facilities: Major Problems Batch Mode: execution of a series of programs on a computer without manual intervention. Jobs are queued: The enduser has no control when the job will be processed. Security: For security reasons all ports are blocked to avoid being used illegaly by third parties. Data cannot be exchanged through internet. Output data is saved in a file. Using Large Numerical Facilities To successfully run a geographically distributed tests is needed: Send and receive data between the numerical facility and the laboratory. Assure that no other process will slow down the test. Requirement of a priority rank. Know before hand when the numerical simulation will start. The most powerful numerical facilities are not able to fullfill the requirements. Using own numerical facilities I The solution is to use the numerical facilities of the university. Since they are less used, special arrangements can be achieved to successfully run geographically distributed tests. The Linux-Cluster at Universität Kassel has: 600+ CPU cores 216 GB of RAM Supports shared memory access Using own numerical facilities II: Special Arrangements To solve the previous problems, special arrangements were made Queue: Some processors will be booked, assuring that the job will run without delay. Interactive program: the use of a local network minimizes the security thread. A specific port will be opened to share data. Firewall: scanning on a specific port disabled to speed up data transfer. The Substructure Algorithm 1. The distributed test(s) 2. Introducing the Substructure Algorithm 3. Adapting the algorithm to run in parallel The distributed test The Substructure Algorithm I Algorithm developed by Dorka [2]. Used successfully in Earthquake engineering and aerospace applications. Needs to be adapted to run in a parallel fashion (MPI & OpenMP). All the operations during the test are matrix vector operations or vector per constant operations. Use of special algebraic libraries to handle large matrices Numerical model of Ariane IV with 2 DOF payload and 4 DOF reference specimen at DLR Göttingen [3] The Substructure Algorithm II Adapting the algorithm Adapting the algorithm II Review of Available Tools 1. A (maybe not so necessary) short introduction to memory management 1. Overview 2. Types of memory 3. Example 2. Dealing efficiently with Linear Algebra Operations 3. Sequential to parallel conversion 4. Dealing efficiently with Linear Algebra Operations on parallel computers A short introduction to memory management I: Overview Computer memory consists of a linearly addressable space. Single variables and onedimensional arrays fit quite well into this concept Two dimensional arrays can be stored by decomposing the matrix into: A collection of rows or row major order (C/C++) A collection of columns or column major order (FORTRAN) For symmetric matrices the ordering does not matter From [4] A short introduction to memory management II: Types of memory Faster memory is more expensive Main memory is much more slower that the Central Processing Unit (CPU) Smaller but faster memory is inserted between them: CPU cache (L1, L2, L3). The use of this memory speeds up the program. Cache reuse Cache hit/miss Cache line Cache blocking A short introduction to memory management III: Example 3x3 matrix A stored in rowmajor order 3x1 vector x Cache line: three units Load first element of A and x Load second element of A Load third element of A Write back the result to main memory What would happen if the matrix was stored in columnmajor order? From [4] Review of Available Tools 1. A (maybe not so necessary) short introduction to memory management 2. Dealing efficiently with Linear Algebra Operations 1. The BLAS library 2. The LAPACK library 3. Sequential to parallel conversion 4. Dealing efficiently with Linear Algebra Operations on parallel computers Dealing efficiently with Linear Algebra Operations: The BLAS library I BLAS: Basic Linear Algebra Subroutines [5] First published in 1979 and developed to achieve high efficiency and clean computer programms The reference BLAS is written in FORTRAN 77 and with C interface (NETLIB) Several implementations in other programming languages: Java, C++, Python... Serves as building blocks in many computer codes Most of the computer vendors optimize BLAS for specific architectures: AMD, Apple, IBM, Intel, SUN, ...) Existance of adaptive BLAS software: ATLAS It automatically tune BLAS for a specific computer It is thread safe BLAS is divided into three levels Commodore PET 2001 Dealing efficiently with Linear Algebra Operations: The BLAS library II A conservative formula to estimate time cost (assuming no overlap between computation and loading of data): From [4] Where Dealing efficiently with Linear Algebra Operations: The BLAS library III Level 1 BLAS performs order n operations: Scalar-vector multiplication Vector addition Inner (dot) product Vector multiply And _axpy operation The ratio is Nf = 2n nm = 3n+1 Nm/nf = 3/2 From [4] Dealing efficiently with Linear Algebra Operations: The BLAS library IV Level 2 BLAS performs order n^2 operations: It solves specifically matrixvector multiplication and accepts both orderings Row-major order Column-major order The ratio is Nf = 2n^2 nm = n^2 + 3n Nm/nf = 1/2 From [4] Dealing efficiently with Linear Algebra Operations: The BLAS library V Level 3 BLAS performs order n^3 operations: It solves specifically matrixmatrix multiplication and accepts both orderings Row-major order Column-major order The ratio is Nf = 2n^4 nm = 4n^3 Nm/nf = 0 From [4] Dealing efficiently with Linear Algebra Operations: The BLAS library VI AVAILABLE DATATYPES SUPPORTED MATRICES GE GEneral GB General Band SY Symmetric SB Symmetric Band SP Symmetric Packed storage HE (complex) HErmitian HB (complex) Hermitian Band HP (complex) Hermitian Packed storage TR TRiangular TB Triangular Band TP Triangular Packed storage S REAL D DOUBLE PRECISION C COMPLEX Z COMPLEX*16 or DOUBLE COMPLEX From [5] Dealing efficiently with Linear Algebra Operations: The LAPACK library I LAPACK: Linear Algebra Package [6,7] First published in 1992 and developed to make the widely used EISPACK and LINPACK libraries run efficiently on shared-memory vector and parallel processors. The reference LAPACK was originally written in FORTRAN 77 (now FORTRAN 90) and with C interface (NETLIB) Several implementations in other programming languages: Java, C++, Python... Provides routines for: Solving systems of simultaneous linear equations Least-squares solutions of linear systems of equations Eigenvalue problems Singular value problems Matrix factorizations (LU, Cholesky, QE,..) Matrix Inversion LAPACK routines are written so that as much as possible computation is performed by calls to BLAS Dealing efficiently with Linear Algebra Operations: The LAPACK library I AVAILABLE DATATYPES S REAL D DOUBLE PRECISION C COMPLEX Z COMPLEX*16 or DOUBLE COMPLEX SUPPORTED MATRICES BD bidiagonal DI diagonal GB general band GE general (i.e., unsymmetric, in some cases rectangular) GG general matrices, generalized problem (i.e., a pair of general matrices) GT general tridiagonal HB (complex) Hermitian band HE (complex) Hermitian HG upper Hessenberg matrix, generalized problem (i.e a Hessenberg and a triangular matrix) HP (complex) Hermitian, packed storage HS upper Hessenberg OP (real) orthogonal, packed storage OR (real) orthogonal PB symmetric or Hermitian positive definite band PO symmetric or Hermitian positive definite PP symmetric or Hermitian positive definite, packed storage PT symmetric or Hermitian positive definite tridiagonal SB (real) symmetric band SP symmetric, packed storage ST (real) symmetric tridiagonal SY symmetric TB triangular band TG triangular matrices, generalized problem (i.e., a pair of triangular matrices) TP triangular, packed storage TR triangular (or in some cases quasi-triangular) TZ trapezoidal UN (complex) unitary UP (complex) unitary, packed storage From [6,7] Review of Available Tools 1. A (maybe not so necessary) short introduction to memory management 2. Dealing efficiently with Linear Algebra Operations 3. Sequential to parallel conversion 1. Introduction 2. Message Passing Interface (MPI) 3. Open MultiProcessing (OpenMP) 4. Hybrid Programming 4. Dealing efficiently with Linear Algebra Operations on parallel computers A warning: Amdahl's Law Equation proposed by Gene Amdahl in 1967 to calculate the speed factor [8] It assumes that some percentage of the program, , cannot be parallelized, and that the remaining, , is perfectly parallel Neglecting communication delays (latencies, memory,...) the speedup is: From [4] The Sequential to parallel conversion: Introduction There are different libraries to adapt the algorithm to run in parallel: Message Passing Interface (MPI) POSIX Threads (Pthreads) OpenMultiProcessing (OpenMP) Parallel Virtual Machine (PVM) Intel Threading Building Blocks (TBB) ... and different languages: CUDA OpenCL Ada High Performance Fortran (HPF) Unified Parallel C (UPC) To parallelize the algorithm MPI and OpenMP will be used. (Pthreads will be used indirectly through ATLAS) The Sequential to parallel conversion: Introduction I The matrix A is distributed into a 2x2 process grid with Number of processes = 4 and Block size = 2 0 1 0 1 0 0 1 1 0 1 4 3 The Sequential to parallel conversion: Message Passing Interface (MPI) I MPI [9,10] is a library specification for message-passing, that allows computers to communicate with one another MPI is a language independent communications protocol. The routines are callable from: Fortran C/C++ MPI was designed for high performance on both massively parallel machines and on workstation clusters. MPI is a specification, not a particular implementation. GOAL: efficiency, portability and functionality MariCel Supercomputer (BSC) The Sequential to parallel conversion: Message Passing Interface (MPI) II It is a set of processors that are able to communicate with others. Pros and cons: + Portable to distributed and shared memory machines + Scales beyond one node + Each process has its own local variables + Highly portable with specific optimization for the implementation - Difficult to develop and debug - High latency, low bandwith - Explicit communication - Difficult load balancing - Requires more programming changes to go from serial to parallel version Message passing model. From [9] The Sequential to parallel conversion: Open Multiprocessing (OpenMP) I OpenMP is a shared-memory application programming interface (API) [11] Is an implementation of multithreading, where a master thread forks a specified number of slaves Can run in a hybrid way with MPI. It is not a new programming language. It is a notation that can be added to Fortran C and C++ It describes how the work is to be shared among threads It is easy to convert a sequential algorithm to run in parallel Specially suitable for loops The Sequential to parallel conversion: Open Multiprocessing (OpenMP) II Pros and Cons + Easy to implement parallelism + Easier to develop and debug than MPI + Data layout and descomposition is handled automatically + Implicit Communication + Dynamic load balancing + Low latency, high bandwith - Scales within one node - Only on shared memory machines - Possible data placement problem - No specific thread order - Mostly used for loop parallelization - High chance of accidentally overwritting variables Shared-memory model. From [9] The Sequential to parallel conversion: Hybrid Programming Combining the best of two worlds [12] + MPI is used across nodes and OpenMP within nodes + Avoids extra communication overhead with MPI within Node + The program may be faster and better scalability - Lack of optimized OpenMP compilers - All threads are idle except one while MPI communication - Introduces an overhead when creating threads Hybrid programming From [9] Review of Available Tools 1. A (maybe not so necessary) short introduction to memory management 2. Dealing efficiently with Linear Algebra Operations 3. Sequential to parallel conversion 4. Dealing efficiently with Linear Algebra Operations on parallel computers 1. Introduction 2. BLACS 3. PBLAS 4. Example Dealing efficiently with Linear Algebra Operations on parallel computers: Introduction Some software is available: BLACS: Basic Linear Algebra Communication Subprograms PBLAS: Parallel Basic Linear Algebra Subprograms ScaLAPACK: Scalable Linear Algebra Package PLAPACK: Parallel LAPACK All packages listed are available for free at www.netlib.org (PLAPACK at http://www.cs.utexas.edu/~plapack/) To parallelize the algorithm only the libraries from Netlib.org will be used Dealing efficiently with Linear Algebra Operations on parallel computers: BLACS BLACS: Basic Linear Algebra Communication Subprogramms [13] It aims at providing a portable, linear algebra specific layer for communication. It is written in C and provides FORTRAN and C interfaces It includes synchronous send/receive routines to: Communicate a matrix or submatrix from one process to another Broadcast submatrices Compute global reductions (sums, maxima and minima). It supports two types of submatrices General submatrices Trapezoidal submatrices (generalization of triangular) Datatypes available are: Single precision Double precision Complex single precision Complex double precision Dealing efficiently with Linear Algebra Operations on parallel computers: PBLAS PBLAS: Parallel Basic Linear Algebra Subprograms [14] Is a collection of routines for performing linear algebra operations on distributedmemory computers. The fundamental building blocks of PBLAS are The sequential BLAS and A set of communication subprograms (BLACS) It is available for free at www.netlib.org/scalapack It is written in C and provides FORTRAN interface PBLAS operates on data distributed using the 2D block cyclic distribution Supported Matrices GE SY HE TR General Symmetric Hermitian Triangular Available Datatypes Real, Double precision, Complex and Complex*16 Dealing efficiently with Linear Algebra Operations on parallel computers: ScaLAPACK ScaLAPACK: Scalable Linear Algebra Package [15,16] Is a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers written in FORTRAN77 It is portable on any computer that supports MPI or PVM. The fundamental building blocks of the ScaLAPACK library are Distributed memory versions (PBLAS) of the Level 1, 2 and 3 BLAS, and a set of Basic Linear Algebra Communication Subprograms (BLACS) All interprocessor communication occurs within the PBLAS and the BLACS. The ScaLAPACK routines resemble their LAPACK equivalents. However No support for Band and packed matrices Missing some more advanced algorithms (Non-symmetric eigenvalue problems, ...) Dealing efficiently with Linear Algebra Operations on parallel computers: How it works To successfully use the libraries mentioned, four steps are needed. Initialize the process grid: define how many processes will be used (BLACS) Distribute the matrix on the process grid Call the routine! (PBLAS, ScaLAPACK) Release the process grid (BLACS) Thank you for your attention References [1] UHyDe: US Patent 546047 [2] Dorka, U. E., Hybrid experimental numerical simulation of vibrating structures, Proceedings International Conference WAVE2002, Okayama, Japan, 2002. [3] Dorka, U. E., Füllekrug, U., Report of DFG-project No. Do 360/7: Sub-PSD Tests, University of Kaiserlautern, Germany [3] Karniadakis, G. and M. Kirby, R., Parallel Scientific Computing in C++ and MPI, Cambridge University Press. [4] Peterse, W. P. and Arbenz, P., Introduction to Parallel Computing, A Practical guide with examples in C , Oxford texts in applied and engineering mathematics. [5] Official BLAS documentation and routines, http://www.netlib.org/blas/ [6] Official LAPACK documentation and routines http://www.netlib.org/lapack/ [7] Anderson, E. and Bai, Z. and Bischof, C. and Blackford, S. and Demmel, J. and Dongarra, J. and Du Croz, J. and Greenbaum, A. and Hammarling, S. and McKenney, A. and Sorensen, D., LAPACK Users Guide, Third Edition 1999, Society for Industrial and Applied Mathematics. [8] Amdahl, G., The validity of the single processor approach to achieving large scale computing capabilities, In AFIPS Conf. Proc., vol 30, pp.483-485, 1967 [9] Gropp, W., Lusk, E., Skjellum, A, Using MPI Portable Parallel Programming with the Message Passing Interface, Massachusetts Institut of Technology [10] Gropp, W., Lusk, E., Thakur, R, Using MPI-2 Advanced Features of the Message-Passing Interface, Massachusetts Institut of Technology [11] Chapman, B., Jost, G. and Van der Paas, R., Using Open MP Portable shared memory parallel programming, The MIT press [12] He, Y. and Ding, C., Hybrid OpenMP and MPI Programming and Tuning, Lawrence Berkeley National Laboratory [13] Official BLACS documentation and routines, http://www.netlib.org/blacs/ [14] Official PBLAS documentation and routines, http://www.netlib.org/scalapack/pblas_qref.html [15] Official ScaLAPACK documentation and routines http://www.netlib.org/scalapack/ [16] Blackford, L. S. and Choi, J. and Cleary, A. and D'Azevedo, E. and Demmel, J. and Dhillon, I. and Dongarra, J. and Hammarling, S. and Henry, G. And Petitet, A. and Stanley, K. and Walker, D. and Whaley, R. C., ScaLAPACK Users Guide, 1997, Society for Industrial and Applied Mathematics.