Download Ferran O ón Santacana

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Eigenvalues and eigenvectors wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Exterior algebra wikipedia , lookup

System of linear equations wikipedia , lookup

Gaussian elimination wikipedia , lookup

Symmetric cone wikipedia , lookup

Matrix multiplication wikipedia , lookup

Transcript
Ferran Obón Santacana
Universität Kassel (UNIKA)
Overview of Presentation
1. State of the art
2. Dealing with large numerical facilities
3. The substructure algorithm
4. Review of available tools
State of the art
• Currently, only small numerical models, reaching only
an order of 10 Dynamic Degrees of Freedom (DDOF),
are used in the field of substructure testing.
• Very large and complex numerical models are out of
reach within the context of continuos hybrid simulation
due to hardware limitations.
• Hybrid simulation has not reached yet its full potential!
Using Large Numerical Facilities
• Will cover one of the major
points of task NA1.4:
Extensebility to non-linear
physical structurees and
complex numerical models.
• Will allow continuous
geographically distributed
tests with numerical models
with a couple of thousand
DDOF.
• Will provide more accurate
simulations
Contribution of METU
(ETABS model)
Contribution of ITU
Using Large Numerical Facilities:
Major Problems
• Batch Mode: execution of a series of programs
on a computer without manual intervention.
• Jobs are queued: The enduser has no control
when the job will be processed.
• Security:
For security reasons all ports are blocked to
avoid being used illegaly by third parties.
Data cannot be exchanged through
internet. Output data is saved in a file.
Using Large Numerical Facilities
• To successfully run a geographically distributed
tests is needed:
Send and receive data between the numerical
facility and the laboratory.
Assure that no other process will “slow down” the
test. Requirement of a priority rank.
Know before hand when the numerical
simulation will start.
• The most powerful numerical facilities are not
able to fullfill the requirements.
Using own numerical facilities I
• The solution is to use the
numerical facilities of the
university.
• Since they are less used, special
arrangements can be achieved to
successfully run geographically
distributed tests.
• The Linux-Cluster at Universität
Kassel has:
600+ CPU cores
216 GB of RAM
Supports shared memory access
Using own numerical facilities II: Special
Arrangements
• To solve the previous problems, special
arrangements were made
Queue: Some processors will be “booked”, assuring
that the job will run without delay.
Interactive program: the use of a local network
minimizes the security thread. A specific port will be
opened to share data.
Firewall: scanning on a specific port disabled to
speed up data transfer.
The Substructure Algorithm
1. The distributed test(s)
2. Introducing the Substructure Algorithm
3. Adapting the algorithm to run in parallel
The distributed test
The Substructure Algorithm I
• Algorithm developed by Dorka
[2].
• Used successfully in Earthquake
engineering and aerospace
applications.
• Needs to be adapted to run in a
parallel fashion (MPI & OpenMP).
• All the operations during the
test are matrix vector
operations or vector per
constant operations.
• Use of special algebraic
libraries to handle large
matrices
Numerical model of Ariane IV with 2 DOF
payload and 4 DOF reference specimen at
DLR Göttingen [3]
The Substructure Algorithm II
Adapting the algorithm
Adapting the algorithm II
Review of Available Tools
1. A (maybe not so necessary) short introduction to
memory management
1. Overview
2. Types of memory
3. Example
2. Dealing efficiently with Linear Algebra Operations
3. Sequential to parallel conversion
4. Dealing efficiently with Linear Algebra Operations
on parallel computers
A short introduction to memory
management I: Overview
• Computer memory consists of a
linearly addressable space.
• Single variables and onedimensional arrays fit quite well
into this concept
• Two dimensional arrays can be
stored by decomposing the matrix
into:
A collection of rows or row major
order (C/C++)
A collection of columns or column
major order (FORTRAN)
• For symmetric matrices the
ordering does not matter
From [4]
A short introduction to memory
management II: Types of memory
• Faster memory is more
expensive
• Main memory is much more
slower that the Central
Processing Unit (CPU)
• Smaller but faster memory is
inserted between them: CPU
cache (L1, L2, L3).
• The use of this memory
speeds up the program.
Cache reuse
Cache hit/miss
Cache line
Cache blocking
A short introduction to memory
management III: Example
• 3x3 matrix A stored in rowmajor order
• 3x1 vector x
• Cache line: three units
• Load first element of A and x
• Load second element of A
• Load third element of A
• Write back the result to main
memory
• What would happen if the
matrix was stored in columnmajor order?
From [4]
Review of Available Tools
1. A (maybe not so necessary) short introduction
to memory management
2. Dealing efficiently with Linear Algebra
Operations
1. The BLAS library
2. The LAPACK library
3. Sequential to parallel conversion
4. Dealing efficiently with Linear Algebra
Operations on parallel computers
Dealing efficiently with Linear Algebra
Operations: The BLAS library I
•
•
•
•
•
•
•
•
BLAS: Basic Linear Algebra Subroutines [5]
First published in 1979 and developed to
achieve high efficiency and clean computer
programms
The reference BLAS is written in FORTRAN 77
and with C interface (NETLIB)
Several implementations in other programming
languages: Java, C++, Python...
Serves as building blocks in many computer
codes
Most of the computer vendors optimize BLAS
for specific architectures: AMD, Apple, IBM,
Intel, SUN, ...)
Existance of adaptive BLAS software: ATLAS
It automatically tune BLAS for a specific
computer
It is thread safe
BLAS is divided into three levels
Commodore PET 2001
Dealing efficiently with Linear Algebra
Operations: The BLAS library II
• A conservative formula to estimate time cost
(assuming no overlap between computation and
loading of data):
From [4]
• Where
Dealing efficiently with Linear Algebra
Operations: The BLAS library III
• Level 1 BLAS performs order
n operations:
Scalar-vector multiplication
Vector addition
Inner (dot) product
Vector multiply
And _axpy operation
• The
ratio is
Nf = 2n
nm = 3n+1
Nm/nf = 3/2
From [4]
Dealing efficiently with Linear Algebra
Operations: The BLAS library IV
• Level 2 BLAS performs
order n^2 operations:
• It solves specifically matrixvector multiplication and
accepts both orderings
Row-major order
Column-major order
• The
ratio is
Nf = 2n^2
nm = n^2 + 3n
Nm/nf = 1/2
From [4]
Dealing efficiently with Linear Algebra
Operations: The BLAS library V
• Level 3 BLAS performs
order n^3 operations:
• It solves specifically matrixmatrix multiplication and
accepts both orderings
Row-major order
Column-major order
• The
ratio is
Nf = 2n^4
nm = 4n^3
Nm/nf = 0
From [4]
Dealing efficiently with Linear Algebra
Operations: The BLAS library VI
AVAILABLE DATATYPES
•
•
•
•
SUPPORTED MATRICES
GE GEneral
GB General Band
SY Symmetric
SB Symmetric Band
SP Symmetric Packed storage
HE (complex) HErmitian
HB (complex) Hermitian Band
HP (complex) Hermitian Packed
storage
• TR TRiangular
• TB Triangular Band
• TP Triangular Packed storage
S REAL
D DOUBLE PRECISION
C COMPLEX
Z COMPLEX*16 or
DOUBLE COMPLEX
•
•
•
•
•
•
•
•
From [5]
Dealing efficiently with Linear Algebra
Operations: The LAPACK library I
•
•
•
•
LAPACK: Linear Algebra Package [6,7]
First published in 1992 and developed to make the
widely used EISPACK and LINPACK libraries run
efficiently on shared-memory vector and parallel
processors.
The reference LAPACK was originally written in
FORTRAN 77 (now FORTRAN 90) and with C
interface (NETLIB)
Several implementations in other programming
languages: Java, C++, Python...
• Provides routines for:
Solving systems of
simultaneous linear equations
Least-squares solutions of
linear systems of equations
Eigenvalue problems
Singular value problems
Matrix factorizations (LU,
Cholesky, QE,..)
Matrix Inversion
• LAPACK routines are written so
that as much as possible
computation is performed by
calls to BLAS
Dealing efficiently with Linear Algebra
Operations: The LAPACK library I
AVAILABLE DATATYPES
•
•
•
•
S
REAL
D
DOUBLE PRECISION
C
COMPLEX
Z
COMPLEX*16 or DOUBLE
COMPLEX
SUPPORTED MATRICES
•
•
•
•
•
•
•
•
•
BD
bidiagonal
DI
diagonal
GB
general band
GE
general (i.e., unsymmetric, in some cases
rectangular)
GG
general matrices, generalized problem (i.e., a
pair of general matrices)
GT
general tridiagonal
HB
(complex) Hermitian band
HE
(complex) Hermitian
HG
upper Hessenberg matrix, generalized
problem (i.e a Hessenberg and a triangular matrix)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
HP
(complex) Hermitian, packed storage
HS
upper Hessenberg
OP
(real) orthogonal, packed storage
OR
(real) orthogonal
PB
symmetric or Hermitian positive definite band
PO
symmetric or Hermitian positive definite
PP
symmetric or Hermitian positive definite,
packed storage
PT
symmetric or Hermitian positive definite
tridiagonal
SB
(real) symmetric band
SP
symmetric, packed storage
ST
(real) symmetric tridiagonal
SY
symmetric
TB
triangular band
TG
triangular matrices, generalized problem (i.e., a
pair of triangular matrices)
TP
triangular, packed storage
TR
triangular (or in some cases quasi-triangular)
TZ
trapezoidal
UN
(complex) unitary
UP
(complex) unitary, packed storage
From [6,7]
Review of Available Tools
1. A (maybe not so necessary) short introduction to
memory management
2. Dealing efficiently with Linear Algebra Operations
3. Sequential to parallel conversion
1. Introduction
2. Message Passing Interface (MPI)
3. Open MultiProcessing (OpenMP)
4. Hybrid Programming
4. Dealing efficiently with Linear Algebra Operations
on parallel computers
A warning: Amdahl's Law
• Equation proposed by Gene
Amdahl in 1967 to calculate the
speed factor [8]
• It assumes that some percentage
of the program,
, cannot be
parallelized, and that the
remaining,
, is perfectly
parallel
• Neglecting communication delays
(latencies, memory,...) the speedup is:
From [4]
The Sequential to parallel
conversion: Introduction
• There are different libraries to adapt the algorithm to run in parallel:
Message Passing Interface (MPI)
POSIX Threads (Pthreads)
OpenMultiProcessing (OpenMP)
Parallel Virtual Machine (PVM)
Intel Threading Building Blocks (TBB)
• ... and different languages:
CUDA
OpenCL
Ada
High Performance Fortran (HPF)
Unified Parallel C (UPC)
• To parallelize the algorithm MPI and OpenMP will be used.
(Pthreads will be used indirectly through ATLAS)
The Sequential to parallel
conversion: Introduction I
• The matrix A is distributed into a 2x2 process grid with
Number of processes = 4 and Block size = 2
0
1
0
1
0
0
1
1
0
1
4
3
The Sequential to parallel conversion:
Message Passing Interface (MPI) I
• MPI [9,10] is a library specification
for message-passing, that allows
computers to communicate with one
another
• MPI is a language independent
communications protocol. The
routines are callable from:
Fortran
C/C++
• MPI was designed for high
performance on both massively
parallel machines and on workstation
clusters.
• MPI is a specification, not a particular
implementation.
• GOAL: efficiency, portability and
functionality
MariCel Supercomputer (BSC)
The Sequential to parallel conversion:
Message Passing Interface (MPI) II
• It is a set of processors that are able to
communicate with others.
• Pros and cons:
+ Portable to distributed and shared
memory machines
+ Scales beyond one node
+ Each process has its own local
variables
+ Highly portable with specific
optimization for the implementation
- Difficult to develop and debug
- High latency, low bandwith
- Explicit communication
- Difficult load balancing
- Requires more programming
changes to go from serial to parallel
version
Message passing model.
From [9]
The Sequential to parallel conversion:
Open Multiprocessing (OpenMP) I
• OpenMP is a shared-memory
application programming interface
(API) [11]
• Is an implementation of
multithreading, where a master thread
“forks” a specified number of slaves
• Can run in a hybrid way with MPI.
• It is not a new programming language.
It is a notation that can be added to
Fortran
C and C++
• It describes how the work is to be
shared among threads
• It is easy to convert a sequential
algorithm to run in parallel
• Specially suitable for loops
The Sequential to parallel conversion:
Open Multiprocessing (OpenMP) II
• Pros and Cons
+ Easy to implement parallelism
+ Easier to develop and debug than
MPI
+ Data layout and descomposition is
handled automatically
+ Implicit Communication
+ Dynamic load balancing
+ Low latency, high bandwith
- Scales within one node
- Only on shared memory machines
- Possible data placement problem
- No specific thread order
- Mostly used for loop parallelization
- High chance of accidentally
overwritting variables
Shared-memory model.
From [9]
The Sequential to parallel conversion:
Hybrid Programming
• Combining the best of two
worlds [12]
+ MPI is used across nodes and
OpenMP within nodes
+ Avoids extra communication
overhead with MPI within Node
+ The program may be faster and
better scalability
- Lack of optimized OpenMP
compilers
- All threads are idle except one
while MPI communication
- Introduces an overhead when
creating threads
Hybrid programming
From [9]
Review of Available Tools
1. A (maybe not so necessary) short introduction to
memory management
2. Dealing efficiently with Linear Algebra Operations
3. Sequential to parallel conversion
4. Dealing efficiently with Linear Algebra Operations
on parallel computers
1. Introduction
2. BLACS
3. PBLAS
4. Example
Dealing efficiently with Linear Algebra Operations
on parallel computers: Introduction
• Some software is available:
BLACS: Basic Linear Algebra
Communication Subprograms
PBLAS: Parallel Basic Linear
Algebra Subprograms
ScaLAPACK: Scalable Linear
Algebra Package
PLAPACK: Parallel LAPACK
• All packages listed are available
for free at www.netlib.org
(PLAPACK at
http://www.cs.utexas.edu/~plapack/)
• To parallelize the algorithm only
the libraries from Netlib.org will
be used
Dealing efficiently with Linear Algebra Operations
on parallel computers: BLACS
• BLACS: Basic Linear Algebra
Communication Subprogramms
[13]
• It aims at providing a portable,
linear algebra specific layer for
communication.
• It is written in C and provides
FORTRAN and C interfaces
• It includes synchronous
send/receive routines to:
Communicate a matrix or
submatrix from one process to
another
Broadcast submatrices
Compute global reductions (sums,
maxima and minima).
• It supports two types of
submatrices
General submatrices
Trapezoidal submatrices
(generalization of triangular)
• Datatypes available are:
Single precision
Double precision
Complex single precision
Complex double precision
Dealing efficiently with Linear Algebra Operations
on parallel computers: PBLAS
• PBLAS: Parallel Basic Linear
Algebra Subprograms [14]
• Is a collection of routines for
performing linear algebra
operations on distributedmemory computers.
• The fundamental building
blocks of PBLAS are
The sequential BLAS and
A set of communication
subprograms (BLACS)
• It is available for free at
www.netlib.org/scalapack
• It is written in C and provides
FORTRAN interface
• PBLAS operates on data
distributed using the 2D block
cyclic distribution
• Supported Matrices
GE
SY
HE
TR
General
Symmetric
Hermitian
Triangular
• Available Datatypes
Real, Double precision,
Complex and Complex*16
Dealing efficiently with Linear Algebra Operations
on parallel computers: ScaLAPACK
• ScaLAPACK: Scalable Linear Algebra Package [15,16]
• Is a subset of LAPACK routines redesigned for distributed memory
MIMD parallel computers written in FORTRAN77
• It is portable on any computer that supports MPI or PVM.
• The fundamental building blocks of the ScaLAPACK library are
Distributed memory versions (PBLAS) of the Level 1, 2 and 3 BLAS,
and a set of Basic Linear Algebra Communication Subprograms (BLACS)
• All interprocessor communication occurs within the PBLAS and the
BLACS.
• The ScaLAPACK routines resemble their LAPACK equivalents.
However
No support for Band and packed matrices
Missing some more advanced algorithms (Non-symmetric
eigenvalue problems, ...)
Dealing efficiently with Linear Algebra Operations
on parallel computers: How it works
• To successfully use the libraries mentioned, four
steps are needed.
Initialize the process grid: define how many
processes will be used (BLACS)
Distribute the matrix on the process grid
Call the routine! (PBLAS, ScaLAPACK)
Release the process grid (BLACS)
Thank you for your attention
References
•
•
•
•
•
•
•
•
[1] UHyDe: US Patent 546047
[2] Dorka, U. E., Hybrid experimental – numerical
simulation of vibrating structures, Proceedings
International Conference WAVE2002, Okayama, Japan,
2002.
[3] Dorka, U. E., Füllekrug, U., Report of DFG-project
No. Do 360/7: Sub-PSD Tests, University of
Kaiserlautern, Germany
[3] Karniadakis, G. and M. Kirby, R., Parallel Scientific
Computing in C++ and MPI, Cambridge University
Press.
[4] Peterse, W. P. and Arbenz, P., Introduction to
Parallel Computing, A Practical guide with examples in
C , Oxford texts in applied and engineering mathematics.
[5] Official BLAS documentation and routines,
http://www.netlib.org/blas/
[6] Official LAPACK documentation and routines
http://www.netlib.org/lapack/
[7] Anderson, E. and Bai, Z. and Bischof, C. and
Blackford, S. and Demmel, J. and Dongarra, J. and Du
Croz, J. and Greenbaum, A. and Hammarling, S. and
McKenney, A. and Sorensen, D., LAPACK Users’ Guide,
Third Edition 1999, Society for Industrial and Applied
Mathematics.
•
•
•
•
•
•
•
•
•
[8] Amdahl, G., The validity of the single processor approach to
achieving large scale computing capabilities, In AFIPS Conf.
Proc., vol 30, pp.483-485, 1967
[9] Gropp, W., Lusk, E., Skjellum, A, Using MPI Portable Parallel
Programming with the Message Passing Interface, Massachusetts
Institut of Technology
[10] Gropp, W., Lusk, E., Thakur, R, Using MPI-2 Advanced
Features of the Message-Passing Interface, Massachusetts Institut
of Technology
[11] Chapman, B., Jost, G. and Van der Paas, R., Using Open MP Portable shared memory parallel programming, The MIT press
[12] He, Y. and Ding, C., Hybrid OpenMP and MPI Programming
and Tuning, Lawrence Berkeley National Laboratory
[13] Official BLACS documentation and routines,
http://www.netlib.org/blacs/
[14] Official PBLAS documentation and routines,
http://www.netlib.org/scalapack/pblas_qref.html
[15] Official ScaLAPACK documentation and routines
http://www.netlib.org/scalapack/
[16] Blackford, L. S. and Choi, J. and Cleary, A. and D'Azevedo, E.
and Demmel, J. and Dhillon, I. and Dongarra, J. and Hammarling,
S. and Henry, G. And Petitet, A. and Stanley, K. and Walker, D.
and Whaley, R. C., ScaLAPACK Users’ Guide, 1997, Society for
Industrial and Applied Mathematics.