Download High Performance Computing with MATLAB

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix calculus wikipedia , lookup

Transcript
High Performance Computing
with MATLAB
Kadin Tseng
Scientific Computing and Visualization Group
Boston University
CNS, March 7, 2008
1
Outline
Performance Issues
 Memory Access
 Vectorization
 Compiler
 Other Considerations
Parallel MATLAB
CNS, March 7, 2008
2
Memory Access
Memory access patterns often affect computational
performances. Here are some effective ways to enhance
performance:
 Allocate array memory before using it
 For-loops Ordering
 Compute and save array in-place wherever possible
CNS, March 7, 2008
3
Allocate Array
 Allocate array memory before using it.
MATLAB is designed primarily as an interactive, user-friendly
environment. No pre-allotment of memory is required. Often,
however, array sizes are known a priori. By pre-allocating it
ensures that all array elements are allocated in one single,
contiguous block right from the start.
n=5000;
x(1) = 1;
for i=2:n
x(i) = 2*x(i-1);
end
n=5000; x = ones(n,1);
x(1) = 1;
for i=2:n
x(i) = 2*x(i-1);
end
Wallclock time = 0.0153 seconds
Wallclock time = 0.0002 seconds
The timing data are recorded on Katana. The actual times can vary significantly
depending on the processor.
CNS, March 7, 2008
4
For-loop Ordering
 Best if inner-most for loop is for left-most index of array, etc.
 For a multi-dimensional array, x(i,j), the 1D representation of
the same array, x(k), inherently possesses the contiguous
property
n=5000; x = zeros(n);
for i=1:n
% rows
for j=1:n % columns
x(i,j) = i+(j-1)*n;
end
end
n=5000; x = zeros(n);
for j=1:n
% columns
for i=1:n % rows
x(i,j) = i+(j-1)*n;
end
end
Wallclock time = 0.88 seconds
Wallclock time = 0.48 seconds
CNS, March 7, 2008
5
Compute In-place
 Compute and save array in-place improves performance
x = randn(10000);
tic
y = x.^2;
toc
x = randn(10000);
tic
x = x.^2;
toc
Wallclock time = 1.23 seconds
Wallclock time = 0.49 seconds
CNS, March 7, 2008
6
Other Considerations
 Use function m-file instead of script m-file whenever
reasonable
 Script m-file is loaded into memory and evaluate one line at a time.
Subsequent uses require reloading.
 Function m-file is compiled into a pseudo-code and is loaded once.
Subsequent use of the function will be faster without reloading.
 Avoid using virtual memory. Physical memory is much faster.
 Avoid passing large matrices to a function and modifying only
a handful of elements.
 Use MATLAB profiler (profile) to identify “hot spots” for
performance enhancement.
CNS, March 7, 2008
7
Vectorization
 The use of for loop in MATLAB, in general, can be expensive,
especially if the loop count is large or nested for-loops.
 Without array allocation, for-loops are very costly.
 From a performance standpoint, in general, a compact vector
representation should be used in place of for-loops. Here is an
example.
i = 0;
for t = 0:.01:10
i = i + 1;
y(i) = sin(t);
end
t = 0:.01:10;
y = sin(t);
Wallclock time = 0.0045 seconds
Wallclock time = 0.0005 seconds
CNS, March 7, 2008
8
Compiler
A MATLAB compiler, mcc, is available.
 It compiles m-files into C codes, object libraries, or stand-alone
executables.
 A stand-alone executable generated with mcc can run on
compatible platforms without an installed MATLAB or a MATLAB
license.
 Many MATLAB general and toolbox licenses are available at BU.
On special occasions, MATLAB access may be denied if all
licenses are checked out. Running a stand-alone requires NO
licenses and no waiting.
 Some compiled codes may run more efficiently than m-files
because they are not run in interpretive mode.
 A stand-alone enables you to share it without revealing the
source.
http://scv.bu.edu/documentation/tutorials/MATLAB/compiler/
CNS, March 7, 2008
9
Is Parallel MATLAB the way to go ?
•
Even in the best case, can’t compete with C/Fortran with MPI/OpenMP
•
It is an acceptable compromise if
•
Converting your MATLAB code to C/Fortran requires too big of an
effort and you don’t have the time or inclination to do that.
•
A “big” job typically takes hours, rather than days, to run on a single
processor.
•
You strongly prefer the relative ease and efficiency in programming a
research code in MATLAB.
•
The appropriate multiprocessing MATLAB paradigm is at your
disposal.
CNS, March 7, 2008
10
Multiprocessing MATLAB
1 MatlabMPI
2 pMatlab
3 SCV’s parallel MATLAB
4 Distributed Computing Toolbox
5 Star-P
CNS, March 7, 2008
11
1 MatlabMPI
MatlabMPI is a parallel MATLAB package developed at Lincoln Lab
in Lexington, MA.
 It does not require or make use of high speed interconnect for
communication among cluster nodes. Instead, it relies on the
network file system being visible, or shared, by all processors.
With this, message passing is achieved through I/O to the file
system.
 It has a small basic set of utility routines that mimic those of the
Message Passing Interface (MPI) in functionalities. While the
MPI routines for sending and receiving messages are performed
via high speed interconnect, the routines in this package
accomplish the same tasks via I/O.
 It is good for “embarrassingly parallel” codes that require only
infrequent communications.
CNS, March 7, 2008
12
2 pMatlab
pMatlab is a parallel MATLAB package also developed at Lincoln
Lab in Lexington, MA. It is built on top of MatlabMPI.
 As such, it inherits all the properties of MatlabMPI. It can be
thought of as providing higher-level wrapper functions to insulate
the programmers from having to deal with lower-level function
calls to perform parallel tasks.
 It is good for embarrassingly parallel algorithms with very modest
amount of communications.
CNS, March 7, 2008
13
3 SCV’s parallel MATLAB
SCV has a very simple parallel MATLAB package that is also based
on the shared network file system concept as with MatlabMPI.
 It is limited to most of the same restrictions as MatlabMPI.
However, there are two departures:
1. There is only one batch script and two function m-files to be
inserted to your code.
2. These include a barrier function to synchronize work
performed on multiprocessing nodes. This is typically required
for codes that contain serial and parallel sections.
 It is good for embarrassingly parallel algorithms with very
modest amount of communications.
 Email or call Kadin if you want to use any of the above
three packages. An example is given next.
CNS, March 7, 2008
14
SCV parallel MATLAB – Example 1
% This example demonstrates the use of multiprocessors to compute C = A + B (matrix size is
N 2)
% Decomposition along columns; can also be decomposed along rows, or both.
% C(:, range(rank)) = A(:, range(rank)) + B(:, range(rank))
% In the above, range(rank) is the range of columns as a function of the processor rank
% range(rank) = rank*n+1:rank*n+n (0<=rank<=nproc-1; n=N/nproc)
% For simplicity, N is assumed to be divisible by nproc
N = 8;
% size of global matrix A
I = (1:N)’;
% generate column vector
A = I(:, ones(1,N))*10 + I(:, ones(1,N))’; % generate A on current (and all) process
[pbegin, pend, rank, nproc] = parallel_info(N); % query for parallel info
% rank (0<=rank<=nproc-1) is the current MATLAB process
n = N/nproc;
% distributed column size of matrix B
b = I(:, ones(1,n))*10; % generate N x n matrix b (local B)
c = A(:, pbegin:pend) + b % compute local c from A and local b
save matrix_c;
% each current dir has own individual copy of c
CNS, March 7, 2008
15
SCV parallel MATLAB Example 1 (cont’d)
% Run barrier to synchronize all processors
ierr = barrier(rank, nproc);
% Finally, perform (serial) gather on c of all ranks into C on 0
if (rank == 0)
C = zeros(N); % allocate C
C(:,1:n) = c;
% starts with c from rank 0 which is already in memory
for k=1:nproc-1
i = n*k+1; % beginning location to which c will be inserted
j = n*k+n; % end location
fk = [‘../' num2str(k) ‘/matrix_c']; % file name of c on process k
load(fk, 'c');
C(:,i:j) = c;
end
save(‘../matrixC’, ‘C’]); % save C to parent dir
end
CNS, March 7, 2008
16
… parallel MATLAB Example 1 – batch script
#!/bin/csh
# Example SGE script for running parallel MATLAB jobs on Katana
# Submit job with the command: qsub batch_sge.scv
# "#$ qsub_option" is interpreted by qsub as if "qsub_option" was passed to qsub on commandline.
# Set hard runtime (wallclock) limit, default is 2 hours. Format: -l h_rt=HH:MM:SS
#$ -l h_rt=2:00:00
# Merge stderr into the stdout file to reduce clutter.
#$ -j y
# Invoke Parallel Environment for N processors. No default value, it must be specified.
# For MATLAB apps, DO NOT select omp
#$ -pe 1_per_node 4
# end of qsub options
# By default, the script is executed in the directory from which it was submitted
# with qsub. You might want to change directories before invoking mpirun ...
cd $PWD
# running the following script generates multiple concurrent copies of MATLAB
# Use addpath in startup.m to add path to all necessary matlab m-files
# batch_sge and sge_matlab should live in either $HOME/bin or $PWD
sge_matlab $PWD scv_matlab_example.m
CNS, March 7, 2008
17
SCV parallel MATLAB Example 2
The airplane is represented with patches of quadrilateral elements and
the integral formulation is discretized to yield
Ne
Ne
 A   B 
ij
j
ij
j 1
j
; i  1,..., Ne
j 1
Aij  I ij  C ij
I ij  1  i  j; I ij  0  i  j
Bij 

j
1
d ; C ij 
2r

j

1
(
)d
n 2r
r  [( x  x i )2  (y  y i )2  (z  z i )2 ]1 / 2
ψ is the known Neumann boundary condition.
φ is the unknown to be solved for.
CNS, March 7, 2008
18
… parallel MATLAB Example 2 – Geometry
CNS, March 7, 2008
19
… parallel MATLAB Example 2 – timings
CNS, March 7, 2008
20
How slow is MATLAB compared with C ?
CNS, March 7, 2008
21
4 Distributed Computing Toolbox
The Mathworks has a DCT which is a parallel MATLAB package that
utilizes the cluster’s high speed interconnect for inter-processor
communications.
 At present, DCT is not available on SCV machines.
CNS, March 7, 2008
22
5 StarP
StarP is a parallel MATLAB product of Interactive Supercomputing,
Inc. It bears some resemblance to the pMatlab package in that it
enables parallel MATLAB while shielding the programmers from
most of the lower level parallel programming.
 Like Mathworks’ DCT, StarP is a parallel MATLAB package that
utilizes high speed interconnect for inter-processor
communications.
 At present, this package is not available on SCV machines.
CNS, March 7, 2008
23
Useful SCV Info
• SCV home page
(http://scv.bu.edu/)
• Resource Applications (https://acct.bu.edu/SCF)
• Help
– Web-based tutorials (http://scv.bu.edu/)
(MPI, OpenMP, MATLAB, IDL, Graphics tools)
– HPC consultations by appointment
• Kadin Tseng ([email protected])
• Doug Sondak ([email protected])
– [email protected], [email protected]
CNS, March 7, 2008
24