Download Performance of a parallel split operator method for the time

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bra–ket notation wikipedia , lookup

Self-adjoint operator wikipedia , lookup

Compact operator on Hilbert space wikipedia , lookup

Relativistic quantum mechanics wikipedia , lookup

Density matrix wikipedia , lookup

Symmetry in quantum mechanics wikipedia , lookup

Transcript
1
Performance of a parallel split operator method for the time dependent Schrödinger
equation
Thierry Mattheya ∗ , Tor Sørevik b
a
b
†
Parallab, UNIFOB, Bergen NORWAY
Dept. of Informatics, University of Bergen, NORWAY
In this paper we report on the parallelization of a split-step algorithm for the Schrödinger equation. The
problem is represented in spherical coordinates in physical space and transformed to Fourier space for operation
by the Laplacian operator, and Legendre space for operation by the Angular momentum operator and the
Potential operator.
Timing results are reported and analyzed for 3 different platforms
1. Introduction
For large scale computation efficient implementation of state-of-the-art algorithm and high-end hardware
are an absolute necessity in order to solve the problem in reasonable time. Current high-end HPC-systems
are parallel system with RISC processors. Thus requirements for efficient implementation are scalable parallel
code to utilize large number of processors, and cache-aware numerical kernels to circumvent the memory
bottleneck on todays RISC processors.
In this paper we report on our experiences with optimizing a split-step operator algorithm for solving the
time dependent Schrödinger equation. A vast number of quantum mechanical problem require the solution
of this equation, such as femto- and attosecond laser physics [2], quantum optics [9], atomic collisions [1]
and in cold matter physics [8], just to name a few. In all cases this is time consuming tasks, and in many
cases out of bounds on todays systems. Many different numerical discretization schemes have been introduced
for the solution of this equation. In our view the most promising candidate for returning reliable solutions
within reasonable time, is the split-step operator technique combined with spectral approximation in space.
The resulting algorithm for is briefly outlined in section 2 where we also explain how it is parallelized. In
section 3 we describe the efficient sequential implementation of the numerical kernels, and report timings for
different versions of matrix multiply where one matrix is real while the other is complex.
The core of the paper is section 4 where we report parallel speed-up on 3 different platforms and discuss our
problems and successes.
2. The Algorithm and its parallelization
The time dependent Schrödinger equation (1) can be written as
i
∂
Ψ(x, t) = HΨ(x, t),
∂t
(1)
where the Hamiltonian operator, H, consists of the Laplacian plus the potential operator V (x, t)
HΨ(x, t) = ∆Ψ(x, t) + V (x, t)Ψ(x, t).
(2)
The problems we are targeting are best described in spherical coordinates. Transforming to spherical coordinates and introducing the reduced wave function Φ = rΨ gives the following form of the Hamiltonian
H =−
∂2
L2
+
+ V (r, θ, φ, t),
2∂r2
2r2
∗ http://www.ii.uib.no/˜matthey
† http://www.ii.uib.no/˜tors
(3)
2
where L2 is the angular momentum operator. To highlight the basic idea we make the simplifying assumption
that V is time-independent as well as independent of φ. The same basic algorithm and the same parallelization
strategy holds without these assumption, but the details become more involved.
For H being a time independent linear operator the formal solution to (1) becomes
Φ(r, θ, φ, tn+1 ) = e∆tH Φ(r, θ, φ, tn ),
(4)
where ∆t = tn+1 − tn . Splitting H into H = A + B we get
Φ(r, θ, φ, tn+1 ) = e∆t(A+B) Φ(r, θ, φ, tn ).
(5)
If we assumed that A and B commute we could write (5)
Φ(r, θ, φ, tn+1 ) = e∆tA e∆tB Φ(r, θ, φ, tn ),
(6)
3
which would allow us to apply the operators separately and greatly simplify the computation. Unfortunately
they do not commute and in that case we will have a splitting error. The straightforward splitting of (6) leads
to a local error of O(∆t2 ). This can be reduced to O(∆t3 ) locally if a Strang [10] splitting is applied.
Φ(r, θ, φ, tn+1 ) = e∆t/2A e∆tB e∆t/2A Φ(r, θ, φ, tn ).
(7)
This introduction of ”symmetry” eliminates a term in the error expansion and increases the order of the method.
More elaborated splitting schemes, involving many terms do also exist.
When the Strang splitting is done repeatedly in a time loop, we notice that except for the start and end
two operations with e∆t/2A follow each other. These can of course be replaced by one operation of e ∆tA .
Hence provided some care is taken with startup and clean up the Strang splitting can be implemented without extra cost. Combining the three ingredients; split-operator technique, spherical coordinates and spectral
approximation in space was first suggested by Hermann and Fleck [7].
The reason why the split-operator technique is so extremely attractive in our case is the fact that the individual operators, the Laplacian and the angular momentum operator have well known eigenfunctions. The Fourier
functions, eikr , and the spherical harmonics, Ψlm (θ, φ), respectively. Thus expanding Φ in these eigenfunctions makes the computation not only efficient and simple, but exact as well.
The split-step algorithm is outlined in Algorithm 1.
Algorithm 1 (The split-step algorithm)
/* initialization */
for n = 0,nsteps-1
F̂ ← F F T (Fn )
F̂ = scale with eigenvalues of A
Fn+1/2 ← IF F T (F̂ )
ˆ
F̂ ← LT (Fn+1/2 )
ˆ
F̂ = scale with eigenvalues of B
ˆ
Fn+1 ← ILT (F̂ )
end for
ˆ
F, F̂ , F̂ are all matrices of size nr × nz , all discrete representation of Φ in coordinate space, Fourier space or
Legendre space, respectively. For data living in the right space, time propagation reduces to simple scaling by
the appropriate eigenvalues and the step size, and consequently fast as well as trivially parallelized. The computational demanding part are the transforms. Each transform is a global operation on the vector in question,
and therefor not easily parallelized. But with multiple vectors in each directions, there is a simple outer level
parallelism. We can simply parallelize the transforms by assigning n z /Np transforms to each processor in the
radial direction, or nr /Np in the angular direction; Np being the number of processors. With this parallelization strategy the coefficient matrix needs to be distributed column-wise for the Fourier transform and row-wise
for the Legendre transform. Consequently, between a Fourier transform and a Legendre transform, we need to
redistribute the data from ”row-splitting” to ”column-splitting” or visa versa.
3 For
simplicity we here split the Hamiltonian in only two operators. For more operators we can apply the formalism recursively, which is
done in our implementation.
3
Figure 1. Color coding shows the distribution of data to processors in the left and right part of the figure. The
middle part indicates which blocks might be sent simultaneously (those with the same color).
Our code is parallelized for distributed memory system, using MPI.
As seen in Figure 1 the redistribution requires that all processors gather N p − 1 blocks of data of size
nr nz /Np2 from the other Np − 1 processors. Thus Np − 1 communication steps are needed for each of the Np
processors. These can however be executed in Np − 1 parallel steps where all the processors at each step send
to and receive from different processors. This can easily be achieved by letting each processor sending block
(i + p) (mod Np ) to the appropriate processor, p = 0, 1, · · · , Np − 1, at step i.
We have implemented this algorithm using point-to-point send and receive with a packing/unpacking of the
blocks of the column- (row-) distributed data. In this case each block is sent as one item. This is implemented
in a separate subroutine where non-blocking MPI-routines are used to minimize synchronization overhead. All
that is needed to move from the sequential version of the code to the parallel one, is inserting the redistribution
routine after returning to coordinate space and before transforming to new spectral space.
The communication described above correspond to the collective communication routine MPI ALLTOALL
provided each matrix block could be treated as an element. This is accomplished using the MPI-derived data
type [6]. This approach is also implemented.
The algorithm is optimal in the sense that it sends the minimum amount of data and keeps all the processors
busy all the time. What is beyond our control, is the optimal use of the underlying network. A well tuned
vendor implementation of MPI ALLTOALL might outpace our hand-coded point-to-point by taking this into
account.
In the parallel version the nz × nz coefficient matrix for the Legendre transform is replicated across all the
participating processors.
IO is handled by one processor only, and the appropriate broadcast and gather operations are used to distribute data to and gather data from all processors when needed. All remaining computational work, not covered
here, is point-wise and can be carried out locally, regardless of whether the data are row- or column-splitted.
For nr and nz being a multiple of Np the load is evenly distributed and any sublinear parallel speed-up can be
contributed to communication overhead.
3. Sequential Optimization
The time consuming parts of our computation are the forward and backward Fourier transform and the
forward and backward Legendre transform. For the Fourier transform we do of course use the FFT-algorithm.
We prefer to use vendor implemented FFT routines whenever they are available. But find it very inconvenient
that no standard interface for these kernel routines are defined such as for BLAS [3]. This makes porting to
new platforms more laborious than necessary. One possibility is to use the portable and self tuning FFTWlibrary [4,5] and we have used this for the SGI Altix system and our IBM Pentium III cluster.
The discrete Legendre transform is formulated as a matrix-vector product. When the transform is applied to
multiple vectors these can be arranged in a matrix and we get a matrix-matrix product. For this purpose we use
BLAS. A minor problem is that the transform matrix is a real matrix while the coeffisient vectors are complex.
Thus we are faced with multiplying a real matrix with a complex matrix. The BLAS standard [3], however,
require both matrices to be of the same type. There is two solutions two this incompatibility of datatypes. We
4
can either split the complex matrix, B, in a real and complex part and do
A = CB = C(X + iY ) = CX + iCY
(8)
which requires two calls to DGEMM, or we can cast the coefficient matrix C to complex and do
A = CB = (comp)(C) ∗ B
(9)
and make one call to ZGEMM.
Note that C (The Legendre transformation matrix) is constant throughout the computation while B is new
for each new transformation. Thus the splitting of B into real and imaginary part as well as the merging of X
and Y to a complex A have to be done at each step, while a casting of C from real to complex is done once
and for all before we start the time marching loop. Our timings on the IBM p690 Regatta system show a small
2
4
10
trippel do−loop
complex matrices
real matrices
Computational time in seconds.
Computational time in seconds.
10
1
10
0
10 1
10
trippel do−loop
complex matrices
real matrices
3
10
2
10
1
2
10
3
10
4
10
Number of grid points, nr, in angular direction.
Figure 2. Timings for the 3 versions of the matrix multiply as a function of nr . nz = 32
10 1
10
2
10
3
10
Number of grid points, nz, in radial direction.
Figure 3. Timings for the 3 versions of the matrix multiply as a function of nz . nr = 2048
advantage for the second strategy on small matrices. While working with real matrices appears to be slightly
better for larger matrices. This probably reflect the fact that the arithmetic is less in the first case (8). However,
this case is also more memory consuming.
4. Experimental results
Our problem is computational demanding because many time steps are needed. Each time step is typically
performed in 0.1-1.0 second on a single CPU. The outer loop over time steps is (as always) inherently sequential. Thus the parallelization have to take place within a time step. All arithmetic are embarrassingly parallel,
provided the data is correctly distributed. To achieve this two global matrix ”transpose” are needed in each
step. The amount of data to be sent is nr nz , while the amount of computation is O(nr nz (log nr + nz )). We
consider this to be medium grained parallelism, with a communication to computation ratio which should scale
well up to a moderate number of CPU (20-30) for typical values of n r (1000-10000) and nz (10-100).
The code is written in Fortran 90, and MPI is used for message passing. We have run our test cases on the
following 3 platforms:
• A 32 CPU shared memory IBM p690 Regatta turbo with 1.3 GHz Power4 processor. ESSL scientific
library version 3.3 and MPI that comes with the Parallel Operating Environment (POE) version 3.2 under
AIX 5.1.
5
• A 32 node Pentium III cluster with dual 1.27 GHz Pentium III nodes. 100 Mb switched Ethernet between
nodes; Intel(R) 32-bit Compiler, version 7.0; MPICH 1.2.5; Intel Math Kernel Library for Linux 5.1 for
BLAS; FFTW version 3.0.
• 8 CPU (virtual) shared memory, SGI Altix with 900 MHz Itanium 2. Intel 64-bits Fortran Compiler
version 7.1, Intel Math Kernel lib 5.2, SGI mpt 1.8, FFTW version 3.0.
Doing reliable timing proved to be a big problem. The Regatta and the SGI Altix are both virtual shared
memory systems, which in essence mean that they not only have a complicated memory structure, but there
is no guaranty that all CPUs have their chunck of data laid out equally. The systems are also true multi-user
systems in the sense that all processes from all users have in principal the same priority to all resources. On
a heavily loaded system this means that some processes inevitably will loose the fierce battle for resources.
When this happens to one process in a carefully load balanced parallel application, a devastating performance
degradation happens at first synchronization point, where every one have to wait for the poor fellow who
lost the battle. Our application needs to synchronize at each redistribution. In practice we found the elapsed
runtime to be quite unpredictable on loaded systems. On rear occasions we might get the predicted runtime,
but it was not unlikely to see a factor 2 in slowdown. Only when running on a dedicated system, timing become
predictable, but even here differences of 10 % on identical runs were likely to happen.
On the cluster the memory system of each node should be simpler and competition between processes did
not take place within a compute node. However, network resources were subject to competition. On the cluster
we observed the same unpredictability as on the other systems. Here it didn’t help to have dedicated access
to the system. We believe this to be a consequence of the interconnect. For our application this gets easily
saturated and packets get lost. The ethernet protocol than requires the packet to be resent, bringing the network
into a viscous circle of further saturation and higher packet losses.
These problems should be kept in mind when reading and interpreting the reported results.
In Figure 4 we report on the pure communication time for the 3 platforms with the two different modes of
communication. The most obvious observation is the huge difference between the cluster and the two SMPplatforms. We conclude that ”Fast Ethernet” is a contradiction in terms. You may either have fast interconnect
or you may have Ethernet. But the two thing never coexist.
4
1
10
SGI Altix P2P
SGI Altix A2A
Pentium III cluster P2P
Pentium III cluster A2A
IBM p690 Regatta P2P
IBM p690 Regatta A2A
Speedup
Communication [s]
10
SGI Altix
Pentium III cluster
IBM p690 Regatta
3
10
0
0
1
10
10
Number of processor(s)
Figure 4. Time spent on communication for the different platforms and communication modes for a problem of size nr ×nz = 4096×64. Point-to-point (P2P)
communication is represented by solid lines, while
dashed lines are used for MPI ALLTOALL (A2A)
10 0
10
1
10
Number of processor(s)
Figure 5. Speedup numbers for the 3 different platforms for the same problem.
6
A second observation is that for the MPICH on the cluster and for IBM’s MPI on the Regatta, differences
between our hand coded point-to-point and the MPI ALLTOALL seem to be small, while on the SGI Altix the
all-to-all seems to be substantial faster.
In Figure 5 we show the speedup numbers for the 3 different platforms. The Regatta as well as the SGI Altix
do very well on our test case, while the scaling on the cluster is quite poor. This all comes down to communication speed. Detailed timing show that the arithmetic scales linearly on all platforms. The communication does
not scale that well, but as long as it only constitute a minor part of the total elapsed time, the overall scaling
becomes quite satisfying, and this is the case on the regatta and the Altix.
Acknowledgements
It is with pleasure we acknowledge Jan Petter Hansen’s patiently explanation of the salient features of the
Schrödinger equation to us. We are also grateful for Martin Helland’s assistance with the computational experiments for the two versions of the matrix multiply.
REFERENCES
[1] B.H. Bransden and M.R.C. McDowell. Charge Exchange and the Theory of Ion-Atom Collisions. Clarendon, 1992.
[2] Jean-Claude Diels and Wolfgang Rudolph. Ultrashort laser pulse phenomena. Academic Press, 1996.
[3] J. J. Dongarra, Jeremy Du Cruz, Sven Hammerling, and Iain S. Duff. A set of level 3 basic linear algebra
subprograms. ACM Transactions on Mathematical Software, 16(1):1–17, 1990.
[4] M. Frigo. FFTW: An adaptive software architecture for the FFT. In Proceedings of the ICASSP Conference, volume 3, page 1381, 1998.
[5] M. Frigo. A fast fourier transform compiler. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’99), 1999.
[6] William Gropp, Ewing Lusk, and Antony Skjellum. USING MPI, Portable Parallel Programming with the
Message-Passing Interface. MIT Press, 1994.
[7] Mark R. Hermann and J. A. Fleck Jr. Split-operator spectral method for solving the time-dependent
schrödinger equation in spherical coordinates. Physical Review A, 38(12):6000–6012, 1988.
[8] W. Ketterle, D. S. Durfee, and D. M. Stamper-Kurn. Making, probing and understanding bose-einstein
condensates. In M. Inguscio, S. Stringari, and C. E. Wieman, editors, Bose-Einstein Condensation of
Atomic Gasesi, Proceedings of the International School of Physics, ”Enrico Fermi”, Cource CXL. IOS
Press, 1999.
[9] Marlan O. Scully and M. Suhail Zubairy. Quantum Optics. Cambridge University Press, 1997.
[10]Gilbert Strang. On the construction and comparison of difference scheme. SIAM Journal of Numerical
Analysis, 5:506–517, 1968.