Download Performance of a parallel split operator method for the time

1 Performance of a parallel split operator method for the time dependent Schrödinger equation Thierry Mattheya ∗ , Tor Sørevik b a b † Parallab, UNIFOB, Bergen NORWAY Dept. of Informatics, University of Bergen, NORWAY In this paper we report on the parallelization of a split-step algorithm for the Schrödinger equation. The problem is represented in spherical coordinates in physical space and transformed to Fourier space for operation by the Laplacian operator, and Legendre space for operation by the Angular momentum operator and the Potential operator. Timing results are reported and analyzed for 3 different platforms 1. Introduction For large scale computation efficient implementation of state-of-the-art algorithm and high-end hardware are an absolute necessity in order to solve the problem in reasonable time. Current high-end HPC-systems are parallel system with RISC processors. Thus requirements for efficient implementation are scalable parallel code to utilize large number of processors, and cache-aware numerical kernels to circumvent the memory bottleneck on todays RISC processors. In this paper we report on our experiences with optimizing a split-step operator algorithm for solving the time dependent Schrödinger equation. A vast number of quantum mechanical problem require the solution of this equation, such as femto- and attosecond laser physics [2], quantum optics [9], atomic collisions [1] and in cold matter physics [8], just to name a few. In all cases this is time consuming tasks, and in many cases out of bounds on todays systems. Many different numerical discretization schemes have been introduced for the solution of this equation. In our view the most promising candidate for returning reliable solutions within reasonable time, is the split-step operator technique combined with spectral approximation in space. The resulting algorithm for is briefly outlined in section 2 where we also explain how it is parallelized. In section 3 we describe the efficient sequential implementation of the numerical kernels, and report timings for different versions of matrix multiply where one matrix is real while the other is complex. The core of the paper is section 4 where we report parallel speed-up on 3 different platforms and discuss our problems and successes. 2. The Algorithm and its parallelization The time dependent Schrödinger equation (1) can be written as i ∂ Ψ(x, t) = HΨ(x, t), ∂t (1) where the Hamiltonian operator, H, consists of the Laplacian plus the potential operator V (x, t) HΨ(x, t) = ∆Ψ(x, t) + V (x, t)Ψ(x, t). (2) The problems we are targeting are best described in spherical coordinates. Transforming to spherical coordinates and introducing the reduced wave function Φ = rΨ gives the following form of the Hamiltonian H =− ∂2 L2 + + V (r, θ, φ, t), 2∂r2 2r2 ∗ http://www.ii.uib.no/˜matthey † http://www.ii.uib.no/˜tors (3) 2 where L2 is the angular momentum operator. To highlight the basic idea we make the simplifying assumption that V is time-independent as well as independent of φ. The same basic algorithm and the same parallelization strategy holds without these assumption, but the details become more involved. For H being a time independent linear operator the formal solution to (1) becomes Φ(r, θ, φ, tn+1 ) = e∆tH Φ(r, θ, φ, tn ), (4) where ∆t = tn+1 − tn . Splitting H into H = A + B we get Φ(r, θ, φ, tn+1 ) = e∆t(A+B) Φ(r, θ, φ, tn ). (5) If we assumed that A and B commute we could write (5) Φ(r, θ, φ, tn+1 ) = e∆tA e∆tB Φ(r, θ, φ, tn ), (6) 3 which would allow us to apply the operators separately and greatly simplify the computation. Unfortunately they do not commute and in that case we will have a splitting error. The straightforward splitting of (6) leads to a local error of O(∆t2 ). This can be reduced to O(∆t3 ) locally if a Strang [10] splitting is applied. Φ(r, θ, φ, tn+1 ) = e∆t/2A e∆tB e∆t/2A Φ(r, θ, φ, tn ). (7) This introduction of ”symmetry” eliminates a term in the error expansion and increases the order of the method. More elaborated splitting schemes, involving many terms do also exist. When the Strang splitting is done repeatedly in a time loop, we notice that except for the start and end two operations with e∆t/2A follow each other. These can of course be replaced by one operation of e ∆tA . Hence provided some care is taken with startup and clean up the Strang splitting can be implemented without extra cost. Combining the three ingredients; split-operator technique, spherical coordinates and spectral approximation in space was first suggested by Hermann and Fleck [7]. The reason why the split-operator technique is so extremely attractive in our case is the fact that the individual operators, the Laplacian and the angular momentum operator have well known eigenfunctions. The Fourier functions, eikr , and the spherical harmonics, Ψlm (θ, φ), respectively. Thus expanding Φ in these eigenfunctions makes the computation not only efficient and simple, but exact as well. The split-step algorithm is outlined in Algorithm 1. Algorithm 1 (The split-step algorithm) /* initialization */ for n = 0,nsteps-1 F̂ ← F F T (Fn ) F̂ = scale with eigenvalues of A Fn+1/2 ← IF F T (F̂ ) ˆ F̂ ← LT (Fn+1/2 ) ˆ F̂ = scale with eigenvalues of B ˆ Fn+1 ← ILT (F̂ ) end for ˆ F, F̂ , F̂ are all matrices of size nr × nz , all discrete representation of Φ in coordinate space, Fourier space or Legendre space, respectively. For data living in the right space, time propagation reduces to simple scaling by the appropriate eigenvalues and the step size, and consequently fast as well as trivially parallelized. The computational demanding part are the transforms. Each transform is a global operation on the vector in question, and therefor not easily parallelized. But with multiple vectors in each directions, there is a simple outer level parallelism. We can simply parallelize the transforms by assigning n z /Np transforms to each processor in the radial direction, or nr /Np in the angular direction; Np being the number of processors. With this parallelization strategy the coefficient matrix needs to be distributed column-wise for the Fourier transform and row-wise for the Legendre transform. Consequently, between a Fourier transform and a Legendre transform, we need to redistribute the data from ”row-splitting” to ”column-splitting” or visa versa. 3 For simplicity we here split the Hamiltonian in only two operators. For more operators we can apply the formalism recursively, which is done in our implementation. 3 Figure 1. Color coding shows the distribution of data to processors in the left and right part of the figure. The middle part indicates which blocks might be sent simultaneously (those with the same color). Our code is parallelized for distributed memory system, using MPI. As seen in Figure 1 the redistribution requires that all processors gather N p − 1 blocks of data of size nr nz /Np2 from the other Np − 1 processors. Thus Np − 1 communication steps are needed for each of the Np processors. These can however be executed in Np − 1 parallel steps where all the processors at each step send to and receive from different processors. This can easily be achieved by letting each processor sending block (i + p) (mod Np ) to the appropriate processor, p = 0, 1, · · · , Np − 1, at step i. We have implemented this algorithm using point-to-point send and receive with a packing/unpacking of the blocks of the column- (row-) distributed data. In this case each block is sent as one item. This is implemented in a separate subroutine where non-blocking MPI-routines are used to minimize synchronization overhead. All that is needed to move from the sequential version of the code to the parallel one, is inserting the redistribution routine after returning to coordinate space and before transforming to new spectral space. The communication described above correspond to the collective communication routine MPI ALLTOALL provided each matrix block could be treated as an element. This is accomplished using the MPI-derived data type [6]. This approach is also implemented. The algorithm is optimal in the sense that it sends the minimum amount of data and keeps all the processors busy all the time. What is beyond our control, is the optimal use of the underlying network. A well tuned vendor implementation of MPI ALLTOALL might outpace our hand-coded point-to-point by taking this into account. In the parallel version the nz × nz coefficient matrix for the Legendre transform is replicated across all the participating processors. IO is handled by one processor only, and the appropriate broadcast and gather operations are used to distribute data to and gather data from all processors when needed. All remaining computational work, not covered here, is point-wise and can be carried out locally, regardless of whether the data are row- or column-splitted. For nr and nz being a multiple of Np the load is evenly distributed and any sublinear parallel speed-up can be contributed to communication overhead. 3. Sequential Optimization The time consuming parts of our computation are the forward and backward Fourier transform and the forward and backward Legendre transform. For the Fourier transform we do of course use the FFT-algorithm. We prefer to use vendor implemented FFT routines whenever they are available. But find it very inconvenient that no standard interface for these kernel routines are defined such as for BLAS [3]. This makes porting to new platforms more laborious than necessary. One possibility is to use the portable and self tuning FFTWlibrary [4,5] and we have used this for the SGI Altix system and our IBM Pentium III cluster. The discrete Legendre transform is formulated as a matrix-vector product. When the transform is applied to multiple vectors these can be arranged in a matrix and we get a matrix-matrix product. For this purpose we use BLAS. A minor problem is that the transform matrix is a real matrix while the coeffisient vectors are complex. Thus we are faced with multiplying a real matrix with a complex matrix. The BLAS standard [3], however, require both matrices to be of the same type. There is two solutions two this incompatibility of datatypes. We 4 can either split the complex matrix, B, in a real and complex part and do A = CB = C(X + iY ) = CX + iCY (8) which requires two calls to DGEMM, or we can cast the coefficient matrix C to complex and do A = CB = (comp)(C) ∗ B (9) and make one call to ZGEMM. Note that C (The Legendre transformation matrix) is constant throughout the computation while B is new for each new transformation. Thus the splitting of B into real and imaginary part as well as the merging of X and Y to a complex A have to be done at each step, while a casting of C from real to complex is done once and for all before we start the time marching loop. Our timings on the IBM p690 Regatta system show a small 2 4 10 trippel do−loop complex matrices real matrices Computational time in seconds. Computational time in seconds. 10 1 10 0 10 1 10 trippel do−loop complex matrices real matrices 3 10 2 10 1 2 10 3 10 4 10 Number of grid points, nr, in angular direction. Figure 2. Timings for the 3 versions of the matrix multiply as a function of nr . nz = 32 10 1 10 2 10 3 10 Number of grid points, nz, in radial direction. Figure 3. Timings for the 3 versions of the matrix multiply as a function of nz . nr = 2048 advantage for the second strategy on small matrices. While working with real matrices appears to be slightly better for larger matrices. This probably reflect the fact that the arithmetic is less in the first case (8). However, this case is also more memory consuming. 4. Experimental results Our problem is computational demanding because many time steps are needed. Each time step is typically performed in 0.1-1.0 second on a single CPU. The outer loop over time steps is (as always) inherently sequential. Thus the parallelization have to take place within a time step. All arithmetic are embarrassingly parallel, provided the data is correctly distributed. To achieve this two global matrix ”transpose” are needed in each step. The amount of data to be sent is nr nz , while the amount of computation is O(nr nz (log nr + nz )). We consider this to be medium grained parallelism, with a communication to computation ratio which should scale well up to a moderate number of CPU (20-30) for typical values of n r (1000-10000) and nz (10-100). The code is written in Fortran 90, and MPI is used for message passing. We have run our test cases on the following 3 platforms: • A 32 CPU shared memory IBM p690 Regatta turbo with 1.3 GHz Power4 processor. ESSL scientific library version 3.3 and MPI that comes with the Parallel Operating Environment (POE) version 3.2 under AIX 5.1. 5 • A 32 node Pentium III cluster with dual 1.27 GHz Pentium III nodes. 100 Mb switched Ethernet between nodes; Intel(R) 32-bit Compiler, version 7.0; MPICH 1.2.5; Intel Math Kernel Library for Linux 5.1 for BLAS; FFTW version 3.0. • 8 CPU (virtual) shared memory, SGI Altix with 900 MHz Itanium 2. Intel 64-bits Fortran Compiler version 7.1, Intel Math Kernel lib 5.2, SGI mpt 1.8, FFTW version 3.0. Doing reliable timing proved to be a big problem. The Regatta and the SGI Altix are both virtual shared memory systems, which in essence mean that they not only have a complicated memory structure, but there is no guaranty that all CPUs have their chunck of data laid out equally. The systems are also true multi-user systems in the sense that all processes from all users have in principal the same priority to all resources. On a heavily loaded system this means that some processes inevitably will loose the fierce battle for resources. When this happens to one process in a carefully load balanced parallel application, a devastating performance degradation happens at first synchronization point, where every one have to wait for the poor fellow who lost the battle. Our application needs to synchronize at each redistribution. In practice we found the elapsed runtime to be quite unpredictable on loaded systems. On rear occasions we might get the predicted runtime, but it was not unlikely to see a factor 2 in slowdown. Only when running on a dedicated system, timing become predictable, but even here differences of 10 % on identical runs were likely to happen. On the cluster the memory system of each node should be simpler and competition between processes did not take place within a compute node. However, network resources were subject to competition. On the cluster we observed the same unpredictability as on the other systems. Here it didn’t help to have dedicated access to the system. We believe this to be a consequence of the interconnect. For our application this gets easily saturated and packets get lost. The ethernet protocol than requires the packet to be resent, bringing the network into a viscous circle of further saturation and higher packet losses. These problems should be kept in mind when reading and interpreting the reported results. In Figure 4 we report on the pure communication time for the 3 platforms with the two different modes of communication. The most obvious observation is the huge difference between the cluster and the two SMPplatforms. We conclude that ”Fast Ethernet” is a contradiction in terms. You may either have fast interconnect or you may have Ethernet. But the two thing never coexist. 4 1 10 SGI Altix P2P SGI Altix A2A Pentium III cluster P2P Pentium III cluster A2A IBM p690 Regatta P2P IBM p690 Regatta A2A Speedup Communication [s] 10 SGI Altix Pentium III cluster IBM p690 Regatta 3 10 0 0 1 10 10 Number of processor(s) Figure 4. Time spent on communication for the different platforms and communication modes for a problem of size nr ×nz = 4096×64. Point-to-point (P2P) communication is represented by solid lines, while dashed lines are used for MPI ALLTOALL (A2A) 10 0 10 1 10 Number of processor(s) Figure 5. Speedup numbers for the 3 different platforms for the same problem. 6 A second observation is that for the MPICH on the cluster and for IBM’s MPI on the Regatta, differences between our hand coded point-to-point and the MPI ALLTOALL seem to be small, while on the SGI Altix the all-to-all seems to be substantial faster. In Figure 5 we show the speedup numbers for the 3 different platforms. The Regatta as well as the SGI Altix do very well on our test case, while the scaling on the cluster is quite poor. This all comes down to communication speed. Detailed timing show that the arithmetic scales linearly on all platforms. The communication does not scale that well, but as long as it only constitute a minor part of the total elapsed time, the overall scaling becomes quite satisfying, and this is the case on the regatta and the Altix. Acknowledgements It is with pleasure we acknowledge Jan Petter Hansen’s patiently explanation of the salient features of the Schrödinger equation to us. We are also grateful for Martin Helland’s assistance with the computational experiments for the two versions of the matrix multiply. REFERENCES [1] B.H. Bransden and M.R.C. McDowell. Charge Exchange and the Theory of Ion-Atom Collisions. Clarendon, 1992. [2] Jean-Claude Diels and Wolfgang Rudolph. Ultrashort laser pulse phenomena. Academic Press, 1996. [3] J. J. Dongarra, Jeremy Du Cruz, Sven Hammerling, and Iain S. Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1):1–17, 1990. [4] M. Frigo. FFTW: An adaptive software architecture for the FFT. In Proceedings of the ICASSP Conference, volume 3, page 1381, 1998. [5] M. Frigo. A fast fourier transform compiler. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’99), 1999. [6] William Gropp, Ewing Lusk, and Antony Skjellum. USING MPI, Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994. [7] Mark R. Hermann and J. A. Fleck Jr. Split-operator spectral method for solving the time-dependent schrödinger equation in spherical coordinates. Physical Review A, 38(12):6000–6012, 1988. [8] W. Ketterle, D. S. Durfee, and D. M. Stamper-Kurn. Making, probing and understanding bose-einstein condensates. In M. Inguscio, S. Stringari, and C. E. Wieman, editors, Bose-Einstein Condensation of Atomic Gasesi, Proceedings of the International School of Physics, ”Enrico Fermi”, Cource CXL. IOS Press, 1999. [9] Marlan O. Scully and M. Suhail Zubairy. Quantum Optics. Cambridge University Press, 1997. [10]Gilbert Strang. On the construction and comparison of difference scheme. SIAM Journal of Numerical Analysis, 5:506–517, 1968.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Performance of a parallel split operator method for the time