Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HPC Using Python Workshop Week Two: Parallel Python Bin Chen [email protected] Research Computing Center Florida State University April 16, 2015 In [1]: %run talktools.py #%load_ext cythonmagic Cython Week 1: HPC via Cython 1. Add C datatype to your Python code cdef int add(int a, int b): int c = a + b return c 2. Dynamic Type --> Static Type 3. Interpreted Language --> Compiled Code 4. a ~10 times speed up is not unusual. Cython VS Pure C Code Question (from last week) space Cython can be ~10 times faster than Python space How close can Cython approach C in performance? I am going to use the greatcircle as an example New audience can get a feeling of what we were do ing last week. Distance Between Two Cities Assuming earth is a sphere, given longitudes and latitudes It uses trignometric functions: sin(), cos(), acos() It shows how to call external C functions from Cython In [1]: %load_ext cythonmagic In [2]: %%cython # user trig functions from C math library cdef extern from "math.h": double cos(double theta) double sin(double theta) double acos(double theta) # import math #name: cy_great_circle_2 cpdef double cy_great_circle_2(double lon1, double lat1, double lon2, double lat2 ): """input angle in degrees """ cdef double radius = 3956 # radius of the earth in miles cdef double x cdef double a, b, theta, c cdef pi = acos(-1.0) x = pi/180.0 a = (90.0-lat1)*x b = (90.0-lat2)*x theta = (lon2-lon1)*x c = acos( cos(a)*cos(b) + sin(a)*sin(b)*cos(theta) ) return radius*c In [3]: import timeit # coordinates of Guangzhou, and Tallahassee lon1, lat1 = 113.2667, 23.1333 lon2, lat2 = -84.2553, 34.4550 print "In miles, Distance(Guangzhou, Tallahassee) = " print cy_great_circle_2(lon1,lat1,lon2,lat2) num = 1000000 t = timeit.Timer("cy_great_circle_2(%f,%f,%f,%f)" % (lon1,lat1,lon2,lat2), 'from __main__ import cy_great_circle_2') print "one millilon run takes: ", t.timeit(num), "sec" In miles, Distance(Guangzhou, Tallahassee) = 8289.15806274 one millilon run takes: 0.251652956009 sec Distance Between Two Cities Performance of the Cython function: about 0.24 micro-seconds for a single run (remember we are calling C trig function already) How fast can the C version be? space will create a C version, compile it, run it, and time it. space cd Desktop/Python/workshop/examples/python_week2/greatcircle Distance Between Two Cities Performance of the Cython function: space about 0.24 micro-seconds for a single run How fast can the C version be? space compile with -O3 option space one single run take 0.107142 micro-seconds space So about ~40% the pure C performance How about the fibonacci sequence example? space C: one single run take 6.997000 nano-seconds space Cython: 10000000 loops, best of 3: 63.6 ns per loop space C is about ~9 times faster than Cython HPC Python: Parallel Computing 1. Submit Jobs to the RCC HPC cluster 2. Distribute Scripts to Multiple Nodes on the HPC cluster 3. Python Message Passing Using mpi4py package 4. An example combining Cython and mpi4py 5. Python multiprocessing (if time allows) Login the RCC HPC Cluster Assume you have an account Login the HPC cluster ssh -X ssh -X [email protected] [email protected] Start an interactive python session $ module load python27 $ ipython Note. Only run small interactive job on the HPC login node. Note. Submit your large job in batch to the HPC cluster. HPC Job Scheduler ---MOAB MOAB: interface between your program and the cluster. What information "MOAB" needs? a. How many cpus you need? b. How long will your job run? c. Name and path to your executable. d. Which queue to submit your job to? ..... How to provide such information? Write a job submit script! HPC Job Cycle 1. Prepare your executable C/Fortran: Python: a.out b.py 2. Create a job submit script for "MOAB" a.sub b.sub # for a.out # for b.py 3. Submit your job using "msub" $ msub [a|b].sub An Example Job Submit Script Example submit script---sub_4_core.msub #!/bin/bash #MOAB -N "my_python_job" #MOAB -j oe #MOAB -m abe #MOAB -l nodes=4 #MOAB -l walltime=01:00:00 #MOAB -q backfill module purge module load python27 module load gnu-openmpi cd $PBS_O_WORKDIR mpirun -np 4 python main_mpi.py Submit the parallel python job to HPC cluster $ msub sub_4_core.msub Distribute Scripts Over Multiple Nodes I have a serial bash/python script: space Can I run multiple copies of it on the HPC? space Yes. One option is the pbsdsh command line utility Why is pbsdsh useful: space provides an environmental variable PBS_VNODENUM, space similar to rank of a process in a MPI job. space This allows you to assign different job for different processors. space Synatx: spacespace $ pbsdsh executable [args] [args] is option list of arguments Distribute Python Scripts Example using pbsdsh and PBS_VNODENUM #!/usr/bin/python # compute the fibonacci sequence import os def fib(n): if n <= 2: return 1 a, b = 1, 1 for i in range(n-2): a,b = a+b,a return a # read the pbs environmental variable $PBS_VNODEN UM n = int( os.getenv('PBS_VNODENUM') ) + 3 print("fibonacci(%d) = %d" % (n, fib(n)) ) I will submit this job to HPC cluster as our warming up example. cd HPC/Python/workshop/week2/examples/pbsdsh What pbsdsh Can/Cannot Do? Sent copies of same serial script to different nodes/cores Give each of them a different input using $PBS_VNODENUM Each process run independent from each other. No communication between eath other. Need communication mechanism to become a powerful tool! What is MPI? MPI = Message Passing Interface 1. It is not a library, but a standard for library of message passing routines 2. Two major projects implementing MPI are openmpi and mvapich 3. different compilers [gnu, intel, pgi] have their own flavors: gnu-[openmpi | mvapich2], intel-[openmpi | mvapich2], pgi-[openmpi | mvapich2] 4. On RCC HPC cluster, load the module before compile/run your mpi code. for example, $ module load gnu-openmpi MPI For Python MPI4PY provides bindings of MPI for Python Features: 1. Object oriented, follows closely MPI-2 C++ bindings. 2. Supports point-to-point and collective communications pickable Python Objects, Numpy Arrays 3. Gives your python script standard MPI "look and feel" Why We Need Message Passing? MPI was designed for distributed memory programming model Here is a picture for distributed memory architecture Each core/process has its local memory, need message passing to share data Message Passing Message can be passed in different ways: 1. point to point Send(), Recv() 2. broadcast: same message from one process (root) to all processes Bcast() 3. scatter: one process (the root) send same amount of data to all processes Scatter() 4. gather: one process (the root) collect same amount of data from all processes Gather() 5. and other ways (you can imagine) Simple MPI Data Movement Data movement for broadcast, scatter, and gather: Note: Each process has its own memory (each row) Note: Scatter is the inverse operation of gather MPI Data Movement Plus Math Operation We can reduce data when passing them: Note: The reduce operation can be space MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN, etc. Message Passing (2) What data can we pass? space or, what are the A, B, C, D in the previous page? I. Fundamental data types : array of data MPI.INT, MPI.DOUBLE, MPI.FLOAT, etc. II. More complicated data structures: e.g., Python objects, such as: data = {'key1' : [7, 2.72, 2+3j], 'key2' : ('abc', 'xyz'), 'key3' : 3.1415926 } Organize Communicating Processes Question: how do I know space 1. who I am? space 2. who I am talking to? space 3. who I can talk to? Answer: space communciating processes need be organized, grouped, labeled. space Each process needs a unique ID space A process needs to know his World/Universe (we call it communicator) Organize Communicating Processes (2) It is surprisingly easy! from comm rank size mpi4py import MPI = MPI.COMM_WORLD = comm.Get_rank() = comm.Get_size() In the above: (1) comm is the Communicator, the world (2) rank is my unique ID (3) size is the number of processes in my world My First MPI Python Code: Hello World! We are ready for our first mpi4py program: $ cat ex01.py from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() size = comm.Get_size() print "My rank is %d out of %d Hello World! " % (rank, size) The code will be copied to all cores assigned to your HPC jobs. We also need a job submit script similar to those I mentioned before Let's run it on the HPC cluster. cd HPC/Python/workshop/week2/examples/A Summary of the First Example The job scheduler interacts nicely with MPI library: space We asked for 4 cores in my job submit script. space The communicator MPI.COMM_WORLD has 4 workers. The id/rank of workers start from 0, not 1. To run a parallel python code on HPC cluster, module load gnu-openmpi mpirun -np num_processors python my.py Also note: no message passing bewteen workers yet! Example 2: Point to Point Communication Here is the source code ex02.py from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() root = 0 if rank == root: data = {'a':7,'b':3.14} comm.send(data,dest=1,tag=11) elif rank == 1 : data = comm.recv(source=root,tag=11) print rank, 'I got', data About the syntax of the send/recv() routine: space [dest | source] indicate where the data is [going to | coming from] space tag is an integer label Note. The data sent is a Python object (dictionary) Example 2: Point to Point Communication (2) The source code ex02.py contains a bug. Can you tell? from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() root = 0 if rank == root: data = {'a':7,'b':3.14} comm.send(data,dest=1,tag=11) elif rank == 1 : data = comm.recv(source=root,tag=11) print rank, 'I got', data Error messages are helpful for beginners, so let's run it! $ cd HPC/Python/workshop/week2/examples/B Example 3: Point to Point Communication (3) I have used send/recv routines for Python Objects. space comm.send(data,dest=1,tag=11) space comm.recv(source=root,tag=11) Note that send, and recv are lower case. For sending funcatmental data type, use upper case space comm.Send() space comm.Recv() The arguments are also different, see next page Example 3: Point to Point Communication (4) Source code ex03.py from mpi4py import MPI import numpy as np comm = MPI.COMM_WORLD rank = comm.Get_rank() if rank == 0 : data = np.arange(10,dtype='i') comm.Send([data,MPI.INT],dest=1,tag=77) elif rank == 1 : data = np.empty(10, dtype='i') comm.Recv([data,MPI.INT],source=0,tag=77) else: data = None print rank, data A few things to note before we run it $ cd HPC/Python/workshop/week2/examples/C Example 3: Point to Point Communication (5) Receiving Buffer data has to be pre-declared and allocated data = np.empty(10, dtype='i') Data types have to be matched between Sender/Receiver More restrictive than passing Python object, but faster! Example 4: Broadcasting Source code for broadcast example: from comm size rank mpi4py import MPI = MPI.COMM_WORLD = comm.Get_size() = comm.Get_rank() if rank == 0: data = {'key1' : [7, 2.72, 2+3j], 'key2' : ( 'abc', 'xyz')} else: data = None data = comm.bcast(data, root=0) print "rank %d has data " % rank, data Now let's run it. cd HPC/Python/workshop/week2/examples/D Example 5: Scattering Source code for Scatter example: from mpi4py import MPI comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() if rank == 0: data = [(i+1)**2 for i in range(size)] else: data = None data = comm.scatter(data, root=0) print "rank = %d, data = " %rank, data Example 6: Reducing Source code for reduce example: from mpi4py import MPI comm = MPI.COMM_WORLD rank = comm.Get_rank() comm = MPI.COMM_WORLD rank = comm.Get_rank() total = 0 comm.Reduce(rank,total,op=MPI.MAX,root=0) if rank == 0: print "Max rank =", total The above code is incorrect, can you tell? Hint: Comm.Reduce(sendbuf, recvbuf, Op op = MPI.SUM, roo t = 0) Under the Hood MPI4PY can pass python objects such as dictionaries, lists How does that work? Under the hood, mpi4py pickles objects! Idea of pickling: space 1. convert Python object to strings ---- serialization, deflating space 2. write strings to a file/buffer space 3. load the file and parse the string ---- deserialization, inflating Let me show you an example of pickling In [4]: import pickle ## delete the pickled_data.txt if exists K = ['a', 3.4] L = {'e':2.7128, 'pi':3.1415926} pfile = open('pickled_data.txt','w') pickle.dump(K,pfile) pickle.dump(L,pfile) pfile.close() del(K) del(L) In [5]: #%ls f = open('pickled_data.txt','r') K = pickle.load(f) L = pickle.load(f) print K print L ['a', 3.4] {'pi': 3.1415926, 'e': 2.7128} Time Your MPI Python Code It is easy to time your Python code import time t_start = time.time() serial python code t_finish = time.time() run_time = t_finish - t_start # in seconds for ipython, %timeit magic command It is also easy to time your mpi4py Code spaceCompute wall-clock time using MPI.Wtime() from mpi4py import MPI t_start = MPI.Wtime() your_mpi_code t_finish = MPI.Wtime() run_time = t_finish - t_start # in seconds Synchronize mpi-processes synchronize mpi-processes using Barrier() comm = MPI.Comm_World ... comm.Barrier() Summary of MPI4PY MPI is a big subject. spaceMPI libraires contain about ~200 routines. Similar to Cython, we only covered the very basic stuff. Cython Plus Mpi4py? Cython speeds up our serial python code spaceGood Cython code can approach C in perforamnce mpi4py allows us to run python code in parallel Can we parallelize a Cython code using mpi4py? Yes. I will show your one example on the HPC cluster. Imaging Accretion Disk around Blackholes Spacetime around a black hole is strongly curved. Here is an image from the movie "Interstellar" I have developed a Cython code doing similar things. Black Hole Ray-tracing Reference: B. Chen, X. Dai, & E. Baron, 2013, Astrophysical J. 762, 122 Some Outputs From MY Code Compare this image with the one from "interstellar" Scalability of MPI4PY Performance of Cython vs MATLAB space My first version of Cython is only slightly faster than MATLAB space MATLAB code scales better than Cython Scalability of MPI4PY (2) Performance of Cython vs MATLAB space My current version of Cython 100 times faster than MATLAB Reference: B. Chen, R. Kantowski, X. Dai, E. Baron & P. Maddumage, spacespace 2015, Astrophys. J. Suppl. Ser., in press