Download HPC Using Python Workshop Week Two: Parallel Python

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
HPC Using Python Workshop
Week Two: Parallel Python
Bin Chen
[email protected]
Research Computing Center
Florida State University
April 16, 2015
In [1]:
%run talktools.py
#%load_ext cythonmagic
Cython
Week 1: HPC via Cython
1. Add C datatype to your Python code
cdef int add(int a, int b):
int c = a + b
return c
2. Dynamic Type --> Static Type
3. Interpreted Language --> Compiled Code
4. a ~10 times speed up is not unusual.
Cython VS Pure C Code
Question (from last week)
space Cython can be ~10 times faster than Python
space How close can Cython approach C in performance?
I am going to use the greatcircle as an example
New audience can get a feeling of what we were do
ing last week.
Distance Between Two Cities
Assuming earth is a sphere, given longitudes and latitudes
It uses trignometric functions: sin(), cos(), acos()
It shows how to call external C functions from Cython
In [1]:
%load_ext cythonmagic
In [2]:
%%cython
# user trig functions from C math library
cdef extern from "math.h":
double cos(double theta)
double sin(double theta)
double acos(double theta)
# import math
#name: cy_great_circle_2
cpdef double cy_great_circle_2(double lon1, double lat1, double lon2, double lat2
):
"""input angle in degrees """
cdef double radius = 3956
# radius of the earth in miles
cdef double x
cdef double a, b, theta, c
cdef pi = acos(-1.0)
x = pi/180.0
a = (90.0-lat1)*x
b = (90.0-lat2)*x
theta = (lon2-lon1)*x
c = acos( cos(a)*cos(b) + sin(a)*sin(b)*cos(theta) )
return radius*c
In [3]:
import timeit
# coordinates of Guangzhou, and Tallahassee
lon1, lat1
= 113.2667, 23.1333
lon2, lat2
= -84.2553, 34.4550
print "In miles, Distance(Guangzhou, Tallahassee) =
"
print cy_great_circle_2(lon1,lat1,lon2,lat2)
num = 1000000
t = timeit.Timer("cy_great_circle_2(%f,%f,%f,%f)" % (lon1,lat1,lon2,lat2),
'from __main__ import cy_great_circle_2')
print "one millilon run takes:
", t.timeit(num), "sec"
In miles, Distance(Guangzhou, Tallahassee) =
8289.15806274
one millilon run takes:
0.251652956009 sec
Distance Between Two Cities
Performance of the Cython function:
about 0.24 micro-seconds for a single run
(remember we are calling C trig function already)
How fast can the C version be?
space will create a C version, compile it, run it, and time it.
space cd Desktop/Python/workshop/examples/python_week2/greatcircle
Distance Between Two Cities
Performance of the Cython function:
space about 0.24 micro-seconds for a single run
How fast can the C version be?
space compile with -O3 option
space one single run take 0.107142 micro-seconds
space So about ~40% the pure C performance
How about the fibonacci sequence example?
space C: one single run take 6.997000 nano-seconds
space Cython: 10000000 loops, best of 3: 63.6 ns per loop
space C is about ~9 times faster than Cython
HPC Python: Parallel Computing
1. Submit Jobs to the RCC HPC cluster
2. Distribute Scripts to Multiple Nodes on the HPC
cluster
3. Python Message Passing Using mpi4py package
4. An example combining Cython and mpi4py
5. Python multiprocessing (if time allows)
Login the RCC HPC Cluster
Assume you have an account
Login the HPC cluster
ssh -X
ssh -X
[email protected]
[email protected]
Start an interactive python session
$
module load python27
$
ipython
Note. Only run small interactive job on the HPC login node.
Note. Submit your large job in batch to the HPC cluster.
HPC Job Scheduler ---MOAB
MOAB: interface between your program and the
cluster.
What information "MOAB" needs?
a. How many cpus you need?
b. How long will your job run?
c. Name and path to your executable.
d. Which queue to submit your job to?
.....
How to provide such information?
Write a job submit script!
HPC Job Cycle
1. Prepare your executable
C/Fortran:
Python:
a.out
b.py
2. Create a job submit script for "MOAB"
a.sub
b.sub
# for a.out
# for b.py
3. Submit your job using "msub"
$ msub [a|b].sub
An Example Job Submit Script
Example submit script---sub_4_core.msub
#!/bin/bash
#MOAB -N "my_python_job"
#MOAB -j oe
#MOAB -m abe
#MOAB -l nodes=4
#MOAB -l walltime=01:00:00
#MOAB -q backfill
module purge
module load python27
module load gnu-openmpi
cd $PBS_O_WORKDIR
mpirun -np 4 python main_mpi.py
Submit the parallel python job to HPC cluster
$ msub sub_4_core.msub
Distribute Scripts Over Multiple
Nodes
I have a serial bash/python script:
space Can I run multiple copies of it on the HPC?
space Yes. One option is the pbsdsh command line utility
Why is pbsdsh useful:
space provides an environmental variable PBS_VNODENUM,
space similar to rank of a process in a MPI job.
space This allows you to assign different job for different processors.
space Synatx:
spacespace $ pbsdsh executable [args]
[args] is option list of arguments
Distribute Python Scripts
Example using pbsdsh and PBS_VNODENUM
#!/usr/bin/python
# compute the fibonacci sequence
import os
def fib(n):
if n <= 2:
return 1
a, b = 1, 1
for i in range(n-2):
a,b = a+b,a
return a
# read the pbs environmental variable $PBS_VNODEN
UM
n = int( os.getenv('PBS_VNODENUM') ) + 3
print("fibonacci(%d) = %d" % (n, fib(n)) )
I will submit this job to HPC cluster as our warming up example.
cd
HPC/Python/workshop/week2/examples/pbsdsh
What pbsdsh Can/Cannot Do?
Sent copies of same serial script to different nodes/cores
Give each of them a different input using $PBS_VNODENUM
Each process run independent from each other.
No communication between eath other.
Need communication mechanism to become a powerful tool!
What is MPI?
MPI = Message Passing Interface
1. It is not a library, but a standard for library of message passing routines
2. Two major projects implementing MPI are openmpi and mvapich
3. different compilers [gnu, intel, pgi] have their own flavors:
gnu-[openmpi | mvapich2],
intel-[openmpi | mvapich2],
pgi-[openmpi | mvapich2]
4. On RCC HPC cluster, load the module before compile/run your mpi code.
for example,
$ module load
gnu-openmpi
MPI For Python
MPI4PY provides bindings of MPI for Python
Features:
1. Object oriented, follows closely MPI-2 C++ bindings.
2. Supports point-to-point and collective communications
pickable Python Objects, Numpy Arrays
3. Gives your python script standard MPI "look and feel"
Why We Need Message Passing?
MPI was designed for distributed memory programming model
Here is a picture for distributed memory architecture
Each core/process has its local memory, need message passing to share data
Message Passing
Message can be passed in different ways:
1. point to point
Send(), Recv()
2. broadcast: same message from one process (root) to all processes
Bcast()
3. scatter: one process (the root) send same amount of data to all processes
Scatter()
4. gather: one process (the root) collect same amount of data from all processes
Gather()
5. and other ways (you can imagine)
Simple MPI Data Movement
Data movement for broadcast, scatter, and gather:
Note: Each process has its own memory (each row)
Note: Scatter is the inverse operation of gather
MPI Data Movement Plus Math
Operation
We can reduce data when passing them:
Note: The reduce operation can be
space MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN, etc.
Message Passing (2)
What data can we pass?
space or, what are the A, B, C, D in the previous page?
I. Fundamental data types : array of data
MPI.INT, MPI.DOUBLE, MPI.FLOAT, etc.
II. More complicated data structures:
e.g., Python objects, such as:
data = {'key1' : [7, 2.72, 2+3j],
'key2' : ('abc', 'xyz'),
'key3' : 3.1415926 }
Organize Communicating Processes
Question: how do I know
space 1. who I am?
space 2. who I am talking to?
space 3. who I can talk to?
Answer:
space communciating processes need be organized, grouped, labeled.
space Each process needs a unique ID
space A process needs to know his World/Universe (we call it communicator)
Organize Communicating Processes
(2)
It is surprisingly easy!
from
comm
rank
size
mpi4py import MPI
= MPI.COMM_WORLD
= comm.Get_rank()
= comm.Get_size()
In the above:
(1) comm is the Communicator, the world
(2) rank is my unique ID
(3) size is the number of processes in my world
My First MPI Python Code: Hello
World!
We are ready for our first mpi4py program:
$ cat ex01.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
print "My rank is %d out of %d Hello World! " %
(rank, size)
The code will be copied to all cores assigned to your HPC jobs.
We also need a job submit script similar to those I mentioned before
Let's run it on the HPC cluster.
cd HPC/Python/workshop/week2/examples/A
Summary of the First Example
The job scheduler interacts nicely with MPI
library:
space We asked for 4 cores in my job submit script.
space The communicator MPI.COMM_WORLD has 4 workers.
The id/rank of workers start from 0, not 1.
To run a parallel python code on HPC cluster,
module load gnu-openmpi
mpirun -np num_processors python my.py
Also note: no message passing bewteen workers
yet!
Example 2: Point to Point
Communication
Here is the source code ex02.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
root = 0
if rank == root:
data = {'a':7,'b':3.14}
comm.send(data,dest=1,tag=11)
elif rank == 1 :
data = comm.recv(source=root,tag=11)
print rank, 'I got', data
About the syntax of the send/recv() routine:
space [dest | source] indicate where the data is [going to | coming from]
space tag is an integer label
Note. The data sent is a Python object (dictionary)
Example 2: Point to Point
Communication (2)
The source code ex02.py contains a bug.
Can you tell?
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
root = 0
if rank == root:
data = {'a':7,'b':3.14}
comm.send(data,dest=1,tag=11)
elif rank == 1 :
data = comm.recv(source=root,tag=11)
print rank, 'I got', data
Error messages are helpful for beginners, so let's
run it!
$ cd HPC/Python/workshop/week2/examples/B
Example 3: Point to Point
Communication (3)
I have used send/recv routines for Python Objects.
space comm.send(data,dest=1,tag=11)
space comm.recv(source=root,tag=11)
Note that send, and recv are lower case.
For sending funcatmental data type, use upper case
space comm.Send()
space comm.Recv()
The arguments are also different, see next page
Example 3: Point to Point
Communication (4)
Source code ex03.py
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0 :
data = np.arange(10,dtype='i')
comm.Send([data,MPI.INT],dest=1,tag=77)
elif rank == 1 :
data = np.empty(10, dtype='i')
comm.Recv([data,MPI.INT],source=0,tag=77)
else:
data = None
print rank, data
A few things to note before we run it
$ cd HPC/Python/workshop/week2/examples/C
Example 3: Point to Point
Communication (5)
Receiving Buffer data has to be pre-declared and
allocated
data = np.empty(10, dtype='i')
Data types have to be matched between
Sender/Receiver
More restrictive than passing Python object, but
faster!
Example 4: Broadcasting
Source code for broadcast example:
from
comm
size
rank
mpi4py import MPI
= MPI.COMM_WORLD
= comm.Get_size()
= comm.Get_rank()
if rank == 0:
data = {'key1' : [7, 2.72, 2+3j],
'key2' : ( 'abc', 'xyz')}
else:
data = None
data = comm.bcast(data, root=0)
print "rank %d has data " % rank, data
Now let's run it.
cd HPC/Python/workshop/week2/examples/D
Example 5: Scattering
Source code for Scatter example:
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
if rank == 0:
data = [(i+1)**2 for i in range(size)]
else:
data = None
data = comm.scatter(data, root=0)
print "rank = %d, data = " %rank, data
Example 6: Reducing
Source code for reduce example:
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
total = 0
comm.Reduce(rank,total,op=MPI.MAX,root=0)
if rank == 0:
print "Max rank =", total
The above code is incorrect, can you tell?
Hint:
Comm.Reduce(sendbuf, recvbuf, Op op = MPI.SUM, roo
t = 0)
Under the Hood
MPI4PY can pass python objects such as dictionaries, lists
How does that work?
Under the hood, mpi4py pickles objects!
Idea of pickling:
space 1. convert Python object to strings ---- serialization, deflating
space 2. write strings to a file/buffer
space 3. load the file and parse the string ---- deserialization, inflating
Let me show you an example of pickling
In [4]:
import pickle
## delete the pickled_data.txt if exists
K = ['a', 3.4]
L = {'e':2.7128, 'pi':3.1415926}
pfile = open('pickled_data.txt','w')
pickle.dump(K,pfile)
pickle.dump(L,pfile)
pfile.close()
del(K)
del(L)
In [5]:
#%ls
f = open('pickled_data.txt','r')
K = pickle.load(f)
L = pickle.load(f)
print K
print L
['a', 3.4]
{'pi': 3.1415926, 'e': 2.7128}
Time Your MPI Python Code
It is easy to time your Python code
import time
t_start
= time.time()
serial python code
t_finish = time.time()
run_time = t_finish - t_start
# in seconds
for ipython, %timeit magic command
It is also easy to time your mpi4py Code
spaceCompute wall-clock time using MPI.Wtime()
from mpi4py import MPI
t_start
= MPI.Wtime()
your_mpi_code
t_finish = MPI.Wtime()
run_time = t_finish - t_start
# in seconds
Synchronize mpi-processes
synchronize mpi-processes using Barrier()
comm = MPI.Comm_World
...
comm.Barrier()
Summary of MPI4PY
MPI is a big subject.
spaceMPI libraires contain about ~200 routines.
Similar to Cython, we only covered the very basic
stuff.
Cython Plus Mpi4py?
Cython speeds up our serial python code
spaceGood Cython code can approach C in perforamnce
mpi4py allows us to run python code in parallel
Can we parallelize a Cython code using mpi4py?
Yes. I will show your one example on the HPC cluster.
Imaging Accretion Disk around
Blackholes
Spacetime around a black hole is strongly curved.
Here is an image from the movie "Interstellar"
I have developed a Cython code doing similar
things.
Black Hole Ray-tracing
Reference: B. Chen, X. Dai, & E. Baron, 2013, Astrophysical J. 762, 122
Some Outputs From MY Code
Compare this image with the one from
"interstellar"
Scalability of MPI4PY
Performance of Cython vs MATLAB
space My first version of Cython is only slightly faster than MATLAB
space MATLAB code scales better than Cython
Scalability of MPI4PY (2)
Performance of Cython vs MATLAB
space My current version of Cython 100 times faster than MATLAB
Reference: B. Chen, R. Kantowski, X. Dai, E. Baron & P. Maddumage,
spacespace 2015, Astrophys. J. Suppl. Ser., in press