Download Wrocław University of Technology Bioinformatics Borys Szefczyk

Document related concepts
no text concepts found
Transcript
Wrocław University of Technology
Bioinformatics
Borys Szefczyk
Applied Informatics
Wrocław (2010)
Project co-financed from the EU European Social Fund
c by Wrocław University of Technology
Copyright :
Wrocław (2010)
Project Office
ul. M. Smoluchowskiego 25, room no. 407
50-372 Wrocław, Poland
Phone: +48 71 320 43 77
Email: [email protected]
Website: www.studia.pwr.wroc.pl
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
Borys Szefczyk
(blank page)
Contents
1 Basics
9
1.1 What is Python and how to use it . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.2 Hello, World! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3 Variables in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.4 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.5 Interaction with the user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.6 Using modules: math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.7 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.8 Handling types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.9 Simple control statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
1.10 Condition-controlled loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
1.11 More complex types — lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
1.12 Count-controlled loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
1.13 Pretty output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
1.14 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
1.15 Strings and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
1.16 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
1.17 Passing arguments to the script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
1.18 Advanced command line options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
Project co-financed from the EU European Social Fund
Borys Szefczyk
6
1.19 Working with files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
1.20 Launching external programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
1.21 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
1.22 Writing modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
1.23 Regular expressions (re) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
2 Numerical applications
69
2.1 Basic operation on arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
2.2 Using Gnuplot with numpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
2.3 Linear algebra in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
2.4 Python for scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
3 Databases
83
3.1 Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
3.2 Simple Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
3.3 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
3.4 Creating tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
3.5 Inserting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
3.6 Searching the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
3.7 Python interface to MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
7
Preface
This textbook on Applied Informatics is by no means comprehensive. There is no book that could cover
the whole field of applied informatics. Instead, the course and the book should give the student knowledge of the programming language, Python, sufficient to solve different tasks in everyday problems of
molecular modelling, computational chemistry or bioinformatics; hence, the name Python programming
for bioinformatics students. Whereas courses of programming languages as Pascal, C or C++ focus on the
language itself and never go into the application layer, this course is focused on applications in computational chemistry. In the first part of the tutorial, you will get the basic knowledge of the Python scripting
language; in the second part you will learn how to use Python to solve selected numerical problems (rootfinding, integration etc.), manipulate coordinates of molecules and build structures, how to use Python
to control computational programs such as GAMESS or Gaussian, and even how to do the Quantitative
Structure-Activity Relationship analysis in Python! You will learn how to solve problems of linear algebra
and how to access and manage professional databases.
Borys Szefczyk1
1 Author’s
e-mail address: [email protected]
Project co-financed from the EU European Social Fund
Borys Szefczyk
8
How to read this textbook
In order to make reading easier, the following convention is applied: commands that you have to type on
your computer, are written like in the example below:
./runme
ps x
Any dialogue with the Python interpreter or other programs is typeset like in the following example, with
the user input on a grey background:
Enter a number: 123.0
You have entered 123.0
Sample source code is typeset using coloured syntax, like in the following example:
Code 1
#!/usr/bin/python
print "Hello, World!"
Project co-financed from the EU European Social Fund
Chapter 1
Basics
1.1 What is Python and how to use it
Python is a scripting language. Python is also the name of a program that is used to interpret the scripts
written in Python language. If you are going to learn Python, you will be writing scripts and not programs.
Does it matter, what we call it? Yes, because there is a huge difference: programs are binary (i.e. readable
for the machine but not for us humans) and have to be compiled before execution. Scripts are written
as text files and they stay a text file for the rest of their lifetime. They are not executed but interpreted,
therefore they always require that you have the interpreter program (i.e. Python) on your computer.
They are also slower than programs, because they have to be translated on-the-fly.
The Python script can be executed interactively (i.e. while you type it) or from a text file. The first way is
useful if you just want to test one or few commands and is also useful to access the internal help system.
However, when you are writing a longer script, which you will use many times, it is obviously better to
type it in a text editor, save as a text (ASCII1 ) file and execute afterwards. Python can be used both in
Linux and in Windows, but the way of running the script differs. This tutorial covers only the usage of
Python in Linux and assumes that you are familiar with this operating system. If not, you should keep a
Linux manual on hand; a good and extensive one is the book by Æleen Frisch [1].
In order to start an interactive session of Python, open a text console and type
python
You should see something like
1 ASCII
— American Standard Code for Information Interchange — one of the character encodings, a table translating characters into one-byte numbers; it contains all English characters, numbers and punctuation, but does not contain, for example, the
characters specific to Slavic languages
Project co-financed from the EU European Social Fund
Borys Szefczyk
10
Python 2.6.4 (r264:75706, Dec 7 2009, 23:19:43)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
Now, you are within the Python program (don’t be confused: this not the shell2 any more and the shell
commands do not work here!). If you want to terminate the session, simply press Control-D and you will
be taken back to the shell program that you were using.
Once you start writing scripts, you will need a text editor. Warning for Windows addicts: text editor does
not mean the “Word” program. Text editor is a program that will let you save plain-text files or ASCII text,
in other words. These can not be .doc or .rtf files (or whatever Word produces), because they contain a
lot of garbage, which you do not see in Word, but which will confuse the Python interpreter. Rather, I
suggest you download and install the SciTE editor.3 It has the wonderful feature of highlighting Python
syntax, which makes writing scripts way easier. If you use the Vim editor, it will also do.
A Python script is a set of commands that you type into the text editor and save for later execution. The
file typically has a .py extension, although it is more a custom than obligation. There are two ways of
executing such a script (again, we are talking about Linux). The first way is to supply the name of the
script to the python command:
python my script.py
The second way uses a mechanism included in the shell: in the very first line of the Python script you type
the characters #!, followed by the path to the Python program (usually /usr/bin/python). For example:
Code 2
#!/usr/bin/python
Additionally, you have to change the permissions4 to the file, so that it can be executed:
chmod u+x my script.py
Now, you can run the script like any other program, by specifying the name (and path, if necessary).
Usually, if the script resides in you current directory, you will type
2 shell
— the command-line program used to interact with the operating system; the most popular shells are bash and tcsh
3 http://www.scintilla.org/SciTEDownload.html
4 In
UNIX systems, the files have reading, writing and execution permissions; the latter one is commonly designated with the
letter x and indicates that the file can be executed
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
11
./my script.py
A short explanation about how does it work: the hash character starts a comment in Python, so it will
be ignored by the interpreter. But the two characters, #!, placed in the first line, have a special meaning
in the shell (no matter if it is bash, csh, or other). They indicate the program that will be executing the
content of the file. Then the rest of the file is simply sent to the standard input of the specified program.
Python as a language uses the object concept, however this tutorial is not aimed at teaching you objectoriented programming. You will learn structural programming, and the object-oriented programming
will be limited to a minimum.
At the time of writing, Python versions 3.x are stable and have started being installed in Linux distribution
along with the older version 2.x. Python 3.x is intentionally not compatible with previous versions. This
book refers to the syntax used by Python 2.x, since most of the external modules are still not compatible
with the new language. On the other hand, the changes are not that big and you can easily “translate”
scripts written in Python 2.x to the 3.x version; there are even tools for automatic conversion.5
1.2 Hello, World!
As usual, we will start our tutorial with the ,,Hello, World!” example. Here it is:
Code 3
#!/usr/bin/python
print "Hello, World!"
The first line in the example is for the shell and it says that Python should be used as an interpreter;
Python itself will ignore it. The print statement is used to display the value of an expression. Here
it is just a string (delimited with quotation marks). You can type it in the editor, save under the name
hello_world.py and execute:
python hello world.py
Note, that every line of your script must begin in the very first character of the line, i.e. there should
be no spaces or tabulators before print. Leading white-characters are used in Python to make blocks of
5 http://docs.python.org/py3k/library/2to3.html
Project co-financed from the EU European Social Fund
Borys Szefczyk
12
instructions (we will discuss it later). If you do put a space in front of the print instruction (which is a
common mistake), you will get an error like this:
File "hello world.py", line 3
print "Hello, World!"
^
IndentationError: unexpected indent
Remember that Python is case sensitive, i.e. lower and upper case letters are interpreted differently. For
instance, the instruction print can not be spelled Print.
Exercise 1: Modify the Hello, World! program, so that it displays your name.
1.3 Variables in Python
What your scripts do, is usually convert one kind of information into another. To do so, you will need
to store intermediate data. For this purpose you will use variables. Think of a variable as a selected
place in the memory of the computer, where you can store a specific kind of data. As you may know,
computers use binary system, i.e. all data are represented as rows of logical values, zeros and ones. It is
important therefore to specify what kind of data you are storing in the memory, otherwise the conversion
to binary and back would not be possible. In many programming languages, you have to declare a
variable, indicating its type (e.g. character or integer number). Python makes your life a bit easier,
because you don’t need to declare the variables, neither to define their type; the variable will be created
once you try to assign a value to it. Also, Python will guess what the type of the variable is. The variable
will exist (it will be kept in the memory) until the program/function finishes or until you explicitly delete
the variable. Each variable has a name (identifier) and a value. The name is just a label that represents
the variable in the program. An example:
Code 4
Val = 123
number pi = 3.14
ch1 = ’x’
In this example, three variables are created, called Val, number_pi and ch1. The names of variables may
contain lower and upper case characters, digits and the underscore character, but they can not start with
the digit. For example: 1x, var.a and my-var are incorrect names. Also, you can not use reserved names,
which are Python instructions (like print). The equality character (=) is used in Python to substitute a
value to a variable. Here, for instance, the variable Val will store the number 123. In the example, we do
not specify the type explicitly, but each of the variables will have the type defined by Python. Val will be
an integer number, number_pi will be a floating point number and ch1 will be a character (string more
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
13
Table 1.1: Data types in Python.
Name
bool
int
long
float
str
Examples
False, 0, True, 1
-10, 4005
123456789L
0.123, 1.4e-15
’a’, "python"
Description
Logical values
Integer numbers
Integer numbers of unlimited size
Real numbers
Strings (text)
precisely). These types are guessed by Python in the following way: 3.14 is a real number — it has the
fractional part. To store it, the integer type is not sufficient, so the float type will be used. On the other
hand, 123 can be stored as an integer number, because it does not have a fractional part. However, if you
would like to create the Val variable as float and store 123 there, you may force Python to do so:
Code 5
Val = 123.0
By specifying the decimal point (123.0) you indicate that this is a floating point number, not an integer.
By executing the substitution to the same variable several times, you make Python ‘forget’ the old value
and ‘learn’ the new one:
Code 6
x = 12
x = 34
After executing this code, the variable x will contain the value 34. This is like erasing the variable and
creating it again, with a new value. Also you may change the type of the variable with the subsequent
substitution:
Code 7
x = 5.67
Most variable types have limits, e.g. there are certain minimum and maximum numbers that you can
store in an integer variable. In Table 1.1 you will find some of the types used in Python. Note, how the
real numbers with an exponent are typed in Python, e.g. 1.4 · 10−15 must be written as 1.4e-15.
1.4 Operators
Have a look at the following example:
Project co-financed from the EU European Social Fund
Borys Szefczyk
14
Table 1.2: Some of the arithmetic operators in Python, arranged according to the priority (from the
highest priority in the top row to the lowest priority in the bottom row).
Operators
**
* / %
+ -
Description
power operator
multiplication, division and modulo (reminder)
sum and difference
Code 8
#!/usr/bin/python
x = 1
y = 2
z = x + y
print "The result is", z
Here, we add the values of two variables (x and y) and substitute the result to the variable z. We use
the sum operator (plus sign). See Table 1.2 for the list of standard math operators and their priority.
Operator with the highest priority will be executed first; if two operators have equal priorities, they will
be executed from the left- to right-hand side. In the following example:
Code 9
x = 3 + 4 / 2 - 1
The first operation executed will be 4 / 2, then 3 + 2 + 1. If you want a lower-priority operator to be
executed first, you have to use parentheses. If you are in doubt, always use parentheses to define how
the expression will be evaluated — it is not an error to use redundant parentheses. For example, in order
to compute correctly
x
z=
y+2
you have to write in your program:
Code 10
z = x / (y + 2)
Be aware of the special behaviour of the division operator, which depends on the arguments. If both are
an integer, it will return only the integer part of the result; if at least one of the arguments is a floating
point number, the result will also be a float. Try out this program, and compare the values of a and b:
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
15
Code 11
#!/usr/bin/python
a = 2 / 3
b = 2 / 3.0
print "a =", a
print "b =", b
Knowing the math operators, you can use the Python program as a calculator. Just run in interactive
session as described in section 1.1 and type the expression you want to calculate:
Python 2.6.4 (r264:75706, Dec 7 2009, 23:19:43)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 12.3 + 67. / 5
25.700000000000003
* 7
>>>
179.90000000000003
>>> a = 33
>>> b = 11
>>> a / b
3
>>>
The result of the last operation can always be retrieved using the special variable designated with the
underscore character. You can also define variables and use them; in the interactive Python session you
do not need to use the print instruction to display the result.
1.5 Interaction with the user
The whole point of making scripts is to save time and work, by writing it once and then feeding it with
different kinds of data. You can insert your data into the script using variables, but this is not what
a programmer would call “an elegant way“ of handling things. Instead, you should use the input()
function to interactively ask the user for data. Let us write a script to convert energy from hartree units
into kJ/mol and eV. The conversion factors are 2625.5 and 27.211:
Code 12
#!/usr/bin/python
Eh = input("Enter energy in hartree: ")
Project co-financed from the EU European Social Fund
Borys Szefczyk
16
EkJmol = Eh * 2625.5
EeV = Eh * 27.211
print Eh, "hartree =", EkJmol, "kJ/mol"
print Eh, "hartree =", EeV, "eV"
The standard function input("string") is used to ask the user to enter a value and returns it. The
"string" is displayed as a prompt, much as you would use the print instruction. In this example, the
value returned by input() is substituted to the Eh variable.
Exercise 2: Write a script that calculates the height of a regular triangle for an edge length
√
entered by the user (hint: 2 = 21/2 ).
1.6 Using modules: math
One of the advantages of using Python is the enormous number of modules that can help to solve various
kinds of programming tasks. Frankly, Python itself is quite limited, and very soon you will realize that
the function you need is available in one of the modules. For example, to use the logarithm function, you
must first load the math module:
Code 13
#!/usr/bin/python
import math
x = math.log(2.0)
print "log(2.0) =", x
As you can see above, the module called math is loaded with the import instruction and after that the
function can be invoked by specifying the module name, a dot and the function name (plus arguments if
any are required). To display a list of all objects inside the module, you may use the dir(module_name)
function. If you just need a single function and not all of them, you can use another syntax:
Code 14
#!/usr/bin/python
from math import log
x = log(2.0)
print "log(2.0) =", x
Note, that by using the latter syntax, you are adding the logarithm function to the global namespace and
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
17
when invoked, the module name is no longer needed. It is also possible to use wildcards and load all the
functions from the module at once using the second statement:
Code 15
from math import *
1.7 Getting help
Besides the pretty large documentation available on-line [2], Python has an internal help system based
on its object-based character. For the purpose of this section, it is best if you start the interactive session
and type, following the snippets presented here.
Python 2.6.4 (r264:75706, Dec 7 2009, 23:19:43)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a=1
>>> import math
>>> dir()
[’ builtins ’, ’ doc ’, ’ name ’, ’ package ’, ’a’, ’math’]
>>>
Function dir(), used here without any arguments, displays the list of names defined in the main namespace. Besides the standard objects, you will notice above that the list contains the variable a, which
has been defined and the math module, which has been imported. Continue with the next example, still
inside the interactive session:
>>> dir(math)
[’ doc ’, ’ file ’, ’ name ’, ’ package ’, ’acos’, ’acosh’,
’asin’, ’asinh’, ’atan’, ’atan2’, ’atanh’, ’ceil’, ’copysign’, ’cos’,
’cosh’, ’degrees’, ’e’, ’exp’, ’fabs’, ’factorial’, ’floor’, ’fmod’,
’frexp’, ’fsum’, ’hypot’, ’isinf’, ’isnan’, ’ldexp’, ’log’, ’log10’,
’log1p’, ’modf’, ’pi’, ’pow’, ’radians’, ’sin’, ’sinh’, ’sqrt’, ’tan’,
’tanh’, ’trunc’]
>>>
This time, the dir() function has been used to display the content of the math module. Note, that
every object contains an element called __doc__. This is just text, which you can display with the print
instruction:
>>> print math. doc
This module is always available.
It provides access to the
Project co-financed from the EU European Social Fund
Borys Szefczyk
18
mathematical functions defined by the C standard.
>>> print math.ceil. doc
ceil(x)
Return the ceiling of x as a float.
This is the smallest integral value >= x.
>>>
The __doc__ object contains (usually) information about the object/function and instructions on how to
use it.
1.8 Handling types
In this section we continue the discussion of data types, which began in Section 1.3. As you know already,
when the variable is created and initialized, Python decides what type of data it contains. It is possible to
check the type with the type(variable_name) function:
Code 16
#!/usr/bin/python
a = 1
print "Type of a is", type(a)
a = 1.0
print "but now a is", type(a)
a = 1+0j
print "and finally a becomes", type(a)
If you execute this script, you will see that:
Type of a is <type ’int’>
but now a is <type ’float’>
and finally a becomes <type ’complex’>
A similar kind of guessing is performed when the input() function is used. Try to execute the script
below a few times, entering different values (e.g. 1, 1.0, 1+0j).
Code 17
#!/usr/bin/python
a = input("Enter a value: ")
print "Your value is", type(a)
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
19
borys@swift $ ./types.py
Enter a value: 23
Your value is <type ’int’>
borys@swift $ ./types.py
Enter a value: 0.5
Your value is <type ’float’>
borys@swift $ ./types.py
Enter a value: "abc"
Your value is <type ’str’>
In the last example, the user entered a string “abc” (type ’str’). Note that in such a case, the string has
to be enclosed in quotation marks. If you are sure that the data you are asking for are strings, it is more
handy to use another function, called raw_input(). It works exactly like input(), except it does not try
to guess the type and always returns a string. This way, the users do not need to enclose their input in
quotation marks, to indicate that it is a string.
You can always ensure that the correct type will be used, by using one of the type-conversion functions
(int(), str() etc.). Consider the following example, which reads in a numerator and denominator and
displays a decimal number:
Code 18
#!/usr/bin/python
numerator = input("Enter numerator: ")
denominator = input("Enter denominator: ")
decimal = float(numerator) / denominator
print numerator, "/", denominator, "=", decimal
Since the user may type the numerator and denominator as integral numbers, it is necessary to use the
float() function to ensure that the result of the division will be real.
Exercise 3: Use the raw_input() function to write a script that will ask the user for his name
and then display a welcome message, like in the example:
borys@swift $ ./rawelcome.py
What is your name? Borys Szefczyk
Welcome, Borys Szefczyk !
Project co-financed from the EU European Social Fund
Borys Szefczyk
20
I want to go out.
Is it raining?
Yes.
Take umbrella.
No.
Is it sunny?
Yes.
Take sunglasses.
Figure 1.1: An algorithm for going out.
1.9 Simple control statements
One important aspect of algorithms used by programs is making decisions about what part of the code
should be executed, depending on certain data. This is like when you check the weather forecast and
decide if you take an umbrella or sun glasses. The algorithms are often presented in graphical form
(Figure 1.1).
In the program or script, it is the conditional instruction that is responsible for making the decision. The
following script calculates the roots of a quadratic equations. The number of distinct real roots depends
on the value of the discriminant ∆:
∆ = b2 − 4ac
The script has to decide which formula should be used, depending if ∆ is positive, negative or zero.
Code 19
#!/usr/bin/python
from math import sqrt
print "Finding roots of a*x^2 + b*x + c = 0"
a = input("Enter a: ")
b = input("Enter b: ")
c = input("Enter c: ")
a = float(a)
delta = b*b - 4*a*c
if delta > 0:
x1 = (-b - sqrt(delta))/2/a
x2 = (-b + sqrt(delta))/2/a
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
21
print "Roots are:", x1, "and", x2
elif delta == 0:
x = -b/2/a
print "The single real root is:", x
else:
print "There are no real roots."
How does it work? Look at the example above: when the script reaches the if statement, it analyses the
relational expression delta > 0; if the expression is true, it will execute the code that follows and skip
the rest of the conditional instruction (elif and else). If the expression delta > 0 is false, it will jump
to the next part of the conditional instruction, which is elif delta == 0:. If this expression is true, the
following code will be executed. If not — the program will jump to the else: part. Note, that there is no
relational expression after else; this statement indicates the part of the code that should be executed if
all other conditions fail. Also note, that the lines between if and elif, as well as the lines between elif
and else are indented. The indentation, which may consist of spaces or tabulators, indicates a block
of instructions. A block of instructions is like a small program inside your program. Remember that
all lines within a block must have the same indentation, i.e. if the first line starts with four spaces, the
following lines must start with four spaces too. This is how Python recognizes the beginning and end of
the block.
The general form of the conditional instruction in Python is: the if keyword, followed by relational
expression and colon. Other elements, the elif and else statements, are optional. Below, different
variants of conditional instruction are shown — all of them are valid.
Code 20
Code 21
# Example 1
if expression1:
code1
elif expression2:
code2
elif expression3:
code3
else:
code4
# Example 3
if expression:
code1
else:
code2
# Example 2
if expression: code
# Example 4
if expression:
pass
else:
code
The first example shows the full version of the conditional instruction, but the elif statement might be
omitted, like in examples 3 and 4. The else statement is also optional; the simplest form of conditional
instruction is just a single line, like in example 2. A block of code within the conditional instruction can
not be empty. If you want one of the conditions to be skipped, you may use the pass instruction. This
instruction does nothing, it just satisfies the lexical requirements of the language.
Project co-financed from the EU European Social Fund
Borys Szefczyk
22
The relational expression used by the if instruction is just like an arithmetic expression, except it can
have one of two values: True or False. Relational expressions are composed of relational operators
(Table 1.3) and parentheses.
Relational operators are listed in Table 1.3. Table 1.4 shows how the or and and operators work. Many
relational expressions can be combined by using parentheses. Look at the examples below and try to
predict if the expressions are true or false. Then use Python to check.
Code 22
a = -1
b = 0
c = 1
(c > a) and (a < b)
(not b) and (b == c)
c or b or a > 0.0
Values of many types in Python have their logical meaning as well:
• Integer 0 and float 0.0 are False, all other numbers are True;
Table 1.3: Relational operators.
Symbol
not
or
and
==
!=
>
>=
<
<=
Example
not a
a or b
a and b
a == b
a != b
a > b
a >= b
a < b
a <= b
Function
Negation
Sum
Product
Equal to
Not equal to
Larger than
Larger or equal than
Lower than
Lower or equal than
Table 1.4: Evaluation of logical sum and product.
Left operand
True
True
False
False
Right operand
True
False
True
False
Sum
True
True
True
False
Product
True
False
False
False
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
23
• Empty string "" is False, any other string is True;
• Empty list [] or tuple () or dictionary {} are False (you will learn what they are in the following
sections).
Exercise 4: Write a script that tells the user if a given year is a leap year. The rule to determine
a leap year is as follows: if the year is divisible by 4 and it is not divisible by 100, it is a leap
year. If the year is divisible by 400 it is also a leap year. For example: 2008, 2004 and 2000
were leap years, but 1900 was not.
1.10 Condition-controlled loops
An important element of programming is executing certain parts of the code multiple times, like reading
subsequent lines of file, until we find the one that interests us. Asking users to input data and repeating
the question until the entered data are correct is another example. Our script is going to calculate the
square root of the number entered by the user. But the argument to the function sqrt() must be nonnegative:
Code 23
from math import sqrt
a = input("Enter a positive number: ")
while a < 0:
print "This number is negative!"
a = input("Enter a positive number: ")
print "sqrt(", a, ") =", sqrt(a)
Loops are often used in numerical procedures, if they are based on iterative techniques. An important
example for computational chemists is the Self-Consistent Field method: we start with an approximate
set of coefficients (guess) and use an iterative procedure to improve them, until we reach the desired
accuracy. In the following two examples we will use iterative procedures to compute the value ln(2)
using series and to solve an equation.
It is known that the sum of convergent series:
1 1 1 1 1
− + − + + · · · = ln 2
1 2 3 4 5
therefore we can pretend that we do not know about the existence of log() function in the math module
and use the series to compute the value of ln 2. A convergent series has the following properties: (i) it
has a limit l, which is less than infinity (l < ∞) and (ii) there is a large integer number N such that for
Project co-financed from the EU European Social Fund
Borys Szefczyk
24
Table 1.5: Calculating the sum of a convergent series with desired accuracy (10−3 ).
Step
1
2
3
4
···
1001
all n ≥ N :
Element
1/1 = 1
1/2 = 0.5
1/3 = 0.333
1/4 = 0.25
···
1/1001 = 0.000999
Sum
1
0.5
0.833
0.583
···
0.6936
Converged?
No
No
No
No
No
Yes
|Sn − l| ≤ ǫ
Where Sn is a partial sum and ǫ is accuracy. Therefore we can compute ln 2 with a desired accuracy ǫ
by simply adding subsequent elements of the series. We should continue the summation until the sum in
two subsequent steps changes by less than ǫ. This algorithm is illustrated in Table 1.5 and the code that
performs the task is shown below.
Code 24
total = 0
element = 1
epsilon = 1e-3
while 1.0/element > epsilon:
if element % 2: total += 1.0 / element
else:
total -= 1.0 / element
element += 1
print "ln(2) =", total, "after", element, "steps."
This example introduces new operators, += and -=. For example element += 1 means “increment the
variable element by one”. It is equivalent to: element = element + 1. Further operators of this kind
are listed in Table 1.6.
In the following example we will solve the equation:
x − 2 = ln x
First, to get an idea of what the solution might be, we will plot two functions (Figure 1.2):
L(x) = x − 2
R(x) = ln 2
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
25
Table 1.6: Operators with assignment and equivalent expressions.
Operator
x += a
x -= a
x *= a
x /= a
x %= a
Meaning
Increment x by a
Subtract a from x
Multiply x by a
Divide x by a
Substitute the reminder to x
Equivalent expression
x = x + a
x = x - a
x = x * a
x = x / a
x = x % a
2
1.5
1
0.5
0
-0.5
-1
L(x) = x - 2
-1.5
R(x) = ln(x)
-2
0
1
2
3
4
5
Figure 1.2: Graphical solution of the equation x − 2 = ln x.
Project co-financed from the EU European Social Fund
6
Borys Szefczyk
26
The solution to our problem is such an x that L(x) = R(x). From Figure 1.2 we see, that the equation
has two solutions, x1 ≈ 0.2 and x2 ≈ 3.2 (these are the points where the functions cross). We will use an
iterative technique to compute a more accurate value: we start with a guess x0 = 3.2 and compute the
right-hand side expression, R = ln(x0 ). Then we use the left-hand side expression to find x1 = R + 2.
This is our new, hopefully better, approximation to x. Then, we substitute it to the right-hand side again
and repeat the cycle until both sides will be equal (within a certain error ǫ):
Code 25
from math import log
x = 3.2
epsilon = 1e-5
step = 0
left = x - 2
right = log(x)
while abs(left - right) > epsilon:
x = right + 2
left = x - 2
right = log(x)
step += 1
print "Step", step, ":
x =", x
This program uses the abs() function, which returns the absolute value of an expression.
Exercise 5: Try to play with different initial values of x. Are you able to find both solutions?
If not, try to rewrite the equation by taking the exponent of both sides, i.e.
ex−2 = x
Exercise 6: Compute the e number (base of natural logarithm) as a sum of the convergent
series:
1
1
1
1
1
e=
+ + + + + ···
0! 1! 2! 3! 4!
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
27
Exercise 7: Use the bisection method to find a root of the equation:
x3 − 3x − 1 = 0
The range of the search and the precision should be given by the user.
The bisection works by the iterative division of the range in to halves, until the precision is
achieved. Consider the function from the equation above (Figure 1.3): the function has three
roots, however we will be searching for the one that is between a = −1 and a = 1. We start
by calculating f (a) and f (b). If f (a) and f (b) have different signs, there has to be a root
between them. Now, we divide the range [a, b] into halves [a, x1 ] and [x1 , b] and calculate the
value of f (x1 ). Comparing the signs of f (a), f (x1 ) and f (b), we realize that the root is now
in the [a, x1 ] range. Therefore, we take it as a new range and divide into halves. Now, we
calculate f (x2 ) and see that it has the same sign as f (x1 ), but different than f (a), so the
root should be between f (a) and f (x2 ). We continue this procedure until the length of the
range becomes smaller than the precision requested by the user. At that point, we can say
that we have found the root with the requested precision. Note, that you don’t need to keep
all the arguments and values in memory; at a single step of this procedure you need only six
variables: the end-points of the range and function values at the end-points, the middle-point
and the function value at the middle-point.
1.11 More complex types — lists
Having variables that can store just a single value is not handy enough. Soon you will want to store a
larger number (possibly unspecified) of values. In languages like C, for example, you would use arrays.
An array is a space in memory that can store a certain number of values — all of them must have the
same type. Arrays can be static, i.e. present as long as the program or function is executed and having a
well defined size, or dynamic, i.e. allocated when they are needed and freed afterwards. The size of the
dynamic array can be changed.
In Python, we will only start using arrays in the NumPy module. Standard Python language does not
have arrays, but has a concept which is similar: lists. There are differences, though. Lists are objects and
besides the values, they also have methods associated with them. Lists can contain elements of different
types. Lists are dynamic: elements can be added or removed and the size of the list changes accordingly.
Here are a few examples on how to create a list:
Code 26
a1
a2
a3
a4
a5
=
=
=
=
=
[
[
[
[
[
3, 6, 9 ]
"python", 3.14, 0 ]
4, [ 5, 6, 7], 8 ]
a1, a2 ]
]
Project co-financed from the EU European Social Fund
Borys Szefczyk
28
Figure 1.3: The bisection method. Description in the text (Exercise 7).
4
3
2
f(a)
1
0
a
f(x2)
x3 x1
x2
f(x3)
b
f(xn)
-1
f(x1)
-2
-3
f(b)
-4
-3
-2
-1
0
1
Project co-financed from the EU European Social Fund
2
3
Python programming for bioinformatics students
29
Lists are delimited with brackets, and as you can see in the case of a2, they can contain different elements,
strings, floats, integers etc. Lists can be nested, i.e. lists can contain also lists (a3). Obviously, you do not
need to specify the values explicitly; you can use variables, like in a4. Lists can be empty (a5).
When you want to retrieve or change element of a list, you have to use the index of the element. Elements
of the list are indexed starting from zero. So, list a1 above has three elements with indices 0, 1, 2. The
code below shows how indexing works:
Code 27
x = [ 3, 6, 9, 12, 15 ]
print "x[1] =", x[1]
print "x[-1] =", x[-1]
print "x[1:3] =", x[1:3]
print "x[:3] =", x[:3]
print "x[2:] =", x[2:]
Here is the output:
x[1] = 6
x[-1] = 15
x[1:3] = [6, 9]
x[:3] = [3, 6, 9]
x[2:] = [9, 12, 15]
Since the indexing starts from 0, x[1] refers to the number 6. Indices can also be negative; x[-1] means
the last element, x[-2] is the one before last and so on. Indices can also refer to ranges. If you specify
the first and last index separated by a colon, i.e. x[1:3], you will retrieve a “sub-list” — a list containing
part of the original lists. However, the indexing in this case is a little bit tricky. Note that x[1:3] returns
only [6, 9], i.e. elements with the indices 1 and 2. The last element (index 3) is always skipped. Note,
that the indices of the range can be omitted: x[:3] means “from the beginning to element 3” and x[2:]
means “from element 2 until the end”. It may seem that x[:] means exactly the same as x, but it does
not. Look at the example below:
Code 28
x = [ 3, 6, 9, 12, 15 ]
y = x
z = x[:]
x[1] = -1
print "x =", x
print "y =", y
Project co-financed from the EU European Social Fund
Borys Szefczyk
30
print "z =", z
Output:
x = [3, -1, 9, 12, 15]
y = [3, -1, 9, 12, 15]
z = [3, 6, 9, 12, 15]
In this example we create list x, then we make two copies, y and z. After that, we change one element of
x. As you can see, list y has also changed, but z not. This is because y = x is like giving another name
(alias) to an existing object; it does not create a new list. z = x[:] on the other hand, copies all elements
from x to z. Although it may be confusing, such behaviour is useful when you have to deal with a large
amount of data. It allows you to save time and memory, you just have to remember that x and z refer to
the same object.
Nested lists can be used to store objects like multi-dimensional arrays or matrices. For example, the
matrix A:


1 2 3


A= 4 5 6 
7 8 9
can be handled in the following way:
Code 29
A = [ [ 1, 2, 3 ], [ 4, 5, 6 ], [ 7, 8, 9 ] ]
print "A[1][2] =", A[1][2]
Since we have a “list in a list” (two-dimensional array), we need two indices, A[1][2], the first one (1)
refers to a row and the second (2) refers to a column.
Python has a special function, range(a, b, c), that creates lists of integer numbers, starting from a, up
to b (excluding b itself) and with a step of c. Arguments a and c are optional; if not supplied, the default
will be used (0 and 1, respectively). Here is an example:
Code 30
x = range(5)
y = range(5, 10)
z = range(3, 10, 2)
print "x = ", x
print "y = ", y
print "z = ", z
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
31
x = [0, 1, 2, 3, 4]
y = [5, 6, 7, 8, 9]
z = [3, 5, 7, 9]
As mentioned in the beginning of this section, lists are objects and have certain methods associated with
them. These methods are used to modify the lists, e.g. to add new elements. Methods are similar to
functions, but they are specific to the object and are invoked in a special way. For example: the method
append() adds a new element at the end of the list:
Code 31
x = [ ]
x.append(3)
x.append(6)
print "x = ", x
x = [3, 6]
As you can see, there is the name of the object (x), a dot and the name of the method with arguments in
the parentheses. Some of the methods can also return a value. For example, the method count() returns
the number of occurrences of the specified element in the list. In such a case, usually you would like to
do something with the returned value, e.g. substitute it to a variable and print:
Code 32
x = [ 1, 2, 3, 4, 1, 2, 3, 1, 2, 1 ]
ones = x.count(1)
twos = x.count(2)
print "There are", ones, "ones and", twos, "twos."
There are 4 ones and 3 twos.
Lists have more methods and each one has short information embedded in the __doc__ object. You access
it through any instance of the list:
Python 2.6.4 (r264:75706, Dec 7 2009, 23:19:43)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> dir([])
[’ add ’, ’ class ’, ’ contains ’, ’ delattr ’, ’ delitem ’,
’ delslice ’, ’ doc ’, ’ eq ’, ’ format ’, ’ ge ’,
Project co-financed from the EU European Social Fund
Borys Szefczyk
32
’ getattribute ’, ’ getitem ’, ’ getslice ’, ’ gt ’, ’ hash ’,
’ iadd ’, ’ imul ’, ’ init ’, ’ iter ’, ’ le ’, ’ len ’,
’ lt ’, ’ mul ’, ’ ne ’, ’ new ’, ’ reduce ’, ’ reduce ex ’,
’ repr ’, ’ reversed ’, ’ rmul ’, ’ setattr ’, ’ setitem ’,
’ setslice ’, ’ sizeof ’, ’ str ’, ’ subclasshook ’, ’append’,
’count’, ’extend’, ’index’, ’insert’, ’pop’, ’remove’, ’reverse’,
’sort’] >>> print [].sort. doc
L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*;
cmp(x, y) -> -1, 0, 1
>>>
There are also two useful functions, which take lists as arguments — sum(), calculates the sum of elements and len() returns the number of elements. Here is an example — a one-line snippet from a script
that computes the average of the elements in the list x:
Code 33
average = sum(x)/len(x)
Exercise 8: Use interactive Python to learn what is the function of the following methods:
extend, index, insert, pop, remove, reverse and sort.
Exercise 9: Write a script that selectively lists files: the script should display a list of Python
scripts in the current directory, i.e. only those files that have their names ending with “.py”
extension. The list should be sorted alphabetically.
You will need the function listdir() from module os. This function returns a list of file
names in the directory given as the argument, e.g.
files = listdir(’/dev’)
will produce a list called files, containing all file names from the directory /dev. To list the
current directory, you may simply use the dot character, like in the shell, listdir(".")
1.12
Count-controlled loops
Most languages have two kinds of loops. One of them is condition-controlled, i.e. executed until the
condition is satisfied. Another kind of loop is count-controlled. This type of loop is executed a certain
number of times or, in the case of Python, for each element of a certain list. The general “rule of thumb”
for choosing the right type of loop is that you should use the count-controlled loop whenever it is easy to
predict how many times it has to be executed. Here is an example of a count-controlled loop:
Code 34
shop = [ "apples", "eggs", "ham", "milk", "potatoes" ]
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
33
for item in shop:
print "The shop has", item
This loop picks subsequent elements from the list shop, substitutes them to item and for each value of
item, executes the block of code that follows.
The
The
The
The
The
shop
shop
shop
shop
shop
has
has
has
has
has
apples
eggs
ham
milk
potatoes
In the next example we will use numerical integration to compute:
Z π
I=
sin x
0
The definite integral is equal to the surface of the area under the function’s plot, within the range of
integration (here, [0, π]). Figure 1.4 shows that we can approximate the area with a set of n rectangles –
each δx wide and h high:
Figure 1.4: Numerical integration.
1.5
i=3
1
i=2
i=1
0.5
h i=0
0
0
δx
π
-0.5
-1
-1
0
1
2
Project co-financed from the EU European Social Fund
3
4
Borys Szefczyk
34
I ≈S=
n
X
i=0
δx · h
When δx → ∞, the sum S → I. The height of the rectangle, h, is simply equal to f (xn ) = sin xn , where
xn = δx(i + 12 ). We add 12 , because the height is measured in the middle of the interval. In the following
code, we split the range [0, π] into 100 intervals, compute the integral and compare with the exact value,
which equals cos 0 − cos π.
Code 35
from math import sin, cos, pi
intervals = 100
dx = pi/intervals
integral = 0.0
for i in range(intervals):
xn = dx * (i + 0.5)
h = sin(xn)
rect = dx * h
integral += rect
print "Numerical value:", integral
print "Exact value:", cos(0) - cos(pi)
In certain cases, you may want to terminate the loop before the conditional expression becomes False,
this can be done with the break instruction. Imagine, you are writing a program that calculates the
average of the numbers entered by the user and you do not know how many numbers the user will enter.
The program could be as follows:
Code 36
s = 0.0
n = 0
# The sum
# Number of elements
while True:
x = input("Enter a number (99 to finish): ")
if x == 99: break
s += x
n += 1
print "The average is", s/n
We use an expression that is always true, therefore the loop is endless. However, if the user enters the
number 99, the break instruction will cause the program to exit immediately from the loop. Remember
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
35
that if the loops are nested (one loop inside another), the break statement will work only for one loop,
e.g. in the example below, the break instruction will make the script jump out of the inner loop, but the
outer loop will continue:
Code 37
while x > 0:
# Outer loop
while y > 0:
# Inner loop
if z == 0: break
If you do not want to exit the loop, but just skip the rest of the code and proceed to the next turnaround,
you may use the continue instruction. break and continue work for both types of loops, the count- and
condition-controlled.
Exercise 10: Use the Monte Carlo integration to calculate the same integral as in the example
above.
Hint: The Monte Carlo integration work like this:
1. Define the boundaries [xa , xb ] as equal to the integration range and the boundaries
[ya , yb ] as equal to the minimum and maximum of the function in that range. Define a
counter p.
2. Generate two random numbers, x and y, within respective boundaries.
3. Check if the point (x, y) is below the function, i.e. if f (x) >= y; step the counter p if
true.
4. Repeat steps 2 and 3 several times (n).
5. Compute the ratio r of points that have fallen under the graph (p) to the total number
of steps (n).
6. The integral is equal to the ratio r times the surface of the rectangle defined by the
boundaries, i.e. (xb − xa ) · (yb − ya ).
You may also need the function random() from the module random.
1.13 Pretty output
Imagine you want to print a table of values of the sine function, in the range [−90, 90] every 30 degrees.
Here is the script that does it:
Project co-financed from the EU European Social Fund
Borys Szefczyk
36
Code 38
#!/usr/bin/python
from math import sin, pi
for x in range(-90, 91, 30):
xx = x / 180.0 * pi
print x, sin(xx)
However the output does not look very pretty:
-90 -1.0
-60 -0.866025403784
-30 -0.5
0 0.0
30 0.5
60 0.866025403784
90 1.0
What we need is the formatted output. Compare the previous example with the next one:
Code 39
#!/usr/bin/python
from math import sin, pi
for x in range(-90, 91, 30):
xx = x / 180.0 * pi
print "%+3d
% 6.3f" % (x, sin(xx))
In this example the columns are aligned, the angle is always printed with the sign and the sine value has
three significant digits:
-90
-60
-30
+0
+30
+60
+90
-1.000
-0.866
-0.500
0.000
0.500
0.866
1.000
Each of the lines above is formatted according to a general specification that is common for several
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
37
Table 1.7: String formatting symbols. The symbols are always used between % and the letter defining
value type (s, d, f, e, etc.).
Symbol
%
number
.number
+
space
0
-
Meaning
Percent character
Width of the field in characters
Number of decimal digits
Always print the sign of the number
Print minus for negative numbers and space for other
Fill the field with leading zeros
Left-align the value
Example
%%
%6d
%.3f
%+6.3f
% 6.3f
%06d
%-10s
programming languages. This specification always starts with the percent character %, followed by other
symbols that define formatting and a letter that specifies what type of value is expected, e.g.: s – string,
d – integer number, f – floating point number, e – floating point number in engineering notation (e.g.
1e-10). Consult Table 1.7 for details. The symbols can be combined, but they always occupy specified
positions, i.e. the percent comes first, then + or space, then zero, number, dot, number and finally a letter.
For example: %+010.4f will print a floating point number in 10-character field, aligned to the right-hand
side, left-padded with zeros, with four decimal digits and the sign.
Exercise 11: Modify the program from Exercise 10, so that it will display intermediate results
in a table. The table should contain the number of steps done, the number of points found
under the graph and a current estimate of the integral. The table should not be too long,
it should have 10–20 entries. To do so, you may, for example, print an entry every f steps,
where f = N/20 and N is the total number of steps.
Steps
Hits
Integral
---------------------------5000
3191
2.005e+00
10000
6347
1.994e+00
15000
9579
2.006e+00
20000
12802
2.011e+00
25000
16007
2.011e+00
30000
19137
2.004e+00
35000
22361
2.007e+00
40000
25589
2.010e+00
45000
28750
2.007e+00
50000
31978
2.009e+00
Final result:
2.00923699753
Expected result: 2.0
Project co-financed from the EU European Social Fund
Borys Szefczyk
38
1.14
Tuples
In the following sections, we compare four complex types of Python: strings, lists, tuples and dictionaries.
All of them are similar in the sense that all can be indexed (“hashed” in the case of dictionaries) and all
are slightly similar to the idea of a table. Tuple is the simplest type from this group. It behaves in a similar
way like lists, except it is static. That means you can not modify the elements of the tuple, delete items
from the tuple or add new items. But thanks to that, tuples are faster, so remember: use them instead of
lists whenever you can. Tuples are written in parentheses. They also do not have as many methods as
the lists. Besides that, they behave like lists:
Python 2.6.4 (r264:75706, Mar 17 2010, 10:33:29)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> t = (’a’, ’b’, ’c’, ’d’, ’a’, ’b’, ’c’, ’a’, ’b’, ’a’)
>>> print t.count(’b’)
3
>>>
1.15
Strings and methods
Strings are composed exclusively of characters, although the limitation to ASCII characters has been lifted
and you can use, for example, Unicode, to encode your national characters. If you do so, you have to
declare the encoding in the head of your script, like in the example below:
Code 40
#!/usr/bin/python
# -*- coding: UTF-8 -*print ”Żóltko”
Strings can be placed in apostrophes or quotation marks. There is also a special syntax, where strings are
encapsulated in triple apostrophes or quotation marks. In that case, the string can be broken into several
lines:
Code 41
help text = """
Usage:
%s -f FILENAME -v X Y Z
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
39
-f FILENAME - name of the file to read
-v X Y Z
- components of the vector
Version: %d
"""
print help text % (argv[0], version)
Some of the characters have to be specified in special way, either because they have a certain function in
Python (e.g. apostrophe) or because they are not printable. For instance, if you want to print a tabulator
or new-line character, use \t and \n. These are so-called “escape sequences“. They start with the \
character and they are interpreted as special characters. Since the quotation marks and apostrophes are
used to open and close strings, they can not be used directly inside of the string and have to be “escaped”.
Examples:
Code 42
print "Quotation mark (\") must be escaped here,"
print ’but not here ("), because this string is in apostrophes’
print "Two empty lines after this one:\n\n"
print "\tTabulation before the text"
print "The backslash (\\) must be escaped too."
print "Percent character can be spelled like this %% or like this \%"
Strings have their own set of methods that facilitate the processing of the text. Those that need special
attention are: split() and the is*() family. String can be converted into a list, based by splitting it at
all occurrences of a selected character. This could be useful to read-in a CSV file6 , for example. Imagine,
you have a line that contains time, temperature and volume separated by commas:
Code 43
text = "24.0,298.15,1000.0"
record = text.split(’,’)
print "Time:", float(record[0])
print "Temperature:", float(record[1])
print "Volume:", float(record[2])
In this script, the text is split at each comma and converted into a list called record. Since there are two
commas, the list will have three elements. After conversion to a list, the numbers are still strings, they are
not converted to numbers automatically (i.e. record is a list of strings). If you need them to be numbers,
6 CSV
— comma separated values, a text file where each row represents a record of data and the fields (or values) are separated
by commas; in countries where the comma is used for separating fractional and integer parts of numbers (like in Slavic or Latin
countries), the semi-colon is used instead
Project co-financed from the EU European Social Fund
Borys Szefczyk
40
you have to use type casting, like in the example above (the float() function). To spare yourself typing,
you can use the map() function to convert elements of the list:
Code 44
text = "24.0,298.15,1000.0"
record = map(float, text.split(’,’))
print "Time:", record[0]
print "Temperature:", record[1]
print "Volume:", record[2]
The map() function takes two arguments. The first one must be a name of a function and the second one
must be a sequence. Sequence is a type that can be indexed, i.e. string, tuple or list. The function will
be applied to every element of the sequence and a new list will be built. Another example:
>>> l = map(float, "12345")
>>> print l
[1.0, 2.0, 3.0, 4.0, 5.0]
>>>
An opposite role (almost) to the split() method, has the join() method of lists. It generates a string
(let’s say S), by repeating the string sep interleaved with elements of a sequence L:
>>> T = "1,2,3,4,5"
>>> sep = ","
>>> L = T.split(sep)
>>> print L
[’1’, ’2’, ’3’, ’4’, ’5’]
>>> S = sep.join(L)
>>> print S
1,2,3,4,5
>>>
The initial string T is divided at each occurrence of the comma character and the chunks are collected in
the list L. To recover the original string, the join() method is used: it has to be applied to the separator
string (sep) that is inserted between elements of the sequence given as the argument of join method.
Strings in Python have a family of methods, which names start with is. These methods all return boolean
value (True or False), depending if the string fulfils certain conditions. For instance, the method islower
checks, if all the characters are lower case; the method isdigit checks, if all the characters are digits.
What is important, is that all the characters have to match the condition, not just any of them.
Besides several methods specific to strings, you can also apply operators to them. If you look at the
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
41
content of the string object:
>>> dir(’’)
[’ add ’, ’ class ’, ’ contains ’, ’ delattr ’, ’ doc ’,
’ eq ’, ’ format ’, ’ ge ’, ’ getattribute ’, ’ getitem ’,
’ getnewargs ’, ’ getslice ’, ’ gt ’, ’ hash ’, ’ init ’,
’ le ’, ’ len ’, ’ lt ’, ’ mod ’, ’ mul ’, ’ ne ’,
’ new ’, ’ reduce ’, ’ reduce ex ’, ’ repr ’, ’ rmod ’,
’ rmul ’, ’ setattr ’, ’ sizeof ’, ’ str ’, ’ subclasshook ’,
’ formatter field name split’, ’ formatter parser’, ’capitalize’,
’center’, ’count’, ’decode’, ’encode’, ’endswith’, ’expandtabs’,
’find’, ’format’, ’index’, ’isalnum’, ’isalpha’, ’isdigit’, ’islower’,
’isspace’, ’istitle’, ’isupper’, ’join’, ’ljust’, ’lower’, ’lstrip’,
’partition’, ’replace’, ’rfind’, ’rindex’, ’rjust’, ’rpartition’,
’rsplit’, ’rstrip’, ’split’, ’splitlines’, ’startswith’, ’strip’,
’swapcase’, ’title’, ’translate’, ’upper’, ’zfill’]
>>>
you will note methods such as __add__, for example. Python uses this underscore notation, for methods
which are in fact standard operators. In this case, the presence of __add__ means that you can apply the
addition operator (+) to strings; __mul__ means that you can multiply strings (although only by integer
numbers); __eq__ means that the comparison operator (=) has also been implemented and so on:
>>> a = "AAA"
>>> b = "BBB"
>>> c = a + b
>>> print c
AAABBB
>>> d = a * 4
>>> print d
AAAAAAAAAAAA
>>> print a == b
False
>>>
Exercise 12: Review the strings’ methods, in particular: capitalize, center, count, find, index,
lstrip, rstrip, strip, replace, startswith, endswith, title, and upper.
Exercise 13: Write a script that asks the user for his/her name and displays it (1) in capital
letters, (2) starting from capital letter followed by small letters, (3) reversing the order of the
names, (4) spelled backwards, and (5) spread. For example:
What’s your name? BoRYs KrzySZtOF SzefCZYK
1. BORYS KRZYSZTOF SZEFCZYK
Project co-financed from the EU European Social Fund
Borys Szefczyk
42
2.
3.
4.
5.
Borys Krzysztof Szefczyk
SzefCZYK KrzySZtOF BoRYs
KYZCfezS FOtZSyzrK sYRoB
B o R Y s
K r z y S Z t O F
S z e f C Z Y K
Exercise 14: Write a script that converts integer numbers into text in the following way:
123 → "one two three"
Use the join() method
1.16
Dictionaries
Dictionary is a table similar to lists and tuples, but instead of numerical indices, keys are used. A key
can be almost any Python object, e.g. string, number, tuple etc. Using dictionaries usually makes scripts
easier to understand, because we do not need to remember what the indices mean. For instance, imagine
a script that deals with protein; at some point we have to count how many residues of each type are in
the protein. The result can be conveniently stored in a dictionary:
Code 45
residues = { "ALA" : 21, "GLY" : 14, "PRO" : 3, "CYS" : 2 }
Here, we use the residue names as keys that correspond to occurrences (values). The syntax of a dictionary is following: the pairs key-value are separated with commas and each pair is separated by a colon,
key : value. It is possible to create an empty dictionary, add new pairs and modify existing values:
Code 46
# Create an empty dictionary
residues = { }
# Add new pair (residue counter)
residues["ALA"] = 0
# Change the value
residues["ALA"] = 5
# Increment the existing counter
residues["ALA"] += 1
Two important caveats to bear in mind: the order of the pairs in a dictionary is not preserved, so that
the first added pair will not necessarily remain first. Therefore, the only way to retrieve a pair from the
dictionary is by using the key. Or using simpler words: Python is “allowed” to rearrange the pairs in a
dictionary. The second caveat is that referencing to a non-existing key constitutes an error. Therefore,
dictionaries have the method has_key() that can be used to check if the key exists. You should use it,
before referencing to a key, unless you can be sure that it exists. This is how it is typically done:
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
43
Code 47
residues = { "ALA" : 20, "GLY" : 15 }
if residues.has key("ALA"):
print "ALA:", residues["ALA"]
else:
print "ALA: no such residue"
Other useful methods include keys() and values(), which return lists of keys and values, respectively:
>>> residues = { "ALA" :
20, "GLY" :
>>> print residues.keys()
[’CYS’, ’GLY’, ’ALA’]
>>> print residues.values()
[5, 15, 20]
>>>
15, "CYS" :
5 }
The keys() method is useful when we want to iterate over the dictionary in a loop:
Code 48
residues = { ’ALA’ : 20, ’GLY’ : 15, ’TYR’ : 23, ’PRO’ : 2 }
for k in residues.keys():
print "%3s = %d" % (k, residues[k])
Exercise 15: Write a script that reads text introduced by the user and counts occurrences of
each character (case-insensitive). Use dictionary to store the results. The output should look
as follows:
~$ python foo.py
Text: Programming in Python is cool!
Character
Count
4
!
1
a
1
c
1
g
2
h
1
(and so on)
Project co-financed from the EU European Social Fund
Borys Szefczyk
44
1.17
Passing arguments to the script
Programs can receive data not only interactively, but also from the environment and the command line.
Passing arguments through the command line is very common in UNIX and especially useful when the
programs are used in batch mode7 . In Python, all the arguments passed to the script are placed in the list
called argv. This list has to be imported from sys module:
Code 49
from sys import argv
print "Program name is", argv[0]
print "%d additional arguments have been passed." % (len(argv) - 1)
The first argument (index 0) is always the name of the script, so the length of the list is at least 1. The
arguments are always passed as strings, therefore if you want to pass numbers, you have to convert them
afterwards, using functions such as int() or float().
1.18
Advanced command line options
Linux programs, and those from the GNU family8 in particular, have a common way of handling commandline options and arguments. By argument we mean a string (like a file name) that is passed to the program
through the command line. Options are like switches: they change the default behaviour of the programs
and often have some parameters. An example:
˜$ python -BEi -m numpy -v tutorial.py
Program python has been run with one argument (tutorial.py) and five options: B E i m v. The
options are always prefixed with a hyphen, but can be specified either all together (-BEi) or one-by-one
(-B -E -i). Here, the option -m has a parameter numpy.
Some programs recognize short and long options, for example, the two lines below should take exactly
the same effect:
˜$ mysql --html --user=pybib --password
˜$ mysql -H -u pybib -p
7 batch
mode — contrary to the interactive mode, occurs when the program is not directly started, but run from a script or by
the system; in this mode, the program usually reads input from files ,,behind the back” of the user
8 http://www.gnu.org
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
45
This way of handling arguments and options gives a lot of freedom to the user and a lot of trouble to
the programmers, who has to parse correctly what the user has typed in. Fortunately, there is the getopt
library and a Python interface to it. Getopt library parses the command line, divides it into arguments and
options and returns them in separate Python lists. Parameters are also handled. Let us try an example:
we are writing a file converter, so any time the script is run, we need exactly two file names. Additionally,
we have the -e option (with parameter) to choose the encoding, the -v option for verbose mode and the
-h option to display help:
Code 50
import getopt
from sys import argv
shortop = "vhe:"
longop = ["verbose", "help", "encoding="]
opts, args = getopt.getopt(argv[1:], shortop, longop)
if len(args) != 2:
print "Exactly two file names must be specified!"
print "opts = ", opts
print "args = ", args
Strings shortop and longop define permitted options. If the option has a parameter, it is indicated with
a colon ‘:’ in the short-option list and equality sign ‘=’ in the long-option list. Running this script with
arguments and options:
˜$ ./options.py -v -e utf8 file1 file2
Produces the output:
opts = [(’-v’, ’’), (’-e’, ’utf8’)]
args = [’file1’, ’file2’]
Note that options are returned as tuples: the first element is always the short option and the second
element is the parameter or empty string, if the option has no parameter. Now, try and see what happens
when you forget to specify a parameter after the option -e or when you use by mistake an option which
has been not specified in the script!
1.19 Working with files
The simplest way of reading and saving files has nothing to do with Python itself. You can use the UNIX
mechanism of redirecting input and output, to read and save data. Each program in UNIX has three
Project co-financed from the EU European Social Fund
Borys Szefczyk
46
streams associated: standard input, standard output and standard error. They are treated in a similar
way to files, but usually they are attached to the keyboard (input) and screen (output and error). If
you want to change the default behaviour — redirect the streams — you can use the characters > and <.
Instead of manually typing all input, we can read it from the file:
˜$ ./script.py < input.txt
For this to work, the script has to use the input() or raw_input() functions, like when you read the
data from the keyboard. To save the text printed by the script on the screen, you just have to redirect the
output to the file:
˜$ ./script.py > output.txt
In this case, whatever would appear on the screen will go to the file instead (there will be no output on
the screen). The sign > redirects only standard output and not standard error output, so if there would
be an error message, it will still appear on the screen.
Let us try this approach to make an XYZ file with coordinates. We will write a script that creates a lattice
of metal gold. The lattice is cubic, with atoms 2.88 Å apart. We will ask the user to tell us the number of
atoms in x, y and z directions:
Code 51
lattice constant = 2.88
nx = input("Number of atoms (x): ")
ny = input("Number of atoms (y): ")
nz = input("Number of atoms (z): ")
The header of the file must contain the number of atoms and a comment:
Code 52
# Total number of atoms
n = nx * ny * nz
print n
# Comment
print "Lattice of gold %d x %d x %d" % (nx, ny, nz)
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
47
The body of the file contains a single line for each atom; each line contains the atom name and coordinates. Since the lattice is cubic, we use the same lattice constant for all dimensions. There are three
nested loops: the first one makes flat slices of atoms along x dimension; second loop makes rows of atoms
in each slice (y dimension); third, internal loop makes atoms in a single row (z):
Code 53
# Loop over x, y and z
for i in range(nx):
x = i * lattice constant
for j in range(ny):
y = j * lattice constant
for k in range(nz):
z = k * lattice constant
print "Au
%12.6f %12.6f %12.6f" % (x, y, z)
The complete script is as follows:
Code 54
lattice constant = 2.88
nx = input("Number of atoms (x): ")
ny = input("Number of atoms (y): ")
nz = input("Number of atoms (z): ")
# Total number of atoms
n = nx * ny * nz
print n
# Comment
print "Lattice of gold %d x %d x %d" % (nx, ny, nz)
# Loop over x, y and z
for i in range(nx):
x = i * lattice constant
for j in range(ny):
y = j * lattice constant
for k in range(nz):
z = k * lattice constant
print "Au
%12.6f %12.6f %12.6f" % (x, y, z)
However, if we redirect the output from our script to a file:
Project co-financed from the EU European Social Fund
Borys Szefczyk
48
˜$ ./gold.py > gold.xyz
The prompt for the number of atoms will be also printed to the file. This is wrong, since the user will not
see the prompt and it will “contaminate” the file. We can circumvent the problem by reading the number
of atoms from the command line. We will change the initial part of the script to read variables nx, ny and
nz from the command line:
Code 55
from sys import argv
lattice constant = 2.88
nx = int(argv[1])
ny = int(argv[2])
nz = int(argv[3])
Now we can run the script with arguments and safely redirect output to a file:
˜$ ./gold.py 3 3 4 > gold.xyz
Exercise 16: Write a script to build the lattice of caesium iodide (CsI). Lattice parameters
can be found eg. on Wikipedia.a
a http://en.wikipedia.org/wiki/Caesium_iodide
However, if you would like to work with many files or use files and work with the program interactively
at the same time, you should use the Python mechanism of handling files. The first important thing to
realize is that the file is just another object in the script and is represented by a variable. The variable is
not a file name, although we need the file name to find it. To read or write a file, it has to be opened and
it has to be opened in the right mode — ’r’ for reading or ’w’ for writing. Reading is the default mode,
so the mode symbol can be skipped:
Code 56
# Both lines are correct and both do the same:
input = open(’data.txt’, ’r’)
input = open(’data.txt’)
Writing mode has to be specified explicitly:
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
49
Code 57
output = open(’data.txt’, ’w’)
When the file is open for reading and the file object has been created, you can use three methods of
reading data. Before we progress further, imagine that we have a file called data.txt and it contains
three lines (instead of using your imagination, you can actually create the file and try out what follows):
123
456
789
Imagine also that the file has a pointer – an “arrow” indicating position in the file. If the file has been
opened for reading the pointer is indicating the first byte (character) of the file.
The first method to read data is read() and it treats the file like a single string; the method accepts a
single argument: the number of bytes (or characters) to read. So, the following code:
Code 58
input = open(’data.txt’)
data = input.read(5)
print data
will read five characters from the file (new-line characters also count) and the output would be:
123
4
Five character have been read: 1, 2, 3, new-line and 4. The pointer has been also advanced to the fifth
character and the subsequent reading operation will start from here:
Code 59
data = input.read(15)
print data
56
789
This time we have requested more bytes than are left in the file and the read() method will return all the
remaining characters. We can also read the whole file at once, by skipping the argument of the read()
Project co-financed from the EU European Social Fund
Borys Szefczyk
50
method:
Code 60
data = input.read()
The second method, readline(), is a little bit more “intelligent“; it reads a single line from the file or,
more precisely, it reads characters from the current position in the file to the nearest new-line character.
The new-line character is also read:
Code 61
input = open(’data.txt’)
data = input.readline()
print data
123
The third method, readlines() is even most sophisticated: it reads the whole file, but the text is already
split into lines and returned as a list:
Code 62
input = open(’data.txt’)
data = input.readlines()
print data
[ ’123\n’, ’456\n’, ’789\n’ ]
If you want to rewind the file, to read the data multiple times, you can use the seek(offset) method,
where offset is in bytes, counting from the beginning of the file:
Code 63
# Rewind to the beginning
file.seek(0)
# Rewind to the 10th byte
file.seek(10)
Writing to the file can be performed in three ways. Method write() simply writes a string given as the
argument. New-line character is not added automatically, so you have to take care of it:
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
51
Code 64
text="Line\n"
output.write(text)
The method writelines() is compatible with readlines(); it saves lines stored in a list:
Code 65
lines = [ "First\n", "Second\n", "Third\n" ]
output.writelines(lines)
The third way of writing files is by using the print statement. In this case the file object has to be given
after >> signs:
Code 66
print >>file, "Script output:"
print >>file, "%d %d %d" % (x, y, z)
print >>file, text
To see these commands in “action”, we will do a step-by-step analysis of a script that converts a molecule
in PDB format, to the XYZ format:
Code 67
#!/usr/bin/python
from sys import argv
pdb name = argv[1]
xyz name = argv[2]
pdb = open(pdb name)
pdb data = []
for line in pdb.readlines():
if line.startswith(’ATOM ’):
atom = {}
atom[’symbol’] = line[13]
atom[’x’] = float(line[30:38])
atom[’y’] = float(line[38:46])
atom[’z’] = float(line[46:54])
pdb data.append(atom)
Project co-financed from the EU European Social Fund
Borys Szefczyk
52
pdb.close()
n atoms = len(pdb data)
xyz = open(xyz name, ’w’)
print >>xyz, n atoms
print >>xyz
for a in pdb data:
print >>xyz, "%1s % 8.3f % 8.3f % 8.3f" % \
(a[’symbol’], a[’x’], a[’y’], a[’z’])
xyz.close()
First, let us look at an example PDB file of a methanol molecule:
TITLE
HEADER
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
END
1
2
3
4
5
6
C
H
H
H
O
H
MOH
MOH
MOH
MOH
MOH
MOH
A
A
A
A
A
A
1
1
1
1
1
1
0.000
0.000
1.027
-0.513
-0.660
-0.660
0.000
0.000
0.000
-0.889
1.143
1.143
0.000
1.089
-0.363
-0.363
-0.467
-1.414
1.00
1.00
1.00
1.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
C
H
H
H
O
H
PDB files contain much more information than the XYZ files. We only need to extract the atom symbols
and the coordinates. Our script is expecting to find the name of an existing PDB file and a non-existing
XYZ file in the command line, therefore the file names are read from argv:
Code 68
#!/usr/bin/python
from sys import argv
pdb name = argv[1]
xyz name = argv[2]
Next we open the PDB file for reading and create the file-object pdb:
Code 69
pdb = open(pdb name)
The coordinates and symbols will be stored in the list pdb_data. We have to create the list (empty) in
order to add the atoms one-by-one. We don’t know yet how many atoms are in the file, but we don’t need
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
53
this information at this point:
Code 70
pdb data = []
Now, we read the PDB file using the readlines method. This method creates a list of lines from the files.
We don’t initialize this list explicitly (we don’t name it), instead it is inserted directly into the loop:
Code 71
for line in pdb.readlines():
Each line from the list will be substituted to the variable line. Since not all the lines in the file contain
atoms (typical PDB file contains also other information) we must filter out the information. Here we use
the startswith() method to find out if the line contains a description of an atom. Be aware that in PDB
files there are also hetero atoms and the identifier of their records are “HETATM”.
Code 72
if line.startswith(’ATOM ’):
The first line of the XYZ file contains the number of atoms. We will collect the atoms in the pdb_data
and the length of this list will tell us the number of atoms. Each element of this list will correspond to a
single atom. The “single atom” in this context means: the symbol, x-, y- and z-coordinate. We could keep
these four values in a tuple, list or dictionary — the choice is more a matter of taste. Here we decide to
use dictionaries. For each atom we must initialize an empty dictionary (remember, we are still inside of
a loop):
Code 73
atom = {}
Next, we add the four components to the dictionary. Perhaps the most obvious way of treating the data
would be to use the split() method, to separate the fields in the line, but this is a wrong approach
to handle PDB files. The fields in a line of a PDB file have a fixed length and position. Besides, if
the number/value is big, it might occupy the whole field and there would be no space between fields.
Therefore, we should refer to the PDB file format description [3], where we find Table 1.8. From this
table we learn, for example, that the x-coordinate should be found in columns 31 − 38. In Python, we are
indexing from 0, so that means range 30 − 37.
Code 74
atom[’symbol’] = line[13]
atom[’x’] = float(line[30:38])
atom[’y’] = float(line[38:46])
Project co-financed from the EU European Social Fund
Borys Szefczyk
54
Table 1.8: PDB file format (excerpt).
Columns
1−6
7 − 11
13 − 16
17
18 − 20
22
23 − 26
27
31 − 38
39 − 46
47 − 54
55 − 60
61 − 66
77 − 78
79 − 80
Type
Int
String
Character
String
Character
Int
Character
Float(8.3)
Float(8.3)
Float(8.3)
Float(6.2)
Float(6.2)
String
String
Definition
The string “ATOM ”
Atom serial number
Atom name
Alternate location indicator
Residue name
Chain identifier
Residue sequence number
Code for insertion of residues
Coordinates for X in Angstroms
Coordinates for Y in Angstroms
Coordinates for Z in Angstroms
Occupancy
Temperature factor
Element symbol, right-justified
Charge on the atom
atom[’z’] = float(line[46:54])
pdb data.append(atom)
Above, we did one important simplification. We assume that all atom symbols are one-letter and we
just read a single character from column 13. In fact, most symbols are strings of 1 − 4 characters. It is
difficult to program this part of the script in a universal way, because for example, we would never know
if the symbol CA refers to a calcium atom or the carbon alpha. The original file format was designed for
proteins, but nowadays is used for inorganic molecules too. Now, we can close the file and we can also
count how many atoms are in the molecule:
Code 75
pdb.close()
n atoms = len(pdb data)
All the data are ready and it is time to save them to the file. First the number of atoms, then a comment
(empty line here) and finally coordinates — that is the format of the XYZ file. For writing the coordinates,
we use a loop again:
Code 76
xyz = open(xyz name, ’w’)
print >>xyz, n atoms
print >>xyz
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
55
for a in pdb data:
print >>xyz, "%1s % 8.3f % 8.3f % 8.3f" % \
(a[’symbol’], a[’x’], a[’y’], a[’z’])
xyz.close()
Exercise 17: Certain QM programs output the structure in atomic units (Bohrs), but most
visualisation programs expect them to be in Angstroms. Write a conversion script. It should
read the XYZ file in Bohr and save an XYZ file in Angstrom. 1 a.u. = 0.529177 Å.
Exercise 18: Write a script to translate the coordinates of a molecule by a given vector.
The script should read the name of an XYZ file and the coordinates of the vector from the
command line.
Exercise 19: Write a script that calculates certain properties of the structure from an XYZ
file: (i) the geometrical center, (ii) the maximum dimension of the molecule and (iii) the size
along the x, y and z-axis.
The geometrical center (xc , yc , zc ) is defined as the average of each coordinate q = x, y, z
(N is the number of atoms):
N
1 X
qi
qc =
N i=1
Maximum dimension is the maximum distance between two atoms in the molecule (those
which are most apart).
Size along an axis is the difference between the maximum and minimum coordinate along this
particular axis. The three sizes (x, y, z) are like boundaries of the molecule.
Exercise 20: Molecular dynamics of liquids usually start with a pre-generated box of molecules.
Write a script to generate such a box. Given an XYZ file and the number of molecules in x,
y and z directions, the script should make a box of molecules by copying and translating the
atom in all directions.
Hint 1: have a look at the script in the Exercise 18 and the script building the lattice of gold
(Code 54) — you will need a triple-nested loop to build the box.
Hint 2: the molecules should not overlap, but the lattice should not be too sparse. The best
you can do is to calculate the size of the molecule in x-, y- and z-direction (see Exercise 19),
increment it by 1 or 2 Angstroms and use as the translation vector.
Hint 3: a ‘nice and clean’ script should first center the original molecule at 0, 0, 0 — again,
look at the previous exercises for hints.
Now, when we know how to read and write files, we have a look at other streams. Files are only a special
case of more general objects, streams. Some of them have been discussed already before: standard input
Project co-financed from the EU European Social Fund
Borys Szefczyk
56
(stdin), standard output (stdout) and standard error (stderr) are streams. It is a common custom to
send all error messages to the ‘standard error’, so that when the output is redirected to a file, the error
messages still go to the screen. For example:
Code 77
from sys import stderr
a = 1
b = 0
# Normal output:
print "a = %d, b = %d" % (a, b)
if b == 0:
# Error -- message goes to stderr:
print >>stderr, "Error, b == 0! Exiting."
exit(1)
The story of writing and reading streams will continue in the next chapter, because the same mechanism
will be used by popen*() functions.
Exercise 21: Relaxed PES scan is performed by stepping one or more variables (coordinates
of the molecule) and performing geometry minimization for all the others. In the Gaussian
program, this task is done by using the modreduntant keyword. Write a script that extract a
multi-frame XYZ files with optimized geometries of the molecule. You have to filter out only
those structures from the output file, which are final, optimized geometries and skip those,
which are geometry optimization steps. Use the output file supplied by the teacher. Hint:
search for the phrase “Stationary point found.”
1.20
Launching external programs
One of the most common purposes of writing scripts is to automate the process of running programs. This
goal can be achieved several ways, the most simple being the system() function. This function, found in
the os module, runs a command given as the argument. We can, for example, display the name of the
machine, where the script is running, using the hostname command (it is a shell command):
>>> import os
>>> code = os.system("hostname")
swift
The problem is however, that the system() function does not allow us to ‘feed’ the external program
with data or receive/intercept the output from it. The name of the host has appeared on the screen,
because the command hostname has displayed it, but we have no means to store it in a variable, for
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
57
example. Therefore, much more useful is the function popen2(), also from module os. This function
runs the command given as the argument and returns two open streams: standard input and standard
output of the command. They work like UNIX pipes and can be used to interact with the program while
it is running. Here is an example: Gaussian can read the commands from the standard input and write
the output to the standard output. We can write scripts to automate calculations in the following way:
Code 78
from os import popen2
input text = """%mem=50mb
#HF/STO-3G sp
single point, HF
0 1
H
0.0
F
1.0
0.0
0.0
0.0
0.0
"""
inp, out = popen2("g03")
print >>inp, input text
inp.close()
output text = out.read()
out.close()
In this example, we define the input for Gaussian (variable input_text) [4], we launch the program
using the function popen2 and feed it with the data, by writing to the stream inp. Finally, we read the
output from the stream out. Note, that we have to close the streams — some programs, may not start
until they get the end-of-file character and this character is sent when the stream is closed. Obviously,
the example above does not automate anything. However, imagine we want to study the influence of
the implicit solvent model on the bond vibration in the HF molecule. We are going to perform 16 PCM
calculations, changing the dielectric constant from 5 to 80 and monitoring how the frequency of the
bond vibration changes [5]. Gaussian does not permit doing a ‘scan’ of the dielectric constant: we have
to prepare 16 separate jobs or write a script that will do that for us. The corresponding input line would
be:
#HF/STO-3G sp scrf=(pcm,read)
Project co-financed from the EU European Social Fund
Borys Szefczyk
58
we also have to specify the dielectric constant (e.g. 80) in the end of the input file:
EPS=80
The script that does the job is following:
Code 79
#!/usr/bin/python
from os import popen2
input text = """%%mem=50mb
#HF/STO-3G opt freq scrf=(pcm,read)
single point, HF
0 1
H
0.0
F
0.9
0.0
0.0
0.0
0.0
EPS=%d
"""
print "D.const Frequency"
print "------------------"
for diel in range(5, 81, 5):
inp, out = popen2("g03")
print >>inp, input text % diel
inp.close()
output text = out.readlines()
out.close()
for line in output text:
if line.count("Frequencies --"):
freq = float(line[15:26])
print "%7d % 9.2f" % (diel, freq)
This time, the input_text has %d instead of the dielectric constant value. In the loop, we substitute the
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
59
dielectric constant to diel and insert the value into the text that is sent to the input stream:
Code 80
print >>inp, input text % diel
After that, we collect the output from Gaussian (output_text) and search it for interesting properties — here, the frequency of the bond vibration, which can be found in the line containing the string
"Frequencies --":
Code 81
if line.count("Frequencies --"):
freq = float(line[15:26])
print "%7d % 9.2f" % (diel, freq)
The output of the script:
D.const Frequency
-----------------5
4429.97
10
4418.40
15
4413.97
20
4411.63
25
4410.18
30
4409.19
35
4408.48
40
4407.94
45
4407.52
50
4407.18
55
4406.90
60
4406.67
65
4406.47
70
4406.30
75
4406.15
80
4406.02
You should be aware of the changes that are taking place in Python: in version 2.4 a new module has
been introduced, subprocess and it is going to replace the functions discussed in this chapter, but until
then (version 2.6), you can still use them.
Exercise 22: Write a script in order to calculate the energies (use the Hartree-Fock method
and STO-3G basis set) for a series of geometries. The geometries shall be passed to the script
in a single, multi-frame XYZ file. Sample file will be supplied by the teacher. The script
should extract from the output only the SCF energies.
Project co-financed from the EU European Social Fund
Borys Szefczyk
60
1.21
Functions
Writing your own functions has two main advantages: you can program the procedures that you use
most often and then re-utilize them just by typing the name of the function and the arguments; in more
complex programs, you can organize the data flow, by making blocks (functions) to perform separate
tasks. For example, if you write a script that reads-in an XYZ file, then converts the coordinates (eg.
translates them) and then saves them to a new file, you could do it in three steps, writing functions
read_xyz, convert_coord and write_xyz. Figure 1.5 shows a similar idea of a data flow in script.
Functions, like in mathematics, have arguments and return values. This is because they are supposed to
convert one kind of data into another. We will start this chapter with a function that adds two vectors.
First we have to agree how the data will be represented. We assume that the vectors are three-dimensional
and are represented by tuples; for example:
Code 82
vector a = ( 0.5, 1.3, -0.5 )
vector b = ( -1.0, 1.0, 1.0 )
The function should accept two arguments and return a single value, which are tuples of three float
numbers:
Code 83
def add vectors(a, b):
c = (a[0]+b[0], a[1]+b[1], a[2]+b[2])
return c
The definition of the function starts with the keyword def, followed by the function’s name and arguments in parenthesis. Keyword return has two roles, it indicates the value that should be returned and
at the same time causes the script to leave the function. The return statement is not obligatory; if the
function has no return statement, it will always return the None value. The definition of the function
must always precede the invocation:
Code 84
def add vectors(a, b):
c = (a[0]+b[0], a[1]+b[1], a[2]+b[2])
Figure 1.5: Using functions to organize the data flow in the script.
input
data
function
read_data
data
processing
functions
function
write_data
Project co-financed from the EU European Social Fund
output
data
Python programming for bioinformatics students
61
return c
vector a = ( 0.5, 1.3, -0.5 )
vector b = ( -1.0, 1.0, 1.0 )
print "The sum is", add vectors(vector a, vector b)
It is also possible to write a function that has no arguments, for example, a function that returns a random
vector of a unit length:
Code 85
def rand normal():
length = 0
while length < 1e-10:
x = random.random()
y = random.random()
z = random.random()
length = math.sqrt(x**2 + y**2 + z**2)
normal = (x/length, y/length, z/length)
return n
The loop and the condition length < 1e-10 was introduced to prevent vectors of null length. The
functions’ arguments may have default values; in such a case, the argument can be omitted and the
default value will be used. For example, we will write a function that calculates a vector perpendicular
to a surface defined by three points (A, B, C). This is very easy: we define two vectors:
p~ = |AB|
~q = |AC|
The cross product of these two vectors is perpendicular to the vectors and to the surface. Here is the
function:
Code 86
def normal(A, B, C):
p = (B[0] - A[0], B[1] - A[1], B[2] - A[2])
q = (C[0] - A[0], C[1] - A[1], C[2] - A[2])
crossp = (p[1]*q[2] - p[2]*c[1], \
p[2]*q[0] - p[0]*c[2], \
p[0]*q[1] - p[1]*q[0])
return crossp
However, usually the normal vector of a surface is normalized, so that its length is one. We may leave the
choice (to normalize or not) to the user and add a fourth argument to the function. This fourth argument
Project co-financed from the EU European Social Fund
Borys Szefczyk
62
will be a boolean variable norm, with the default value False. If the value is False, the vector will be left
unchanged; if it is True, the vector will be normalized:
Code 87
def normal(A, B, C, norm=False):
p = (B[0] - A[0], B[1] - A[1], B[2] - A[2])
q = (C[0] - A[0], C[1] - A[1], C[2] - A[2])
crossp = (p[1]*q[2] - p[2]*c[1], \
p[2]*q[0] - p[0]*c[2], \
p[0]*q[1] - p[1]*q[0])
if not norm:
return crossp
else:
length = math.sqrt(crossp[0]**2 + crossp[1]**2 + crossp[2]**2)
crossp = (crossp[0] / length, crossp[1] / length, crossp[2] / length)
return crossp
Note that the return statement has been used twice in this function. If the variable norm is False (no
normalization), the function exits and return the unchanged vector. In the opposite case (else statement)
the normalization is performed before returning the value.
Another important aspect of using functions is how the arguments are passed to the function and how
they are used inside. In principle, functions have their private copies of variables, which are destroyed
when the function is left. Consider this example:
Code 88
a
b
c
d
=
=
=
=
2
3
4
5
def my func(x, y):
c = x * d
b = 0
y = 0
return c
result = my func(a, b)
print "a = %d, b = %d, c = %d, d = %d" % (a, b, c, d)
print "result = %d" % result
If you execute this code, the result will be:
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
63
a = 2, b = 3, c = 4, d = 5
result = 10
In the first place, note that although the function uses variables b and c (and substitutes them), the
original values in the script are left unchanged. This is because b in the function and b in the script are
not the same variable. The function uses its own name space. Secondly, notice that we have passed to the
function variables a and b as the arguments x and y. Although the y has been substituted in the function,
the original variable b has not been changed. This is because x and y contain copies of the variables in
the script, preventing the originals from being modified (this is called ‘passing by value’ in contrast to
‘passing by reference’). Finally, note that the function uses variable d, which has not been initialized in
the function. This variable is taken from the main program, and passed “under the table”. This kind
of behaviour, although possible, should be avoided since the scripts become messy. Rather, you should
explicitly declare that you are going to use global variables, using the keyword global:
Code 89
def my func(x, y):
global b, d
c = x * d
b = 0
y = 0
return c
Exercise 23: Write functions to read and write XYZ files. Before you start writing, decide
how the data will be represented.
Exercise 24: Re-write the script from the Exercise 18 using the functions written in the
Exercise 23.
1.22 Writing modules
It is very easy to write your own module: just put what ever instructions you want in a file with extension
.py, place it in the current directory and import it! For example, you can make your own set of molecular
modelling tools, put them in a file called mm.py and then import by using the instruction import mm.
Usually, the modules contain functions, classes or constants (variables). If you put a “normal” code (i.e.
not a definition) the code will be executed while importing the module. A sample module, with two
functions is shown below:
Code 90
"""
Sample module for vector operations.
Project co-financed from the EU European Social Fund
Borys Szefczyk
64
Two functions are defined:
norm - normalizes vector
cross - computes cross product
The general form of the vector is: tuple(x, y, z)
"""
# We need sqrt, so we have to import math.
# We can not expect that the user will do it for us :-)
import math
def norm(vec):
"""This function normalizes vector to unity."""
l = math.sqrt(vec[0]**2 + vec[1]**2 + vec[2]**2)
return (vec[0]/l, vec[1]/l, vec[2]/l)
def cross(va, vb):
"""This function calculates the cross product."""
x = va[2]*vb[1] - va[1]*vb[2]
y = va[0]*vb[2] - va[2]*vb[0]
z = va[1]*vb[0] - va[0]*vb[1]
return (x, y, z)
# This is a trick, to execute the code only when the file is being run
# as a script, not a module
if
name
== ’ main ’:
# Let’s do some tests
p = (0.5, -0.5, 1.1)
q = (0.3, 0.4, -2.0)
print norm(p)
print cross(p, q)
In the example above, notice the text in the triple quotation marks: this is how the __doc__ entries
are being made. Also notice the conditional statement in the end that checks if __name__ is __main__.
Thanks to this statement, the script can have a “double-life” — it can be used as a script or as a module.
The trick is, if the code is run as a script, it becomes “the main code” and the variable called __name__
contains the value __main__. If the script is imported as a module, __name__ contains the name of the
module. Now, observe how our a module behaves:
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
65
Python 2.6.4 (r264:75706, Mar 17 2010, 10:33:29)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import vector
>>> dir(vector)
[’ builtins ’, ’ doc ’, ’ file ’, ’ name ’, ’ package ’,
’cross’, ’math’, ’norm’]
>>> print vector. doc
Sample module for vector operations.
Two functions are defined:
norm - normalizes vector
cross - computes cross product
The general form of the vector is: tuple(x, y, z)
>>> print vector.cross. doc
This function calculates the cross product.
>>> print name
main
>>> print vector. name
vector
>>>
Exercise 25: Write a module with two functions: read_xyz and write_xyz, to perform the
tasks of reading and writing XYZ files.
1.23 Regular expressions (re)
Regular expressions are used to search and match text. They are composed of normal text and special
characters, which have more general meaning. For example, the dot (.) means any character, therefore
the regular expression "a.t" will match, for example, the words “act”, “ant”, “art”, but not “aunt”,
because the dot matches exactly one character. Regular expressions can be much more complex, for
example [A-Za-z]*[_-]{1,3}\d+\..{3}$ reads: any letter from the range A-Z and a-z, repeated 0 or
more times, then or – repeated 1−3 times, a digit, repeated at least one time, a dot, any three characters
and the end of the string. This will match for example “Abc 34.xyz”. Most common symbols are listed in
Table 1.9; more information can be found in the literature [6].
Regular expressions in Python are handled by the re module. In order to use regular expression in your
script, you must first import the module and compile a regular expression object, eg.:
Code 91
import re
regexp = re.compile("(\d+) basis functions")
Project co-financed from the EU European Social Fund
Borys Szefczyk
66
Table 1.9: Selected regular expression atoms.
Symbol
.
^
$
*
+
?
{m}
{m,n}
[ ]
\
|
\s
\d
Meaning
Single character (any)
Beginning of a line or string
End of a line or string
Zero or more repetitions of the previous character
One or more repetitions of the previous character
Zero or one repetition of the previous character
m repetitions of the previous character
Between m and n repetitions of the previous character
Matches one of the characters specified in brackets; ranges are
permitted eg. [a-z0-9]; to match hyphen, it should be the last
character specified, eg. [A-Z_-]
Allows to match literally the special characters, eg. \. \* \^
Separator of alternate match strings, eg. aa|bb
White character
Digit
Then, you can perform a search or match — the difference is that a match requires the regular expression
to match from the beginning of a line or string, whereas a search will match at any point of the string/line.
For instance:
Code 92
import re
regexp = re.compile("(\d+) basis functions")
test = "we have 305 basis functions"
result1 = regexp.match(test)
result2 = regexp.search(test)
print "Result of ’match’ is", result1
print "Result of ’search’ is", result2
Result of ’match’ is None
Result of ’search’ is < sre.SRE Match object at 0x7f4d2642a8a0>
The result of a search or match can be safely used in a conditional statement, although it is not a simple
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
67
boolean value, but an abject. Remember that most objects in Python have their boolean value; in this
case None works like False and successful match/search object works like True:
Code 93
result = regexp.search(test)
if result:
print "Successful match"
The whole point of using regular expression is to match strings, which are not always the same, but have
certain common characteristics. Moreover, regular expressions can be used to extract intelligently fragments of those strings. If you enclose fragments of the regular expression in parentheses, corresponding
fragments of the matched string will form groups that can be retrieved from the match object. This is
done using method .groups(). Let us assume that the variable output contains a text produced by our
favourite programs. Some of the lines contain numbers that we would like to extract; these lines start
with the word ’Atom’:
Code 94
import re
output = """Calculations finished.
In file 1 found:
Atom 3 at x = -1.2
Atom 12 at x = .004
Atom 21 at x = 10.1
In file 2 found:
Atom 5 at x = 4.5e+3
Atom 14 at x = -4.2e-1
Atom 101 at x = 0
"""
regexp = re.compile("Atom \d+ at x = (-?\d*\.?\d*(e[+-]?\d+)?)")
coord = []
for line in output.split(’\n’):
result = regexp.match(line)
if result:
print result.groups() # Just to see how the groups look like
c = result.groups()[0]
coord.append(float(c))
print coord
First note how the text looks: the line starts with the word Atom, then goes an integer number, then the
Project co-financed from the EU European Social Fund
Borys Szefczyk
68
text "at x =" and finally, the value we want to extract. So the first part of the regular expression — to
match the right line — would be "Atom \d+ at x =". The expression \d+ matches one or more digits.
Then we have to match the value, but as you can see above, the value can be of any kind: it might have a
minus sign, it may or may not have the integral part, it may or may not have the decimal part and it may
have the exponent! To match the minus sign or the lack of it, we will use "-?"; to match the integral part
we will use "\d*"; then we have to look for the point that might not be there, "\.?"; then the decimal
digits — again "\d*"; finally, the exponent, which may have plus or minus sign, "(e[+-]?\d+)?". The
complete expression is following: (-?\d*\.?\d*(e[+-]?\d+)?). The outer parenthesis are necessary
to extract the whole number; the inner parenthesis are just to make a group out of the exponent. The
groups() method returns a tuple with groups found while matching/searching for the regular expression.
The re module has also other methods, which you may find useful in your scripts. See the full documentation on-line [7].
Exercise 26: Use the output file from Gaussian supplied by the teacher and write a script to
extract the energies from the file. It contains several single point calculations of the energy,
using the Hartree-Fock and DFT methods, with different functionals. The energies are in lines
like:
SCF Done: E(RHF) = -39.7034912248
A.U. after
8 cycles
SCF Done: E(RPBE-PBE) = -39.9471405984
A.U. after
7 cycles
and so on. You should use a single regular expression to retrieve the energy and the name
of the method. After that, the program should produce a CSV file (to be read into a spread
sheet) and table on the screen:
Model
Energy
-------------------------------HF
-39.7034912248
B3LYP
-40.0181905764
X3LYP
-39.9896656215
PBE-PBE
-39.9471405984
PW91-PW91
-39.9866252189
M06
-39.9651652848
M06L
-39.9962589720
Project co-financed from the EU European Social Fund
Chapter 2
Numerical applications
2.1 Basic operation on arrays
Python is commonly used in scientific applications and since they often involve matrices and matrix
operations, steps have been taken to facilitate these tasks. The numpy module [8] introduces a new type,
array, and several routines to handle them. Arrays may contain different types of elements, not just
numbers, however, all elements of the array must be of the same type. Consider the following example:
Code 95
import numpy
A = numpy.array([1, 1.0, 1.0 + 0.0j])
print A
[ 1.+0.j 1.+0.j 1.+0.j]
In this example, you can see that the array was created from a list using the array() function; the
elements of the list were of different types and were all converted to the type, which is most ‘roomy’ to
store them; in this case it was the complex type (we can recongize that because each number has the
imaginary part 0.j). Arrays can be flat, rectangular, cubic etc. They can be also “reshaped”. In fact all
matrices are stored as flat and the information about the shape is stored separately, in a tuple, therefore
it is easy to change it:
Code 96
A = numpy.array(range(27))
print A
Project co-financed from the EU European Social Fund
Borys Szefczyk
70
A.shape = (3,9)
print A
A.shape = (3,3,3)
print A
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26]
[[ 0 1 2 3 4 5 6 7 8]
[ 9 10 11 12 13 14 15 16 17]
[18 19 20 21 22 23 24 25 26]]
[[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]]
[[ 9 10 11]
[12 13 14]
[15 16 17]]
[[18 19 20]
[21 22 23]
[24 25 26]]]
Besides the array() function, the module offers routines to create special types of matrices, like a matrix
of zeros, ones or the identity matrix:
Code 97
print "Matrix of zeros"
O = numpy.ones((3,3))
print O
print "Matrix of ones"
Z = numpy.zeros((3,3))
print Z
print "Identity matrix"
I = numpy.identity(3)
print I
Matrix
[[ 1.
[ 1.
[ 1.
Matrix
[[ 0.
[ 0.
[ 0.
of zeros
1. 1.]
1. 1.]
1. 1.]]
of ones
0. 0.]
0. 0.]
0. 0.]]
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
71
Identity matrix
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
A great advantage of the numpy module is that it permits math operations on the whole matrices, like
they were just numbers:
Code 98
m = numpy.ones((3,3))
print m
n = m + 1
print n
n = m * 0.5
print n
n = numpy.sin(m * 0.5)
print n
[[
[
[
[[
[
[
[[
[
[
[[
[
[
1. 1. 1.]
1. 1. 1.]
1. 1. 1.]]
2. 2. 2.]
2. 2. 2.]
2. 2. 2.]]
0.5 0.5 0.5]
0.5 0.5 0.5]
0.5 0.5 0.5]]
0.47942554 0.47942554
0.47942554 0.47942554
0.47942554 0.47942554
0.47942554]
0.47942554]
0.47942554]]
You can also perform matrix operations in a mathematical sense, e.g. add them:
Code 99
m = numpy.array(range(9))
m.shape = (3,3)
n = numpy.array(range(8,-1,-1))
n.shape = (3,3)
print "m =", m
print "n =", n
print "m + n =", m+n
m = [[0 1 2]
Project co-financed from the EU European Social Fund
Borys Szefczyk
72
[3
[6
n =
[5
[2
m +
[8
[8
4 5]
7 8]]
[[8 7 6]
4 3]
1 0]]
n = [[8 8 8]
8 8]
8 8]]
However, notice that matrix multiplication is performed in the following way:
C(m, n) = A(m, n) + B(m, n)
to calculate the ‘real’ matrix product, you should use the dot() function:
Code 100
m = numpy.array(range(1,10))
m.shape = (3,3)
n = 1./m
print "m =", m
print "n =", n
print "m * n =", m*n
print "dot(m,n) =", numpy.dot(m,n)
m = [[1 2 3]
[4 5 6]
[7 8 9]]
n = [[ 1.
0.5
[ 0.25
0.2
[ 0.14285714 0.125
m * n = [[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
dot(m,n) = [[ 1.92857143
[ 6.10714286
[ 10.28571429
0.33333333]
0.16666667]
0.11111111]]
1.275
3.75
6.225
1.
]
2.83333333]
4.66666667]]
The numpy module is also useful for simple statistics applications. For example, we have a file with many
data points. We are going to read in the file and compute: the number of data points, the sum, the mean
and the standard deviation:
Code 101
data = numpy.array(map(float, open("numpy.dat").readlines()))
print "Number of data points:", len(data)
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
73
print "Sum:", numpy.sum(data)
print "Mean value:", numpy.mean(data)
print "Standard deviation:", numpy.std(data)
The next important issue is the indexing of elements in a matrix, but before we proceed to this problem,
we will construct a model matrix, a “times table”. It will contain products of the numbers that are in the
beginning of each row and column. The left-most column and the upper-most row will contain numbers
from 1 to 9:
Code 102
vec = numpy.arange(1, 10)
print "vec =", vec
mat = numpy.multiply.outer(vec, vec)
print "mat =", mat
vec = [1
mat = [[
[
[
[
[
[
[
[
[
2
1
2
3
4
5
6
7
8
9
3 4 5
2 3
4 6
6 9
8 12
10 15
12 18
14 21
16 24
18 27
6 7 8
4 5
8 10
12 15
16 20
20 25
24 30
28 35
32 40
36 45
9]
6
12
18
24
30
36
42
48
54
7
14
21
28
35
42
49
56
63
8
16
24
32
40
48
56
64
72
9]
18]
27]
36]
45]
54]
63]
72]
81]]
From every matrix, we can pick a number or a “sub-matrix”, by specifying the indices or ranges. Try out
the following instructions and observe the result:
Code 103
print mat[3:5, 3:7]
print mat[1:4:2, 1:9:3]
print mat[::2,::2]
print mat[::-1,::-1]
Project co-financed from the EU European Social Fund
Borys Szefczyk
74
Exercise 27: Use the least-square method to fit experimental data (temperature vs. time) with
linear function, y = ax + b. The teacher will provide you with the data file. Also, calculate
the correlation coefficient. In the following formulas, n is the number of data points, x, y are
the data points.
∆=n
X
x2 −
X 2
x
P P
xy − x y
a=
∆
P 2P
P P
x
y − x xy
b=
∆
P
P P
n xy − x y
r = r
P
P
P
P
n x2 − ( x)2 · n y 2 − ( y)2
n
P
2.2 Using Gnuplot with numpy
We will use the program from Exercise 27 to see how the data can be visualized directly from Python.
We will use the Gnuplot [9] interface (module) to plot a graph. Alternatively, the matplotlib package
can be used1 . The program in the exercise fits a linear function to experimental data and it may look like
this:
Code 104
import numpy
data = numpy.array([map(float, x.split()) for x in open(’lsq.dat’).readlines()])
Sx = numpy.sum(data[:,0])
Sy = numpy.sum(data[:,1])
Sxx = numpy.sum(data[:,0]**2)
Syy = numpy.sum(data[:,1]**2)
Sxy = numpy.sum(data[:,0]*data[:,1])
n = len(data[:,0])
delta = n*Sxx - Sx**2
a = (n*Sxy - Sx*Sy)/delta
b = (Sxx*Sy - Sx*Sxy)/delta
r = (n*Sxy - Sx*Sy)/numpy.sqrt((n*Sxx - Sx**2)*(n*Syy - Sy**2))
Still in the same script, we will add instruction to plot the points and the function, so that we can see
how well the data is fitted. Note that we have two kinds of data to plot: discrete data in form of points
(x,y) and a continuous function. Both types of data can be plotted with Gnuplot:
1 http://matplotlib.sourceforge.net
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
75
Code 105
import Gnuplot
gp = Gnuplot.Gnuplot(persist=1)
gp.title("Least square fit")
gp.xlabel("time [s]")
gp.ylabel("temp [K]")
gp(’set pointsize 3’)
gp(’set key right bottom’)
gp data = Gnuplot.Data(data, title="r = %g" % r)
gp func = Gnuplot.Func("%f * x + %f" % (a, b), title="%g x %+g" % (a, b))
gp.plot(gp data, gp func)
gp(’set terminal postscript enhanced color 20’)
gp.hardcopy("lsq.eps")
We will analyse the script line-by-line. First, the Gnuplot module was imported and a Gnuplot object was
initialized. The persist option prevents the graph from being closed when our script terminates:
Code 106
import Gnuplot
gp = Gnuplot.Gnuplot(persist=1)
At this point, we also configure different aspects of the graph, like the title and the position of the key. Two
kinds of syntax are used: some of the more common gnuplot commands are implemented as methods
(eg. title(), xlabel() etc.), whereas any other command can be passed using the gp(’command’)
syntax:
Code 107
gp.title("Least square fit")
gp.xlabel("time [s]")
gp.ylabel("temp [K]")
gp(’set pointsize 3’)
gp(’set key right bottom’)
Next, we have defined the data and the function objects and plot them. Also the titles of the data sets
have been defined here.
Code 108
gp data = Gnuplot.Data(data, title="r = %g" % r)
gp func = Gnuplot.Func("%f * x + %f" % (a, b), title="%g x %+g" % (a, b))
gp.plot(gp data, gp func)
Project co-financed from the EU European Social Fund
Borys Szefczyk
76
We can also export the graph to the PostScript file2 [10]. If you want to change some of the parameters,
you can use the gp(’command’) syntax again. Here, we cause the image to be printed in colour, using a
20pt font, in EPS format:
Code 109
gp(’set terminal postscript enhanced color 20’)
gp.hardcopy("lsq.eps")
2.3 Linear algebra in Python
The sub-module linalg of the numpy module, contains basic tools for linear algebra: functions to calculate determinants, inverse matrices, to solve sets of linear equations and eigen-problems. We will use one
of these functions to find the roots of the following set of equations:

9 x1



 2x
1

2
x
1



8 x1
−
+
+
−
8 x2
3 x2
4 x2
6 x2
+
+
+
−
7 x3
4 x3
6 x3
8 x3
−
+
+
+
6 x4
5 x4
8 x4
2 x4
= −11
=
44
=
66
= −22
This problem can be written in the matrix form:
A·X=Y
where A is a 4 × 4 coefficient matrix, Y is a vertical vector (4 × 1 matrix) of the free elements and X are
the solutions to our problem. Here is the script:
Code 110
from numpy import array, dot, ravel
from numpy.linalg import solve
A = array([[ 9, -8, 7, -6],
[ 2, 3, 4, 5],
[ 2, 4, 6, 8],
[ 8, -6, -8, 2]])
Y = array([[-11],
[ 44],
[ 66],
[-22]])
X = solve(A, Y)
# the ravel function is used with the sole purpose of ”flattening”
2 PostScript
is a language developed by the Adobe company to represent vector graphics (i.e. graphics composed of primitives
such as lines, circles etc.); PostScript files have extensions .ps and .eps
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
77
# the vector for a nice display
print "Solution:", ravel(X)
# If the solution is correct, this should be zero!
print dot(A,X) - Y
In the second example, we will use the multivariate least squares method to solve a linear regression
problem. A review article by Hansch presents several QSAR studies of anti-HIV drugs [11]. Table 42 in
this review shows the dependence of EC50 (log 1/C) on four properties called Lx , B1x , IY , σ in a series
of compounds. We will fit these data with linear function:
log 1/C = a0 + a1 Lx + a2 B1x + a3 Iy + a4 σ
We are looking for the coefficient vector:




A=



First, we must write the problem in a matrix
n = 16 compounds:

x00

 x10
X=
 ..
 .
xn0

a0
a1
a2
a3
a4







form. Let X be the matrix of the four properties of our
x01
x11
..
.
xn1
x02
x12
..
.
xn2
x03
x13
..
.
xn3
x04
x24
..
.
xn4






x00 , . . . , xn0 = 1, because it corresponds to the free element a0 . Let Y be the vector of the observed EC50
values:


y0


 y1 

Y= . 

 .. 
yn
The problem can be written as:
(X′ X) A = X′ Y
where X′ denotes a transposed matrix. This time, the A matrix is unknown and the equation has to be
solved to find it.
Our data is stored in a file in the following format:
# log(1/C)
6.50
8.26
6.28
const
1.00
1.00
1.00
Lx
2.06
2.87
4.11
B1x
1.00
1.52
1.52
Iy
1.00
1.00
1.00
sigma
0.00
-0.04
-0.01
Project co-financed from the EU European Social Fund
Borys Szefczyk
78
5.98
5.94
5.32
5.00
4.15
4.27
4.22
4.26
4.92
4.07
4.01
6.77
5.31
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
3.82
4.23
2.65
2.74
3.98
4.80
2.06
4.11
2.06
4.11
2.06
2.87
4.11
1.95
2.15
1.35
1.35
1.35
1.35
1.00
1.52
1.00
1.52
1.00
1.52
1.52
1.00
1.00
1.00
1.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.44
0.39
0.52
0.29
0.27
0.28
0.00
-0.01
0.00
-0.01
0.00
-0.04
-0.01
The program is very simple, provided that we use the numpy module. First, we read the whole file into a
matrix (data), then we extract the first column into the Y vector and the rest of the columns into the X
vector. Finally, we calculate the transposed matrix Xt and solve the equation:
Code 111
#!/usr/bin/python
import numpy
data = numpy.array([map(float, x.split()) for x in open(’LR’).readlines()[1:]])
Y = data[:,0]
X = data[:,1:]
Xt = numpy.transpose(X)
A = numpy.linalg.solve(numpy.dot(Xt, X), numpy.dot(Xt, Y))
print numpy.ravel(A)
The solution printed by the script:
[ 3.06274963 -0.94451833 3.51948739 1.88218388 -5.10869993]
is consistent with Equation 58 of the review by Hansch.
2.4 Python for scientists
Although the scipy module [12] is a set of routines for scientific applications, we will start with an
example from computer graphics. Namely, we will use the scipy module to manipulate a picture. A
bitmap graphics is in fact a two-dimensional matrix: each element of this matrix corresponds to a single
pixel of the image. The value of each element describes the colour of the pixel. Therefore, to get the
negative of the image, we just have to convert the picture to a matrix and then multiply each value by
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
79
−1:
Code 112
import scipy
raw = scipy.misc.imread(’IMG 2028.png’)
raw *= -1
scipy.misc.imsave(’outfile.png’, raw)
We have created our first graphical filter. You can see the effect in Figure 2.1-B.
In fact, a colour image is not a two-dimensional but a three-dimensional matrix. This is because each
pixel contains more than one value or in other words, the third dimension of the matrix describes the
components of the colour of the pixel. There are different colour spaces, but in photography, the most
common is RGB, ie. the colour is a mixture of red, green and blue. Imagine that our picture has the size
of 600 × 400 pixels. Then the corresponding matrix will be 600 × 400 × 3 — the third number means
that we have three “slices” in the matrix, one for each of the colours: red, green or blue. Now, let’s make
another graphical filter: we will swap the colours so that whatever is red becomes green; green becomes
blue and blue becomes red:
Code 113
import numpy
import scipy
A
B
C
D
Figure 2.1: Image transformed using matrix operations: (A) original, (B) negative, (C) colors swapped,
(D) modified FFT ‘spectrum’.
Project co-financed from the EU European Social Fund
Borys Szefczyk
80
raw = scipy.misc.imread(’IMG 2028.png’)
raw2 = numpy.zeros(raw.shape)
raw2[:,:,0] = raw[:,:,1]
raw2[:,:,1] = raw[:,:,2]
raw2[:,:,2] = raw[:,:,0]
scipy.misc.imsave(’outfile.png’, raw2)
Figure 2.1-C shows the result of the operation. In the third example, we will use a more advanced trick:
we will convert the image with a 2-dimensional discrete inverse fast Fourier transform (2D iFFT) into the
reciprocal space. Then we will modify the ‘spectrum’ by erasing (setting to zero) the upper-left quarter of
the reciprocal image. Finally, we will use the 2-dimensional discrete FFT (2D FFT) to convert the image
back to the real space. Note that we do this operation separately on each channel (colour):
Code 114
import numpy
import scipy
raw = scipy.misc.imread(’IMG 2028.png’)
print "Doing inverse discrete 2D FFT..."
iR = numpy.fft.irfft2(raw[:,:,0])
iG = numpy.fft.irfft2(raw[:,:,1])
iB = numpy.fft.irfft2(raw[:,:,2])
w = iR.shape[1]/2
h = iR.shape[0]/2
iR[:h,:w] = 0
iG[:h,:w] = 0
iB[:h,:w] = 0
print "Doing real discrete 2D FFT..."
raw2 = numpy.zeros(raw.shape)
raw2[:,:,0] = numpy.real(numpy.fft.rfft2(iR))
raw2[:,:,1] = numpy.real(numpy.fft.rfft2(iG))
raw2[:,:,2] = numpy.real(numpy.fft.rfft2(iB))
print "Saving to outfile.png ..."
scipy.misc.imsave(’outfile.png’, raw2)
The effect (Figure 2.1-D) is like the picture was old and warped. For more information on bitmap graphics, refer to the literature [13].
In the last example, we will use the numpy and scipy modules for error analysis. The situation is following: we have performed an MD simulation of a certain liquid and we have estimated the density.
However, to do the job properly, we should also calculate the error of this estimation. Ideally, we should
do several independent simulations, then calculate the average density from the set of simulations, the
standard deviation and the error. However, if the run was long enough, we can also get a good estimate
of the error by splitting the run into blocks and doing ‘block averaging’.
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
81
Our input data is a file containing two columns: time step and density. We will read the data into a
two-column matrix, then split the matrix into blocks of 200 ps. For each block we will calculate the
average density. Then, we will calculate the standard deviation of these averages and estimate the error.
We will use the Student’s t-distribution and we will estimate the error at 95% confidence. We will use
the following functions: the functions mean and std from the numpy module to calculate the mean value
and the standard deviation; the ppf function from scipy.stats.t to get the Student’s t-factor for a 95%
confidence level and n_blocks samples.
Code 115
from scipy import stats
import numpy
from sys import argv
# Some constants
confidence = 0.95
block = 200
# [ps]
# Get the data into a two-column matrix (timestep -> density)
file = open(argv[1]).readlines()
data = numpy.array([ map(float, l.split()) for l in file[1:] ])
# Calculate the number of blocks
min time = data[0,0]
max time = data[-1,0]
n blocks = int((max time - min time) / block)
# Calculate the number of steps in a single block
# We can do that based on time step, because it is constant
time step = data[1,0] - data[0,0]
block size = int(round(block / time step))
# Do the block averaging
averages = []
for b ind in range(n blocks):
begin = b ind * block size
end = (b ind + 1) * block size
block mean = numpy.mean(data[begin:end,1])
averages.append(block mean)
# Convert to a numpy array
averages = numpy.array(averages)
# Get the Student’s t-factor
Project co-financed from the EU European Social Fund
Borys Szefczyk
82
one sided = 0.5 + confidence/2.0
t crit = stats.t.ppf(one sided, n blocks)
# Do the statistics
mean = numpy.mean(averages)
std dev = numpy.std(averages)
error = t crit * std dev/numpy.sqrt(n blocks)
print "Mean value and error: %.2f +/- %.2f" % (mean, error)
Project co-financed from the EU European Social Fund
Chapter 3
Databases
In your Python scripts, you can manage and access databases. There are several wrappers that permit
accessing databases in a transparent way; in the next section we will use the MySQLdb module [14] to
write a bibliographic database, but first we have to learn how to work with MySQL and how to use Simple
Query Language (SQL).
3.1 Administration
The MySQL database works in server – client fashion, meaning there is an existing database (one or
more) on the server, it is managed by the server program and can be accessed — locally and remotely
(from other computers). Typically, the database and user accounts will be created by the system administrator and, unless it is you, you do not have to worry about it. If you would like to try it on your own
computer and step into the administrator’s shoes, here is the recipe (you will find more information on
the Internet [15]).
Assuming that the MySQL server is already running, you must create a database and an account for the
purpose of our exercise. Start the MySQL interface as administrator (usually root):
~$ mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.0.90-log Gentoo Linux mysql-5.0.90-r2
Type ’help;’ or ’\h’ for help.
input statement.
Type ’\c’ to clear the current
mysql>
Project co-financed from the EU European Social Fund
Borys Szefczyk
84
Then, create a new database. Here, we also use the show databases command to list all existing
databases:
mysql> create database bibliography;
Query OK, 1 row affected (0.00 sec)
mysql> show databases;
+--------------------+
| Database
|
+--------------------+
| information schema |
| bibliography
|
| mysql
|
| test
|
+--------------------+
4 rows in set (0.00 sec)
mysql>
Note that all SQL commands must end with a semicolon. After creating, switch to the mysql database.
This is the place where all credentials are stored. Solely for your interest, you may list the tables with the
show tables command:
mysql> use mysql
Reading table information for completion of table and column names
You can turn off this feature to get a quicker start-up with -A
Database changed
mysql> show tables;
+---------------------------+
| Tables in mysql
|
+---------------------------+
|
| columns priv
| db
|
| func
|
| host
|
| proc
|
|
| procs priv
| tables priv
|
| user
|
+---------------------------+
8 rows in set (0.00 sec)
mysql>
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
85
Now, add a new user called pybib with the password bookWORM or other of choose another. Here, we grant
all privileges (like creating and deleting tables) to this user and limit the access to the local computer.
Nowadays, most applications use the web interface (in many cases written in Python) and the database
can be accessed from the Internet. Even though, you should still restrict the access to the local machine.
This is because the database server will be not accessed remotely, but from the web server running on the
same computer, therefore only local access is needed.
mysql> create user ’pybib’@’localhost’ identified by ’bookWORM’;
Query OK, 0 rows affected (0.00 sec)
mysql> grant all privileges on bibliography.* to ’pybib’@’localhost’;
Query OK, 0 rows affected (0.00 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
mysql>
The flush privileges command is necessary for the changes to take effect immediately. Now, let us see
the information stored about our new user. This information is stored in the table user. Each database
table has records (rows) and columns. Each column has a label and definition of the type of information
that is stored in the column. You can list the columns using the describe command (in this example
some of the columns have been omitted):
mysql> describe user;
+-----------------+------------------+------+-----+---------+-------+
| Field
| Type
| Null | Key | Default | Extra |
+-----------------+------------------+------+-----+---------+-------+
| Host
| char(60)
| NO
| PRI |
|
|
| User
| char(16)
| NO
| PRI |
|
|
| Password
| char(41)
| NO
|
|
|
|
| Select priv
| enum(’N’,’Y’)
| NO
|
| N
|
|
| Insert priv
| enum(’N’,’Y’)
| NO
|
| N
|
|
| Update priv
| enum(’N’,’Y’)
| NO
|
| N
|
|
| enum(’N’,’Y’)
| NO
|
| N
|
|
| Delete priv
| Create priv
| enum(’N’,’Y’)
| NO
|
| N
|
|
| Drop priv
| enum(’N’,’Y’)
| NO
|
| N
|
|
| blob
| NO
|
| NULL
|
|
| ssl cipher
| max connections | int(11) unsigned | NO
|
| 0
|
|
+-----------------+------------------+------+-----+---------+-------+
11 rows in set (0.00 sec)
mysql>
Now we will form our first query in order to retrieve information from the table. We will list the content
Project co-financed from the EU European Social Fund
Borys Szefczyk
86
of the columns Host, User and Password, respective to all users in the server, and quit from the program.
You can see our newly created user in this table:
mysql> select Host,User,Password from user;
+-----------+-------+-------------------------------------------+
| Host
| User | Password
|
+-----------+-------+-------------------------------------------+
| localhost | root | *2D691E2378921A44C977D6D896515AC6234A2B09 |
| swift
| root | *2D691E2378921A44C977D6D896515AC6234A2B09 |
| 127.0.0.1 | root | *2D691E2378921A44C977D6D896515AC6234A2B09 |
| localhost | pybib | *202DCBC5DA0CF0272398688C93DA5DE9F3E38F23 |
+-----------+-------+-------------------------------------------+
7 rows in set (0.00 sec)
mysql> quit
Bye
$
3.2 Simple Query Language
It is time to create some tables and learn the Simple Query Language (SQL), which is a common language
for different database systems (eg. MySQL, PostgreSQL etc.). In our example, we will create a bibliographic database to store the information about publications of a certain group of people (say, a research
team). The central table of this database will be called ‘papers’ and will store information like the name
of the journal, volume, pages, year and authors of an article. The first step should always be the design
of the database — we have to decide how the data will be stored, so that it can be effectively used later.
First of all, each entry (paper) can have one or more authors and each author may have several names —
using a single column to store this information may not be sufficient and can cause searching in the table
to be more difficult and less effective. Secondly, this is a database of publications of authors from one
institution, so we can expect that certain names will reappear many times. Therefore, it is convenient to
use indices instead of names — we will give each author his unique index and make a separate table to
bind the names to these IDs. In our sample database, we will store information about these four articles:
1. William L. Jorgensen J. Phys. Chem. 90:1276-1284 (1986)
2. Wolfgang Damm, Antonio Frontera, Julian Tirado-Rives, William L. Jorgensen J. Comp. Chem.
18:1955-1970 (1997)
3. David Kony, Wolfgang Damm, Serge Stoll, Wilfred F. van Gunsteren J. Comp. Chem. 23:1416-1429
(2002)
4. William L. Jorgensen, David S. Maxwell, Julian Tirado-Rives J. Am. Chem. Soc. 118:11225-11236
(1996)
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
87
Authors
Paper Author
Papers
Idx Journal
Vol. Pages
Year
1
2
3
4
90
18
23
118
1986
1997
2002
1996
J.
J.
J.
J.
Phys. Chem.
Comp. Chem.
Comp. Chem.
Am. Chem. Soc.
1276-1284
1955-1970
1416-1429
11225-11236
1
2
2
2
2
3
3
3
3
4
4
4
1
2
6
7
1
3
2
5
4
1
8
7
Names
Names
Surnames
Idx
William L.
Wolfgang
David
Wilfred F.
Serge
Antonio
Julian
David S.
Jorgensen
Damm
Kony
van Gunsteren
Stoll
Frontera
Tirado-Rives
Maxwell
1
2
3
4
5
6
7
8
Figure 3.1: Structure of the database: binding authors and papers.
An example is shown in Figure 3.1. Note that the authors table is used to correlate the data in names
and papers. In the table papers, each entry also has a unique index and we do not store the information
about the authors. This information is stored in the table authors using the indices from names and
papers.
3.3 Data types
Once we know what tables we are going to create and what its content will be, we can proceed to the
next step. We have to choose the data type for each column. The data can be numerical, textual, it
can be time, date etc. In the case of the numerical and textual data we have to decide about the size
(eg. maximum integer number that can be stored in a particular column or the maximum length of a
string). This is like choosing the right type for variables, but there are also new elements: in the table
we can permit (or not) empty values, we can let the indices be unique numbers and we can let them
be incremented automatically. Table 3.1 shows which types have been chosen for the columns in our
example.
All indices used in the database, as well as the ‘volume’ column, use integer values. Columns ‘idx’ in tables
names and papers have been declared as SERIAL; this is a shorthand for BIGINT UNSIGNED NOT NULL
AUTO INCREMENT UNIQUE. These keywords mean that the column will contain possibly large (BIGINT)
non-negative (UNSIGNED) integers, that its value can not be empty (NOT NULL), that any two records in
the table can not share the same value (UNIQUE) and that the values will be added automatically, if not
provided (AUTO INCREMENT). The two columns in the table authors contain the same values as the
columns ‘idx’ in names and papers, however we do not use the SERIAL type. This is because the values
will be not UNIQUE (see the example in Figure 3.1 to understand why). Instead, we specify the type as
non-negative (UNSIGNED), large integer (BIGINT) and we do not permit empty values (NOT NULL).
The basic type used to store textual data in MySQL is the CHAR type. A field of this type has a fixed
length, which has to be specified, eg. to create a column, which has a width of 16 characters, the type
should be specified as CHAR(16). However, if you plan to store a large amount of text of variable length,
it might be advisable to use the VARCHAR type. Fields of this type have a variable length, up to the
Project co-financed from the EU European Social Fund
Borys Szefczyk
88
Table 3.1: Types used in the example.
Table
names
papers
authors
Column
idx
names
surnames
idx
journal
year
pages
volume
paper
author
Type
SERIAL
VARCHAR(200)
VARCHAR(200)
SERIAL
VARCHAR(1000)
YEAR
CHAR(20)
UNSIGNED INT
BIGINT UNSIGNED NOT NULL
BIGINT UNSIGNED NOT NULL
specified limit, therefore the text always occupies the minimum space in the database. For example, in
the papers database, we store the name of the journal in a text field of a variable length, with a maximum
length of 1000 characters; declared as VARCHAR(1000).
Time and date have special types in MySQL. In this database, we use only one of them, namely the YEAR
type. Obviously, we could use the integer type (INT), however using the most appropriate types has its
benefits: minimum space is used (one byte in case of the YEAR type), MySQL verifies if the data are
correct for the specified type and it will automatically do the conversions, e.g. 00 to 2000.
3.4 Creating tables
Now, when the structure of the database is established and we have decided about the types, we can
create the tables. Start the MySQL interface, select your database and use the CREATE command to do it:
~$ mysql -u pybib -p
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.0.90-log Gentoo Linux mysql-5.0.90-r2
Type ’help;’ or ’\h’ for help.
input statement.
Type ’\c’ to clear the current
mysql> use bibliography
Database changed
mysql> CREATE TABLE names (names VARCHAR(200), surnames VARCHAR(200),
idx SERIAL);
Query OK, 0 rows affected (0.00 sec)
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
89
mysql> CREATE TABLE authors ( paper BIGINT UNSIGNED NOT NULL,
author BIGINT UNSIGNED NOT NULL );
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE TABLE papers ( volume INT, journal VARCHAR(1000),
pages CHAR(20), idx SERIAL, year YEAR );
Query OK, 0 rows affected (0.00 sec)
mysql>
At any point, you can use the DESCRIBE command to see the definition of your tables:
mysql> DESCRIBE authors;
+--------+---------------------+------+-----+---------+-------+
| Field | Type
| Null | Key | Default | Extra |
+--------+---------------------+------+-----+---------+-------+
| paper | bigint(20) unsigned | NO
|
| NULL
|
|
| author | bigint(20) unsigned | NO
|
| NULL
|
|
+--------+---------------------+------+-----+---------+-------+
2 rows in set (0.00 sec)
3.5 Inserting data
We will populate our database with some data. First, we will add the author’ names. We use the INSERT
command; remember that the index is auto-incremented, so we do not need to specify it:
mysql> INSERT INTO names (names,surnames) VALUE (’William L.’,
’Jorgensen’);
Query OK, 1 rows affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings:
0
mysql> INSERT INTO names (names,surnames) VALUE (’Wolfgang’,’Damm’);
Query OK, 1 rows affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> INSERT INTO names (names,surnames) VALUES (’David’, ’Kony’),
(’Wilfred F.’,’van Gunsteren’), (’Serge’,’Stoll’), (’Antonio’,
’Frontera’), (’Julian’,’Tirado-Rives’), (’David S.’,’Maxwell’);
Query OK, 6 rows affected (0.04 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql>
Project co-financed from the EU European Social Fund
Borys Szefczyk
90
As you can see, you may insert values one-by-one or all at once (notice the difference in the syntax, ie.
VALUE vs. VALUES). We can verify the content of the table using the SELECT statement:
mysql> SELECT * FROM names;
+---------------+------------+-----+
| surnames
| names
| idx |
+---------------+------------+-----+
| Jorgensen
| William L. |
1 |
| Damm
| Wolfgang
|
2 |
| Kony
| David
|
3 |
| van Gunsteren | Wilfred F. |
4 |
| Stoll
| Serge
|
5 |
| Frontera
| Antonio
|
6 |
| Tirado-Rives | Julian
|
7 |
| Maxwell
| David S.
|
8 |
+---------------+------------+-----+
8 rows in set (0.00 sec)
Next we will add information about the papers:
mysql> INSERT INTO papers (journal, volume, pages, year) VALUES
-> (’J. Phys.
Chem.’, 90, ’1276-1284’, 1986),
-> (’J. Comp.
Chem.’, 18, ’1955-1970’, 1997),
-> (’J. Comp.
Chem.’, 32, ’1416-1429’, 2002),
-> (’J. Am. Chem. Soc.’, 118, ’11225-11236’, 1996);
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM papers;
+--------+-------------------+-------------+-----+------+
| volume | journal
| pages
| idx | year |
+--------+-------------------+-------------+-----+------+
|
90 | J. Phys. Chem.
| 1276-1284
|
1 | 1986 |
|
18 | J. Comp. Chem.
| 1955-1970
|
2 | 1997 |
|
32 | J. Comp. Chem.
| 1416-1429
|
3 | 2002 |
|
118 | J. Am. Chem. Soc. | 11225-11236 |
4 | 1996 |
+--------+-------------------+-------------+-----+------+
4 rows in set (0.00 sec)
Oops! We did a mistake: in the third row, the volume number should be 23 instead of 32. We can fix it
by the UPDATE ... SET command, which substitutes the values in the table. First, we have to pick the
row in a unique way — for that purpose we have the idx column; we will change the value in the volume
column, but only where the idx = 3:
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
91
mysql> UPDATE papers SET volume=23 WHERE idx=3;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Finally, we will assign authors to papers in the authors table:
mysql> INSERT INTO authors (paper,author) VALUES (1,1), (2,2), (2,6)
(2,7), (2,1), (3,3), (3,2), (3,5), (3,4), (4,1), (4,8), (4,7);
Query OK, 12 rows affected (0.00 sec)
Records: 12 Duplicates: 0 Warnings: 0
3.6 Searching the database
Now, we will learn how to use the SQL to form simple and more advanced queries. First, let us try to find
the papers in the database which were published in 2002:
mysql> SELECT * FROM papers WHERE year=2002;
+--------+----------------+-----------+-----+------+
| volume | journal
| pages
| idx | year |
+--------+----------------+-----------+-----+------+
|
23 | J. Comp. Chem. | 1416-1429 |
3 | 2002 |
+--------+----------------+-----------+-----+------+
1 row in set (0.04 sec)
We can also search for papers that were published in 2002 and in the specified journal, eg. J. Phys. Chem.
Note, that the usual rules of boolean operators apply; we have to ‘intersect’ the conditions: year = 2002
and journal = J. Phys. Chem.; we want that both conditions are fulfilled, so we have to use the AND
operator:
mysql> SELECT * FROM papers WHERE year=2002
AND journal=’J. Phys.
Empty set (0.00 sec)
Chem.’;
That was easy because we operate on the data from a single table. But let us try to find papers published
by William L. Jorgensen. First, we must look-up his ID in the names table, then we have to find the ID’s of
the papers in the authors table and finally, we have to retrieve the corresponding data from the papers
table:
mysql> SELECT idx FROM names WHERE names=’William L.’ AND
Project co-financed from the EU European Social Fund
Borys Szefczyk
92
surnames=’Jorgensen’;
+-----+
| idx |
+-----+
|
1 |
+-----+
1 row in set (0.00 sec)
mysql> SELECT paper FROM authors WHERE author=1;
+-------+
| paper |
+-------+
|
1 |
|
2 |
|
4 |
+-------+
3 rows in set (0.00 sec)
mysql> SELECT * FROM papers WHERE idx=1 OR idx=2 OR idx=4;
+--------+-------------------+-------------+-----+------+
| volume | journal
| pages
| idx | year |
+--------+-------------------+-------------+-----+------+
|
90 | J. Phys. Chem.
| 1276-1284
|
1 | 1986 |
|
18 | J. Comp. Chem.
| 1955-1970
|
2 | 1997 |
|
118 | J. Am. Chem. Soc. | 11225-11236 |
4 | 1996 |
+--------+-------------------+-------------+-----+------+
3 rows in set (0.00 sec)
OK, that did the job, but the solution is not very elegant and requires a lot of typing. This can be a
problem if we need to retrieve, for example, 1000 records.
First, let us join the data in the tables names and authors; we will look up indices of the papers published
by William L. Jorgensen:
mysql> SELECT paper FROM names JOIN authors ON
names.idx=authors.author WHERE names=’William L.’
AND surnames=’Jorgensen’;
+-------+
| paper |
+-------+
|
1 |
|
2 |
|
4 |
+-------+
3 rows in set (0.00 sec)
In order to join two tables, we must specify the relation between the records; here, the rows are joined
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
93
based on the column idx in the table names and the column author in the table authors. In addition,
we display only those records which have the surnames containing ‘Jorgensen’ and the names column
containing ‘William L.’. In general, using two or more tables may lead to ambiguity in the column names,
eg. two of our tables have the idx column. This problem is resolved by prefixing the column name with
the table name. To avoid any doubts, the command above could be written as:
mysql> SELECT authors.paper FROM names JOIN authors ON
names.idx=authors.author WHERE names.names=’William L.’
AND names.surnames=’Jorgensen’;
In the next example, we will join all three tables together, using the relation between the idx column in
names and author in authors, as well as the relation between paper in authors and idx in the table
papers:
mysql> SELECT journal,volume,pages,year FROM papers JOIN (authors,
names) ON papers.idx=authors.paper AND authors.author=names.idx
WHERE surnames=’Jorgensen’ AND names=’William L.’;
+-------------------+--------+-------------+------+
| journal
| volume | pages
| year |
+-------------------+--------+-------------+------+
| J. Phys. Chem.
|
90 | 1276-1284
| 1986 |
| J. Comp. Chem.
|
18 | 1955-1970
| 1997 |
| J. Am. Chem. Soc. |
118 | 11225-11236 | 1996 |
+-------------------+--------+-------------+------+
3 rows in set (0.00 sec)
Finally, we will search the database for articles published together by the authors ‘Jorgensen’ and ‘TiradoRives’. This requires three steps: (i) looking-up the ID’s of the two authors in the table names, (ii)
finding ID’s of the papers in the table authors, which have assigned both author ID’s, (iii) retrieving the
information from the table papers:
mysql> SELECT idx FROM names WHERE surnames=’Jorgensen’ OR
surnames=’Tirado-Rives’;
+-----+
| idx |
+-----+
|
1 |
|
7 |
+-----+
2 rows in set (0.00 sec)
mysql> SELECT t1.paper FROM authors AS t1 JOIN authors AS t2
ON t1.paper=t2.paper WHERE t1.author=1 AND t2.author=7;
Project co-financed from the EU European Social Fund
Borys Szefczyk
94
+-------+
| paper |
+-------+
|
2 |
|
4 |
+-------+
2 rows in set (0.00 sec)
mysql> SELECT journal,volume,pages,year FROM papers WHERE
idx=2 OR idx=4;
+-------------------+--------+-------------+------+
| journal
| volume | pages
| year |
+-------------------+--------+-------------+------+
| J. Comp. Chem.
|
18 | 1955-1970
| 1997 |
| J. Am. Chem. Soc. |
118 | 11225-11236 | 1996 |
+-------------------+--------+-------------+------+
2 rows in set (0.00 sec)
The only part that needs explanation is the second statement; by using JOIN on the same table on both
sides, we join the table with itself. The copies are given aliases t1 and t2. We assemble the new table by
using the condition that the column paper in t1 is equal to paper in t2; this way we create rows for each
existing co-authorship. Finally, we specify the WHERE statement to filter out only those rows, which refer
to our authors (indices 1 and 7).
The SQL offers many more commands, but in fact, once you learn how to use the Python interface to
MySQL (the next section), you will rarely need them. Knowing Python, you can always retrieve some
data from the database and do the filtering and processing in your script. However, if you work on huge
tables, it is always better to leave as much work as possible to MySQL, because its search algorithms are
optimized and therefore faster.
Exercise 28: Modify the existing database to store journal names in a separate table. This
table should contain both the full and abbreviated names and it should assign indices to each
journal. These indices should be used instead of names in the papers table.
3.7 Python interface to MySQL
Python interface to the MySQL database is implemented in the MySQLdb module [14]. In order to execute
queries on the MySQL database from your Python scripts, you have to import the module and connect to
the database:
Code 116
import MySQLdb
conn = MySQLdb.connect(host = ’localhost’, user = ’pybib’, \
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
95
passwd = ’bookWORM’, db = ’bibliography’)
Then, you have to create a cursor (an object that performs queries and returns results), define the query
using SQL and execute it:
Code 117
cur = conn.cursor()
query = "SELECT * FROM names WHERE surnames=’Jorgensen’;"
cur.execute(query)
Next, you can fetch the results using either the fetchone() or fetchmany() method:
Code 118
result = cur.fetchone()
result = cur.fetchmany(10)
Normally, this is done in a loop, since we have many rows to retrieve:
Code 119
# Using count-controlled loop
cur.execute(query)
for row in range(cur.rowcount):
result = cur.fetchone()
# Using condition-controlled loop
cur.execute(query)
result = cur.fetchone()
while result:
result = cur.fetchone()
Do not forget to close the connection when you are done:
Code 120
conn.close()
The next example performs a search for publications of a specified author and from a specified year.
Code 121
#!/usr/bin/python
import MySQLdb
Project co-financed from the EU European Social Fund
Borys Szefczyk
96
surname = raw input("Surname [Press Enter for none]: ")
year = raw input("Year [Press Enter for none]: ")
surname = surname.strip().lower()
if not year: year = 0
else: year = int(year)
conn = MySQLdb.connect(host = "localhost", user = "pybib", \
passwd = "bookWORM", db = "bibliography")
cur = conn.cursor()
query = "SELECT DISTINCT journal,volume,pages,year FROM papers"
query += " JOIN (authors, names) ON papers.idx=authors.paper"
query += " AND authors.author=names.idx"
if surname or year:
query += " WHERE"
if surname:
query += " surnames=’%s’" % surname
if year: query += " AND"
if year: query += " year=%s" % year
cur.execute(query)
result = cur.fetchone()
print "%-20s %5s %12s %4s" % ("Journal", "Vol", "Pages", "Year")
print "-"*44
while result:
print "%-20s %5d %12s %4d" % result
result = cur.fetchone()
conn.close()
Let us analyse the script line-by-line. First we ask for the surname and year. An empty value means that
the user does not want to use that search criteria:
Code 122
surname = raw input("Surname [Enter for none]: ")
year = raw input("Year [Enter for none]: ")
Next, we strip unnecessary white characters and convert to lower case, since the search will be caseinsensitive anyway. The year, if not given by the user, will be set to zero, which is boolean ‘false’ (this will
come useful later):
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
97
Code 123
surname = surname.strip().lower()
if not year: year = 0
else: year = int(year)
Now, we establish the connection with the database server and assemble the query:
Code 124
conn = MySQLdb.connect(host = "localhost", user = "pybib", \
passwd = "bookWORM", db = "bibliography")
cur = conn.cursor()
query = "SELECT DISTINCT journal,volume,pages,year FROM papers"
query += " JOIN (authors, names) ON papers.idx=authors.paper"
query += " AND authors.author=names.idx"
The rest of the query is added depending on the search criteria that are used.
Code 125
if surname or year:
query += " WHERE"
if surname:
query += " surnames=’%s’" % surname
if year: query += " AND"
if year: query += " year=%s" % year
cur.execute(query)
Finally, we can retrieve the results and print them in a table:
Code 126
result = cur.fetchone()
print "%-20s %5s %12s %4s" % ("Journal", "Vol", "Pages", "Year")
print "-"*44
while result:
print "%-20s %5d %12s %4d" % result
result = cur.fetchone()
In the next example, the script adds a new author to the database and returns the auto-generated ID of
this author. Remember that in our database, the idx column of the names table has an AUTO INCREMENT
property; when you add a new author, MySQL will automatically insert a unique number into this column.
In Python, this number can be retrieved by the insert_id() method of the connect object:
Project co-financed from the EU European Social Fund
Borys Szefczyk
98
Code 127
#!/usr/bin/python
import MySQLdb
conn = MySQLdb.connect(host = "localhost", user = "pybib", \
passwd = "bookWORM", db = "bibliography")
names = raw input("Enter names: ")
surnames = raw input("Enter surnames: ")
cur = conn.cursor()
query = "INSERT INTO names (names, surnames)"
query += " VALUES (’%s’,’%s’)" % (names, surnames)
cur.execute(query)
print "Author’s ID =", conn.insert id()
conn.close()
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
99
Afterword
Python was developed to be a scripting language with clear, readable and easy-to-learn syntax; as such,
it quickly became popular and new projects based on this language started to emerge. Nowadays, it
has replaced other scripting languages in many applications. It is the main scripting language of the
Gentoo linux distribution; it was used to build many web site engines, like Zope, for example;1 several
applications were written in Python including some molecular modelling tools, like PyMOL2 or BkChem;3
finally it can be used for command line steering of some programs, also including those of interest to
computational chemists, like Modeller4 or VMD.5 Therefore one can say that Python is the scripting
language of bioinformatics and computational chemistry. If you intend to be just a regular, end-user of
bioinformatics tools, this book and course should be enough. On the other hand, if you would like to
implement new methods or modify existing software, it might be desirable to go beyond the material
covered in this textbook. If programming in Python is fun for you, there is also a lot more to explore.
There are plenty of books written on Python; there are introductory tutorials, library references and books
on more complex subjects, like GUI programming or scientific programming in Python. You will also find
a lot of very advanced resources on the Internet. You can start, for example, with these introductory
books and sites: the official Python tutorial [16], the book by Allen B. Downey, entitled Python for
Software Design: How to Think Like a Computer Scientist [17] (there is an on-line version available),
the book by Mark Pilgrim, entitled Dive Into Python [18] (also available on-line), the book by Mark
Lutz, entitled Learning Python: Powerful Object-Oriented Programming [19] and the book by David M.
Beazley, entitled Python Essential Reference [20]. For scientific applications, you may want to check out
the book by Allen B. Downey [21] available on-line and the book by Hans Petter Langtangen [22]; for
the SciPy and NumPy modules, there are great reference guides available on the Internet [8, 12]; for
applications in bioinformatics, there is a free e-book by Katja Schuerer [23] and the books by RuedigerMarcus Flaig [24] and Mitchell Model [25]; finally, for the standard Python modules, you can use the
official documentation [26] and the book by Fredrik Lundh [27], also available on-line. If you intend
to make graphical interfaces to your scripts, you will find the tutorial for the PyGTK package on the
Internet [28]; for the PyQT library, I would recommend the official documentation [29] and the book
by Boudewijn Rempt [30]. For reference to MySQL database administration you will find resources on
the Internet, starting from the official on-line documentation [15] and independent tutorials [31]. There
are also printed resources, for example, the books by Larry Ullman [32] or Robert Sheldon and Geoff
Moes [33]. For the specific issues of the interface between Python and MySQL, you should consult the
on-line tutorial [14] or the book by Albert Lukaszewski [34].
1 http://www.zope.org/WhatIsZope
2 http://www.pymol.org
3 http://bkchem.zirael.org
4 http://www.salilab.org/modeller
5 http://www.ks.uiuc.edu/Research/vmd
Project co-financed from the EU European Social Fund
(blank page)
Bibliography
[1] Æ. Frisch. Essential System Administration, Third Edition (O’Reilly, 2002), 3rd edn. ISBN: 9780596003432.
[2] The official Python documentation.
URL: http://docs.python.org
[3] PDB file format documentation.
URL: http://www.wwpdb.org/docs.html
[4] Gaussian program documentation.
URL: http://www.gaussian.com/g_tech/g_ur/g09help.htm
[5] J. B. Foresman and Æ. Frisch. Exploring Chemistry with Electronic Structure Methods (Gaussian,
1996), 2nd edn. ISBN: 978-0963676931.
[6] J. E. F. Friedl. Mastering Regular Expressions (O’Reilly, 2006), 3rd edn. ISBN: 978-0596528126.
[7] Documentation of the re module.
URL: http://docs.python.org/library/re.html
[8] NumPy reference guide.
URL: http://docs.scipy.org/doc/numpy/reference
[9] The official Gnuplot documentation.
URL: http://www.gnuplot.info/documentation.html
[10] Adobe Systems Inc. PostScript Language Reference (Addison-Wesley, 1999), 3rd edn. ISBN: 9780201379228.
URL: http://www.adobe.com/products/postscript/pdfs/PLRM.pdf
[11] R. Garg, S. P. Gupta, H. Gao, M. S. Babu, A. K. Debnath and C. Hansch. Comparative quantitative
structure-activity relationship studies on anti-hiv drugs. Chem. Rev., 99:3525–3602 (1999).
[12] SciPy reference guide.
URL: http://docs.scipy.org/doc/scipy/reference
Project co-financed from the EU European Social Fund
Borys Szefczyk
102
[13] W. v. R. James D. Murray. Encyclopedia of Graphics File Formats (O’Reilly Media, 1996), 2nd edn.
ISBN: 978-1565921610.
URL: http://www.fileformat.info/mirror/egff/index.htm
[14] Tutorial on the MySQLdb Python interface.
URL: http://mysql-python.sourceforge.net/MySQLdb.html
[15] MySQL reference.
URL: http://dev.mysql.com/doc/refman/5.0/en/index.html
[16] The official Python tutorial.
URL: http://docs.python.org/tutorial
[17] A. B. Downey. Python for Software Design: How to Think Like a Computer Scientist (Cambridge
University Press, 2009), 1st edn. ISBN: 978-0521725965.
URL: http://www.greenteapress.com/thinkpython/index.html
[18] M. Pilgrim. Dive Into Python (Apress, 2004), 1st edn. ISBN: 978-1590593561.
URL: http://diveintopython.org
[19] M. Lutz. Learning Python: Powerful Object-Oriented Programming (O’Reilly, 2009), 4th edn. ISBN:
978-0596158064.
[20] D. M. Beazley.
0672329784.
Python Essential Reference (Addison-Weasley, 2009), 4th edn.
ISBN: 978-
[21] A. B. Downey. Computational Modeling and Complexity Science (Green Tea Press, 2008), 1st edn.
URL: http://www.greenteapress.com/compmod
[22] H. P. Langtangen. Python Scripting for Computational Science (Springer, 2007), 3rd edn. ISBN:
978-3540739159.
[23] K. Schuerer. Python course in bioinformatics.
URL: http://www.pasteur.fr/recherche/unites/sis/formation/python
[24] R.-M. Flaig. Bioinformatics Programming in Python: A Practical Course for Beginners (Wiley-VCH,
2008), 1st edn. ISBN: 978-3527320943.
[25] M. L. Model. Bioinformatics Programming Using Python: Practical Programming for Biological Data
(O’Reilly, 2009), 1st edn. ISBN: 978-0596154509.
[26] Index of Python modules.
URL: http://docs.python.org/modindex.html
[27] F. Lundh. Python Standard Library (O’Reilly, 2001), 1st edn. ISBN: 978-0596000967.
URL: http://effbot.org/zone/librarybook-index.htm
[28] PyGTK reference.
URL: http://www.pygtk.org
Project co-financed from the EU European Social Fund
Python programming for bioinformatics students
103
[29] PyQT reference.
URL: http://www.riverbankcomputing.co.uk/software/pyqt/intro
[30] B. Rempt. GUI Programming with Python: QT Edition (Commandprompt, 2001), 1st edn. ISBN:
978-0970033048.
URL: http://www.commandprompt.com/community/pyqt
[31] Tutorial on MySQL databases.
URL: http://www.techotopia.com/index.php/MySQL_Essentials
[32] L. Ullman. MySQL, Second Edition (Peachpit Press, 2006), 2nd edn. ISBN: 978-0321375735.
[33] R. Sheldon and G. Moes. Beginning MySQL (Programmer to Programmer) (Wrox, 2005), 1st edn.
ISBN: 978-0764579509.
[34] A. Lukaszewski. MySQL for Python (Packt Publishing, 2010), 1st edn. ISBN: 978-1849510189.
Project co-financed from the EU European Social Fund