Download Wrocław University of Technology Bioinformatics Borys Szefczyk

Wrocław University of Technology Bioinformatics Borys Szefczyk Applied Informatics Wrocław (2010) Project co-financed from the EU European Social Fund c by Wrocław University of Technology Copyright : Wrocław (2010) Project Office ul. M. Smoluchowskiego 25, room no. 407 50-372 Wrocław, Poland Phone: +48 71 320 43 77 Email: [email protected] Website: www.studia.pwr.wroc.pl Project co-financed from the EU European Social Fund Python programming for bioinformatics students Borys Szefczyk (blank page) Contents 1 Basics 9 1.1 What is Python and how to use it . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Hello, World! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Variables in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Interaction with the user . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 Using modules: math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.7 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.8 Handling types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.9 Simple control statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.10 Condition-controlled loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.11 More complex types — lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.12 Count-controlled loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.13 Pretty output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.14 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 1.15 Strings and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 1.16 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 1.17 Passing arguments to the script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 1.18 Advanced command line options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Project co-financed from the EU European Social Fund Borys Szefczyk 6 1.19 Working with files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 1.20 Launching external programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 1.21 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 1.22 Writing modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 1.23 Regular expressions (re) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2 Numerical applications 69 2.1 Basic operation on arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.2 Using Gnuplot with numpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.3 Linear algebra in Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 2.4 Python for scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3 Databases 83 3.1 Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.2 Simple Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.3 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.4 Creating tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.5 Inserting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.6 Searching the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.7 Python interface to MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Project co-financed from the EU European Social Fund Python programming for bioinformatics students 7 Preface This textbook on Applied Informatics is by no means comprehensive. There is no book that could cover the whole field of applied informatics. Instead, the course and the book should give the student knowledge of the programming language, Python, sufficient to solve different tasks in everyday problems of molecular modelling, computational chemistry or bioinformatics; hence, the name Python programming for bioinformatics students. Whereas courses of programming languages as Pascal, C or C++ focus on the language itself and never go into the application layer, this course is focused on applications in computational chemistry. In the first part of the tutorial, you will get the basic knowledge of the Python scripting language; in the second part you will learn how to use Python to solve selected numerical problems (rootfinding, integration etc.), manipulate coordinates of molecules and build structures, how to use Python to control computational programs such as GAMESS or Gaussian, and even how to do the Quantitative Structure-Activity Relationship analysis in Python! You will learn how to solve problems of linear algebra and how to access and manage professional databases. Borys Szefczyk1 1 Author’s e-mail address: [email protected] Project co-financed from the EU European Social Fund Borys Szefczyk 8 How to read this textbook In order to make reading easier, the following convention is applied: commands that you have to type on your computer, are written like in the example below: ./runme ps x Any dialogue with the Python interpreter or other programs is typeset like in the following example, with the user input on a grey background: Enter a number: 123.0 You have entered 123.0 Sample source code is typeset using coloured syntax, like in the following example: Code 1 #!/usr/bin/python print "Hello, World!" Project co-financed from the EU European Social Fund Chapter 1 Basics 1.1 What is Python and how to use it Python is a scripting language. Python is also the name of a program that is used to interpret the scripts written in Python language. If you are going to learn Python, you will be writing scripts and not programs. Does it matter, what we call it? Yes, because there is a huge difference: programs are binary (i.e. readable for the machine but not for us humans) and have to be compiled before execution. Scripts are written as text files and they stay a text file for the rest of their lifetime. They are not executed but interpreted, therefore they always require that you have the interpreter program (i.e. Python) on your computer. They are also slower than programs, because they have to be translated on-the-fly. The Python script can be executed interactively (i.e. while you type it) or from a text file. The first way is useful if you just want to test one or few commands and is also useful to access the internal help system. However, when you are writing a longer script, which you will use many times, it is obviously better to type it in a text editor, save as a text (ASCII1 ) file and execute afterwards. Python can be used both in Linux and in Windows, but the way of running the script differs. This tutorial covers only the usage of Python in Linux and assumes that you are familiar with this operating system. If not, you should keep a Linux manual on hand; a good and extensive one is the book by Æleen Frisch [1]. In order to start an interactive session of Python, open a text console and type python You should see something like 1 ASCII — American Standard Code for Information Interchange — one of the character encodings, a table translating characters into one-byte numbers; it contains all English characters, numbers and punctuation, but does not contain, for example, the characters specific to Slavic languages Project co-financed from the EU European Social Fund Borys Szefczyk 10 Python 2.6.4 (r264:75706, Dec 7 2009, 23:19:43) [GCC 4.3.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> Now, you are within the Python program (don’t be confused: this not the shell2 any more and the shell commands do not work here!). If you want to terminate the session, simply press Control-D and you will be taken back to the shell program that you were using. Once you start writing scripts, you will need a text editor. Warning for Windows addicts: text editor does not mean the “Word” program. Text editor is a program that will let you save plain-text files or ASCII text, in other words. These can not be .doc or .rtf files (or whatever Word produces), because they contain a lot of garbage, which you do not see in Word, but which will confuse the Python interpreter. Rather, I suggest you download and install the SciTE editor.3 It has the wonderful feature of highlighting Python syntax, which makes writing scripts way easier. If you use the Vim editor, it will also do. A Python script is a set of commands that you type into the text editor and save for later execution. The file typically has a .py extension, although it is more a custom than obligation. There are two ways of executing such a script (again, we are talking about Linux). The first way is to supply the name of the script to the python command: python my script.py The second way uses a mechanism included in the shell: in the very first line of the Python script you type the characters #!, followed by the path to the Python program (usually /usr/bin/python). For example: Code 2 #!/usr/bin/python Additionally, you have to change the permissions4 to the file, so that it can be executed: chmod u+x my script.py Now, you can run the script like any other program, by specifying the name (and path, if necessary). Usually, if the script resides in you current directory, you will type 2 shell — the command-line program used to interact with the operating system; the most popular shells are bash and tcsh 3 http://www.scintilla.org/SciTEDownload.html 4 In UNIX systems, the files have reading, writing and execution permissions; the latter one is commonly designated with the letter x and indicates that the file can be executed Project co-financed from the EU European Social Fund Python programming for bioinformatics students 11 ./my script.py A short explanation about how does it work: the hash character starts a comment in Python, so it will be ignored by the interpreter. But the two characters, #!, placed in the first line, have a special meaning in the shell (no matter if it is bash, csh, or other). They indicate the program that will be executing the content of the file. Then the rest of the file is simply sent to the standard input of the specified program. Python as a language uses the object concept, however this tutorial is not aimed at teaching you objectoriented programming. You will learn structural programming, and the object-oriented programming will be limited to a minimum. At the time of writing, Python versions 3.x are stable and have started being installed in Linux distribution along with the older version 2.x. Python 3.x is intentionally not compatible with previous versions. This book refers to the syntax used by Python 2.x, since most of the external modules are still not compatible with the new language. On the other hand, the changes are not that big and you can easily “translate” scripts written in Python 2.x to the 3.x version; there are even tools for automatic conversion.5 1.2 Hello, World! As usual, we will start our tutorial with the ,,Hello, World!” example. Here it is: Code 3 #!/usr/bin/python print "Hello, World!" The first line in the example is for the shell and it says that Python should be used as an interpreter; Python itself will ignore it. The print statement is used to display the value of an expression. Here it is just a string (delimited with quotation marks). You can type it in the editor, save under the name hello_world.py and execute: python hello world.py Note, that every line of your script must begin in the very first character of the line, i.e. there should be no spaces or tabulators before print. Leading white-characters are used in Python to make blocks of 5 http://docs.python.org/py3k/library/2to3.html Project co-financed from the EU European Social Fund Borys Szefczyk 12 instructions (we will discuss it later). If you do put a space in front of the print instruction (which is a common mistake), you will get an error like this: File "hello world.py", line 3 print "Hello, World!" ^ IndentationError: unexpected indent Remember that Python is case sensitive, i.e. lower and upper case letters are interpreted differently. For instance, the instruction print can not be spelled Print. Exercise 1: Modify the Hello, World! program, so that it displays your name. 1.3 Variables in Python What your scripts do, is usually convert one kind of information into another. To do so, you will need to store intermediate data. For this purpose you will use variables. Think of a variable as a selected place in the memory of the computer, where you can store a specific kind of data. As you may know, computers use binary system, i.e. all data are represented as rows of logical values, zeros and ones. It is important therefore to specify what kind of data you are storing in the memory, otherwise the conversion to binary and back would not be possible. In many programming languages, you have to declare a variable, indicating its type (e.g. character or integer number). Python makes your life a bit easier, because you don’t need to declare the variables, neither to define their type; the variable will be created once you try to assign a value to it. Also, Python will guess what the type of the variable is. The variable will exist (it will be kept in the memory) until the program/function finishes or until you explicitly delete the variable. Each variable has a name (identifier) and a value. The name is just a label that represents the variable in the program. An example: Code 4 Val = 123 number pi = 3.14 ch1 = ’x’ In this example, three variables are created, called Val, number_pi and ch1. The names of variables may contain lower and upper case characters, digits and the underscore character, but they can not start with the digit. For example: 1x, var.a and my-var are incorrect names. Also, you can not use reserved names, which are Python instructions (like print). The equality character (=) is used in Python to substitute a value to a variable. Here, for instance, the variable Val will store the number 123. In the example, we do not specify the type explicitly, but each of the variables will have the type defined by Python. Val will be an integer number, number_pi will be a floating point number and ch1 will be a character (string more Project co-financed from the EU European Social Fund Python programming for bioinformatics students 13 Table 1.1: Data types in Python. Name bool int long float str Examples False, 0, True, 1 -10, 4005 123456789L 0.123, 1.4e-15 ’a’, "python" Description Logical values Integer numbers Integer numbers of unlimited size Real numbers Strings (text) precisely). These types are guessed by Python in the following way: 3.14 is a real number — it has the fractional part. To store it, the integer type is not sufficient, so the float type will be used. On the other hand, 123 can be stored as an integer number, because it does not have a fractional part. However, if you would like to create the Val variable as float and store 123 there, you may force Python to do so: Code 5 Val = 123.0 By specifying the decimal point (123.0) you indicate that this is a floating point number, not an integer. By executing the substitution to the same variable several times, you make Python ‘forget’ the old value and ‘learn’ the new one: Code 6 x = 12 x = 34 After executing this code, the variable x will contain the value 34. This is like erasing the variable and creating it again, with a new value. Also you may change the type of the variable with the subsequent substitution: Code 7 x = 5.67 Most variable types have limits, e.g. there are certain minimum and maximum numbers that you can store in an integer variable. In Table 1.1 you will find some of the types used in Python. Note, how the real numbers with an exponent are typed in Python, e.g. 1.4 · 10−15 must be written as 1.4e-15. 1.4 Operators Have a look at the following example: Project co-financed from the EU European Social Fund Borys Szefczyk 14 Table 1.2: Some of the arithmetic operators in Python, arranged according to the priority (from the highest priority in the top row to the lowest priority in the bottom row). Operators ** * / % + - Description power operator multiplication, division and modulo (reminder) sum and difference Code 8 #!/usr/bin/python x = 1 y = 2 z = x + y print "The result is", z Here, we add the values of two variables (x and y) and substitute the result to the variable z. We use the sum operator (plus sign). See Table 1.2 for the list of standard math operators and their priority. Operator with the highest priority will be executed first; if two operators have equal priorities, they will be executed from the left- to right-hand side. In the following example: Code 9 x = 3 + 4 / 2 - 1 The first operation executed will be 4 / 2, then 3 + 2 + 1. If you want a lower-priority operator to be executed first, you have to use parentheses. If you are in doubt, always use parentheses to define how the expression will be evaluated — it is not an error to use redundant parentheses. For example, in order to compute correctly x z= y+2 you have to write in your program: Code 10 z = x / (y + 2) Be aware of the special behaviour of the division operator, which depends on the arguments. If both are an integer, it will return only the integer part of the result; if at least one of the arguments is a floating point number, the result will also be a float. Try out this program, and compare the values of a and b: Project co-financed from the EU European Social Fund Python programming for bioinformatics students 15 Code 11 #!/usr/bin/python a = 2 / 3 b = 2 / 3.0 print "a =", a print "b =", b Knowing the math operators, you can use the Python program as a calculator. Just run in interactive session as described in section 1.1 and type the expression you want to calculate: Python 2.6.4 (r264:75706, Dec 7 2009, 23:19:43) [GCC 4.3.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> 12.3 + 67. / 5 25.700000000000003 * 7 >>> 179.90000000000003 >>> a = 33 >>> b = 11 >>> a / b 3 >>> The result of the last operation can always be retrieved using the special variable designated with the underscore character. You can also define variables and use them; in the interactive Python session you do not need to use the print instruction to display the result. 1.5 Interaction with the user The whole point of making scripts is to save time and work, by writing it once and then feeding it with different kinds of data. You can insert your data into the script using variables, but this is not what a programmer would call “an elegant way“ of handling things. Instead, you should use the input() function to interactively ask the user for data. Let us write a script to convert energy from hartree units into kJ/mol and eV. The conversion factors are 2625.5 and 27.211: Code 12 #!/usr/bin/python Eh = input("Enter energy in hartree: ") Project co-financed from the EU European Social Fund Borys Szefczyk 16 EkJmol = Eh * 2625.5 EeV = Eh * 27.211 print Eh, "hartree =", EkJmol, "kJ/mol" print Eh, "hartree =", EeV, "eV" The standard function input("string") is used to ask the user to enter a value and returns it. The "string" is displayed as a prompt, much as you would use the print instruction. In this example, the value returned by input() is substituted to the Eh variable. Exercise 2: Write a script that calculates the height of a regular triangle for an edge length √ entered by the user (hint: 2 = 21/2 ). 1.6 Using modules: math One of the advantages of using Python is the enormous number of modules that can help to solve various kinds of programming tasks. Frankly, Python itself is quite limited, and very soon you will realize that the function you need is available in one of the modules. For example, to use the logarithm function, you must first load the math module: Code 13 #!/usr/bin/python import math x = math.log(2.0) print "log(2.0) =", x As you can see above, the module called math is loaded with the import instruction and after that the function can be invoked by specifying the module name, a dot and the function name (plus arguments if any are required). To display a list of all objects inside the module, you may use the dir(module_name) function. If you just need a single function and not all of them, you can use another syntax: Code 14 #!/usr/bin/python from math import log x = log(2.0) print "log(2.0) =", x Note, that by using the latter syntax, you are adding the logarithm function to the global namespace and Project co-financed from the EU European Social Fund Python programming for bioinformatics students 17 when invoked, the module name is no longer needed. It is also possible to use wildcards and load all the functions from the module at once using the second statement: Code 15 from math import * 1.7 Getting help Besides the pretty large documentation available on-line [2], Python has an internal help system based on its object-based character. For the purpose of this section, it is best if you start the interactive session and type, following the snippets presented here. Python 2.6.4 (r264:75706, Dec 7 2009, 23:19:43) [GCC 4.3.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> a=1 >>> import math >>> dir() [’ builtins ’, ’ doc ’, ’ name ’, ’ package ’, ’a’, ’math’] >>> Function dir(), used here without any arguments, displays the list of names defined in the main namespace. Besides the standard objects, you will notice above that the list contains the variable a, which has been defined and the math module, which has been imported. Continue with the next example, still inside the interactive session: >>> dir(math) [’ doc ’, ’ file ’, ’ name ’, ’ package ’, ’acos’, ’acosh’, ’asin’, ’asinh’, ’atan’, ’atan2’, ’atanh’, ’ceil’, ’copysign’, ’cos’, ’cosh’, ’degrees’, ’e’, ’exp’, ’fabs’, ’factorial’, ’floor’, ’fmod’, ’frexp’, ’fsum’, ’hypot’, ’isinf’, ’isnan’, ’ldexp’, ’log’, ’log10’, ’log1p’, ’modf’, ’pi’, ’pow’, ’radians’, ’sin’, ’sinh’, ’sqrt’, ’tan’, ’tanh’, ’trunc’] >>> This time, the dir() function has been used to display the content of the math module. Note, that every object contains an element called __doc__. This is just text, which you can display with the print instruction: >>> print math. doc This module is always available. It provides access to the Project co-financed from the EU European Social Fund Borys Szefczyk 18 mathematical functions defined by the C standard. >>> print math.ceil. doc ceil(x) Return the ceiling of x as a float. This is the smallest integral value >= x. >>> The __doc__ object contains (usually) information about the object/function and instructions on how to use it. 1.8 Handling types In this section we continue the discussion of data types, which began in Section 1.3. As you know already, when the variable is created and initialized, Python decides what type of data it contains. It is possible to check the type with the type(variable_name) function: Code 16 #!/usr/bin/python a = 1 print "Type of a is", type(a) a = 1.0 print "but now a is", type(a) a = 1+0j print "and finally a becomes", type(a) If you execute this script, you will see that: Type of a is <type ’int’> but now a is <type ’float’> and finally a becomes <type ’complex’> A similar kind of guessing is performed when the input() function is used. Try to execute the script below a few times, entering different values (e.g. 1, 1.0, 1+0j). Code 17 #!/usr/bin/python a = input("Enter a value: ") print "Your value is", type(a) Project co-financed from the EU European Social Fund Python programming for bioinformatics students 19 borys@swift $ ./types.py Enter a value: 23 Your value is <type ’int’> borys@swift $ ./types.py Enter a value: 0.5 Your value is <type ’float’> borys@swift $ ./types.py Enter a value: "abc" Your value is <type ’str’> In the last example, the user entered a string “abc” (type ’str’). Note that in such a case, the string has to be enclosed in quotation marks. If you are sure that the data you are asking for are strings, it is more handy to use another function, called raw_input(). It works exactly like input(), except it does not try to guess the type and always returns a string. This way, the users do not need to enclose their input in quotation marks, to indicate that it is a string. You can always ensure that the correct type will be used, by using one of the type-conversion functions (int(), str() etc.). Consider the following example, which reads in a numerator and denominator and displays a decimal number: Code 18 #!/usr/bin/python numerator = input("Enter numerator: ") denominator = input("Enter denominator: ") decimal = float(numerator) / denominator print numerator, "/", denominator, "=", decimal Since the user may type the numerator and denominator as integral numbers, it is necessary to use the float() function to ensure that the result of the division will be real. Exercise 3: Use the raw_input() function to write a script that will ask the user for his name and then display a welcome message, like in the example: borys@swift $ ./rawelcome.py What is your name? Borys Szefczyk Welcome, Borys Szefczyk ! Project co-financed from the EU European Social Fund Borys Szefczyk 20 I want to go out. Is it raining? Yes. Take umbrella. No. Is it sunny? Yes. Take sunglasses. Figure 1.1: An algorithm for going out. 1.9 Simple control statements One important aspect of algorithms used by programs is making decisions about what part of the code should be executed, depending on certain data. This is like when you check the weather forecast and decide if you take an umbrella or sun glasses. The algorithms are often presented in graphical form (Figure 1.1). In the program or script, it is the conditional instruction that is responsible for making the decision. The following script calculates the roots of a quadratic equations. The number of distinct real roots depends on the value of the discriminant ∆: ∆ = b2 − 4ac The script has to decide which formula should be used, depending if ∆ is positive, negative or zero. Code 19 #!/usr/bin/python from math import sqrt print "Finding roots of a*x^2 + b*x + c = 0" a = input("Enter a: ") b = input("Enter b: ") c = input("Enter c: ") a = float(a) delta = b*b - 4*a*c if delta > 0: x1 = (-b - sqrt(delta))/2/a x2 = (-b + sqrt(delta))/2/a Project co-financed from the EU European Social Fund Python programming for bioinformatics students 21 print "Roots are:", x1, "and", x2 elif delta == 0: x = -b/2/a print "The single real root is:", x else: print "There are no real roots." How does it work? Look at the example above: when the script reaches the if statement, it analyses the relational expression delta > 0; if the expression is true, it will execute the code that follows and skip the rest of the conditional instruction (elif and else). If the expression delta > 0 is false, it will jump to the next part of the conditional instruction, which is elif delta == 0:. If this expression is true, the following code will be executed. If not — the program will jump to the else: part. Note, that there is no relational expression after else; this statement indicates the part of the code that should be executed if all other conditions fail. Also note, that the lines between if and elif, as well as the lines between elif and else are indented. The indentation, which may consist of spaces or tabulators, indicates a block of instructions. A block of instructions is like a small program inside your program. Remember that all lines within a block must have the same indentation, i.e. if the first line starts with four spaces, the following lines must start with four spaces too. This is how Python recognizes the beginning and end of the block. The general form of the conditional instruction in Python is: the if keyword, followed by relational expression and colon. Other elements, the elif and else statements, are optional. Below, different variants of conditional instruction are shown — all of them are valid. Code 20 Code 21 # Example 1 if expression1: code1 elif expression2: code2 elif expression3: code3 else: code4 # Example 3 if expression: code1 else: code2 # Example 2 if expression: code # Example 4 if expression: pass else: code The first example shows the full version of the conditional instruction, but the elif statement might be omitted, like in examples 3 and 4. The else statement is also optional; the simplest form of conditional instruction is just a single line, like in example 2. A block of code within the conditional instruction can not be empty. If you want one of the conditions to be skipped, you may use the pass instruction. This instruction does nothing, it just satisfies the lexical requirements of the language. Project co-financed from the EU European Social Fund Borys Szefczyk 22 The relational expression used by the if instruction is just like an arithmetic expression, except it can have one of two values: True or False. Relational expressions are composed of relational operators (Table 1.3) and parentheses. Relational operators are listed in Table 1.3. Table 1.4 shows how the or and and operators work. Many relational expressions can be combined by using parentheses. Look at the examples below and try to predict if the expressions are true or false. Then use Python to check. Code 22 a = -1 b = 0 c = 1 (c > a) and (a < b) (not b) and (b == c) c or b or a > 0.0 Values of many types in Python have their logical meaning as well: • Integer 0 and float 0.0 are False, all other numbers are True; Table 1.3: Relational operators. Symbol not or and == != > >= < <= Example not a a or b a and b a == b a != b a > b a >= b a < b a <= b Function Negation Sum Product Equal to Not equal to Larger than Larger or equal than Lower than Lower or equal than Table 1.4: Evaluation of logical sum and product. Left operand True True False False Right operand True False True False Sum True True True False Product True False False False Project co-financed from the EU European Social Fund Python programming for bioinformatics students 23 • Empty string "" is False, any other string is True; • Empty list [] or tuple () or dictionary {} are False (you will learn what they are in the following sections). Exercise 4: Write a script that tells the user if a given year is a leap year. The rule to determine a leap year is as follows: if the year is divisible by 4 and it is not divisible by 100, it is a leap year. If the year is divisible by 400 it is also a leap year. For example: 2008, 2004 and 2000 were leap years, but 1900 was not. 1.10 Condition-controlled loops An important element of programming is executing certain parts of the code multiple times, like reading subsequent lines of file, until we find the one that interests us. Asking users to input data and repeating the question until the entered data are correct is another example. Our script is going to calculate the square root of the number entered by the user. But the argument to the function sqrt() must be nonnegative: Code 23 from math import sqrt a = input("Enter a positive number: ") while a < 0: print "This number is negative!" a = input("Enter a positive number: ") print "sqrt(", a, ") =", sqrt(a) Loops are often used in numerical procedures, if they are based on iterative techniques. An important example for computational chemists is the Self-Consistent Field method: we start with an approximate set of coefficients (guess) and use an iterative procedure to improve them, until we reach the desired accuracy. In the following two examples we will use iterative procedures to compute the value ln(2) using series and to solve an equation. It is known that the sum of convergent series: 1 1 1 1 1 − + − + + · · · = ln 2 1 2 3 4 5 therefore we can pretend that we do not know about the existence of log() function in the math module and use the series to compute the value of ln 2. A convergent series has the following properties: (i) it has a limit l, which is less than infinity (l < ∞) and (ii) there is a large integer number N such that for Project co-financed from the EU European Social Fund Borys Szefczyk 24 Table 1.5: Calculating the sum of a convergent series with desired accuracy (10−3 ). Step 1 2 3 4 ··· 1001 all n ≥ N : Element 1/1 = 1 1/2 = 0.5 1/3 = 0.333 1/4 = 0.25 ··· 1/1001 = 0.000999 Sum 1 0.5 0.833 0.583 ··· 0.6936 Converged? No No No No No Yes |Sn − l| ≤ ǫ Where Sn is a partial sum and ǫ is accuracy. Therefore we can compute ln 2 with a desired accuracy ǫ by simply adding subsequent elements of the series. We should continue the summation until the sum in two subsequent steps changes by less than ǫ. This algorithm is illustrated in Table 1.5 and the code that performs the task is shown below. Code 24 total = 0 element = 1 epsilon = 1e-3 while 1.0/element > epsilon: if element % 2: total += 1.0 / element else: total -= 1.0 / element element += 1 print "ln(2) =", total, "after", element, "steps." This example introduces new operators, += and -=. For example element += 1 means “increment the variable element by one”. It is equivalent to: element = element + 1. Further operators of this kind are listed in Table 1.6. In the following example we will solve the equation: x − 2 = ln x First, to get an idea of what the solution might be, we will plot two functions (Figure 1.2): L(x) = x − 2 R(x) = ln 2 Project co-financed from the EU European Social Fund Python programming for bioinformatics students 25 Table 1.6: Operators with assignment and equivalent expressions. Operator x += a x -= a x *= a x /= a x %= a Meaning Increment x by a Subtract a from x Multiply x by a Divide x by a Substitute the reminder to x Equivalent expression x = x + a x = x - a x = x * a x = x / a x = x % a 2 1.5 1 0.5 0 -0.5 -1 L(x) = x - 2 -1.5 R(x) = ln(x) -2 0 1 2 3 4 5 Figure 1.2: Graphical solution of the equation x − 2 = ln x. Project co-financed from the EU European Social Fund 6 Borys Szefczyk 26 The solution to our problem is such an x that L(x) = R(x). From Figure 1.2 we see, that the equation has two solutions, x1 ≈ 0.2 and x2 ≈ 3.2 (these are the points where the functions cross). We will use an iterative technique to compute a more accurate value: we start with a guess x0 = 3.2 and compute the right-hand side expression, R = ln(x0 ). Then we use the left-hand side expression to find x1 = R + 2. This is our new, hopefully better, approximation to x. Then, we substitute it to the right-hand side again and repeat the cycle until both sides will be equal (within a certain error ǫ): Code 25 from math import log x = 3.2 epsilon = 1e-5 step = 0 left = x - 2 right = log(x) while abs(left - right) > epsilon: x = right + 2 left = x - 2 right = log(x) step += 1 print "Step", step, ": x =", x This program uses the abs() function, which returns the absolute value of an expression. Exercise 5: Try to play with different initial values of x. Are you able to find both solutions? If not, try to rewrite the equation by taking the exponent of both sides, i.e. ex−2 = x Exercise 6: Compute the e number (base of natural logarithm) as a sum of the convergent series: 1 1 1 1 1 e= + + + + + ··· 0! 1! 2! 3! 4! Project co-financed from the EU European Social Fund Python programming for bioinformatics students 27 Exercise 7: Use the bisection method to find a root of the equation: x3 − 3x − 1 = 0 The range of the search and the precision should be given by the user. The bisection works by the iterative division of the range in to halves, until the precision is achieved. Consider the function from the equation above (Figure 1.3): the function has three roots, however we will be searching for the one that is between a = −1 and a = 1. We start by calculating f (a) and f (b). If f (a) and f (b) have different signs, there has to be a root between them. Now, we divide the range [a, b] into halves [a, x1 ] and [x1 , b] and calculate the value of f (x1 ). Comparing the signs of f (a), f (x1 ) and f (b), we realize that the root is now in the [a, x1 ] range. Therefore, we take it as a new range and divide into halves. Now, we calculate f (x2 ) and see that it has the same sign as f (x1 ), but different than f (a), so the root should be between f (a) and f (x2 ). We continue this procedure until the length of the range becomes smaller than the precision requested by the user. At that point, we can say that we have found the root with the requested precision. Note, that you don’t need to keep all the arguments and values in memory; at a single step of this procedure you need only six variables: the end-points of the range and function values at the end-points, the middle-point and the function value at the middle-point. 1.11 More complex types — lists Having variables that can store just a single value is not handy enough. Soon you will want to store a larger number (possibly unspecified) of values. In languages like C, for example, you would use arrays. An array is a space in memory that can store a certain number of values — all of them must have the same type. Arrays can be static, i.e. present as long as the program or function is executed and having a well defined size, or dynamic, i.e. allocated when they are needed and freed afterwards. The size of the dynamic array can be changed. In Python, we will only start using arrays in the NumPy module. Standard Python language does not have arrays, but has a concept which is similar: lists. There are differences, though. Lists are objects and besides the values, they also have methods associated with them. Lists can contain elements of different types. Lists are dynamic: elements can be added or removed and the size of the list changes accordingly. Here are a few examples on how to create a list: Code 26 a1 a2 a3 a4 a5 = = = = = [ [ [ [ [ 3, 6, 9 ] "python", 3.14, 0 ] 4, [ 5, 6, 7], 8 ] a1, a2 ] ] Project co-financed from the EU European Social Fund Borys Szefczyk 28 Figure 1.3: The bisection method. Description in the text (Exercise 7). 4 3 2 f(a) 1 0 a f(x2) x3 x1 x2 f(x3) b f(xn) -1 f(x1) -2 -3 f(b) -4 -3 -2 -1 0 1 Project co-financed from the EU European Social Fund 2 3 Python programming for bioinformatics students 29 Lists are delimited with brackets, and as you can see in the case of a2, they can contain different elements, strings, floats, integers etc. Lists can be nested, i.e. lists can contain also lists (a3). Obviously, you do not need to specify the values explicitly; you can use variables, like in a4. Lists can be empty (a5). When you want to retrieve or change element of a list, you have to use the index of the element. Elements of the list are indexed starting from zero. So, list a1 above has three elements with indices 0, 1, 2. The code below shows how indexing works: Code 27 x = [ 3, 6, 9, 12, 15 ] print "x[1] =", x[1] print "x[-1] =", x[-1] print "x[1:3] =", x[1:3] print "x[:3] =", x[:3] print "x[2:] =", x[2:] Here is the output: x[1] = 6 x[-1] = 15 x[1:3] = [6, 9] x[:3] = [3, 6, 9] x[2:] = [9, 12, 15] Since the indexing starts from 0, x[1] refers to the number 6. Indices can also be negative; x[-1] means the last element, x[-2] is the one before last and so on. Indices can also refer to ranges. If you specify the first and last index separated by a colon, i.e. x[1:3], you will retrieve a “sub-list” — a list containing part of the original lists. However, the indexing in this case is a little bit tricky. Note that x[1:3] returns only [6, 9], i.e. elements with the indices 1 and 2. The last element (index 3) is always skipped. Note, that the indices of the range can be omitted: x[:3] means “from the beginning to element 3” and x[2:] means “from element 2 until the end”. It may seem that x[:] means exactly the same as x, but it does not. Look at the example below: Code 28 x = [ 3, 6, 9, 12, 15 ] y = x z = x[:] x[1] = -1 print "x =", x print "y =", y Project co-financed from the EU European Social Fund Borys Szefczyk 30 print "z =", z Output: x = [3, -1, 9, 12, 15] y = [3, -1, 9, 12, 15] z = [3, 6, 9, 12, 15] In this example we create list x, then we make two copies, y and z. After that, we change one element of x. As you can see, list y has also changed, but z not. This is because y = x is like giving another name (alias) to an existing object; it does not create a new list. z = x[:] on the other hand, copies all elements from x to z. Although it may be confusing, such behaviour is useful when you have to deal with a large amount of data. It allows you to save time and memory, you just have to remember that x and z refer to the same object. Nested lists can be used to store objects like multi-dimensional arrays or matrices. For example, the matrix A:   1 2 3   A= 4 5 6  7 8 9 can be handled in the following way: Code 29 A = [ [ 1, 2, 3 ], [ 4, 5, 6 ], [ 7, 8, 9 ] ] print "A[1][2] =", A[1][2] Since we have a “list in a list” (two-dimensional array), we need two indices, A[1][2], the first one (1) refers to a row and the second (2) refers to a column. Python has a special function, range(a, b, c), that creates lists of integer numbers, starting from a, up to b (excluding b itself) and with a step of c. Arguments a and c are optional; if not supplied, the default will be used (0 and 1, respectively). Here is an example: Code 30 x = range(5) y = range(5, 10) z = range(3, 10, 2) print "x = ", x print "y = ", y print "z = ", z Project co-financed from the EU European Social Fund Python programming for bioinformatics students 31 x = [0, 1, 2, 3, 4] y = [5, 6, 7, 8, 9] z = [3, 5, 7, 9] As mentioned in the beginning of this section, lists are objects and have certain methods associated with them. These methods are used to modify the lists, e.g. to add new elements. Methods are similar to functions, but they are specific to the object and are invoked in a special way. For example: the method append() adds a new element at the end of the list: Code 31 x = [ ] x.append(3) x.append(6) print "x = ", x x = [3, 6] As you can see, there is the name of the object (x), a dot and the name of the method with arguments in the parentheses. Some of the methods can also return a value. For example, the method count() returns the number of occurrences of the specified element in the list. In such a case, usually you would like to do something with the returned value, e.g. substitute it to a variable and print: Code 32 x = [ 1, 2, 3, 4, 1, 2, 3, 1, 2, 1 ] ones = x.count(1) twos = x.count(2) print "There are", ones, "ones and", twos, "twos." There are 4 ones and 3 twos. Lists have more methods and each one has short information embedded in the __doc__ object. You access it through any instance of the list: Python 2.6.4 (r264:75706, Dec 7 2009, 23:19:43) [GCC 4.3.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> dir([]) [’ add ’, ’ class ’, ’ contains ’, ’ delattr ’, ’ delitem ’, ’ delslice ’, ’ doc ’, ’ eq ’, ’ format ’, ’ ge ’, Project co-financed from the EU European Social Fund Borys Szefczyk 32 ’ getattribute ’, ’ getitem ’, ’ getslice ’, ’ gt ’, ’ hash ’, ’ iadd ’, ’ imul ’, ’ init ’, ’ iter ’, ’ le ’, ’ len ’, ’ lt ’, ’ mul ’, ’ ne ’, ’ new ’, ’ reduce ’, ’ reduce ex ’, ’ repr ’, ’ reversed ’, ’ rmul ’, ’ setattr ’, ’ setitem ’, ’ setslice ’, ’ sizeof ’, ’ str ’, ’ subclasshook ’, ’append’, ’count’, ’extend’, ’index’, ’insert’, ’pop’, ’remove’, ’reverse’, ’sort’] >>> print [].sort. doc L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*; cmp(x, y) -> -1, 0, 1 >>> There are also two useful functions, which take lists as arguments — sum(), calculates the sum of elements and len() returns the number of elements. Here is an example — a one-line snippet from a script that computes the average of the elements in the list x: Code 33 average = sum(x)/len(x) Exercise 8: Use interactive Python to learn what is the function of the following methods: extend, index, insert, pop, remove, reverse and sort. Exercise 9: Write a script that selectively lists files: the script should display a list of Python scripts in the current directory, i.e. only those files that have their names ending with “.py” extension. The list should be sorted alphabetically. You will need the function listdir() from module os. This function returns a list of file names in the directory given as the argument, e.g. files = listdir(’/dev’) will produce a list called files, containing all file names from the directory /dev. To list the current directory, you may simply use the dot character, like in the shell, listdir(".") 1.12 Count-controlled loops Most languages have two kinds of loops. One of them is condition-controlled, i.e. executed until the condition is satisfied. Another kind of loop is count-controlled. This type of loop is executed a certain number of times or, in the case of Python, for each element of a certain list. The general “rule of thumb” for choosing the right type of loop is that you should use the count-controlled loop whenever it is easy to predict how many times it has to be executed. Here is an example of a count-controlled loop: Code 34 shop = [ "apples", "eggs", "ham", "milk", "potatoes" ] Project co-financed from the EU European Social Fund Python programming for bioinformatics students 33 for item in shop: print "The shop has", item This loop picks subsequent elements from the list shop, substitutes them to item and for each value of item, executes the block of code that follows. The The The The The shop shop shop shop shop has has has has has apples eggs ham milk potatoes In the next example we will use numerical integration to compute: Z π I= sin x 0 The definite integral is equal to the surface of the area under the function’s plot, within the range of integration (here, [0, π]). Figure 1.4 shows that we can approximate the area with a set of n rectangles – each δx wide and h high: Figure 1.4: Numerical integration. 1.5 i=3 1 i=2 i=1 0.5 h i=0 0 0 δx π -0.5 -1 -1 0 1 2 Project co-financed from the EU European Social Fund 3 4 Borys Szefczyk 34 I ≈S= n X i=0 δx · h When δx → ∞, the sum S → I. The height of the rectangle, h, is simply equal to f (xn ) = sin xn , where xn = δx(i + 12 ). We add 12 , because the height is measured in the middle of the interval. In the following code, we split the range [0, π] into 100 intervals, compute the integral and compare with the exact value, which equals cos 0 − cos π. Code 35 from math import sin, cos, pi intervals = 100 dx = pi/intervals integral = 0.0 for i in range(intervals): xn = dx * (i + 0.5) h = sin(xn) rect = dx * h integral += rect print "Numerical value:", integral print "Exact value:", cos(0) - cos(pi) In certain cases, you may want to terminate the loop before the conditional expression becomes False, this can be done with the break instruction. Imagine, you are writing a program that calculates the average of the numbers entered by the user and you do not know how many numbers the user will enter. The program could be as follows: Code 36 s = 0.0 n = 0 # The sum # Number of elements while True: x = input("Enter a number (99 to finish): ") if x == 99: break s += x n += 1 print "The average is", s/n We use an expression that is always true, therefore the loop is endless. However, if the user enters the number 99, the break instruction will cause the program to exit immediately from the loop. Remember Project co-financed from the EU European Social Fund Python programming for bioinformatics students 35 that if the loops are nested (one loop inside another), the break statement will work only for one loop, e.g. in the example below, the break instruction will make the script jump out of the inner loop, but the outer loop will continue: Code 37 while x > 0: # Outer loop while y > 0: # Inner loop if z == 0: break If you do not want to exit the loop, but just skip the rest of the code and proceed to the next turnaround, you may use the continue instruction. break and continue work for both types of loops, the count- and condition-controlled. Exercise 10: Use the Monte Carlo integration to calculate the same integral as in the example above. Hint: The Monte Carlo integration work like this: 1. Define the boundaries [xa , xb ] as equal to the integration range and the boundaries [ya , yb ] as equal to the minimum and maximum of the function in that range. Define a counter p. 2. Generate two random numbers, x and y, within respective boundaries. 3. Check if the point (x, y) is below the function, i.e. if f (x) >= y; step the counter p if true. 4. Repeat steps 2 and 3 several times (n). 5. Compute the ratio r of points that have fallen under the graph (p) to the total number of steps (n). 6. The integral is equal to the ratio r times the surface of the rectangle defined by the boundaries, i.e. (xb − xa ) · (yb − ya ). You may also need the function random() from the module random. 1.13 Pretty output Imagine you want to print a table of values of the sine function, in the range [−90, 90] every 30 degrees. Here is the script that does it: Project co-financed from the EU European Social Fund Borys Szefczyk 36 Code 38 #!/usr/bin/python from math import sin, pi for x in range(-90, 91, 30): xx = x / 180.0 * pi print x, sin(xx) However the output does not look very pretty: -90 -1.0 -60 -0.866025403784 -30 -0.5 0 0.0 30 0.5 60 0.866025403784 90 1.0 What we need is the formatted output. Compare the previous example with the next one: Code 39 #!/usr/bin/python from math import sin, pi for x in range(-90, 91, 30): xx = x / 180.0 * pi print "%+3d % 6.3f" % (x, sin(xx)) In this example the columns are aligned, the angle is always printed with the sign and the sine value has three significant digits: -90 -60 -30 +0 +30 +60 +90 -1.000 -0.866 -0.500 0.000 0.500 0.866 1.000 Each of the lines above is formatted according to a general specification that is common for several Project co-financed from the EU European Social Fund Python programming for bioinformatics students 37 Table 1.7: String formatting symbols. The symbols are always used between % and the letter defining value type (s, d, f, e, etc.). Symbol % number .number + space 0 - Meaning Percent character Width of the field in characters Number of decimal digits Always print the sign of the number Print minus for negative numbers and space for other Fill the field with leading zeros Left-align the value Example %% %6d %.3f %+6.3f % 6.3f %06d %-10s programming languages. This specification always starts with the percent character %, followed by other symbols that define formatting and a letter that specifies what type of value is expected, e.g.: s – string, d – integer number, f – floating point number, e – floating point number in engineering notation (e.g. 1e-10). Consult Table 1.7 for details. The symbols can be combined, but they always occupy specified positions, i.e. the percent comes first, then + or space, then zero, number, dot, number and finally a letter. For example: %+010.4f will print a floating point number in 10-character field, aligned to the right-hand side, left-padded with zeros, with four decimal digits and the sign. Exercise 11: Modify the program from Exercise 10, so that it will display intermediate results in a table. The table should contain the number of steps done, the number of points found under the graph and a current estimate of the integral. The table should not be too long, it should have 10–20 entries. To do so, you may, for example, print an entry every f steps, where f = N/20 and N is the total number of steps. Steps Hits Integral ---------------------------5000 3191 2.005e+00 10000 6347 1.994e+00 15000 9579 2.006e+00 20000 12802 2.011e+00 25000 16007 2.011e+00 30000 19137 2.004e+00 35000 22361 2.007e+00 40000 25589 2.010e+00 45000 28750 2.007e+00 50000 31978 2.009e+00 Final result: 2.00923699753 Expected result: 2.0 Project co-financed from the EU European Social Fund Borys Szefczyk 38 1.14 Tuples In the following sections, we compare four complex types of Python: strings, lists, tuples and dictionaries. All of them are similar in the sense that all can be indexed (“hashed” in the case of dictionaries) and all are slightly similar to the idea of a table. Tuple is the simplest type from this group. It behaves in a similar way like lists, except it is static. That means you can not modify the elements of the tuple, delete items from the tuple or add new items. But thanks to that, tuples are faster, so remember: use them instead of lists whenever you can. Tuples are written in parentheses. They also do not have as many methods as the lists. Besides that, they behave like lists: Python 2.6.4 (r264:75706, Mar 17 2010, 10:33:29) [GCC 4.3.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> t = (’a’, ’b’, ’c’, ’d’, ’a’, ’b’, ’c’, ’a’, ’b’, ’a’) >>> print t.count(’b’) 3 >>> 1.15 Strings and methods Strings are composed exclusively of characters, although the limitation to ASCII characters has been lifted and you can use, for example, Unicode, to encode your national characters. If you do so, you have to declare the encoding in the head of your script, like in the example below: Code 40 #!/usr/bin/python # -*- coding: UTF-8 -*print ”Żóltko” Strings can be placed in apostrophes or quotation marks. There is also a special syntax, where strings are encapsulated in triple apostrophes or quotation marks. In that case, the string can be broken into several lines: Code 41 help text = """ Usage: %s -f FILENAME -v X Y Z Project co-financed from the EU European Social Fund Python programming for bioinformatics students 39 -f FILENAME - name of the file to read -v X Y Z - components of the vector Version: %d """ print help text % (argv[0], version) Some of the characters have to be specified in special way, either because they have a certain function in Python (e.g. apostrophe) or because they are not printable. For instance, if you want to print a tabulator or new-line character, use \t and \n. These are so-called “escape sequences“. They start with the \ character and they are interpreted as special characters. Since the quotation marks and apostrophes are used to open and close strings, they can not be used directly inside of the string and have to be “escaped”. Examples: Code 42 print "Quotation mark (\") must be escaped here," print ’but not here ("), because this string is in apostrophes’ print "Two empty lines after this one:\n\n" print "\tTabulation before the text" print "The backslash (\\) must be escaped too." print "Percent character can be spelled like this %% or like this \%" Strings have their own set of methods that facilitate the processing of the text. Those that need special attention are: split() and the is*() family. String can be converted into a list, based by splitting it at all occurrences of a selected character. This could be useful to read-in a CSV file6 , for example. Imagine, you have a line that contains time, temperature and volume separated by commas: Code 43 text = "24.0,298.15,1000.0" record = text.split(’,’) print "Time:", float(record[0]) print "Temperature:", float(record[1]) print "Volume:", float(record[2]) In this script, the text is split at each comma and converted into a list called record. Since there are two commas, the list will have three elements. After conversion to a list, the numbers are still strings, they are not converted to numbers automatically (i.e. record is a list of strings). If you need them to be numbers, 6 CSV — comma separated values, a text file where each row represents a record of data and the fields (or values) are separated by commas; in countries where the comma is used for separating fractional and integer parts of numbers (like in Slavic or Latin countries), the semi-colon is used instead Project co-financed from the EU European Social Fund Borys Szefczyk 40 you have to use type casting, like in the example above (the float() function). To spare yourself typing, you can use the map() function to convert elements of the list: Code 44 text = "24.0,298.15,1000.0" record = map(float, text.split(’,’)) print "Time:", record[0] print "Temperature:", record[1] print "Volume:", record[2] The map() function takes two arguments. The first one must be a name of a function and the second one must be a sequence. Sequence is a type that can be indexed, i.e. string, tuple or list. The function will be applied to every element of the sequence and a new list will be built. Another example: >>> l = map(float, "12345") >>> print l [1.0, 2.0, 3.0, 4.0, 5.0] >>> An opposite role (almost) to the split() method, has the join() method of lists. It generates a string (let’s say S), by repeating the string sep interleaved with elements of a sequence L: >>> T = "1,2,3,4,5" >>> sep = "," >>> L = T.split(sep) >>> print L [’1’, ’2’, ’3’, ’4’, ’5’] >>> S = sep.join(L) >>> print S 1,2,3,4,5 >>> The initial string T is divided at each occurrence of the comma character and the chunks are collected in the list L. To recover the original string, the join() method is used: it has to be applied to the separator string (sep) that is inserted between elements of the sequence given as the argument of join method. Strings in Python have a family of methods, which names start with is. These methods all return boolean value (True or False), depending if the string fulfils certain conditions. For instance, the method islower checks, if all the characters are lower case; the method isdigit checks, if all the characters are digits. What is important, is that all the characters have to match the condition, not just any of them. Besides several methods specific to strings, you can also apply operators to them. If you look at the Project co-financed from the EU European Social Fund Python programming for bioinformatics students 41 content of the string object: >>> dir(’’) [’ add ’, ’ class ’, ’ contains ’, ’ delattr ’, ’ doc ’, ’ eq ’, ’ format ’, ’ ge ’, ’ getattribute ’, ’ getitem ’, ’ getnewargs ’, ’ getslice ’, ’ gt ’, ’ hash ’, ’ init ’, ’ le ’, ’ len ’, ’ lt ’, ’ mod ’, ’ mul ’, ’ ne ’, ’ new ’, ’ reduce ’, ’ reduce ex ’, ’ repr ’, ’ rmod ’, ’ rmul ’, ’ setattr ’, ’ sizeof ’, ’ str ’, ’ subclasshook ’, ’ formatter field name split’, ’ formatter parser’, ’capitalize’, ’center’, ’count’, ’decode’, ’encode’, ’endswith’, ’expandtabs’, ’find’, ’format’, ’index’, ’isalnum’, ’isalpha’, ’isdigit’, ’islower’, ’isspace’, ’istitle’, ’isupper’, ’join’, ’ljust’, ’lower’, ’lstrip’, ’partition’, ’replace’, ’rfind’, ’rindex’, ’rjust’, ’rpartition’, ’rsplit’, ’rstrip’, ’split’, ’splitlines’, ’startswith’, ’strip’, ’swapcase’, ’title’, ’translate’, ’upper’, ’zfill’] >>> you will note methods such as __add__, for example. Python uses this underscore notation, for methods which are in fact standard operators. In this case, the presence of __add__ means that you can apply the addition operator (+) to strings; __mul__ means that you can multiply strings (although only by integer numbers); __eq__ means that the comparison operator (=) has also been implemented and so on: >>> a = "AAA" >>> b = "BBB" >>> c = a + b >>> print c AAABBB >>> d = a * 4 >>> print d AAAAAAAAAAAA >>> print a == b False >>> Exercise 12: Review the strings’ methods, in particular: capitalize, center, count, find, index, lstrip, rstrip, strip, replace, startswith, endswith, title, and upper. Exercise 13: Write a script that asks the user for his/her name and displays it (1) in capital letters, (2) starting from capital letter followed by small letters, (3) reversing the order of the names, (4) spelled backwards, and (5) spread. For example: What’s your name? BoRYs KrzySZtOF SzefCZYK 1. BORYS KRZYSZTOF SZEFCZYK Project co-financed from the EU European Social Fund Borys Szefczyk 42 2. 3. 4. 5. Borys Krzysztof Szefczyk SzefCZYK KrzySZtOF BoRYs KYZCfezS FOtZSyzrK sYRoB B o R Y s K r z y S Z t O F S z e f C Z Y K Exercise 14: Write a script that converts integer numbers into text in the following way: 123 → "one two three" Use the join() method 1.16 Dictionaries Dictionary is a table similar to lists and tuples, but instead of numerical indices, keys are used. A key can be almost any Python object, e.g. string, number, tuple etc. Using dictionaries usually makes scripts easier to understand, because we do not need to remember what the indices mean. For instance, imagine a script that deals with protein; at some point we have to count how many residues of each type are in the protein. The result can be conveniently stored in a dictionary: Code 45 residues = { "ALA" : 21, "GLY" : 14, "PRO" : 3, "CYS" : 2 } Here, we use the residue names as keys that correspond to occurrences (values). The syntax of a dictionary is following: the pairs key-value are separated with commas and each pair is separated by a colon, key : value. It is possible to create an empty dictionary, add new pairs and modify existing values: Code 46 # Create an empty dictionary residues = { } # Add new pair (residue counter) residues["ALA"] = 0 # Change the value residues["ALA"] = 5 # Increment the existing counter residues["ALA"] += 1 Two important caveats to bear in mind: the order of the pairs in a dictionary is not preserved, so that the first added pair will not necessarily remain first. Therefore, the only way to retrieve a pair from the dictionary is by using the key. Or using simpler words: Python is “allowed” to rearrange the pairs in a dictionary. The second caveat is that referencing to a non-existing key constitutes an error. Therefore, dictionaries have the method has_key() that can be used to check if the key exists. You should use it, before referencing to a key, unless you can be sure that it exists. This is how it is typically done: Project co-financed from the EU European Social Fund Python programming for bioinformatics students 43 Code 47 residues = { "ALA" : 20, "GLY" : 15 } if residues.has key("ALA"): print "ALA:", residues["ALA"] else: print "ALA: no such residue" Other useful methods include keys() and values(), which return lists of keys and values, respectively: >>> residues = { "ALA" : 20, "GLY" : >>> print residues.keys() [’CYS’, ’GLY’, ’ALA’] >>> print residues.values() [5, 15, 20] >>> 15, "CYS" : 5 } The keys() method is useful when we want to iterate over the dictionary in a loop: Code 48 residues = { ’ALA’ : 20, ’GLY’ : 15, ’TYR’ : 23, ’PRO’ : 2 } for k in residues.keys(): print "%3s = %d" % (k, residues[k]) Exercise 15: Write a script that reads text introduced by the user and counts occurrences of each character (case-insensitive). Use dictionary to store the results. The output should look as follows: ~$ python foo.py Text: Programming in Python is cool! Character Count 4 ! 1 a 1 c 1 g 2 h 1 (and so on) Project co-financed from the EU European Social Fund Borys Szefczyk 44 1.17 Passing arguments to the script Programs can receive data not only interactively, but also from the environment and the command line. Passing arguments through the command line is very common in UNIX and especially useful when the programs are used in batch mode7 . In Python, all the arguments passed to the script are placed in the list called argv. This list has to be imported from sys module: Code 49 from sys import argv print "Program name is", argv[0] print "%d additional arguments have been passed." % (len(argv) - 1) The first argument (index 0) is always the name of the script, so the length of the list is at least 1. The arguments are always passed as strings, therefore if you want to pass numbers, you have to convert them afterwards, using functions such as int() or float(). 1.18 Advanced command line options Linux programs, and those from the GNU family8 in particular, have a common way of handling commandline options and arguments. By argument we mean a string (like a file name) that is passed to the program through the command line. Options are like switches: they change the default behaviour of the programs and often have some parameters. An example: ˜$ python -BEi -m numpy -v tutorial.py Program python has been run with one argument (tutorial.py) and five options: B E i m v. The options are always prefixed with a hyphen, but can be specified either all together (-BEi) or one-by-one (-B -E -i). Here, the option -m has a parameter numpy. Some programs recognize short and long options, for example, the two lines below should take exactly the same effect: ˜$ mysql --html --user=pybib --password ˜$ mysql -H -u pybib -p 7 batch mode — contrary to the interactive mode, occurs when the program is not directly started, but run from a script or by the system; in this mode, the program usually reads input from files ,,behind the back” of the user 8 http://www.gnu.org Project co-financed from the EU European Social Fund Python programming for bioinformatics students 45 This way of handling arguments and options gives a lot of freedom to the user and a lot of trouble to the programmers, who has to parse correctly what the user has typed in. Fortunately, there is the getopt library and a Python interface to it. Getopt library parses the command line, divides it into arguments and options and returns them in separate Python lists. Parameters are also handled. Let us try an example: we are writing a file converter, so any time the script is run, we need exactly two file names. Additionally, we have the -e option (with parameter) to choose the encoding, the -v option for verbose mode and the -h option to display help: Code 50 import getopt from sys import argv shortop = "vhe:" longop = ["verbose", "help", "encoding="] opts, args = getopt.getopt(argv[1:], shortop, longop) if len(args) != 2: print "Exactly two file names must be specified!" print "opts = ", opts print "args = ", args Strings shortop and longop define permitted options. If the option has a parameter, it is indicated with a colon ‘:’ in the short-option list and equality sign ‘=’ in the long-option list. Running this script with arguments and options: ˜$ ./options.py -v -e utf8 file1 file2 Produces the output: opts = [(’-v’, ’’), (’-e’, ’utf8’)] args = [’file1’, ’file2’] Note that options are returned as tuples: the first element is always the short option and the second element is the parameter or empty string, if the option has no parameter. Now, try and see what happens when you forget to specify a parameter after the option -e or when you use by mistake an option which has been not specified in the script! 1.19 Working with files The simplest way of reading and saving files has nothing to do with Python itself. You can use the UNIX mechanism of redirecting input and output, to read and save data. Each program in UNIX has three Project co-financed from the EU European Social Fund Borys Szefczyk 46 streams associated: standard input, standard output and standard error. They are treated in a similar way to files, but usually they are attached to the keyboard (input) and screen (output and error). If you want to change the default behaviour — redirect the streams — you can use the characters > and <. Instead of manually typing all input, we can read it from the file: ˜$ ./script.py < input.txt For this to work, the script has to use the input() or raw_input() functions, like when you read the data from the keyboard. To save the text printed by the script on the screen, you just have to redirect the output to the file: ˜$ ./script.py > output.txt In this case, whatever would appear on the screen will go to the file instead (there will be no output on the screen). The sign > redirects only standard output and not standard error output, so if there would be an error message, it will still appear on the screen. Let us try this approach to make an XYZ file with coordinates. We will write a script that creates a lattice of metal gold. The lattice is cubic, with atoms 2.88 Å apart. We will ask the user to tell us the number of atoms in x, y and z directions: Code 51 lattice constant = 2.88 nx = input("Number of atoms (x): ") ny = input("Number of atoms (y): ") nz = input("Number of atoms (z): ") The header of the file must contain the number of atoms and a comment: Code 52 # Total number of atoms n = nx * ny * nz print n # Comment print "Lattice of gold %d x %d x %d" % (nx, ny, nz) Project co-financed from the EU European Social Fund Python programming for bioinformatics students 47 The body of the file contains a single line for each atom; each line contains the atom name and coordinates. Since the lattice is cubic, we use the same lattice constant for all dimensions. There are three nested loops: the first one makes flat slices of atoms along x dimension; second loop makes rows of atoms in each slice (y dimension); third, internal loop makes atoms in a single row (z): Code 53 # Loop over x, y and z for i in range(nx): x = i * lattice constant for j in range(ny): y = j * lattice constant for k in range(nz): z = k * lattice constant print "Au %12.6f %12.6f %12.6f" % (x, y, z) The complete script is as follows: Code 54 lattice constant = 2.88 nx = input("Number of atoms (x): ") ny = input("Number of atoms (y): ") nz = input("Number of atoms (z): ") # Total number of atoms n = nx * ny * nz print n # Comment print "Lattice of gold %d x %d x %d" % (nx, ny, nz) # Loop over x, y and z for i in range(nx): x = i * lattice constant for j in range(ny): y = j * lattice constant for k in range(nz): z = k * lattice constant print "Au %12.6f %12.6f %12.6f" % (x, y, z) However, if we redirect the output from our script to a file: Project co-financed from the EU European Social Fund Borys Szefczyk 48 ˜$ ./gold.py > gold.xyz The prompt for the number of atoms will be also printed to the file. This is wrong, since the user will not see the prompt and it will “contaminate” the file. We can circumvent the problem by reading the number of atoms from the command line. We will change the initial part of the script to read variables nx, ny and nz from the command line: Code 55 from sys import argv lattice constant = 2.88 nx = int(argv[1]) ny = int(argv[2]) nz = int(argv[3]) Now we can run the script with arguments and safely redirect output to a file: ˜$ ./gold.py 3 3 4 > gold.xyz Exercise 16: Write a script to build the lattice of caesium iodide (CsI). Lattice parameters can be found eg. on Wikipedia.a a http://en.wikipedia.org/wiki/Caesium_iodide However, if you would like to work with many files or use files and work with the program interactively at the same time, you should use the Python mechanism of handling files. The first important thing to realize is that the file is just another object in the script and is represented by a variable. The variable is not a file name, although we need the file name to find it. To read or write a file, it has to be opened and it has to be opened in the right mode — ’r’ for reading or ’w’ for writing. Reading is the default mode, so the mode symbol can be skipped: Code 56 # Both lines are correct and both do the same: input = open(’data.txt’, ’r’) input = open(’data.txt’) Writing mode has to be specified explicitly: Project co-financed from the EU European Social Fund Python programming for bioinformatics students 49 Code 57 output = open(’data.txt’, ’w’) When the file is open for reading and the file object has been created, you can use three methods of reading data. Before we progress further, imagine that we have a file called data.txt and it contains three lines (instead of using your imagination, you can actually create the file and try out what follows): 123 456 789 Imagine also that the file has a pointer – an “arrow” indicating position in the file. If the file has been opened for reading the pointer is indicating the first byte (character) of the file. The first method to read data is read() and it treats the file like a single string; the method accepts a single argument: the number of bytes (or characters) to read. So, the following code: Code 58 input = open(’data.txt’) data = input.read(5) print data will read five characters from the file (new-line characters also count) and the output would be: 123 4 Five character have been read: 1, 2, 3, new-line and 4. The pointer has been also advanced to the fifth character and the subsequent reading operation will start from here: Code 59 data = input.read(15) print data 56 789 This time we have requested more bytes than are left in the file and the read() method will return all the remaining characters. We can also read the whole file at once, by skipping the argument of the read() Project co-financed from the EU European Social Fund Borys Szefczyk 50 method: Code 60 data = input.read() The second method, readline(), is a little bit more “intelligent“; it reads a single line from the file or, more precisely, it reads characters from the current position in the file to the nearest new-line character. The new-line character is also read: Code 61 input = open(’data.txt’) data = input.readline() print data 123 The third method, readlines() is even most sophisticated: it reads the whole file, but the text is already split into lines and returned as a list: Code 62 input = open(’data.txt’) data = input.readlines() print data [ ’123\n’, ’456\n’, ’789\n’ ] If you want to rewind the file, to read the data multiple times, you can use the seek(offset) method, where offset is in bytes, counting from the beginning of the file: Code 63 # Rewind to the beginning file.seek(0) # Rewind to the 10th byte file.seek(10) Writing to the file can be performed in three ways. Method write() simply writes a string given as the argument. New-line character is not added automatically, so you have to take care of it: Project co-financed from the EU European Social Fund Python programming for bioinformatics students 51 Code 64 text="Line\n" output.write(text) The method writelines() is compatible with readlines(); it saves lines stored in a list: Code 65 lines = [ "First\n", "Second\n", "Third\n" ] output.writelines(lines) The third way of writing files is by using the print statement. In this case the file object has to be given after >> signs: Code 66 print >>file, "Script output:" print >>file, "%d %d %d" % (x, y, z) print >>file, text To see these commands in “action”, we will do a step-by-step analysis of a script that converts a molecule in PDB format, to the XYZ format: Code 67 #!/usr/bin/python from sys import argv pdb name = argv[1] xyz name = argv[2] pdb = open(pdb name) pdb data = [] for line in pdb.readlines(): if line.startswith(’ATOM ’): atom = {} atom[’symbol’] = line[13] atom[’x’] = float(line[30:38]) atom[’y’] = float(line[38:46]) atom[’z’] = float(line[46:54]) pdb data.append(atom) Project co-financed from the EU European Social Fund Borys Szefczyk 52 pdb.close() n atoms = len(pdb data) xyz = open(xyz name, ’w’) print >>xyz, n atoms print >>xyz for a in pdb data: print >>xyz, "%1s % 8.3f % 8.3f % 8.3f" % \ (a[’symbol’], a[’x’], a[’y’], a[’z’]) xyz.close() First, let us look at an example PDB file of a methanol molecule: TITLE HEADER ATOM ATOM ATOM ATOM ATOM ATOM END 1 2 3 4 5 6 C H H H O H MOH MOH MOH MOH MOH MOH A A A A A A 1 1 1 1 1 1 0.000 0.000 1.027 -0.513 -0.660 -0.660 0.000 0.000 0.000 -0.889 1.143 1.143 0.000 1.089 -0.363 -0.363 -0.467 -1.414 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 C H H H O H PDB files contain much more information than the XYZ files. We only need to extract the atom symbols and the coordinates. Our script is expecting to find the name of an existing PDB file and a non-existing XYZ file in the command line, therefore the file names are read from argv: Code 68 #!/usr/bin/python from sys import argv pdb name = argv[1] xyz name = argv[2] Next we open the PDB file for reading and create the file-object pdb: Code 69 pdb = open(pdb name) The coordinates and symbols will be stored in the list pdb_data. We have to create the list (empty) in order to add the atoms one-by-one. We don’t know yet how many atoms are in the file, but we don’t need Project co-financed from the EU European Social Fund Python programming for bioinformatics students 53 this information at this point: Code 70 pdb data = [] Now, we read the PDB file using the readlines method. This method creates a list of lines from the files. We don’t initialize this list explicitly (we don’t name it), instead it is inserted directly into the loop: Code 71 for line in pdb.readlines(): Each line from the list will be substituted to the variable line. Since not all the lines in the file contain atoms (typical PDB file contains also other information) we must filter out the information. Here we use the startswith() method to find out if the line contains a description of an atom. Be aware that in PDB files there are also hetero atoms and the identifier of their records are “HETATM”. Code 72 if line.startswith(’ATOM ’): The first line of the XYZ file contains the number of atoms. We will collect the atoms in the pdb_data and the length of this list will tell us the number of atoms. Each element of this list will correspond to a single atom. The “single atom” in this context means: the symbol, x-, y- and z-coordinate. We could keep these four values in a tuple, list or dictionary — the choice is more a matter of taste. Here we decide to use dictionaries. For each atom we must initialize an empty dictionary (remember, we are still inside of a loop): Code 73 atom = {} Next, we add the four components to the dictionary. Perhaps the most obvious way of treating the data would be to use the split() method, to separate the fields in the line, but this is a wrong approach to handle PDB files. The fields in a line of a PDB file have a fixed length and position. Besides, if the number/value is big, it might occupy the whole field and there would be no space between fields. Therefore, we should refer to the PDB file format description [3], where we find Table 1.8. From this table we learn, for example, that the x-coordinate should be found in columns 31 − 38. In Python, we are indexing from 0, so that means range 30 − 37. Code 74 atom[’symbol’] = line[13] atom[’x’] = float(line[30:38]) atom[’y’] = float(line[38:46]) Project co-financed from the EU European Social Fund Borys Szefczyk 54 Table 1.8: PDB file format (excerpt). Columns 1−6 7 − 11 13 − 16 17 18 − 20 22 23 − 26 27 31 − 38 39 − 46 47 − 54 55 − 60 61 − 66 77 − 78 79 − 80 Type Int String Character String Character Int Character Float(8.3) Float(8.3) Float(8.3) Float(6.2) Float(6.2) String String Definition The string “ATOM ” Atom serial number Atom name Alternate location indicator Residue name Chain identifier Residue sequence number Code for insertion of residues Coordinates for X in Angstroms Coordinates for Y in Angstroms Coordinates for Z in Angstroms Occupancy Temperature factor Element symbol, right-justified Charge on the atom atom[’z’] = float(line[46:54]) pdb data.append(atom) Above, we did one important simplification. We assume that all atom symbols are one-letter and we just read a single character from column 13. In fact, most symbols are strings of 1 − 4 characters. It is difficult to program this part of the script in a universal way, because for example, we would never know if the symbol CA refers to a calcium atom or the carbon alpha. The original file format was designed for proteins, but nowadays is used for inorganic molecules too. Now, we can close the file and we can also count how many atoms are in the molecule: Code 75 pdb.close() n atoms = len(pdb data) All the data are ready and it is time to save them to the file. First the number of atoms, then a comment (empty line here) and finally coordinates — that is the format of the XYZ file. For writing the coordinates, we use a loop again: Code 76 xyz = open(xyz name, ’w’) print >>xyz, n atoms print >>xyz Project co-financed from the EU European Social Fund Python programming for bioinformatics students 55 for a in pdb data: print >>xyz, "%1s % 8.3f % 8.3f % 8.3f" % \ (a[’symbol’], a[’x’], a[’y’], a[’z’]) xyz.close() Exercise 17: Certain QM programs output the structure in atomic units (Bohrs), but most visualisation programs expect them to be in Angstroms. Write a conversion script. It should read the XYZ file in Bohr and save an XYZ file in Angstrom. 1 a.u. = 0.529177 Å. Exercise 18: Write a script to translate the coordinates of a molecule by a given vector. The script should read the name of an XYZ file and the coordinates of the vector from the command line. Exercise 19: Write a script that calculates certain properties of the structure from an XYZ file: (i) the geometrical center, (ii) the maximum dimension of the molecule and (iii) the size along the x, y and z-axis. The geometrical center (xc , yc , zc ) is defined as the average of each coordinate q = x, y, z (N is the number of atoms): N 1 X qi qc = N i=1 Maximum dimension is the maximum distance between two atoms in the molecule (those which are most apart). Size along an axis is the difference between the maximum and minimum coordinate along this particular axis. The three sizes (x, y, z) are like boundaries of the molecule. Exercise 20: Molecular dynamics of liquids usually start with a pre-generated box of molecules. Write a script to generate such a box. Given an XYZ file and the number of molecules in x, y and z directions, the script should make a box of molecules by copying and translating the atom in all directions. Hint 1: have a look at the script in the Exercise 18 and the script building the lattice of gold (Code 54) — you will need a triple-nested loop to build the box. Hint 2: the molecules should not overlap, but the lattice should not be too sparse. The best you can do is to calculate the size of the molecule in x-, y- and z-direction (see Exercise 19), increment it by 1 or 2 Angstroms and use as the translation vector. Hint 3: a ‘nice and clean’ script should first center the original molecule at 0, 0, 0 — again, look at the previous exercises for hints. Now, when we know how to read and write files, we have a look at other streams. Files are only a special case of more general objects, streams. Some of them have been discussed already before: standard input Project co-financed from the EU European Social Fund Borys Szefczyk 56 (stdin), standard output (stdout) and standard error (stderr) are streams. It is a common custom to send all error messages to the ‘standard error’, so that when the output is redirected to a file, the error messages still go to the screen. For example: Code 77 from sys import stderr a = 1 b = 0 # Normal output: print "a = %d, b = %d" % (a, b) if b == 0: # Error -- message goes to stderr: print >>stderr, "Error, b == 0! Exiting." exit(1) The story of writing and reading streams will continue in the next chapter, because the same mechanism will be used by popen*() functions. Exercise 21: Relaxed PES scan is performed by stepping one or more variables (coordinates of the molecule) and performing geometry minimization for all the others. In the Gaussian program, this task is done by using the modreduntant keyword. Write a script that extract a multi-frame XYZ files with optimized geometries of the molecule. You have to filter out only those structures from the output file, which are final, optimized geometries and skip those, which are geometry optimization steps. Use the output file supplied by the teacher. Hint: search for the phrase “Stationary point found.” 1.20 Launching external programs One of the most common purposes of writing scripts is to automate the process of running programs. This goal can be achieved several ways, the most simple being the system() function. This function, found in the os module, runs a command given as the argument. We can, for example, display the name of the machine, where the script is running, using the hostname command (it is a shell command): >>> import os >>> code = os.system("hostname") swift The problem is however, that the system() function does not allow us to ‘feed’ the external program with data or receive/intercept the output from it. The name of the host has appeared on the screen, because the command hostname has displayed it, but we have no means to store it in a variable, for Project co-financed from the EU European Social Fund Python programming for bioinformatics students 57 example. Therefore, much more useful is the function popen2(), also from module os. This function runs the command given as the argument and returns two open streams: standard input and standard output of the command. They work like UNIX pipes and can be used to interact with the program while it is running. Here is an example: Gaussian can read the commands from the standard input and write the output to the standard output. We can write scripts to automate calculations in the following way: Code 78 from os import popen2 input text = """%mem=50mb #HF/STO-3G sp single point, HF 0 1 H 0.0 F 1.0 0.0 0.0 0.0 0.0 """ inp, out = popen2("g03") print >>inp, input text inp.close() output text = out.read() out.close() In this example, we define the input for Gaussian (variable input_text) [4], we launch the program using the function popen2 and feed it with the data, by writing to the stream inp. Finally, we read the output from the stream out. Note, that we have to close the streams — some programs, may not start until they get the end-of-file character and this character is sent when the stream is closed. Obviously, the example above does not automate anything. However, imagine we want to study the influence of the implicit solvent model on the bond vibration in the HF molecule. We are going to perform 16 PCM calculations, changing the dielectric constant from 5 to 80 and monitoring how the frequency of the bond vibration changes [5]. Gaussian does not permit doing a ‘scan’ of the dielectric constant: we have to prepare 16 separate jobs or write a script that will do that for us. The corresponding input line would be: #HF/STO-3G sp scrf=(pcm,read) Project co-financed from the EU European Social Fund Borys Szefczyk 58 we also have to specify the dielectric constant (e.g. 80) in the end of the input file: EPS=80 The script that does the job is following: Code 79 #!/usr/bin/python from os import popen2 input text = """%%mem=50mb #HF/STO-3G opt freq scrf=(pcm,read) single point, HF 0 1 H 0.0 F 0.9 0.0 0.0 0.0 0.0 EPS=%d """ print "D.const Frequency" print "------------------" for diel in range(5, 81, 5): inp, out = popen2("g03") print >>inp, input text % diel inp.close() output text = out.readlines() out.close() for line in output text: if line.count("Frequencies --"): freq = float(line[15:26]) print "%7d % 9.2f" % (diel, freq) This time, the input_text has %d instead of the dielectric constant value. In the loop, we substitute the Project co-financed from the EU European Social Fund Python programming for bioinformatics students 59 dielectric constant to diel and insert the value into the text that is sent to the input stream: Code 80 print >>inp, input text % diel After that, we collect the output from Gaussian (output_text) and search it for interesting properties — here, the frequency of the bond vibration, which can be found in the line containing the string "Frequencies --": Code 81 if line.count("Frequencies --"): freq = float(line[15:26]) print "%7d % 9.2f" % (diel, freq) The output of the script: D.const Frequency -----------------5 4429.97 10 4418.40 15 4413.97 20 4411.63 25 4410.18 30 4409.19 35 4408.48 40 4407.94 45 4407.52 50 4407.18 55 4406.90 60 4406.67 65 4406.47 70 4406.30 75 4406.15 80 4406.02 You should be aware of the changes that are taking place in Python: in version 2.4 a new module has been introduced, subprocess and it is going to replace the functions discussed in this chapter, but until then (version 2.6), you can still use them. Exercise 22: Write a script in order to calculate the energies (use the Hartree-Fock method and STO-3G basis set) for a series of geometries. The geometries shall be passed to the script in a single, multi-frame XYZ file. Sample file will be supplied by the teacher. The script should extract from the output only the SCF energies. Project co-financed from the EU European Social Fund Borys Szefczyk 60 1.21 Functions Writing your own functions has two main advantages: you can program the procedures that you use most often and then re-utilize them just by typing the name of the function and the arguments; in more complex programs, you can organize the data flow, by making blocks (functions) to perform separate tasks. For example, if you write a script that reads-in an XYZ file, then converts the coordinates (eg. translates them) and then saves them to a new file, you could do it in three steps, writing functions read_xyz, convert_coord and write_xyz. Figure 1.5 shows a similar idea of a data flow in script. Functions, like in mathematics, have arguments and return values. This is because they are supposed to convert one kind of data into another. We will start this chapter with a function that adds two vectors. First we have to agree how the data will be represented. We assume that the vectors are three-dimensional and are represented by tuples; for example: Code 82 vector a = ( 0.5, 1.3, -0.5 ) vector b = ( -1.0, 1.0, 1.0 ) The function should accept two arguments and return a single value, which are tuples of three float numbers: Code 83 def add vectors(a, b): c = (a[0]+b[0], a[1]+b[1], a[2]+b[2]) return c The definition of the function starts with the keyword def, followed by the function’s name and arguments in parenthesis. Keyword return has two roles, it indicates the value that should be returned and at the same time causes the script to leave the function. The return statement is not obligatory; if the function has no return statement, it will always return the None value. The definition of the function must always precede the invocation: Code 84 def add vectors(a, b): c = (a[0]+b[0], a[1]+b[1], a[2]+b[2]) Figure 1.5: Using functions to organize the data flow in the script. input data function read_data data processing functions function write_data Project co-financed from the EU European Social Fund output data Python programming for bioinformatics students 61 return c vector a = ( 0.5, 1.3, -0.5 ) vector b = ( -1.0, 1.0, 1.0 ) print "The sum is", add vectors(vector a, vector b) It is also possible to write a function that has no arguments, for example, a function that returns a random vector of a unit length: Code 85 def rand normal(): length = 0 while length < 1e-10: x = random.random() y = random.random() z = random.random() length = math.sqrt(x**2 + y**2 + z**2) normal = (x/length, y/length, z/length) return n The loop and the condition length < 1e-10 was introduced to prevent vectors of null length. The functions’ arguments may have default values; in such a case, the argument can be omitted and the default value will be used. For example, we will write a function that calculates a vector perpendicular to a surface defined by three points (A, B, C). This is very easy: we define two vectors: p~ = |AB| ~q = |AC| The cross product of these two vectors is perpendicular to the vectors and to the surface. Here is the function: Code 86 def normal(A, B, C): p = (B[0] - A[0], B[1] - A[1], B[2] - A[2]) q = (C[0] - A[0], C[1] - A[1], C[2] - A[2]) crossp = (p[1]*q[2] - p[2]*c[1], \ p[2]*q[0] - p[0]*c[2], \ p[0]*q[1] - p[1]*q[0]) return crossp However, usually the normal vector of a surface is normalized, so that its length is one. We may leave the choice (to normalize or not) to the user and add a fourth argument to the function. This fourth argument Project co-financed from the EU European Social Fund Borys Szefczyk 62 will be a boolean variable norm, with the default value False. If the value is False, the vector will be left unchanged; if it is True, the vector will be normalized: Code 87 def normal(A, B, C, norm=False): p = (B[0] - A[0], B[1] - A[1], B[2] - A[2]) q = (C[0] - A[0], C[1] - A[1], C[2] - A[2]) crossp = (p[1]*q[2] - p[2]*c[1], \ p[2]*q[0] - p[0]*c[2], \ p[0]*q[1] - p[1]*q[0]) if not norm: return crossp else: length = math.sqrt(crossp[0]**2 + crossp[1]**2 + crossp[2]**2) crossp = (crossp[0] / length, crossp[1] / length, crossp[2] / length) return crossp Note that the return statement has been used twice in this function. If the variable norm is False (no normalization), the function exits and return the unchanged vector. In the opposite case (else statement) the normalization is performed before returning the value. Another important aspect of using functions is how the arguments are passed to the function and how they are used inside. In principle, functions have their private copies of variables, which are destroyed when the function is left. Consider this example: Code 88 a b c d = = = = 2 3 4 5 def my func(x, y): c = x * d b = 0 y = 0 return c result = my func(a, b) print "a = %d, b = %d, c = %d, d = %d" % (a, b, c, d) print "result = %d" % result If you execute this code, the result will be: Project co-financed from the EU European Social Fund Python programming for bioinformatics students 63 a = 2, b = 3, c = 4, d = 5 result = 10 In the first place, note that although the function uses variables b and c (and substitutes them), the original values in the script are left unchanged. This is because b in the function and b in the script are not the same variable. The function uses its own name space. Secondly, notice that we have passed to the function variables a and b as the arguments x and y. Although the y has been substituted in the function, the original variable b has not been changed. This is because x and y contain copies of the variables in the script, preventing the originals from being modified (this is called ‘passing by value’ in contrast to ‘passing by reference’). Finally, note that the function uses variable d, which has not been initialized in the function. This variable is taken from the main program, and passed “under the table”. This kind of behaviour, although possible, should be avoided since the scripts become messy. Rather, you should explicitly declare that you are going to use global variables, using the keyword global: Code 89 def my func(x, y): global b, d c = x * d b = 0 y = 0 return c Exercise 23: Write functions to read and write XYZ files. Before you start writing, decide how the data will be represented. Exercise 24: Re-write the script from the Exercise 18 using the functions written in the Exercise 23. 1.22 Writing modules It is very easy to write your own module: just put what ever instructions you want in a file with extension .py, place it in the current directory and import it! For example, you can make your own set of molecular modelling tools, put them in a file called mm.py and then import by using the instruction import mm. Usually, the modules contain functions, classes or constants (variables). If you put a “normal” code (i.e. not a definition) the code will be executed while importing the module. A sample module, with two functions is shown below: Code 90 """ Sample module for vector operations. Project co-financed from the EU European Social Fund Borys Szefczyk 64 Two functions are defined: norm - normalizes vector cross - computes cross product The general form of the vector is: tuple(x, y, z) """ # We need sqrt, so we have to import math. # We can not expect that the user will do it for us :-) import math def norm(vec): """This function normalizes vector to unity.""" l = math.sqrt(vec[0]**2 + vec[1]**2 + vec[2]**2) return (vec[0]/l, vec[1]/l, vec[2]/l) def cross(va, vb): """This function calculates the cross product.""" x = va[2]*vb[1] - va[1]*vb[2] y = va[0]*vb[2] - va[2]*vb[0] z = va[1]*vb[0] - va[0]*vb[1] return (x, y, z) # This is a trick, to execute the code only when the file is being run # as a script, not a module if name == ’ main ’: # Let’s do some tests p = (0.5, -0.5, 1.1) q = (0.3, 0.4, -2.0) print norm(p) print cross(p, q) In the example above, notice the text in the triple quotation marks: this is how the __doc__ entries are being made. Also notice the conditional statement in the end that checks if __name__ is __main__. Thanks to this statement, the script can have a “double-life” — it can be used as a script or as a module. The trick is, if the code is run as a script, it becomes “the main code” and the variable called __name__ contains the value __main__. If the script is imported as a module, __name__ contains the name of the module. Now, observe how our a module behaves: Project co-financed from the EU European Social Fund Python programming for bioinformatics students 65 Python 2.6.4 (r264:75706, Mar 17 2010, 10:33:29) [GCC 4.3.4] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import vector >>> dir(vector) [’ builtins ’, ’ doc ’, ’ file ’, ’ name ’, ’ package ’, ’cross’, ’math’, ’norm’] >>> print vector. doc Sample module for vector operations. Two functions are defined: norm - normalizes vector cross - computes cross product The general form of the vector is: tuple(x, y, z) >>> print vector.cross. doc This function calculates the cross product. >>> print name main >>> print vector. name vector >>> Exercise 25: Write a module with two functions: read_xyz and write_xyz, to perform the tasks of reading and writing XYZ files. 1.23 Regular expressions (re) Regular expressions are used to search and match text. They are composed of normal text and special characters, which have more general meaning. For example, the dot (.) means any character, therefore the regular expression "a.t" will match, for example, the words “act”, “ant”, “art”, but not “aunt”, because the dot matches exactly one character. Regular expressions can be much more complex, for example [A-Za-z]*[_-]{1,3}\d+\..{3}$ reads: any letter from the range A-Z and a-z, repeated 0 or more times, then or – repeated 1−3 times, a digit, repeated at least one time, a dot, any three characters and the end of the string. This will match for example “Abc 34.xyz”. Most common symbols are listed in Table 1.9; more information can be found in the literature [6]. Regular expressions in Python are handled by the re module. In order to use regular expression in your script, you must first import the module and compile a regular expression object, eg.: Code 91 import re regexp = re.compile("(\d+) basis functions") Project co-financed from the EU European Social Fund Borys Szefczyk 66 Table 1.9: Selected regular expression atoms. Symbol . ^ $ * + ? {m} {m,n} [ ] \ | \s \d Meaning Single character (any) Beginning of a line or string End of a line or string Zero or more repetitions of the previous character One or more repetitions of the previous character Zero or one repetition of the previous character m repetitions of the previous character Between m and n repetitions of the previous character Matches one of the characters specified in brackets; ranges are permitted eg. [a-z0-9]; to match hyphen, it should be the last character specified, eg. [A-Z_-] Allows to match literally the special characters, eg. \. \* \^ Separator of alternate match strings, eg. aa|bb White character Digit Then, you can perform a search or match — the difference is that a match requires the regular expression to match from the beginning of a line or string, whereas a search will match at any point of the string/line. For instance: Code 92 import re regexp = re.compile("(\d+) basis functions") test = "we have 305 basis functions" result1 = regexp.match(test) result2 = regexp.search(test) print "Result of ’match’ is", result1 print "Result of ’search’ is", result2 Result of ’match’ is None Result of ’search’ is < sre.SRE Match object at 0x7f4d2642a8a0> The result of a search or match can be safely used in a conditional statement, although it is not a simple Project co-financed from the EU European Social Fund Python programming for bioinformatics students 67 boolean value, but an abject. Remember that most objects in Python have their boolean value; in this case None works like False and successful match/search object works like True: Code 93 result = regexp.search(test) if result: print "Successful match" The whole point of using regular expression is to match strings, which are not always the same, but have certain common characteristics. Moreover, regular expressions can be used to extract intelligently fragments of those strings. If you enclose fragments of the regular expression in parentheses, corresponding fragments of the matched string will form groups that can be retrieved from the match object. This is done using method .groups(). Let us assume that the variable output contains a text produced by our favourite programs. Some of the lines contain numbers that we would like to extract; these lines start with the word ’Atom’: Code 94 import re output = """Calculations finished. In file 1 found: Atom 3 at x = -1.2 Atom 12 at x = .004 Atom 21 at x = 10.1 In file 2 found: Atom 5 at x = 4.5e+3 Atom 14 at x = -4.2e-1 Atom 101 at x = 0 """ regexp = re.compile("Atom \d+ at x = (-?\d*\.?\d*(e[+-]?\d+)?)") coord = [] for line in output.split(’\n’): result = regexp.match(line) if result: print result.groups() # Just to see how the groups look like c = result.groups()[0] coord.append(float(c)) print coord First note how the text looks: the line starts with the word Atom, then goes an integer number, then the Project co-financed from the EU European Social Fund Borys Szefczyk 68 text "at x =" and finally, the value we want to extract. So the first part of the regular expression — to match the right line — would be "Atom \d+ at x =". The expression \d+ matches one or more digits. Then we have to match the value, but as you can see above, the value can be of any kind: it might have a minus sign, it may or may not have the integral part, it may or may not have the decimal part and it may have the exponent! To match the minus sign or the lack of it, we will use "-?"; to match the integral part we will use "\d*"; then we have to look for the point that might not be there, "\.?"; then the decimal digits — again "\d*"; finally, the exponent, which may have plus or minus sign, "(e[+-]?\d+)?". The complete expression is following: (-?\d*\.?\d*(e[+-]?\d+)?). The outer parenthesis are necessary to extract the whole number; the inner parenthesis are just to make a group out of the exponent. The groups() method returns a tuple with groups found while matching/searching for the regular expression. The re module has also other methods, which you may find useful in your scripts. See the full documentation on-line [7]. Exercise 26: Use the output file from Gaussian supplied by the teacher and write a script to extract the energies from the file. It contains several single point calculations of the energy, using the Hartree-Fock and DFT methods, with different functionals. The energies are in lines like: SCF Done: E(RHF) = -39.7034912248 A.U. after 8 cycles SCF Done: E(RPBE-PBE) = -39.9471405984 A.U. after 7 cycles and so on. You should use a single regular expression to retrieve the energy and the name of the method. After that, the program should produce a CSV file (to be read into a spread sheet) and table on the screen: Model Energy -------------------------------HF -39.7034912248 B3LYP -40.0181905764 X3LYP -39.9896656215 PBE-PBE -39.9471405984 PW91-PW91 -39.9866252189 M06 -39.9651652848 M06L -39.9962589720 Project co-financed from the EU European Social Fund Chapter 2 Numerical applications 2.1 Basic operation on arrays Python is commonly used in scientific applications and since they often involve matrices and matrix operations, steps have been taken to facilitate these tasks. The numpy module [8] introduces a new type, array, and several routines to handle them. Arrays may contain different types of elements, not just numbers, however, all elements of the array must be of the same type. Consider the following example: Code 95 import numpy A = numpy.array([1, 1.0, 1.0 + 0.0j]) print A [ 1.+0.j 1.+0.j 1.+0.j] In this example, you can see that the array was created from a list using the array() function; the elements of the list were of different types and were all converted to the type, which is most ‘roomy’ to store them; in this case it was the complex type (we can recongize that because each number has the imaginary part 0.j). Arrays can be flat, rectangular, cubic etc. They can be also “reshaped”. In fact all matrices are stored as flat and the information about the shape is stored separately, in a tuple, therefore it is easy to change it: Code 96 A = numpy.array(range(27)) print A Project co-financed from the EU European Social Fund Borys Szefczyk 70 A.shape = (3,9) print A A.shape = (3,3,3) print A [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26] [[ 0 1 2 3 4 5 6 7 8] [ 9 10 11 12 13 14 15 16 17] [18 19 20 21 22 23 24 25 26]] [[[ 0 1 2] [ 3 4 5] [ 6 7 8]] [[ 9 10 11] [12 13 14] [15 16 17]] [[18 19 20] [21 22 23] [24 25 26]]] Besides the array() function, the module offers routines to create special types of matrices, like a matrix of zeros, ones or the identity matrix: Code 97 print "Matrix of zeros" O = numpy.ones((3,3)) print O print "Matrix of ones" Z = numpy.zeros((3,3)) print Z print "Identity matrix" I = numpy.identity(3) print I Matrix [[ 1. [ 1. [ 1. Matrix [[ 0. [ 0. [ 0. of zeros 1. 1.] 1. 1.] 1. 1.]] of ones 0. 0.] 0. 0.] 0. 0.]] Project co-financed from the EU European Social Fund Python programming for bioinformatics students 71 Identity matrix [[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]] A great advantage of the numpy module is that it permits math operations on the whole matrices, like they were just numbers: Code 98 m = numpy.ones((3,3)) print m n = m + 1 print n n = m * 0.5 print n n = numpy.sin(m * 0.5) print n [[ [ [ [[ [ [ [[ [ [ [[ [ [ 1. 1. 1.] 1. 1. 1.] 1. 1. 1.]] 2. 2. 2.] 2. 2. 2.] 2. 2. 2.]] 0.5 0.5 0.5] 0.5 0.5 0.5] 0.5 0.5 0.5]] 0.47942554 0.47942554 0.47942554 0.47942554 0.47942554 0.47942554 0.47942554] 0.47942554] 0.47942554]] You can also perform matrix operations in a mathematical sense, e.g. add them: Code 99 m = numpy.array(range(9)) m.shape = (3,3) n = numpy.array(range(8,-1,-1)) n.shape = (3,3) print "m =", m print "n =", n print "m + n =", m+n m = [[0 1 2] Project co-financed from the EU European Social Fund Borys Szefczyk 72 [3 [6 n = [5 [2 m + [8 [8 4 5] 7 8]] [[8 7 6] 4 3] 1 0]] n = [[8 8 8] 8 8] 8 8]] However, notice that matrix multiplication is performed in the following way: C(m, n) = A(m, n) + B(m, n) to calculate the ‘real’ matrix product, you should use the dot() function: Code 100 m = numpy.array(range(1,10)) m.shape = (3,3) n = 1./m print "m =", m print "n =", n print "m * n =", m*n print "dot(m,n) =", numpy.dot(m,n) m = [[1 2 3] [4 5 6] [7 8 9]] n = [[ 1. 0.5 [ 0.25 0.2 [ 0.14285714 0.125 m * n = [[ 1. 1. 1.] [ 1. 1. 1.] [ 1. 1. 1.]] dot(m,n) = [[ 1.92857143 [ 6.10714286 [ 10.28571429 0.33333333] 0.16666667] 0.11111111]] 1.275 3.75 6.225 1. ] 2.83333333] 4.66666667]] The numpy module is also useful for simple statistics applications. For example, we have a file with many data points. We are going to read in the file and compute: the number of data points, the sum, the mean and the standard deviation: Code 101 data = numpy.array(map(float, open("numpy.dat").readlines())) print "Number of data points:", len(data) Project co-financed from the EU European Social Fund Python programming for bioinformatics students 73 print "Sum:", numpy.sum(data) print "Mean value:", numpy.mean(data) print "Standard deviation:", numpy.std(data) The next important issue is the indexing of elements in a matrix, but before we proceed to this problem, we will construct a model matrix, a “times table”. It will contain products of the numbers that are in the beginning of each row and column. The left-most column and the upper-most row will contain numbers from 1 to 9: Code 102 vec = numpy.arange(1, 10) print "vec =", vec mat = numpy.multiply.outer(vec, vec) print "mat =", mat vec = [1 mat = [[ [ [ [ [ [ [ [ [ 2 1 2 3 4 5 6 7 8 9 3 4 5 2 3 4 6 6 9 8 12 10 15 12 18 14 21 16 24 18 27 6 7 8 4 5 8 10 12 15 16 20 20 25 24 30 28 35 32 40 36 45 9] 6 12 18 24 30 36 42 48 54 7 14 21 28 35 42 49 56 63 8 16 24 32 40 48 56 64 72 9] 18] 27] 36] 45] 54] 63] 72] 81]] From every matrix, we can pick a number or a “sub-matrix”, by specifying the indices or ranges. Try out the following instructions and observe the result: Code 103 print mat[3:5, 3:7] print mat[1:4:2, 1:9:3] print mat[::2,::2] print mat[::-1,::-1] Project co-financed from the EU European Social Fund Borys Szefczyk 74 Exercise 27: Use the least-square method to fit experimental data (temperature vs. time) with linear function, y = ax + b. The teacher will provide you with the data file. Also, calculate the correlation coefficient. In the following formulas, n is the number of data points, x, y are the data points. ∆=n X x2 − X 2 x P P xy − x y a= ∆ P 2P P P x y − x xy b= ∆ P P P n xy − x y r = r P P P P n x2 − ( x)2 · n y 2 − ( y)2 n P 2.2 Using Gnuplot with numpy We will use the program from Exercise 27 to see how the data can be visualized directly from Python. We will use the Gnuplot [9] interface (module) to plot a graph. Alternatively, the matplotlib package can be used1 . The program in the exercise fits a linear function to experimental data and it may look like this: Code 104 import numpy data = numpy.array([map(float, x.split()) for x in open(’lsq.dat’).readlines()]) Sx = numpy.sum(data[:,0]) Sy = numpy.sum(data[:,1]) Sxx = numpy.sum(data[:,0]**2) Syy = numpy.sum(data[:,1]**2) Sxy = numpy.sum(data[:,0]*data[:,1]) n = len(data[:,0]) delta = n*Sxx - Sx**2 a = (n*Sxy - Sx*Sy)/delta b = (Sxx*Sy - Sx*Sxy)/delta r = (n*Sxy - Sx*Sy)/numpy.sqrt((n*Sxx - Sx**2)*(n*Syy - Sy**2)) Still in the same script, we will add instruction to plot the points and the function, so that we can see how well the data is fitted. Note that we have two kinds of data to plot: discrete data in form of points (x,y) and a continuous function. Both types of data can be plotted with Gnuplot: 1 http://matplotlib.sourceforge.net Project co-financed from the EU European Social Fund Python programming for bioinformatics students 75 Code 105 import Gnuplot gp = Gnuplot.Gnuplot(persist=1) gp.title("Least square fit") gp.xlabel("time [s]") gp.ylabel("temp [K]") gp(’set pointsize 3’) gp(’set key right bottom’) gp data = Gnuplot.Data(data, title="r = %g" % r) gp func = Gnuplot.Func("%f * x + %f" % (a, b), title="%g x %+g" % (a, b)) gp.plot(gp data, gp func) gp(’set terminal postscript enhanced color 20’) gp.hardcopy("lsq.eps") We will analyse the script line-by-line. First, the Gnuplot module was imported and a Gnuplot object was initialized. The persist option prevents the graph from being closed when our script terminates: Code 106 import Gnuplot gp = Gnuplot.Gnuplot(persist=1) At this point, we also configure different aspects of the graph, like the title and the position of the key. Two kinds of syntax are used: some of the more common gnuplot commands are implemented as methods (eg. title(), xlabel() etc.), whereas any other command can be passed using the gp(’command’) syntax: Code 107 gp.title("Least square fit") gp.xlabel("time [s]") gp.ylabel("temp [K]") gp(’set pointsize 3’) gp(’set key right bottom’) Next, we have defined the data and the function objects and plot them. Also the titles of the data sets have been defined here. Code 108 gp data = Gnuplot.Data(data, title="r = %g" % r) gp func = Gnuplot.Func("%f * x + %f" % (a, b), title="%g x %+g" % (a, b)) gp.plot(gp data, gp func) Project co-financed from the EU European Social Fund Borys Szefczyk 76 We can also export the graph to the PostScript file2 [10]. If you want to change some of the parameters, you can use the gp(’command’) syntax again. Here, we cause the image to be printed in colour, using a 20pt font, in EPS format: Code 109 gp(’set terminal postscript enhanced color 20’) gp.hardcopy("lsq.eps") 2.3 Linear algebra in Python The sub-module linalg of the numpy module, contains basic tools for linear algebra: functions to calculate determinants, inverse matrices, to solve sets of linear equations and eigen-problems. We will use one of these functions to find the roots of the following set of equations:  9 x1     2x 1  2 x 1    8 x1 − + + − 8 x2 3 x2 4 x2 6 x2 + + + − 7 x3 4 x3 6 x3 8 x3 − + + + 6 x4 5 x4 8 x4 2 x4 = −11 = 44 = 66 = −22 This problem can be written in the matrix form: A·X=Y where A is a 4 × 4 coefficient matrix, Y is a vertical vector (4 × 1 matrix) of the free elements and X are the solutions to our problem. Here is the script: Code 110 from numpy import array, dot, ravel from numpy.linalg import solve A = array([[ 9, -8, 7, -6], [ 2, 3, 4, 5], [ 2, 4, 6, 8], [ 8, -6, -8, 2]]) Y = array([[-11], [ 44], [ 66], [-22]]) X = solve(A, Y) # the ravel function is used with the sole purpose of ”flattening” 2 PostScript is a language developed by the Adobe company to represent vector graphics (i.e. graphics composed of primitives such as lines, circles etc.); PostScript files have extensions .ps and .eps Project co-financed from the EU European Social Fund Python programming for bioinformatics students 77 # the vector for a nice display print "Solution:", ravel(X) # If the solution is correct, this should be zero! print dot(A,X) - Y In the second example, we will use the multivariate least squares method to solve a linear regression problem. A review article by Hansch presents several QSAR studies of anti-HIV drugs [11]. Table 42 in this review shows the dependence of EC50 (log 1/C) on four properties called Lx , B1x , IY , σ in a series of compounds. We will fit these data with linear function: log 1/C = a0 + a1 Lx + a2 B1x + a3 Iy + a4 σ We are looking for the coefficient vector:     A=    First, we must write the problem in a matrix n = 16 compounds:  x00   x10 X=  ..  . xn0  a0 a1 a2 a3 a4        form. Let X be the matrix of the four properties of our x01 x11 .. . xn1 x02 x12 .. . xn2 x03 x13 .. . xn3 x04 x24 .. . xn4       x00 , . . . , xn0 = 1, because it corresponds to the free element a0 . Let Y be the vector of the observed EC50 values:   y0    y1   Y= .    ..  yn The problem can be written as: (X′ X) A = X′ Y where X′ denotes a transposed matrix. This time, the A matrix is unknown and the equation has to be solved to find it. Our data is stored in a file in the following format: # log(1/C) 6.50 8.26 6.28 const 1.00 1.00 1.00 Lx 2.06 2.87 4.11 B1x 1.00 1.52 1.52 Iy 1.00 1.00 1.00 sigma 0.00 -0.04 -0.01 Project co-financed from the EU European Social Fund Borys Szefczyk 78 5.98 5.94 5.32 5.00 4.15 4.27 4.22 4.26 4.92 4.07 4.01 6.77 5.31 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 3.82 4.23 2.65 2.74 3.98 4.80 2.06 4.11 2.06 4.11 2.06 2.87 4.11 1.95 2.15 1.35 1.35 1.35 1.35 1.00 1.52 1.00 1.52 1.00 1.52 1.52 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.44 0.39 0.52 0.29 0.27 0.28 0.00 -0.01 0.00 -0.01 0.00 -0.04 -0.01 The program is very simple, provided that we use the numpy module. First, we read the whole file into a matrix (data), then we extract the first column into the Y vector and the rest of the columns into the X vector. Finally, we calculate the transposed matrix Xt and solve the equation: Code 111 #!/usr/bin/python import numpy data = numpy.array([map(float, x.split()) for x in open(’LR’).readlines()[1:]]) Y = data[:,0] X = data[:,1:] Xt = numpy.transpose(X) A = numpy.linalg.solve(numpy.dot(Xt, X), numpy.dot(Xt, Y)) print numpy.ravel(A) The solution printed by the script: [ 3.06274963 -0.94451833 3.51948739 1.88218388 -5.10869993] is consistent with Equation 58 of the review by Hansch. 2.4 Python for scientists Although the scipy module [12] is a set of routines for scientific applications, we will start with an example from computer graphics. Namely, we will use the scipy module to manipulate a picture. A bitmap graphics is in fact a two-dimensional matrix: each element of this matrix corresponds to a single pixel of the image. The value of each element describes the colour of the pixel. Therefore, to get the negative of the image, we just have to convert the picture to a matrix and then multiply each value by Project co-financed from the EU European Social Fund Python programming for bioinformatics students 79 −1: Code 112 import scipy raw = scipy.misc.imread(’IMG 2028.png’) raw *= -1 scipy.misc.imsave(’outfile.png’, raw) We have created our first graphical filter. You can see the effect in Figure 2.1-B. In fact, a colour image is not a two-dimensional but a three-dimensional matrix. This is because each pixel contains more than one value or in other words, the third dimension of the matrix describes the components of the colour of the pixel. There are different colour spaces, but in photography, the most common is RGB, ie. the colour is a mixture of red, green and blue. Imagine that our picture has the size of 600 × 400 pixels. Then the corresponding matrix will be 600 × 400 × 3 — the third number means that we have three “slices” in the matrix, one for each of the colours: red, green or blue. Now, let’s make another graphical filter: we will swap the colours so that whatever is red becomes green; green becomes blue and blue becomes red: Code 113 import numpy import scipy A B C D Figure 2.1: Image transformed using matrix operations: (A) original, (B) negative, (C) colors swapped, (D) modified FFT ‘spectrum’. Project co-financed from the EU European Social Fund Borys Szefczyk 80 raw = scipy.misc.imread(’IMG 2028.png’) raw2 = numpy.zeros(raw.shape) raw2[:,:,0] = raw[:,:,1] raw2[:,:,1] = raw[:,:,2] raw2[:,:,2] = raw[:,:,0] scipy.misc.imsave(’outfile.png’, raw2) Figure 2.1-C shows the result of the operation. In the third example, we will use a more advanced trick: we will convert the image with a 2-dimensional discrete inverse fast Fourier transform (2D iFFT) into the reciprocal space. Then we will modify the ‘spectrum’ by erasing (setting to zero) the upper-left quarter of the reciprocal image. Finally, we will use the 2-dimensional discrete FFT (2D FFT) to convert the image back to the real space. Note that we do this operation separately on each channel (colour): Code 114 import numpy import scipy raw = scipy.misc.imread(’IMG 2028.png’) print "Doing inverse discrete 2D FFT..." iR = numpy.fft.irfft2(raw[:,:,0]) iG = numpy.fft.irfft2(raw[:,:,1]) iB = numpy.fft.irfft2(raw[:,:,2]) w = iR.shape[1]/2 h = iR.shape[0]/2 iR[:h,:w] = 0 iG[:h,:w] = 0 iB[:h,:w] = 0 print "Doing real discrete 2D FFT..." raw2 = numpy.zeros(raw.shape) raw2[:,:,0] = numpy.real(numpy.fft.rfft2(iR)) raw2[:,:,1] = numpy.real(numpy.fft.rfft2(iG)) raw2[:,:,2] = numpy.real(numpy.fft.rfft2(iB)) print "Saving to outfile.png ..." scipy.misc.imsave(’outfile.png’, raw2) The effect (Figure 2.1-D) is like the picture was old and warped. For more information on bitmap graphics, refer to the literature [13]. In the last example, we will use the numpy and scipy modules for error analysis. The situation is following: we have performed an MD simulation of a certain liquid and we have estimated the density. However, to do the job properly, we should also calculate the error of this estimation. Ideally, we should do several independent simulations, then calculate the average density from the set of simulations, the standard deviation and the error. However, if the run was long enough, we can also get a good estimate of the error by splitting the run into blocks and doing ‘block averaging’. Project co-financed from the EU European Social Fund Python programming for bioinformatics students 81 Our input data is a file containing two columns: time step and density. We will read the data into a two-column matrix, then split the matrix into blocks of 200 ps. For each block we will calculate the average density. Then, we will calculate the standard deviation of these averages and estimate the error. We will use the Student’s t-distribution and we will estimate the error at 95% confidence. We will use the following functions: the functions mean and std from the numpy module to calculate the mean value and the standard deviation; the ppf function from scipy.stats.t to get the Student’s t-factor for a 95% confidence level and n_blocks samples. Code 115 from scipy import stats import numpy from sys import argv # Some constants confidence = 0.95 block = 200 # [ps] # Get the data into a two-column matrix (timestep -> density) file = open(argv[1]).readlines() data = numpy.array([ map(float, l.split()) for l in file[1:] ]) # Calculate the number of blocks min time = data[0,0] max time = data[-1,0] n blocks = int((max time - min time) / block) # Calculate the number of steps in a single block # We can do that based on time step, because it is constant time step = data[1,0] - data[0,0] block size = int(round(block / time step)) # Do the block averaging averages = [] for b ind in range(n blocks): begin = b ind * block size end = (b ind + 1) * block size block mean = numpy.mean(data[begin:end,1]) averages.append(block mean) # Convert to a numpy array averages = numpy.array(averages) # Get the Student’s t-factor Project co-financed from the EU European Social Fund Borys Szefczyk 82 one sided = 0.5 + confidence/2.0 t crit = stats.t.ppf(one sided, n blocks) # Do the statistics mean = numpy.mean(averages) std dev = numpy.std(averages) error = t crit * std dev/numpy.sqrt(n blocks) print "Mean value and error: %.2f +/- %.2f" % (mean, error) Project co-financed from the EU European Social Fund Chapter 3 Databases In your Python scripts, you can manage and access databases. There are several wrappers that permit accessing databases in a transparent way; in the next section we will use the MySQLdb module [14] to write a bibliographic database, but first we have to learn how to work with MySQL and how to use Simple Query Language (SQL). 3.1 Administration The MySQL database works in server – client fashion, meaning there is an existing database (one or more) on the server, it is managed by the server program and can be accessed — locally and remotely (from other computers). Typically, the database and user accounts will be created by the system administrator and, unless it is you, you do not have to worry about it. If you would like to try it on your own computer and step into the administrator’s shoes, here is the recipe (you will find more information on the Internet [15]). Assuming that the MySQL server is already running, you must create a database and an account for the purpose of our exercise. Start the MySQL interface as administrator (usually root): ~$ mysql -u root -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 5.0.90-log Gentoo Linux mysql-5.0.90-r2 Type ’help;’ or ’\h’ for help. input statement. Type ’\c’ to clear the current mysql> Project co-financed from the EU European Social Fund Borys Szefczyk 84 Then, create a new database. Here, we also use the show databases command to list all existing databases: mysql> create database bibliography; Query OK, 1 row affected (0.00 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information schema | | bibliography | | mysql | | test | +--------------------+ 4 rows in set (0.00 sec) mysql> Note that all SQL commands must end with a semicolon. After creating, switch to the mysql database. This is the place where all credentials are stored. Solely for your interest, you may list the tables with the show tables command: mysql> use mysql Reading table information for completion of table and column names You can turn off this feature to get a quicker start-up with -A Database changed mysql> show tables; +---------------------------+ | Tables in mysql | +---------------------------+ | | columns priv | db | | func | | host | | proc | | | procs priv | tables priv | | user | +---------------------------+ 8 rows in set (0.00 sec) mysql> Project co-financed from the EU European Social Fund Python programming for bioinformatics students 85 Now, add a new user called pybib with the password bookWORM or other of choose another. Here, we grant all privileges (like creating and deleting tables) to this user and limit the access to the local computer. Nowadays, most applications use the web interface (in many cases written in Python) and the database can be accessed from the Internet. Even though, you should still restrict the access to the local machine. This is because the database server will be not accessed remotely, but from the web server running on the same computer, therefore only local access is needed. mysql> create user ’pybib’@’localhost’ identified by ’bookWORM’; Query OK, 0 rows affected (0.00 sec) mysql> grant all privileges on bibliography.* to ’pybib’@’localhost’; Query OK, 0 rows affected (0.00 sec) mysql> flush privileges; Query OK, 0 rows affected (0.00 sec) mysql> The flush privileges command is necessary for the changes to take effect immediately. Now, let us see the information stored about our new user. This information is stored in the table user. Each database table has records (rows) and columns. Each column has a label and definition of the type of information that is stored in the column. You can list the columns using the describe command (in this example some of the columns have been omitted): mysql> describe user; +-----------------+------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+------------------+------+-----+---------+-------+ | Host | char(60) | NO | PRI | | | | User | char(16) | NO | PRI | | | | Password | char(41) | NO | | | | | Select priv | enum(’N’,’Y’) | NO | | N | | | Insert priv | enum(’N’,’Y’) | NO | | N | | | Update priv | enum(’N’,’Y’) | NO | | N | | | enum(’N’,’Y’) | NO | | N | | | Delete priv | Create priv | enum(’N’,’Y’) | NO | | N | | | Drop priv | enum(’N’,’Y’) | NO | | N | | | blob | NO | | NULL | | | ssl cipher | max connections | int(11) unsigned | NO | | 0 | | +-----------------+------------------+------+-----+---------+-------+ 11 rows in set (0.00 sec) mysql> Now we will form our first query in order to retrieve information from the table. We will list the content Project co-financed from the EU European Social Fund Borys Szefczyk 86 of the columns Host, User and Password, respective to all users in the server, and quit from the program. You can see our newly created user in this table: mysql> select Host,User,Password from user; +-----------+-------+-------------------------------------------+ | Host | User | Password | +-----------+-------+-------------------------------------------+ | localhost | root | *2D691E2378921A44C977D6D896515AC6234A2B09 | | swift | root | *2D691E2378921A44C977D6D896515AC6234A2B09 | | 127.0.0.1 | root | *2D691E2378921A44C977D6D896515AC6234A2B09 | | localhost | pybib | *202DCBC5DA0CF0272398688C93DA5DE9F3E38F23 | +-----------+-------+-------------------------------------------+ 7 rows in set (0.00 sec) mysql> quit Bye $ 3.2 Simple Query Language It is time to create some tables and learn the Simple Query Language (SQL), which is a common language for different database systems (eg. MySQL, PostgreSQL etc.). In our example, we will create a bibliographic database to store the information about publications of a certain group of people (say, a research team). The central table of this database will be called ‘papers’ and will store information like the name of the journal, volume, pages, year and authors of an article. The first step should always be the design of the database — we have to decide how the data will be stored, so that it can be effectively used later. First of all, each entry (paper) can have one or more authors and each author may have several names — using a single column to store this information may not be sufficient and can cause searching in the table to be more difficult and less effective. Secondly, this is a database of publications of authors from one institution, so we can expect that certain names will reappear many times. Therefore, it is convenient to use indices instead of names — we will give each author his unique index and make a separate table to bind the names to these IDs. In our sample database, we will store information about these four articles: 1. William L. Jorgensen J. Phys. Chem. 90:1276-1284 (1986) 2. Wolfgang Damm, Antonio Frontera, Julian Tirado-Rives, William L. Jorgensen J. Comp. Chem. 18:1955-1970 (1997) 3. David Kony, Wolfgang Damm, Serge Stoll, Wilfred F. van Gunsteren J. Comp. Chem. 23:1416-1429 (2002) 4. William L. Jorgensen, David S. Maxwell, Julian Tirado-Rives J. Am. Chem. Soc. 118:11225-11236 (1996) Project co-financed from the EU European Social Fund Python programming for bioinformatics students 87 Authors Paper Author Papers Idx Journal Vol. Pages Year 1 2 3 4 90 18 23 118 1986 1997 2002 1996 J. J. J. J. Phys. Chem. Comp. Chem. Comp. Chem. Am. Chem. Soc. 1276-1284 1955-1970 1416-1429 11225-11236 1 2 2 2 2 3 3 3 3 4 4 4 1 2 6 7 1 3 2 5 4 1 8 7 Names Names Surnames Idx William L. Wolfgang David Wilfred F. Serge Antonio Julian David S. Jorgensen Damm Kony van Gunsteren Stoll Frontera Tirado-Rives Maxwell 1 2 3 4 5 6 7 8 Figure 3.1: Structure of the database: binding authors and papers. An example is shown in Figure 3.1. Note that the authors table is used to correlate the data in names and papers. In the table papers, each entry also has a unique index and we do not store the information about the authors. This information is stored in the table authors using the indices from names and papers. 3.3 Data types Once we know what tables we are going to create and what its content will be, we can proceed to the next step. We have to choose the data type for each column. The data can be numerical, textual, it can be time, date etc. In the case of the numerical and textual data we have to decide about the size (eg. maximum integer number that can be stored in a particular column or the maximum length of a string). This is like choosing the right type for variables, but there are also new elements: in the table we can permit (or not) empty values, we can let the indices be unique numbers and we can let them be incremented automatically. Table 3.1 shows which types have been chosen for the columns in our example. All indices used in the database, as well as the ‘volume’ column, use integer values. Columns ‘idx’ in tables names and papers have been declared as SERIAL; this is a shorthand for BIGINT UNSIGNED NOT NULL AUTO INCREMENT UNIQUE. These keywords mean that the column will contain possibly large (BIGINT) non-negative (UNSIGNED) integers, that its value can not be empty (NOT NULL), that any two records in the table can not share the same value (UNIQUE) and that the values will be added automatically, if not provided (AUTO INCREMENT). The two columns in the table authors contain the same values as the columns ‘idx’ in names and papers, however we do not use the SERIAL type. This is because the values will be not UNIQUE (see the example in Figure 3.1 to understand why). Instead, we specify the type as non-negative (UNSIGNED), large integer (BIGINT) and we do not permit empty values (NOT NULL). The basic type used to store textual data in MySQL is the CHAR type. A field of this type has a fixed length, which has to be specified, eg. to create a column, which has a width of 16 characters, the type should be specified as CHAR(16). However, if you plan to store a large amount of text of variable length, it might be advisable to use the VARCHAR type. Fields of this type have a variable length, up to the Project co-financed from the EU European Social Fund Borys Szefczyk 88 Table 3.1: Types used in the example. Table names papers authors Column idx names surnames idx journal year pages volume paper author Type SERIAL VARCHAR(200) VARCHAR(200) SERIAL VARCHAR(1000) YEAR CHAR(20) UNSIGNED INT BIGINT UNSIGNED NOT NULL BIGINT UNSIGNED NOT NULL specified limit, therefore the text always occupies the minimum space in the database. For example, in the papers database, we store the name of the journal in a text field of a variable length, with a maximum length of 1000 characters; declared as VARCHAR(1000). Time and date have special types in MySQL. In this database, we use only one of them, namely the YEAR type. Obviously, we could use the integer type (INT), however using the most appropriate types has its benefits: minimum space is used (one byte in case of the YEAR type), MySQL verifies if the data are correct for the specified type and it will automatically do the conversions, e.g. 00 to 2000. 3.4 Creating tables Now, when the structure of the database is established and we have decided about the types, we can create the tables. Start the MySQL interface, select your database and use the CREATE command to do it: ~$ mysql -u pybib -p Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2 Server version: 5.0.90-log Gentoo Linux mysql-5.0.90-r2 Type ’help;’ or ’\h’ for help. input statement. Type ’\c’ to clear the current mysql> use bibliography Database changed mysql> CREATE TABLE names (names VARCHAR(200), surnames VARCHAR(200), idx SERIAL); Query OK, 0 rows affected (0.00 sec) Project co-financed from the EU European Social Fund Python programming for bioinformatics students 89 mysql> CREATE TABLE authors ( paper BIGINT UNSIGNED NOT NULL, author BIGINT UNSIGNED NOT NULL ); Query OK, 0 rows affected (0.00 sec) mysql> CREATE TABLE papers ( volume INT, journal VARCHAR(1000), pages CHAR(20), idx SERIAL, year YEAR ); Query OK, 0 rows affected (0.00 sec) mysql> At any point, you can use the DESCRIBE command to see the definition of your tables: mysql> DESCRIBE authors; +--------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------+---------------------+------+-----+---------+-------+ | paper | bigint(20) unsigned | NO | | NULL | | | author | bigint(20) unsigned | NO | | NULL | | +--------+---------------------+------+-----+---------+-------+ 2 rows in set (0.00 sec) 3.5 Inserting data We will populate our database with some data. First, we will add the author’ names. We use the INSERT command; remember that the index is auto-incremented, so we do not need to specify it: mysql> INSERT INTO names (names,surnames) VALUE (’William L.’, ’Jorgensen’); Query OK, 1 rows affected (0.00 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> INSERT INTO names (names,surnames) VALUE (’Wolfgang’,’Damm’); Query OK, 1 rows affected (0.00 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> INSERT INTO names (names,surnames) VALUES (’David’, ’Kony’), (’Wilfred F.’,’van Gunsteren’), (’Serge’,’Stoll’), (’Antonio’, ’Frontera’), (’Julian’,’Tirado-Rives’), (’David S.’,’Maxwell’); Query OK, 6 rows affected (0.04 sec) Records: 6 Duplicates: 0 Warnings: 0 mysql> Project co-financed from the EU European Social Fund Borys Szefczyk 90 As you can see, you may insert values one-by-one or all at once (notice the difference in the syntax, ie. VALUE vs. VALUES). We can verify the content of the table using the SELECT statement: mysql> SELECT * FROM names; +---------------+------------+-----+ | surnames | names | idx | +---------------+------------+-----+ | Jorgensen | William L. | 1 | | Damm | Wolfgang | 2 | | Kony | David | 3 | | van Gunsteren | Wilfred F. | 4 | | Stoll | Serge | 5 | | Frontera | Antonio | 6 | | Tirado-Rives | Julian | 7 | | Maxwell | David S. | 8 | +---------------+------------+-----+ 8 rows in set (0.00 sec) Next we will add information about the papers: mysql> INSERT INTO papers (journal, volume, pages, year) VALUES -> (’J. Phys. Chem.’, 90, ’1276-1284’, 1986), -> (’J. Comp. Chem.’, 18, ’1955-1970’, 1997), -> (’J. Comp. Chem.’, 32, ’1416-1429’, 2002), -> (’J. Am. Chem. Soc.’, 118, ’11225-11236’, 1996); Query OK, 4 rows affected (0.00 sec) Records: 4 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM papers; +--------+-------------------+-------------+-----+------+ | volume | journal | pages | idx | year | +--------+-------------------+-------------+-----+------+ | 90 | J. Phys. Chem. | 1276-1284 | 1 | 1986 | | 18 | J. Comp. Chem. | 1955-1970 | 2 | 1997 | | 32 | J. Comp. Chem. | 1416-1429 | 3 | 2002 | | 118 | J. Am. Chem. Soc. | 11225-11236 | 4 | 1996 | +--------+-------------------+-------------+-----+------+ 4 rows in set (0.00 sec) Oops! We did a mistake: in the third row, the volume number should be 23 instead of 32. We can fix it by the UPDATE ... SET command, which substitutes the values in the table. First, we have to pick the row in a unique way — for that purpose we have the idx column; we will change the value in the volume column, but only where the idx = 3: Project co-financed from the EU European Social Fund Python programming for bioinformatics students 91 mysql> UPDATE papers SET volume=23 WHERE idx=3; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 Finally, we will assign authors to papers in the authors table: mysql> INSERT INTO authors (paper,author) VALUES (1,1), (2,2), (2,6) (2,7), (2,1), (3,3), (3,2), (3,5), (3,4), (4,1), (4,8), (4,7); Query OK, 12 rows affected (0.00 sec) Records: 12 Duplicates: 0 Warnings: 0 3.6 Searching the database Now, we will learn how to use the SQL to form simple and more advanced queries. First, let us try to find the papers in the database which were published in 2002: mysql> SELECT * FROM papers WHERE year=2002; +--------+----------------+-----------+-----+------+ | volume | journal | pages | idx | year | +--------+----------------+-----------+-----+------+ | 23 | J. Comp. Chem. | 1416-1429 | 3 | 2002 | +--------+----------------+-----------+-----+------+ 1 row in set (0.04 sec) We can also search for papers that were published in 2002 and in the specified journal, eg. J. Phys. Chem. Note, that the usual rules of boolean operators apply; we have to ‘intersect’ the conditions: year = 2002 and journal = J. Phys. Chem.; we want that both conditions are fulfilled, so we have to use the AND operator: mysql> SELECT * FROM papers WHERE year=2002 AND journal=’J. Phys. Empty set (0.00 sec) Chem.’; That was easy because we operate on the data from a single table. But let us try to find papers published by William L. Jorgensen. First, we must look-up his ID in the names table, then we have to find the ID’s of the papers in the authors table and finally, we have to retrieve the corresponding data from the papers table: mysql> SELECT idx FROM names WHERE names=’William L.’ AND Project co-financed from the EU European Social Fund Borys Szefczyk 92 surnames=’Jorgensen’; +-----+ | idx | +-----+ | 1 | +-----+ 1 row in set (0.00 sec) mysql> SELECT paper FROM authors WHERE author=1; +-------+ | paper | +-------+ | 1 | | 2 | | 4 | +-------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM papers WHERE idx=1 OR idx=2 OR idx=4; +--------+-------------------+-------------+-----+------+ | volume | journal | pages | idx | year | +--------+-------------------+-------------+-----+------+ | 90 | J. Phys. Chem. | 1276-1284 | 1 | 1986 | | 18 | J. Comp. Chem. | 1955-1970 | 2 | 1997 | | 118 | J. Am. Chem. Soc. | 11225-11236 | 4 | 1996 | +--------+-------------------+-------------+-----+------+ 3 rows in set (0.00 sec) OK, that did the job, but the solution is not very elegant and requires a lot of typing. This can be a problem if we need to retrieve, for example, 1000 records. First, let us join the data in the tables names and authors; we will look up indices of the papers published by William L. Jorgensen: mysql> SELECT paper FROM names JOIN authors ON names.idx=authors.author WHERE names=’William L.’ AND surnames=’Jorgensen’; +-------+ | paper | +-------+ | 1 | | 2 | | 4 | +-------+ 3 rows in set (0.00 sec) In order to join two tables, we must specify the relation between the records; here, the rows are joined Project co-financed from the EU European Social Fund Python programming for bioinformatics students 93 based on the column idx in the table names and the column author in the table authors. In addition, we display only those records which have the surnames containing ‘Jorgensen’ and the names column containing ‘William L.’. In general, using two or more tables may lead to ambiguity in the column names, eg. two of our tables have the idx column. This problem is resolved by prefixing the column name with the table name. To avoid any doubts, the command above could be written as: mysql> SELECT authors.paper FROM names JOIN authors ON names.idx=authors.author WHERE names.names=’William L.’ AND names.surnames=’Jorgensen’; In the next example, we will join all three tables together, using the relation between the idx column in names and author in authors, as well as the relation between paper in authors and idx in the table papers: mysql> SELECT journal,volume,pages,year FROM papers JOIN (authors, names) ON papers.idx=authors.paper AND authors.author=names.idx WHERE surnames=’Jorgensen’ AND names=’William L.’; +-------------------+--------+-------------+------+ | journal | volume | pages | year | +-------------------+--------+-------------+------+ | J. Phys. Chem. | 90 | 1276-1284 | 1986 | | J. Comp. Chem. | 18 | 1955-1970 | 1997 | | J. Am. Chem. Soc. | 118 | 11225-11236 | 1996 | +-------------------+--------+-------------+------+ 3 rows in set (0.00 sec) Finally, we will search the database for articles published together by the authors ‘Jorgensen’ and ‘TiradoRives’. This requires three steps: (i) looking-up the ID’s of the two authors in the table names, (ii) finding ID’s of the papers in the table authors, which have assigned both author ID’s, (iii) retrieving the information from the table papers: mysql> SELECT idx FROM names WHERE surnames=’Jorgensen’ OR surnames=’Tirado-Rives’; +-----+ | idx | +-----+ | 1 | | 7 | +-----+ 2 rows in set (0.00 sec) mysql> SELECT t1.paper FROM authors AS t1 JOIN authors AS t2 ON t1.paper=t2.paper WHERE t1.author=1 AND t2.author=7; Project co-financed from the EU European Social Fund Borys Szefczyk 94 +-------+ | paper | +-------+ | 2 | | 4 | +-------+ 2 rows in set (0.00 sec) mysql> SELECT journal,volume,pages,year FROM papers WHERE idx=2 OR idx=4; +-------------------+--------+-------------+------+ | journal | volume | pages | year | +-------------------+--------+-------------+------+ | J. Comp. Chem. | 18 | 1955-1970 | 1997 | | J. Am. Chem. Soc. | 118 | 11225-11236 | 1996 | +-------------------+--------+-------------+------+ 2 rows in set (0.00 sec) The only part that needs explanation is the second statement; by using JOIN on the same table on both sides, we join the table with itself. The copies are given aliases t1 and t2. We assemble the new table by using the condition that the column paper in t1 is equal to paper in t2; this way we create rows for each existing co-authorship. Finally, we specify the WHERE statement to filter out only those rows, which refer to our authors (indices 1 and 7). The SQL offers many more commands, but in fact, once you learn how to use the Python interface to MySQL (the next section), you will rarely need them. Knowing Python, you can always retrieve some data from the database and do the filtering and processing in your script. However, if you work on huge tables, it is always better to leave as much work as possible to MySQL, because its search algorithms are optimized and therefore faster. Exercise 28: Modify the existing database to store journal names in a separate table. This table should contain both the full and abbreviated names and it should assign indices to each journal. These indices should be used instead of names in the papers table. 3.7 Python interface to MySQL Python interface to the MySQL database is implemented in the MySQLdb module [14]. In order to execute queries on the MySQL database from your Python scripts, you have to import the module and connect to the database: Code 116 import MySQLdb conn = MySQLdb.connect(host = ’localhost’, user = ’pybib’, \ Project co-financed from the EU European Social Fund Python programming for bioinformatics students 95 passwd = ’bookWORM’, db = ’bibliography’) Then, you have to create a cursor (an object that performs queries and returns results), define the query using SQL and execute it: Code 117 cur = conn.cursor() query = "SELECT * FROM names WHERE surnames=’Jorgensen’;" cur.execute(query) Next, you can fetch the results using either the fetchone() or fetchmany() method: Code 118 result = cur.fetchone() result = cur.fetchmany(10) Normally, this is done in a loop, since we have many rows to retrieve: Code 119 # Using count-controlled loop cur.execute(query) for row in range(cur.rowcount): result = cur.fetchone() # Using condition-controlled loop cur.execute(query) result = cur.fetchone() while result: result = cur.fetchone() Do not forget to close the connection when you are done: Code 120 conn.close() The next example performs a search for publications of a specified author and from a specified year. Code 121 #!/usr/bin/python import MySQLdb Project co-financed from the EU European Social Fund Borys Szefczyk 96 surname = raw input("Surname [Press Enter for none]: ") year = raw input("Year [Press Enter for none]: ") surname = surname.strip().lower() if not year: year = 0 else: year = int(year) conn = MySQLdb.connect(host = "localhost", user = "pybib", \ passwd = "bookWORM", db = "bibliography") cur = conn.cursor() query = "SELECT DISTINCT journal,volume,pages,year FROM papers" query += " JOIN (authors, names) ON papers.idx=authors.paper" query += " AND authors.author=names.idx" if surname or year: query += " WHERE" if surname: query += " surnames=’%s’" % surname if year: query += " AND" if year: query += " year=%s" % year cur.execute(query) result = cur.fetchone() print "%-20s %5s %12s %4s" % ("Journal", "Vol", "Pages", "Year") print "-"*44 while result: print "%-20s %5d %12s %4d" % result result = cur.fetchone() conn.close() Let us analyse the script line-by-line. First we ask for the surname and year. An empty value means that the user does not want to use that search criteria: Code 122 surname = raw input("Surname [Enter for none]: ") year = raw input("Year [Enter for none]: ") Next, we strip unnecessary white characters and convert to lower case, since the search will be caseinsensitive anyway. The year, if not given by the user, will be set to zero, which is boolean ‘false’ (this will come useful later): Project co-financed from the EU European Social Fund Python programming for bioinformatics students 97 Code 123 surname = surname.strip().lower() if not year: year = 0 else: year = int(year) Now, we establish the connection with the database server and assemble the query: Code 124 conn = MySQLdb.connect(host = "localhost", user = "pybib", \ passwd = "bookWORM", db = "bibliography") cur = conn.cursor() query = "SELECT DISTINCT journal,volume,pages,year FROM papers" query += " JOIN (authors, names) ON papers.idx=authors.paper" query += " AND authors.author=names.idx" The rest of the query is added depending on the search criteria that are used. Code 125 if surname or year: query += " WHERE" if surname: query += " surnames=’%s’" % surname if year: query += " AND" if year: query += " year=%s" % year cur.execute(query) Finally, we can retrieve the results and print them in a table: Code 126 result = cur.fetchone() print "%-20s %5s %12s %4s" % ("Journal", "Vol", "Pages", "Year") print "-"*44 while result: print "%-20s %5d %12s %4d" % result result = cur.fetchone() In the next example, the script adds a new author to the database and returns the auto-generated ID of this author. Remember that in our database, the idx column of the names table has an AUTO INCREMENT property; when you add a new author, MySQL will automatically insert a unique number into this column. In Python, this number can be retrieved by the insert_id() method of the connect object: Project co-financed from the EU European Social Fund Borys Szefczyk 98 Code 127 #!/usr/bin/python import MySQLdb conn = MySQLdb.connect(host = "localhost", user = "pybib", \ passwd = "bookWORM", db = "bibliography") names = raw input("Enter names: ") surnames = raw input("Enter surnames: ") cur = conn.cursor() query = "INSERT INTO names (names, surnames)" query += " VALUES (’%s’,’%s’)" % (names, surnames) cur.execute(query) print "Author’s ID =", conn.insert id() conn.close() Project co-financed from the EU European Social Fund Python programming for bioinformatics students 99 Afterword Python was developed to be a scripting language with clear, readable and easy-to-learn syntax; as such, it quickly became popular and new projects based on this language started to emerge. Nowadays, it has replaced other scripting languages in many applications. It is the main scripting language of the Gentoo linux distribution; it was used to build many web site engines, like Zope, for example;1 several applications were written in Python including some molecular modelling tools, like PyMOL2 or BkChem;3 finally it can be used for command line steering of some programs, also including those of interest to computational chemists, like Modeller4 or VMD.5 Therefore one can say that Python is the scripting language of bioinformatics and computational chemistry. If you intend to be just a regular, end-user of bioinformatics tools, this book and course should be enough. On the other hand, if you would like to implement new methods or modify existing software, it might be desirable to go beyond the material covered in this textbook. If programming in Python is fun for you, there is also a lot more to explore. There are plenty of books written on Python; there are introductory tutorials, library references and books on more complex subjects, like GUI programming or scientific programming in Python. You will also find a lot of very advanced resources on the Internet. You can start, for example, with these introductory books and sites: the official Python tutorial [16], the book by Allen B. Downey, entitled Python for Software Design: How to Think Like a Computer Scientist [17] (there is an on-line version available), the book by Mark Pilgrim, entitled Dive Into Python [18] (also available on-line), the book by Mark Lutz, entitled Learning Python: Powerful Object-Oriented Programming [19] and the book by David M. Beazley, entitled Python Essential Reference [20]. For scientific applications, you may want to check out the book by Allen B. Downey [21] available on-line and the book by Hans Petter Langtangen [22]; for the SciPy and NumPy modules, there are great reference guides available on the Internet [8, 12]; for applications in bioinformatics, there is a free e-book by Katja Schuerer [23] and the books by RuedigerMarcus Flaig [24] and Mitchell Model [25]; finally, for the standard Python modules, you can use the official documentation [26] and the book by Fredrik Lundh [27], also available on-line. If you intend to make graphical interfaces to your scripts, you will find the tutorial for the PyGTK package on the Internet [28]; for the PyQT library, I would recommend the official documentation [29] and the book by Boudewijn Rempt [30]. For reference to MySQL database administration you will find resources on the Internet, starting from the official on-line documentation [15] and independent tutorials [31]. There are also printed resources, for example, the books by Larry Ullman [32] or Robert Sheldon and Geoff Moes [33]. For the specific issues of the interface between Python and MySQL, you should consult the on-line tutorial [14] or the book by Albert Lukaszewski [34]. 1 http://www.zope.org/WhatIsZope 2 http://www.pymol.org 3 http://bkchem.zirael.org 4 http://www.salilab.org/modeller 5 http://www.ks.uiuc.edu/Research/vmd Project co-financed from the EU European Social Fund (blank page) Bibliography [1] Æ. Frisch. Essential System Administration, Third Edition (O’Reilly, 2002), 3rd edn. ISBN: 9780596003432. [2] The official Python documentation. URL: http://docs.python.org [3] PDB file format documentation. URL: http://www.wwpdb.org/docs.html [4] Gaussian program documentation. URL: http://www.gaussian.com/g_tech/g_ur/g09help.htm [5] J. B. Foresman and Æ. Frisch. Exploring Chemistry with Electronic Structure Methods (Gaussian, 1996), 2nd edn. ISBN: 978-0963676931. [6] J. E. F. Friedl. Mastering Regular Expressions (O’Reilly, 2006), 3rd edn. ISBN: 978-0596528126. [7] Documentation of the re module. URL: http://docs.python.org/library/re.html [8] NumPy reference guide. URL: http://docs.scipy.org/doc/numpy/reference [9] The official Gnuplot documentation. URL: http://www.gnuplot.info/documentation.html [10] Adobe Systems Inc. PostScript Language Reference (Addison-Wesley, 1999), 3rd edn. ISBN: 9780201379228. URL: http://www.adobe.com/products/postscript/pdfs/PLRM.pdf [11] R. Garg, S. P. Gupta, H. Gao, M. S. Babu, A. K. Debnath and C. Hansch. Comparative quantitative structure-activity relationship studies on anti-hiv drugs. Chem. Rev., 99:3525–3602 (1999). [12] SciPy reference guide. URL: http://docs.scipy.org/doc/scipy/reference Project co-financed from the EU European Social Fund Borys Szefczyk 102 [13] W. v. R. James D. Murray. Encyclopedia of Graphics File Formats (O’Reilly Media, 1996), 2nd edn. ISBN: 978-1565921610. URL: http://www.fileformat.info/mirror/egff/index.htm [14] Tutorial on the MySQLdb Python interface. URL: http://mysql-python.sourceforge.net/MySQLdb.html [15] MySQL reference. URL: http://dev.mysql.com/doc/refman/5.0/en/index.html [16] The official Python tutorial. URL: http://docs.python.org/tutorial [17] A. B. Downey. Python for Software Design: How to Think Like a Computer Scientist (Cambridge University Press, 2009), 1st edn. ISBN: 978-0521725965. URL: http://www.greenteapress.com/thinkpython/index.html [18] M. Pilgrim. Dive Into Python (Apress, 2004), 1st edn. ISBN: 978-1590593561. URL: http://diveintopython.org [19] M. Lutz. Learning Python: Powerful Object-Oriented Programming (O’Reilly, 2009), 4th edn. ISBN: 978-0596158064. [20] D. M. Beazley. 0672329784. Python Essential Reference (Addison-Weasley, 2009), 4th edn. ISBN: 978- [21] A. B. Downey. Computational Modeling and Complexity Science (Green Tea Press, 2008), 1st edn. URL: http://www.greenteapress.com/compmod [22] H. P. Langtangen. Python Scripting for Computational Science (Springer, 2007), 3rd edn. ISBN: 978-3540739159. [23] K. Schuerer. Python course in bioinformatics. URL: http://www.pasteur.fr/recherche/unites/sis/formation/python [24] R.-M. Flaig. Bioinformatics Programming in Python: A Practical Course for Beginners (Wiley-VCH, 2008), 1st edn. ISBN: 978-3527320943. [25] M. L. Model. Bioinformatics Programming Using Python: Practical Programming for Biological Data (O’Reilly, 2009), 1st edn. ISBN: 978-0596154509. [26] Index of Python modules. URL: http://docs.python.org/modindex.html [27] F. Lundh. Python Standard Library (O’Reilly, 2001), 1st edn. ISBN: 978-0596000967. URL: http://effbot.org/zone/librarybook-index.htm [28] PyGTK reference. URL: http://www.pygtk.org Project co-financed from the EU European Social Fund Python programming for bioinformatics students 103 [29] PyQT reference. URL: http://www.riverbankcomputing.co.uk/software/pyqt/intro [30] B. Rempt. GUI Programming with Python: QT Edition (Commandprompt, 2001), 1st edn. ISBN: 978-0970033048. URL: http://www.commandprompt.com/community/pyqt [31] Tutorial on MySQL databases. URL: http://www.techotopia.com/index.php/MySQL_Essentials [32] L. Ullman. MySQL, Second Edition (Peachpit Press, 2006), 2nd edn. ISBN: 978-0321375735. [33] R. Sheldon and G. Moes. Beginning MySQL (Programmer to Programmer) (Wrox, 2005), 1st edn. ISBN: 978-0764579509. [34] A. Lukaszewski. MySQL for Python (Packt Publishing, 2010), 1st edn. ISBN: 978-1849510189. Project co-financed from the EU European Social Fund

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Wrocław University of Technology Bioinformatics Borys Szefczyk