Download Python and Application in Bioinformatics - BIDD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 9: Back to the Basics:
Python and Application in Bioinformatics
Y.Z. Chen
Department of Pharmacy
National University of Singapore
Tel: 65-6616-6877; Email: [email protected] ; Web: http://bidd.nus.edu.sg
Content
• What is python?
• Python basics
• Application in bioinformatics
Why Programming?
Programming skills needed for tasks such as:
• Write a program to do the same PUBMED search every
week and list the new hits for molecular interactions,
network regulations.
• Do a BLAST search against sequences which are on your
list of proteins with known kinetic data
• Merge results from different searches
• Import data into Excel for plotting
What Programming Tools?
• Popularly used programming tools:
• Programming languages - Perl, Python, C, C++, Java,
Visual Basic, PHP, Fortran
• Software libraries - BioPerl, Biopython, and BioJava
• Databases - MySQL, Postgres, Oracle
Statistics of Software Usage
Nature Biotech 25, 390
(2007)
Why Python?
• Suitable for relatively small automated tasks such as search-andreplace over a large number of text files, rename and rearrange files,
write a small database, specialized GUI application, and development of
simple games
• Faster and easier alternatives to C/C++/Java
• Simpler to use, available on Windows, MacOS X, and Unix operating
systems
• A real programming language, more structure and support than shell
scripts or batch files can offer, more error checking than C, high-level
data types built in, applicable to a much larger problem domain than
Awk or even Perl yet in many cases equally easy to use
• An interpreted language, which can save you considerable time during
program development because no compilation and linking is necessary.
Why Python?
• Allows you to split program into modules used in other Python
programs, comes with a large collection of standard modules such as
file I/O, system calls, sockets, interfaces to graphical user interface
toolkits.
• Enables programs to be written compactly and readably at typically
much shorter length than equivalent C, C++, Java programs, for
several reasons:
• The high-level data types allow you to express complex
operations in a single statement;
• statement grouping is done by indentation instead of beginning
and ending brackets;
• no variable or argument declarations are necessary.
• Extensible: if you know how to program in C it is easy to add a new
built-in function or module to the interpreter, you can link the Python
interpreter into an application written in C and use it as an extension
or command language for that application.
What is Python?
Python is a Programming Language
• Started by Guido van Rossum in 1990 as a way to write
software for the Amoeba operating system. Influenced by
ABC, which was designed to be easy to learn. It is also
very useful for large programs written by expert
programmers.
• The word "Python" comes from the comedy troupe "Monty
Python." Words and jokes from the skits and movies
appear often in Python software, including "spam," "idle,"
and "grail"
What is Python?
Python Properties
•
•
•
•
•
•
•
•
Interpreted Language
Interactive mode
Imperative and "Object-Oriented"
Cross-platform
Doesn't try to guess what you mean
Great for team projects
Popular for web applications, testing, and XML
Extremely popular for chemical informatics (but not so
much in bioinformatics)
What is Python?
Interactive Mode
• Python has an interactive mode. You can type Python code
and see the results immediately. To start Python, open a
unix shell and type "python".
> python
Python 2.3.3 (#1, Jan 29 2004, 22:55:13)
[GCC 3.3.3 [FreeBSD] 20031106] on freebsd5
Type "help", "copyright", "credits" or "license" for more information.
>>>
• At the >>> prompt you can enter Python code.
Python Resources http://python.org/
Python Resources
http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html
Python Resources
http://www.pasteur.fr/recherche/unites/sis/formation/python/index.html
Example: Using Python as a calculator
>>> 2+3
5
>>> 4+6*8
52
>>> abs(-4)
4
>>> help(abs)
Help on built-in function abs:
abs(...)
abs(number) -> number
Return the absolute value of the argument.
>>> 89**34
1902217732808760980190430983601716818363305103120555045416541165041L
>>> print 89**34
1902217732808760980190430983601716818363305103120555045416541165041
>>> "What... is the air-speed velocity of an unladen swallow?"
'What... is the air-speed velocity of an unladen swallow?'
>>> print "What do you mean? An African or European swallow?"
What do you mean? An African or European swallow?
What is Python?
Example: Importing a module
>>> import math
>>> help(math)
Help on module math:
NAME
math
FILE
/usr/local/lib/python2.3/lib-dynload/math.so
DESCRIPTION
This module is always available. It provides access to the mathematical
functions defined by the C standard.
>>> math.pi
3.1415926535897931
>>> math.sin(math.pi/2.0)
1.0
>>>
What is Python?
Example: Print the Time of Day
>>> import datetime
>>> now = datetime.datetime.now()
>>> now
datetime.datetime(2008, 2, 2, 19, 23, 28, 809434)
>>> print now
2008-02-02 19:23:28.809434
>>> print "Now is", now.strftime("%d-%m-%Y"), "at",
now.strftime("%H:%M")
Now is 02-02-2008 at 19:23
>>>
• The notation name1.name2 is called an attribute lookup. In this case,
name2 is an attribute of name1 and has some value.
>>> now.day
2
>>> now.year
2008
>>> now.hour
19
Simple Python script
Code:
# file: simple_code.py
import math
import datetime
print "log(1e23) =", math.log(1e23)
print "2*sin(3.1414) = ", 2*math.sin(3.1414)
now = datetime.datetime.now()
print "Now is", now.strftime("%d-%m-%Y"), "at", now.strftime("%H:%M")
print "or, more precisely, %s" % now
Output:
> python simple_code.py
log(1e23) = 52.9594571389
2*sin(3.1414) = 0.000385307177203
Now is 02-02-2008 at 19:55
or, more precisely, 2008-02-02 19:55:43.046953
>
Python Script
Creating Python Script
• A Python program is just a text file. You can use any text (programmer's)
editor. There are several on the Linux machines, including vi, XEmacs,
Kate, xvim, and nedit. You can also use one of the free IDEs like idle,
PyShell, or (under Microsoft Windows) Pythonwin.
Running Python Script
• Option 1: Run the python program from the command line, giving it the
name of the script file to run.
> python now.py
Now is 02-02-2004 at 19:55
or, more precisely, 2004-02-02 19:55:43.046953
>
Python Script
Running Python Script
• Option 2: Put the magic comment #!/usr/bin/env python as the very first
line in the program.
Code:
#!/usr/bin/env python
# now.py
import datetime
now = datetime.datetime.now()
print "Now is", now.strftime("%d-%m-%Y"), "at", now.strftime("%H:%M")
print "or, more precisely, %s" % now
Make the script executable with chmod +x now.py
> chmod +x now.py
Then run the program as if it's any other Unix program
> now.py
Now is 02-02-2004 at 19:55
or, more precisely, 2004-02-02 19:55:43.046953
Python Statements
Statement examples:
sum = 2 + 2 # this is a statement
name = raw_input("What is your name?") # these are two statements
print "Hello,", name
print "Did you know that your name has", \
len(name), "letters?" # This is one statement spread across 2 lines
# Another way to extend a statement across several lines
print "Here is your name repeated 7 times:", (
name * 7
)
Python Statements
Blocks, If and for statements
EcoRI = "GAATTC"
sequence = raw_input("Enter a DNA sequence:")
if EcoRI in sequence:
print "Sequence contains an EcoRI site" # This is a one-line block
import sys
sequence2 = raw_input("Enter another sequence:")
if len(sequence2) < 100:
print "Sequence is too small. Throw it back." # a two-line block
sys.exit(0)
sequences = (sequence, sequence2)
for seq in sequences:
print "sequence length =", len(seq) # a block ...
for c in "ATCG":
print "#%s = %d" % (c, seq.count("C")) # ... with a block inside it
Python Objects and Literals
String Literals
# single quotes
'Who said "to be or not to be"?'
# double quotes
"DNA goes from 5' to 3'."
# escaped quotes
"\"That's not fair!\" yelled my sister."
# creates: "That's not fair!" yelled my sister
# triple quoted strings, with single quotes
'''This one string can go
over several lines'''
# "raw" strings, mostly used for regular expressions
r"\"That's not fair!\" yelled my sister."
# creates: \"That's not fair!\" yelled my sister
# You can even have raw triple double quoted strings!
r"""So there!"“”
Python Objects and Literals
Numeric Literals
123
# an integer
1.23
# a floating point number
-1.23
# a negative floating point number
1.23E45; # scientific notation
0x7b;
# hexadecimal notation (decimal 123)
0173;
# octal notation (decimal 123)
12+3*j; # complex number 12 + 3i (Note that Python uses "j"!)
2147483648L # a long integer
Python Objects and Literals
List literal
>>> data = [1, 4, 9, 16]
>>> data[0]
1
>>> data[1]
4
>>> data[2] = 7
>>> data
[1, 4, 7, 16]
>>> data[1:3]
[4, 9]
>>>
Python Objects and Literals
Tuple literal
>>> data = (1, 4, 9, 16)
>>> data[1]
4
>>> data[2] = 7
Traceback (most recent call last):
File "", line 1, in ?
TypeError: object doesn't support item assignment
>>>
Dictionary literal
>>> d = {"A": "ALA", "C": "CYS", "D": "ASP"}
>>> print d["A"]
ALA
>>>
Python Operators
Some operation using numbers
>>> (1+2)**2
9
>>> (2+3*4)/2
7
>>> 7%3 # % is the modulo operator
1
>>> 7 == 7
True
>>>
Python Operators
Some operation using strings
>>> "Andrew" + " " + "Dalke"
'Andrew Dalke‘
>>> "*" * 10
'**********'
>>> "My name is %s. What's your name?" % "Andrew"
'My name is Andrew. What's your name‘
>>> "My first name is %s and family name is %s" % ("Andrew",
"Dalke")
'My first name is Andrew and family name is Dalke‘
>>> "My first name is %(first)s. Is yours also %(first)s?" % \
... {"first": "Andrew", "family": "Dalke"}
'My first name is Andrew. Is yours also Andrew?‘
>>> "Andrew" == "Dalke"
False
>>>
Python Functions
http://python.org/doc/current/lib/built-in-funcs.html
Python Functions
String Methods
>>> seq = "AATGCCG"
>>> seq.lower()
'aatgccg'
>>> seq.count("A")
2
>>> seq.find("GC")
3
>>> seq.find("gc")
-1
>>> seq.replace("C", "U")
'AATGUUG'
>>> import string
>>> seq.translate(string.maketrans("ATCG", "TAGC"))
'TTACGGC'
>>> # Make the reverse complement
>>> seq.translate(string.maketrans("ATCG", "TAGC"))[::-1]
'CGGCATT'
>>>
Python Functions
Special Methods
Some methods are used so often that they have special syntax.
>>> s = "AATGCCGTTTAT"
>>> s[0] # index
'A'
>>> s[1:4] # slice from beginning to end
'ATG'
>>> s[:4] # default beginning is position 0
'AATG'
>>> s[-1] # index from the end
'T'
>>> s[-3:] # default end includes the last character
'TAT'
>>> s[3:-3]
'GCCGTT'
>>> s[::2] # the optional third parameter is the stride
'ATCGTA'
>>> s[::-1] # returns the string, reversed
'TATTTGCCGTAA'
>>>
Python Processing Command Line Arguments
•
When a Python script is run, its command-line arguments (if any) are stored in
the list sys.argv.
Code:
#!/usr/bin/env python
# file: echo.py
import sys
print sys.argv
Output:
> chmod +x echo.py
> echo.py tuna
['echo.py', 'tuna']
> echo.py tuna fish
['echo.py', 'tuna', 'fish']
> echo.py "tuna fish"
['echo.py', 'tuna fish']
> echo.py
['echo.py']
>
Python Processing Command Line Arguments
Computing the Hypotenuse of a Right Triangle
Code:
#!/usr/bin/env python
# file: hypotenuse.py
import sys, math
if len(sys.argv) != 3: # the program name and the two arguments
# stop the program and print an error message
sys.exit("Must provide two positive numbers")
# Convert the two arguments from strings into numbers
x = float(sys.argv[1])
y = float(sys.argv[2])
print "Hypotenuse =", math.sqrt(x**2+y**2)
Output:
> hypotenuse.py 5 12
Hypotenuse = 13.0
>
Python I/O (Input / Output)
Input
• Text input comes from sys.stdin. It has a method called readline which
reads a line of input.
>>> import sys
>>> s = sys.stdin.readline()
This is a line of text. The line ends when I press 'Enter'.
>>> s
"This is a line of text. The line ends when I press 'Enter'.\n"
>>>
• You can also use the raw_input function to get a string from sys.stdin.
This function takes an optional argument which is used as the prompt.
>>> name = raw_input("What is your name? ")
What is your name? Andrew
>>> print name, "is a nice name"
Andrew is a nice name
>>>
Python I/O (Input / Output)
Output
• Most Python text output goes to the sys.stdout file object. You've been
using the print statement, which uses sys.stdout under the covers.
Output file handles have a write function which writes a string to the file
with no extra interpretation.
>>> a, b, c = 1, 4, 9
>>> print "The first three squares are", a, b, "and", c
The first three squares are 1 4 and 9
>>> print "The first three squares are", a, ",", b, "and", c, "."
The first three squares are 1 , 4 and 9 .
>>> print "The first three squares are %s, %s and %s." % (a, b, c)
The first three squares are 1, 4 and 9.
>>> import sys
>>> sys.stdout.write("The first three squares are %s, %s and
%s.\n" %
... (a, b, c))
The first three squares are 1, 4 and 9
>>>
Python Applications in Bioinformatics
BLAST output parsing
• BLAST is the most widely used bioinformatics tool to search large
sequence databases. The original BLAST authors expected the output
to be read by people only. But many use BLAST as part of a larger
algorithm and want to automate the BLAST step by using parsers for
BLAST output flavors (BLASTN, BLASTP, TBLASTX, WU-BLAST, and
so on). BLAST parsers have been developed and put into library in
Bioperl, Biopython, BioJava, etc., which all have BLAST output parsers.
First few lines of the BLASTP output
Python Applications in Bioinformatics
BLAST output parsing
• Getting program version information
•
Program reporting the version information of a BLAST file
Python Applications in Bioinformatics
BLAST output parsing
• Getting no of sequences in the database and no of letters
Python Applications in Bioinformatics
BLAST output parsing
• Reading description lines
Python Applications in Bioinformatics
BLAST output parsing
• Reading description lines