Download Python and Biopython Scripting for Busy

Document related concepts
no text concepts found
Transcript
Python and Biopython
Scripting for Busy
Bioinformaticians
Jeffrey Chang
Stanford University
11 Aug 2003
CSB2003
Introduction
03
20
01
20
99
19
97
19
19
95
Jeffrey Chang <[email protected]>
2
3
Outline
• Act I
• Act II
So what is this Python I keep
hearing about?
Python, it is nice to meet you!
Intermission
• Act III
• Act IV
• Act V
Let’s write some code!
Biopython: Batteries Included.
Where do we go from here?
4
Act I
So what is Python?
5
Assembly
BASIC
LISP
C
C++
perl
FORTRAN
•
00
20
90
19
80
19
19
19
19
50
60
70
Geneology
Java
Python
Increasing layers of abstraction
• structured/object-oriented programming
• memory handling
•
Sophistication in data types
6
Happy Birthday!
Usability
tested!
ABC
Guido Van Rossum
Programming Language
Aug 13, 1991
7
Raison D’être
FORTRAN
LISP
C
C++
Java
perl
python
numerical analysis
symbolic computation
and
more!
system programming
objects, speed, compatibility with C
objects, internet
system administration
general programming
8
Language for Research
•
•
minimize development
time
interactive
• examine your data
• tweak algorithms
•
•
suitable for library
development
sociable
• other research tools
• internet
•
multiplatform
9
Python for Research
•
•
minimize development
time
interactive
• examine your data
• tweak algorithms
•
•
suitable for library
development
sociable
• other research tools
• internet
•
•
•
•
•
•
•
high level data types
garbage collection
interpreted
interactive
environment
rich module support
extensible with C
multiplatform
10
Python vs Perl
Python Strengths
•
•
•
•
Object-oriented
Handles Numbers
Clean Syntax
Clean Extensions to C
Perl Strengths
•
•
•
•
•
Popular
Available
Mature Libraries
Familiar Syntax
String handling
11
Python vs. Java
Python Strengths
•
•
•
•
Libraries
Sociable
High Level Data Types
Easy to Prototype
Java Strengths
•
•
•
•
•
Popular
Industry Support
GUI Tools
Fast
Good Development
Tools
12
Python in Biology
13
Where can I find Python?
Officially available for:
http://www.python.org
•
•
•
•
Windows
Macintosh
Linux
Source
14
Also available on:
15
Act II
Python, it is nice to meet you!
16
Interacting with Python
17
Interacting with Python
18
Python Interpreter
Python 1.5.2 (#1, Aug 2 1999, 18:47:55)
[GCC egcs-2.91.66 19990314 (egcs-1.1.2 on sunos5
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> _
Triple prompt
means go!
19
Our First Script
Python 1.5.2 (#1, Aug 2 1999, 18:47:55)
[GCC egcs-2.91.66 19990314 (egcs-1.1.2 on sunos5
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> print "hello world"_
•
Interactive environment. Try out new
ideas here!
20
Python Interpreter
Python 1.5.2 (#1, Aug 2 1999, 18:47:55) [GCC egcs-2.91.66
19990314 (egcs-1.1.2 on sunos5
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> print "hello world”
hello world
>>> _
No
Semicolon!
•
Commands are evaluated as you type
21
Python Interpreter
Python 1.5.2 (#1, Aug 2 1999, 18:47:55)
[GCC egcs-2.91.66 19990314 (egcs-1.1.2 on sunos5
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> print "hello world"
hello world
>>> print gene_name
Traceback (innermost last):
Errors caught
File "<stdin>", line 1, in ?
immediately.
NameError: gene_name
>>> _
•
•
Examine your data.
Quickly develop and test algorithms
22
Creating Variables
>>> print "hello world"
hello world
>>> print gene_name
Traceback (innermost last):
File "<stdin>", line 1, in ?
NameError: gene_name
>>> gene_name = “caspase”
>>> print gene_name
caspase
>>> _
•
•
Variables created when you assign them.
gene_name is not Gene_Name.
23
Printing Variables
>>>
>>>
>>>
500
>>>
a = 100
b = 5
print a * b
print "%10d" % a
100
>>> print “%d + %d = %d” % (a, b, a+b)
100 + 5 = 105
>>> _
•
Formatting like printf in C.
24
Special Value: ‘None’
>>>
>>>
>>>
500
>>>
a = 100
b = 5
print a * b
print "%10d" % a
100
>>> print “%d + %d = %d” % (a, b, a+b)
100 + 5 = 105
>>> c = None
>>> print c
None
>>> _
25
Integers
>>>
>>>
100
>>>
110
>>>
a = 100
print a
print a + 10
_
Supports arithmetic:
• +, -, *, /, **
26
Integers
>>>
>>>
100
>>>
110
>>>
>>>
220
>>>
a = 100
print a
print a + 10
a = 2*(a+10)
print a
_
Understands parentheses.
27
Integers
>>>
>>>
100
>>>
110
>>>
>>>
220
>>>
2
>>>
a = 100
print a
print a + 10
a = 2*(a+10)
print a
print a / 100
Gotcha!
_
Be careful of division.
28
Integers
>>> a = 100
>>> print a
100
>>> print a + 10
110
>>> a = 2*(a+10)
>>> print a
220
>>> print a / 100
2
>>> print a ** 2
48400
>>> print a ** 100
Traceback (innermost last):
File "<stdin>", line 1, in ?
OverflowError: integer pow()
>>> _
Biggest integer is (about)
2 billion.
29
Long Integers
>>> a = 100L
>>> print a
100L
>>> a ** 100
10000000000000000000000000000000000000
00000000000000000000000000000000000000
00000000000000000000000000000000000000
00000000000000000000000000000000000000
00000000000000000000000000000000000000
00000000000L
>>> _
Have no limit!
30
Float
>>> a = 100.0
>>> print a
100.0
>>> a ** 100
1e+200
>>> a / 3
33.3333333333
>>>
31
Number Coercion
>>>
1
>>>
1.5
>>>
3.0
>>>
1.5
>>>
3 / 2
3. / 2
Integers convert to
floating point.
float(3)
float(3)/2
32
Strings
>>> protein = "TSQGRTRTLLNLTPIRLIVALFLVAAAVGL”
>>> print protein
TSQGRTRTLLNLTPIRLIVALFLVAAAVGL
>>> _
33
Strings
>>> protein = "TSQGRTRTLLNLTPIRLIVALFLVAAAVGL”
>>> print protein
TSQGRTRTLLNLTPIRLIVALFLVAAAVGL
>>> _
Characters numbered from 0.
T S Q G R T R T L L N ...
0 1 2 3 4 5 6 7 8 9 10 ...
34
Strings
>>> protein = "TSQGRTRTLLNLTPIRLIVALFLVAAAVGL”
>>> print protein
TSQGRTRTLLNLTPIRLIVALFLVAAAVGL
>>> print protein[0:5]
TSQGR
>>> _
“Slices” do not include the end.
T S Q G R T R T L L N ...
0 1 2 3 4 5 6 7 8 9 10 ...
35
Strings
>>> protein = "TSQGRTRTLLNLTPIRLIVALFLVAAAVGL”
>>> print protein
TSQGRTRTLLNLTPIRLIVALFLVAAAVGL
>>> print protein[0:5]
TSQGR
>>> fragment = protein[5:10]
>>> print fragment
TRTLL
>>> len(fragment)
5
>>> _
“len” gives the length of the string.
36
Strings
>>> print fragment
TRTLL
>>> len(fragment)
5
>>> fragment[3:]
'LL'
>>> _
Slice endpoints are optional.
T R T L L
0 1 2 3 4
37
Strings
>>> print fragment
TRTLL
>>> len(fragment)
5
>>> fragment[3:]
'LL’
>>> fragment[10:]
''
>>> _
Slices can be out of range.
T R T L L
0 1 2 3 4
38
Strings
>>> print fragment
TRTLL
>>> len(fragment)
5
>>> fragment[3:]
'LL’
>>> fragment[10:]
'’
>>> fragment[-1]
'L'
>>> _
Slices can also be counted from the end.
T R T L L
-5 -4 -3 -2 -1
39
Lists
>>> fragment = ['T', 'R', 'T', 'L', 'L']
>>> print fragment
['T', 'R', 'T', 'L', 'L']
>>> print fragment[1:3]
['R', 'T']
>>> _
Slices like strings.
40
Lists
>>> fragment = ['T', 'R', 'T', 'L', 'L']
>>> print fragment
['T', 'R', 'T', 'L', 'L']
>>> print fragment[1:3]
['R', 'T']
>>> 'R' in fragment
1
>>> 'A' in fragment
0
>>> _
“in” checks whether something is in a list.
41
List Assignments
>>> print fragment
['T', 'R', 'T', 'L', 'L']
>>> reference = fragment
>>> fragment[0] = 'A'
>>> print fragment
['A', 'R', 'T', 'L', 'L']
>>> print reference
???
42
List Assignments
>>> print fragment
['T', 'R', 'T', 'L', 'L']
>>> reference = fragment
>>> fragment[0] = 'A'
>>> print fragment
['A', 'R', 'T', 'L', 'L']
>>> print reference
['A', 'R', 'T', 'L', 'L']
>>> _
list assignment is a reference.
43
List Assignments
>>> print fragment
['T', 'R', 'T', 'L', 'L']
>>> reference = fragment[:]
>>> fragment[0] = 'A'
>>> print fragment
['A', 'R', 'T', 'L', 'L']
>>> print reference
[’T', 'R', 'T', 'L', 'L']
>>> _
Python
Idiom
To copy a list, slice the whole thing!
44
Lists are Objects
>>> dir(fragment)
['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove',
'reverse', 'sort']
>>> _
“dir” tells you what an object can do.
45
Lists are Objects
>>> dir(fragment)
['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove',
'reverse', 'sort']
>>> fragment.append
<built-in method append of list object at 2094b8>
>>> _
46
Lists are Objects
>>> dir(fragment)
['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove',
'reverse', 'sort']
>>> fragment.append
<built-in method append of list object at 2094b8>
>>> print fragment.append.__doc__
L.append(object) -- append object to end
>>> _
__doc__ shows you documentation.
47
Tuples
>>> fragment = ('T', 'R', 'T', 'L', 'L')
>>> print fragment
('T', 'R', 'T', 'L', 'L')
>>> print fragment[1:3]
('R', 'T')
>>> fragment[0] = 'A'
Traceback (innermost last):
File "<stdin>", line 1, in ?
TypeError: object doesn't support item assignment
>>> _
Like lists, but can not be changed.
48
Dictionaries
>>> genetic_code = {
'UUU' : 'F',
'UUC' : 'F',
'UUA' : 'L',
'UUG' : 'L',
[...]
}
>>>
Creates a mapping between a key to a value.
(Duplicate keys are not allowed.)
49
Dictionaries
>>> genetic_code = {
'UUU' : 'F',
'UUC' : 'F',
'UUA' : 'L',
'UUG' : 'L',
[...]
}
>>> print genetic_code['GGU']
G
>>> print genetic_code['ABC']
Traceback (innermost last):
File "<stdin>", line 1, in ?
KeyError:'ABC'
>>>
50
Dictionaries
>>> genetic_code = {
'UUU' : 'F',
'UUC' : 'F',
'UUA' : 'L',
'UUG' : 'L',
[...]
}
>>> print genetic_code['GGU']
G
>>> print genetic_code['ABC']
Traceback (innermost last):
File "<stdin>", line 1, in ?
KeyError:'ABC'
>>> dir(genetic_code)
['clear', 'copy', 'get', 'has_key', 'items', 'keys',
'update', 'values']
>>>
Dictionaries are objects.
51
What we’ve covered so far...
• Python is a high-level scripting language.
• Data types:
•
•
•
•
Numbers: Integer, Float
Strings
Lists
Dictionary
52
Intermission
53
Act III
Let’s write some code!
54
On with the programming!
>>> rna = ("AUG", "GGU", "GCC")
>>> prot = ""
>>> for codon in rna:
...
print codon
...
prot = prot + gencode[codon]
...
'AUG'
'GGU'
'GCC'
>>> print prot
'MGA'
>>> _
“for” loop delineated by whitespace.
55
Whitespace: Good or Bad???
• Usability testing
• Enforce common style
We will perhaps eventually be writing only small
modules which are identified by name as they
are used to build larger ones, so that devices
like indentation, rather than delimiters, might
become feasible for expressing local structure
in the source language.
Donald E. Knuth, 1974
56
while...
>>> rna = ("AUG", "GGU", "GCC")
>>> prot = ""
>>> i = 0
>>> while i < len(rna):
...
print rna[i]
...
prot = prot + gencode[rna[i]]
...
i = i + 1
...
'AUG'
'GGU'
'GCC'
>>> print prot
'MGA'
>>> _
57
if, else
>>> rna = ("AUG", "XXX", "GGU", "GCC")
>>> prot = ""
>>> for codon in rna:
...
if gencode.has_key(codon):
...
prot = prot + gencode[codon]
...
else:
...
print "unknown '%s'" % codon
...
unknown XXX
>>> print prot
'MGA'
>>> _
58
Loop control
>>> rna = ("AUG", "XXX", "GGU", "UAA", "GCC")
>>> prot = ""
>>> for codon in rna:
...
if codon in ['UAA', 'UAG', 'UGA']:
...
break
...
elif gencode.has_key(codon):
...
prot = prot + gencode[codon]
...
else:
...
pass # handle unknown key
...
>>> print prot
'MG'
>>> _
“break” exits the loop
“pass” does nothing
“continue” (not shown) skips to the next iteration
59
Functions
>>> def to_aa(codon):
...
gencode = {[...]}
...
return gencode[codon]
...
>>> _
“return” exits the function
60
Saving Code as Scripts
•
•
Save your code for
next time!
“.py” files
61
Modules
#!/usr/local/bin/python
def translate(rna):
gencode = {
"bio" module
[...]
}
prot = ""
for codon in rna:
prot = prot + gencode[codon]
•
•
•
A module is a library
of code.
"bio.py" is the "bio"
module.
import / reload
>>> import bio
>>> dir(bio)
['translate']
>>> bio.translate(("AUG", "GGU", "GCC"))
'MGA'
>>> _
62
... as a standalone script
#!/usr/local/bin/python
def translate(rna):
gencode = {
[...]
}
prot = ""
for codon in rna:
prot = prot + gencode[codon]
return prot
print "Hi!"
if __name__ == '__main__':
print translate(("AUG", "GGU",
"GCC"))
•
•
Interprets and
executes the script
__name__ is set to
module name
Hi!
MGA
63
Global Variables
#!/usr/local/bin/python
gencode =
"UUU"
"UUC"
"UUA"
"UUG"
[...]
}
{
:
:
:
:
•
"F",
"F",
"L",
"L",
def translate(rna):
prot = ""
for codon in rna:
prot = prot + gencode[codon]
return prot
•
Recreating the
genetic code mapping
each time is
expensive!
Create a global
variable to store it.
64
Default Parameters
#!/usr/local/bin/python
gencode =
"UUU"
"UUC"
"UUA"
"UUG"
[...]
}
{
:
:
:
:
"F",
"F",
"L",
"L",
def translate(rna, code=gencode):
prot = ""
for codon in rna:
prot = prot + code[codon]
return prot
•
•
•
But what if you want
to use a different
genetic code?
Pass a translation
table as a parameter.
Set default parameter
• standard one used
most often
• does not break
existing programs
65
Default Parameters
#!/usr/local/bin/python
gencode =
"UUU"
"UUC"
"UUA"
"UUG"
[...]
}
{
:
:
:
:
"F",
"F",
"L",
"L",
def translate(rna, code=gencode):
prot = ""
for codon in rna:
prot = prot + code[codon]
return prot
>>> import bio
>>> bio.translate(("AUG", "GGU",
"GCC"))
'MGA'
>>> mycode = {[...]}
>>> bio.translate(("AUG", "GGU",
"GCC"), mycode)
'MGV'
>>> bio.translate(("AUG", "GGU",
"GCC"), code=mycode)
'MGV'
>>>
66
Using stopcodons
#!/usr/local/bin/python
gencode =
"UUU"
"UUC"
"UUA"
"UUG"
[...]
}
{
:
:
:
:
"F",
"F",
"L",
"L",
def translate(rna, code=gencode):
prot = ""
for codon in rna:
if codon in ['UAA', 'UAG',
'UGA']:
break
prot = prot + code[codon]
return prot
Do a translation only up
to any recognized
stop codon.
Bug!!!
67
What's the bug?
#!/usr/local/bin/python
gencode =
"UUU"
"UUC"
"UUA"
"UUG"
[...]
}
{
:
:
:
:
"F",
"F",
"L",
"L",
The stop codon may be
different for different
genetic codes!
def translate(rna, code=gencode):
prot = ""
for codon in rna:
if codon in ['UAA', 'UAG',
'UGA']:
break
prot = prot + code[codon]
return prot
68
adding another parameter
#!/usr/local/bin/python
gencode =
"UUU"
"UUC"
"UUA"
"UUG"
[...]
}
{
:
:
:
:
"F",
"F",
"L",
"L",
def translate(rna, code=gencode,
stopcodon=['UAA', 'UAG', 'UGA']):
prot = ""
for codon in rna:
if codon in stopcodon:
break
prot = prot + code[codon]
return prot
One solution: make the
stopcodon a
parameter
>>> bio.translate(("AUG", "GGU",
"GCC"), mycode)
'MGV'
>>> bio.translate(("AUG", "GGU",
"GCC"), code=mycode,
stopcodon=['GCC'])
'MG'
>>> bio.translate(("AUG", "GGU",
"GCC"), stopcodon=['GGU'])
'M'
>>> _
69
finishing touches
#!/usr/local/bin/python
[...]
def translate(rna, code=gencode,
stopcodon=['UAA', 'UAG', 'UGA']):
"""translate(rna[, code][,
stopcodon]) -> string
Translate an RNA sequence into
a protein sequence.
"""
prot = ""
for codon in rna:
if codon in stopcodon:
break
prot = prot + code[codon]
return prot
•
•
Documentation
Triple-quoted strings
• Allows newlines
>>> import bio
>>> dir(bio.translate)
['__doc__', '__name__',
'func_code',
'func_defaults', 'func_doc',
'func_globals', 'func_name']
>>> print bio.translate.__doc__
translate(rna[, code][,
stopcodon]) -> string
Translate an RNA sequence into
a protein sequence.
>>> _
70
RNA as a tuple of codons?
• ("AUG", "GGU", "GCC")
• Representation problems
• hard to get sequences into that form
• not sliceable, e.g. cannot easily get residues 2
to 4
• Semantic problems
• what about non-coding regions?
• insertion/deletion errors?
71
Building a Sequence object
#!/usr/local/bin/python
class Sequence:
seq = ''
name = ''
•
•
"class" keyword
member variables
• defined in scope of
class
• class "owns" the
variables
>>> import Sequence
>>> seq = Sequence.Sequence()
>>> seq.name
''
>>> seq.name = "Actin Binding Protein"
>>> print seq.name
Actin Binding Protein
>>> _
72
Private Variables
#!/usr/local/bin/python
class Sequence1:
_seq = ''
_name = ''
class Sequence2:
__seq = ''
__name = ''
•
data hiding by
convention
• leading underscore
• name mangling
>>> dir(Sequence.Sequence1)
['_seq', '_name', ...]
>>> dir(Sequence.Sequence2)
['_Sequence2__seq',
'_Sequence2__name’, ...]
>>> _
73
Adding a constructor
#!/usr/local/bin/python
•
class Sequence:
def __init__(self, seq='', name=''):
self._seq = seq
self._name = name
__init__
• optional constructor
• automatically called
when objected created
•
self
• reference to object
• like "this" in C++, java
• defined explicitly
74
Adding methods
#!/usr/local/bin/python
• methods defined
class Sequence:
def __init__(self, seq='',
name=''):
self._seq = seq
self._name = name
inside class.
def get_seq(self):
return self._seq
def get_name(self):
return self._name
>>> seq = Sequence.Sequence(
...
"HSRDIDQEYQ", "Actin
Binding")
...
>>> print seq.get_name()
Actin Binding
>>> print seq._name
Actin Binding
>>> _
75
Create an RNA class
#!/usr/local/bin/python
•
[...] # Sequence declaration
class RNASequence(Sequence):
def __init__(self, seq='',
name=''):
Sequence.__init__(self,
seq, name)
# get_seq defined in Sequence
# get_name defined in Sequence
•
•
subclass from
"Sequence"
inherit its methods
and members
new constructor hides
Sequence one
• need to call it explicitly
76
Make my codons!
#!/usr/local/bin/python
[...] # Sequence declaration
class RNASequence(Sequence):
def __init__(self, seq='',
name=''):
Sequence.__init__(self,
seq, name)
def as_codons(self):
codons = []
i = 0
while i < len(self._seq):
codon = self._seq[i:i+3]
codons.append(codon)
i = i + 3
return codons
•
Create a new method
to split the sequence
into triple codons.
• as_codons not
appropriate for general
sequences
• only available to
RNASequence
77
How to handle errors?
#!/usr/local/bin/python
[...] # Sequence declaration
class RNASequence(Sequence):
[...] # constructor
check for
def as_codons(self):
condition
if len(self._seq) % 3 != 0:
raise ValueError, "broken"
•
What happens when
the sequence cannot
be split evenly into
triplets?
codons = []
i = 0
while i < len(self._seq):
codon = self._seq[i:i+3]
codons.append(codon)
i = i + 3
return codons
78
Exception Handling
•
•
Use for "unignorable"
conditions
Fail loudly!
class RNASequence(Sequence):
[...] # constructor
def as_codons(self):
if len(self._seq) % 3 != 0:
raise ValueError, "broken"
codons = []
i = 0
while i < len(self._seq):
codon = self._seq[i:i+3]
codons.append(codon)
i = i + 3
return codons
>>> goodseq = Sequence.RNASequence(
...
"AUGGGU")
...
>>> print goodseq.as_codons()
['AUG', 'GGU']
>>> brokenseq =
Sequence.RNASequence(
...
"AUGGGUG")
...
>>> print brokenseq.as_codons()
Traceback (innermost last):
File "<stdin>", line 1, in ?
File "Sequence.py", line 23, in
as_codons
raise ValueError, "broken"
ValueError: broken
>>> _
79
Exception Handling
>>> try:
...
codons = badseq.as_codons()
...
print "Codons: %s" % codons
... except ValueError:
...
print "Sequence is broken"
...
Sequence is broken
>>> _
80
Reading sequence from a file
>ABP1_SACEX Actin-Binding Protein
MALEPIDATTHSRDIEQEYQKVVRGTDNDT
TWLIISPNTQKEYLPSSTGSSFSDFLQSFD
ETKVEYGIARVSPPGSDVGKIILVGWCPDS
APMKTRASFAANFGTIANSVLPGYHIQVTA
RDEDDLDEEELLTKISNAAGARYSIQAAGN
SVPTSSASGSAPVKKVFTPSLAKKESEPKK
SFVPPPVREEPVPVNVVKDN
FASTA-formatted file
81
Opening a File
>>> print open.__doc__
open(filename[, mode[, buffering]]) -> file object
Open a file. [...]
>>> _
“open” returns a file object.
82
Opening a File
>>> print open.__doc__
open(filename[, mode[, buffering]]) -> file object
Open a file. [...]
>>> file = open("does_not_exist", "r")
Traceback (innermost last):
File "<stdin>", line 1, in ?
IOError: [Errno 2] No such file or directory: 'does_not_exist'
>>> _
83
Opening a File
>>> print open.__doc__
open(filename[, mode[, buffering]]) -> file object
Open a file. [...]
>>> file = open("does_not_exist", "r")
Traceback (innermost last):
File "<stdin>", line 1, in ?
IOError: [Errno 2] No such file or directory: 'does_not_exist'
>>> file = open("fasta_file", "r")
>>> dir(file)
['close', 'closed', 'fileno', 'flush', 'isatty', 'mode', 'name’,
'read', 'readinto', 'readline', 'readlines', 'seek', 'softspace',
'tell', 'truncate', 'write', 'writelines']
>>> _
84
Reading a FASTA file
#!/usr/local/bin/python
[...] # Sequence stuff
def read_fasta(filename):
file = open(filename, 'r')
title_line = file.readline()
sequence = ''
while 1:
line = file.readline()
if not line:
break
sequence = sequence + line
Added to our Sequence
module...
Bug!!!
name = title_line[1:]
return Sequence(name, sequence)
85
Reading a FASTA file
#!/usr/local/bin/python
[...] # Sequence stuff
def read_fasta(filename):
file = open(filename, 'r')
title_line = file.readline()
sequence = ''
while 1:
line = file.readline()
if not line:
break
sequence = sequence + line
“line” contains newlines
and/or carriage
returns!
BUG: extra characters
name = title_line[1:]
return Sequence(name, sequence)
86
The string Module
>>> import string
>>> dir(string)
['atof', 'atoi', 'atol', 'capitalize', 'capwords', 'center',
'count', 'digits', 'expandtabs', 'find', 'hexdigits', 'index',
'index_error', 'join', 'joinfields', 'letters', 'ljust', 'lower',
'lowercase', 'lstrip', 'maketrans', 'octdigits', 'replace',
'rfind', 'rindex', 'rjust', 'rstrip', 'split', 'splitfields',
'strip', 'swapcase', 'translate', 'upper', 'uppercase',
'whitespace', 'zfill']
>>> print string.rstrip.__doc__
rstrip(s) -> string
Return a copy of the string s with trailing whitespace removed.
>>> _
See the Library Reference for more modules!
87
Using the string library
#!/usr/local/bin/python
import string
import here
[...] # Sequence stuff
“import” the library to
access the functions.
def read_fasta(filename):
file = open(filename, 'r')
title_line = file.readline()
sequence = ''
while 1:
line = file.readline()
if not line:
break
remove whitespace
line = string.rstrip(line)
sequence = sequence + line
name = title_line[1:]
return Sequence(name, sequence)
88
read_fasta (finished)
#!/usr/local/bin/python
import string
[...] # Sequence stuff
def read_fasta(filename):
"""read_fasta(filename) ->
Sequence
Added docstring.
Should add error checking.
Check format.
Check sequence.
Read a FASTA-formatted file
and return a Sequence object
""”
file = open(filename, 'r')
title_line = file.readline()
sequence = ''
while 1:
line = file.readline()
if not line:
break
89
Summary: Act III
•
Python Covered:
•
•
•
•
Functions
Objects
Modules
Read/Write Files
Code written:
• Translate RNA to protein.
• Sequence, RNASequence
classes.
• Read FASTA files.
90
Act IV
Biopython: Batteries Included
91
Python in Biology, 1999
• Growing body of code being developed in
•
•
Python.
Much code attacking the same problem.
Little intellectual property in the code -- we
just need code to get something done!
92
Solution: Biopython!
•
•
•
Provides freely
available software
tools for biology
research.
High-tech penny jar.
Modelled on Bioperl.
www.biopython.org
93
Who Should Use Biopython?
• People who manipulate and analyze
•
•
•
biological data using python.
People who need a module to perform a
function.
Very little end user tools (scripts to run).
Very little GUI tools.
94
Why should I use Biopython?
• Software is hard.
• Complete solutions are hard.
• Maintenance is hard.
95
What does Biopython do?
• Database access / File formats
• Sequence analysis
• Structure analysis
• Access to algorithms
• Microarray data analysis
96
Sequence Library
• Sequence class.
• Understands types of biological sequences.
• Transcribe and translate sequences.
• Analyses
• reverse complement, molecular weight, GC
content, Smith-Waterman alignment
97
Structure Analysis
• Thomas Hamelryck’s PDB Library
• Read and write PDB files. Hard!
• Fast search for neighbors
•
in 3D space.
Superimpose structures.
98
Microarray Analysis
• Michiel de Hoon’s PyCluster
• Read/Write Cluster/TreeView Files
• Data Analysis includes:
• Hierarchical Clustering, Self-Organizing Maps,
Principal Component Analysis
99
Databases
• Typical Functions:
• Search for data.
• Download data.
• Databases Supported:
• GenBank, PubMed, SWISS-PROT, PDB,
SCOP, Prosite, LocusLink, etc...
• BLAST search (Local and WWW)
100
Using Biopython
101
Download a SWISS-PROT Seq
>>> from Bio.SwissProt import Sprot
>>> SWISSPROT = SProt.ExPASyDictionary()
>>> entry = SWISSPROT['POL_HV2RO']
>>> print entry
ID
POL_HV2RO
STANDARD;
PRT; 1036 AA.
AC
P04584; Q76629;
[...]
SQ
SEQUENCE
1036 AA; 117080 MW; 5224E354B1DCC83B CRC64;
TGRFFRTGPL GKEAPQLPRG PSSAGADTNS TPSGSSSGST GEIYAAREKT ERAERETIQG
SDRGLTAPRA GGDTIQGATN RGLAAPQFSL WKRPVVTAYI EGQPVEVLLD TGADDSIVAG
[...]
>>> _
102
Saving FASTA-format
>>> from Bio import Fasta
>>> fasta_seq = Fasta.Record()
>>> fasta_seq.title = seq_obj.entry_name
>>> fasta_seq.sequence = seq_obj.sequence
>>> print fasta_seq
>POL_HV2RO
TGRFFRTGPLGKEAPQLPRGPSSAGADTNSTPSGSSSGSTGEIYAAREKTERAERETIQG
SDRGLTAPRAGGDTIQGATNRGLAAPQFSLWKRPVVTAYIEGQPVEVLLDTGADDSIVAG
IELGNNYSPKIVGGIGGFINTKEYKNVEIEVLNKKVRATIMTGDTPINIFGRNILTALGM
[...]
>>> open('myseq', 'w').write(str(fasta_seq))
>>> _
103
Run a BLAST search
>>> from Bio.Blast import NCBIWWW
>>> handle = NCBIWWW.blast('blastp', 'pdb', open('myseq'))
>>> results = NCBIWWW.BlastParser().parse(handle)
>>> print len(results.descriptions)
15
>>> for desc in results.descriptions:
...
print desc
...
pdb|1QGH|A
Chain A, The X-Ray Structure Of The Unusual Dod...
pdb|1QGH|A
Chain A, The X-Ray Structure Of The Unusual Dod...
[…]
pdb|1P35|B Chain B, Crystal Structure Of Baculovirus P35 >g...
>>> _
24
24
2.6
2.6
23
8.5
104
Getting Biopython
• Download from:
•
http://www.biopython.org
Open Source license!
• Free to modify, redistribute.
• See LICENSE for details.
105
Learning Biopython
• Read the Tutorial at:
http://www.biopython.org/docs/tutorial/
106
To Help Out
• Visit the web site.
• Join the biopython mailing lists.
• Find a project!
• Best: something you do often that is not in the
library.
• Support more programs, databases, types of
data.
• Documentation, site management, news
reporter.
107
Act V
Where do we go from here?
108
Jython = Java + Python
• Java implementation of Python.
• Compatible with Java
• Jython can execute Java code.
• Java can execute Jython code.
• Can take advantage of libraries written in
Java.
109
Optimizing Python with C
•
Python is closely tied
with C.
•
• Can extend Python
code with C code
•
Optimization strategy:
• Find slow points
• Rewrite in C
Gordon Bell Prize for
supercomputing
• 1998 Finalist for
Price/Performance
•
SPaSM (Scalable
Parallel Short-range
Molecular-dynamics)
• C/Python
110
Take-Home Messages
• Python is a simple high level language.
• Python is full-featured, and scales well for
•
•
scientific computation.
Python is well-supported in biology.
Biopython performs common biologyrelated tasks.
111
From here...
• Visit the web page: http://www.python.org
• Download Python
• Read the online documentation
• Tutorial
• Library Reference
• Find a project and start coding!
112
Recommended Books
Gentle introduction
Fast-paced introduction
Reference
Practical reference
113
he !
t
y
e
o
j
c
En eren
f
n
Co
Thank You!
Jeffrey Chang
[email protected]
114