Download Python # Biopython # wxPython - IBERO-NBIC @ Inicio

Document related concepts
no text concepts found
Transcript
 Python # Biopython # wxPython
Programming Course for Bioinformatics
14-19 May 2011
Organised by
RNASA-IMEDIR Group
University of A Coruña, Spain
Alejandro Pazos Sierra (UDC), [email protected]
Julián Dorado de la Calle (UDC), [email protected]
Victoria López Alonso (ISCIII), [email protected]
Diana de la Iglesia Jimenez (GIB, UPM), [email protected]
Scientific Networks
Ibero-American Network of the Nano-Bio-Info-Cogno Convergent Technologies (Ibero-NBIC)
Galician Network for Colorectal Cancer Research (REGICC)
Cooperative Research Thematic Network on Computational Medicine (COMBIOMED)
Teachers
Cristian Robert Munteanu (UDC), [email protected]
José Antonio Seoane Fernández (UDC), [email protected]
UDC = University of A Coruña
ISCIII = Instituto de Salud Carlos III
UPM = Universidad Politécnica de Madrid
Python
Python is a powerful programming language, easy to learn, with an efficient high-level data structures and a
simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing,
together with its interpreted nature, make it an ideal language for scripting and rapid application
development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or binary form for all
major platforms from the Python Web site, http://www.python.org/, and may be freely distributed. The same
site also contains distributions of and pointers to many free third party Python modules, programs and tools,
and additional documentation.
The Python interpreter is easily extended with new functions and data types implemented in C or C++ (or
other languages callable from C). Python is also suitable as an extension language for customizable
applications.
This tutorial introduces the reader informally to the basic concepts and features of the Python language and
system. It helps to have a Python interpreter handy for hands-on experience, but all examples are selfcontained, so the tutorial can be read off-line as well.
For a description of standard objects and modules, see The Python Standard Library. The Python Language
Reference gives a more formal definition of the language. To write extensions in C or C++, read Extending
and Embedding the Python Interpreter and Python/C API Reference Manual. There are also several books
covering Python in depth.
This tutorial does not attempt to be comprehensive and cover every single feature, or even every commonly
used feature. Instead, it introduces many of Python’s most noteworthy features, and will give you a good
idea of the language’s flavor and style. After reading it, you will be able to read and write Python modules
and programs, and you will be ready to learn more about the various Python library modules described in
The Python Standard Library.
Numbers
The interpreter acts as a simple calculator: you can type an expression at it and it will write the value.
Expression syntax is straightforward: the operators +, -, * and / work just like in most other languages (for
example, Pascal or C); parentheses can be used for grouping. For example:
>>> 2+2
4
>>> # This is a comment
... 2+2
4
>>> 2+2 # and a comment on the same line as code
4
>>> (50-5*6)/4
5
>>> # Integer division returns the floor:
... 7/3
2
>>> 7/-3
-3
The equal sign ('=') is used to assign a value to a variable. Afterwards, no result is displayed before the next
interactive prompt:
>>> width = 20
>>> height = 5*9
>>> width * height
900
A value can be assigned to several variables simultaneously:
>>> x = y = z = 0 # Zero x, y and z
>>> x
0
>>> y
0
>>> z
0
Variables must be “defined” (assigned a value) before they can be used, or an error will occur:
>>> # try to access an undefined variable
... n
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'n' is not defined
There is full support for floating point; operators with mixed type operands convert the integer operand to
floating point:
>>> 3 * 3.75 / 1.5
7.5
>>> 7.0 / 2
3.5
Complex numbers are also supported; imaginary numbers are written with a suffix of j or J. Complex
numbers with a nonzero real component are written as (real+imagj), or can be created with the complex(real,
imag) function.
>>> 1j * 1J
(-1+0j)
>>> 1j * complex(0,1)
(-1+0j)
>>> 3+1j*3
(3+3j)
>>> (3+1j)*3
(9+3j)
>>> (1+2j)/(1+1j)
(1.5+0.5j)
Complex numbers are always represented as two floating point numbers, the real and imaginary part. To
extract these parts from a complex number z, use z.real and z.imag.
>>> a=1.5+0.5j
>>> a.real
1.5
>>> a.imag
0.5
The conversion functions to floating point and integer (float(), int() and long()) don’t work for complex
numbers — there is no one correct way to convert a complex number to a real number. Use abs(z) to get its
magnitude (as a float) or z.real to get its real part.
>>> a=3.0+4.0j
>>> float(a)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: can't convert complex to float; use abs(z)
>>> a.real
3.0
>>> a.imag
4.0
>>> abs(a) # sqrt(a.real**2 + a.imag**2)
5.0
In interactive mode, the last printed expression is assigned to the variable _. This means that when you are
using Python as a desk calculator, it is somewhat easier to continue calculations, for example:
>>> tax = 12.5 / 100
>>> price = 100.50
>>> price * tax
12.5625
>>> price + _
113.0625
>>> round(_, 2)
113.06
This variable should be treated as read-only by the user. Don’t explicitly assign a value to it — you would
create an independent local variable with the same name masking the built-in variable with its magic
behavior.
Strings
Besides numbers, Python can also manipulate strings, which can be expressed in several ways. They can be
enclosed in single quotes or double quotes:
>>> 'spam eggs'
'spam eggs'
>>> 'doesn\'t'
"doesn't"
>>> "doesn't"
"doesn't"
>>> '"Yes," he said.'
'"Yes," he said.'
>>> "\"Yes,\" he said."
'"Yes," he said.'
>>> '"Isn\'t," she said.'
'"Isn\'t," she said.'
The interpreter prints the result of string operations in the same way as they are typed for input: inside
quotes, and with quotes and other funny characters escaped by backslashes, to show the precise value. The
string is enclosed in double quotes if the string contains a single quote and no double quotes, else it’s
enclosed in single quotes. The print statement produces a more readable output for such input strings.
String literals can span multiple lines in several ways. Continuation lines can be used, with a backslash as
the last character on the line indicating that the next line is a logical continuation of the line:
hello = "This is a rather long string containing\n\
several lines of text just as you would do in C.\n\
Note that whitespace at the beginning of the line is\
significant."
print hello
Note that newlines still need to be embedded in the string using \n – the newline following the trailing
backslash is discarded. This example would print the following:
This is a rather long string containing
several lines of text just as you would do in C.
Note that whitespace at the beginning of the line is significant.
Or, strings can be surrounded in a pair of matching triple-quotes: """ or '''. End of lines do not need to be
escaped when using triple-quotes, but they will be included in the string.
print """
Usage: thingy [OPTIONS]
-h
Display this usage message
-H hostname
Hostname to connect to
"""
produces the following output:
Usage: thingy [OPTIONS]
-h
Display this usage message
-H hostname
Hostname to connect to
If we make the string literal a “raw” string, \n sequences are not converted to newlines, but the backslash at
the end of the line, and the newline character in the source, are both included in the string as data. Thus, the
example:
hello = r"This is a rather long string containing\n\
several lines of text much as you would do in C."
print hello
would print:
This is a rather long string containing\n\
several lines of text much as you would do in C.
The interpreter prints the result of string operations in the same way as they are typed for input: inside
quotes, and with quotes and other funny characters escaped by backslashes, to show the precise value. The
string is enclosed in double quotes if the string contains a single quote and no double quotes, else it’s
enclosed in single quotes. (The print statement, described later, can be used to write strings without quotes
or escapes.)
Strings can be concatenated (glued together) with the + operator, and repeated with *:
>>> word = 'Help' + 'A'
>>> word
'HelpA'
>>> '<' + word*5 + '>'
'<HelpAHelpAHelpAHelpAHelpA>'
Two string literals next to each other are automatically concatenated; the first line above could also have
been written word = 'Help' 'A'; this only works with two literals, not with arbitrary string expressions:
>>> 'str' 'ing'
# <- This is ok
'string'
>>> 'str'.strip() + 'ing' # <- This is ok
'string'
>>> 'str'.strip() 'ing' # <- This is invalid
File "<stdin>", line 1, in ?
'str'.strip() 'ing'
^
SyntaxError: invalid syntax
Strings can be subscripted (indexed); like in C, the first character of a string has subscript (index) 0. There is
no separate character type; a character is simply a string of size one. Like in Icon, substrings can be
specified with the slice notation: two indices separated by a colon.
>>> word[4]
'A'
>>> word[0:2]
'He'
>>> word[2:4]
'lp'
Slice indices have useful defaults; an omitted first index defaults to zero, an omitted second index defaults to
the size of the string being sliced.
>>> word[:2]
'He'
>>> word[2:]
'lpA'
# The first two characters
# Everything except the first two characters
Unlike a C string, Python strings cannot be changed. Assigning to an indexed position in the string results in
an error:
>>> word[0] = 'x'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment
>>> word[:1] = 'Splat'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support slice assignment
However, creating a new string with the combined content is easy and efficient:
>>> 'x' + word[1:]
'xelpA'
>>> 'Splat' + word[4]
'SplatA'
Here’s a useful invariant of slice operations: s[:i] + s[i:] equals s.
>>> word[:2] + word[2:]
'HelpA'
>>> word[:3] + word[3:]
'HelpA'
Degenerate slice indices are handled gracefully: an index that is too large is replaced by the string size, an
upper bound smaller than the lower bound returns an empty string.
>>> word[1:100]
'elpA'
>>> word[10:]
''
>>> word[2:1]
''
Indices may be negative numbers, to start counting from the right. For example:
>>> word[-1] # The last character
'A'
>>> word[-2] # The last-but-one character
'p'
>>> word[-2:] # The last two characters
'pA'
>>> word[:-2] # Everything except the last two characters
'Hel'
But note that -0 is really the same as 0, so it does not count from the right!
>>> word[-0]
'H'
# (since -0 equals 0)
Out-of-range negative slice indices are truncated, but don’t try this for single-element (non-slice) indices:
>>> word[-100:]
'HelpA'
>>> word[-10] # error
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range
One way to remember how slices work is to think of the indices as pointing between characters, with the left
edge of the first character numbered 0. Then the right edge of the last character of a string of n characters
has index n, for example:
+---+---+---+---+---+
|H|e|l|p|A|
+---+---+---+---+---+
0 1 2 3 4 5
-5 -4 -3 -2 -1
The first row of numbers gives the position of the indices 0...5 in the string; the second row gives the
corresponding negative indices. The slice from i to j consists of all characters between the edges labeled i
and j, respectively.
For non-negative indices, the length of a slice is the difference of the indices, if both are within bounds. For
example, the length of word[1:3] is 2.
The built-in function len() returns the length of a string:
>>> s = 'supercalifragilisticexpialidocious'
>>> len(s)
34
Unicode Strings
Starting with Python 2.0 a new data type for storing text data is available to the programmer: the Unicode
object. It can be used to store and manipulate Unicode data (see http://www.unicode.org/) and integrates
well with the existing string objects, providing auto-conversions where necessary.
Unicode has the advantage of providing one ordinal for every character in every script used in modern and
ancient texts. Previously, there were only 256 possible ordinals for script characters. Texts were typically
bound to a code page which mapped the ordinals to script characters. This lead to very much confusion
especially with respect to internationalization (usually written as i18n — 'i' + 18 characters + 'n') of
software. Unicode solves these problems by defining one code page for all scripts.
Creating Unicode strings in Python is just as simple as creating normal strings:
>>> u'Hello World !'
u'Hello World !'
The small 'u' in front of the quote indicates that a Unicode string is supposed to be created. If you want to
include special characters in the string, you can do so by using the Python Unicode-Escape encoding. The
following example shows how:
>>> u'Hello\u0020World !'
u'Hello World !'
The escape sequence \u0020 indicates to insert the Unicode character with the ordinal value 0x0020 (the
space character) at the given position.
Other characters are interpreted by using their respective ordinal values directly as Unicode ordinals. If you
have literal strings in the standard Latin-1 encoding that is used in many Western countries, you will find it
convenient that the lower 256 characters of Unicode are the same as the 256 characters of Latin-1.
For experts, there is also a raw mode just like the one for normal strings. You have to prefix the opening
quote with ‘ur’ to have Python use the Raw-Unicode-Escape encoding. It will only apply the above
\uXXXX conversion if there is an uneven number of backslashes in front of the small ‘u’.
>>> ur'Hello\u0020World !'
u'Hello World !'
>>> ur'Hello\\u0020World !'
u'Hello\\\\u0020World !'
The raw mode is most useful when you have to enter lots of backslashes, as can be necessary in regular
expressions.
Apart from these standard encodings, Python provides a whole set of other ways of creating Unicode strings
on the basis of a known encoding.
The built-in function unicode() provides access to all registered Unicode codecs (COders and DECoders).
Some of the more well known encodings which these codecs can convert are Latin-1, ASCII, UTF-8, and
UTF-16. The latter two are variable-length encodings that store each Unicode character in one or more
bytes. The default encoding is normally set to ASCII, which passes through characters in the range 0 to 127
and rejects any other characters with an error. When a Unicode string is printed, written to a file, or
converted with str(), conversion takes place using this default encoding.
>>> u"abc"
u'abc'
>>> str(u"abc")
'abc'
>>> u"äöü"
u'\xe4\xf6\xfc'
>>> str(u"äöü")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
To convert a Unicode string into an 8-bit string using a specific encoding, Unicode objects provide an
encode() method that takes one argument, the name of the encoding. Lowercase names for encodings are
preferred.
>>> u"äöü".encode('utf-8')
'\xc3\xa4\xc3\xb6\xc3\xbc'
If you have data in a specific encoding and want to produce a corresponding Unicode string from it, you can
use the unicode() function with the encoding name as the second argument.
>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf-8')
u'\xe4\xf6\xfc'
Lists
Python knows a number of compound data types, used to group together other values. The most versatile is
the list, which can be written as a list of comma-separated values (items) between square brackets. List items
need not all have the same type.
>>> a = ['spam', 'eggs', 100, 1234]
>>> a
['spam', 'eggs', 100, 1234]
Like string indices, list indices start at 0, and lists can be sliced, concatenated and so on:
>>> a[0]
'spam'
>>> a[3]
1234
>>> a[-2]
100
>>> a[1:-1]
['eggs', 100]
>>> a[:2] + ['bacon', 2*2]
['spam', 'eggs', 'bacon', 4]
>>> 3*a[:3] + ['Boo!']
['spam', 'eggs', 100, 'spam', 'eggs', 100, 'spam', 'eggs', 100, 'Boo!']
All slice operations return a new list containing the requested elements. This means that the following slice
returns a shallow copy of the list a:
>>> a[:]
['spam', 'eggs', 100, 1234]
Unlike strings, which are immutable, it is possible to change individual elements of a list:
>>> a
['spam', 'eggs', 100, 1234]
>>> a[2] = a[2] + 23
>>> a
['spam', 'eggs', 123, 1234]
Assignment to slices is also possible, and this can even change the size of the list or clear it entirely:
>>> # Replace some items:
... a[0:2] = [1, 12]
>>> a
[1, 12, 123, 1234]
>>> # Remove some:
... a[0:2] = []
>>> a
[123, 1234]
>>> # Insert some:
... a[1:1] = ['bletch', 'xyzzy']
>>> a
[123, 'bletch', 'xyzzy', 1234]
>>> # Insert (a copy of) itself at the beginning
>>> a[:0] = a
>>> a
[123, 'bletch', 'xyzzy', 1234, 123, 'bletch', 'xyzzy', 1234]
>>> # Clear the list: replace all items with an empty list
>>> a[:] = []
>>> a
[]
The built-in function len() also applies to lists:
>>> a = ['a', 'b', 'c', 'd']
>>> len(a)
4
It is possible to nest lists (create lists containing other lists), for example:
>>> q = [2, 3]
>>> p = [1, q, 4]
>>> len(p)
3
>>> p[1]
[2, 3]
>>> p[1][0]
2
>>> p[1].append('xtra')
>>> p
[1, [2, 3, 'xtra'], 4]
>>> q
[2, 3, 'xtra']
# See section 5.1
Note that in the last example, p[1] and q really refer to the same object! We’ll come back to object semantics
later.
del statement
There is a way to remove an item from a list given its index instead of its value: the del statement. This
differs from the pop() method which returns a value. The del statement can also be used to remove slices
from a list or clear the entire list (which we did earlier by assignment of an empty list to the slice). For
example:
>>> a = [-1, 1, 66.25, 333, 333, 1234.5]
>>> del a[0]
>>> a
[1, 66.25, 333, 333, 1234.5]
>>> del a[2:4]
>>> a
[1, 66.25, 1234.5]
>>> del a[:]
>>> a
[]
del can also be used to delete entire variables:
>>> del a
Referencing the name a hereafter is an error (at least until another value is assigned to it). We’ll find other
uses for del later.
Tuples and Sequences
We saw that lists and strings have many common properties, such as indexing and slicing operations. They
are two examples of sequence data types (see Sequence Types — str, unicode, list, tuple, bytearray, buffer,
xrange). Since Python is an evolving language, other sequence data types may be added. There is also
another standard sequence data type: the tuple.
A tuple consists of a number of values separated by commas, for instance:
>>> t = 12345, 54321, 'hello!'
>>> t[0]
12345
>>> t
(12345, 54321, 'hello!')
>>> # Tuples may be nested:
... u = t, (1, 2, 3, 4, 5)
>>> u
((12345, 54321, 'hello!'), (1, 2, 3, 4, 5))
As you see, on output tuples are always enclosed in parentheses, so that nested tuples are interpreted
correctly; they may be input with or without surrounding parentheses, although often parentheses are
necessary anyway (if the tuple is part of a larger expression).
Tuples have many uses. For example: (x, y) coordinate pairs, employee records from a database, etc. Tuples,
like strings, are immutable: it is not possible to assign to the individual items of a tuple (you can simulate
much of the same effect with slicing and concatenation, though). It is also possible to create tuples which
contain mutable objects, such as lists.
A special problem is the construction of tuples containing 0 or 1 items: the syntax has some extra quirks to
accommodate these. Empty tuples are constructed by an empty pair of parentheses; a tuple with one item is
constructed by following a value with a comma (it is not sufficient to enclose a single value in parentheses).
Ugly, but effective. For example:
>>> empty = ()
>>> singleton = 'hello',
>>> len(empty)
0
>>> len(singleton)
1
>>> singleton
('hello',)
# <-- note trailing comma
The statement t = 12345, 54321, 'hello!' is an example of tuple packing: the values 12345, 54321 and 'hello!'
are packed together in a tuple. The reverse operation is also possible:
>>> x, y, z = t
This is called, appropriately enough, sequence unpacking and works for any sequence on the right-hand side.
Sequence unpacking requires the list of variables on the left to have the same number of elements as the
length of the sequence. Note that multiple assignment is really just a combination of tuple packing and
sequence unpacking.
Sets
Python also includes a data type for sets. A set is an unordered collection with no duplicate elements. Basic
uses include membership testing and eliminating duplicate entries. Set objects also support mathematical
operations like union, intersection, difference, and symmetric difference.
Here is a brief demonstration:
>>> basket = ['apple', 'orange', 'apple', 'pear', 'orange', 'banana']
>>> fruit = set(basket)
# create a set without duplicates
>>> fruit
set(['orange', 'pear', 'apple', 'banana'])
>>> 'orange' in fruit
# fast membership testing
True
>>> 'crabgrass' in fruit
False
>>> # Demonstrate set operations on unique letters from two words
...
>>> a = set('abracadabra')
>>> b = set('alacazam')
>>> a
# unique letters in a
set(['a', 'r', 'b', 'c', 'd'])
>>> a - b
# letters in a but not in b
set(['r', 'd', 'b'])
>>> a | b
# letters in either a or b
set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l'])
>>> a & b
# letters in both a and b
set(['a', 'c'])
>>> a ^ b
# letters in a or b but not both
set(['r', 'd', 'b', 'm', 'z', 'l'])
Dictionaries
Another useful data type built into Python is the dictionary. Dictionaries are sometimes found in other
languages as “associative memories” or “associative arrays”. Unlike sequences, which are indexed by a
range of numbers, dictionaries are indexed by keys, which can be any immutable type; strings and numbers
can always be keys. Tuples can be used as keys if they contain only strings, numbers, or tuples; if a tuple
contains any mutable object either directly or indirectly, it cannot be used as a key. You can’t use lists as
keys, since lists can be modified in place using index assignments, slice assignments, or methods like
append() and extend().
It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys
are unique (within one dictionary). A pair of braces creates an empty dictionary: {}. Placing a commaseparated list of key:value pairs within the braces adds initial key:value pairs to the dictionary; this is also
the way dictionaries are written on output.
The main operations on a dictionary are storing a value with some key and extracting the value given the
key. It is also possible to delete a key:value pair with del. If you store using a key that is already in use, the
old value associated with that key is forgotten. It is an error to extract a value using a non-existent key.
The keys() method of a dictionary object returns a list of all the keys used in the dictionary, in arbitrary
order (if you want it sorted, just apply the sorted() function to it). To check whether a single key is in the
dictionary, use the in keyword.
Here is a small example using a dictionary:
>>> tel = {'jack': 4098, 'sape': 4139}
>>> tel['guido'] = 4127
>>> tel
{'sape': 4139, 'guido': 4127, 'jack': 4098}
>>> tel['jack']
4098
>>> del tel['sape']
>>> tel['irv'] = 4127
>>> tel
{'guido': 4127, 'irv': 4127, 'jack': 4098}
>>> tel.keys()
['guido', 'irv', 'jack']
>>> 'guido' in tel
True
The dict() constructor builds dictionaries directly from lists of key-value pairs stored as tuples. When the
pairs form a pattern, list comprehensions can compactly specify the key-value list.
>>> dict([('sape', 4139), ('guido', 4127), ('jack', 4098)])
{'sape': 4139, 'jack': 4098, 'guido': 4127}
>>> dict([(x, x**2) for x in (2, 4, 6)]) # use a list comprehension
{2: 4, 4: 16, 6: 36}
Later in the tutorial, we will learn about Generator Expressions which are even better suited for the task of
supplying key-values pairs to the dict() constructor.
When the keys are simple strings, it is sometimes easier to specify pairs using keyword arguments:
>>> dict(sape=4139, guido=4127, jack=4098)
{'sape': 4139, 'jack': 4098, 'guido': 4127}
Looping Techniques
When looping through dictionaries, the key and corresponding value can be retrieved at the same time using
the iteritems() method.
>>> knights = {'gallahad': 'the pure', 'robin': 'the brave'}
>>> for k, v in knights.iteritems():
... print k, v
...
gallahad the pure
robin the brave
When looping through a sequence, the position index and corresponding value can be retrieved at the same
time using the enumerate() function.
>>> for i, v in enumerate(['tic', 'tac', 'toe']):
... print i, v
...
0 tic
1 tac
2 toe
To loop over two or more sequences at the same time, the entries can be paired with the zip() function.
>>> questions = ['name', 'quest', 'favorite color']
>>> answers = ['lancelot', 'the holy grail', 'blue']
>>> for q, a in zip(questions, answers):
... print 'What is your {0}? It is {1}.'.format(q, a)
...
What is your name? It is lancelot.
What is your quest? It is the holy grail.
What is your favorite color? It is blue.
To loop over a sequence in reverse, first specify the sequence in a forward direction and then call the
reversed() function.
>>> for i in reversed(xrange(1,10,2)):
... print i
...
9
7
5
3
1
To loop over a sequence in sorted order, use the sorted() function which returns a new sorted list while
leaving the source unaltered.
>>> basket = ['apple', 'orange', 'apple', 'pear', 'orange', 'banana']
>>> for f in sorted(set(basket)):
... print f
...
apple
banana
orange
pear
CONTROL FLOW TOOLS
if Statements
Perhaps the most well-known statement type is the if statement. For example:
>>> x = int(raw_input("Please enter an integer: "))
Please enter an integer: 42
>>> if x < 0:
...
x=0
...
print 'Negative changed to zero'
... elif x == 0:
...
print 'Zero'
... elif x == 1:
...
print 'Single'
... else:
...
print 'More'
...
More
There can be zero or more elif parts, and the else part is optional. The keyword ‘elif‘ is short for ‘else if’,
and is useful to avoid excessive indentation. An if ... elif ... elif ... sequence is a substitute for the switch or
case statements found in other languages.
for Statements
The for statement in Python differs a bit from what you may be used to in C or Pascal. Rather than always
iterating over an arithmetic progression of numbers (like in Pascal), or giving the user the ability to define
both the iteration step and halting condition (as C), Python’s for statement iterates over the items of any
sequence (a list or a string), in the order that they appear in the sequence. For example:
>>> # Measure some strings:
... a = ['cat', 'window', 'defenestrate']
>>> for x in a:
... print x, len(x)
...
cat 3
window 6
defenestrate 12
It is not safe to modify the sequence being iterated over in the loop (this can only happen for mutable
sequence types, such as lists). If you need to modify the list you are iterating over (for example, to duplicate
selected items) you must iterate over a copy. The slice notation makes this particularly convenient:
>>> for x in a[:]: # make a slice copy of the entire list
... if len(x) > 6: a.insert(0, x)
...
>>> a
['defenestrate', 'cat', 'window', 'defenestrate']
range() Function
If you do need to iterate over a sequence of numbers, the built-in function range() comes in handy. It
generates lists containing arithmetic progressions:
>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The given end point is never part of the generated list; range(10) generates a list of 10 values, the legal
indices for items of a sequence of length 10. It is possible to let the range start at another number, or to
specify a different increment (even negative; sometimes this is called the ‘step’):
>>> range(5, 10)
[5, 6, 7, 8, 9]
>>> range(0, 10, 3)
[0, 3, 6, 9]
>>> range(-10, -100, -30)
[-10, -40, -70]
To iterate over the indices of a sequence, you can combine range() and len() as follows:
>>> a = ['Mary', 'had', 'a', 'little', 'lamb']
>>> for i in range(len(a)):
... print i, a[i]
...
0 Mary
1 had
2a
3 little
4 lamb
In most such cases, however, it is convenient to use the enumerate() function.
break and continue Statements, and else Clauses on Loops
The break statement, like in C, breaks out of the smallest enclosing for or while loop.
The continue statement, also borrowed from C, continues with the next iteration of the loop.
Loop statements may have an else clause; it is executed when the loop terminates through exhaustion of the
list (with for) or when the condition becomes false (with while), but not when the loop is terminated by a
break statement. This is exemplified by the following loop, which searches for prime numbers:
>>> for n in range(2, 10):
... for x in range(2, n):
...
if n % x == 0:
...
print n, 'equals', x, '*', n/x
...
break
... else:
...
# loop fell through without finding a factor
...
print n, 'is a prime number'
...
2 is a prime number
3 is a prime number
4 equals 2 * 2
5 is a prime number
6 equals 2 * 3
7 is a prime number
8 equals 2 * 4
9 equals 3 * 3
pass Statements
The pass statement does nothing. It can be used when a statement is required syntactically but the program
requires no action. For example:
>>> while True:
... pass # Busy-wait for keyboard interrupt (Ctrl+C)
...
This is commonly used for creating minimal classes:
>>> class MyEmptyClass:
... pass
...
Another place pass can be used is as a place-holder for a function or conditional body when you are working
on new code, allowing you to keep thinking at a more abstract level. The pass is silently ignored:
>>> def initlog(*args):
... pass # Remember to implement this!
...
Defining Functions
We can create a function that writes the Fibonacci series to an arbitrary boundary:
>>> def fib(n): # write Fibonacci series up to n
... """Print a Fibonacci series up to n."""
... a, b = 0, 1
... while a < n:
...
print a,
...
a, b = b, a+b
...
>>> # Now call the function we just defined:
... fib(2000)
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597
The keyword def introduces a function definition. It must be followed by the function name and the
parenthesized list of formal parameters. The statements that form the body of the function start at the next
line, and must be indented.
The first statement of the function body can optionally be a string literal; this string literal is the function’s
documentation string, or docstring. There are tools which use docstrings to automatically produce online or
printed documentation, or to let the user interactively browse through code; it’s good practice to include
docstrings in code that you write, so make a habit of it.
The execution of a function introduces a new symbol table used for the local variables of the function. More
precisely, all variable assignments in a function store the value in the local symbol table; whereas variable
references first look in the local symbol table, then in the local symbol tables of enclosing functions, then in
the global symbol table, and finally in the table of built-in names. Thus, global variables cannot be directly
assigned a value within a function (unless named in a global statement), although they may be referenced.
The actual parameters (arguments) to a function call are introduced in the local symbol table of the called
function when it is called; thus, arguments are passed using call by value (where the value is always an
object reference, not the value of the object). When a function calls another function, a new local symbol
table is created for that call.
A function definition introduces the function name in the current symbol table. The value of the function
name has a type that is recognized by the interpreter as a user-defined function. This value can be assigned
to another name which can then also be used as a function. This serves as a general renaming mechanism:
>>> fib
<function fib at 10042ed0>
>>> f = fib
>>> f(100)
0 1 1 2 3 5 8 13 21 34 55 89
Coming from other languages, you might object that fib is not a function but a procedure since it doesn’t
return a value. In fact, even functions without a return statement do return a value, albeit a rather boring one.
This value is called None (it’s a built-in name). Writing the value None is normally suppressed by the
interpreter if it would be the only value written. You can see it if you really want to using print:
>>> fib(0)
>>> print fib(0)
None
It is simple to write a function that returns a list of the numbers of the Fibonacci series, instead of printing it:
>>> def fib2(n): # return Fibonacci series up to n
... """Return a list containing the Fibonacci series up to n."""
... result = []
... a, b = 0, 1
... while a < n:
...
result.append(a) # see below
...
a, b = b, a+b
... return result
...
>>> f100 = fib2(100) # call it
>>> f100
# write the result
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
This example, as usual, demonstrates some new Python features:


The return statement returns with a value from a function. return without an expression argument
returns None. Falling off the end of a function also returns None.
The statement result.append(a) calls a method of the list object result. A method is a function that
‘belongs’ to an object and is named obj.methodname, where obj is some object (this may be an
expression), and methodname is the name of a method that is defined by the object’s type. Different
types define different methods. Methods of different types may have the same name without causing
ambiguity. (It is possible to define your own object types and methods, using classes) The method
append() shown in the example is defined for list objects; it adds a new element at the end of the list.
In this example it is equivalent to result = result + [a], but more efficient.
Standard Modules
Python comes with a library of standard modules, described in a separate document, the Python Library
Reference (“Library Reference” hereafter). Some modules are built into the interpreter; these provide access
to operations that are not part of the core of the language but are nevertheless built in, either for efficiency or
to provide access to operating system primitives such as system calls. The set of such modules is a
configuration option which also depends on the underlying platform For example, the winreg module is only
provided on Windows systems. One particular module deserves some attention: sys, which is built into
every Python interpreter. The variables sys.ps1 and sys.ps2 define the strings used as primary and secondary
prompts:
>>> import sys
>>> sys.ps1
'>>> '
>>> sys.ps2
'... '
>>> sys.ps1 = 'C> '
C> print 'Yuck!'
Yuck!
C>
These two variables are only defined if the interpreter is in interactive mode.
The variable sys.path is a list of strings that determines the interpreter’s search path for modules. It is
initialized to a default path taken from the environment variable PYTHONPATH, or from a built-in default
if PYTHONPATH is not set. You can modify it using standard list operations:
>>> import sys
>>> sys.path.append('/ufs/guido/lib/python')
Importing * From a Package
Now what happens when the user writes from sound.effects import *? Ideally, one would hope that this
somehow goes out to the filesystem, finds which submodules are present in the package, and imports them
all. This could take a long time and importing sub-modules might have unwanted side-effects that should
only happen when the sub-module is explicitly imported.
The only solution is for the package author to provide an explicit index of the package. The import statement
uses the following convention: if a package’s __init__.py code defines a list named __all__, it is taken to be
the list of module names that should be imported when from package import * is encountered. It is up to the
package author to keep this list up-to-date when a new version of the package is released. Package authors
may also decide not to support it, if they don’t see a use for importing * from their package. For example,
the file sounds/effects/__init__.py could contain the following code:
__all__ = ["echo", "surround", "reverse"]
This would mean that from sound.effects import * would import the three named submodules of the sound
package.
If __all__ is not defined, the statement from sound.effects import * does not import all submodules from the
package sound.effects into the current namespace; it only ensures that the package sound.effects has been
imported (possibly running any initialization code in __init__.py) and then imports whatever names are
defined in the package. This includes any names defined (and submodules explicitly loaded) by __init__.py.
It also includes any submodules of the package that were explicitly loaded by previous import statements.
Consider this code:
import sound.effects.echo
import sound.effects.surround
from sound.effects import *
In this example, the echo and surround modules are imported in the current namespace because they are
defined in the sound.effects package when the from...import statement is executed. (This also works when
__all__ is defined.)
Although certain modules are designed to export only names that follow certain patterns when you use
import *, it is still considered bad practise in production code.
Remember, there is nothing wrong with using from Package import specific_submodule! In fact, this is the
recommended notation unless the importing module needs to use submodules with the same name from
different packages.
Input and Output
There are several ways to present the output of a program; data can be printed in a human-readable form, or
written to a file for future use. This chapter will discuss some of the possibilities.
Fancier Output Formatting
So far we’ve encountered two ways of writing values: expression statements and the print statement. A third
way is using the write() method of file objects; the standard output file can be referenced as sys.stdout. See
the Library Reference for more information on this.
Often you’ll want more control over the formatting of your output than simply printing space-separated
values. There are two ways to format your output; the first way is to do all the string handling yourself;
using string slicing and concatenation operations you can create any layout you can imagine. The string
types have some methods that perform useful operations for padding strings to a given column width; these
will be discussed shortly. The second way is to use the str.format() method.
The string module contains a Template class which offers yet another way to substitute values into strings.
One question remains, of course: how do you convert values to strings? Luckily, Python has ways to convert
any value to a string: pass it to the repr() or str() functions.
The str() function is meant to return representations of values which are fairly human-readable, while repr()
is meant to generate representations which can be read by the interpreter (or will force a SyntaxError if there
is not equivalent syntax). For objects which don’t have a particular representation for human consumption,
str() will return the same value as repr(). Many values, such as numbers or structures like lists and
dictionaries, have the same representation using either function. Strings and floating point numbers, in
particular, have two distinct representations.
Some examples:
>>> s = 'Hello, world.'
>>> str(s)
'Hello, world.'
>>> repr(s)
"'Hello, world.'"
>>> str(1.0/7.0)
'0.142857142857'
>>> repr(1.0/7.0)
'0.14285714285714285'
>>> x = 10 * 3.25
>>> y = 200 * 200
>>> s = 'The value of x is ' + repr(x) + ', and y is ' + repr(y) + '...'
>>> print s
The value of x is 32.5, and y is 40000...
>>> # The repr() of a string adds string quotes and backslashes:
... hello = 'hello, world\n'
>>> hellos = repr(hello)
>>> print hellos
'hello, world\n'
>>> # The argument to repr() may be any Python object:
... repr((x, y, ('spam', 'eggs')))
"(32.5, 40000, ('spam', 'eggs'))"
Here are two ways to write a table of squares and cubes:
>>> for x in range(1, 11):
... print repr(x).rjust(2), repr(x*x).rjust(3),
... # Note trailing comma on previous line
... print repr(x*x*x).rjust(4)
...
1 1 1
2 4 8
3 9 27
4 16 64
5 25 125
6 36 216
7 49 343
8 64 512
9 81 729
10 100 1000
>>> for x in range(1,11):
... print '{0:2d} {1:3d} {2:4d}'.format(x, x*x, x*x*x)
...
1 1 1
2 4 8
3 9 27
4 16 64
5 25 125
6 36 216
7 49 343
8 64 512
9 81 729
10 100 1000
(Note that in the first example, one space between each column was added by the way print works: it always
adds spaces between its arguments.)
This example demonstrates the str.rjust() method of string objects, which right-justifies a string in a field of
a given width by padding it with spaces on the left. There are similar methods str.ljust() and str.center().
These methods do not write anything, they just return a new string. If the input string is too long, they don’t
truncate it, but return it unchanged; this will mess up your column lay-out but that’s usually better than the
alternative, which would be lying about a value. (If you really want truncation you can always add a slice
operation, as in x.ljust(n)[:n].)
There is another method, str.zfill(), which pads a numeric string on the left with zeros. It understands about
plus and minus signs:
>>> '12'.zfill(5)
'00012'
>>> '-3.14'.zfill(7)
'-003.14'
>>> '3.14159265359'.zfill(5)
'3.14159265359'
Basic usage of the str.format() method looks like this:
>>> print 'We are the {} who say "{}!"'.format('knights', 'Ni')
We are the knights who say "Ni!"
The brackets and characters within them (called format fields) are replaced with the objects passed into the
str.format() method. A number in the brackets refers to the position of the object passed into the str.format()
method.
>>> print '{0} and {1}'.format('spam', 'eggs')
spam and eggs
>>> print '{1} and {0}'.format('spam', 'eggs')
eggs and spam
If keyword arguments are used in the str.format() method, their values are referred to by using the name of
the argument.
>>> print 'This {food} is {adjective}.'.format(
...
food='spam', adjective='absolutely horrible')
This spam is absolutely horrible.
Positional and keyword arguments can be arbitrarily combined:
>>> print 'The story of {0}, {1}, and {other}.'.format('Bill', 'Manfred',
...
other='Georg')
The story of Bill, Manfred, and Georg.
'!s' (apply str()) and '!r' (apply repr()) can be used to convert the value before it is formatted.
>>> import math
>>> print 'The value of PI is approximately {}.'.format(math.pi)
The value of PI is approximately 3.14159265359.
>>> print 'The value of PI is approximately {!r}.'.format(math.pi)
The value of PI is approximately 3.141592653589793.
An optional ':' and format specifier can follow the field name. This allows greater control over how the value
is formatted. The following example rounds Pi to three places after the decimal.
>>> import math
>>> print 'The value of PI is approximately {0:.3f}.'.format(math.pi)
The value of PI is approximately 3.142.
Passing an integer after the ':' will cause that field to be a minimum number of characters wide. This is
useful for making tables pretty.
>>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678}
>>> for name, phone in table.items():
... print '{0:10} ==> {1:10d}'.format(name, phone)
...
Jack
==>
4098
Dcab
==>
7678
Sjoerd ==>
4127
If you have a really long format string that you don’t want to split up, it would be nice if you could reference
the variables to be formatted by name instead of by position. This can be done by simply passing the dict
and using square brackets '[]' to access the keys
>>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 8637678}
>>> print ('Jack: {0[Jack]:d}; Sjoerd: {0[Sjoerd]:d}; '
...
'Dcab: {0[Dcab]:d}'.format(table))
Jack: 4098; Sjoerd: 4127; Dcab: 8637678
This could also be done by passing the table as keyword arguments with the ‘**’ notation.
>>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 8637678}
>>> print 'Jack: {Jack:d}; Sjoerd: {Sjoerd:d}; Dcab: {Dcab:d}'.format(**table)
Jack: 4098; Sjoerd: 4127; Dcab: 8637678
This is particularly useful in combination with the built-in function vars(), which returns a dictionary
containing all local variables.
Old string formatting
The % operator can also be used for string formatting. It interprets the left argument much like a sprintf()style format string to be applied to the right argument, and returns the string resulting from this formatting
operation. For example:
>>> import math
>>> print 'The value of PI is approximately %5.3f.' % math.pi
The value of PI is approximately 3.142.
Since str.format() is quite new, a lot of Python code still uses the % operator. However, because this old
style of formatting will eventually be removed from the language, str.format() should generally be used.
Reading and Writing Files
open() returns a file object, and is most commonly used with two arguments: open(filename, mode).
>>> f = open('/tmp/workfile', 'w')
>>> print f
<open file '/tmp/workfile', mode 'w' at 80a0960>
The first argument is a string containing the filename. The second argument is another string containing a
few characters describing the way in which the file will be used. mode can be 'r' when the file will only be
read, 'w' for only writing (an existing file with the same name will be erased), and 'a' opens the file for
appending; any data written to the file is automatically added to the end. 'r+' opens the file for both reading
and writing. The mode argument is optional; 'r' will be assumed if it’s omitted.
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb',
and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters
in text files are automatically altered slightly when data is read or written. This behind-the-scenes
modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE
files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to
append a 'b' to the mode, so you can use it platform-independently for all binary files.
Methods of File Objects
The rest of the examples in this section will assume that a file object called f has already been created.
To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string. size is
an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read
and returned; it’s your problem if the file is twice as large as your machine’s memory. Otherwise, at most
size bytes are read and returned. If the end of the file has been reached, f.read() will return an empty string
("").
>>> f.read()
'This is the entire file.\n'
>>> f.read()
''
f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is
only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value
unambiguous; if f.readline() returns an empty string, the end of the file has been reached, while a blank line
is represented by '\n', a string containing only a single newline.
>>> f.readline()
'This is the first line of the file.\n'
>>> f.readline()
'Second line of the file\n'
>>> f.readline()
''
f.readlines() returns a list containing all the lines of data in the file. If given an optional parameter sizehint, it
reads that many bytes from the file and enough more to complete a line, and returns the lines from that. This
is often used to allow efficient reading of a large file by lines, but without having to load the entire file in
memory. Only complete lines will be returned.
>>> f.readlines()
['This is the first line of the file.\n', 'Second line of the file\n']
An alternative approach to reading lines is to loop over the file object. This is memory efficient, fast, and
leads to simpler code:
>>> for line in f:
print line,
This is the first line of the file.
Second line of the file
The alternative approach is simpler but does not provide as fine-grained control. Since the two approaches
manage line buffering differently, they should not be mixed.
f.write(string) writes the contents of string to the file, returning None.
>>> f.write('This is a test\n')
To write something other than a string, it needs to be converted to a string first:
>>> value = ('the answer', 42)
>>> s = str(value)
>>> f.write(s)
f.tell() returns an integer giving the file object’s current position in the file, measured in bytes from the
beginning of the file. To change the file object’s position, use f.seek(offset, from_what). The position is
computed from adding offset to a reference point; the reference point is selected by the from_what argument.
A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses
the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of
the file as the reference point.
>>> f = open('/tmp/workfile', 'r+')
>>> f.write('0123456789abcdef')
>>> f.seek(5) # Go to the 6th byte in the file
>>> f.read(1)
'5'
>>> f.seek(-3, 2) # Go to the 3rd byte before the end
>>> f.read(1)
'd'
When you’re done with a file, call f.close() to close it and free up any system resources taken up by the open
file. After calling f.close(), attempts to use the file object will automatically fail.
>>> f.close()
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: I/O operation on closed file
It is good practice to use the with keyword when dealing with file objects. This has the advantage that the
file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much
shorter than writing equivalent try-finally blocks:
>>> with open('/tmp/workfile', 'r') as f:
... read_data = f.read()
>>> f.closed
True
File objects have some additional methods, such as isatty() and truncate() which are less frequently used.
Errors and Exceptions
Until now error messages haven’t been more than mentioned, but if you have tried out the examples you
have probably seen some. There are (at least) two distinguishable kinds of errors: syntax errors and
exceptions.
Syntax Errors
Syntax errors, also known as parsing errors, are perhaps the most common kind of complaint you get while
you are still learning Python:
>>> while True print 'Hello world'
File "<stdin>", line 1, in ?
while True print 'Hello world'
^
SyntaxError: invalid syntax
The parser repeats the offending line and displays a little ‘arrow’ pointing at the earliest point in the line
where the error was detected. The error is caused by (or at least detected at) the token preceding the arrow:
in the example, the error is detected at the keyword print, since a colon (':') is missing before it. File name
and line number are printed so you know where to look in case the input came from a script.
Exceptions
Even if a statement or expression is syntactically correct, it may cause an error when an attempt is made to
execute it. Errors detected during execution are called exceptions and are not unconditionally fatal: you will
soon learn how to handle them in Python programs. Most exceptions are not handled by programs, however,
and result in error messages as shown here:
>>> 10 * (1/0)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ZeroDivisionError: integer division or modulo by zero
>>> 4 + spam*3
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'spam' is not defined
>>> '2' + 2
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: cannot concatenate 'str' and 'int' objects
The last line of the error message indicates what happened. Exceptions come in different types, and the type
is printed as part of the message: the types in the example are ZeroDivisionError, NameError and TypeError.
The string printed as the exception type is the name of the built-in exception that occurred. This is true for
all built-in exceptions, but need not be true for user-defined exceptions (although it is a useful convention).
Standard exception names are built-in identifiers (not reserved keywords).
The rest of the line provides detail based on the type of exception and what caused it.
The preceding part of the error message shows the context where the exception happened, in the form of a
stack traceback. In general it contains a stack traceback listing source lines; however, it will not display lines
read from standard input.
Built-in Exceptions lists the built-in exceptions and their meanings.
Handling Exceptions
It is possible to write programs that handle selected exceptions. Look at the following example, which asks
the user for input until a valid integer has been entered, but allows the user to interrupt the program (using
Control-C or whatever the operating system supports); note that a user-generated interruption is signalled by
raising the KeyboardInterrupt exception.
>>> while True:
... try:
...
x = int(raw_input("Please enter a number: "))
...
break
... except ValueError:
...
print "Oops! That was no valid number. Try again..."
...
The try statement works as follows.




First, the try clause (the statement(s) between the try and except keywords) is executed.
If no exception occurs, the except clause is skipped and execution of the try statement is finished.
If an exception occurs during execution of the try clause, the rest of the clause is skipped. Then if its
type matches the exception named after the except keyword, the except clause is executed, and then
execution continues after the try statement.
If an exception occurs which does not match the exception named in the except clause, it is passed
on to outer try statements; if no handler is found, it is an unhandled exception and execution stops
with a message as shown above.
A try statement may have more than one except clause, to specify handlers for different exceptions. At most
one handler will be executed. Handlers only handle exceptions that occur in the corresponding try clause,
not in other handlers of the same try statement. An except clause may name multiple exceptions as a
parenthesized tuple, for example:
... except (RuntimeError, TypeError, NameError):
... pass
The last except clause may omit the exception name(s), to serve as a wildcard. Use this with extreme
caution, since it is easy to mask a real programming error in this way! It can also be used to print an error
message and then re-raise the exception (allowing a caller to handle the exception as well):
import sys
try:
f = open('myfile.txt')
s = f.readline()
i = int(s.strip())
except IOError as (errno, strerror):
print "I/O error({0}): {1}".format(errno, strerror)
except ValueError:
print "Could not convert data to an integer."
except:
print "Unexpected error:", sys.exc_info()[0]
raise
The try ... except statement has an optional else clause, which, when present, must follow all except clauses.
It is useful for code that must be executed if the try clause does not raise an exception. For example:
for arg in sys.argv[1:]:
try:
f = open(arg, 'r')
except IOError:
print 'cannot open', arg
else:
print arg, 'has', len(f.readlines()), 'lines'
f.close()
The use of the else clause is better than adding additional code to the try clause because it avoids
accidentally catching an exception that wasn’t raised by the code being protected by the try ... except
statement.
When an exception occurs, it may have an associated value, also known as the exception’s argument. The
presence and type of the argument depend on the exception type.
The except clause may specify a variable after the exception name (or tuple). The variable is bound to an
exception instance with the arguments stored in instance.args. For convenience, the exception instance
defines __str__() so the arguments can be printed directly without having to reference .args.
One may also instantiate an exception first before raising it and add any attributes to it as desired.
>>> try:
... raise Exception('spam', 'eggs')
... except Exception as inst:
... print type(inst) # the exception instance
... print inst.args # arguments stored in .args
... print inst
# __str__ allows args to printed directly
... x, y = inst
# __getitem__ allows args to be unpacked directly
... print 'x =', x
... print 'y =', y
...
<type 'exceptions.Exception'>
('spam', 'eggs')
('spam', 'eggs')
x = spam
y = eggs
If an exception has an argument, it is printed as the last part (‘detail’) of the message for unhandled
exceptions.
Exception handlers don’t just handle exceptions if they occur immediately in the try clause, but also if they
occur inside functions that are called (even indirectly) in the try clause. For example:
>>> def this_fails():
... x = 1/0
...
>>> try:
... this_fails()
... except ZeroDivisionError as detail:
... print 'Handling run-time error:', detail
...
Handling run-time error: integer division or modulo by zero
Raising Exceptions
The raise statement allows the programmer to force a specified exception to occur. For example:
>>> raise NameError('HiThere')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: HiThere
The sole argument to raise indicates the exception to be raised. This must be either an exception instance or
an exception class (a class that derives from Exception).
If you need to determine whether an exception was raised but don’t intend to handle it, a simpler form of the
raise statement allows you to re-raise the exception:
>>> try:
... raise NameError('HiThere')
... except NameError:
... print 'An exception flew by!'
... raise
...
An exception flew by!
Traceback (most recent call last):
File "<stdin>", line 2, in ?
NameError: HiThere
User-defined Exceptions
Programs may name their own exceptions by creating a new exception class. Exceptions should typically be
derived from the Exception class, either directly or indirectly. For example:
>>> class MyError(Exception):
... def __init__(self, value):
...
self.value = value
... def __str__(self):
...
return repr(self.value)
...
>>> try:
... raise MyError(2*2)
... except MyError as e:
... print 'My exception occurred, value:', e.value
...
My exception occurred, value: 4
>>> raise MyError('oops!')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
__main__.MyError: 'oops!'
In this example, the default __init__() of Exception has been overridden. The new behavior simply creates
the value attribute. This replaces the default behavior of creating the args attribute.
Exception classes can be defined which do anything any other class can do, but are usually kept simple,
often only offering a number of attributes that allow information about the error to be extracted by handlers
for the exception. When creating a module that can raise several distinct errors, a common practice is to
create a base class for exceptions defined by that module, and subclass that to create specific exception
classes for different error conditions:
class Error(Exception):
"""Base class for exceptions in this module."""
pass
class InputError(Error):
"""Exception raised for errors in the input.
Attributes:
expr -- input expression in which the error occurred
msg -- explanation of the error
"""
def __init__(self, expr, msg):
self.expr = expr
self.msg = msg
class TransitionError(Error):
"""Raised when an operation attempts a state transition that's not
allowed.
Attributes:
prev -- state at beginning of transition
next -- attempted new state
msg -- explanation of why the specific transition is not allowed
"""
def __init__(self, prev, next, msg):
self.prev = prev
self.next = next
self.msg = msg
Most exceptions are defined with names that end in “Error,” similar to the naming of the standard
exceptions.
Many standard modules define their own exceptions to report errors that may occur in functions they define.
Defining Clean-up Actions
The try statement has another optional clause which is intended to define clean-up actions that must be
executed under all circumstances. For example:
>>> try:
... raise KeyboardInterrupt
... finally:
... print 'Goodbye, world!'
...
Goodbye, world!
KeyboardInterrupt
A finally clause is always executed before leaving the try statement, whether an exception has occurred or
not. When an exception has occurred in the try clause and has not been handled by an except clause (or it
has occurred in a except or else clause), it is re-raised after the finally clause has been executed. The finally
clause is also executed “on the way out” when any other clause of the try statement is left via a break,
continue or return statement. A more complicated example (having except and finally clauses in the same try
statement works as of Python 2.5):
>>> def divide(x, y):
... try:
...
result = x / y
... except ZeroDivisionError:
...
print "division by zero!"
... else:
...
print "result is", result
... finally:
...
print "executing finally clause"
...
>>> divide(2, 1)
result is 2
executing finally clause
>>> divide(2, 0)
division by zero!
executing finally clause
>>> divide("2", "1")
executing finally clause
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "<stdin>", line 3, in divide
TypeError: unsupported operand type(s) for /: 'str' and 'str'
As you can see, the finally clause is executed in any event. The TypeError raised by dividing two strings is
not handled by the except clause and therefore re-raised after the finally clause has been executed.
In real world applications, the finally clause is useful for releasing external resources (such as files or
network connections), regardless of whether the use of the resource was successful.
Predefined Clean-up Actions
Some objects define standard clean-up actions to be undertaken when the object is no longer needed,
regardless of whether or not the operation using the object succeeded or failed. Look at the following
example, which tries to open a file and print its contents to the screen.
for line in open("myfile.txt"):
print line
The problem with this code is that it leaves the file open for an indeterminate amount of time after the code
has finished executing. This is not an issue in simple scripts, but can be a problem for larger applications.
The with statement allows objects like files to be used in a way that ensures they are always cleaned up
promptly and correctly.
with open("myfile.txt") as f:
for line in f:
print line
After the statement is executed, the file f is always closed, even if a problem was encountered while
processing the lines. Other objects which provide predefined clean-up actions will indicate this in their
documentation.
Operating System Interface
The os module provides dozens of functions for interacting with the operating system:
>>> import os
>>> os.getcwd() # Return the current working directory
'C:\\Python26'
>>> os.chdir('/server/accesslogs') # Change current working directory
>>> os.system('mkdir today') # Run the command mkdir in the system shell
0
Be sure to use the import os style instead of from os import *. This will keep os.open() from shadowing the
built-in open() function which operates much differently.
The built-in dir() and help() functions are useful as interactive aids for working with large modules like os:
>>> import os
>>> dir(os)
<returns a list of all module functions>
>>> help(os)
<returns an extensive manual page created from the module's docstrings>
For daily file and directory management tasks, the shutil module provides a higher level interface that is
easier to use:
>>> import shutil
>>> shutil.copyfile('data.db', 'archive.db')
>>> shutil.move('/build/executables', 'installdir')
File Wildcards
The glob module provides a function for making file lists from directory wildcard searches:
>>> import glob
>>> glob.glob('*.py')
['primes.py', 'random.py', 'quote.py']
Command Line Arguments
Common utility scripts often need to process command line arguments. These arguments are stored in the
sys module’s argv attribute as a list. For instance the following output results from running python demo.py
one two three at the command line:
>>> import sys
>>> print sys.argv
['demo.py', 'one', 'two', 'three']
The getopt module processes sys.argv using the conventions of the Unix getopt() function. More powerful
and flexible command line processing is provided by the argparse module.
Error Output Redirection and Program Termination
The sys module also has attributes for stdin, stdout, and stderr. The latter is useful for emitting warnings and
error messages to make them visible even when stdout has been redirected:
>>> sys.stderr.write('Warning, log file not found starting a new one\n')
Warning, log file not found starting a new one
The most direct way to terminate a script is to use sys.exit().
String Pattern Matching
The re module provides regular expression tools for advanced string processing. For complex matching and
manipulation, regular expressions offer succinct, optimized solutions:
>>> import re
>>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
['foot', 'fell', 'fastest']
>>> re.sub(r'(\b[a-z]+) \1', r'\1', 'cat in the the hat')
'cat in the hat'
When only simple capabilities are needed, string methods are preferred because they are easier to read and
debug:
>>> 'tea for too'.replace('too', 'two')
'tea for two'
Mathematics
The math module gives access to the underlying C library functions for floating point math:
>>> import math
>>> math.cos(math.pi / 4.0)
0.70710678118654757
>>> math.log(1024, 2)
10.0
The random module provides tools for making random selections:
>>> import random
>>> random.choice(['apple', 'pear', 'banana'])
'apple'
>>> random.sample(xrange(100), 10) # sampling without replacement
[30, 83, 16, 4, 8, 81, 41, 50, 18, 33]
>>> random.random() # random float
0.17970987693706186
>>> random.randrange(6) # random integer chosen from range(6)
4
Internet Access
There are a number of modules for accessing the internet and processing internet protocols. Two of the
simplest are urllib2 for retrieving data from urls and smtplib for sending mail:
>>> import urllib2
>>> for line in urllib2.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'):
... if 'EST' in line or 'EDT' in line: # look for Eastern Time
...
print line
<BR>Nov. 25, 09:43:32 PM EST
>>> import smtplib
>>> server = smtplib.SMTP('localhost')
>>> server.sendmail('[email protected]', '[email protected]',
... """To: [email protected]
... From: [email protected]
...
... Beware the Ides of March.
... """)
>>> server.quit()
(Note that the second example needs a mailserver running on localhost.)
Dates and Times
The datetime module supplies classes for manipulating dates and times in both simple and complex ways.
While date and time arithmetic is supported, the focus of the implementation is on efficient member
extraction for output formatting and manipulation. The module also supports objects that are timezone
aware.
>>> # dates are easily constructed and formatted
>>> from datetime import date
>>> now = date.today()
>>> now
datetime.date(2003, 12, 2)
>>> now.strftime("%m-%d-%y. %d %b %Y is a %A on the %d day of %B.")
'12-02-03. 02 Dec 2003 is a Tuesday on the 02 day of December.'
>>> # dates support calendar arithmetic
>>> birthday = date(1964, 7, 31)
>>> age = now - birthday
>>> age.days
14368
Data Compression
Common data archiving and compression formats are directly supported by modules including: zlib, gzip,
bz2, zipfile and tarfile.
>>> import zlib
>>> s = 'witch which has which witches wrist watch'
>>> len(s)
41
>>> t = zlib.compress(s)
>>> len(t)
37
>>> zlib.decompress(t)
'witch which has which witches wrist watch'
>>> zlib.crc32(s)
226805979
Performance Measurement
Some Python users develop a deep interest in knowing the relative performance of different approaches to
the same problem. Python provides a measurement tool that answers those questions immediately.
For example, it may be tempting to use the tuple packing and unpacking feature instead of the traditional
approach to swapping arguments. The timeit module quickly demonstrates a modest performance advantage:
>>> from timeit import Timer
>>> Timer('t=a; a=b; b=t', 'a=1; b=2').timeit()
0.57535828626024577
>>> Timer('a,b = b,a', 'a=1; b=2').timeit()
0.54962537085770791
In contrast to timeit‘s fine level of granularity, the profile and pstats modules provide tools for identifying
time critical sections in larger blocks of code.
BioPython
What is Biopython?
The Biopython Project is an international association of developers of freely available Python
(http://www.python.org) tools for computational molecular biology. The web site http://www.biopython.org
provides an online resource for modules, scripts, and web links for developers of Python-based software for
life science research.
Basically, we just like to program in Python and want to make it as easy as possible to use Python for
bioinformatics by creating high-quality, reusable modules and scripts.
What can I find in the Biopython package
The main Biopython releases have lots of functionality, including:












The ability to parse bioinformatics files into Python utilizable data structures, including support for
the following formats:
o Blast output – both from standalone and WWW Blast
o Clustalw
o FASTA
o GenBank
o PubMed and Medline
o ExPASy files, like Enzyme and Prosite
o SCOP, including ‘dom’ and ‘lin’ files
o UniGene
o SwissProt
Files in the supported formats can be iterated over record by record or indexed and accessed via a
Dictionary interface.
Code to deal with popular on-line bioinformatics destinations such as:
o NCBI – Blast, Entrez and PubMed services
o ExPASy – Swiss-Prot and Prosite entries, as well as Prosite searches
Interfaces to common bioinformatics programs such as:
o Standalone Blast from NCBI
o Clustalw alignment program
o EMBOSS command line tools
A standard sequence class that deals with sequences, ids on sequences, and sequence features.
Tools for performing common operations on sequences, such as translation, transcription and weight
calculations.
Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector
Machines.
Code for dealing with alignments, including a standard way to create and deal with substitution
matrices.
Code making it easy to split up parallelizable tasks into separate processes.
GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
Extensive documentation and help with using the modules, including this file, on-line wiki
documentation, the web site, and the mailing list.
Integration with BioSQL, a sequence database schema also supported by the BioPerl and BioJava
projects.
Working with sequences
The central object in bioinformatics is the sequence. Thus, we’ll start with a quick introduction to the
Biopython mechanisms for dealing with sequences, the Seq object.
Most of the time when we think about sequences we have in my mind a string of letters like
‘AGTACACTGGT’. You can create such Seq object with this sequence as follows - the “>>>” represents
the Python prompt followed by what you would type in:
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> print my_seq
AGTACACTGGT
>>> my_seq.alphabet
Alphabet()
What we have here is a sequence object with a generic alphabet - reflecting the fact we have not specified if
this is a DNA or protein sequence (okay, a protein with a lot of Alanines, Glycines, Cysteines and
Threonines!).
In addition to having an alphabet, the Seq object differs from the Python string in the methods it supports.
You can’t do this with a plain string:
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.complement()
Seq('TCATGTGACCA', Alphabet())
>>> my_seq.reverse_complement()
Seq('ACCAGTGTACT', Alphabet())
The next most important class is the SeqRecord or Sequence Record. This holds a sequence (as a Seq object)
with additional annotation including an identifier, name and description. The Bio.SeqIO module for reading
and writing sequence file formats works with SeqRecord objects.
This covers the basic features and uses of the Biopython sequence class. Now that you’ve got some idea of
what it is like to interact with the Biopython libraries, it’s time to delve into the fun, fun world of dealing
with biological file formats!
Parsing sequence file formats
A large part of much bioinformatics work involves dealing with the many types of file formats designed to
hold biological data. These files are loaded with interesting biological data, and a special challenge is
parsing these files into a format so that you can manipulate them with some kind of programming language.
However the task of parsing these files can be frustrated by the fact that the formats can change quite
regularly, and that formats may contain small subtleties which can break even the most well designed
parsers.
We are now going to briefly introduce the Bio.SeqIO module. We’ll start with an online search for our
friends, the lady slipper orchids. To keep this introduction simple, we’re just using the NCBI website by
hand. Let’s just take a look through the nucleotide databases at NCBI, using an Entrez online search
(http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide) for everything mentioning the text
Cypripedioideae (this is the subfamily of lady slipper orchids).
When this tutorial was originally written, this search gave us only 94 hits, which we saved as a FASTA
formatted text file and as a GenBank formatted text file (files ls_orchid.fasta and ls_orchid.gbk, also
included with the Biopython source code under docs/tutorial/examples/).
If you run the search today, you’ll get hundreds of results! When following the tutorial, if you want to see
the same list of genes, just download the two files above or copy them from docs/examples/ in the
Biopython source code.
Simple FASTA parsing example
If you open the lady slipper orchids FASTA file ls_orchid.fasta in your favourite text editor, you’ll see that
the file starts like this:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGA
TCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTT
GTTGGG
...
It contains 94 records, each has a line starting with “>” (greater-than symbol) followed by the sequence on
one or more lines. Now try this in Python:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',
SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',
SingleLetterAlphabet())
592
Notice that the FASTA format does not specify the alphabet, so Bio.SeqIO has defaulted to the rather
generic SingleLetterAlphabet() rather than something DNA specific.
Simple GenBank parsing example
Now let’s load the GenBank file ls_orchid.gbk instead - notice that the code to do this is almost identical to
the snippet used above for the FASTA file - the only difference is we change the filename and the format
string:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
This should give:
Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',
IUPACAmbiguousDNA())
740
...
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',
IUPACAmbiguousDNA())
592
This time Bio.SeqIO has been able to choose a sensible alphabet, IUPAC Ambiguous DNA. You’ll also
notice that a shorter string has been used as the seq_record.id in this case.
Connecting with biological databases
One of the very common things that you need to do in bioinformatics is extract information from biological
databases. It can be quite tedious to access these databases manually, especially if you have a lot of
repetitive work to do. Biopython attempts to save you time and energy by making some on-line databases
available from Python scripts. Currently, Biopython has code to extract information from the following
databases:



Entrez (and PubMed) from the NCBI.
ExPASy – See Chapter 9.
SCOP – See the Bio.SCOP.search() function.
The code in these modules basically makes it easy to write Python code that interact with the CGI scripts on
these pages, so that you can get results in an easy to deal with format. In some cases, the results can be
tightly integrated with the Biopython parsers to make it even easier to extract information.
Sequence objects
Biological sequences are arguably the central object in Bioinformatics, and in this chapter we’ll introduce
the Biopython mechanism for dealing with sequences, the Seq object.
Sequences are essentially strings of letters like AGTACACTGGT, which seems very natural since this is the
most common way that sequences are seen in biological file formats.
There are two important differences between Seq objects and standard Python strings. First of all, they have
different methods. Although the Seq object supports many of the same methods as a plain string, its
translate() method differs by doing biological translation, and there are also additional biologically relevant
methods like reverse_complement(). Secondly, the Seq object has an important attribute, alphabet, which is
an object describing what the individual characters making up the sequence string “mean”, and how they
should be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a protein sequence that
happens to be rich in Alanines, Glycines, Cysteines and Threonines?
Sequences and Alphabets
The alphabet object is perhaps the important thing that makes the Seq object more than just a string. The
currently available alphabets for Biopython are defined in the Bio.Alphabet module. We’ll use the IUPAC
alphabets (http://www.chem.qmw.ac.uk/iupac/) here to deal with some of our favorite objects: DNA, RNA
and Proteins.
Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the
ability to extend and customize the basic definitions. For instance, for proteins, there is a basic
IUPACProtein class, but there is an additional ExtendedIUPACProtein class providing for the additional
elements “U” (or “Sec” for selenocysteine) and “O” (or “Pyl” for pyrrolysine), plus the ambiguous symbols
“B” (or “Asx” for asparagine or aspartic acid), “Z” (or “Glx” for glutamine or glutamic acid), “J” (or “Xle”
for leucine isoleucine) and “X” (or “Xxx” for an unknown amino acid). For DNA you’ve got choices of
IUPACUnambiguousDNA, which provides for just the basic letters, IUPACAmbiguousDNA (which
provides for ambiguity letters for every possible situation) and ExtendedIUPACDNA, which allows letters
for modified bases. Similarly, RNA can be represented by IUPACAmbiguousRNA or
IUPACUnambiguousRNA.
The advantages of having an alphabet class are two fold. First, this gives an idea of the type of information
the Seq object contains. Secondly, this provides a means of constraining the information, as a means of type
checking.
Now that we know what we are dealing with, let’s look at how to utilize this class to do interesting work.
You can create an ambiguous sequence with the default generic alphabet like this:
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.alphabet
Alphabet()
However, where possible you should specify the alphabet explicitly when creating your sequence objects in this case an unambiguous DNA alphabet object:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
>>> my_seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()
Unless of course, this really is an amino acid sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_prot = Seq("AGTACACTGGT", IUPAC.protein)
>>> my_prot
Seq('AGTACACTGGT', IUPACProtein())
>>> my_prot.alphabet
IUPACProtein()
Sequences act like strings
In many ways, we can deal with Seq objects as if they were normal Python strings, for example getting the
length, or iterating over the elements:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
>>> for index, letter in enumerate(my_seq):
... print index, letter
0G
1A
2T
3C
4G
>>> print len(my_seq)
5
You can access elements of the sequence in the same way as for strings (but remember, Python counts from
zero!):
>>> print my_seq[0] #first letter
G
>>> print my_seq[2] #third letter
T
>>> print my_seq[-1] #last letter
G
The Seq object has a .count() method, just like a string. Note that this means that like a Python string, this
gives a non-overlapping count:
>>> from Bio.Seq import Seq
>>> "AAAA".count("AA")
2
>>> Seq("AAAA").count("AA")
2
For some biological uses, you may actually want an overlapping count (i.e. 3 in this trivial example). When
searching for single letters, this makes no difference:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
46.875
While you could use the above snippet of code to calculate a GC%, note that the Bio.SeqUtils module has
several GC functions already built. For example:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875
Note that using the Bio.SeqUtils.GC() function should automatically cope with mixed case sequences and
the ambiguous nucleotide S which means G or C.
Also note that just like a normal Python string, the Seq object is in some ways “read-only”. If you need to
edit your sequence, for example simulating a point mutation.
Slicing a sequence
A more complicated example, let’s get a slice of the sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq[4:12]
Seq('GATGGGCC', IUPACUnambiguousDNA())
Two things are interesting to note. First, this follows the normal conventions for Python strings. So the first
element of the sequence is 0 (which is normal for computer science, but not so normal for biology). When
you do a slice the first item is included (i.e. 4 in this case) and the last is excluded (12 in this case), which is
the way things work in Python, but of course not necessarily the way everyone in the world would expect.
The main goal is to stay consistent with what Python does.
The second thing to notice is that the slice is performed on the sequence data string, but the new object
produced is another Seq object which retains the alphabet information from the original Seq object.
Also like a Python string, you can do slices with a start, stop and stride (the step size, which defaults to one).
For example, we can get the first, second and third codon positions of this DNA sequence:
>>> my_seq[0::3]
Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())
>>> my_seq[1::3]
Seq('AGGCATGCATC', IUPACUnambiguousDNA())
>>> my_seq[2::3]
Seq('TAGCTAAGAC', IUPACUnambiguousDNA())
Another stride trick you might have seen with a Python string is the use of a -1 stride to reverse the string.
You can do this with a Seq object too:
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
Turning Seq objects into strings
If you really do just need a plain string, for example to write to a file, or insert into a database, then this is
very easy to get:
>>> str(my_seq)
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Since calling str() on a Seq object returns the full sequence as a string, you often don’t actually have to do
this conversion explicitly. Python does this automatically with a print statement:
>>> print my_seq
GATCGATGGGCCTATATAGGATCGAAAATCGC
You can also use the Seq object directly with a %s placeholder when using the Python string formatting or
interpolation operator (%):
>>> fasta_format_string = ">Name\n%s\n" % my_seq
>>> print fasta_format_string
>Name
GATCGATGGGCCTATATAGGATCGAAAATCGC
<BLANKLINE>
This line of code constructs a simple FASTA format record (without worrying about line wrapping).
NOTE: If you are using Biopython 1.44 or older, using str(my_seq) will give just a truncated representation.
Instead use my_seq.tostring() (which is still available in the current Biopython releases for backwards
compatibility):
>>> my_seq.tostring()
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Concatenating or adding sequences
Naturally, you can in principle add any two Seq objects together - just like you can with Python strings to
concatenate them. However, you can’t add sequences with incompatible alphabets, such as a protein
sequence and a DNA sequence:
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> protein_seq = Seq("EVRNAK", IUPAC.protein)
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> protein_seq + dna_seq
Traceback (most recent call last):
...
TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()
If you really wanted to do this, you’d have to first give both sequences generic alphabets:
>>> from Bio.Alphabet import generic_alphabet
>>> protein_seq.alphabet = generic_alphabet
>>> dna_seq.alphabet = generic_alphabet
>>> protein_seq + dna_seq
Seq('EVRNAKACGT', Alphabet())
Here is an example of adding a generic nucleotide sequence to an unambiguous IUPAC DNA sequence,
resulting in an ambiguous nucleotide sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_nucleotide
>>> from Bio.Alphabet import IUPAC
>>> nuc_seq = Seq("GATCGATGC", generic_nucleotide)
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> nuc_seq
Seq('GATCGATGC', NucleotideAlphabet())
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> nuc_seq + dna_seq
Seq('GATCGATGCACGT', NucleotideAlphabet())
Changing case
Python strings have very useful upper and lower methods for changing the case. As of Biopython 1.53, the
Seq object gained similar methods which are alphabet aware. For example,
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
>>> dna_seq.upper()
Seq('ACGTACGT', DNAAlphabet())
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())
These are useful for doing case insensitive matching:
>>> "GTAC" in dna_seq
False
>>> "GTAC" in dna_seq.upper()
True
Note that strictly speaking the IUPAC alphabets are for upper case sequences only, thus:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> dna_seq.lower()
Seq('acgt', DNAAlphabet())
Nucleotide sequences and (reverse) complements
For nucleotide sequences, you can easily obtain the complement or reverse complement of a Seq object
using its built-in methods:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
As mentioned earlier, an easy way to just reverse a Seq object (or a Python string) is slice it with -1 step:
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
In all of these operations, the alphabet property is maintained. This is very useful in case you accidentally
end up trying to do something weird like take the (reverse)complement of a protein sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> protein_seq = Seq("EVRNAK", IUPAC.protein)
>>> protein_seq.complement()
Traceback (most recent call last):
...
ValueError: Proteins do not have complements!
Transcription
Before talking about transcription, I want to try and clarify the strand issue. Consider the following (made
up) stretch of double stranded DNA which encodes a short peptide:
DNA coding strand (aka Crick strand, strand +1)
5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’
|||||||||||||||||||||||||||||||||||||||
3’
TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC
DNA template strand (aka Watson strand, strand −1)
|
Transcription
↓
5’
5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’
Single stranded messenger RNA
The actual biological transcription process works from the template strand, doing a reverse complement
(TCAG → CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically
work directly with the coding strand because this means we can get the mRNA sequence just by switching T
→ U.
Now let’s actually get down to doing a transcription in Biopython. First, let’s create Seq objects for the
coding and template DNA strands:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> template_dna = coding_dna.reverse_complement()
>>> template_dna
Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA())
These should match the figure above - remember by convention nucleotide sequences are normally read
from the 5’ to 3’ direction, while in the figure the template strand is shown reversed.
Now let’s transcribe the coding strand into the corresponding mRNA, using the Seq object’s built in
transcribe method:
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> messenger_rna = coding_dna.transcribe()
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
As you can see, all this does is switch T → U, and adjust the alphabet.
If you do want to do a true biological transcription starting with the template strand, then this becomes a
two-step process:
>>> template_dna.reverse_complement().transcribe()
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
The Seq object also includes a back-transcription method for going from the mRNA to the coding strand of
the DNA. Again, this is a simple U → T substitution and associated change of alphabet:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",
IUPAC.unambiguous_rna)
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
Note: The Seq object’s transcribe and back_transcribe methods were added in Biopython 1.49. For older
releases you would have to use the Bio.Seq module’s functions instead.
Translation
Sticking with the same example discussed in the transcription section above, now let’s translate this mRNA
into the corresponding protein sequence - again taking advantage of one of the Seq object’s biological
methods:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",
IUPAC.unambiguous_rna)
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
>>> messenger_rna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
You can also translate directly from the coding strand DNA sequence:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
You should notice in the above protein sequences that in addition to the end stop character, there is an
internal stop as well. This was a deliberate choice of example, as it gives an excuse to talk about some
optional arguments, including different translation tables (Genetic Codes).
The translation tables available in Biopython are based on those from the NCBI (see the next section of this
tutorial). By default, translation will use the standard genetic code (NCBI table id 1). Suppose we are
dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic
code instead:
>>> coding_dna.translate(table="Vertebrate Mitochondrial")
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
You can also specify the table using the NCBI table number which is shorter, and often included in the
feature annotation of GenBank files:
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
Now, you may want to translate the nucleotides up to the first in frame stop codon, and then stop (as
happens in nature):
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(to_stop=True)
Seq('MAIVMGR', IUPACProtein())
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', IUPACProtein())
Notice that when you use the to_stop argument, the stop codon itself is not translated - and the stop symbol
is not included at the end of your protein sequence.
You can even specify the stop symbol if you don’t like the default asterisk:
>>> coding_dna.translate(table=2, stop_symbol="@")
Seq('MAIVMGRWKGAR@', HasStopCodon(IUPACProtein(), '@'))
Now, suppose you have a complete coding sequence CDS, which is to say a nucleotide sequence (e.g.
mRNA – after any splicing) which is a whole number of codons (i.e. the length is a multiple of three),
commences with a start codon, ends with a stop codon, and has no internal in-frame stop codons. In general,
given a complete CDS, the default translate method will do what you want (perhaps with the to_stop
option). However, what if your sequence uses a non-standard start codon? This happens a lot in bacteria –
for example the gene yaaX in E. coli K12:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> gene =
Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"
+\
...
"GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
...
"AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
...
"TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
...
"AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
...
generic_dna)
>>> gene.translate(table="Bacterial")
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*',
HasStopCodon(ExtendedIUPACProtein(), '*')
>>> gene.translate(table="Bacterial", to_stop=True)
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())
In the bacterial genetic code GTG is a valid start codon, and while it does normally encode valine, if used as
a start codon it should be translated as methionine. This happens if you tell Biopython your sequence is a
complete CDS:
>>> gene.translate(table="Bacterial", cds=True)
Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())
In addition to telling Biopython to translate an alternative start codon as methionine, using this option also
makes sure your sequence really is a valid CDS (you’ll get an exception if not).
Note: The Seq object’s translate method is new in Biopython 1.49. For older releases you would have to use
the Bio.Seq module’s translate function instead. The cds option was added in Biopython 1.51, and there is
no simple way to do this with older versions of Biopython.
Translation Tables
In the previous sections we talked about the Seq object translation method (and mentioned the equivalent
function in the Bio.Seq module). Internally these use codon table objects derived from the NCBI
information at ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt, also shown on
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi in a much more readable layout.
As before, let’s just focus on two choices: the Standard translation table, and the translation table for
Vertebrate Mitochondrial DNA.
>>> from Bio.Data import CodonTable
>>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
>>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
Alternatively, these tables are labeled with ID numbers 1 and 2, respectively:
>>> from Bio.Data import CodonTable
>>> standard_table = CodonTable.unambiguous_dna_by_id[1]
>>> mito_table = CodonTable.unambiguous_dna_by_id[2]
You can compare the actual tables visually by printing them:
>>> print standard_table
Table 1 Standard, SGC0
| T | C | A | G |
--+---------+---------+---------+---------+-T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+-C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L(s)| CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+-A | ATT I | ACT T | AAT N | AGT S | T
A | ATC I | ACC T | AAC N | AGC S | C
A | ATA I | ACA T | AAA K | AGA R | A
A | ATG M(s)| ACG T | AAG K | AGG R | G
--+---------+---------+---------+---------+-G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V | GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+-and:
>>> print mito_table
Table 2 Vertebrate Mitochondrial, SGC1
| T | C | A | G |
--+---------+---------+---------+---------+-T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA W | A
T | TTG L | TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+-C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L | CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+-A | ATT I(s)| ACT T | AAT N | AGT S | T
A | ATC I(s)| ACC T | AAC N | AGC S | C
A | ATA M(s)| ACA T | AAA K | AGA Stop| A
A | ATG M(s)| ACG T | AAG K | AGG Stop| G
--+---------+---------+---------+---------+-G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V(s)| GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+-You may find these following properties useful – for example if you are trying to do your own gene finding:
>>> mito_table.stop_codons
['TAA', 'TAG', 'AGA', 'AGG']
>>> mito_table.start_codons
['ATT', 'ATC', 'ATA', 'ATG', 'GTG']
>>> mito_table.forward_table["ACG"]
'T'
Comparing Seq objects
Sequence comparison is actually a very complicated topic, and there is no easy way to decide if two
sequences are equal. The basic problem is the meaning of the letters in a sequence are context dependent the letter “A” could be part of a DNA, RNA or protein sequence. Biopython uses alphabet objects as part of
each Seq object to try and capture this information - so comparing two Seq objects means considering both
the sequence strings and the alphabets.
For example, you might argue that the two DNA Seq objects Seq("ACGT", IUPAC.unambiguous_dna) and
Seq("ACGT", IUPAC.ambiguous_dna) should be equal, even though they do have different alphabets.
Depending on the context this could be important.
This gets worse – suppose you think Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT") (i.e. the
default generic alphabet) should be equal. Then, logically, Seq("ACGT", IUPAC.protein) and Seq("ACGT")
should also be equal. Now, in logic if A=B and B=C, by transitivity we expect A=C. So for logical
consistency we’d require Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT", IUPAC.protein) to be
equal – which most people would agree is just not right. This transitivity problem would also have
implications for using Seq objects as Python dictionary keys.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
>>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna)
So, what does Biopython do? Well, the equality test is the default for Python objects – it tests to see if they
are the same object in memory. This is a very strict test:
>>> seq1 == seq2
False
>>> seq1 == seq1
True
If you actually want to do this, you can be more explicit by using the Python id function,
>>> id(seq1) == id(seq2)
False
>>> id(seq1) == id(seq1)
True
Now, in every day use, your sequences will probably all have the same alphabet, or at least all be the same
type of sequence (all DNA, all RNA, or all protein). What you probably want is to just compare the
sequences as strings – so do this explicitly:
>>> str(seq1) == str(seq2)
True
>>> str(seq1) == str(seq1)
True
As an extension to this, while you can use a Python dictionary with Seq objects as keys, it is generally more
useful to use the sequence a string for the key.
MutableSeq objects
Just like the normal Python string, the Seq object is “read only”, or in Python terminology, immutable. Apart
from wanting the Seq object to act like a string, this is also a useful default since in many biological
applications you want to ensure you are not changing your sequence data:
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
Observe what happens if you try to edit the sequence:
>>> my_seq[5] = "G"
Traceback (most recent call last):
...
TypeError: 'Seq' object does not support item assignment
However, you can convert it into a mutable sequence (a MutableSeq object) and do pretty much anything
you want with it:
>>> mutable_seq = my_seq.tomutable()
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
Alternatively, you can create a MutableSeq object directly from a string:
>>> from Bio.Seq import MutableSeq
>>> from Bio.Alphabet import IUPAC
>>> mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
Either way will give you a sequence object which can be changed:
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq[5] = "C"
>>> mutable_seq
MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq.remove("T")
>>> mutable_seq
MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq.reverse()
>>> mutable_seq
MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA())
Do note that unlike the Seq object, the MutableSeq object’s methods like reverse_complement() and
reverse() act in-situ!
An important technical difference between mutable and immutable objects in Python means that you can’t
use a MutableSeq object as a dictionary key, but you can use a Python string or a Seq object in this way.
Once you have finished editing your a MutableSeq object, it’s easy to get back to a read-only Seq object
should you need to:
>>> new_seq = mutable_seq.toseq()
>>> new_seq
Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA())
You can also get a string from a MutableSeq object just like from a Seq object.
UnknownSeq objects
Biopython 1.50 introduced another basic sequence object, the UnknownSeq object. This is a subclass of the
basic Seq object and its purpose is to represent a sequence where we know the length, but not the actual
letters making it up. You could of course use a normal Seq object in this situation, but it wastes rather a lot
of memory to hold a string of a million “N” characters when you could just store a single letter “N” and the
desired length as an integer.
>>> from Bio.Seq import UnknownSeq
>>> unk = UnknownSeq(20)
>>> unk
UnknownSeq(20, alphabet = Alphabet(), character = '?')
>>> print unk
????????????????????
>>> len(unk)
20
You can of course specify an alphabet, meaning for nucleotide sequences the letter defaults to “N” and for
proteins “X”, rather than just “?”.
>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import IUPAC
>>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna)
>>> unk_dna
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> print unk_dna
NNNNNNNNNNNNNNNNNNNN
You can use all the usual Seq object methods too, note these give back memory saving UnknownSeq objects
where appropriate as you might expect:
>>> unk_dna
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> unk_dna.complement()
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> unk_dna.reverse_complement()
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> unk_dna.transcribe()
UnknownSeq(20, alphabet = IUPACAmbiguousRNA(), character = 'N')
>>> unk_protein = unk_dna.translate()
>>> unk_protein
UnknownSeq(6, alphabet = ProteinAlphabet(), character = 'X')
>>> print unk_protein
XXXXXX
>>> len(unk_protein)
6
You may be able to find a use for the UnknownSeq object in your own code, but it is more likely that you
will first come across them in a SeqRecord object created by Bio.SeqIO (see Chapter 5). Some sequence file
formats don’t always include the actual sequence, for example GenBank and EMBL files may include a list
of features but for the sequence just present the contig information. Alternatively, the QUAL files used in
sequencing work hold quality scores but they never contain a sequence – instead there is a partner FASTA
file which does have the sequence.
Working with directly strings
To close this chapter, for those you who really don’t want to use the sequence objects (or who prefer a
functional programming style to an object orientated one), there are module level functions in Bio.Seq will
accept plain Python strings, Seq objects (including UnknownSeq objects) or MutableSeq objects:
>>> from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate
>>> my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"
>>> reverse_complement(my_string)
'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'
>>> transcribe(my_string)
'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'
>>> back_transcribe(my_string)
'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'
>>> translate(my_string)
'AVMGRWKGGRAAG*'
You are, however, encouraged to work with Seq objects by default.
Creating a SeqRecord
Using a SeqRecord object is not very complicated, since all of the information is presented as attributes of
the class. Usually you won’t create a SeqRecord “by hand”, but instead use Bio.SeqIO to read in a sequence
file for you. However, creating SeqRecord can be quite simple.
SeqRecord objects from scratch
To create a SeqRecord at a minimum you just need a Seq object:
>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq)
Additionally, you can also pass the id, name and description to the initialization function, but if not they will
be set as strings indicating they are unknown, and can be modified subsequently:
>>> simple_seq_r.id
'<unknown id>'
>>> simple_seq_r.id = "AC12345"
>>> simple_seq_r.description = "Made up sequence I wish I could write a paper about"
>>> print simple_seq_r.description
Made up sequence I wish I could write a paper about
>>> simple_seq_r.seq
Seq('GATC', Alphabet())
Including an identifier is very important if you want to output your SeqRecord to a file. You would normally
include this when creating the object:
>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq, id="AC12345")
As mentioned above, the SeqRecord has an dictionary attribute annotations. This is used for any
miscellaneous annotations that doesn’t fit under one of the other more specific attributes. Adding
annotations is easy, and just involves dealing directly with the annotation dictionary:
>>> simple_seq_r.annotations["evidence"] = "None. I just made it up."
>>> print simple_seq_r.annotations
{'evidence': 'None. I just made it up.'}
>>> print simple_seq_r.annotations["evidence"]
None. I just made it up.
Working with per-letter-annotations is similar, letter_annotations is a dictionary like attribute which will let
you assign any Python sequence (i.e. a string, list or tuple) which has the same length as the sequence:
>>> simple_seq_r.letter_annotations["phred_quality"] = [40,40,38,30]
>>> print simple_seq_r.letter_annotations
{'phred_quality': [40, 40, 38, 30]}
>>> print simple_seq_r.letter_annotations["phred_quality"]
[40, 40, 38, 30]
The dbxrefs and features attributes are just Python lists, and should be used to store strings and SeqFeature
objects (discussed later in this chapter) respectively.
SeqRecord objects from FASTA files
This example uses a fairly large FASTA file containing the whole sequence for Yersinia pestis biovar
Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI. This file is included with the
Biopython unit tests under the GenBank folder, or online NC_005816.fna from our website.
The file starts like this - and you can check there is only one record present (i.e. only one line starting with a
greater than symbol):
>gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCT
GCTCTCC
...
The function Bio.SeqIO.parse(...) is used to loop over all the records in a file as SeqRecord objects. The
Bio.SeqIO module has a sister function for use on files which contain just one record which we’ll use here:
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.fna", "fasta")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATC
CAGG...CTG',
SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|',
description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... sequence',
dbxrefs=[])
Now, let’s have a look at the key attributes of this SeqRecord individually – starting with the seq attribute
which gives you a Seq object:
>>> record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
SingleLetterAlphabet())
Here Bio.SeqIO has defaulted to a generic alphabet, rather than guessing that this is DNA. If you know in
advance what kind of sequence your FASTA file contains, you can tell Bio.SeqIO which alphabet to use.
Next, the identifiers and description:
>>> record.id
'gi|45478711|ref|NC_005816.1|'
>>> record.name
'gi|45478711|ref|NC_005816.1|'
>>> record.description
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence'
As you can see above, the first word of the FASTA record’s title line (after removing the greater than
symbol) is used for both the id and name attributes. The whole title line (after removing the greater than
symbol) is used for the record description. This is deliberate, partly for backwards compatibility reasons, but
it also makes sense if you have a FASTA file like this:
>Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCT
GCTCTCC
...
Note that none of the other annotation attributes get populated when reading a FASTA file:
>>> record.dbxrefs
[]
>>> record.annotations
{}
>>> record.letter_annotations
{}
>>> record.features
[]
In this case our example FASTA file was from the NCBI, and they have a fairly well defined set of
conventions for formatting their FASTA lines. This means it would be possible to parse this information and
extract the GI number and accession for example. However, FASTA files from other sources vary, so this
isn’t possible in general.
SeqRecord objects from GenBank files
As in the previous example, we’re going to look at the whole sequence for Yersinia pestis biovar Microtus
str. 91001 plasmid pPCP1, originally downloaded from the NCBI, but this time as a GenBank file. Again,
this file is included with the Biopython unit tests under the GenBank folder, or online NC_005816.gb from
our website.
This file contains a single record (i.e. only one LOCUS line) and starts:
LOCUS
NC_005816
9609 bp DNA circular BCT 21-JUL-2008
DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete
sequence.
ACCESSION NC_005816
VERSION NC_005816.1 GI:45478711
PROJECT GenomeProject:10638
...
Again, we’ll use Bio.SeqIO to read this file in, and the code is almost identical to that for used above for the
FASTA file:
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATC
CAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=['Project:10638'])
You should be able to spot some differences already! But taking the attributes individually, the sequence
string is the same as before, but this time Bio.SeqIO has been able to automatically assign a more specific
alphabet :
>>> record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA())
The name comes from the LOCUS line, while the id includes the version suffix. The description comes from
the DEFINITION line:
>>> record.id
'NC_005816.1'
>>> record.name
'NC_005816'
>>> record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.'
GenBank files don’t have any per-letter annotations:
>>> record.letter_annotations
{}
Most of the annotations information gets recorded in the annotations dictionary, for example:
>>> len(record.annotations)
11
>>> record.annotations["source"]
'Yersinia pestis biovar Microtus str. 91001'
The dbxrefs list gets populated from any PROJECT or DBLINK lines:
>>> record.dbxrefs
['Project:10638']
Finally, and perhaps most interestingly, all the entries in the features table (e.g. the genes or CDS features)
get recorded as SeqFeature objects in the features list.
>>> len(record.features)
29
Sequence
A SeqFeature object doesn’t directly contain a sequence, instead its location describes how to get this from
the parent sequence. For example consider a (short) gene sequence with location 5:18 on the reverse strand,
which in GenBank/EMBL notation using 1-based counting would be complement(4..18), like this:
>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> example_parent =
Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACT
GGGAGCCTAC")
>>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
You could take the parent sequence, slice it to extract 5:18, and then take the reverse complement:
>>> feature_seq = example_parent[5:18].reverse_complement()
>>> print feature_seq
AGCCTTTGCCGTC
This is a simple example so this isn’t too bad – however once you have to deal with compound features
(joins) this is rather messy. Instead, the SeqFeature object has an extract method to take care of all this:
>>> feature_seq = example_feature.extract(example_parent)
>>> print feature_seq
AGCCTTTGCCGTC
The extract method was added in Biopython 1.53, and in Biopython 1.56 the SeqFeature was further
extended to give its length as that of the region of sequence it describes.
>>> print example_feature.extract(example_parent)
AGCCTTTGCCGTC
>>> print len(example_feature.extract(example_parent))
13
>>> print len(example_feature)
The PDB module
Biopython also allows you to explore the extensive realm of macromolecular structure. Biopython comes
with a PDBParser class that produces a Structure object. The Structure object can be used to access the
atomic data in the file in a convenient manner.
Structure representation
A macromolecular structure is represented using a structure, model chain, residue, atom (or SMCRA)
hierarchy. The figure below shows a UML class diagram of the SMCRA data structure. Such a data
structure is not necessarily best suited for the representation of the macromolecular content of a structure,
but it is absolutely necessary for a good interpretation of the data present in a file that describes the structure
(typically a PDB or MMCIF file). If this hierarchy cannot represent the contents of a structure file, it is fairly
certain that the file contains an error or at least does not describe the structure unambiguously. If a SMCRA
data structure cannot be generated, there is reason to suspect a problem. Parsing a PDB file can thus be used
to detect likely problems.
Structure, Model, Chain and Residue are all subclasses of the Entity base class. The Atom class only (partly)
implements the Entity interface (because an Atom does not have children).
For each Entity subclass, you can extract a child by using a unique id for that child as a key (e.g. you can
extract an Atom object from a Residue object by using an atom name string as a key, you can extract a
Chain object from a Model object by using its chain identifier as a key).
Disordered atoms and residues are represented by DisorderedAtom and DisorderedResidue classes, which
are both subclasses of the DisorderedEntityWrapper base class. They hide the complexity associated with
disorder and behave exactly as Atom and Residue objects.
In general, a child Entity object (i.e. Atom, Residue, Chain, Model) can be extracted from its parent (i.e.
Residue, Chain, Model, Structure, respectively) by using an id as a key.
child_entity=parent_entity[child_id]
You can also get a list of all child Entities of a parent Entity object. Note that this list is sorted in a specific
way (e.g. according to chain identifier for Chain objects in a Model object).
child_list=parent_entity.get_list()
You can also get the parent from a child.
parent_entity=child_entity.get_parent()
At all levels of the SMCRA hierarchy, you can also extract a full id. The full id is a tuple containing all id’s
starting from the top object (Structure) down to the current object. A full id for a Residue object e.g. is
something like:
full_id=residue.get_full_id()
print full_id
("1abc", 0, "A", ("", 10, "A"))
This corresponds to:




The Structure with id "1abc"
The Model with id 0
The Chain with id "A"
The Residue with id (" ", 10, "A").
The Residue id indicates that the residue is not a hetero-residue (nor a water) because it has a blank hetero
field, that its sequence identifier is 10 and that its insertion code is "A".
Some other useful methods:
# get the entity's id
entity.get_id()
# check if there is a child with a given id
entity.has_id(entity_id)
# get number of children
nr_children=len(entity)
It is possible to delete, rename, add, etc. child entities from a parent entity, but this does not include any
sanity checks (e.g. it is possible to add two residues with the same id to one chain). This really should be
done via a nice Decorator class that includes integrity checking, but you can take a look at the code
(Entity.py) if you want to use the raw interface.
Structure
The Structure object is at the top of the hierarchy. Its id is a user given string. The Structure contains a
number of Model children. Most crystal structures (but not all) contain a single model, while NMR
structures typically consist of several models. Disorder in crystal structures of large parts of molecules can
also result in several models.
Constructing a Structure object
A Structure object is produced by a PDBParser object:
from Bio.PDB.PDBParser import PDBParser
p=PDBParser(PERMISSIVE=1)
structure_id="1fat"
filename="pdb1fat.ent"
s=p.get_structure(structure_id, filename)
The PERMISSIVE flag indicates that a number of common problems associated with PDB files will be
ignored (but note that some atoms and/or residues will be missing). If the flag is not present a
PDBConstructionException will be generated during the parse operation.
Header and trailer
You can extract the header and trailer (simple lists of strings) of the PDB file from the PDBParser object
with the get_header and get_trailer methods.
Model
The id of the Model object is an integer, which is derived from the position of the model in the parsed file
(they are automatically numbered starting from 0). The Model object stores a list of Chain children.
Example
Get the first model from a Structure object.
first_model=structure[0]
Chain
The id of a Chain object is derived from the chain identifier in the structure file, and can be any string. Each
Chain in a Model object has a unique id. The Chain object stores a list of Residue children.
Example
Get the Chain object with identifier “A” from a Model object.
chain_A=model["A"]
Residue
Unsurprisingly, a Residue object stores a set of Atom children. In addition, it also contains a string that
specifies the residue name (e.g. “ASN”) and the segment identifier of the residue (well known to X-PLOR
users, but not used in the construction of the SMCRA data structure).
The id of a Residue object is composed of three parts: the hetero field (hetfield), the sequence identifier
(resseq) and the insertion code (icode).
The hetero field is a string : it is “W” for waters, “H_” followed by the residue name (e.g. “H_FUC”) for
other hetero residues and blank for standard amino and nucleic acids. The second field in the Residue id is
the sequence identifier, an integer describing the position of the residue in the chain.
The third field is a string, consisting of the insertion code. The insertion code is sometimes used to preserve
a certain desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and
an Asn 81 residue) could e.g. have sequence identifiers and insertion codes as followed: Thr 80 A, Ser 80 B,
Asn 81. In this way the residue numbering scheme stays in tune with that of the wild type structure.
Let’s give some examples. Asn 10 with a blank insertion code would have residue id (’’ ’’, 10, ’’ ’’). Water
10 would have residue id (‘‘W‘‘, 10, ‘‘ ‘‘). A glucose molecule (a hetero residue with residue name GLC)
with sequence identifier 10 would have residue id (’’H_GLC’’, 10, ’’ ’’). In this way, the three residues
(with the same insertion code and sequence identifier) can be part of the same chain because their residue
id’s are distinct.
In most cases, the hetflag and insertion code fields will be blank, e.g. (’’ ’’, 10, ’’ ’’). In these cases, the
sequence identifier can be used as a shortcut for the full id:
# use full id
res10=chain[("", 10, "")]
# use shortcut
res10=chain[10]
Each Residue object in a Chain object should have a unique id. However, disordered residues are dealt with
in a special way.
A Residue object has a number of additional methods:
r.get_resname() # return residue name, e.g. "ASN"
r.get_segid() # return the SEGID, e.g. "CHN1"
10.1.5 Atom
The Atom object stores the data associated with an atom, and has no children. The id of an atom is its atom
name (e.g. “OG” for the side chain oxygen of a Ser residue). An Atom id needs to be unique in a Residue.
Again, an exception is made for disordered atoms.
In a PDB file, an atom name consists of 4 chars, typically with leading and trailing spaces. Often these
spaces can be removed for ease of use (e.g. an amino acid C α atom is labeled “.CA.” in a PDB file, where
the dots represent spaces). To generate an atom name (and thus an atom id) the spaces are removed, unless
this would result in a name collision in a Residue (i.e. two Atom objects with the same atom name and id).
In the latter case, the atom name including spaces is tried. This situation can e.g. happen when one residue
contains atoms with names “.CA.” and “CA..”, although this is not very likely.
The atomic data stored includes the atom name, the atomic coordinates (including standard deviation if
present), the B factor (including anisotropic B factors and standard deviation if present), the altloc specifier
and the full atom name including spaces. Less used items like the atom element number or the atomic charge
sometimes specified in a PDB file are not stored.
An Atom object has the following additional methods:
a.get_name()
# atom name (spaces stripped, e.g. "CA")
a.get_id()
# id (equals atom name)
a.get_coord() # atomic coordinates
a.get_bfactor() # B factor
a.get_occupancy() # occupancy
a.get_altloc() # alternative location specifie
a.get_sigatm() # std. dev. of atomic parameters
a.get_siguij() # std. dev. of anisotropic B factor
a.get_anisou() # anisotropic B factor
a.get_fullname() # atom name (with spaces, e.g. ".CA.")
To represent the atom coordinates, siguij, anisotropic B factor and sigatm Numpy arrays are used.
Disorder
General approach
Disorder should be dealt with from two points of view: the atom and the residue points of view. In general,
we have tried to encapsulate all the complexity that arises from disorder. If you just want to loop over all C
α atoms, you do not care that some residues have a disordered side chain. On the other hand it should also be
possible to represent disorder completely in the data structure. Therefore, disordered atoms or residues are
stored in special objects that behave as if there is no disorder. This is done by only representing a subset of
the disordered atoms or residues. Which subset is picked (e.g. which of the two disordered OG side chain
atom positions of a Ser residue is used) can be specified by the user.
Disordered atoms
Disordered atoms are represented by ordinary Atom objects, but all Atom objects that represent the same
physical atom are stored in a DisorderedAtom object. Each Atom object in a DisorderedAtom object can be
uniquely indexed using its altloc specifier. The DisorderedAtom object forwards all uncaught method calls
to the selected Atom object, by default the one that represents the atom with with the highest occupancy.
The user can of course change the selected Atom object, making use of its altloc specifier. In this way atom
disorder is represented correctly without much additional complexity. In other words, if you are not
interested in atom disorder, you will not be bothered by it.
Each disordered atom has a characteristic altloc identifier. You can specify that a DisorderedAtom object
should behave like the Atom object associated with a specific altloc identifier:
atom.disordered\_select("A") # select altloc A atom
print atom.get_altloc()
"A"
atom.disordered_select("B")
print atom.get_altloc()
"B"
# select altloc B atom
Disordered residues
Common case
The most common case is a residue that contains one or more disordered atoms. This is evidently solved by
using DisorderedAtom objects to represent the disordered atoms, and storing the DisorderedAtom object in a
Residue object just like ordinary Atom objects. The DisorderedAtom will behave exactly like an ordinary
atom (in fact the atom with the highest occupancy) by forwarding all uncaught method calls to one of the
Atom objects (the selected Atom object) it contains.
Point mutations
A special case arises when disorder is due to a point mutation, i.e. when two or more point mutants of a
polypeptide are present in the crystal. An example of this can be found in PDB structure 1EN2.
Since these residues belong to a different residue type (e.g. let’s say Ser 60 and Cys 60) they should not be
stored in a single Residue object as in the common case. In this case, each residue is represented by one
Residue object, and both Residue objects are stored in a DisorderedResidue object.
The DisorderedResidue object forwards all uncaught methods to the selected Residue object (by default the
last Residue object added), and thus behaves like an ordinary residue. Each Residue object in a
DisorderedResidue object can be uniquely identified by its residue name. In the above example, residue Ser
60 would have id “SER” in the DisorderedResidue object, while residue Cys 60 would have id “CYS”. The
user can select the active Residue object in a DisorderedResidue object via this id.
Hetero residues
10.3.1 Associated problems
A common problem with hetero residues is that several hetero and non-hetero residues present in the same
chain share the same sequence identifier (and insertion code). Therefore, to generate a unique id for each
hetero residue, waters and other hetero residues are treated in a different way.
Remember that Residue object have the tuple (hetfield, resseq, icode) as id. The hetfield is blank (“ “) for
amino and nucleic acids, and a string for waters and other hetero residues. The content of the hetfield is
explained below.
Water residues
The hetfield string of a water residue consists of the letter “W”. So a typical residue id for a water is (“W”,
1, “ “).
Other hetero residues
The hetfield string for other hetero residues starts with “H_” followed by the residue name. A glucose
molecule e.g. with residue name “GLC” would have hetfield “H_GLC”. It’s residue id could e.g. be
(“H_GLC”, 1, “ “).
Some random usage examples
Parse a PDB file, and extract some Model, Chain, Residue and Atom objects.
from Bio.PDB.PDBParser import PDBParser
parser=PDBParser()
structure=parser.get_structure("test", "1fat.pdb")
model=structure[0]
chain=model["A"]
residue=chain[1]
atom=residue["CA"]
Extract a hetero residue from a chain (e.g. a glucose (GLC) moiety with resseq 10).
residue_id=("H_GLC", 10, " ")
residue=chain[residue_id]
Print all hetero residues in chain.
for residue in chain.get_list():
residue_id=residue.get_id()
hetfield=residue_id[0]
if hetfield[0]=="H":
print residue_id
Print out the coordinates of all CA atoms in a structure with B factor greater than 50.
for model in structure.get_list():
for chain in model.get_list():
for residue in chain.get_list():
if residue.has_id("CA"):
ca=residue["CA"]
if ca.get_bfactor()>50.0:
print ca.get_coord()
Print out all the residues that contain disordered atoms.
for model in structure.get_list():
for chain in model.get_list():
for residue in chain.get_list():
if residue.is_disordered():
resseq=residue.get_id()[1]
resname=residue.get_resname()
model_id=model.get_id()
chain_id=chain.get_id()
print model_id, chain_id, resname, resseq
Loop over all disordered atoms, and select all atoms with altloc A (if present). This will make sure that the
SMCRA data structure will behave as if only the atoms with altloc A are present.
for model in structure.get_list():
for chain in model.get_list():
for residue in chain.get_list():
if residue.is_disordered():
for atom in residue.get_list():
if atom.is_disordered():
if atom.disordered_has_id("A"):
atom.disordered_select("A")
Suppose that a chain has a point mutation at position 10, consisting of a Ser and a Cys residue. Make sure
that residue 10 of this chain behaves as the Cys residue.
residue=chain[10]
residue.disordered_select("CYS")
wxPython (GUI)
wxPython is a GUI toolkit for the Python programming language. It allows Python programmers to create
programs with a robust, highly functional graphical user interface, simply and easily. It is implemented as a
Python extension module (native code) that wraps the popular wxWidgets cross platform GUI library, which
is written in C++.
Like Python and wxWidgets, wxPython is Open Source which means that it is free for anyone to use and
the source code is available for anyone to look at and modify. Or anyone can contribute fixes or
enhancements to the project.
wxPython is a cross-platform toolkit. This means that the same program will run on multiple platforms
without modification. Currently supported platforms are 32-bit Microsoft Windows, most Unix or unix-like
systems, and Macintosh OS X.
A First Application: "Hello, World"
As is traditional, we are first going to write a Small "Hello, world" application. Here is the code:
1 #!/usr/bin/env python
2 import wx
3
4 app = wx.App(False) # Create a new app, don't redirect stdout/stderr to a window.
5 frame = wx.Frame(None, wx.ID_ANY, "Hello World") # A Frame is a top-level window.
6 frame.Show(True) # Show the frame.
7 app.MainLoop()
Explanations:
Every wxPython app is an instance of wx.App. For most simple
applications you can use wx.App as is. When you get to more
app = wx.App(False)
complex applications you may need to extend the wx.App class. The
"False" parameter means "don't redirect stdout and stderr to a
window".
A wx.Frame is a top-level window. The syntax is x.Frame(Parent,
Id, Title). Most of the constructors have this shape (a parent object,
wx.Frame(None,wx.ID_ANY,"Hello")
followed by an Id). In this example, we use None for "no parent" and
wx.ID_ANY to have wxWidgets pick an id for us.
frame.Show(True)
We make a frame visible by "showing" it.
Finally, we start the application's MainLoop whose role is to handle
app.MainLoop()
the events.
Note: You almost always want to use wx.ID_ANY or another standard ID provided by wxWidgets. You can
make your own IDs, but there is no reason to do that.
Run the program and you should see a window like this one:
Windows or Frames?
When people talk about GUIs, they usually speak of windows, menus and icons. Naturally then, you would
expect that wx.Window should represent a window on the screen. Unfortunately, this is not the case. A
wx.Window is the base class from which all visual elements are derived (buttons, menus, etc) and what we
normally think of as a program window is a wx.Frame. This is an unfortunate inconsistency that has led to
much confusion for new users.
Building a simple text editor
In this tutorial we are going to build a simple text editor. In the process, we will explore several widgets, and
learn about features such as events and callbacks.
First steps
The first step is to make a simple frame with an editable text box inside. A text box is made with the
wx.TextCtrl widget. By default, a text box is a single-line field, but the wx.TE_MULTILINE parameter
allows you to enter multiple lines of text.
1 #!/usr/bin/env python
2 import wx
3 class MyFrame(wx.Frame):
4 """ We simply derive a new class of Frame. """
5 def __init__(self, parent, title):
6
wx.Frame.__init__(self, parent, title=title, size=(200,100))
7
self.control = wx.TextCtrl(self, style=wx.TE_MULTILINE)
8
self.Show(True)
9
10 app = wx.App(False)
11 frame = MyFrame(None, 'Small editor')
12 app.MainLoop()
In this example, we derive from wx.Frame and overwrite its __init__ method. Here we declare a new
wx.TextCtrl which is a simple text edit control. Note that since the MyFrame runs self.Show() inside its
__init__ method, we no longer have to call frame.Show() explicitly.
Adding a menu bar
Every application should have a menu bar and a status bar. Let's add them to ours:
1 import wx
2
3 class MainWindow(wx.Frame):
4 def __init__(self, parent, title):
5
wx.Frame.__init__(self, parent, title=title, size=(200,100))
6
self.control = wx.TextCtrl(self, style=wx.TE_MULTILINE)
7
self.CreateStatusBar() # A Statusbar in the bottom of the window
8
9
# Setting up the menu.
10
filemenu= wx.Menu()
11
12
# wx.ID_ABOUT and wx.ID_EXIT are standard IDs provided by wxWidgets.
13
filemenu.Append(wx.ID_ABOUT, "&About"," Information about this program")
14
filemenu.AppendSeparator()
15
filemenu.Append(wx.ID_EXIT,"E&xit"," Terminate the program")
16
17
# Creating the menubar.
18
menuBar = wx.MenuBar()
19
menuBar.Append(filemenu,"&File") # Adding the "filemenu" to the MenuBar
20
self.SetMenuBar(menuBar) # Adding the MenuBar to the Frame content.
21
self.Show(True)
22
23 app = wx.App(False)
24 frame = MainWindow(None, "Sample editor")
25 app.MainLoop()
TIP: Notice the wx.ID_ABOUT and wx.ID_EXIT ids. These are standard ids provided by wxWidgets (see
a full list here). It is a good habit to use the standard ID if there is one available. This helps wxWidgets know
how to display the widget in each platform to make it look more native.
Event handling
Reacting to events in wxPython is called event handling. An event is when "something" happens on your
application (a button click, text input, mouse movement, etc). Much of GUI programming consists of
responding to events. You bind an object to an event using the Bind() method:
1 class MainWindow(wx.Frame):
2 def __init__(self, parent, title):
3
wx.Frame.__init__(self,parent, title=title, size=(200,100))
4
...
5
menuItem = filemenu.Append(wx.ID_ABOUT, "&About"," Information about this program")
6
self.Bind(wx.EVT_MENU, self.OnAbout, menuItem)
This means that, from now on, when the user selects the "About" menu item, the method self.OnAbout will
be executed. wx.EVT_MENU is the "select menu item" event. wxWidgets understands many other events
(see the full list). The self.OnAbout method has the general declaration:
1 def OnAbout(self, event):
2
...
Here event is an instance of a subclass of wx.Event. For example, a button-click event wx.EVT_BUTTON - is a subclass of wx.Event.
The method is executed when the event occurs. By default, this method will handle the event and the event
will stop after the callback finishes. However, you can "skip" an event with event.Skip(). This causes the
event to go through the hierarchy of event handlers. For example:
1 def OnButtonClick(self, event):
2 if (some_condition):
3
do_something()
4 else:
5
event.Skip()
6
7 def OnEvent(self, event):
8 ...
When a button-click event occurs, the method OnButtonClick gets called. If some_condition is true, we
do_something() otherwise we let the event be handled by the more general event handler. Now let's have a
look at our application:
1 import os
2 import wx
3
4
5 class MainWindow(wx.Frame):
6 def __init__(self, parent, title):
7
wx.Frame.__init__(self, parent, title=title, size=(200,100))
8
self.control = wx.TextCtrl(self, style=wx.TE_MULTILINE)
9
self.CreateStatusBar() # A StatusBar in the bottom of the window
10
11
# Setting up the menu.
12
filemenu= wx.Menu()
13
14
# wx.ID_ABOUT and wx.ID_EXIT are standard ids provided by wxWidgets.
15
menuAbout = filemenu.Append(wx.ID_ABOUT, "&About"," Information about this program")
16
menuExit = filemenu.Append(wx.ID_EXIT,"E&xit"," Terminate the program")
17
18
# Creating the menubar.
19
menuBar = wx.MenuBar()
20
menuBar.Append(filemenu,"&File") # Adding the "filemenu" to the MenuBar
21
self.SetMenuBar(menuBar) # Adding the MenuBar to the Frame content.
22
23
# Set events.
24
self.Bind(wx.EVT_MENU, self.OnAbout, menuAbout)
25
self.Bind(wx.EVT_MENU, self.OnExit, menuExit)
26
27
self.Show(True)
28
29 def OnAbout(self,e):
30
# A message dialog box with an OK button. wx.OK is a standard ID in wxWidgets.
31
dlg = wx.MessageDialog( self, "A small text editor", "About Sample Editor", wx.OK)
32
dlg.ShowModal() # Show it
33
dlg.Destroy() # finally destroy it when finished.
34
35 def OnExit(self,e):
36
self.Close(True) # Close the frame.
37
38 app = wx.App(False)
39 frame = MainWindow(None, "Sample editor")
40 app.MainLoop()
Note: Instead of
1 wx.MessageDialog( self, "A small editor in wxPython", "About Sample Editor", wx.OK)
We could have omitted the id. In that case wxWidget automatically generates an ID (just as if you specified
wx.ID_ANY):
1 wx.MessageDialog( self, "A small editor in wxPython", "About Sample Editor")
Dialogs
Of course an editor is useless if it is not able to save or open documents. That's where Common dialogs
come in. Common dialogs are those offered by the underlying platform so that your application will look
exactly like a native application. Here is the implementation of the OnOpen method in MainWindow:
1 def OnOpen(self,e):
2
""" Open a file"""
3
self.dirname = ''
4
dlg = wx.FileDialog(self, "Choose a file", self.dirname, "", "*.*", wx.OPEN)
5
if dlg.ShowModal() == wx.ID_OK:
6
self.filename = dlg.GetFilename()
7
self.dirname = dlg.GetDirectory()
8
f = open(os.path.join(self.dirname, self.filename), 'r')
9
self.control.SetValue(f.read())
10
f.close()
11
dlg.Destroy()
Explanation:



First, we create the dialog by calling the appropriate Constructor.
Then, we call ShowModal. That opens the dialog - "Modal" means that the user cannot do anything
on the application until he clicks OK or Cancel.
The return value of ShowModal is the Id of the button pressed. If the user pressed OK we read the
file.
You should now be able to add the corresponding entry into the menu and connect it to the OnOpen
method. If you have trouble, scroll down to the appendix for the full source code.
Possible extensions
Of course, this program is far from being a decent editor. But adding other features should not be any more
difficult than what has already been done. You might take inspiration from the demos that are bundled with
wxPython:




Drag and Drop.
MDI
Tab view/multiple files
Find/Replace dialog



Print dialog (Printing)
Macro-commands in python ( using the eval function)
etc ...
Working with Windows

Topics:
o Frames
o Windows
o Controls/Widgets
o Sizers
o Validators
In this section, we are going to present the way wxPython deals with windows and their contents, including
building input forms and using various widgets/controls. We are going to build a small application that
calculates the price of a quote.
Overview
Laying out Visual Elements

Within a frame, you'll use a number of wxWindow sub-classes to flesh out the frame's contents. Here
are some of the more common elements you might want to put in your frame:
o A wx.MenuBar, which puts a menu bar along the top of your frame.
o A wx.StatusBar, which sets up an area along the bottom of your frame for displaying status
messages, etc.
o A wx.ToolBar, which puts a toolbar in your frame.
o Sub-classes of wx.Control. These are objects which represent user interface widgets (ie,
visual elements which display data and/or process user input). Common examples of
wx.Control objects include wx.Button, wxStaticText, wx.TextCtrl and wx.ComboBox.
o A wx.Panel, which is a container to hold your various wx.Control objects. Putting your
wx.Control objects inside a wx.Panel means that the user can tab from one UI widget to the
next.
All visual elements (wxWindow objects and their subclasses) can hold sub-elements. Thus, for
example, a wx.Frame might hold a number of wx.Panel objects, which in turn hold a number of
wx.Button, wx.StaticText and wx.TextCtrl objects, giving you an entire hierarchy of elements:
Note that this merely describes the way that certain visual elements
are interrelated -- not how they are visually laid out within the frame. To handle the layout of
elements within a frame, you have several options:
6.
You can manually position each element by specifying it's exact pixel coordinate within the parent
window. Because of differences in font sizes, etc, between platforms, this option is not generally
recommended.
7.
You can use wx.LayoutConstraints, though these are fairly complex to use.
8.
You can use the Delphi-like LayoutAnchors, which make it easier to use wx.LayoutConstraints.
9.
You can use one of the wxSizer subclasses.
This document will concentrate on the use of wxSizers, because that's the scheme I'm most familiar
with.
Sizers

A sizer (that is, one of the wx.Sizer sub-classes) can be used to handle the visual arrangement of
elements within a window or frame. Sizers can:
o Calculate an appropriate size for each visual element.
o Position the elements according to certain rules.
o Dynamically resize and/or reposition elements when a frame is resized.
Some of the more common types of sizers include:
o
o
o
wx.BoxSizer, which arranges visual elements in a line going either horizontally or vertically.
wx.GridSizer, which lays visual elements out into a grid-like structure.
wx.FlexGridSizer, which is similar to a wx.GridSizer except that it allow for more flexibility
in laying out visual elements.
A sizer is given a list of wx.Window objects to size, either by calling sizer.Add(window, options...),
or by calling sizer.AddMany(...). A sizer will only work on those elements which it has been given.
Sizers can be nested. That is, you can add one sizer to another sizer, for example to have two rows of
buttons (each laid out by a horizontal wx.BoxSizer) contained within another wx.BoxSizer which
places the rows of buttons one above the other, like this:
o
Note: Notice that the above example does not lay out the six buttons into two rows of three
columns each -- to do that, you should use a wxGridSizer.
In the following example we use two nested sizers, the main one with vertical layout and the
embedded one with horizontal layout:
1 import wx
2 import os
3
4 class MainWindow(wx.Frame):
5 def __init__(self, parent, title):
6
self.dirname=''
7
8
# A "-1" in the size parameter instructs wxWidgets to use the default size.
9
# In this case, we select 200px width and the default height.
10
wx.Frame.__init__(self, parent, title=title, size=(200,-1))
11
self.control = wx.TextCtrl(self, style=wx.TE_MULTILINE)
12
self.CreateStatusBar() # A Statusbar in the bottom of the window
13
14
# Setting up the menu.
15
filemenu= wx.Menu()
16
menuOpen = filemenu.Append(wx.ID_OPEN, "&Open"," Open a file to edit")
17
menuAbout= filemenu.Append(wx.ID_ABOUT, "&About"," Information about this program")
18
menuExit = filemenu.Append(wx.ID_EXIT,"E&xit"," Terminate the program")
19
20
# Creating the menubar.
21
menuBar = wx.MenuBar()
22
menuBar.Append(filemenu,"&File") # Adding the "filemenu" to the MenuBar
23
self.SetMenuBar(menuBar) # Adding the MenuBar to the Frame content.
24
25
# Events.
26
self.Bind(wx.EVT_MENU, self.OnOpen, menuOpen)
27
self.Bind(wx.EVT_MENU, self.OnExit, menuExit)
28
self.Bind(wx.EVT_MENU, self.OnAbout, menuAbout)
29
30
self.sizer2 = wx.BoxSizer(wx.HORIZONTAL)
31
self.buttons = []
32
for i in range(0, 6):
33
self.buttons.append(wx.Button(self, -1, "Button &"+str(i)))
34
self.sizer2.Add(self.buttons[i], 1, wx.EXPAND)
35
36
# Use some sizers to see layout options
37
self.sizer = wx.BoxSizer(wx.VERTICAL)
38
self.sizer.Add(self.control, 1, wx.EXPAND)
39
self.sizer.Add(self.sizer2, 0, wx.EXPAND)
40
41
#Layout sizers
42
self.SetSizer(self.sizer)
43
self.SetAutoLayout(1)
44
self.sizer.Fit(self)
45
self.Show()
46
47 def OnAbout(self,e):
48
# Create a message dialog box
49
dlg = wx.MessageDialog(self, " A sample editor \n in wxPython", "About Sample Editor", wx.OK)
50
dlg.ShowModal() # Shows it
51
dlg.Destroy() # finally destroy it when finished.
52
53 def OnExit(self,e):
54
self.Close(True) # Close the frame.
55
56 def OnOpen(self,e):
57
""" Open a file"""
58
dlg = wx.FileDialog(self, "Choose a file", self.dirname, "", "*.*", wx.OPEN)
59
if dlg.ShowModal() == wx.ID_OK:
60
self.filename = dlg.GetFilename()
61
self.dirname = dlg.GetDirectory()
62
f = open(os.path.join(self.dirname, self.filename), 'r')
63
self.control.SetValue(f.read())
64
f.close()
65
dlg.Destroy()
66
67 app = wx.App(False)
68 frame = MainWindow(None, "Sample editor")
69 app.MainLoop()





The sizer.Add method has three arguments. The first one specifies the control to include in the sizer.
The second one is a weight factor which means that this control will be sized in proportion to other
ones. For example, if you had three edit controls and you wanted them to have the proportions 3:2:1
then you would specify these factors as arguments when adding the controls. 0 means that this
control or sizer will not grow. The third argument is normally wx.GROW (same as wx.EXPAND)
which means the control will be resized when necessary. If you use wx.SHAPED instead, the
controls aspect ratio will remain the same.
If the second parameter is 0, i.e. the control will not be resized, the third parameter may indicate if
the control should be centered horizontally and/or vertically by using
wx.ALIGN_CENTER_HORIZONTAL, wx.ALIGN_CENTER_VERTICAL, or
wx.ALIGN_CENTER (for both) instead of wx.GROW or wx.SHAPED as that third parameter.
You can alternatively specify combinations of wx.ALIGN_LEFT, wx.ALIGN_TOP,
wx.ALIGN_RIGHT, and wx.ALIGN_BOTTOM. The default behavior is equivalent to
wx.ALIGN_LEFT | wx.ALIGN_TOP.
One potentially confusing aspect of the wx.Sizer and its sub-classes is the distinction between a sizer
and a parent window. When you create objects to go inside a sizer, you do not make the sizer the
object's parent window. A sizer is a way of laying out windows, it is not a window in itself. In the
above example, all six buttons would be created with the parent window being the frame or window
which encloses the buttons -- not the sizer. If you try to create a visual element and pass the sizer as
the parent window, your program will crash.
Once you have set up your visual elements and added them to a sizer (or to a nested set of sizers), the
next step is to tell your frame or window to use the sizer. You do this in three steps:
1 window.SetSizer(sizer)
2 window.SetAutoLayout(true)
3 sizer.Fit(window)

The SetSizer() call tells your window (or frame) which sizer to use. The call to SetAutoLayout() tells
your window to use the sizer to position and size your components. And finally, the call to sizer.Fit()
tells the sizer to calculate the initial size and position for all its elements. If you are using sizers, this
is the normal process you would go through to set up your window or frame's contents before it is
displayed for the first time.
Validators

When you create a dialog box or other input form, you can use a wx.Validator to simplify the process
of loading data into your form, validating the entered data, and extracting the data out of the form
again. wx.Validator can also be used to intercept keystrokes and other events within an input field.
To use a validator, you have to create your own sub-class of wx.Validator (neither wx.TextValidator
nor wx.GenericValidator are implemented in wxPython). This sub-class is then associated with your
input field by calling myInputField.SetValidator(myValidator).
o Note: Your wx.Validator sub-class must implement the wxValidator.Clone() method.
A Working Example
Our first label within a panel

Let's start with an example. Our program is going to have a single Frame with a panel [7] containing
a label [8]:
1 import wx
2 class ExampleFrame(wx.Frame):
3 def __init__(self, parent):
4
wx.Frame.__init__(self, parent)
5
self.quote = wx.StaticText(self, label="Your quote :", pos=(20, 30), size=(200, -1))
6
self.Show()
7
8 app = wx.App(False)
9 ExampleFrame(None)
10 app.MainLoop()

1
This design should be clear and you should not have any problem with it if you read the Small Editor
section of this howto. Note: a sizer should be used here instead of specifying the position of each
widget. Notice in the line:
self.quote = wx.StaticText(self, label="Your quote :", pos=(20, 30))

the use of self as parent parameter of our wxStaticText. Our Static text is going to be on the panel we
have created just before. The second parameter refers to the Id number of the new control. As this
static control is not really going to be sending events, we don't care about this parameter. The
wxPoint is used as positioning parameter. There is also an optional wxSize parameter but its use here
is not justified.
[7] According to wxPython documentation:

"A panel is a window on which controls are placed. It is usually placed within a frame. It contains
minimal extra functionality over and above its parent class wxWindow; its main purpose is to be
similar in appearance and functionality to a dialog, but with the flexibility of having any window as a
parent.", in fact, it is a simple window used as a (grayed) background for other objects which are
meant to deal with data entry. These are generally known as Controls or Widgets.
[8] A label is used to display text that is not supposed to interact with the user.
Adding a few more controls

You will find a complete list of the numerous Controls that exist in wxPython in the demo and help,
but here we are going to present those most frequently used:
o wxButton The most basic Control: A button showing a text that you can click. For example,
here is a "Clear" button (e.g. to clear a text):
1 clearButton = wx.Button(self, wx.ID_CLEAR, "Clear")
2 self.Bind(wx.EVT_BUTTON, self.OnClear, clearButton)

wxTextCtrl This control let the user input text. It generates two main events. EVT_TEXT is called
whenever the text changes. EVT_CHAR is called whenever a key has been pressed.
1 textField = wx.TextCtrl(self)
2 self.Bind(wx.EVT_TEXT, self.OnChange, textField)
3 self.Bind(wx.EVT_CHAR, self.OnKeyPress, textField)





For example: If the user presses the "Clear" button and that clears the text field, that will generate an
EVT_TEXT event, but not an EVT_CHAR event.
wxComboBox A combobox is very similar to wxTextCtrl but in addition to the events generated by
wxTextCtrl, wxComboBox has the EVT_COMBOBOX event.
wxCheckBox The checkbox is a control that gives the user true/false choice.
wxRadioBox The radiobox lets the user choose from a list of options.
Let's have a closer look at what Form1 looks like now:
1 import wx
2 class ExamplePanel(wx.Panel):
3 def __init__(self, parent):
4
wx.Panel.__init__(self, parent)
5
self.quote = wx.StaticText(self, label="Your quote :", pos=(20, 30))
6
7
# A multiline TextCtrl - This is here to show how the events work in this program, don't pay too
much attention to it
8
self.logger = wx.TextCtrl(self, pos=(300,20), size=(200,300), style=wx.TE_MULTILINE |
wx.TE_READONLY)
9
10
# A button
11
self.button =wx.Button(self, label="Save", pos=(200, 325))
12
self.Bind(wx.EVT_BUTTON, self.OnClick,self.button)
13
14
# the edit control - one line version.
15
self.lblname = wx.StaticText(self, label="Your name :", pos=(20,60))
16
self.editname = wx.TextCtrl(self, value="Enter here your name", pos=(150, 60), size=(140,-1))
17
self.Bind(wx.EVT_TEXT, self.EvtText, self.editname)
18
self.Bind(wx.EVT_CHAR, self.EvtChar, self.editname)
19
20
# the combobox Control
21
self.sampleList = ['friends', 'advertising', 'web search', 'Yellow Pages']
22
self.lblhear = wx.StaticText(self, label="How did you hear from us ?", pos=(20, 90))
23
self.edithear = wx.ComboBox(self, pos=(150, 90), size=(95, -1), choices=self.sampleList,
style=wx.CB_DROPDOWN)
24
self.Bind(wx.EVT_COMBOBOX, self.EvtComboBox, self.edithear)
25
self.Bind(wx.EVT_TEXT, self.EvtText,self.edithear)
26
27
# Checkbox
28
self.insure = wx.CheckBox(self, label="Do you want Insured Shipment ?", pos=(20,180))
29
self.Bind(wx.EVT_CHECKBOX, self.EvtCheckBox, self.insure)
30
31
# Radio Boxes
32
radioList = ['blue', 'red', 'yellow', 'orange', 'green', 'purple', 'navy blue', 'black', 'gray']
33
rb = wx.RadioBox(self, label="What color would you like ?", pos=(20, 210), choices=radioList,
majorDimension=3,
34
style=wx.RA_SPECIFY_COLS)
35
self.Bind(wx.EVT_RADIOBOX, self.EvtRadioBox, rb)
36
37 def EvtRadioBox(self, event):
38
self.logger.AppendText('EvtRadioBox: %d\n' % event.GetInt())
39 def EvtComboBox(self, event):
40
self.logger.AppendText('EvtComboBox: %s\n' % event.GetString())
41 def OnClick(self,event):
42
self.logger.AppendText(" Click on object with Id %d\n" %event.GetId())
43 def EvtText(self, event):
44
self.logger.AppendText('EvtText: %s\n' % event.GetString())
45 def EvtChar(self, event):
46
self.logger.AppendText('EvtChar: %d\n' % event.GetKeyCode())
47
event.Skip()
48 def EvtCheckBox(self, event):
49
self.logger.AppendText('EvtCheckBox: %d\n' % event.Checked())
50
51
52 app = wx.App(False)
53 frame = wx.Frame(None)
54 panel = ExamplePanel(frame)
55 frame.Show()
56 app.MainLoop()

Our class has grown a lot bigger; it has now a lot of controls and
those controls react. We have added a special wxTextCtrl control to show the various events that are
sent by the controls.
The notebook

Sometimes, a form grows too big to fit on a single page. The
wxNoteBook is used in that kind of case : It allows the user to navigate quickly between a small
amount of pages by clicking on associated tabs. We implement this by putting the wxNotebook
instead of our form into the main Frame and then add our Form1 into the notebook by using method
AddPage.
Note how ExamplePanel's parent is the Notebook. This is important.
1 app = wx.App(False)
2 frame = wx.Frame(None, title="Demo with Notebook")
3 nb = wx.Notebook(frame)
4
5
6 nb.AddPage(ExamplePanel(nb), "Absolute Positioning")
7 nb.AddPage(ExamplePanel(nb), "Page Two")
8 nb.AddPage(ExamplePanel(nb), "Page Three")
9 frame.Show()
10 app.MainLoop()
Improving the layout - Using Sizers

Using absolute positioning is often not very satisfying: The result is ugly if the windows are not (for
one reason or another) the
right size. WxPython has very rich vocabulary of objects to lay out controls.
wx.BoxSizer is the most common and simple layout object but it permits a vast range of
possibilities. Its role is roughly to arrange a set of controls in a line or in a row and rearrange
them when needed (i.e. when the global size is changed).
o wx.GridSizer and wx.FlexGridSizer are two very important layout tools. They arrange the
controls in a tabular layout.
o
Here is the sample above re-written to use sizers:
1 class ExamplePanel(wx.Panel):
2 def __init__(self, parent):
3
wx.Panel.__init__(self, parent)
4
5
# create some sizers
6
mainSizer = wx.BoxSizer(wx.VERTICAL)
7
grid = wx.GridBagSizer(hgap=5, vgap=5)
8
hSizer = wx.BoxSizer(wx.HORIZONTAL)
9
10
self.quote = wx.StaticText(self, label="Your quote: ")
11
grid.Add(self.quote, pos=(0,0))
12
13
# A multiline TextCtrl - This is here to show how the events work in this program, don't pay too
much attention to it
14
self.logger = wx.TextCtrl(self, size=(200,300), style=wx.TE_MULTILINE | wx.TE_READONLY)
15
16
# A button
17
self.button =wx.Button(self, label="Save")
18
self.Bind(wx.EVT_BUTTON, self.OnClick,self.button)
19
20
# the edit control - one line version.
21
self.lblname = wx.StaticText(self, label="Your name :")
22
grid.Add(self.lblname, pos=(1,0))
23
self.editname = wx.TextCtrl(self, value="Enter here your name", size=(140,-1))
24
grid.Add(self.editname, pos=(1,1))
25
self.Bind(wx.EVT_TEXT, self.EvtText, self.editname)
26
self.Bind(wx.EVT_CHAR, self.EvtChar, self.editname)
27
28
# the combobox Control
29
self.sampleList = ['friends', 'advertising', 'web search', 'Yellow Pages']
30
self.lblhear = wx.StaticText(self, label="How did you hear from us ?")
31
grid.Add(self.lblhear, pos=(3,0))
32
self.edithear = wx.ComboBox(self, size=(95, -1), choices=self.sampleList,
style=wx.CB_DROPDOWN)
33
grid.Add(self.edithear, pos=(3,1))
34
self.Bind(wx.EVT_COMBOBOX, self.EvtComboBox, self.edithear)
35
self.Bind(wx.EVT_TEXT, self.EvtText,self.edithear)
36
37
# add a spacer to the sizer
38
grid.Add((10, 40), pos=(2,0))
39
40
# Checkbox
41
self.insure = wx.CheckBox(self, label="Do you want Insured Shipment ?")
42
grid.Add(self.insure, pos=(4,0), span=(1,2), flag=wx.BOTTOM, border=5)
43
self.Bind(wx.EVT_CHECKBOX, self.EvtCheckBox, self.insure)
44
45
# Radio Boxes
46
radioList = ['blue', 'red', 'yellow', 'orange', 'green', 'purple', 'navy blue', 'black', 'gray']
47
rb = wx.RadioBox(self, label="What color would you like ?", pos=(20, 210), choices=radioList,
majorDimension=3,
48
style=wx.RA_SPECIFY_COLS)
49
grid.Add(rb, pos=(5,0), span=(1,2))
50
self.Bind(wx.EVT_RADIOBOX, self.EvtRadioBox, rb)
51
52
hSizer.Add(grid, 0, wx.ALL, 5)
53
hSizer.Add(self.logger)
54
mainSizer.Add(hSizer, 0, wx.ALL, 5)
55
mainSizer.Add(self.button, 0, wx.CENTER)
56
self.SetSizerAndFit(mainSizer)
This example uses a GridBagSizer to place its items. The "pos" argument controls where in the grid the
widget is placed, in (x, y) positions. For instance, (0, 0) is the top-left position, whereas (3, 5) would be the
third row down, and the fifth column across. The span argument allows a widget to span over multiple rows
or columns.
Source of this course: http://docs.python.org/tutorial/.