Download Python # Biopython # wxPython - IBERO-NBIC @ Inicio

Python # Biopython # wxPython Programming Course for Bioinformatics 14-19 May 2011 Organised by RNASA-IMEDIR Group University of A Coruña, Spain Alejandro Pazos Sierra (UDC), [email protected] Julián Dorado de la Calle (UDC), [email protected] Victoria López Alonso (ISCIII), [email protected] Diana de la Iglesia Jimenez (GIB, UPM), [email protected] Scientific Networks Ibero-American Network of the Nano-Bio-Info-Cogno Convergent Technologies (Ibero-NBIC) Galician Network for Colorectal Cancer Research (REGICC) Cooperative Research Thematic Network on Computational Medicine (COMBIOMED) Teachers Cristian Robert Munteanu (UDC), [email protected] José Antonio Seoane Fernández (UDC), [email protected] UDC = University of A Coruña ISCIII = Instituto de Salud Carlos III UPM = Universidad Politécnica de Madrid Python Python is a powerful programming language, easy to learn, with an efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms. The Python interpreter and the extensive standard library are freely available in source or binary form for all major platforms from the Python Web site, http://www.python.org/, and may be freely distributed. The same site also contains distributions of and pointers to many free third party Python modules, programs and tools, and additional documentation. The Python interpreter is easily extended with new functions and data types implemented in C or C++ (or other languages callable from C). Python is also suitable as an extension language for customizable applications. This tutorial introduces the reader informally to the basic concepts and features of the Python language and system. It helps to have a Python interpreter handy for hands-on experience, but all examples are selfcontained, so the tutorial can be read off-line as well. For a description of standard objects and modules, see The Python Standard Library. The Python Language Reference gives a more formal definition of the language. To write extensions in C or C++, read Extending and Embedding the Python Interpreter and Python/C API Reference Manual. There are also several books covering Python in depth. This tutorial does not attempt to be comprehensive and cover every single feature, or even every commonly used feature. Instead, it introduces many of Python’s most noteworthy features, and will give you a good idea of the language’s flavor and style. After reading it, you will be able to read and write Python modules and programs, and you will be ready to learn more about the various Python library modules described in The Python Standard Library. Numbers The interpreter acts as a simple calculator: you can type an expression at it and it will write the value. Expression syntax is straightforward: the operators +, -, * and / work just like in most other languages (for example, Pascal or C); parentheses can be used for grouping. For example: >>> 2+2 4 >>> # This is a comment ... 2+2 4 >>> 2+2 # and a comment on the same line as code 4 >>> (50-5*6)/4 5 >>> # Integer division returns the floor: ... 7/3 2 >>> 7/-3 -3 The equal sign ('=') is used to assign a value to a variable. Afterwards, no result is displayed before the next interactive prompt: >>> width = 20 >>> height = 5*9 >>> width * height 900 A value can be assigned to several variables simultaneously: >>> x = y = z = 0 # Zero x, y and z >>> x 0 >>> y 0 >>> z 0 Variables must be “defined” (assigned a value) before they can be used, or an error will occur: >>> # try to access an undefined variable ... n Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'n' is not defined There is full support for floating point; operators with mixed type operands convert the integer operand to floating point: >>> 3 * 3.75 / 1.5 7.5 >>> 7.0 / 2 3.5 Complex numbers are also supported; imaginary numbers are written with a suffix of j or J. Complex numbers with a nonzero real component are written as (real+imagj), or can be created with the complex(real, imag) function. >>> 1j * 1J (-1+0j) >>> 1j * complex(0,1) (-1+0j) >>> 3+1j*3 (3+3j) >>> (3+1j)*3 (9+3j) >>> (1+2j)/(1+1j) (1.5+0.5j) Complex numbers are always represented as two floating point numbers, the real and imaginary part. To extract these parts from a complex number z, use z.real and z.imag. >>> a=1.5+0.5j >>> a.real 1.5 >>> a.imag 0.5 The conversion functions to floating point and integer (float(), int() and long()) don’t work for complex numbers — there is no one correct way to convert a complex number to a real number. Use abs(z) to get its magnitude (as a float) or z.real to get its real part. >>> a=3.0+4.0j >>> float(a) Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: can't convert complex to float; use abs(z) >>> a.real 3.0 >>> a.imag 4.0 >>> abs(a) # sqrt(a.real**2 + a.imag**2) 5.0 In interactive mode, the last printed expression is assigned to the variable _. This means that when you are using Python as a desk calculator, it is somewhat easier to continue calculations, for example: >>> tax = 12.5 / 100 >>> price = 100.50 >>> price * tax 12.5625 >>> price + _ 113.0625 >>> round(_, 2) 113.06 This variable should be treated as read-only by the user. Don’t explicitly assign a value to it — you would create an independent local variable with the same name masking the built-in variable with its magic behavior. Strings Besides numbers, Python can also manipulate strings, which can be expressed in several ways. They can be enclosed in single quotes or double quotes: >>> 'spam eggs' 'spam eggs' >>> 'doesn\'t' "doesn't" >>> "doesn't" "doesn't" >>> '"Yes," he said.' '"Yes," he said.' >>> "\"Yes,\" he said." '"Yes," he said.' >>> '"Isn\'t," she said.' '"Isn\'t," she said.' The interpreter prints the result of string operations in the same way as they are typed for input: inside quotes, and with quotes and other funny characters escaped by backslashes, to show the precise value. The string is enclosed in double quotes if the string contains a single quote and no double quotes, else it’s enclosed in single quotes. The print statement produces a more readable output for such input strings. String literals can span multiple lines in several ways. Continuation lines can be used, with a backslash as the last character on the line indicating that the next line is a logical continuation of the line: hello = "This is a rather long string containing\n\ several lines of text just as you would do in C.\n\ Note that whitespace at the beginning of the line is\ significant." print hello Note that newlines still need to be embedded in the string using \n – the newline following the trailing backslash is discarded. This example would print the following: This is a rather long string containing several lines of text just as you would do in C. Note that whitespace at the beginning of the line is significant. Or, strings can be surrounded in a pair of matching triple-quotes: """ or '''. End of lines do not need to be escaped when using triple-quotes, but they will be included in the string. print """ Usage: thingy [OPTIONS] -h Display this usage message -H hostname Hostname to connect to """ produces the following output: Usage: thingy [OPTIONS] -h Display this usage message -H hostname Hostname to connect to If we make the string literal a “raw” string, \n sequences are not converted to newlines, but the backslash at the end of the line, and the newline character in the source, are both included in the string as data. Thus, the example: hello = r"This is a rather long string containing\n\ several lines of text much as you would do in C." print hello would print: This is a rather long string containing\n\ several lines of text much as you would do in C. The interpreter prints the result of string operations in the same way as they are typed for input: inside quotes, and with quotes and other funny characters escaped by backslashes, to show the precise value. The string is enclosed in double quotes if the string contains a single quote and no double quotes, else it’s enclosed in single quotes. (The print statement, described later, can be used to write strings without quotes or escapes.) Strings can be concatenated (glued together) with the + operator, and repeated with *: >>> word = 'Help' + 'A' >>> word 'HelpA' >>> '<' + word*5 + '>' '<HelpAHelpAHelpAHelpAHelpA>' Two string literals next to each other are automatically concatenated; the first line above could also have been written word = 'Help' 'A'; this only works with two literals, not with arbitrary string expressions: >>> 'str' 'ing' # <- This is ok 'string' >>> 'str'.strip() + 'ing' # <- This is ok 'string' >>> 'str'.strip() 'ing' # <- This is invalid File "<stdin>", line 1, in ? 'str'.strip() 'ing' ^ SyntaxError: invalid syntax Strings can be subscripted (indexed); like in C, the first character of a string has subscript (index) 0. There is no separate character type; a character is simply a string of size one. Like in Icon, substrings can be specified with the slice notation: two indices separated by a colon. >>> word[4] 'A' >>> word[0:2] 'He' >>> word[2:4] 'lp' Slice indices have useful defaults; an omitted first index defaults to zero, an omitted second index defaults to the size of the string being sliced. >>> word[:2] 'He' >>> word[2:] 'lpA' # The first two characters # Everything except the first two characters Unlike a C string, Python strings cannot be changed. Assigning to an indexed position in the string results in an error: >>> word[0] = 'x' Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: object does not support item assignment >>> word[:1] = 'Splat' Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: object does not support slice assignment However, creating a new string with the combined content is easy and efficient: >>> 'x' + word[1:] 'xelpA' >>> 'Splat' + word[4] 'SplatA' Here’s a useful invariant of slice operations: s[:i] + s[i:] equals s. >>> word[:2] + word[2:] 'HelpA' >>> word[:3] + word[3:] 'HelpA' Degenerate slice indices are handled gracefully: an index that is too large is replaced by the string size, an upper bound smaller than the lower bound returns an empty string. >>> word[1:100] 'elpA' >>> word[10:] '' >>> word[2:1] '' Indices may be negative numbers, to start counting from the right. For example: >>> word[-1] # The last character 'A' >>> word[-2] # The last-but-one character 'p' >>> word[-2:] # The last two characters 'pA' >>> word[:-2] # Everything except the last two characters 'Hel' But note that -0 is really the same as 0, so it does not count from the right! >>> word[-0] 'H' # (since -0 equals 0) Out-of-range negative slice indices are truncated, but don’t try this for single-element (non-slice) indices: >>> word[-100:] 'HelpA' >>> word[-10] # error Traceback (most recent call last): File "<stdin>", line 1, in ? IndexError: string index out of range One way to remember how slices work is to think of the indices as pointing between characters, with the left edge of the first character numbered 0. Then the right edge of the last character of a string of n characters has index n, for example: +---+---+---+---+---+ |H|e|l|p|A| +---+---+---+---+---+ 0 1 2 3 4 5 -5 -4 -3 -2 -1 The first row of numbers gives the position of the indices 0...5 in the string; the second row gives the corresponding negative indices. The slice from i to j consists of all characters between the edges labeled i and j, respectively. For non-negative indices, the length of a slice is the difference of the indices, if both are within bounds. For example, the length of word[1:3] is 2. The built-in function len() returns the length of a string: >>> s = 'supercalifragilisticexpialidocious' >>> len(s) 34 Unicode Strings Starting with Python 2.0 a new data type for storing text data is available to the programmer: the Unicode object. It can be used to store and manipulate Unicode data (see http://www.unicode.org/) and integrates well with the existing string objects, providing auto-conversions where necessary. Unicode has the advantage of providing one ordinal for every character in every script used in modern and ancient texts. Previously, there were only 256 possible ordinals for script characters. Texts were typically bound to a code page which mapped the ordinals to script characters. This lead to very much confusion especially with respect to internationalization (usually written as i18n — 'i' + 18 characters + 'n') of software. Unicode solves these problems by defining one code page for all scripts. Creating Unicode strings in Python is just as simple as creating normal strings: >>> u'Hello World !' u'Hello World !' The small 'u' in front of the quote indicates that a Unicode string is supposed to be created. If you want to include special characters in the string, you can do so by using the Python Unicode-Escape encoding. The following example shows how: >>> u'Hello\u0020World !' u'Hello World !' The escape sequence \u0020 indicates to insert the Unicode character with the ordinal value 0x0020 (the space character) at the given position. Other characters are interpreted by using their respective ordinal values directly as Unicode ordinals. If you have literal strings in the standard Latin-1 encoding that is used in many Western countries, you will find it convenient that the lower 256 characters of Unicode are the same as the 256 characters of Latin-1. For experts, there is also a raw mode just like the one for normal strings. You have to prefix the opening quote with ‘ur’ to have Python use the Raw-Unicode-Escape encoding. It will only apply the above \uXXXX conversion if there is an uneven number of backslashes in front of the small ‘u’. >>> ur'Hello\u0020World !' u'Hello World !' >>> ur'Hello\\u0020World !' u'Hello\\\\u0020World !' The raw mode is most useful when you have to enter lots of backslashes, as can be necessary in regular expressions. Apart from these standard encodings, Python provides a whole set of other ways of creating Unicode strings on the basis of a known encoding. The built-in function unicode() provides access to all registered Unicode codecs (COders and DECoders). Some of the more well known encodings which these codecs can convert are Latin-1, ASCII, UTF-8, and UTF-16. The latter two are variable-length encodings that store each Unicode character in one or more bytes. The default encoding is normally set to ASCII, which passes through characters in the range 0 to 127 and rejects any other characters with an error. When a Unicode string is printed, written to a file, or converted with str(), conversion takes place using this default encoding. >>> u"abc" u'abc' >>> str(u"abc") 'abc' >>> u"äöü" u'\xe4\xf6\xfc' >>> str(u"äöü") Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) To convert a Unicode string into an 8-bit string using a specific encoding, Unicode objects provide an encode() method that takes one argument, the name of the encoding. Lowercase names for encodings are preferred. >>> u"äöü".encode('utf-8') '\xc3\xa4\xc3\xb6\xc3\xbc' If you have data in a specific encoding and want to produce a corresponding Unicode string from it, you can use the unicode() function with the encoding name as the second argument. >>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf-8') u'\xe4\xf6\xfc' Lists Python knows a number of compound data types, used to group together other values. The most versatile is the list, which can be written as a list of comma-separated values (items) between square brackets. List items need not all have the same type. >>> a = ['spam', 'eggs', 100, 1234] >>> a ['spam', 'eggs', 100, 1234] Like string indices, list indices start at 0, and lists can be sliced, concatenated and so on: >>> a[0] 'spam' >>> a[3] 1234 >>> a[-2] 100 >>> a[1:-1] ['eggs', 100] >>> a[:2] + ['bacon', 2*2] ['spam', 'eggs', 'bacon', 4] >>> 3*a[:3] + ['Boo!'] ['spam', 'eggs', 100, 'spam', 'eggs', 100, 'spam', 'eggs', 100, 'Boo!'] All slice operations return a new list containing the requested elements. This means that the following slice returns a shallow copy of the list a: >>> a[:] ['spam', 'eggs', 100, 1234] Unlike strings, which are immutable, it is possible to change individual elements of a list: >>> a ['spam', 'eggs', 100, 1234] >>> a[2] = a[2] + 23 >>> a ['spam', 'eggs', 123, 1234] Assignment to slices is also possible, and this can even change the size of the list or clear it entirely: >>> # Replace some items: ... a[0:2] = [1, 12] >>> a [1, 12, 123, 1234] >>> # Remove some: ... a[0:2] = [] >>> a [123, 1234] >>> # Insert some: ... a[1:1] = ['bletch', 'xyzzy'] >>> a [123, 'bletch', 'xyzzy', 1234] >>> # Insert (a copy of) itself at the beginning >>> a[:0] = a >>> a [123, 'bletch', 'xyzzy', 1234, 123, 'bletch', 'xyzzy', 1234] >>> # Clear the list: replace all items with an empty list >>> a[:] = [] >>> a [] The built-in function len() also applies to lists: >>> a = ['a', 'b', 'c', 'd'] >>> len(a) 4 It is possible to nest lists (create lists containing other lists), for example: >>> q = [2, 3] >>> p = [1, q, 4] >>> len(p) 3 >>> p[1] [2, 3] >>> p[1][0] 2 >>> p[1].append('xtra') >>> p [1, [2, 3, 'xtra'], 4] >>> q [2, 3, 'xtra'] # See section 5.1 Note that in the last example, p[1] and q really refer to the same object! We’ll come back to object semantics later. del statement There is a way to remove an item from a list given its index instead of its value: the del statement. This differs from the pop() method which returns a value. The del statement can also be used to remove slices from a list or clear the entire list (which we did earlier by assignment of an empty list to the slice). For example: >>> a = [-1, 1, 66.25, 333, 333, 1234.5] >>> del a[0] >>> a [1, 66.25, 333, 333, 1234.5] >>> del a[2:4] >>> a [1, 66.25, 1234.5] >>> del a[:] >>> a [] del can also be used to delete entire variables: >>> del a Referencing the name a hereafter is an error (at least until another value is assigned to it). We’ll find other uses for del later. Tuples and Sequences We saw that lists and strings have many common properties, such as indexing and slicing operations. They are two examples of sequence data types (see Sequence Types — str, unicode, list, tuple, bytearray, buffer, xrange). Since Python is an evolving language, other sequence data types may be added. There is also another standard sequence data type: the tuple. A tuple consists of a number of values separated by commas, for instance: >>> t = 12345, 54321, 'hello!' >>> t[0] 12345 >>> t (12345, 54321, 'hello!') >>> # Tuples may be nested: ... u = t, (1, 2, 3, 4, 5) >>> u ((12345, 54321, 'hello!'), (1, 2, 3, 4, 5)) As you see, on output tuples are always enclosed in parentheses, so that nested tuples are interpreted correctly; they may be input with or without surrounding parentheses, although often parentheses are necessary anyway (if the tuple is part of a larger expression). Tuples have many uses. For example: (x, y) coordinate pairs, employee records from a database, etc. Tuples, like strings, are immutable: it is not possible to assign to the individual items of a tuple (you can simulate much of the same effect with slicing and concatenation, though). It is also possible to create tuples which contain mutable objects, such as lists. A special problem is the construction of tuples containing 0 or 1 items: the syntax has some extra quirks to accommodate these. Empty tuples are constructed by an empty pair of parentheses; a tuple with one item is constructed by following a value with a comma (it is not sufficient to enclose a single value in parentheses). Ugly, but effective. For example: >>> empty = () >>> singleton = 'hello', >>> len(empty) 0 >>> len(singleton) 1 >>> singleton ('hello',) # <-- note trailing comma The statement t = 12345, 54321, 'hello!' is an example of tuple packing: the values 12345, 54321 and 'hello!' are packed together in a tuple. The reverse operation is also possible: >>> x, y, z = t This is called, appropriately enough, sequence unpacking and works for any sequence on the right-hand side. Sequence unpacking requires the list of variables on the left to have the same number of elements as the length of the sequence. Note that multiple assignment is really just a combination of tuple packing and sequence unpacking. Sets Python also includes a data type for sets. A set is an unordered collection with no duplicate elements. Basic uses include membership testing and eliminating duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference. Here is a brief demonstration: >>> basket = ['apple', 'orange', 'apple', 'pear', 'orange', 'banana'] >>> fruit = set(basket) # create a set without duplicates >>> fruit set(['orange', 'pear', 'apple', 'banana']) >>> 'orange' in fruit # fast membership testing True >>> 'crabgrass' in fruit False >>> # Demonstrate set operations on unique letters from two words ... >>> a = set('abracadabra') >>> b = set('alacazam') >>> a # unique letters in a set(['a', 'r', 'b', 'c', 'd']) >>> a - b # letters in a but not in b set(['r', 'd', 'b']) >>> a | b # letters in either a or b set(['a', 'c', 'r', 'd', 'b', 'm', 'z', 'l']) >>> a & b # letters in both a and b set(['a', 'c']) >>> a ^ b # letters in a or b but not both set(['r', 'd', 'b', 'm', 'z', 'l']) Dictionaries Another useful data type built into Python is the dictionary. Dictionaries are sometimes found in other languages as “associative memories” or “associative arrays”. Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys, which can be any immutable type; strings and numbers can always be keys. Tuples can be used as keys if they contain only strings, numbers, or tuples; if a tuple contains any mutable object either directly or indirectly, it cannot be used as a key. You can’t use lists as keys, since lists can be modified in place using index assignments, slice assignments, or methods like append() and extend(). It is best to think of a dictionary as an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary). A pair of braces creates an empty dictionary: {}. Placing a commaseparated list of key:value pairs within the braces adds initial key:value pairs to the dictionary; this is also the way dictionaries are written on output. The main operations on a dictionary are storing a value with some key and extracting the value given the key. It is also possible to delete a key:value pair with del. If you store using a key that is already in use, the old value associated with that key is forgotten. It is an error to extract a value using a non-existent key. The keys() method of a dictionary object returns a list of all the keys used in the dictionary, in arbitrary order (if you want it sorted, just apply the sorted() function to it). To check whether a single key is in the dictionary, use the in keyword. Here is a small example using a dictionary: >>> tel = {'jack': 4098, 'sape': 4139} >>> tel['guido'] = 4127 >>> tel {'sape': 4139, 'guido': 4127, 'jack': 4098} >>> tel['jack'] 4098 >>> del tel['sape'] >>> tel['irv'] = 4127 >>> tel {'guido': 4127, 'irv': 4127, 'jack': 4098} >>> tel.keys() ['guido', 'irv', 'jack'] >>> 'guido' in tel True The dict() constructor builds dictionaries directly from lists of key-value pairs stored as tuples. When the pairs form a pattern, list comprehensions can compactly specify the key-value list. >>> dict([('sape', 4139), ('guido', 4127), ('jack', 4098)]) {'sape': 4139, 'jack': 4098, 'guido': 4127} >>> dict([(x, x**2) for x in (2, 4, 6)]) # use a list comprehension {2: 4, 4: 16, 6: 36} Later in the tutorial, we will learn about Generator Expressions which are even better suited for the task of supplying key-values pairs to the dict() constructor. When the keys are simple strings, it is sometimes easier to specify pairs using keyword arguments: >>> dict(sape=4139, guido=4127, jack=4098) {'sape': 4139, 'jack': 4098, 'guido': 4127} Looping Techniques When looping through dictionaries, the key and corresponding value can be retrieved at the same time using the iteritems() method. >>> knights = {'gallahad': 'the pure', 'robin': 'the brave'} >>> for k, v in knights.iteritems(): ... print k, v ... gallahad the pure robin the brave When looping through a sequence, the position index and corresponding value can be retrieved at the same time using the enumerate() function. >>> for i, v in enumerate(['tic', 'tac', 'toe']): ... print i, v ... 0 tic 1 tac 2 toe To loop over two or more sequences at the same time, the entries can be paired with the zip() function. >>> questions = ['name', 'quest', 'favorite color'] >>> answers = ['lancelot', 'the holy grail', 'blue'] >>> for q, a in zip(questions, answers): ... print 'What is your {0}? It is {1}.'.format(q, a) ... What is your name? It is lancelot. What is your quest? It is the holy grail. What is your favorite color? It is blue. To loop over a sequence in reverse, first specify the sequence in a forward direction and then call the reversed() function. >>> for i in reversed(xrange(1,10,2)): ... print i ... 9 7 5 3 1 To loop over a sequence in sorted order, use the sorted() function which returns a new sorted list while leaving the source unaltered. >>> basket = ['apple', 'orange', 'apple', 'pear', 'orange', 'banana'] >>> for f in sorted(set(basket)): ... print f ... apple banana orange pear CONTROL FLOW TOOLS if Statements Perhaps the most well-known statement type is the if statement. For example: >>> x = int(raw_input("Please enter an integer: ")) Please enter an integer: 42 >>> if x < 0: ... x=0 ... print 'Negative changed to zero' ... elif x == 0: ... print 'Zero' ... elif x == 1: ... print 'Single' ... else: ... print 'More' ... More There can be zero or more elif parts, and the else part is optional. The keyword ‘elif‘ is short for ‘else if’, and is useful to avoid excessive indentation. An if ... elif ... elif ... sequence is a substitute for the switch or case statements found in other languages. for Statements The for statement in Python differs a bit from what you may be used to in C or Pascal. Rather than always iterating over an arithmetic progression of numbers (like in Pascal), or giving the user the ability to define both the iteration step and halting condition (as C), Python’s for statement iterates over the items of any sequence (a list or a string), in the order that they appear in the sequence. For example: >>> # Measure some strings: ... a = ['cat', 'window', 'defenestrate'] >>> for x in a: ... print x, len(x) ... cat 3 window 6 defenestrate 12 It is not safe to modify the sequence being iterated over in the loop (this can only happen for mutable sequence types, such as lists). If you need to modify the list you are iterating over (for example, to duplicate selected items) you must iterate over a copy. The slice notation makes this particularly convenient: >>> for x in a[:]: # make a slice copy of the entire list ... if len(x) > 6: a.insert(0, x) ... >>> a ['defenestrate', 'cat', 'window', 'defenestrate'] range() Function If you do need to iterate over a sequence of numbers, the built-in function range() comes in handy. It generates lists containing arithmetic progressions: >>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] The given end point is never part of the generated list; range(10) generates a list of 10 values, the legal indices for items of a sequence of length 10. It is possible to let the range start at another number, or to specify a different increment (even negative; sometimes this is called the ‘step’): >>> range(5, 10) [5, 6, 7, 8, 9] >>> range(0, 10, 3) [0, 3, 6, 9] >>> range(-10, -100, -30) [-10, -40, -70] To iterate over the indices of a sequence, you can combine range() and len() as follows: >>> a = ['Mary', 'had', 'a', 'little', 'lamb'] >>> for i in range(len(a)): ... print i, a[i] ... 0 Mary 1 had 2a 3 little 4 lamb In most such cases, however, it is convenient to use the enumerate() function. break and continue Statements, and else Clauses on Loops The break statement, like in C, breaks out of the smallest enclosing for or while loop. The continue statement, also borrowed from C, continues with the next iteration of the loop. Loop statements may have an else clause; it is executed when the loop terminates through exhaustion of the list (with for) or when the condition becomes false (with while), but not when the loop is terminated by a break statement. This is exemplified by the following loop, which searches for prime numbers: >>> for n in range(2, 10): ... for x in range(2, n): ... if n % x == 0: ... print n, 'equals', x, '*', n/x ... break ... else: ... # loop fell through without finding a factor ... print n, 'is a prime number' ... 2 is a prime number 3 is a prime number 4 equals 2 * 2 5 is a prime number 6 equals 2 * 3 7 is a prime number 8 equals 2 * 4 9 equals 3 * 3 pass Statements The pass statement does nothing. It can be used when a statement is required syntactically but the program requires no action. For example: >>> while True: ... pass # Busy-wait for keyboard interrupt (Ctrl+C) ... This is commonly used for creating minimal classes: >>> class MyEmptyClass: ... pass ... Another place pass can be used is as a place-holder for a function or conditional body when you are working on new code, allowing you to keep thinking at a more abstract level. The pass is silently ignored: >>> def initlog(*args): ... pass # Remember to implement this! ... Defining Functions We can create a function that writes the Fibonacci series to an arbitrary boundary: >>> def fib(n): # write Fibonacci series up to n ... """Print a Fibonacci series up to n.""" ... a, b = 0, 1 ... while a < n: ... print a, ... a, b = b, a+b ... >>> # Now call the function we just defined: ... fib(2000) 0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 The keyword def introduces a function definition. It must be followed by the function name and the parenthesized list of formal parameters. The statements that form the body of the function start at the next line, and must be indented. The first statement of the function body can optionally be a string literal; this string literal is the function’s documentation string, or docstring. There are tools which use docstrings to automatically produce online or printed documentation, or to let the user interactively browse through code; it’s good practice to include docstrings in code that you write, so make a habit of it. The execution of a function introduces a new symbol table used for the local variables of the function. More precisely, all variable assignments in a function store the value in the local symbol table; whereas variable references first look in the local symbol table, then in the local symbol tables of enclosing functions, then in the global symbol table, and finally in the table of built-in names. Thus, global variables cannot be directly assigned a value within a function (unless named in a global statement), although they may be referenced. The actual parameters (arguments) to a function call are introduced in the local symbol table of the called function when it is called; thus, arguments are passed using call by value (where the value is always an object reference, not the value of the object). When a function calls another function, a new local symbol table is created for that call. A function definition introduces the function name in the current symbol table. The value of the function name has a type that is recognized by the interpreter as a user-defined function. This value can be assigned to another name which can then also be used as a function. This serves as a general renaming mechanism: >>> fib <function fib at 10042ed0> >>> f = fib >>> f(100) 0 1 1 2 3 5 8 13 21 34 55 89 Coming from other languages, you might object that fib is not a function but a procedure since it doesn’t return a value. In fact, even functions without a return statement do return a value, albeit a rather boring one. This value is called None (it’s a built-in name). Writing the value None is normally suppressed by the interpreter if it would be the only value written. You can see it if you really want to using print: >>> fib(0) >>> print fib(0) None It is simple to write a function that returns a list of the numbers of the Fibonacci series, instead of printing it: >>> def fib2(n): # return Fibonacci series up to n ... """Return a list containing the Fibonacci series up to n.""" ... result = [] ... a, b = 0, 1 ... while a < n: ... result.append(a) # see below ... a, b = b, a+b ... return result ... >>> f100 = fib2(100) # call it >>> f100 # write the result [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89] This example, as usual, demonstrates some new Python features:   The return statement returns with a value from a function. return without an expression argument returns None. Falling off the end of a function also returns None. The statement result.append(a) calls a method of the list object result. A method is a function that ‘belongs’ to an object and is named obj.methodname, where obj is some object (this may be an expression), and methodname is the name of a method that is defined by the object’s type. Different types define different methods. Methods of different types may have the same name without causing ambiguity. (It is possible to define your own object types and methods, using classes) The method append() shown in the example is defined for list objects; it adds a new element at the end of the list. In this example it is equivalent to result = result + [a], but more efficient. Standard Modules Python comes with a library of standard modules, described in a separate document, the Python Library Reference (“Library Reference” hereafter). Some modules are built into the interpreter; these provide access to operations that are not part of the core of the language but are nevertheless built in, either for efficiency or to provide access to operating system primitives such as system calls. The set of such modules is a configuration option which also depends on the underlying platform For example, the winreg module is only provided on Windows systems. One particular module deserves some attention: sys, which is built into every Python interpreter. The variables sys.ps1 and sys.ps2 define the strings used as primary and secondary prompts: >>> import sys >>> sys.ps1 '>>> ' >>> sys.ps2 '... ' >>> sys.ps1 = 'C> ' C> print 'Yuck!' Yuck! C> These two variables are only defined if the interpreter is in interactive mode. The variable sys.path is a list of strings that determines the interpreter’s search path for modules. It is initialized to a default path taken from the environment variable PYTHONPATH, or from a built-in default if PYTHONPATH is not set. You can modify it using standard list operations: >>> import sys >>> sys.path.append('/ufs/guido/lib/python') Importing * From a Package Now what happens when the user writes from sound.effects import *? Ideally, one would hope that this somehow goes out to the filesystem, finds which submodules are present in the package, and imports them all. This could take a long time and importing sub-modules might have unwanted side-effects that should only happen when the sub-module is explicitly imported. The only solution is for the package author to provide an explicit index of the package. The import statement uses the following convention: if a package’s __init__.py code defines a list named __all__, it is taken to be the list of module names that should be imported when from package import * is encountered. It is up to the package author to keep this list up-to-date when a new version of the package is released. Package authors may also decide not to support it, if they don’t see a use for importing * from their package. For example, the file sounds/effects/__init__.py could contain the following code: __all__ = ["echo", "surround", "reverse"] This would mean that from sound.effects import * would import the three named submodules of the sound package. If __all__ is not defined, the statement from sound.effects import * does not import all submodules from the package sound.effects into the current namespace; it only ensures that the package sound.effects has been imported (possibly running any initialization code in __init__.py) and then imports whatever names are defined in the package. This includes any names defined (and submodules explicitly loaded) by __init__.py. It also includes any submodules of the package that were explicitly loaded by previous import statements. Consider this code: import sound.effects.echo import sound.effects.surround from sound.effects import * In this example, the echo and surround modules are imported in the current namespace because they are defined in the sound.effects package when the from...import statement is executed. (This also works when __all__ is defined.) Although certain modules are designed to export only names that follow certain patterns when you use import *, it is still considered bad practise in production code. Remember, there is nothing wrong with using from Package import specific_submodule! In fact, this is the recommended notation unless the importing module needs to use submodules with the same name from different packages. Input and Output There are several ways to present the output of a program; data can be printed in a human-readable form, or written to a file for future use. This chapter will discuss some of the possibilities. Fancier Output Formatting So far we’ve encountered two ways of writing values: expression statements and the print statement. A third way is using the write() method of file objects; the standard output file can be referenced as sys.stdout. See the Library Reference for more information on this. Often you’ll want more control over the formatting of your output than simply printing space-separated values. There are two ways to format your output; the first way is to do all the string handling yourself; using string slicing and concatenation operations you can create any layout you can imagine. The string types have some methods that perform useful operations for padding strings to a given column width; these will be discussed shortly. The second way is to use the str.format() method. The string module contains a Template class which offers yet another way to substitute values into strings. One question remains, of course: how do you convert values to strings? Luckily, Python has ways to convert any value to a string: pass it to the repr() or str() functions. The str() function is meant to return representations of values which are fairly human-readable, while repr() is meant to generate representations which can be read by the interpreter (or will force a SyntaxError if there is not equivalent syntax). For objects which don’t have a particular representation for human consumption, str() will return the same value as repr(). Many values, such as numbers or structures like lists and dictionaries, have the same representation using either function. Strings and floating point numbers, in particular, have two distinct representations. Some examples: >>> s = 'Hello, world.' >>> str(s) 'Hello, world.' >>> repr(s) "'Hello, world.'" >>> str(1.0/7.0) '0.142857142857' >>> repr(1.0/7.0) '0.14285714285714285' >>> x = 10 * 3.25 >>> y = 200 * 200 >>> s = 'The value of x is ' + repr(x) + ', and y is ' + repr(y) + '...' >>> print s The value of x is 32.5, and y is 40000... >>> # The repr() of a string adds string quotes and backslashes: ... hello = 'hello, world\n' >>> hellos = repr(hello) >>> print hellos 'hello, world\n' >>> # The argument to repr() may be any Python object: ... repr((x, y, ('spam', 'eggs'))) "(32.5, 40000, ('spam', 'eggs'))" Here are two ways to write a table of squares and cubes: >>> for x in range(1, 11): ... print repr(x).rjust(2), repr(x*x).rjust(3), ... # Note trailing comma on previous line ... print repr(x*x*x).rjust(4) ... 1 1 1 2 4 8 3 9 27 4 16 64 5 25 125 6 36 216 7 49 343 8 64 512 9 81 729 10 100 1000 >>> for x in range(1,11): ... print '{0:2d} {1:3d} {2:4d}'.format(x, x*x, x*x*x) ... 1 1 1 2 4 8 3 9 27 4 16 64 5 25 125 6 36 216 7 49 343 8 64 512 9 81 729 10 100 1000 (Note that in the first example, one space between each column was added by the way print works: it always adds spaces between its arguments.) This example demonstrates the str.rjust() method of string objects, which right-justifies a string in a field of a given width by padding it with spaces on the left. There are similar methods str.ljust() and str.center(). These methods do not write anything, they just return a new string. If the input string is too long, they don’t truncate it, but return it unchanged; this will mess up your column lay-out but that’s usually better than the alternative, which would be lying about a value. (If you really want truncation you can always add a slice operation, as in x.ljust(n)[:n].) There is another method, str.zfill(), which pads a numeric string on the left with zeros. It understands about plus and minus signs: >>> '12'.zfill(5) '00012' >>> '-3.14'.zfill(7) '-003.14' >>> '3.14159265359'.zfill(5) '3.14159265359' Basic usage of the str.format() method looks like this: >>> print 'We are the {} who say "{}!"'.format('knights', 'Ni') We are the knights who say "Ni!" The brackets and characters within them (called format fields) are replaced with the objects passed into the str.format() method. A number in the brackets refers to the position of the object passed into the str.format() method. >>> print '{0} and {1}'.format('spam', 'eggs') spam and eggs >>> print '{1} and {0}'.format('spam', 'eggs') eggs and spam If keyword arguments are used in the str.format() method, their values are referred to by using the name of the argument. >>> print 'This {food} is {adjective}.'.format( ... food='spam', adjective='absolutely horrible') This spam is absolutely horrible. Positional and keyword arguments can be arbitrarily combined: >>> print 'The story of {0}, {1}, and {other}.'.format('Bill', 'Manfred', ... other='Georg') The story of Bill, Manfred, and Georg. '!s' (apply str()) and '!r' (apply repr()) can be used to convert the value before it is formatted. >>> import math >>> print 'The value of PI is approximately {}.'.format(math.pi) The value of PI is approximately 3.14159265359. >>> print 'The value of PI is approximately {!r}.'.format(math.pi) The value of PI is approximately 3.141592653589793. An optional ':' and format specifier can follow the field name. This allows greater control over how the value is formatted. The following example rounds Pi to three places after the decimal. >>> import math >>> print 'The value of PI is approximately {0:.3f}.'.format(math.pi) The value of PI is approximately 3.142. Passing an integer after the ':' will cause that field to be a minimum number of characters wide. This is useful for making tables pretty. >>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678} >>> for name, phone in table.items(): ... print '{0:10} ==> {1:10d}'.format(name, phone) ... Jack ==> 4098 Dcab ==> 7678 Sjoerd ==> 4127 If you have a really long format string that you don’t want to split up, it would be nice if you could reference the variables to be formatted by name instead of by position. This can be done by simply passing the dict and using square brackets '[]' to access the keys >>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 8637678} >>> print ('Jack: {0[Jack]:d}; Sjoerd: {0[Sjoerd]:d}; ' ... 'Dcab: {0[Dcab]:d}'.format(table)) Jack: 4098; Sjoerd: 4127; Dcab: 8637678 This could also be done by passing the table as keyword arguments with the ‘**’ notation. >>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 8637678} >>> print 'Jack: {Jack:d}; Sjoerd: {Sjoerd:d}; Dcab: {Dcab:d}'.format(**table) Jack: 4098; Sjoerd: 4127; Dcab: 8637678 This is particularly useful in combination with the built-in function vars(), which returns a dictionary containing all local variables. Old string formatting The % operator can also be used for string formatting. It interprets the left argument much like a sprintf()style format string to be applied to the right argument, and returns the string resulting from this formatting operation. For example: >>> import math >>> print 'The value of PI is approximately %5.3f.' % math.pi The value of PI is approximately 3.142. Since str.format() is quite new, a lot of Python code still uses the % operator. However, because this old style of formatting will eventually be removed from the language, str.format() should generally be used. Reading and Writing Files open() returns a file object, and is most commonly used with two arguments: open(filename, mode). >>> f = open('/tmp/workfile', 'w') >>> print f <open file '/tmp/workfile', mode 'w' at 80a0960> The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be 'r' when the file will only be read, 'w' for only writing (an existing file with the same name will be erased), and 'a' opens the file for appending; any data written to the file is automatically added to the end. 'r+' opens the file for both reading and writing. The mode argument is optional; 'r' will be assumed if it’s omitted. On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files. Methods of File Objects The rest of the examples in this section will assume that a file object called f has already been created. To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string. size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory. Otherwise, at most size bytes are read and returned. If the end of the file has been reached, f.read() will return an empty string (""). >>> f.read() 'This is the entire file.\n' >>> f.read() '' f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous; if f.readline() returns an empty string, the end of the file has been reached, while a blank line is represented by '\n', a string containing only a single newline. >>> f.readline() 'This is the first line of the file.\n' >>> f.readline() 'Second line of the file\n' >>> f.readline() '' f.readlines() returns a list containing all the lines of data in the file. If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. Only complete lines will be returned. >>> f.readlines() ['This is the first line of the file.\n', 'Second line of the file\n'] An alternative approach to reading lines is to loop over the file object. This is memory efficient, fast, and leads to simpler code: >>> for line in f: print line, This is the first line of the file. Second line of the file The alternative approach is simpler but does not provide as fine-grained control. Since the two approaches manage line buffering differently, they should not be mixed. f.write(string) writes the contents of string to the file, returning None. >>> f.write('This is a test\n') To write something other than a string, it needs to be converted to a string first: >>> value = ('the answer', 42) >>> s = str(value) >>> f.write(s) f.tell() returns an integer giving the file object’s current position in the file, measured in bytes from the beginning of the file. To change the file object’s position, use f.seek(offset, from_what). The position is computed from adding offset to a reference point; the reference point is selected by the from_what argument. A from_what value of 0 measures from the beginning of the file, 1 uses the current file position, and 2 uses the end of the file as the reference point. from_what can be omitted and defaults to 0, using the beginning of the file as the reference point. >>> f = open('/tmp/workfile', 'r+') >>> f.write('0123456789abcdef') >>> f.seek(5) # Go to the 6th byte in the file >>> f.read(1) '5' >>> f.seek(-3, 2) # Go to the 3rd byte before the end >>> f.read(1) 'd' When you’re done with a file, call f.close() to close it and free up any system resources taken up by the open file. After calling f.close(), attempts to use the file object will automatically fail. >>> f.close() >>> f.read() Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: I/O operation on closed file It is good practice to use the with keyword when dealing with file objects. This has the advantage that the file is properly closed after its suite finishes, even if an exception is raised on the way. It is also much shorter than writing equivalent try-finally blocks: >>> with open('/tmp/workfile', 'r') as f: ... read_data = f.read() >>> f.closed True File objects have some additional methods, such as isatty() and truncate() which are less frequently used. Errors and Exceptions Until now error messages haven’t been more than mentioned, but if you have tried out the examples you have probably seen some. There are (at least) two distinguishable kinds of errors: syntax errors and exceptions. Syntax Errors Syntax errors, also known as parsing errors, are perhaps the most common kind of complaint you get while you are still learning Python: >>> while True print 'Hello world' File "<stdin>", line 1, in ? while True print 'Hello world' ^ SyntaxError: invalid syntax The parser repeats the offending line and displays a little ‘arrow’ pointing at the earliest point in the line where the error was detected. The error is caused by (or at least detected at) the token preceding the arrow: in the example, the error is detected at the keyword print, since a colon (':') is missing before it. File name and line number are printed so you know where to look in case the input came from a script. Exceptions Even if a statement or expression is syntactically correct, it may cause an error when an attempt is made to execute it. Errors detected during execution are called exceptions and are not unconditionally fatal: you will soon learn how to handle them in Python programs. Most exceptions are not handled by programs, however, and result in error messages as shown here: >>> 10 * (1/0) Traceback (most recent call last): File "<stdin>", line 1, in ? ZeroDivisionError: integer division or modulo by zero >>> 4 + spam*3 Traceback (most recent call last): File "<stdin>", line 1, in ? NameError: name 'spam' is not defined >>> '2' + 2 Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: cannot concatenate 'str' and 'int' objects The last line of the error message indicates what happened. Exceptions come in different types, and the type is printed as part of the message: the types in the example are ZeroDivisionError, NameError and TypeError. The string printed as the exception type is the name of the built-in exception that occurred. This is true for all built-in exceptions, but need not be true for user-defined exceptions (although it is a useful convention). Standard exception names are built-in identifiers (not reserved keywords). The rest of the line provides detail based on the type of exception and what caused it. The preceding part of the error message shows the context where the exception happened, in the form of a stack traceback. In general it contains a stack traceback listing source lines; however, it will not display lines read from standard input. Built-in Exceptions lists the built-in exceptions and their meanings. Handling Exceptions It is possible to write programs that handle selected exceptions. Look at the following example, which asks the user for input until a valid integer has been entered, but allows the user to interrupt the program (using Control-C or whatever the operating system supports); note that a user-generated interruption is signalled by raising the KeyboardInterrupt exception. >>> while True: ... try: ... x = int(raw_input("Please enter a number: ")) ... break ... except ValueError: ... print "Oops! That was no valid number. Try again..." ... The try statement works as follows.     First, the try clause (the statement(s) between the try and except keywords) is executed. If no exception occurs, the except clause is skipped and execution of the try statement is finished. If an exception occurs during execution of the try clause, the rest of the clause is skipped. Then if its type matches the exception named after the except keyword, the except clause is executed, and then execution continues after the try statement. If an exception occurs which does not match the exception named in the except clause, it is passed on to outer try statements; if no handler is found, it is an unhandled exception and execution stops with a message as shown above. A try statement may have more than one except clause, to specify handlers for different exceptions. At most one handler will be executed. Handlers only handle exceptions that occur in the corresponding try clause, not in other handlers of the same try statement. An except clause may name multiple exceptions as a parenthesized tuple, for example: ... except (RuntimeError, TypeError, NameError): ... pass The last except clause may omit the exception name(s), to serve as a wildcard. Use this with extreme caution, since it is easy to mask a real programming error in this way! It can also be used to print an error message and then re-raise the exception (allowing a caller to handle the exception as well): import sys try: f = open('myfile.txt') s = f.readline() i = int(s.strip()) except IOError as (errno, strerror): print "I/O error({0}): {1}".format(errno, strerror) except ValueError: print "Could not convert data to an integer." except: print "Unexpected error:", sys.exc_info()[0] raise The try ... except statement has an optional else clause, which, when present, must follow all except clauses. It is useful for code that must be executed if the try clause does not raise an exception. For example: for arg in sys.argv[1:]: try: f = open(arg, 'r') except IOError: print 'cannot open', arg else: print arg, 'has', len(f.readlines()), 'lines' f.close() The use of the else clause is better than adding additional code to the try clause because it avoids accidentally catching an exception that wasn’t raised by the code being protected by the try ... except statement. When an exception occurs, it may have an associated value, also known as the exception’s argument. The presence and type of the argument depend on the exception type. The except clause may specify a variable after the exception name (or tuple). The variable is bound to an exception instance with the arguments stored in instance.args. For convenience, the exception instance defines __str__() so the arguments can be printed directly without having to reference .args. One may also instantiate an exception first before raising it and add any attributes to it as desired. >>> try: ... raise Exception('spam', 'eggs') ... except Exception as inst: ... print type(inst) # the exception instance ... print inst.args # arguments stored in .args ... print inst # __str__ allows args to printed directly ... x, y = inst # __getitem__ allows args to be unpacked directly ... print 'x =', x ... print 'y =', y ... <type 'exceptions.Exception'> ('spam', 'eggs') ('spam', 'eggs') x = spam y = eggs If an exception has an argument, it is printed as the last part (‘detail’) of the message for unhandled exceptions. Exception handlers don’t just handle exceptions if they occur immediately in the try clause, but also if they occur inside functions that are called (even indirectly) in the try clause. For example: >>> def this_fails(): ... x = 1/0 ... >>> try: ... this_fails() ... except ZeroDivisionError as detail: ... print 'Handling run-time error:', detail ... Handling run-time error: integer division or modulo by zero Raising Exceptions The raise statement allows the programmer to force a specified exception to occur. For example: >>> raise NameError('HiThere') Traceback (most recent call last): File "<stdin>", line 1, in ? NameError: HiThere The sole argument to raise indicates the exception to be raised. This must be either an exception instance or an exception class (a class that derives from Exception). If you need to determine whether an exception was raised but don’t intend to handle it, a simpler form of the raise statement allows you to re-raise the exception: >>> try: ... raise NameError('HiThere') ... except NameError: ... print 'An exception flew by!' ... raise ... An exception flew by! Traceback (most recent call last): File "<stdin>", line 2, in ? NameError: HiThere User-defined Exceptions Programs may name their own exceptions by creating a new exception class. Exceptions should typically be derived from the Exception class, either directly or indirectly. For example: >>> class MyError(Exception): ... def __init__(self, value): ... self.value = value ... def __str__(self): ... return repr(self.value) ... >>> try: ... raise MyError(2*2) ... except MyError as e: ... print 'My exception occurred, value:', e.value ... My exception occurred, value: 4 >>> raise MyError('oops!') Traceback (most recent call last): File "<stdin>", line 1, in ? __main__.MyError: 'oops!' In this example, the default __init__() of Exception has been overridden. The new behavior simply creates the value attribute. This replaces the default behavior of creating the args attribute. Exception classes can be defined which do anything any other class can do, but are usually kept simple, often only offering a number of attributes that allow information about the error to be extracted by handlers for the exception. When creating a module that can raise several distinct errors, a common practice is to create a base class for exceptions defined by that module, and subclass that to create specific exception classes for different error conditions: class Error(Exception): """Base class for exceptions in this module.""" pass class InputError(Error): """Exception raised for errors in the input. Attributes: expr -- input expression in which the error occurred msg -- explanation of the error """ def __init__(self, expr, msg): self.expr = expr self.msg = msg class TransitionError(Error): """Raised when an operation attempts a state transition that's not allowed. Attributes: prev -- state at beginning of transition next -- attempted new state msg -- explanation of why the specific transition is not allowed """ def __init__(self, prev, next, msg): self.prev = prev self.next = next self.msg = msg Most exceptions are defined with names that end in “Error,” similar to the naming of the standard exceptions. Many standard modules define their own exceptions to report errors that may occur in functions they define. Defining Clean-up Actions The try statement has another optional clause which is intended to define clean-up actions that must be executed under all circumstances. For example: >>> try: ... raise KeyboardInterrupt ... finally: ... print 'Goodbye, world!' ... Goodbye, world! KeyboardInterrupt A finally clause is always executed before leaving the try statement, whether an exception has occurred or not. When an exception has occurred in the try clause and has not been handled by an except clause (or it has occurred in a except or else clause), it is re-raised after the finally clause has been executed. The finally clause is also executed “on the way out” when any other clause of the try statement is left via a break, continue or return statement. A more complicated example (having except and finally clauses in the same try statement works as of Python 2.5): >>> def divide(x, y): ... try: ... result = x / y ... except ZeroDivisionError: ... print "division by zero!" ... else: ... print "result is", result ... finally: ... print "executing finally clause" ... >>> divide(2, 1) result is 2 executing finally clause >>> divide(2, 0) division by zero! executing finally clause >>> divide("2", "1") executing finally clause Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 3, in divide TypeError: unsupported operand type(s) for /: 'str' and 'str' As you can see, the finally clause is executed in any event. The TypeError raised by dividing two strings is not handled by the except clause and therefore re-raised after the finally clause has been executed. In real world applications, the finally clause is useful for releasing external resources (such as files or network connections), regardless of whether the use of the resource was successful. Predefined Clean-up Actions Some objects define standard clean-up actions to be undertaken when the object is no longer needed, regardless of whether or not the operation using the object succeeded or failed. Look at the following example, which tries to open a file and print its contents to the screen. for line in open("myfile.txt"): print line The problem with this code is that it leaves the file open for an indeterminate amount of time after the code has finished executing. This is not an issue in simple scripts, but can be a problem for larger applications. The with statement allows objects like files to be used in a way that ensures they are always cleaned up promptly and correctly. with open("myfile.txt") as f: for line in f: print line After the statement is executed, the file f is always closed, even if a problem was encountered while processing the lines. Other objects which provide predefined clean-up actions will indicate this in their documentation. Operating System Interface The os module provides dozens of functions for interacting with the operating system: >>> import os >>> os.getcwd() # Return the current working directory 'C:\\Python26' >>> os.chdir('/server/accesslogs') # Change current working directory >>> os.system('mkdir today') # Run the command mkdir in the system shell 0 Be sure to use the import os style instead of from os import *. This will keep os.open() from shadowing the built-in open() function which operates much differently. The built-in dir() and help() functions are useful as interactive aids for working with large modules like os: >>> import os >>> dir(os) <returns a list of all module functions> >>> help(os) <returns an extensive manual page created from the module's docstrings> For daily file and directory management tasks, the shutil module provides a higher level interface that is easier to use: >>> import shutil >>> shutil.copyfile('data.db', 'archive.db') >>> shutil.move('/build/executables', 'installdir') File Wildcards The glob module provides a function for making file lists from directory wildcard searches: >>> import glob >>> glob.glob('*.py') ['primes.py', 'random.py', 'quote.py'] Command Line Arguments Common utility scripts often need to process command line arguments. These arguments are stored in the sys module’s argv attribute as a list. For instance the following output results from running python demo.py one two three at the command line: >>> import sys >>> print sys.argv ['demo.py', 'one', 'two', 'three'] The getopt module processes sys.argv using the conventions of the Unix getopt() function. More powerful and flexible command line processing is provided by the argparse module. Error Output Redirection and Program Termination The sys module also has attributes for stdin, stdout, and stderr. The latter is useful for emitting warnings and error messages to make them visible even when stdout has been redirected: >>> sys.stderr.write('Warning, log file not found starting a new one\n') Warning, log file not found starting a new one The most direct way to terminate a script is to use sys.exit(). String Pattern Matching The re module provides regular expression tools for advanced string processing. For complex matching and manipulation, regular expressions offer succinct, optimized solutions: >>> import re >>> re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest') ['foot', 'fell', 'fastest'] >>> re.sub(r'(\b[a-z]+) \1', r'\1', 'cat in the the hat') 'cat in the hat' When only simple capabilities are needed, string methods are preferred because they are easier to read and debug: >>> 'tea for too'.replace('too', 'two') 'tea for two' Mathematics The math module gives access to the underlying C library functions for floating point math: >>> import math >>> math.cos(math.pi / 4.0) 0.70710678118654757 >>> math.log(1024, 2) 10.0 The random module provides tools for making random selections: >>> import random >>> random.choice(['apple', 'pear', 'banana']) 'apple' >>> random.sample(xrange(100), 10) # sampling without replacement [30, 83, 16, 4, 8, 81, 41, 50, 18, 33] >>> random.random() # random float 0.17970987693706186 >>> random.randrange(6) # random integer chosen from range(6) 4 Internet Access There are a number of modules for accessing the internet and processing internet protocols. Two of the simplest are urllib2 for retrieving data from urls and smtplib for sending mail: >>> import urllib2 >>> for line in urllib2.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'): ... if 'EST' in line or 'EDT' in line: # look for Eastern Time ... print line <BR>Nov. 25, 09:43:32 PM EST >>> import smtplib >>> server = smtplib.SMTP('localhost') >>> server.sendmail('[email protected]', '[email protected]', ... """To: [email protected] ... From: [email protected] ... ... Beware the Ides of March. ... """) >>> server.quit() (Note that the second example needs a mailserver running on localhost.) Dates and Times The datetime module supplies classes for manipulating dates and times in both simple and complex ways. While date and time arithmetic is supported, the focus of the implementation is on efficient member extraction for output formatting and manipulation. The module also supports objects that are timezone aware. >>> # dates are easily constructed and formatted >>> from datetime import date >>> now = date.today() >>> now datetime.date(2003, 12, 2) >>> now.strftime("%m-%d-%y. %d %b %Y is a %A on the %d day of %B.") '12-02-03. 02 Dec 2003 is a Tuesday on the 02 day of December.' >>> # dates support calendar arithmetic >>> birthday = date(1964, 7, 31) >>> age = now - birthday >>> age.days 14368 Data Compression Common data archiving and compression formats are directly supported by modules including: zlib, gzip, bz2, zipfile and tarfile. >>> import zlib >>> s = 'witch which has which witches wrist watch' >>> len(s) 41 >>> t = zlib.compress(s) >>> len(t) 37 >>> zlib.decompress(t) 'witch which has which witches wrist watch' >>> zlib.crc32(s) 226805979 Performance Measurement Some Python users develop a deep interest in knowing the relative performance of different approaches to the same problem. Python provides a measurement tool that answers those questions immediately. For example, it may be tempting to use the tuple packing and unpacking feature instead of the traditional approach to swapping arguments. The timeit module quickly demonstrates a modest performance advantage: >>> from timeit import Timer >>> Timer('t=a; a=b; b=t', 'a=1; b=2').timeit() 0.57535828626024577 >>> Timer('a,b = b,a', 'a=1; b=2').timeit() 0.54962537085770791 In contrast to timeit‘s fine level of granularity, the profile and pstats modules provide tools for identifying time critical sections in larger blocks of code. BioPython What is Biopython? The Biopython Project is an international association of developers of freely available Python (http://www.python.org) tools for computational molecular biology. The web site http://www.biopython.org provides an online resource for modules, scripts, and web links for developers of Python-based software for life science research. Basically, we just like to program in Python and want to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and scripts. What can I find in the Biopython package The main Biopython releases have lots of functionality, including:             The ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats: o Blast output – both from standalone and WWW Blast o Clustalw o FASTA o GenBank o PubMed and Medline o ExPASy files, like Enzyme and Prosite o SCOP, including ‘dom’ and ‘lin’ files o UniGene o SwissProt Files in the supported formats can be iterated over record by record or indexed and accessed via a Dictionary interface. Code to deal with popular on-line bioinformatics destinations such as: o NCBI – Blast, Entrez and PubMed services o ExPASy – Swiss-Prot and Prosite entries, as well as Prosite searches Interfaces to common bioinformatics programs such as: o Standalone Blast from NCBI o Clustalw alignment program o EMBOSS command line tools A standard sequence class that deals with sequences, ids on sequences, and sequence features. Tools for performing common operations on sequences, such as translation, transcription and weight calculations. Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector Machines. Code for dealing with alignments, including a standard way to create and deal with substitution matrices. Code making it easy to split up parallelizable tasks into separate processes. GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc. Extensive documentation and help with using the modules, including this file, on-line wiki documentation, the web site, and the mailing list. Integration with BioSQL, a sequence database schema also supported by the BioPerl and BioJava projects. Working with sequences The central object in bioinformatics is the sequence. Thus, we’ll start with a quick introduction to the Biopython mechanisms for dealing with sequences, the Seq object. Most of the time when we think about sequences we have in my mind a string of letters like ‘AGTACACTGGT’. You can create such Seq object with this sequence as follows - the “>>>” represents the Python prompt followed by what you would type in: >>> from Bio.Seq import Seq >>> my_seq = Seq("AGTACACTGGT") >>> my_seq Seq('AGTACACTGGT', Alphabet()) >>> print my_seq AGTACACTGGT >>> my_seq.alphabet Alphabet() What we have here is a sequence object with a generic alphabet - reflecting the fact we have not specified if this is a DNA or protein sequence (okay, a protein with a lot of Alanines, Glycines, Cysteines and Threonines!). In addition to having an alphabet, the Seq object differs from the Python string in the methods it supports. You can’t do this with a plain string: >>> my_seq Seq('AGTACACTGGT', Alphabet()) >>> my_seq.complement() Seq('TCATGTGACCA', Alphabet()) >>> my_seq.reverse_complement() Seq('ACCAGTGTACT', Alphabet()) The next most important class is the SeqRecord or Sequence Record. This holds a sequence (as a Seq object) with additional annotation including an identifier, name and description. The Bio.SeqIO module for reading and writing sequence file formats works with SeqRecord objects. This covers the basic features and uses of the Biopython sequence class. Now that you’ve got some idea of what it is like to interact with the Biopython libraries, it’s time to delve into the fun, fun world of dealing with biological file formats! Parsing sequence file formats A large part of much bioinformatics work involves dealing with the many types of file formats designed to hold biological data. These files are loaded with interesting biological data, and a special challenge is parsing these files into a format so that you can manipulate them with some kind of programming language. However the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and that formats may contain small subtleties which can break even the most well designed parsers. We are now going to briefly introduce the Bio.SeqIO module. We’ll start with an online search for our friends, the lady slipper orchids. To keep this introduction simple, we’re just using the NCBI website by hand. Let’s just take a look through the nucleotide databases at NCBI, using an Entrez online search (http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide) for everything mentioning the text Cypripedioideae (this is the subfamily of lady slipper orchids). When this tutorial was originally written, this search gave us only 94 hits, which we saved as a FASTA formatted text file and as a GenBank formatted text file (files ls_orchid.fasta and ls_orchid.gbk, also included with the Biopython source code under docs/tutorial/examples/). If you run the search today, you’ll get hundreds of results! When following the tutorial, if you want to see the same list of genes, just download the two files above or copy them from docs/examples/ in the Biopython source code. Simple FASTA parsing example If you open the lady slipper orchids FASTA file ls_orchid.fasta in your favourite text editor, you’ll see that the file starts like this: >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGA TCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTT GTTGGG ... It contains 94 records, each has a line starting with “>” (greater-than symbol) followed by the sequence on one or more lines. Now try this in Python: from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print seq_record.id print repr(seq_record.seq) print len(seq_record) You should get something like this on your screen: gi|2765658|emb|Z78533.1|CIZ78533 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()) 740 ... gi|2765564|emb|Z78439.1|PBZ78439 Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet()) 592 Notice that the FASTA format does not specify the alphabet, so Bio.SeqIO has defaulted to the rather generic SingleLetterAlphabet() rather than something DNA specific. Simple GenBank parsing example Now let’s load the GenBank file ls_orchid.gbk instead - notice that the code to do this is almost identical to the snippet used above for the FASTA file - the only difference is we change the filename and the format string: from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) This should give: Z78533.1 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA()) 740 ... Z78439.1 Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', IUPACAmbiguousDNA()) 592 This time Bio.SeqIO has been able to choose a sensible alphabet, IUPAC Ambiguous DNA. You’ll also notice that a shorter string has been used as the seq_record.id in this case. Connecting with biological databases One of the very common things that you need to do in bioinformatics is extract information from biological databases. It can be quite tedious to access these databases manually, especially if you have a lot of repetitive work to do. Biopython attempts to save you time and energy by making some on-line databases available from Python scripts. Currently, Biopython has code to extract information from the following databases:    Entrez (and PubMed) from the NCBI. ExPASy – See Chapter 9. SCOP – See the Bio.SCOP.search() function. The code in these modules basically makes it easy to write Python code that interact with the CGI scripts on these pages, so that you can get results in an easy to deal with format. In some cases, the results can be tightly integrated with the Biopython parsers to make it even easier to extract information. Sequence objects Biological sequences are arguably the central object in Bioinformatics, and in this chapter we’ll introduce the Biopython mechanism for dealing with sequences, the Seq object. Sequences are essentially strings of letters like AGTACACTGGT, which seems very natural since this is the most common way that sequences are seen in biological file formats. There are two important differences between Seq objects and standard Python strings. First of all, they have different methods. Although the Seq object supports many of the same methods as a plain string, its translate() method differs by doing biological translation, and there are also additional biologically relevant methods like reverse_complement(). Secondly, the Seq object has an important attribute, alphabet, which is an object describing what the individual characters making up the sequence string “mean”, and how they should be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a protein sequence that happens to be rich in Alanines, Glycines, Cysteines and Threonines? Sequences and Alphabets The alphabet object is perhaps the important thing that makes the Seq object more than just a string. The currently available alphabets for Biopython are defined in the Bio.Alphabet module. We’ll use the IUPAC alphabets (http://www.chem.qmw.ac.uk/iupac/) here to deal with some of our favorite objects: DNA, RNA and Proteins. Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but additionally provides the ability to extend and customize the basic definitions. For instance, for proteins, there is a basic IUPACProtein class, but there is an additional ExtendedIUPACProtein class providing for the additional elements “U” (or “Sec” for selenocysteine) and “O” (or “Pyl” for pyrrolysine), plus the ambiguous symbols “B” (or “Asx” for asparagine or aspartic acid), “Z” (or “Glx” for glutamine or glutamic acid), “J” (or “Xle” for leucine isoleucine) and “X” (or “Xxx” for an unknown amino acid). For DNA you’ve got choices of IUPACUnambiguousDNA, which provides for just the basic letters, IUPACAmbiguousDNA (which provides for ambiguity letters for every possible situation) and ExtendedIUPACDNA, which allows letters for modified bases. Similarly, RNA can be represented by IUPACAmbiguousRNA or IUPACUnambiguousRNA. The advantages of having an alphabet class are two fold. First, this gives an idea of the type of information the Seq object contains. Secondly, this provides a means of constraining the information, as a means of type checking. Now that we know what we are dealing with, let’s look at how to utilize this class to do interesting work. You can create an ambiguous sequence with the default generic alphabet like this: >>> from Bio.Seq import Seq >>> my_seq = Seq("AGTACACTGGT") >>> my_seq Seq('AGTACACTGGT', Alphabet()) >>> my_seq.alphabet Alphabet() However, where possible you should specify the alphabet explicitly when creating your sequence objects in this case an unambiguous DNA alphabet object: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna) >>> my_seq Seq('AGTACACTGGT', IUPACUnambiguousDNA()) >>> my_seq.alphabet IUPACUnambiguousDNA() Unless of course, this really is an amino acid sequence: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_prot = Seq("AGTACACTGGT", IUPAC.protein) >>> my_prot Seq('AGTACACTGGT', IUPACProtein()) >>> my_prot.alphabet IUPACProtein() Sequences act like strings In many ways, we can deal with Seq objects as if they were normal Python strings, for example getting the length, or iterating over the elements: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCG", IUPAC.unambiguous_dna) >>> for index, letter in enumerate(my_seq): ... print index, letter 0G 1A 2T 3C 4G >>> print len(my_seq) 5 You can access elements of the sequence in the same way as for strings (but remember, Python counts from zero!): >>> print my_seq[0] #first letter G >>> print my_seq[2] #third letter T >>> print my_seq[-1] #last letter G The Seq object has a .count() method, just like a string. Note that this means that like a Python string, this gives a non-overlapping count: >>> from Bio.Seq import Seq >>> "AAAA".count("AA") 2 >>> Seq("AAAA").count("AA") 2 For some biological uses, you may actually want an overlapping count (i.e. 3 in this trivial example). When searching for single letters, this makes no difference: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> len(my_seq) 32 >>> my_seq.count("G") 9 >>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq) 46.875 While you could use the above snippet of code to calculate a GC%, note that the Bio.SeqUtils module has several GC functions already built. For example: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio.SeqUtils import GC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> GC(my_seq) 46.875 Note that using the Bio.SeqUtils.GC() function should automatically cope with mixed case sequences and the ambiguous nucleotide S which means G or C. Also note that just like a normal Python string, the Seq object is in some ways “read-only”. If you need to edit your sequence, for example simulating a point mutation. Slicing a sequence A more complicated example, let’s get a slice of the sequence: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq[4:12] Seq('GATGGGCC', IUPACUnambiguousDNA()) Two things are interesting to note. First, this follows the normal conventions for Python strings. So the first element of the sequence is 0 (which is normal for computer science, but not so normal for biology). When you do a slice the first item is included (i.e. 4 in this case) and the last is excluded (12 in this case), which is the way things work in Python, but of course not necessarily the way everyone in the world would expect. The main goal is to stay consistent with what Python does. The second thing to notice is that the slice is performed on the sequence data string, but the new object produced is another Seq object which retains the alphabet information from the original Seq object. Also like a Python string, you can do slices with a start, stop and stride (the step size, which defaults to one). For example, we can get the first, second and third codon positions of this DNA sequence: >>> my_seq[0::3] Seq('GCTGTAGTAAG', IUPACUnambiguousDNA()) >>> my_seq[1::3] Seq('AGGCATGCATC', IUPACUnambiguousDNA()) >>> my_seq[2::3] Seq('TAGCTAAGAC', IUPACUnambiguousDNA()) Another stride trick you might have seen with a Python string is the use of a -1 stride to reverse the string. You can do this with a Seq object too: >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA()) Turning Seq objects into strings If you really do just need a plain string, for example to write to a file, or insert into a database, then this is very easy to get: >>> str(my_seq) 'GATCGATGGGCCTATATAGGATCGAAAATCGC' Since calling str() on a Seq object returns the full sequence as a string, you often don’t actually have to do this conversion explicitly. Python does this automatically with a print statement: >>> print my_seq GATCGATGGGCCTATATAGGATCGAAAATCGC You can also use the Seq object directly with a %s placeholder when using the Python string formatting or interpolation operator (%): >>> fasta_format_string = ">Name\n%s\n" % my_seq >>> print fasta_format_string >Name GATCGATGGGCCTATATAGGATCGAAAATCGC <BLANKLINE> This line of code constructs a simple FASTA format record (without worrying about line wrapping). NOTE: If you are using Biopython 1.44 or older, using str(my_seq) will give just a truncated representation. Instead use my_seq.tostring() (which is still available in the current Biopython releases for backwards compatibility): >>> my_seq.tostring() 'GATCGATGGGCCTATATAGGATCGAAAATCGC' Concatenating or adding sequences Naturally, you can in principle add any two Seq objects together - just like you can with Python strings to concatenate them. However, you can’t add sequences with incompatible alphabets, such as a protein sequence and a DNA sequence: >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> protein_seq = Seq("EVRNAK", IUPAC.protein) >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> protein_seq + dna_seq Traceback (most recent call last): ... TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA() If you really wanted to do this, you’d have to first give both sequences generic alphabets: >>> from Bio.Alphabet import generic_alphabet >>> protein_seq.alphabet = generic_alphabet >>> dna_seq.alphabet = generic_alphabet >>> protein_seq + dna_seq Seq('EVRNAKACGT', Alphabet()) Here is an example of adding a generic nucleotide sequence to an unambiguous IUPAC DNA sequence, resulting in an ambiguous nucleotide sequence: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_nucleotide >>> from Bio.Alphabet import IUPAC >>> nuc_seq = Seq("GATCGATGC", generic_nucleotide) >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> nuc_seq Seq('GATCGATGC', NucleotideAlphabet()) >>> dna_seq Seq('ACGT', IUPACUnambiguousDNA()) >>> nuc_seq + dna_seq Seq('GATCGATGCACGT', NucleotideAlphabet()) Changing case Python strings have very useful upper and lower methods for changing the case. As of Biopython 1.53, the Seq object gained similar methods which are alphabet aware. For example, >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> dna_seq = Seq("acgtACGT", generic_dna) >>> dna_seq Seq('acgtACGT', DNAAlphabet()) >>> dna_seq.upper() Seq('ACGTACGT', DNAAlphabet()) >>> dna_seq.lower() Seq('acgtacgt', DNAAlphabet()) These are useful for doing case insensitive matching: >>> "GTAC" in dna_seq False >>> "GTAC" in dna_seq.upper() True Note that strictly speaking the IUPAC alphabets are for upper case sequences only, thus: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> dna_seq Seq('ACGT', IUPACUnambiguousDNA()) >>> dna_seq.lower() Seq('acgt', DNAAlphabet()) Nucleotide sequences and (reverse) complements For nucleotide sequences, you can easily obtain the complement or reverse complement of a Seq object using its built-in methods: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) >>> my_seq.complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA()) >>> my_seq.reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA()) As mentioned earlier, an easy way to just reverse a Seq object (or a Python string) is slice it with -1 step: >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA()) In all of these operations, the alphabet property is maintained. This is very useful in case you accidentally end up trying to do something weird like take the (reverse)complement of a protein sequence: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> protein_seq = Seq("EVRNAK", IUPAC.protein) >>> protein_seq.complement() Traceback (most recent call last): ... ValueError: Proteins do not have complements! Transcription Before talking about transcription, I want to try and clarify the strand issue. Consider the following (made up) stretch of double stranded DNA which encodes a short peptide: DNA coding strand (aka Crick strand, strand +1) 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ ||||||||||||||||||||||||||||||||||||||| 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC DNA template strand (aka Watson strand, strand −1) | Transcription ↓ 5’ 5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’ Single stranded messenger RNA The actual biological transcription process works from the template strand, doing a reverse complement (TCAG → CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching T → U. Now let’s actually get down to doing a transcription in Biopython. First, let’s create Seq objects for the coding and template DNA strands: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) >>> template_dna = coding_dna.reverse_complement() >>> template_dna Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA()) These should match the figure above - remember by convention nucleotide sequences are normally read from the 5’ to 3’ direction, while in the figure the template strand is shown reversed. Now let’s transcribe the coding strand into the corresponding mRNA, using the Seq object’s built in transcribe method: >>> coding_dna Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) >>> messenger_rna = coding_dna.transcribe() >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) As you can see, all this does is switch T → U, and adjust the alphabet. If you do want to do a true biological transcription starting with the template strand, then this becomes a two-step process: >>> template_dna.reverse_complement().transcribe() Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) The Seq object also includes a back-transcription method for going from the mRNA to the coding strand of the DNA. Again, this is a simple U → T substitution and associated change of alphabet: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) >>> messenger_rna.back_transcribe() Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) Note: The Seq object’s transcribe and back_transcribe methods were added in Biopython 1.49. For older releases you would have to use the Bio.Seq module’s functions instead. Translation Sticking with the same example discussed in the transcription section above, now let’s translate this mRNA into the corresponding protein sequence - again taking advantage of one of the Seq object’s biological methods: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) >>> messenger_rna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) You can also translate directly from the coding strand DNA sequence: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) You should notice in the above protein sequences that in addition to the end stop character, there is an internal stop as well. This was a deliberate choice of example, as it gives an excuse to talk about some optional arguments, including different translation tables (Genetic Codes). The translation tables available in Biopython are based on those from the NCBI (see the next section of this tutorial). By default, translation will use the standard genetic code (NCBI table id 1). Suppose we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead: >>> coding_dna.translate(table="Vertebrate Mitochondrial") Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) You can also specify the table using the NCBI table number which is shorter, and often included in the feature annotation of GenBank files: >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) Now, you may want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature): >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(to_stop=True) Seq('MAIVMGR', IUPACProtein()) >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table=2, to_stop=True) Seq('MAIVMGRWKGAR', IUPACProtein()) Notice that when you use the to_stop argument, the stop codon itself is not translated - and the stop symbol is not included at the end of your protein sequence. You can even specify the stop symbol if you don’t like the default asterisk: >>> coding_dna.translate(table=2, stop_symbol="@") Seq('MAIVMGRWKGAR@', HasStopCodon(IUPACProtein(), '@')) Now, suppose you have a complete coding sequence CDS, which is to say a nucleotide sequence (e.g. mRNA – after any splicing) which is a whole number of codons (i.e. the length is a multiple of three), commences with a start codon, ends with a stop codon, and has no internal in-frame stop codons. In general, given a complete CDS, the default translate method will do what you want (perhaps with the to_stop option). However, what if your sequence uses a non-standard start codon? This happens a lot in bacteria – for example the gene yaaX in E. coli K12: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" +\ ... "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \ ... "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \ ... "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \ ... "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA", ... generic_dna) >>> gene.translate(table="Bacterial") Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*') >>> gene.translate(table="Bacterial", to_stop=True) Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein()) In the bacterial genetic code GTG is a valid start codon, and while it does normally encode valine, if used as a start codon it should be translated as methionine. This happens if you tell Biopython your sequence is a complete CDS: >>> gene.translate(table="Bacterial", cds=True) Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein()) In addition to telling Biopython to translate an alternative start codon as methionine, using this option also makes sure your sequence really is a valid CDS (you’ll get an exception if not). Note: The Seq object’s translate method is new in Biopython 1.49. For older releases you would have to use the Bio.Seq module’s translate function instead. The cds option was added in Biopython 1.51, and there is no simple way to do this with older versions of Biopython. Translation Tables In the previous sections we talked about the Seq object translation method (and mentioned the equivalent function in the Bio.Seq module). Internally these use codon table objects derived from the NCBI information at ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt, also shown on http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi in a much more readable layout. As before, let’s just focus on two choices: the Standard translation table, and the translation table for Vertebrate Mitochondrial DNA. >>> from Bio.Data import CodonTable >>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"] >>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"] Alternatively, these tables are labeled with ID numbers 1 and 2, respectively: >>> from Bio.Data import CodonTable >>> standard_table = CodonTable.unambiguous_dna_by_id[1] >>> mito_table = CodonTable.unambiguous_dna_by_id[2] You can compare the actual tables visually by printing them: >>> print standard_table Table 1 Standard, SGC0 | T | C | A | G | --+---------+---------+---------+---------+-T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G --+---------+---------+---------+---------+-G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+-and: >>> print mito_table Table 2 Vertebrate Mitochondrial, SGC1 | T | C | A | G | --+---------+---------+---------+---------+-T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA W | A T | TTG L | TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L | CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-A | ATT I(s)| ACT T | AAT N | AGT S | T A | ATC I(s)| ACC T | AAC N | AGC S | C A | ATA M(s)| ACA T | AAA K | AGA Stop| A A | ATG M(s)| ACG T | AAG K | AGG Stop| G --+---------+---------+---------+---------+-G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V(s)| GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+-You may find these following properties useful – for example if you are trying to do your own gene finding: >>> mito_table.stop_codons ['TAA', 'TAG', 'AGA', 'AGG'] >>> mito_table.start_codons ['ATT', 'ATC', 'ATA', 'ATG', 'GTG'] >>> mito_table.forward_table["ACG"] 'T' Comparing Seq objects Sequence comparison is actually a very complicated topic, and there is no easy way to decide if two sequences are equal. The basic problem is the meaning of the letters in a sequence are context dependent the letter “A” could be part of a DNA, RNA or protein sequence. Biopython uses alphabet objects as part of each Seq object to try and capture this information - so comparing two Seq objects means considering both the sequence strings and the alphabets. For example, you might argue that the two DNA Seq objects Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT", IUPAC.ambiguous_dna) should be equal, even though they do have different alphabets. Depending on the context this could be important. This gets worse – suppose you think Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT") (i.e. the default generic alphabet) should be equal. Then, logically, Seq("ACGT", IUPAC.protein) and Seq("ACGT") should also be equal. Now, in logic if A=B and B=C, by transitivity we expect A=C. So for logical consistency we’d require Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT", IUPAC.protein) to be equal – which most people would agree is just not right. This transitivity problem would also have implications for using Seq objects as Python dictionary keys. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna) >>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna) So, what does Biopython do? Well, the equality test is the default for Python objects – it tests to see if they are the same object in memory. This is a very strict test: >>> seq1 == seq2 False >>> seq1 == seq1 True If you actually want to do this, you can be more explicit by using the Python id function, >>> id(seq1) == id(seq2) False >>> id(seq1) == id(seq1) True Now, in every day use, your sequences will probably all have the same alphabet, or at least all be the same type of sequence (all DNA, all RNA, or all protein). What you probably want is to just compare the sequences as strings – so do this explicitly: >>> str(seq1) == str(seq2) True >>> str(seq1) == str(seq1) True As an extension to this, while you can use a Python dictionary with Seq objects as keys, it is generally more useful to use the sequence a string for the key. MutableSeq objects Just like the normal Python string, the Seq object is “read only”, or in Python terminology, immutable. Apart from wanting the Seq object to act like a string, this is also a useful default since in many biological applications you want to ensure you are not changing your sequence data: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) Observe what happens if you try to edit the sequence: >>> my_seq[5] = "G" Traceback (most recent call last): ... TypeError: 'Seq' object does not support item assignment However, you can convert it into a mutable sequence (a MutableSeq object) and do pretty much anything you want with it: >>> mutable_seq = my_seq.tomutable() >>> mutable_seq MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA()) Alternatively, you can create a MutableSeq object directly from a string: >>> from Bio.Seq import MutableSeq >>> from Bio.Alphabet import IUPAC >>> mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) Either way will give you a sequence object which can be changed: >>> mutable_seq MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA()) >>> mutable_seq[5] = "C" >>> mutable_seq MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA()) >>> mutable_seq.remove("T") >>> mutable_seq MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA()) >>> mutable_seq.reverse() >>> mutable_seq MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA()) Do note that unlike the Seq object, the MutableSeq object’s methods like reverse_complement() and reverse() act in-situ! An important technical difference between mutable and immutable objects in Python means that you can’t use a MutableSeq object as a dictionary key, but you can use a Python string or a Seq object in this way. Once you have finished editing your a MutableSeq object, it’s easy to get back to a read-only Seq object should you need to: >>> new_seq = mutable_seq.toseq() >>> new_seq Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA()) You can also get a string from a MutableSeq object just like from a Seq object. UnknownSeq objects Biopython 1.50 introduced another basic sequence object, the UnknownSeq object. This is a subclass of the basic Seq object and its purpose is to represent a sequence where we know the length, but not the actual letters making it up. You could of course use a normal Seq object in this situation, but it wastes rather a lot of memory to hold a string of a million “N” characters when you could just store a single letter “N” and the desired length as an integer. >>> from Bio.Seq import UnknownSeq >>> unk = UnknownSeq(20) >>> unk UnknownSeq(20, alphabet = Alphabet(), character = '?') >>> print unk ???????????????????? >>> len(unk) 20 You can of course specify an alphabet, meaning for nucleotide sequences the letter defaults to “N” and for proteins “X”, rather than just “?”. >>> from Bio.Seq import UnknownSeq >>> from Bio.Alphabet import IUPAC >>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna) >>> unk_dna UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N') >>> print unk_dna NNNNNNNNNNNNNNNNNNNN You can use all the usual Seq object methods too, note these give back memory saving UnknownSeq objects where appropriate as you might expect: >>> unk_dna UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N') >>> unk_dna.complement() UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N') >>> unk_dna.reverse_complement() UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N') >>> unk_dna.transcribe() UnknownSeq(20, alphabet = IUPACAmbiguousRNA(), character = 'N') >>> unk_protein = unk_dna.translate() >>> unk_protein UnknownSeq(6, alphabet = ProteinAlphabet(), character = 'X') >>> print unk_protein XXXXXX >>> len(unk_protein) 6 You may be able to find a use for the UnknownSeq object in your own code, but it is more likely that you will first come across them in a SeqRecord object created by Bio.SeqIO (see Chapter 5). Some sequence file formats don’t always include the actual sequence, for example GenBank and EMBL files may include a list of features but for the sequence just present the contig information. Alternatively, the QUAL files used in sequencing work hold quality scores but they never contain a sequence – instead there is a partner FASTA file which does have the sequence. Working with directly strings To close this chapter, for those you who really don’t want to use the sequence objects (or who prefer a functional programming style to an object orientated one), there are module level functions in Bio.Seq will accept plain Python strings, Seq objects (including UnknownSeq objects) or MutableSeq objects: >>> from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate >>> my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG" >>> reverse_complement(my_string) 'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC' >>> transcribe(my_string) 'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG' >>> back_transcribe(my_string) 'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG' >>> translate(my_string) 'AVMGRWKGGRAAG*' You are, however, encouraged to work with Seq objects by default. Creating a SeqRecord Using a SeqRecord object is not very complicated, since all of the information is presented as attributes of the class. Usually you won’t create a SeqRecord “by hand”, but instead use Bio.SeqIO to read in a sequence file for you. However, creating SeqRecord can be quite simple. SeqRecord objects from scratch To create a SeqRecord at a minimum you just need a Seq object: >>> from Bio.Seq import Seq >>> simple_seq = Seq("GATC") >>> from Bio.SeqRecord import SeqRecord >>> simple_seq_r = SeqRecord(simple_seq) Additionally, you can also pass the id, name and description to the initialization function, but if not they will be set as strings indicating they are unknown, and can be modified subsequently: >>> simple_seq_r.id '<unknown id>' >>> simple_seq_r.id = "AC12345" >>> simple_seq_r.description = "Made up sequence I wish I could write a paper about" >>> print simple_seq_r.description Made up sequence I wish I could write a paper about >>> simple_seq_r.seq Seq('GATC', Alphabet()) Including an identifier is very important if you want to output your SeqRecord to a file. You would normally include this when creating the object: >>> from Bio.Seq import Seq >>> simple_seq = Seq("GATC") >>> from Bio.SeqRecord import SeqRecord >>> simple_seq_r = SeqRecord(simple_seq, id="AC12345") As mentioned above, the SeqRecord has an dictionary attribute annotations. This is used for any miscellaneous annotations that doesn’t fit under one of the other more specific attributes. Adding annotations is easy, and just involves dealing directly with the annotation dictionary: >>> simple_seq_r.annotations["evidence"] = "None. I just made it up." >>> print simple_seq_r.annotations {'evidence': 'None. I just made it up.'} >>> print simple_seq_r.annotations["evidence"] None. I just made it up. Working with per-letter-annotations is similar, letter_annotations is a dictionary like attribute which will let you assign any Python sequence (i.e. a string, list or tuple) which has the same length as the sequence: >>> simple_seq_r.letter_annotations["phred_quality"] = [40,40,38,30] >>> print simple_seq_r.letter_annotations {'phred_quality': [40, 40, 38, 30]} >>> print simple_seq_r.letter_annotations["phred_quality"] [40, 40, 38, 30] The dbxrefs and features attributes are just Python lists, and should be used to store strings and SeqFeature objects (discussed later in this chapter) respectively. SeqRecord objects from FASTA files This example uses a fairly large FASTA file containing the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI. This file is included with the Biopython unit tests under the GenBank folder, or online NC_005816.fna from our website. The file starts like this - and you can check there is only one record present (i.e. only one line starting with a greater than symbol): >gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCT GCTCTCC ... The function Bio.SeqIO.parse(...) is used to loop over all the records in a file as SeqRecord objects. The Bio.SeqIO module has a sister function for use on files which contain just one record which we’ll use here: >>> from Bio import SeqIO >>> record = SeqIO.read("NC_005816.fna", "fasta") >>> record SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATC CAGG...CTG', SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... sequence', dbxrefs=[]) Now, let’s have a look at the key attributes of this SeqRecord individually – starting with the seq attribute which gives you a Seq object: >>> record.seq Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet()) Here Bio.SeqIO has defaulted to a generic alphabet, rather than guessing that this is DNA. If you know in advance what kind of sequence your FASTA file contains, you can tell Bio.SeqIO which alphabet to use. Next, the identifiers and description: >>> record.id 'gi|45478711|ref|NC_005816.1|' >>> record.name 'gi|45478711|ref|NC_005816.1|' >>> record.description 'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence' As you can see above, the first word of the FASTA record’s title line (after removing the greater than symbol) is used for both the id and name attributes. The whole title line (after removing the greater than symbol) is used for the record description. This is deliberate, partly for backwards compatibility reasons, but it also makes sense if you have a FASTA file like this: >Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1 TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCT GCTCTCC ... Note that none of the other annotation attributes get populated when reading a FASTA file: >>> record.dbxrefs [] >>> record.annotations {} >>> record.letter_annotations {} >>> record.features [] In this case our example FASTA file was from the NCBI, and they have a fairly well defined set of conventions for formatting their FASTA lines. This means it would be possible to parse this information and extract the GI number and accession for example. However, FASTA files from other sources vary, so this isn’t possible in general. SeqRecord objects from GenBank files As in the previous example, we’re going to look at the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI, but this time as a GenBank file. Again, this file is included with the Biopython unit tests under the GenBank folder, or online NC_005816.gb from our website. This file contains a single record (i.e. only one LOCUS line) and starts: LOCUS NC_005816 9609 bp DNA circular BCT 21-JUL-2008 DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence. ACCESSION NC_005816 VERSION NC_005816.1 GI:45478711 PROJECT GenomeProject:10638 ... Again, we’ll use Bio.SeqIO to read this file in, and the code is almost identical to that for used above for the FASTA file: >>> from Bio import SeqIO >>> record = SeqIO.read("NC_005816.gb", "genbank") >>> record SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATC CAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=['Project:10638']) You should be able to spot some differences already! But taking the attributes individually, the sequence string is the same as before, but this time Bio.SeqIO has been able to automatically assign a more specific alphabet : >>> record.seq Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()) The name comes from the LOCUS line, while the id includes the version suffix. The description comes from the DEFINITION line: >>> record.id 'NC_005816.1' >>> record.name 'NC_005816' >>> record.description 'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.' GenBank files don’t have any per-letter annotations: >>> record.letter_annotations {} Most of the annotations information gets recorded in the annotations dictionary, for example: >>> len(record.annotations) 11 >>> record.annotations["source"] 'Yersinia pestis biovar Microtus str. 91001' The dbxrefs list gets populated from any PROJECT or DBLINK lines: >>> record.dbxrefs ['Project:10638'] Finally, and perhaps most interestingly, all the entries in the features table (e.g. the genes or CDS features) get recorded as SeqFeature objects in the features list. >>> len(record.features) 29 Sequence A SeqFeature object doesn’t directly contain a sequence, instead its location describes how to get this from the parent sequence. For example consider a (short) gene sequence with location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be complement(4..18), like this: >>> from Bio.Seq import Seq >>> from Bio.SeqFeature import SeqFeature, FeatureLocation >>> example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACT GGGAGCCTAC") >>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1) You could take the parent sequence, slice it to extract 5:18, and then take the reverse complement: >>> feature_seq = example_parent[5:18].reverse_complement() >>> print feature_seq AGCCTTTGCCGTC This is a simple example so this isn’t too bad – however once you have to deal with compound features (joins) this is rather messy. Instead, the SeqFeature object has an extract method to take care of all this: >>> feature_seq = example_feature.extract(example_parent) >>> print feature_seq AGCCTTTGCCGTC The extract method was added in Biopython 1.53, and in Biopython 1.56 the SeqFeature was further extended to give its length as that of the region of sequence it describes. >>> print example_feature.extract(example_parent) AGCCTTTGCCGTC >>> print len(example_feature.extract(example_parent)) 13 >>> print len(example_feature) The PDB module Biopython also allows you to explore the extensive realm of macromolecular structure. Biopython comes with a PDBParser class that produces a Structure object. The Structure object can be used to access the atomic data in the file in a convenient manner. Structure representation A macromolecular structure is represented using a structure, model chain, residue, atom (or SMCRA) hierarchy. The figure below shows a UML class diagram of the SMCRA data structure. Such a data structure is not necessarily best suited for the representation of the macromolecular content of a structure, but it is absolutely necessary for a good interpretation of the data present in a file that describes the structure (typically a PDB or MMCIF file). If this hierarchy cannot represent the contents of a structure file, it is fairly certain that the file contains an error or at least does not describe the structure unambiguously. If a SMCRA data structure cannot be generated, there is reason to suspect a problem. Parsing a PDB file can thus be used to detect likely problems. Structure, Model, Chain and Residue are all subclasses of the Entity base class. The Atom class only (partly) implements the Entity interface (because an Atom does not have children). For each Entity subclass, you can extract a child by using a unique id for that child as a key (e.g. you can extract an Atom object from a Residue object by using an atom name string as a key, you can extract a Chain object from a Model object by using its chain identifier as a key). Disordered atoms and residues are represented by DisorderedAtom and DisorderedResidue classes, which are both subclasses of the DisorderedEntityWrapper base class. They hide the complexity associated with disorder and behave exactly as Atom and Residue objects. In general, a child Entity object (i.e. Atom, Residue, Chain, Model) can be extracted from its parent (i.e. Residue, Chain, Model, Structure, respectively) by using an id as a key. child_entity=parent_entity[child_id] You can also get a list of all child Entities of a parent Entity object. Note that this list is sorted in a specific way (e.g. according to chain identifier for Chain objects in a Model object). child_list=parent_entity.get_list() You can also get the parent from a child. parent_entity=child_entity.get_parent() At all levels of the SMCRA hierarchy, you can also extract a full id. The full id is a tuple containing all id’s starting from the top object (Structure) down to the current object. A full id for a Residue object e.g. is something like: full_id=residue.get_full_id() print full_id ("1abc", 0, "A", ("", 10, "A")) This corresponds to:     The Structure with id "1abc" The Model with id 0 The Chain with id "A" The Residue with id (" ", 10, "A"). The Residue id indicates that the residue is not a hetero-residue (nor a water) because it has a blank hetero field, that its sequence identifier is 10 and that its insertion code is "A". Some other useful methods: # get the entity's id entity.get_id() # check if there is a child with a given id entity.has_id(entity_id) # get number of children nr_children=len(entity) It is possible to delete, rename, add, etc. child entities from a parent entity, but this does not include any sanity checks (e.g. it is possible to add two residues with the same id to one chain). This really should be done via a nice Decorator class that includes integrity checking, but you can take a look at the code (Entity.py) if you want to use the raw interface. Structure The Structure object is at the top of the hierarchy. Its id is a user given string. The Structure contains a number of Model children. Most crystal structures (but not all) contain a single model, while NMR structures typically consist of several models. Disorder in crystal structures of large parts of molecules can also result in several models. Constructing a Structure object A Structure object is produced by a PDBParser object: from Bio.PDB.PDBParser import PDBParser p=PDBParser(PERMISSIVE=1) structure_id="1fat" filename="pdb1fat.ent" s=p.get_structure(structure_id, filename) The PERMISSIVE flag indicates that a number of common problems associated with PDB files will be ignored (but note that some atoms and/or residues will be missing). If the flag is not present a PDBConstructionException will be generated during the parse operation. Header and trailer You can extract the header and trailer (simple lists of strings) of the PDB file from the PDBParser object with the get_header and get_trailer methods. Model The id of the Model object is an integer, which is derived from the position of the model in the parsed file (they are automatically numbered starting from 0). The Model object stores a list of Chain children. Example Get the first model from a Structure object. first_model=structure[0] Chain The id of a Chain object is derived from the chain identifier in the structure file, and can be any string. Each Chain in a Model object has a unique id. The Chain object stores a list of Residue children. Example Get the Chain object with identifier “A” from a Model object. chain_A=model["A"] Residue Unsurprisingly, a Residue object stores a set of Atom children. In addition, it also contains a string that specifies the residue name (e.g. “ASN”) and the segment identifier of the residue (well known to X-PLOR users, but not used in the construction of the SMCRA data structure). The id of a Residue object is composed of three parts: the hetero field (hetfield), the sequence identifier (resseq) and the insertion code (icode). The hetero field is a string : it is “W” for waters, “H_” followed by the residue name (e.g. “H_FUC”) for other hetero residues and blank for standard amino and nucleic acids. The second field in the Residue id is the sequence identifier, an integer describing the position of the residue in the chain. The third field is a string, consisting of the insertion code. The insertion code is sometimes used to preserve a certain desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81 residue) could e.g. have sequence identifiers and insertion codes as followed: Thr 80 A, Ser 80 B, Asn 81. In this way the residue numbering scheme stays in tune with that of the wild type structure. Let’s give some examples. Asn 10 with a blank insertion code would have residue id (’’ ’’, 10, ’’ ’’). Water 10 would have residue id (‘‘W‘‘, 10, ‘‘ ‘‘). A glucose molecule (a hetero residue with residue name GLC) with sequence identifier 10 would have residue id (’’H_GLC’’, 10, ’’ ’’). In this way, the three residues (with the same insertion code and sequence identifier) can be part of the same chain because their residue id’s are distinct. In most cases, the hetflag and insertion code fields will be blank, e.g. (’’ ’’, 10, ’’ ’’). In these cases, the sequence identifier can be used as a shortcut for the full id: # use full id res10=chain[("", 10, "")] # use shortcut res10=chain[10] Each Residue object in a Chain object should have a unique id. However, disordered residues are dealt with in a special way. A Residue object has a number of additional methods: r.get_resname() # return residue name, e.g. "ASN" r.get_segid() # return the SEGID, e.g. "CHN1" 10.1.5 Atom The Atom object stores the data associated with an atom, and has no children. The id of an atom is its atom name (e.g. “OG” for the side chain oxygen of a Ser residue). An Atom id needs to be unique in a Residue. Again, an exception is made for disordered atoms. In a PDB file, an atom name consists of 4 chars, typically with leading and trailing spaces. Often these spaces can be removed for ease of use (e.g. an amino acid C α atom is labeled “.CA.” in a PDB file, where the dots represent spaces). To generate an atom name (and thus an atom id) the spaces are removed, unless this would result in a name collision in a Residue (i.e. two Atom objects with the same atom name and id). In the latter case, the atom name including spaces is tried. This situation can e.g. happen when one residue contains atoms with names “.CA.” and “CA..”, although this is not very likely. The atomic data stored includes the atom name, the atomic coordinates (including standard deviation if present), the B factor (including anisotropic B factors and standard deviation if present), the altloc specifier and the full atom name including spaces. Less used items like the atom element number or the atomic charge sometimes specified in a PDB file are not stored. An Atom object has the following additional methods: a.get_name() # atom name (spaces stripped, e.g. "CA") a.get_id() # id (equals atom name) a.get_coord() # atomic coordinates a.get_bfactor() # B factor a.get_occupancy() # occupancy a.get_altloc() # alternative location specifie a.get_sigatm() # std. dev. of atomic parameters a.get_siguij() # std. dev. of anisotropic B factor a.get_anisou() # anisotropic B factor a.get_fullname() # atom name (with spaces, e.g. ".CA.") To represent the atom coordinates, siguij, anisotropic B factor and sigatm Numpy arrays are used. Disorder General approach Disorder should be dealt with from two points of view: the atom and the residue points of view. In general, we have tried to encapsulate all the complexity that arises from disorder. If you just want to loop over all C α atoms, you do not care that some residues have a disordered side chain. On the other hand it should also be possible to represent disorder completely in the data structure. Therefore, disordered atoms or residues are stored in special objects that behave as if there is no disorder. This is done by only representing a subset of the disordered atoms or residues. Which subset is picked (e.g. which of the two disordered OG side chain atom positions of a Ser residue is used) can be specified by the user. Disordered atoms Disordered atoms are represented by ordinary Atom objects, but all Atom objects that represent the same physical atom are stored in a DisorderedAtom object. Each Atom object in a DisorderedAtom object can be uniquely indexed using its altloc specifier. The DisorderedAtom object forwards all uncaught method calls to the selected Atom object, by default the one that represents the atom with with the highest occupancy. The user can of course change the selected Atom object, making use of its altloc specifier. In this way atom disorder is represented correctly without much additional complexity. In other words, if you are not interested in atom disorder, you will not be bothered by it. Each disordered atom has a characteristic altloc identifier. You can specify that a DisorderedAtom object should behave like the Atom object associated with a specific altloc identifier: atom.disordered\_select("A") # select altloc A atom print atom.get_altloc() "A" atom.disordered_select("B") print atom.get_altloc() "B" # select altloc B atom Disordered residues Common case The most common case is a residue that contains one or more disordered atoms. This is evidently solved by using DisorderedAtom objects to represent the disordered atoms, and storing the DisorderedAtom object in a Residue object just like ordinary Atom objects. The DisorderedAtom will behave exactly like an ordinary atom (in fact the atom with the highest occupancy) by forwarding all uncaught method calls to one of the Atom objects (the selected Atom object) it contains. Point mutations A special case arises when disorder is due to a point mutation, i.e. when two or more point mutants of a polypeptide are present in the crystal. An example of this can be found in PDB structure 1EN2. Since these residues belong to a different residue type (e.g. let’s say Ser 60 and Cys 60) they should not be stored in a single Residue object as in the common case. In this case, each residue is represented by one Residue object, and both Residue objects are stored in a DisorderedResidue object. The DisorderedResidue object forwards all uncaught methods to the selected Residue object (by default the last Residue object added), and thus behaves like an ordinary residue. Each Residue object in a DisorderedResidue object can be uniquely identified by its residue name. In the above example, residue Ser 60 would have id “SER” in the DisorderedResidue object, while residue Cys 60 would have id “CYS”. The user can select the active Residue object in a DisorderedResidue object via this id. Hetero residues 10.3.1 Associated problems A common problem with hetero residues is that several hetero and non-hetero residues present in the same chain share the same sequence identifier (and insertion code). Therefore, to generate a unique id for each hetero residue, waters and other hetero residues are treated in a different way. Remember that Residue object have the tuple (hetfield, resseq, icode) as id. The hetfield is blank (“ “) for amino and nucleic acids, and a string for waters and other hetero residues. The content of the hetfield is explained below. Water residues The hetfield string of a water residue consists of the letter “W”. So a typical residue id for a water is (“W”, 1, “ “). Other hetero residues The hetfield string for other hetero residues starts with “H_” followed by the residue name. A glucose molecule e.g. with residue name “GLC” would have hetfield “H_GLC”. It’s residue id could e.g. be (“H_GLC”, 1, “ “). Some random usage examples Parse a PDB file, and extract some Model, Chain, Residue and Atom objects. from Bio.PDB.PDBParser import PDBParser parser=PDBParser() structure=parser.get_structure("test", "1fat.pdb") model=structure[0] chain=model["A"] residue=chain[1] atom=residue["CA"] Extract a hetero residue from a chain (e.g. a glucose (GLC) moiety with resseq 10). residue_id=("H_GLC", 10, " ") residue=chain[residue_id] Print all hetero residues in chain. for residue in chain.get_list(): residue_id=residue.get_id() hetfield=residue_id[0] if hetfield[0]=="H": print residue_id Print out the coordinates of all CA atoms in a structure with B factor greater than 50. for model in structure.get_list(): for chain in model.get_list(): for residue in chain.get_list(): if residue.has_id("CA"): ca=residue["CA"] if ca.get_bfactor()>50.0: print ca.get_coord() Print out all the residues that contain disordered atoms. for model in structure.get_list(): for chain in model.get_list(): for residue in chain.get_list(): if residue.is_disordered(): resseq=residue.get_id()[1] resname=residue.get_resname() model_id=model.get_id() chain_id=chain.get_id() print model_id, chain_id, resname, resseq Loop over all disordered atoms, and select all atoms with altloc A (if present). This will make sure that the SMCRA data structure will behave as if only the atoms with altloc A are present. for model in structure.get_list(): for chain in model.get_list(): for residue in chain.get_list(): if residue.is_disordered(): for atom in residue.get_list(): if atom.is_disordered(): if atom.disordered_has_id("A"): atom.disordered_select("A") Suppose that a chain has a point mutation at position 10, consisting of a Ser and a Cys residue. Make sure that residue 10 of this chain behaves as the Cys residue. residue=chain[10] residue.disordered_select("CYS") wxPython (GUI) wxPython is a GUI toolkit for the Python programming language. It allows Python programmers to create programs with a robust, highly functional graphical user interface, simply and easily. It is implemented as a Python extension module (native code) that wraps the popular wxWidgets cross platform GUI library, which is written in C++. Like Python and wxWidgets, wxPython is Open Source which means that it is free for anyone to use and the source code is available for anyone to look at and modify. Or anyone can contribute fixes or enhancements to the project. wxPython is a cross-platform toolkit. This means that the same program will run on multiple platforms without modification. Currently supported platforms are 32-bit Microsoft Windows, most Unix or unix-like systems, and Macintosh OS X. A First Application: "Hello, World" As is traditional, we are first going to write a Small "Hello, world" application. Here is the code: 1 #!/usr/bin/env python 2 import wx 3 4 app = wx.App(False) # Create a new app, don't redirect stdout/stderr to a window. 5 frame = wx.Frame(None, wx.ID_ANY, "Hello World") # A Frame is a top-level window. 6 frame.Show(True) # Show the frame. 7 app.MainLoop() Explanations: Every wxPython app is an instance of wx.App. For most simple applications you can use wx.App as is. When you get to more app = wx.App(False) complex applications you may need to extend the wx.App class. The "False" parameter means "don't redirect stdout and stderr to a window". A wx.Frame is a top-level window. The syntax is x.Frame(Parent, Id, Title). Most of the constructors have this shape (a parent object, wx.Frame(None,wx.ID_ANY,"Hello") followed by an Id). In this example, we use None for "no parent" and wx.ID_ANY to have wxWidgets pick an id for us. frame.Show(True) We make a frame visible by "showing" it. Finally, we start the application's MainLoop whose role is to handle app.MainLoop() the events. Note: You almost always want to use wx.ID_ANY or another standard ID provided by wxWidgets. You can make your own IDs, but there is no reason to do that. Run the program and you should see a window like this one: Windows or Frames? When people talk about GUIs, they usually speak of windows, menus and icons. Naturally then, you would expect that wx.Window should represent a window on the screen. Unfortunately, this is not the case. A wx.Window is the base class from which all visual elements are derived (buttons, menus, etc) and what we normally think of as a program window is a wx.Frame. This is an unfortunate inconsistency that has led to much confusion for new users. Building a simple text editor In this tutorial we are going to build a simple text editor. In the process, we will explore several widgets, and learn about features such as events and callbacks. First steps The first step is to make a simple frame with an editable text box inside. A text box is made with the wx.TextCtrl widget. By default, a text box is a single-line field, but the wx.TE_MULTILINE parameter allows you to enter multiple lines of text. 1 #!/usr/bin/env python 2 import wx 3 class MyFrame(wx.Frame): 4 """ We simply derive a new class of Frame. """ 5 def __init__(self, parent, title): 6 wx.Frame.__init__(self, parent, title=title, size=(200,100)) 7 self.control = wx.TextCtrl(self, style=wx.TE_MULTILINE) 8 self.Show(True) 9 10 app = wx.App(False) 11 frame = MyFrame(None, 'Small editor') 12 app.MainLoop() In this example, we derive from wx.Frame and overwrite its __init__ method. Here we declare a new wx.TextCtrl which is a simple text edit control. Note that since the MyFrame runs self.Show() inside its __init__ method, we no longer have to call frame.Show() explicitly. Adding a menu bar Every application should have a menu bar and a status bar. Let's add them to ours: 1 import wx 2 3 class MainWindow(wx.Frame): 4 def __init__(self, parent, title): 5 wx.Frame.__init__(self, parent, title=title, size=(200,100)) 6 self.control = wx.TextCtrl(self, style=wx.TE_MULTILINE) 7 self.CreateStatusBar() # A Statusbar in the bottom of the window 8 9 # Setting up the menu. 10 filemenu= wx.Menu() 11 12 # wx.ID_ABOUT and wx.ID_EXIT are standard IDs provided by wxWidgets. 13 filemenu.Append(wx.ID_ABOUT, "&About"," Information about this program") 14 filemenu.AppendSeparator() 15 filemenu.Append(wx.ID_EXIT,"E&xit"," Terminate the program") 16 17 # Creating the menubar. 18 menuBar = wx.MenuBar() 19 menuBar.Append(filemenu,"&File") # Adding the "filemenu" to the MenuBar 20 self.SetMenuBar(menuBar) # Adding the MenuBar to the Frame content. 21 self.Show(True) 22 23 app = wx.App(False) 24 frame = MainWindow(None, "Sample editor") 25 app.MainLoop() TIP: Notice the wx.ID_ABOUT and wx.ID_EXIT ids. These are standard ids provided by wxWidgets (see a full list here). It is a good habit to use the standard ID if there is one available. This helps wxWidgets know how to display the widget in each platform to make it look more native. Event handling Reacting to events in wxPython is called event handling. An event is when "something" happens on your application (a button click, text input, mouse movement, etc). Much of GUI programming consists of responding to events. You bind an object to an event using the Bind() method: 1 class MainWindow(wx.Frame): 2 def __init__(self, parent, title): 3 wx.Frame.__init__(self,parent, title=title, size=(200,100)) 4 ... 5 menuItem = filemenu.Append(wx.ID_ABOUT, "&About"," Information about this program") 6 self.Bind(wx.EVT_MENU, self.OnAbout, menuItem) This means that, from now on, when the user selects the "About" menu item, the method self.OnAbout will be executed. wx.EVT_MENU is the "select menu item" event. wxWidgets understands many other events (see the full list). The self.OnAbout method has the general declaration: 1 def OnAbout(self, event): 2 ... Here event is an instance of a subclass of wx.Event. For example, a button-click event wx.EVT_BUTTON - is a subclass of wx.Event. The method is executed when the event occurs. By default, this method will handle the event and the event will stop after the callback finishes. However, you can "skip" an event with event.Skip(). This causes the event to go through the hierarchy of event handlers. For example: 1 def OnButtonClick(self, event): 2 if (some_condition): 3 do_something() 4 else: 5 event.Skip() 6 7 def OnEvent(self, event): 8 ... When a button-click event occurs, the method OnButtonClick gets called. If some_condition is true, we do_something() otherwise we let the event be handled by the more general event handler. Now let's have a look at our application: 1 import os 2 import wx 3 4 5 class MainWindow(wx.Frame): 6 def __init__(self, parent, title): 7 wx.Frame.__init__(self, parent, title=title, size=(200,100)) 8 self.control = wx.TextCtrl(self, style=wx.TE_MULTILINE) 9 self.CreateStatusBar() # A StatusBar in the bottom of the window 10 11 # Setting up the menu. 12 filemenu= wx.Menu() 13 14 # wx.ID_ABOUT and wx.ID_EXIT are standard ids provided by wxWidgets. 15 menuAbout = filemenu.Append(wx.ID_ABOUT, "&About"," Information about this program") 16 menuExit = filemenu.Append(wx.ID_EXIT,"E&xit"," Terminate the program") 17 18 # Creating the menubar. 19 menuBar = wx.MenuBar() 20 menuBar.Append(filemenu,"&File") # Adding the "filemenu" to the MenuBar 21 self.SetMenuBar(menuBar) # Adding the MenuBar to the Frame content. 22 23 # Set events. 24 self.Bind(wx.EVT_MENU, self.OnAbout, menuAbout) 25 self.Bind(wx.EVT_MENU, self.OnExit, menuExit) 26 27 self.Show(True) 28 29 def OnAbout(self,e): 30 # A message dialog box with an OK button. wx.OK is a standard ID in wxWidgets. 31 dlg = wx.MessageDialog( self, "A small text editor", "About Sample Editor", wx.OK) 32 dlg.ShowModal() # Show it 33 dlg.Destroy() # finally destroy it when finished. 34 35 def OnExit(self,e): 36 self.Close(True) # Close the frame. 37 38 app = wx.App(False) 39 frame = MainWindow(None, "Sample editor") 40 app.MainLoop() Note: Instead of 1 wx.MessageDialog( self, "A small editor in wxPython", "About Sample Editor", wx.OK) We could have omitted the id. In that case wxWidget automatically generates an ID (just as if you specified wx.ID_ANY): 1 wx.MessageDialog( self, "A small editor in wxPython", "About Sample Editor") Dialogs Of course an editor is useless if it is not able to save or open documents. That's where Common dialogs come in. Common dialogs are those offered by the underlying platform so that your application will look exactly like a native application. Here is the implementation of the OnOpen method in MainWindow: 1 def OnOpen(self,e): 2 """ Open a file""" 3 self.dirname = '' 4 dlg = wx.FileDialog(self, "Choose a file", self.dirname, "", "*.*", wx.OPEN) 5 if dlg.ShowModal() == wx.ID_OK: 6 self.filename = dlg.GetFilename() 7 self.dirname = dlg.GetDirectory() 8 f = open(os.path.join(self.dirname, self.filename), 'r') 9 self.control.SetValue(f.read()) 10 f.close() 11 dlg.Destroy() Explanation:    First, we create the dialog by calling the appropriate Constructor. Then, we call ShowModal. That opens the dialog - "Modal" means that the user cannot do anything on the application until he clicks OK or Cancel. The return value of ShowModal is the Id of the button pressed. If the user pressed OK we read the file. You should now be able to add the corresponding entry into the menu and connect it to the OnOpen method. If you have trouble, scroll down to the appendix for the full source code. Possible extensions Of course, this program is far from being a decent editor. But adding other features should not be any more difficult than what has already been done. You might take inspiration from the demos that are bundled with wxPython:     Drag and Drop. MDI Tab view/multiple files Find/Replace dialog    Print dialog (Printing) Macro-commands in python ( using the eval function) etc ... Working with Windows  Topics: o Frames o Windows o Controls/Widgets o Sizers o Validators In this section, we are going to present the way wxPython deals with windows and their contents, including building input forms and using various widgets/controls. We are going to build a small application that calculates the price of a quote. Overview Laying out Visual Elements  Within a frame, you'll use a number of wxWindow sub-classes to flesh out the frame's contents. Here are some of the more common elements you might want to put in your frame: o A wx.MenuBar, which puts a menu bar along the top of your frame. o A wx.StatusBar, which sets up an area along the bottom of your frame for displaying status messages, etc. o A wx.ToolBar, which puts a toolbar in your frame. o Sub-classes of wx.Control. These are objects which represent user interface widgets (ie, visual elements which display data and/or process user input). Common examples of wx.Control objects include wx.Button, wxStaticText, wx.TextCtrl and wx.ComboBox. o A wx.Panel, which is a container to hold your various wx.Control objects. Putting your wx.Control objects inside a wx.Panel means that the user can tab from one UI widget to the next. All visual elements (wxWindow objects and their subclasses) can hold sub-elements. Thus, for example, a wx.Frame might hold a number of wx.Panel objects, which in turn hold a number of wx.Button, wx.StaticText and wx.TextCtrl objects, giving you an entire hierarchy of elements: Note that this merely describes the way that certain visual elements are interrelated -- not how they are visually laid out within the frame. To handle the layout of elements within a frame, you have several options: 6. You can manually position each element by specifying it's exact pixel coordinate within the parent window. Because of differences in font sizes, etc, between platforms, this option is not generally recommended. 7. You can use wx.LayoutConstraints, though these are fairly complex to use. 8. You can use the Delphi-like LayoutAnchors, which make it easier to use wx.LayoutConstraints. 9. You can use one of the wxSizer subclasses. This document will concentrate on the use of wxSizers, because that's the scheme I'm most familiar with. Sizers  A sizer (that is, one of the wx.Sizer sub-classes) can be used to handle the visual arrangement of elements within a window or frame. Sizers can: o Calculate an appropriate size for each visual element. o Position the elements according to certain rules. o Dynamically resize and/or reposition elements when a frame is resized. Some of the more common types of sizers include: o o o wx.BoxSizer, which arranges visual elements in a line going either horizontally or vertically. wx.GridSizer, which lays visual elements out into a grid-like structure. wx.FlexGridSizer, which is similar to a wx.GridSizer except that it allow for more flexibility in laying out visual elements. A sizer is given a list of wx.Window objects to size, either by calling sizer.Add(window, options...), or by calling sizer.AddMany(...). A sizer will only work on those elements which it has been given. Sizers can be nested. That is, you can add one sizer to another sizer, for example to have two rows of buttons (each laid out by a horizontal wx.BoxSizer) contained within another wx.BoxSizer which places the rows of buttons one above the other, like this: o Note: Notice that the above example does not lay out the six buttons into two rows of three columns each -- to do that, you should use a wxGridSizer. In the following example we use two nested sizers, the main one with vertical layout and the embedded one with horizontal layout: 1 import wx 2 import os 3 4 class MainWindow(wx.Frame): 5 def __init__(self, parent, title): 6 self.dirname='' 7 8 # A "-1" in the size parameter instructs wxWidgets to use the default size. 9 # In this case, we select 200px width and the default height. 10 wx.Frame.__init__(self, parent, title=title, size=(200,-1)) 11 self.control = wx.TextCtrl(self, style=wx.TE_MULTILINE) 12 self.CreateStatusBar() # A Statusbar in the bottom of the window 13 14 # Setting up the menu. 15 filemenu= wx.Menu() 16 menuOpen = filemenu.Append(wx.ID_OPEN, "&Open"," Open a file to edit") 17 menuAbout= filemenu.Append(wx.ID_ABOUT, "&About"," Information about this program") 18 menuExit = filemenu.Append(wx.ID_EXIT,"E&xit"," Terminate the program") 19 20 # Creating the menubar. 21 menuBar = wx.MenuBar() 22 menuBar.Append(filemenu,"&File") # Adding the "filemenu" to the MenuBar 23 self.SetMenuBar(menuBar) # Adding the MenuBar to the Frame content. 24 25 # Events. 26 self.Bind(wx.EVT_MENU, self.OnOpen, menuOpen) 27 self.Bind(wx.EVT_MENU, self.OnExit, menuExit) 28 self.Bind(wx.EVT_MENU, self.OnAbout, menuAbout) 29 30 self.sizer2 = wx.BoxSizer(wx.HORIZONTAL) 31 self.buttons = [] 32 for i in range(0, 6): 33 self.buttons.append(wx.Button(self, -1, "Button &"+str(i))) 34 self.sizer2.Add(self.buttons[i], 1, wx.EXPAND) 35 36 # Use some sizers to see layout options 37 self.sizer = wx.BoxSizer(wx.VERTICAL) 38 self.sizer.Add(self.control, 1, wx.EXPAND) 39 self.sizer.Add(self.sizer2, 0, wx.EXPAND) 40 41 #Layout sizers 42 self.SetSizer(self.sizer) 43 self.SetAutoLayout(1) 44 self.sizer.Fit(self) 45 self.Show() 46 47 def OnAbout(self,e): 48 # Create a message dialog box 49 dlg = wx.MessageDialog(self, " A sample editor \n in wxPython", "About Sample Editor", wx.OK) 50 dlg.ShowModal() # Shows it 51 dlg.Destroy() # finally destroy it when finished. 52 53 def OnExit(self,e): 54 self.Close(True) # Close the frame. 55 56 def OnOpen(self,e): 57 """ Open a file""" 58 dlg = wx.FileDialog(self, "Choose a file", self.dirname, "", "*.*", wx.OPEN) 59 if dlg.ShowModal() == wx.ID_OK: 60 self.filename = dlg.GetFilename() 61 self.dirname = dlg.GetDirectory() 62 f = open(os.path.join(self.dirname, self.filename), 'r') 63 self.control.SetValue(f.read()) 64 f.close() 65 dlg.Destroy() 66 67 app = wx.App(False) 68 frame = MainWindow(None, "Sample editor") 69 app.MainLoop()      The sizer.Add method has three arguments. The first one specifies the control to include in the sizer. The second one is a weight factor which means that this control will be sized in proportion to other ones. For example, if you had three edit controls and you wanted them to have the proportions 3:2:1 then you would specify these factors as arguments when adding the controls. 0 means that this control or sizer will not grow. The third argument is normally wx.GROW (same as wx.EXPAND) which means the control will be resized when necessary. If you use wx.SHAPED instead, the controls aspect ratio will remain the same. If the second parameter is 0, i.e. the control will not be resized, the third parameter may indicate if the control should be centered horizontally and/or vertically by using wx.ALIGN_CENTER_HORIZONTAL, wx.ALIGN_CENTER_VERTICAL, or wx.ALIGN_CENTER (for both) instead of wx.GROW or wx.SHAPED as that third parameter. You can alternatively specify combinations of wx.ALIGN_LEFT, wx.ALIGN_TOP, wx.ALIGN_RIGHT, and wx.ALIGN_BOTTOM. The default behavior is equivalent to wx.ALIGN_LEFT | wx.ALIGN_TOP. One potentially confusing aspect of the wx.Sizer and its sub-classes is the distinction between a sizer and a parent window. When you create objects to go inside a sizer, you do not make the sizer the object's parent window. A sizer is a way of laying out windows, it is not a window in itself. In the above example, all six buttons would be created with the parent window being the frame or window which encloses the buttons -- not the sizer. If you try to create a visual element and pass the sizer as the parent window, your program will crash. Once you have set up your visual elements and added them to a sizer (or to a nested set of sizers), the next step is to tell your frame or window to use the sizer. You do this in three steps: 1 window.SetSizer(sizer) 2 window.SetAutoLayout(true) 3 sizer.Fit(window)  The SetSizer() call tells your window (or frame) which sizer to use. The call to SetAutoLayout() tells your window to use the sizer to position and size your components. And finally, the call to sizer.Fit() tells the sizer to calculate the initial size and position for all its elements. If you are using sizers, this is the normal process you would go through to set up your window or frame's contents before it is displayed for the first time. Validators  When you create a dialog box or other input form, you can use a wx.Validator to simplify the process of loading data into your form, validating the entered data, and extracting the data out of the form again. wx.Validator can also be used to intercept keystrokes and other events within an input field. To use a validator, you have to create your own sub-class of wx.Validator (neither wx.TextValidator nor wx.GenericValidator are implemented in wxPython). This sub-class is then associated with your input field by calling myInputField.SetValidator(myValidator). o Note: Your wx.Validator sub-class must implement the wxValidator.Clone() method. A Working Example Our first label within a panel  Let's start with an example. Our program is going to have a single Frame with a panel [7] containing a label [8]: 1 import wx 2 class ExampleFrame(wx.Frame): 3 def __init__(self, parent): 4 wx.Frame.__init__(self, parent) 5 self.quote = wx.StaticText(self, label="Your quote :", pos=(20, 30), size=(200, -1)) 6 self.Show() 7 8 app = wx.App(False) 9 ExampleFrame(None) 10 app.MainLoop()  1 This design should be clear and you should not have any problem with it if you read the Small Editor section of this howto. Note: a sizer should be used here instead of specifying the position of each widget. Notice in the line: self.quote = wx.StaticText(self, label="Your quote :", pos=(20, 30))  the use of self as parent parameter of our wxStaticText. Our Static text is going to be on the panel we have created just before. The second parameter refers to the Id number of the new control. As this static control is not really going to be sending events, we don't care about this parameter. The wxPoint is used as positioning parameter. There is also an optional wxSize parameter but its use here is not justified. [7] According to wxPython documentation:  "A panel is a window on which controls are placed. It is usually placed within a frame. It contains minimal extra functionality over and above its parent class wxWindow; its main purpose is to be similar in appearance and functionality to a dialog, but with the flexibility of having any window as a parent.", in fact, it is a simple window used as a (grayed) background for other objects which are meant to deal with data entry. These are generally known as Controls or Widgets. [8] A label is used to display text that is not supposed to interact with the user. Adding a few more controls  You will find a complete list of the numerous Controls that exist in wxPython in the demo and help, but here we are going to present those most frequently used: o wxButton The most basic Control: A button showing a text that you can click. For example, here is a "Clear" button (e.g. to clear a text): 1 clearButton = wx.Button(self, wx.ID_CLEAR, "Clear") 2 self.Bind(wx.EVT_BUTTON, self.OnClear, clearButton)  wxTextCtrl This control let the user input text. It generates two main events. EVT_TEXT is called whenever the text changes. EVT_CHAR is called whenever a key has been pressed. 1 textField = wx.TextCtrl(self) 2 self.Bind(wx.EVT_TEXT, self.OnChange, textField) 3 self.Bind(wx.EVT_CHAR, self.OnKeyPress, textField)      For example: If the user presses the "Clear" button and that clears the text field, that will generate an EVT_TEXT event, but not an EVT_CHAR event. wxComboBox A combobox is very similar to wxTextCtrl but in addition to the events generated by wxTextCtrl, wxComboBox has the EVT_COMBOBOX event. wxCheckBox The checkbox is a control that gives the user true/false choice. wxRadioBox The radiobox lets the user choose from a list of options. Let's have a closer look at what Form1 looks like now: 1 import wx 2 class ExamplePanel(wx.Panel): 3 def __init__(self, parent): 4 wx.Panel.__init__(self, parent) 5 self.quote = wx.StaticText(self, label="Your quote :", pos=(20, 30)) 6 7 # A multiline TextCtrl - This is here to show how the events work in this program, don't pay too much attention to it 8 self.logger = wx.TextCtrl(self, pos=(300,20), size=(200,300), style=wx.TE_MULTILINE | wx.TE_READONLY) 9 10 # A button 11 self.button =wx.Button(self, label="Save", pos=(200, 325)) 12 self.Bind(wx.EVT_BUTTON, self.OnClick,self.button) 13 14 # the edit control - one line version. 15 self.lblname = wx.StaticText(self, label="Your name :", pos=(20,60)) 16 self.editname = wx.TextCtrl(self, value="Enter here your name", pos=(150, 60), size=(140,-1)) 17 self.Bind(wx.EVT_TEXT, self.EvtText, self.editname) 18 self.Bind(wx.EVT_CHAR, self.EvtChar, self.editname) 19 20 # the combobox Control 21 self.sampleList = ['friends', 'advertising', 'web search', 'Yellow Pages'] 22 self.lblhear = wx.StaticText(self, label="How did you hear from us ?", pos=(20, 90)) 23 self.edithear = wx.ComboBox(self, pos=(150, 90), size=(95, -1), choices=self.sampleList, style=wx.CB_DROPDOWN) 24 self.Bind(wx.EVT_COMBOBOX, self.EvtComboBox, self.edithear) 25 self.Bind(wx.EVT_TEXT, self.EvtText,self.edithear) 26 27 # Checkbox 28 self.insure = wx.CheckBox(self, label="Do you want Insured Shipment ?", pos=(20,180)) 29 self.Bind(wx.EVT_CHECKBOX, self.EvtCheckBox, self.insure) 30 31 # Radio Boxes 32 radioList = ['blue', 'red', 'yellow', 'orange', 'green', 'purple', 'navy blue', 'black', 'gray'] 33 rb = wx.RadioBox(self, label="What color would you like ?", pos=(20, 210), choices=radioList, majorDimension=3, 34 style=wx.RA_SPECIFY_COLS) 35 self.Bind(wx.EVT_RADIOBOX, self.EvtRadioBox, rb) 36 37 def EvtRadioBox(self, event): 38 self.logger.AppendText('EvtRadioBox: %d\n' % event.GetInt()) 39 def EvtComboBox(self, event): 40 self.logger.AppendText('EvtComboBox: %s\n' % event.GetString()) 41 def OnClick(self,event): 42 self.logger.AppendText(" Click on object with Id %d\n" %event.GetId()) 43 def EvtText(self, event): 44 self.logger.AppendText('EvtText: %s\n' % event.GetString()) 45 def EvtChar(self, event): 46 self.logger.AppendText('EvtChar: %d\n' % event.GetKeyCode()) 47 event.Skip() 48 def EvtCheckBox(self, event): 49 self.logger.AppendText('EvtCheckBox: %d\n' % event.Checked()) 50 51 52 app = wx.App(False) 53 frame = wx.Frame(None) 54 panel = ExamplePanel(frame) 55 frame.Show() 56 app.MainLoop()  Our class has grown a lot bigger; it has now a lot of controls and those controls react. We have added a special wxTextCtrl control to show the various events that are sent by the controls. The notebook  Sometimes, a form grows too big to fit on a single page. The wxNoteBook is used in that kind of case : It allows the user to navigate quickly between a small amount of pages by clicking on associated tabs. We implement this by putting the wxNotebook instead of our form into the main Frame and then add our Form1 into the notebook by using method AddPage. Note how ExamplePanel's parent is the Notebook. This is important. 1 app = wx.App(False) 2 frame = wx.Frame(None, title="Demo with Notebook") 3 nb = wx.Notebook(frame) 4 5 6 nb.AddPage(ExamplePanel(nb), "Absolute Positioning") 7 nb.AddPage(ExamplePanel(nb), "Page Two") 8 nb.AddPage(ExamplePanel(nb), "Page Three") 9 frame.Show() 10 app.MainLoop() Improving the layout - Using Sizers  Using absolute positioning is often not very satisfying: The result is ugly if the windows are not (for one reason or another) the right size. WxPython has very rich vocabulary of objects to lay out controls. wx.BoxSizer is the most common and simple layout object but it permits a vast range of possibilities. Its role is roughly to arrange a set of controls in a line or in a row and rearrange them when needed (i.e. when the global size is changed). o wx.GridSizer and wx.FlexGridSizer are two very important layout tools. They arrange the controls in a tabular layout. o Here is the sample above re-written to use sizers: 1 class ExamplePanel(wx.Panel): 2 def __init__(self, parent): 3 wx.Panel.__init__(self, parent) 4 5 # create some sizers 6 mainSizer = wx.BoxSizer(wx.VERTICAL) 7 grid = wx.GridBagSizer(hgap=5, vgap=5) 8 hSizer = wx.BoxSizer(wx.HORIZONTAL) 9 10 self.quote = wx.StaticText(self, label="Your quote: ") 11 grid.Add(self.quote, pos=(0,0)) 12 13 # A multiline TextCtrl - This is here to show how the events work in this program, don't pay too much attention to it 14 self.logger = wx.TextCtrl(self, size=(200,300), style=wx.TE_MULTILINE | wx.TE_READONLY) 15 16 # A button 17 self.button =wx.Button(self, label="Save") 18 self.Bind(wx.EVT_BUTTON, self.OnClick,self.button) 19 20 # the edit control - one line version. 21 self.lblname = wx.StaticText(self, label="Your name :") 22 grid.Add(self.lblname, pos=(1,0)) 23 self.editname = wx.TextCtrl(self, value="Enter here your name", size=(140,-1)) 24 grid.Add(self.editname, pos=(1,1)) 25 self.Bind(wx.EVT_TEXT, self.EvtText, self.editname) 26 self.Bind(wx.EVT_CHAR, self.EvtChar, self.editname) 27 28 # the combobox Control 29 self.sampleList = ['friends', 'advertising', 'web search', 'Yellow Pages'] 30 self.lblhear = wx.StaticText(self, label="How did you hear from us ?") 31 grid.Add(self.lblhear, pos=(3,0)) 32 self.edithear = wx.ComboBox(self, size=(95, -1), choices=self.sampleList, style=wx.CB_DROPDOWN) 33 grid.Add(self.edithear, pos=(3,1)) 34 self.Bind(wx.EVT_COMBOBOX, self.EvtComboBox, self.edithear) 35 self.Bind(wx.EVT_TEXT, self.EvtText,self.edithear) 36 37 # add a spacer to the sizer 38 grid.Add((10, 40), pos=(2,0)) 39 40 # Checkbox 41 self.insure = wx.CheckBox(self, label="Do you want Insured Shipment ?") 42 grid.Add(self.insure, pos=(4,0), span=(1,2), flag=wx.BOTTOM, border=5) 43 self.Bind(wx.EVT_CHECKBOX, self.EvtCheckBox, self.insure) 44 45 # Radio Boxes 46 radioList = ['blue', 'red', 'yellow', 'orange', 'green', 'purple', 'navy blue', 'black', 'gray'] 47 rb = wx.RadioBox(self, label="What color would you like ?", pos=(20, 210), choices=radioList, majorDimension=3, 48 style=wx.RA_SPECIFY_COLS) 49 grid.Add(rb, pos=(5,0), span=(1,2)) 50 self.Bind(wx.EVT_RADIOBOX, self.EvtRadioBox, rb) 51 52 hSizer.Add(grid, 0, wx.ALL, 5) 53 hSizer.Add(self.logger) 54 mainSizer.Add(hSizer, 0, wx.ALL, 5) 55 mainSizer.Add(self.button, 0, wx.CENTER) 56 self.SetSizerAndFit(mainSizer) This example uses a GridBagSizer to place its items. The "pos" argument controls where in the grid the widget is placed, in (x, y) positions. For instance, (0, 0) is the top-left position, whereas (3, 5) would be the third row down, and the fifth column across. The span argument allows a widget to span over multiple rows or columns. Source of this course: http://docs.python.org/tutorial/.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Python # Biopython # wxPython - IBERO-NBIC @ Inicio