Download Programming for Bioinformatics

Document related concepts
no text concepts found
Transcript
Programming for Bioinformatics
Nicolas Salamin
[email protected]; tel: 4154; office: 3212
Iakov Davydov
[email protected]
MLS bioinformatics; Fall semester 2016
Course organisation
Goal of the course
Learn the fundamental aspects of programming to do biological research
Lecture / exercise
I
mix of lectures and exercises
I
need your computer all the time (no programming on paper. . .)
practice will come with
I
I
I
I
This course
Elements of bioinformatics
First step project
Exam
I
oral exam (15 minutes) done in winter session
I
all kind of documentation allowed (course slides, books, etc. . .)
I
focus on programming logic and structure
Why programming for biologists?
Computers and biology
Computers are increasingly essential to study all aspects of biology.
I
access and manage data
I
do statistical analysis (R is a programming language)
I
simulation and numerical modeling
Skills to learn
I
write simple computer programs in Python
I
automate data analysis
I
apply these tools to address biological questions
I
learn and understand programming concepts that will help with using
other languages
Why python?
Advantages
Easy Syntax easy to learn, you can start programming right away
Readability very clear syntax (executable “pseudo-code”), easy to
understand program code
High-Level looks more like a readable, human language than a low-level
language; program at a faster rate than with a low-level
language
Free free and open-source and cross-platform
Safe no pointers; errors handling
Modules large set of available modules (e.g. Numpy, Biopython)
Disadvantages
Speed executed by an interpreter instead of compiled, but “Speed
isn’t a problem until it’s a problem.”
“Too Easy” can be difficult to become comfortable in other programming
languages
How does a computer work?
Hardware components
Basic hardware components
CPU central processing unit, where the computation is executed
Memory Random-Access Memory, where instructions, results and data
are stored during computation. Non-permanent
Disk permanent memory, which stores programs, data file, . . .
keyboard standard input to communicate with the computer
display standard ouput to communicate with the user
Simple CPU-Memory system
CPU
instruction
counter
cache
1000
Memory
1000
LOAD a, R1
1001
LOAD b, R2
1002
ADD R1, R2, R3
1003
STORE R3, C
Instructions
from the program
..
.
ALU
registers
R1
23
R2
8
R3
31
Ri
..
.
a
b
c
23
8
31
..
.
Data related
to the program
Simple CPU-Memory system
CPU
instruction
counter
cache
1000
Memory
1000
LOAD a, R1
1001
LOAD b, R2
1002
ADD R1, R2, R3
1003
STORE R3, C
Instructions
from the program
..
.
ALU
registers
R1
23
R2
8
R3
31
Ri
..
.
a
b
c
23
8
31
..
.
Data related
to the program
Simple CPU-Memory system
CPU
instruction
counter
cache
1000
Memory
1000
LOAD a, R1
1001
LOAD b, R2
1002
ADD R1, R2, R3
1003
STORE R3, C
Instructions
from the program
..
.
ALU
registers
R1
23
R2
8
R3
31
Ri
..
.
a
b
c
23
8
31
..
.
Data related
to the program
Simple CPU-Memory system
Famous von Neumann architecture
I
Instructions and data are in memory
I
CPU : registers and instruction counter
I
The CPU can do simple actions : load, store from given memory
location (slow),
I
Also : does arithmetic or logical operations (ALU for arithmetic logic
unit) on registers, compare values in registers (fast),
I
Modify the instruction counter (loops, branches)
Input and output operations to disk or user (very slow)
I
Computing performance depends on appropriate data structures and
assignment instruction
Software component
What is a program?
I
Recipe, list of actions that describes a task to be executed by the
computer
I
A text file containing instructions or statements that the computer
understands and can execute
I
Statements (or instructions) are executed one after the other, starting
form the top of the program (like a cooking recipe)
I
Instructions modify the values stored in memory in order to achieve the
goal of the program
I
A program implements an algorithm (e.g. sorting numbers)
Course setup
System and tools
I
your laptop
I
Python 3 with Biopython (if we have time)
I
text editor of your choice (or an IDE if you want)
I
terminal
I
internet for tutorials, examples, etc. . .
Textbook used
Tim J. Stevens and Wayne Boucher. 2015. Python
Programming for Biology, Bioinformatics and Beyond.
Cambridge University Press.
Getting started with Python
The Hello World! program
Our first program
I
start a text editor
I
type in the following lines (Note: no leading space before the print):
print("Hello World!")
I
save the file with your preferred name and suffix .py
I
for example: myfile.py
Run it from the command line
> python myfile.py
You get on the screen:
Hello World!
A more advanced program
a=12 # variable a is assigned the value 12
b= 7 # variable b is assigned the value 7
c = a +b
print("The sum of", a, "and",
b, "is", c)
#or
print("The sum of %d and %d is %d" % (a, b, c))
Syntax
Important details
I
the hash mark indicates comments and is ignored by Python
I
the end of a statement is indicated by a newline
I
you can have blank lines between each statement
I
you can add spaces
continue a statement over several lines using
I
I
I
I
a backslash
brackets or parentheses
but be careful with indentation. This has a meaning in python, except
for line continuation
Variables
Assignment instruction ’=’
I
the object on the right-hand side (here a number) is associated to a
variable, whose name is specified on the left-hand side
I
the variable is created when first assigned. Then it can be used.
I
when used in an expression, the variable is replaced by the value it refers
to.
I
the name of the variable should not be a reserved name (e.g. print). It
can contain small and big letters, digits and underscores. But it cannot
start with a digit.
Modules, scripts and programs
Different levels
I
modules and scripts both denote Python programs.
I
but scripts refer to top level programs, the ones we run explicitly
(e.g. the “main” program above).
I
top level scripts do not need the .py suffix
I
modules are Python programs imported by a script or other modules.
I
they need the .py suffix.
Importing modules
> python
>>> import myfile
Hello World!
The sum of 12 and 7 is 19
>>>
Importing modules
If we import again in the same session, nothing happens
>>> import myfile
>>>
also, the variable a is not known but myfile.a is known
>>> a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name ’a’ is not defined
>>> myfile.a
12
>>>
Importing modules
Editing myfile.py
Let say a = 4, we can force a reload
>>> import imp
>>> imp.reload(myfile)
Hello World!
The sum of 4 and 7 is 11
<module ’myfile’ from ’/home/salamin/.../myfile.py’>
>>>
Importing parts of a module
>>> from myfile import a
>>> a
4
>>>
Now the variable a is known!
A more useful example
Scope of a variable
Try the following code:
import math
b = math.sin(math.pi / 2)
print(b)
or the next one:
from math import pi, sin
b = sin(pi / 2)
print(b)
Note: dir(math) lists the functions available in the module
Importance of modules
Advantages of modules
I
modules are a central concept in Python programs
I
it allows us to reuse code easily
I
it provides a simple way to avoid clashes with name
each module creates a namespace, which is unique
I
Structure of Python programs
I
a Python program is composed of modules
I
modules contains statements: print(a)
I
statements contains expressions: a-b+2
I
expressions create and/or process objects: a
Data types and objects
Built-in data type (or object)
I
numbers (integer, floating point)
I
strings
I
booleans
I
lists
I
dictionaries
I
tuples
I
sets
The next parts will focus on their syntax and how to work with them.
In Python, variables are memory location that contains the position of the
objects. Object (data) have a type, not the variables.
Numbers and strings
Representing numbers
Numbers
I
are declared in the usual way, e.g.
I
I
I
2
-12.5
1.34e-5 or -4.9E12
I
can be integer or floating-point or Boolean
I
can be complex numbers: 3+4j, -2+1j (always of coefficient before j)
can have binary, octal and hexadecimal representation. For example, 15
can be represented as
I
I
I
I
0b1111
0o17
0xf
Manipulating numbers
Basic operators
math notation
x +y
x −y
x ∗y
x /y
x //y
operator*
add
sub
mul
truediv
floordiv
x ∗ ∗y
pow
x %y
x << y
x >> y
x &y
x ∧y
∼a
a|b
mod
lshift
rshift
and_
xor
invert
or_
* use import operator
example
3+6→9
3 − 6 → −3
3 ∗ 6 → 18
3/6 → 0.5 (! different in Python 2 or 3)
6.//4. → 1.0
6//4 → 1
3 ∗ ∗4 → 81
16 ∗ ∗0.5 → 4
6%4 → 2
1 << 2 → 4
100 >> 2 → 25
10&6 → 2
10∧ 6 → 12
∼ 10 → −11
10|6 → 14
Mixing types
Dealing with numbers
I
integers have arbitrary precision (they can represent arbitrarily big
numbers, but will take more space in memory). For 64-bit, it will be
263 − 1.
I
floating point (i.e. ±p × 2e ) numbers have limited precision (64 bits; 1
bit ±, 53 bits for p, 11 bits for e).
Try 1e-322 * 0.1 or 1e-323 * 0.1. Why?
I
Python determines the type (int or float) by the way the number is
typed (literal)
I
it is possible to convert one from another: int(3.8) or float(3)
I
but Python does it automatically: in an operation, the numbers are
converted to the “highest” type (int is simpler than float)
I
3.4 + 7 gives the float 10.4
I
the function type(a) returns the current type of the variable a. It can
change during the execution of the program
Expressions
Numbers, variables and operators
I
I
an expression is a combination of numbers, variables and operators
a numerical expression can be evaluated to a number
I
precedence rule when several operators appear in the same expression
(order: **; ∼; *, /, %, //; +,-; <<,>>; &; ∧ , |; <=,<,>,<=; ==,!=;
is,is not; not,or,and)
I
use parentheses to force a different order
x=7+3*2 #x=13, not 20...
Incrementations
I
a=a+1
I
a+=1
I
a*=2 same as a=a*2
Strings
High level data type
I
ordered collection of characters for text-based information
I
a character is usually represented by one byte (i.e. 8 bits or 1 octet:
28 = 256 possible characters). Python3 uses more...
I
can contain any text, of any length
I
large set of processing tools
Rules to express strings
I
text surrounded by single quotes
species = ’Amphiprion clarkii "Indonesia"’
I
or double quotes
I
both forms can be mixed
position = "5’ UTR"
title = "Amphiprion clarkii " ’"Indonesia"’
print(title)
I
Python automatically concatenates adjacent string literals
Escape sequence
Characters with special meanings
I
use a backslah to escape the character
species = "Amphiprion clarkii \"Indonesia\""
position = ’5\’ UTR’
Some examples
\\
\t
\v
\a
\b
\r
\n
\f
\0
\xhh
\
horizontal tab
vertical tab
bell
backspace
carriage return
new line
formfeed
null byte
byte with hexadecimal value hh
(e.g. \x 66 → f)
String operations
Adding and multiplicating strings
I
use len(s) to know how many characters are in string s
I
+ and * can be used with strings: these operators are overloaded
a="abc" + ’def’
print(a)
a=’abc’*4
print(a)
len(a)
I
you cannot mix strings and numbers, but you can change the type of an
object
a=’123’ + 4
a=’123’ + str(4)
b=int(’123’) + 4
print(a, b)
Indexing
Position of a character
I
if dna=’ACGT’, then A is at position 0, C at position 1, G at position 2
and T at position 3
I
individual characters are accessed with string_name[position], where
the first position as index 0 (! different from R !)
dna=’ACGT’
print(dna[0], dna[4], dna[-1], dna[-4])
I
extract a substring by giving a start and end position
print(dna[2:4], dna[2:], dna[:2])
I
slice with stride
seq=’ACACGTACTTCCTAG’
print(seq[0::2], seq[2::3], seq[2:7:2])
print(seq[::-1])
Strings features
Strings are immutable
I
a string (object in memory) cannot be modified, but we can use it to
build a different string . . . and associate the old name to it
seq=’ACAGACCTAGGACCT’
seq[0]=’T’
seq=seq+seq[0]
Some string methods (see Python doc for full list)
’ACC’ in seq
’TTT’ not in seq
print(seq.lower(), seq.replace(’A’, ’T’))
print(seq.find(’CT’), seq.index(’CT’))
print(seq.rfind(’CT’), seq.rindex(’CT’))
print(seq.strip(’A’), seq.lstrip(’A’), seq.rstrip(’G’))
print(seq.split(’A’), seq.count(’A’))
Exercices
You have a DNA sequence in fasta format
fasta=""">gi|506953611|gb|KC684925.1| Amphiprion clarkii rhodopsin (RH) mRNA, partial cds
AGTCCTTATGAGTACCCTCAGTACTACCTTGTCAACCCAGCCGCTTATGCTGCTCTGGGTGCCTACATGT
TCTTCCTCATCCTTGCTGGCTTCCCAGTCAACTTCCTCACCCTCTACGTCACCCTCGAACACAAGAAGCT
GCGAACCCCTCTAAACTACATCCTGCTGAACCTCGCGGTGGCTAACCTCTTCATGGTGCTTGGAGGATTC
ACCACAACGATGTACACCTCTATGCACGGCTACTTCGTCCTTGGACGCCTCGGCTGCAATCTGGAAGGAT
TCTTTGCTACCCTCGGTGGTGAGATTGCCCTCTGGTCACTGGTTGTTCTGGCTATTGAAAGGTGGGTCGT
TGTCTGCAAGCCCATCAGCAACTTCCGCTTCGGGGAGAATCACGCTATTATGGGTTTGGCCTTCACCTGG
ACAATGGCCAGTGCCTGCGCTGTTCCTCCTCTTGTCGGCTGGTCTCGTTACATCCCTGAGGGCATGCAGT
GCTCATGTGGAGTTGACTACTACACACGTGCAGAGGGTTTCAACAATGAGAGCTTTGTCGTCTCCTCTTG
TCGGCTGGTCTCGTTACATCCCTGAGG"""
Q1 What is the length of the DNA sequence?
Q2 What is the frequency of the four nucleotides?
Q3 Extract the 1st, 2nd and 3rd codon positions of the DNA
sequence.
Q4 Extract the genbank number for this sequence.
Q5 Print the sequence as an mRNA.
Q6 How many amino-acid are represented by this sequence?
Collection data types
Collections of objects
Tuples, lists, sets and dictionaries
I
Python objects that contain a collection of items or elements
I
lists and dictionaries are very powerful data-structures that can grow and
shrink at will during the execution of a program. You will use them a lot
I
their elements can be themselves lists or dictionaries or tuples, offering
an arbitrarily rich information structure to represent complex data
Tuples
Fixed, ordered and immutable object
I
a tuple is defined using parentheses and each element is separated by a
comma
x=()
x=("ATG",)
x=(32, 223, 423, 2321)
I
comma at the end of a tuple with one item is needed to avoid ambiguity
with expression (e.g. (2+3) vs (2+3,))
I
you can mix numbers, strings or whatever, including variables
seq="ACGGAT"
nucleotides=(’A’,’C’,’G’,’T’)
x=(2, 2, seq, False, nucleotides)
I
multi-dimensional objects
matrix=((0,1,2),(1,0,3),(2,3,0))
matrix[0][1]
Lists
Ordered but modifiable
I
a list is defined using square brackets. Elements are separated by a
comma
x=[]
x=["ATG"]
x=[32, 223, 423, 2321]
I
you can again mix numbers, strings or whatever, including variables
seq="ACGGAT"
nucleotides=(’A’,’C’,’G’,’T’)
x=[2, 2, seq, False, nucleotides]
I
you can convert a tuple to a list and inversely
nucleotides=(’A’,’C’,’G’,’T’)
x=list(nucleotides)
w=tuple(x)
I
the + and * are defined for lists
nucleotide+nucleotide #new list with 8 elements
nucleotide*3 #repeat the original list 3 times
Manipulating lists and tuples
Fetching values
I
Use the square brackets to access elements stored in a variable. In this
case, [] have a different meaning than during list creation
nucTuple=(’A’,’C’,’G’,’T’)
nucList=[’A’,’C’,’G’,’T’]
nucList[0] #work the same on nucTuple
nucList[-1] #same as before
nucList[4]=’U’ #change nucList
nucTuple[4]=’U’ #error!
I
if list/tuple has n elements, minimum value of index is −1 and
maximum value is n − 1
I
you can also take slices
nucList[1:3] #pos 1 and 2
nucList[1:] #everything except pos 0
nucList[:3] #everything up to pos 2 included
nucList[1:-1]
Manipulating lists and tuples
Membership, counting, finding elements
I
membership. Not very efficient computationally. Use sets (see later) to
do that
’A’ in nucList #True, same with nucTuple
’U’ in nucList #False, same with nucTuple
’U’ not in nucList #True
I
methods available for lists and tuples
seq=[’A’,’T’,’G’,’A’,’A’,’C’,’T’,’T’,’G’]
seq.count(’A’) #how many A’s do we have
seq.index(’T’) #first occurrence of T, not efficient
len(seq)
I
this is it for tuples as they are not modifiable. Lists have however more
to offer
Modifying lists
Extending a list, removing elements
I
you can extend a list. It will modify the variable itself
seq.append(’G’)
seq.extend([’G’,’A’,’T’])
seq.append([’G’,’A’,’T’]) #different?
I
you can insert and remove an element
seq.insert(1, ’A’) #insert A in position 1
seq.remove(’A’) #remove the 1st occurence of A
del seq[4]
seq2=seq.pop(2) #remove element in pos 2 and return it
I
change a range of items at once. If the number of elements is larger
than the range, it increases the list
seq[1:3]=[’A’,’A’,’A’]
Modifying lists
Sorting and reversing
I
you can reverse the order of the items in a list. It will modify directly the
list
seq.reverse()
print(seq)
print(seq.reverse()) #returns None as the function reverse modify the li
I
you can sort the items of the list. It will modify directly the list, but it
throws an error if the elements are not comparable
seq=[’A’,’A’,’T’,’G’,’A’,’A’,’C’]
seq.sort() #like .reverse(), modify the list itself
print(seq)
seq.append(1)
seq.sort() #error!
I
if you want to sort the list without modifying the list itself:
seq=[’A’,’A’,’T’,’G’,’A’,’A’,’C’]
y=sorted(seq)
z=seq[::-1] #function reversed() does not work as expected
print(seq)
Reference versus copy
We have to be careful with lists
L1=[1,2,3]
L2=L1
L3=L1[:]
print(L2)
print(L3)
L1[0]=7
print(L2)
print(L3)
L1 and L2 are two variables pointing to the same data in memory. L3 is a
new copy of L1 and its data is in a different part of the memory.
Sets
Unordered and modifiable
I
a set is defined using brackets. Elements are separated by a comma
x={32, 223, 423, 2321}
x=set([32, 223, 423, 2321])
x={32, 223, 423, 223}
I
you can only mix numbers, strings and tuples in a set (elements need to
be hashable)
seq="ACGGAT"
x={len(seq), seq} #see how it reordered the set
nucleotides=[’A’,’C’,’G’,’T’]
x={4, nucleotides} #error
x={4, tuple(nucleotides)}
I
you can convert sets into tuples or list and inversely. However, the
content might change
codons=[’ATG’,’CGT’,’GGT’,’TTC’, ’CGT’]
x=tuple(codons)
w=set(codons)
z=list(w)
Manipulating sets
Unordered, so no indexing possible
I
it doesn’t mean anything to access sets by index. Tests of the presence
of an item are however efficient because sets are like dictionaries
aa={’Ala’,’Thr’,’Leu’,’Pro’,’Ser’}
’Pro’ in aa
’Iso’ in aa
’Met’ not in aa
I
you can add, remove and get the length of a set
len(aa)
aa.add(’Iso’) #modify aa directly
aa.add(’Ala’) #no effect as Ala already in
aa.remove(’Iso’) #decrease aa by one
aa.pop() #remove the last item and decrease aa
I
several functions can work with multiple sets
s={1,2,3,4,5}
t={4,5,6,7}
a=s&t #intersection
b=s|t #union
Dictionaries
Unordered and modifiable key:value pairs
I
a dictionary is defined using brackets and the key is separated from its
value by a colon. Elements are separated by a comma
x={}
x={"CCC":"Pro"}
x={"CCC":"Pro", "GCA":"Ala"}
I
the key must be hashable, whereas the value can be of any type
I
keys and values can be of different types
x={(’CCA’,’CCC’,’CCG’,’CCT’):"Pro"}
x={[’CCA’,’CCC’,’CCG’,’CCT’]:"Pro"} #error
x={"Amphiprion clarkii":fasta,
(’CCA’,’CCC’,’CCG’,’CCT’):"Pro"}
Manipulating dictionaries
Unordered, so no indexing
I
use again square brackets to access an element of a dictionary
aa={"CCC":"Pro", "GCA":"Ala", "AGA":"Arg"}
aa[’CCC’]
aa[’ACA’] #error because not in aa
aa.get(’ACA’) #alternative without error
aa.get(’ACA’, ’unkown’) #return ’unknown’ if not in aa
aa.get(’CCC’, ’unkown’) #return the value for ’CCC’
I
note that the dictionary is not changed if a key is not present. Use
setdefault for this
I
membership, modifying dictionaries and listing items
’CCC’ in aa
len(aa) #nb of key:value pairs
aa[’CCC’]=’Proline’ #change the value of the key ’CCC’
aa[’ACA’]=’Thr’ #add a new pair
del aa[’CCC’]
print(aa.keys(), aa.values())
print(aa.items())
References versus copy 2
Some caution needed
I
we have to be careful when a list or dictionary is defined through
variables referencing mutable objects
L=[1,2] #L points to the list [1,2]
D={’a’:L} #reference to list [1,2] also pointed to by L
print(D)
L[0]=3 #we change L *in-place*, ie the list [1,2]
print(D)
D[’a’][1]=17 #we modify the list pointed to by L
print(L)
I
apparently the same but very different behaviour!!!
D={’a’:L} #reference to list [1,2] also pointed to by L
print(D)
L=[3,4] #we assign L to a *new list*
#list [1,2] is still in memory
#and pointed to by D
print(D)
I
a better way
D={’a’:L[:]} #L[:] returns a copy of L
print(D)
L[1]=23 #we change L *in-place*
print(D) #D didn’t change, it points to the copy of [1,2]
Morale de l’histoire
Be careful!
I
the copy is done for top level object and if L contains a reference to
another list, the reference is copied, not the list that is referenced
I
to copy a list or a dictionary, use the method copy() because = makes
the new variable point to the same location in memory
Exercices
You have a multiple sequence alignment in fasta
seq1=">Amphiprion clarkii\nAGTTGACCTAGTCATAGA"
seq2=">Amphiprion frenatus\nAGCTGACCTAGTTTTAGA"
seq3=">Amphiprion ocellaris\nAGTTGACCTGGGCATCGA"
seq4=">Pomacentrus mollucensis\nAGTCTACCTGATCCGGA"
Q1 Create a dictionary to store the sequences.
Q2 Create a new dictionary with the same set of species but
replace the sequence of each species by the following:
seq1b="ATAATATTCGATTGATCAGT"
seq2b="ATAATACTCGATTTATCAGT"
seq3b="ATAATACTCGATCGATCCGT"
seq4b="ATAATAGGCGATCGACTAGT"
Q3 Merge the second sequence to the initial sequence of each
species
Q4 Calculate the GC content in each sequence
Q5 Order the species according to their value of GC content
Program control and logic
Program execution
Normal
1
2
Loop
Conditional
1
False
?
1
True
End
*
Next
3
a
a
4
b
b
5
2
2
6
3
3
Program execution
Normal
1
2
Loop
Conditional
1
False
?
1
True
End
*
Next
3
a
a
4
b
b
5
2
2
6
3
3
Program execution
Normal
1
2
Loop
Conditional
1
False
?
1
True
End
*
Next
3
a
a
4
b
b
5
2
2
6
3
3
If statements
Conditional execution
Some block of code should only be executed if a test is true. In Python
if <condition1>:
<statements>
elif <condition2>:
<statements>
elif <condition3>:
<statements>
...
else:
<statements>
Remarks
I
<condition> = any expression evaluated to True or False
I
<statements> = any number of instructions
I
use indentation of <statements> rather than {}
I
indentation is part of the syntax
I
note the : as a delimiter of the tests
I
elif and else are optional
Example
Testing user input
x=input("Enter a nucleotide:")
if x == ’A’:
print("The nucleotide is an adenine (purine)")
elif x == ’C’:
print("The nucleotide is a cytosine (pyrimidine)")
elif x == ’G’:
print("The nucleotide is a guanine (purine)")
elif x == ’T’:
print("The nucleotide is a thymine (pyrimidine)")
elif x == ’U’:
print("The nucleotide is an urasil (pyrimidine)")
else:
print("%s is not a nucleotide" % x)
Remarks
I
tests are considered one after the other until one is true
I
if all if/elif are false, the else statements are executed
what is the difference with having 5 independent if?
I
Remarks on If statements
Comparison operators
I
>, <, <= or >= to compare numbers
I
!= for 6= and ==
is and is not
I
x=[123,54,92,87,33]
y=[:] # y is a copy of x
y==x # True, they have the same values
y is x # False, they are different objects
Comparisons and truth
Expressions without an obvious query can have conditional truthfulness
I
value of an item can be forced to be considered as true or false
I
assumed False: None, False, 0, 0.0, "", (), {}, [], set(())
I
everything else is assumed True
Logic operations
Standard logical connectives (and, or, not)
resulting expressions are not necessarily Boolean. Try
z=0 and 3 #try: z=1 and 2, then z=5 and 0
if z:
print("Z is equivalent to True")
else:
print("Z is equivalent to False")
I
gives back the x or y value, which only evaluates as equivalent to True
or False in conditional statement
I
we can mix data types in the logical comparisons
I
also available: or and not
Advanced syntax
Shortcuts and parentheses
I
if statement in one go:
print(x and "Yes" or "No") #similar to
print("Yes" if x else "No")
I
in x or y, y is evaluated only if x is false. Similarly, in x and y, y is
evaluated only if x is true
x=[256,128]
if x and x[0]>10:
#do something...
I
a!=b is the same as not a==b
I
precedence and parentheses:
x=y=0
z=1
x and (y or z)
(x and y) or z
try statement
Statements that might cause error
for e.g. if we want to compute the square of a given number
x=raw_input("Enter a number:")
try:
num=int(x)
except: #the try clause produced an error
print("It’s not a number. I stop")
quit()
else: #the try clause did not produced an error
print num**2
I
try is a way to handle error: it catches an exception raised by an error
and recovers gracefully
Loops
The for loop
natural to consider (i.e. iterate over) every element in a collection
nucleotides=[’A’,’C’,’G’,’T’]
for nuc in nucleotides:
print(nuc)
I
variable nuc is set in turn to each element from the list values and the
lbock of code is executed every time. It doesn’t have to be defined
before-hand
I
works the same for tuples and sets
I
for dictionaries, nuc would be assigned the keys of the dictionary
I
looping for sets and dictionaries (unordered collections) will happen in
arbitrary order
I
note the : at the end of the for statement
Loops, cont.
Positional indices
in most loops, no need to know the position of the associated value in the
list. However, you sometimes need the index themselves
I
use the range() function
range(7) # [0,1,2,3,4,5,6]
range(2,7) # [2,3,4,5,6]
nucleotides=[’A’,’C’,’G’,’T’]
for i in range(len(nucleotides)):
print(nucleotides[i])
I
add a third argument to define the step size
range(2,13,3) # [2,5,8,11]
range(7,2,-2) # [7,5,3]
I
access the value and the index at once: enumerate()
vec1=(4,7,9)
vec2=(1,3,5) #get the inner (or dot) product
s=0
for i, x in enumerate(vec1):
s+=x*vec2[i]
Loops, cont.
The while loop
same idea, but loop keeps going on while a condition is true
nucleotides=[’A’,’C’,’G’,’T’]
i=0
while i < len(nucleotides):
print(nucleotides[i])
i+=1
I
careful as you can have an infinite loop if the statement is never false
I
good if you don’t have an initial collection of elements to loop over
I
you need to define the variable i before-hand
I
to need to explicitly increment i within the loop
I
you can add else: after a while loop to indicate that the loop is
finished
Skipping and breaking loops
continue statement
means that the rest of the code in the block is skipped for this particular
iteration
nucleotides=[’A’,’C’,’G’,’T’]
for nuc in nucleotides:
if nuc == ’C’:
continue
print(nuc)
break statement
immediately causes all looping to finish
nucleotides=[’A’,’C’,’G’,’T’]
for nuc in nucleotides:
if nuc == ’C’:
break
print(nuc)
List comprehension
Create a new collection
go through a collection to generate another, but different, collection.
E.g. squaring numbers
squares=[]
for x in range(1,8):
squares.append(x*x)
more efficient alternative: use list comprehension
squares=[x*x for x in range(1,8)]
Looping tips
Altering collection elements
I
bad idea to alter the number of elements in a loop
num=range(9)
for n in num:
if n<5:
num.remove(n)
print(num) #some n<5 still there...
I
duplicate the list to get the expected result
for n in list(num): #create a new list similar to num
if n<5:
num.remove(n)
print(num) #everything is ok
#or use list comprehension
num=[n for n in num if n >= 5]
#or use a second list explicitly
num2=[]
for n in num:
if n>=5:
num2.append(n)
num=num2
Exercices
Loops and conditional statements
Q1 write a code that will check that a specific amino acid is
present in a sequence (using index() is not allowed)
seq=’NFYIPMFNKTGVVRSPFEYPQYYLAGVVRSPFEY’
Q2 simulate a DNA sequence of length 150 by randomly drawing
nucleotides (tip: use the module random, with its function
random.choice(’ACGT’))
Q3 calculate how many amino-acids of seq2 are in seq
seq2=’ALILALSMGY’
Functions
Why using functions
We have seen many examples
I
e.g. len(), split(), . . .
I
we will learn how to define our own functions
I
they are useful to maximize code reuse: the same operation is coded
only once and can be used many times
I
they provide procedural decomposition: the problem is split into
simple pieces or subtasks
Defining a function
Use the def instruction
def function_name(arg1, arg2, ..., argN):
statements # Body of the function
# instructions involving arg1 to argN
I
parts in parenthesis is provided by you
I
Note the colon at the end of the header line and the indentation
I
the function is then called as function_name(. . .) with actual variables
or literals
I
a function usually returns an object but may also have a side effect
I
upon calling a function, execution flow is transferred to the function
I
when finished, control is returned to instruction following the call
Examples
Function with a side effect
def hello():
print("Hello world!")
I
no return value, only a side effect
I
does not require an argument on calling
Function returning a value
def mult(x,y):
return x*y
a=mult(2,3) # 6
mult(a,a) # 36
mult("bla ", 2) # ’bla bla ’
I
the return statement used to return the value
I
return exits the function
I
it can be placed anywhere in the function
def min(a, b):
if a<b: return a
else: return b
Polymorphism
Functions can work on any types of arguments
I
in the example above, the arguments were either numbers or strings
I
argument types are not defined in the function definition
I
arguments and return value have no type constraints
I
function will work as long as the operations done in the body of the
function are defined on the objects passed as arguments
I
this is called polymorphism
Remarks
Position and assignment
I
a function can be defined anywhere in the code, even in a if or a loop.
This is different from other languages where the functions are defined
separately and not part of the main program
I
a function can be nested in another one
I
a function should be defined before it is used
I
def creates a function object and assigns it to the given name
I
we can always reassign a function to a new name
def mult(x,y): return x*y
times = mult
print(times(2,3))
Modifying the argument value
Mutable objects
if passed as arguments, they can be modified in the body of a function
def func(X):
X[0]=5 #works if object accepts the operation
L=[1,2]
func(L)
print(L)
Immutable objects
cannot be modified if passed as arguments
def swap(x):
x = (x[1],x[0]) #not very useful
x = (2,3)
swap(x)
print(x)
I
immutable objects are passed by value: use the return value to modify
them (try with swap function above as an exercice)
I
mutable objects are passed by reference
Scope of a variable
Were can we access a variable
I
variables defined within a function are only accessible inside it
I
their scope is limited to the body of the function
I
local variables disappear from memory when a function is exited
def func(x):
a,b = 2,4 # a and b are local, private
return a*x+b
print(func(1))
print(a)
Local variables
I
we could have defined a in the main program
I
no way to the name of all variables in all functions
a=89
def func(x):
a,b = 2,4 # a and b are local, private
print(a)
return a*x+b
print(func(1))
print(a)
Local variables
What makes a variable local
I
any assignment within a function makes the assigned variable local
I
otherwise, a variable not assigned in a function is considered as global
scope is limited to the program it is defined in. This is why we need the
import statement
I
x=3
def func(y):
print(x)
return x+y
print(f(2))
Priority between nested scopes
I
search sequence: local scope, then global one
I
third level: built-in scope (i.e. reserved words like str, int, open, . . .
I
possible to reuse them in local scope, but careful
def func():
open=3
open(’file.txt’) #won’t work
Global statement
Declaring a variable as global
def func():
global x
x=2
x=99
f()
print(x)
I
this is not recommended (leads to bugs, no portability, . . .).
I
could be useful if function needs to remember its state after the previous
call. Object oriented programming is better
Function’s arguments
Number of arguments
I
function should be called with the right number of arguments given in
its definition
I
arguments are matched by position in the list between function
definition and function call
I
but Python is more flexible
def func(name, age, job):
print("%d is %s years old and is a %s" % (name, age, job))
func(’Joe’, 32, ’teacher’)
#usual way
func(age=32, job=’teacher’, name=’Joe’) #using keyword arguments
def func(name, age, job=’doctor’): #with default values
print("%d is %s years old and is a %s" % (name, age, job))
func(age=45, name=’Allan’)
func(’Allan’, 45)
func(’Allan’, 45, ’lawyer’)
Arbitrary numbers
Using a list of arguments: * construct
I
packing a collection of arguments into a tuple
def func(*args):
print(args)
f()
f(1)
f(1,2)
I
args becomes a tuple with all the arguments passed to the function.
I
packing a collection into a dictionary
def func(item, *args, **kw):
print(’Mandatory argument:’, item)
print(’Unnamed argument:’, args)
print(’Keyword dictionary:’, kw)
func(’Hello’, 1, 99, valueA="abc", valueB=7.0)
func(’Hello’, valueA="abc", valueB=7.0)
func(’Hello’)
I
I
kw becomes a dictionary with all the arguments passed to the function
useful if arguments have to be passed to a nested function inside the
main one
Unpacking a collection
Using the * construct
* can be used when calling a function: it unpacks a sequence into a
collection of arguments
def func(*val):
x=0
for y in val: x+=y
return x
f(1,2,3) # 6
a=(1,2,3)
f(*a) # 6
b=[2,3,4]
f(*b) # 9
Recursivity
a function can be defined through itself. This is called recursivity
def factorial(n):
if n==1: return 1
else: return n*factorial(n-1)
factorial(3) # 6
factorial(6) # 720
Function as argument
A function can also be passed as argument
def add(x, y): return x+y
def mult(x, y): return x*y
def combine(f, *args):
a=args[0]
for b in args[1:]:
a=f(a,b)
return a
t=(1,2,3,4)
v=combine(add, *t) # v is the sum: 10
v=combine(mult, *t) # v is the sum: 24
The map function
very efficient way to apply a function on all elements of a list
def func(x,y): x+y
L1=[1,2,3]
L2=[3,4,5]
L3=map(f, L1, L2) # [4,6,8]
Exercices
Functions
Q1 write a function to count the percentage of identity > 0.8 in a
sequence alignement given as a long string
""">seq1\nACTAATGCGTAGTACTGACTTACT\n
>seq2\nAGTAAGTCGTAGTACTGCCGTACT\n
>seq3\nACTAATGCTTAGTACTGACGGTTA\n
>seq4\nATCAATGCGCAGTACTGACTTACA\n
>seq5\nAGCAATGCGTAGTATTGCCAACCT"""
Q2 compare the running time of a for loop and of map to add the
values of two lists with 1 mio items. Use the following to
calculate the running time
from time import time
...
start=time()
...
end=time()
print(end - start)
Q3 find the minimum value in a sequence of numbers using
I argument packing
I a tuple or a list
Input/Output with files
Files
Permanent storage
I
I
files are used to store permanently data or information on the disk of the
computer
files can be accessed from a Python program:
I
I
I
read an existing file
write to a file
create, delete, copy files
Opening a file
I
built-in function open(), which takes a filename as argument
I
it returns an object of type file whose methods can be used to act on
the file
I
not a usual object such as lists, strings or numbers. It is a link to a file
in the filesystem
outfile=open("tmp/rh1.fas", ’w’)
infile=open("data.txt", ’r’)
Opening a file, cont.
Remarks
I
names infile and outfile are variable and can take any variable
names
I
filename is passed to open as a string
I
it may be an absolute name with full path or a relative name with
respect to the directory where the Python script is run
I
the access-mode argument ’w’ means write, ’r’ means read (default if
not present) and ’a’ means append
I
third argument possible to control the output buffering: passing a zero
means the output is transferred immediately to the file (which may be
slow)
Closing a file
Terminating the connection to the file
I
done through the method close() available for the file object
infile.close()
I
this flushes the output buffer (if any) to the actual file
I
it releases the lock on the open file. The file can be used by other
applications
I
at the of a Python program, open files are automatically closed
Reading from a file
Example of a file
reading these lines as strings
> cat myseq.txt
Number of sequences and length
4
256
>
myfile=open(’myseq.txt’, ’r’)
a=myfile.readline() # read 1st
print(a)
b=myfile.readline() # read 2nd
print(b)
c=myfile.readline() # read 3rd
print(c)
print(int(b) * int(c)) # total
myfile.close()
line; a is a string
line; b is a string
line; c is a string
sequence length; conversion needed
I
keeps track of the current position in the file, which is incremented after
each read operation
I
prints double newline between items: one from the file, one added by
print
I
use rstrip() to remove the first one before printing
Parsing and other reading methods
Using the split() method
Reading a sequence alignment
2 15
sp1 ACATCATTGACCTAG
sp2 ACACGATCGATCTAG
myfile=open(’myseq.phy’, ’r’)
line=myfile.readline() # ’2 15\n’
nseq, nchar = line.split(’ ’)
line=myfile.readline() # ’sp1 ACATCATTGACCTAG’
k, v = line.split(’ ’)
seq[k]=v
...
Reading more than a line
line = myfile.read(n) # read the next n bytes (char) into a string
# next call to readline will read the end of
# line + next one
lines = myfile.read() # read entire file into a single string
lines = myfile.readlines() # read entire file in a list of lines
myfile.seek(n) #change current file position to offset n for next read
Reading a file
Often need to read an entire file
I
how to know its length?
I
best to read line by line to save memory
I
big files cannot fit in memory
Method 1:
f=open(’input.dat’, ’r’)
while True:
line = f.readline()
if line: #an empty line means the EOF: why?
line.rstrip()
...
else:
break
Input/Output with files
Files
Permanent storage
I
I
files are used to store permanently data or information on the disk of the
computer
files can be accessed from a Python program:
I
I
I
read an existing file
write to a file
create, delete, copy files
Opening a file
I
built-in function open(), which takes a filename as argument
I
it returns an object of type file whose methods can be used to act on
the file
I
not a usual object such as lists, strings or numbers. It is a link to a file
in the filesystem
outfile=open("tmp/rh1.fas", ’w’)
infile=open("data.txt", ’r’)
Opening a file, cont.
Remarks
I
names infile and outfile are variable and can take any variable
names
I
filename is passed to open as a string
I
it may be an absolute name with full path or a relative name with
respect to the directory where the Python script is run
I
the access-mode argument ’w’ means write, ’r’ means read (default if
not present) and ’a’ means append
I
third argument possible to control the output buffering: passing a zero
means the output is transferred immediately to the file (which may be
slow)
Closing a file
Terminating the connection to the file
I
done through the method close() available for the file object
infile.close()
I
this flushes the output buffer (if any) to the actual file
I
it releases the lock on the open file. The file can be used by other
applications
I
at the of a Python program, open files are automatically closed
Reading from a file
Example of a file
reading these lines as strings
> cat myseq.txt
Number of sequences and length
4
256
>
myfile=open(’myseq.txt’, ’r’)
a=myfile.readline() # read 1st
print(a)
b=myfile.readline() # read 2nd
print(b)
c=myfile.readline() # read 3rd
print(c)
print(int(b) * int(c)) # total
myfile.close()
line; a is a string
line; b is a string
line; c is a string
sequence length; conversion needed
I
keeps track of the current position in the file, which is incremented after
each read operation
I
prints double newline between items: one from the file, one added by
print
I
use rstrip() to remove the first one before printing
Parsing and other reading methods
Using the split() method
Reading a sequence alignment
2 15
sp1 ACATCATTGACCTAG
sp2 ACACGATCGATCTAG
myfile=open(’myseq.phy’, ’r’)
line=myfile.readline() # ’2 15\n’
nseq, nchar = line.split(’ ’)
line=myfile.readline() # ’sp1 ACATCATTGACCTAG’
k, v = line.split(’ ’)
seq[k]=v
...
Reading more than a line
line = myfile.read(n) # read the next n bytes (char) into a string
# next call to readline will read the end of
# line + next one
lines = myfile.read() # read entire file into a single string
lines = myfile.readlines() # read entire file in a list of lines
myfile.seek(n) #change current file position to offset n for next read
End of line . . .
UNIX/LINUX, MacOS X, Windows
End of line not represented in the same way depending on computer. But
readline() depends on this character to recognize the end of line . . .
I
’\n’ = end of line for UNIX/LINUX and MacOS X
I
’\r\n’ = end of line for Windows
I
’\r’ = end of line for MacOS up to version 9
Download the files testUNIX.txt, testWin.txt and testMacOS.txt.
f=open(’testUNIX.txt’, ’r’)
line=f.readline()
print(line) # repeat this on test_win.txt and test_macos.txt
How do make sure you can read files correctly?
f=open(’testUNIX.txt’, ’rU’) # add ’U’ for universal end of line
Reading a file
Often need to read an entire file
I
how to know its length?
I
best to read line by line to save memory
I
big files cannot fit in memory
f=open(’input.dat’, ’r’)
while True:
line = f.readline()
if line: #an empty line means the EOF: why?
line=line.rstrip()
...
else:
break
Iterable file objects
I
we can loop through file objects
I
best and fastest way to read a file
f=open(’input.dat’, ’r’)
for line in f:
line=line.rstrip()
...
f.close()
Writing to a file
Opening a file for writing
f=open(’filename’, ’w’)
I
when access mode is ’w’, the file is created if it does not exist already. It
is overwritten otherwise
I
other access mode are: ’a’ for append: the written lines are added at the
end of an existing file (or the beginning of a new one)
if access mode is ’r+’, the file is open both for reading and writing
I
Basic methods
f.write(’some strings’)
f.writelines(aList)
f.flush() # flushes the output buffer to actual file without closing
Writing/reading complex objects
Only strings are written/read from files
Need some methods to deal with complex objects
x, y, z = 10, 50, 100
s=’ACCATGAT’
D={’CCC’:’Pro’, ’ACC’:’Thr’}
L=[’A’,’C’,’G’,’T’]
f=open(’datafile.txt’, ’w’)
f.write(s+’\n’) # to have a newline
f.write(’%s, %s, %s\n’ % (x,y,z))
f.write(str(L) + ’$’ + str(D) + ’\n’) #need explicit string conversion
f.write(’ ’.join(L)) #or ’’.join(L), or ’_’.join(L), ...
for k, v in D.items():
f.write(’%s: %s\n’ % (k, v))
The % operator
Used it to format a printout, syntax: print string % tuple
a=’ACAATAT’
b=12
print(’The length of the sequence %s is %d nucleotides’ % (a,b))
More about files
Serialize/deserialize an object
Transform it to a string that can be written to a file and read it back
import pickle
f=open(’filename’, ’wb’) #should open wiht ’wb’ to create a binary file
D={’CCC’:’Pro’, ’ACC’:’Thr’}
pickle.dump(D,f) #write D to file object F
f.close()
f=open(’filename’, ’r’)
E=pickle.load(f)
E
The file looks weird if you print it: it is saved in binary format.
Failing to open a file
Because of lack of permission, file system is full, inexisting file opened in a
read mode, . . .
I try-except construct can help recovering from a file error:
filename=raw_input("Enter a file name: ")
try:
f=open(filename, ’r’)
except:
print("File doesn’t exist.")
Modules sys and os
Variable file names
File names usually given by a user and they will change. Not ideal to
hardcode them in your script. Use the sys module and call
myscript.py data/inputFile.txt
import sys
pyScriptName = sys.argv[0] #name of the scirpt
filename = sys.argv[1] #name of the file, ie data/inputFile.txt
File operations
Use the module os and the sub-module os.path to deal with files
os
chdir(path)
getcwd()
listdir(path)
rmdir(path)
remove(path)
rename(src, dst)
module
changes current dir
returns current dir
lists dir content
removes directory
removes file
moves from src to dst
os.path sub-module
exists(path)
does path exists?
isfile(path)
is path a file?
isdir(path)
is path a dir?
islink(path)
is path a symbolic link?
join(*paths)
joins paths together
dirname(path)
dir containing path
basename(path)
path minus dirname
split(path)
returns dirname and basename
Exercices
Reading and writing files
Q1 Download the fasta file clownfish.fasta and create a
function that will read the file and store the sequences in a
dictionary (species names as keys) containing
I the sequence itself
I its total length
I the percentage of GC
Q2 write a function that will read as input a genbank file and
write as output the sequences in fasta format. A possible file
is available here. You can also download one directly from
GenBank
Object Oriented Programming
What is OOP?
Objects in programming
Object-oriented programming: a programming paradigm based on the
concept of “objects”, which are data structures containing data (or
attributes) and methods (or procedures).
Objects in python
We have already used objects.
n = 12 # n is an object of type integer
s = ’ACAGATC’ # s is an object of type string
l = [12, ’A’, 21121, ’ACCAT’] # ls is an object of type list
These objects
I
contain data (the number 12, the string ’ACAGCTC’, . . .)
I
can be modified/manipulated
s.count(’A’)
s.lower()
l.append(’121’)
Extending data types
Python standard data types
For most simple programs, we can usually survive well with standard Python
data types. This includes
I
numbers, strings
I
tuples, lists, sets, dictionaries
Defining your own data types
It might be useful though to create your own data types: objects built to
your own specifications and organised in the way convenient to you.
This is done through an object definitions known as classes.
Class vs object
Implementation vs instantiation
The class is the definition of a particular kind of object in terms of its
component features and how it is constructed or implemented in the code.
The object is a specific instance of the thing which has been made according
to the class definition. Everything that exists in Python is an object.
OOP in Python
Common principle of OOP is that a class definition
I
makes available certain functionality
I
hides internal information about how a specific class is implemented
This is called encapsulation and information hiding.
Python is however quite permissive and you can access any element of an
object if you know how to do that.
An example
A sequence object
We need to store data:
I
species name
I
sequence in DNA and amino-acid
I
protein name
I
length of the sequences
I
percentage of GC
I
...
We need to be able to manipulate the data using methods:
I add/remove nucleotide or amino-acid (and update the other data
dependent on it)
I
translate DNA to amino-acid and inversely
I
print the sequence in various ways
I
calculate some characteristics
...
I
DNA sequence class
Class definition
class Sequence:
# some statements
I
common practice to save each class into their specific files. Use then the
from Sequence import Sequence
Inheritance
class DNAsequence(Sequence):
# some statements
I
inheriting methods from a superclass
I
classes can have more than one superclass
Class functions
Providing object capabilities
I
functions are defined within the construction of a class
I
defined in the same way as ordinary functions (indented within the class
code block)
I
accessed from the variable representing the object via ’dot’ syntax
name = mySequence.getName()
I
getName() knows which Sequence object to use when fetching the
name
I
first argument is special: it is the object called from (self)
class Sequence:
def getName(self):
return self.name
def getCapitalisedName(self):
name = self.getName()
if name:
return name.capitalize()
else:
return name
Remarks on functions
Order of functions
I
order of functions does not matter
I
if function definition appears more than once, the last instead replaces
the previous one
class MultipleSeqAl(Sequence):
def getMSA(self):
# function implementation
def getSequenceIdentity(self):
# function implementation
Using subclasses
I
call specific functions as ususal msa.getMultipleSeqAl()
I
can also call msa.getName() from Sequence directly because of
inheritance
however, .getMSA() cannot be accessed from an object Sequence
I
Object attributes
Variables tied to the object
I
attributes hold information useful for the object and its functions
I
e.g. associate a variable storing sequence name in Sequence objects
Class attributes
I
specific to a particular object
I
defined inside class functions
I
use the self keyword to access it
Object attributes
I
available to all instances of a class
I
defined outside all function blocks
I
usually used for variable that do not change
I
accessed directly using the variable name
I
bare function names are also class attributes
Examples of class attributes
class Sequence:
type = ’DNA’ # class attribute
def setSequenceLength(self, l):
self.length = l
def getSequenceLength(self):
return self.length
myseq = Sequence()
print(myseq.type) # variable type can be accessed from the object
# not that we don’t use the () as we access a variable
print(Sequence.type) # accessed through the class itself
getSeqLenFunc = Sequence.getSequenceLength
getSeqLenFunc(myseq) # same as myseq.getSequenceLength()
length = myseq.length # error. Length not yet set
myseq.setSequenceLength(541)
length = myseq.length # returns 541 this time
length = myseq.getSequenceLength()
myseq.l = 541 # create new attributes on the fly...
Object life cycle
Birth, life and death
I
creation of object handled in a special function called constructor
I
removal is handled by a function called destructor
I
Python has automatic garbage collection, usually no need to define a
destructor
Class constructor
I
called whenever the corresponding object is created
I
use a special name: __init__
I
first argument is the object itself (i.e. self)
I
any other arguments you need to create the object
I
good idea to introduce a key to uniquely identifies objects of a given
class
Example
class Sequence:
def __init__(self, name, type=’DNA’):
self.type = type
try name:
self.name = name
except:
print(’Name must be set to something’)
myseq = Sequence(’opsin’)
myseq = Sequence(’opsin’, ’AA’)
When to create attributes
I
attributes can be created in any class function (or directly on the object)
I
convention to create most of them in the constructor either directly or
through the call to a function
I
set it to None if it cannot be set at object creation
I
constructor are inherited by subclasses
Exercices
Q1 Create a class to store all the elements of a GenBank record and store
the accessions found in the genbank file used last time (download it
here) in a list. Try to invent functions that could be useful to deal with
these accessions
Q2 Create a function to get the length of the sequence of each GenBank
record and calculate the mean, maximum and minimum length
Q3 Create a new class that will hold the set of GenBank accessions and
store useful information on them