Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Programming for Bioinformatics Nicolas Salamin [email protected]; tel: 4154; office: 3212 Iakov Davydov [email protected] MLS bioinformatics; Fall semester 2016 Course organisation Goal of the course Learn the fundamental aspects of programming to do biological research Lecture / exercise I mix of lectures and exercises I need your computer all the time (no programming on paper. . .) practice will come with I I I I This course Elements of bioinformatics First step project Exam I oral exam (15 minutes) done in winter session I all kind of documentation allowed (course slides, books, etc. . .) I focus on programming logic and structure Why programming for biologists? Computers and biology Computers are increasingly essential to study all aspects of biology. I access and manage data I do statistical analysis (R is a programming language) I simulation and numerical modeling Skills to learn I write simple computer programs in Python I automate data analysis I apply these tools to address biological questions I learn and understand programming concepts that will help with using other languages Why python? Advantages Easy Syntax easy to learn, you can start programming right away Readability very clear syntax (executable “pseudo-code”), easy to understand program code High-Level looks more like a readable, human language than a low-level language; program at a faster rate than with a low-level language Free free and open-source and cross-platform Safe no pointers; errors handling Modules large set of available modules (e.g. Numpy, Biopython) Disadvantages Speed executed by an interpreter instead of compiled, but “Speed isn’t a problem until it’s a problem.” “Too Easy” can be difficult to become comfortable in other programming languages How does a computer work? Hardware components Basic hardware components CPU central processing unit, where the computation is executed Memory Random-Access Memory, where instructions, results and data are stored during computation. Non-permanent Disk permanent memory, which stores programs, data file, . . . keyboard standard input to communicate with the computer display standard ouput to communicate with the user Simple CPU-Memory system CPU instruction counter cache 1000 Memory 1000 LOAD a, R1 1001 LOAD b, R2 1002 ADD R1, R2, R3 1003 STORE R3, C Instructions from the program .. . ALU registers R1 23 R2 8 R3 31 Ri .. . a b c 23 8 31 .. . Data related to the program Simple CPU-Memory system CPU instruction counter cache 1000 Memory 1000 LOAD a, R1 1001 LOAD b, R2 1002 ADD R1, R2, R3 1003 STORE R3, C Instructions from the program .. . ALU registers R1 23 R2 8 R3 31 Ri .. . a b c 23 8 31 .. . Data related to the program Simple CPU-Memory system CPU instruction counter cache 1000 Memory 1000 LOAD a, R1 1001 LOAD b, R2 1002 ADD R1, R2, R3 1003 STORE R3, C Instructions from the program .. . ALU registers R1 23 R2 8 R3 31 Ri .. . a b c 23 8 31 .. . Data related to the program Simple CPU-Memory system Famous von Neumann architecture I Instructions and data are in memory I CPU : registers and instruction counter I The CPU can do simple actions : load, store from given memory location (slow), I Also : does arithmetic or logical operations (ALU for arithmetic logic unit) on registers, compare values in registers (fast), I Modify the instruction counter (loops, branches) Input and output operations to disk or user (very slow) I Computing performance depends on appropriate data structures and assignment instruction Software component What is a program? I Recipe, list of actions that describes a task to be executed by the computer I A text file containing instructions or statements that the computer understands and can execute I Statements (or instructions) are executed one after the other, starting form the top of the program (like a cooking recipe) I Instructions modify the values stored in memory in order to achieve the goal of the program I A program implements an algorithm (e.g. sorting numbers) Course setup System and tools I your laptop I Python 3 with Biopython (if we have time) I text editor of your choice (or an IDE if you want) I terminal I internet for tutorials, examples, etc. . . Textbook used Tim J. Stevens and Wayne Boucher. 2015. Python Programming for Biology, Bioinformatics and Beyond. Cambridge University Press. Getting started with Python The Hello World! program Our first program I start a text editor I type in the following lines (Note: no leading space before the print): print("Hello World!") I save the file with your preferred name and suffix .py I for example: myfile.py Run it from the command line > python myfile.py You get on the screen: Hello World! A more advanced program a=12 # variable a is assigned the value 12 b= 7 # variable b is assigned the value 7 c = a +b print("The sum of", a, "and", b, "is", c) #or print("The sum of %d and %d is %d" % (a, b, c)) Syntax Important details I the hash mark indicates comments and is ignored by Python I the end of a statement is indicated by a newline I you can have blank lines between each statement I you can add spaces continue a statement over several lines using I I I I a backslash brackets or parentheses but be careful with indentation. This has a meaning in python, except for line continuation Variables Assignment instruction ’=’ I the object on the right-hand side (here a number) is associated to a variable, whose name is specified on the left-hand side I the variable is created when first assigned. Then it can be used. I when used in an expression, the variable is replaced by the value it refers to. I the name of the variable should not be a reserved name (e.g. print). It can contain small and big letters, digits and underscores. But it cannot start with a digit. Modules, scripts and programs Different levels I modules and scripts both denote Python programs. I but scripts refer to top level programs, the ones we run explicitly (e.g. the “main” program above). I top level scripts do not need the .py suffix I modules are Python programs imported by a script or other modules. I they need the .py suffix. Importing modules > python >>> import myfile Hello World! The sum of 12 and 7 is 19 >>> Importing modules If we import again in the same session, nothing happens >>> import myfile >>> also, the variable a is not known but myfile.a is known >>> a Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name ’a’ is not defined >>> myfile.a 12 >>> Importing modules Editing myfile.py Let say a = 4, we can force a reload >>> import imp >>> imp.reload(myfile) Hello World! The sum of 4 and 7 is 11 <module ’myfile’ from ’/home/salamin/.../myfile.py’> >>> Importing parts of a module >>> from myfile import a >>> a 4 >>> Now the variable a is known! A more useful example Scope of a variable Try the following code: import math b = math.sin(math.pi / 2) print(b) or the next one: from math import pi, sin b = sin(pi / 2) print(b) Note: dir(math) lists the functions available in the module Importance of modules Advantages of modules I modules are a central concept in Python programs I it allows us to reuse code easily I it provides a simple way to avoid clashes with name each module creates a namespace, which is unique I Structure of Python programs I a Python program is composed of modules I modules contains statements: print(a) I statements contains expressions: a-b+2 I expressions create and/or process objects: a Data types and objects Built-in data type (or object) I numbers (integer, floating point) I strings I booleans I lists I dictionaries I tuples I sets The next parts will focus on their syntax and how to work with them. In Python, variables are memory location that contains the position of the objects. Object (data) have a type, not the variables. Numbers and strings Representing numbers Numbers I are declared in the usual way, e.g. I I I 2 -12.5 1.34e-5 or -4.9E12 I can be integer or floating-point or Boolean I can be complex numbers: 3+4j, -2+1j (always of coefficient before j) can have binary, octal and hexadecimal representation. For example, 15 can be represented as I I I I 0b1111 0o17 0xf Manipulating numbers Basic operators math notation x +y x −y x ∗y x /y x //y operator* add sub mul truediv floordiv x ∗ ∗y pow x %y x << y x >> y x &y x ∧y ∼a a|b mod lshift rshift and_ xor invert or_ * use import operator example 3+6→9 3 − 6 → −3 3 ∗ 6 → 18 3/6 → 0.5 (! different in Python 2 or 3) 6.//4. → 1.0 6//4 → 1 3 ∗ ∗4 → 81 16 ∗ ∗0.5 → 4 6%4 → 2 1 << 2 → 4 100 >> 2 → 25 10&6 → 2 10∧ 6 → 12 ∼ 10 → −11 10|6 → 14 Mixing types Dealing with numbers I integers have arbitrary precision (they can represent arbitrarily big numbers, but will take more space in memory). For 64-bit, it will be 263 − 1. I floating point (i.e. ±p × 2e ) numbers have limited precision (64 bits; 1 bit ±, 53 bits for p, 11 bits for e). Try 1e-322 * 0.1 or 1e-323 * 0.1. Why? I Python determines the type (int or float) by the way the number is typed (literal) I it is possible to convert one from another: int(3.8) or float(3) I but Python does it automatically: in an operation, the numbers are converted to the “highest” type (int is simpler than float) I 3.4 + 7 gives the float 10.4 I the function type(a) returns the current type of the variable a. It can change during the execution of the program Expressions Numbers, variables and operators I I an expression is a combination of numbers, variables and operators a numerical expression can be evaluated to a number I precedence rule when several operators appear in the same expression (order: **; ∼; *, /, %, //; +,-; <<,>>; &; ∧ , |; <=,<,>,<=; ==,!=; is,is not; not,or,and) I use parentheses to force a different order x=7+3*2 #x=13, not 20... Incrementations I a=a+1 I a+=1 I a*=2 same as a=a*2 Strings High level data type I ordered collection of characters for text-based information I a character is usually represented by one byte (i.e. 8 bits or 1 octet: 28 = 256 possible characters). Python3 uses more... I can contain any text, of any length I large set of processing tools Rules to express strings I text surrounded by single quotes species = ’Amphiprion clarkii "Indonesia"’ I or double quotes I both forms can be mixed position = "5’ UTR" title = "Amphiprion clarkii " ’"Indonesia"’ print(title) I Python automatically concatenates adjacent string literals Escape sequence Characters with special meanings I use a backslah to escape the character species = "Amphiprion clarkii \"Indonesia\"" position = ’5\’ UTR’ Some examples \\ \t \v \a \b \r \n \f \0 \xhh \ horizontal tab vertical tab bell backspace carriage return new line formfeed null byte byte with hexadecimal value hh (e.g. \x 66 → f) String operations Adding and multiplicating strings I use len(s) to know how many characters are in string s I + and * can be used with strings: these operators are overloaded a="abc" + ’def’ print(a) a=’abc’*4 print(a) len(a) I you cannot mix strings and numbers, but you can change the type of an object a=’123’ + 4 a=’123’ + str(4) b=int(’123’) + 4 print(a, b) Indexing Position of a character I if dna=’ACGT’, then A is at position 0, C at position 1, G at position 2 and T at position 3 I individual characters are accessed with string_name[position], where the first position as index 0 (! different from R !) dna=’ACGT’ print(dna[0], dna[4], dna[-1], dna[-4]) I extract a substring by giving a start and end position print(dna[2:4], dna[2:], dna[:2]) I slice with stride seq=’ACACGTACTTCCTAG’ print(seq[0::2], seq[2::3], seq[2:7:2]) print(seq[::-1]) Strings features Strings are immutable I a string (object in memory) cannot be modified, but we can use it to build a different string . . . and associate the old name to it seq=’ACAGACCTAGGACCT’ seq[0]=’T’ seq=seq+seq[0] Some string methods (see Python doc for full list) ’ACC’ in seq ’TTT’ not in seq print(seq.lower(), seq.replace(’A’, ’T’)) print(seq.find(’CT’), seq.index(’CT’)) print(seq.rfind(’CT’), seq.rindex(’CT’)) print(seq.strip(’A’), seq.lstrip(’A’), seq.rstrip(’G’)) print(seq.split(’A’), seq.count(’A’)) Exercices You have a DNA sequence in fasta format fasta=""">gi|506953611|gb|KC684925.1| Amphiprion clarkii rhodopsin (RH) mRNA, partial cds AGTCCTTATGAGTACCCTCAGTACTACCTTGTCAACCCAGCCGCTTATGCTGCTCTGGGTGCCTACATGT TCTTCCTCATCCTTGCTGGCTTCCCAGTCAACTTCCTCACCCTCTACGTCACCCTCGAACACAAGAAGCT GCGAACCCCTCTAAACTACATCCTGCTGAACCTCGCGGTGGCTAACCTCTTCATGGTGCTTGGAGGATTC ACCACAACGATGTACACCTCTATGCACGGCTACTTCGTCCTTGGACGCCTCGGCTGCAATCTGGAAGGAT TCTTTGCTACCCTCGGTGGTGAGATTGCCCTCTGGTCACTGGTTGTTCTGGCTATTGAAAGGTGGGTCGT TGTCTGCAAGCCCATCAGCAACTTCCGCTTCGGGGAGAATCACGCTATTATGGGTTTGGCCTTCACCTGG ACAATGGCCAGTGCCTGCGCTGTTCCTCCTCTTGTCGGCTGGTCTCGTTACATCCCTGAGGGCATGCAGT GCTCATGTGGAGTTGACTACTACACACGTGCAGAGGGTTTCAACAATGAGAGCTTTGTCGTCTCCTCTTG TCGGCTGGTCTCGTTACATCCCTGAGG""" Q1 What is the length of the DNA sequence? Q2 What is the frequency of the four nucleotides? Q3 Extract the 1st, 2nd and 3rd codon positions of the DNA sequence. Q4 Extract the genbank number for this sequence. Q5 Print the sequence as an mRNA. Q6 How many amino-acid are represented by this sequence? Collection data types Collections of objects Tuples, lists, sets and dictionaries I Python objects that contain a collection of items or elements I lists and dictionaries are very powerful data-structures that can grow and shrink at will during the execution of a program. You will use them a lot I their elements can be themselves lists or dictionaries or tuples, offering an arbitrarily rich information structure to represent complex data Tuples Fixed, ordered and immutable object I a tuple is defined using parentheses and each element is separated by a comma x=() x=("ATG",) x=(32, 223, 423, 2321) I comma at the end of a tuple with one item is needed to avoid ambiguity with expression (e.g. (2+3) vs (2+3,)) I you can mix numbers, strings or whatever, including variables seq="ACGGAT" nucleotides=(’A’,’C’,’G’,’T’) x=(2, 2, seq, False, nucleotides) I multi-dimensional objects matrix=((0,1,2),(1,0,3),(2,3,0)) matrix[0][1] Lists Ordered but modifiable I a list is defined using square brackets. Elements are separated by a comma x=[] x=["ATG"] x=[32, 223, 423, 2321] I you can again mix numbers, strings or whatever, including variables seq="ACGGAT" nucleotides=(’A’,’C’,’G’,’T’) x=[2, 2, seq, False, nucleotides] I you can convert a tuple to a list and inversely nucleotides=(’A’,’C’,’G’,’T’) x=list(nucleotides) w=tuple(x) I the + and * are defined for lists nucleotide+nucleotide #new list with 8 elements nucleotide*3 #repeat the original list 3 times Manipulating lists and tuples Fetching values I Use the square brackets to access elements stored in a variable. In this case, [] have a different meaning than during list creation nucTuple=(’A’,’C’,’G’,’T’) nucList=[’A’,’C’,’G’,’T’] nucList[0] #work the same on nucTuple nucList[-1] #same as before nucList[4]=’U’ #change nucList nucTuple[4]=’U’ #error! I if list/tuple has n elements, minimum value of index is −1 and maximum value is n − 1 I you can also take slices nucList[1:3] #pos 1 and 2 nucList[1:] #everything except pos 0 nucList[:3] #everything up to pos 2 included nucList[1:-1] Manipulating lists and tuples Membership, counting, finding elements I membership. Not very efficient computationally. Use sets (see later) to do that ’A’ in nucList #True, same with nucTuple ’U’ in nucList #False, same with nucTuple ’U’ not in nucList #True I methods available for lists and tuples seq=[’A’,’T’,’G’,’A’,’A’,’C’,’T’,’T’,’G’] seq.count(’A’) #how many A’s do we have seq.index(’T’) #first occurrence of T, not efficient len(seq) I this is it for tuples as they are not modifiable. Lists have however more to offer Modifying lists Extending a list, removing elements I you can extend a list. It will modify the variable itself seq.append(’G’) seq.extend([’G’,’A’,’T’]) seq.append([’G’,’A’,’T’]) #different? I you can insert and remove an element seq.insert(1, ’A’) #insert A in position 1 seq.remove(’A’) #remove the 1st occurence of A del seq[4] seq2=seq.pop(2) #remove element in pos 2 and return it I change a range of items at once. If the number of elements is larger than the range, it increases the list seq[1:3]=[’A’,’A’,’A’] Modifying lists Sorting and reversing I you can reverse the order of the items in a list. It will modify directly the list seq.reverse() print(seq) print(seq.reverse()) #returns None as the function reverse modify the li I you can sort the items of the list. It will modify directly the list, but it throws an error if the elements are not comparable seq=[’A’,’A’,’T’,’G’,’A’,’A’,’C’] seq.sort() #like .reverse(), modify the list itself print(seq) seq.append(1) seq.sort() #error! I if you want to sort the list without modifying the list itself: seq=[’A’,’A’,’T’,’G’,’A’,’A’,’C’] y=sorted(seq) z=seq[::-1] #function reversed() does not work as expected print(seq) Reference versus copy We have to be careful with lists L1=[1,2,3] L2=L1 L3=L1[:] print(L2) print(L3) L1[0]=7 print(L2) print(L3) L1 and L2 are two variables pointing to the same data in memory. L3 is a new copy of L1 and its data is in a different part of the memory. Sets Unordered and modifiable I a set is defined using brackets. Elements are separated by a comma x={32, 223, 423, 2321} x=set([32, 223, 423, 2321]) x={32, 223, 423, 223} I you can only mix numbers, strings and tuples in a set (elements need to be hashable) seq="ACGGAT" x={len(seq), seq} #see how it reordered the set nucleotides=[’A’,’C’,’G’,’T’] x={4, nucleotides} #error x={4, tuple(nucleotides)} I you can convert sets into tuples or list and inversely. However, the content might change codons=[’ATG’,’CGT’,’GGT’,’TTC’, ’CGT’] x=tuple(codons) w=set(codons) z=list(w) Manipulating sets Unordered, so no indexing possible I it doesn’t mean anything to access sets by index. Tests of the presence of an item are however efficient because sets are like dictionaries aa={’Ala’,’Thr’,’Leu’,’Pro’,’Ser’} ’Pro’ in aa ’Iso’ in aa ’Met’ not in aa I you can add, remove and get the length of a set len(aa) aa.add(’Iso’) #modify aa directly aa.add(’Ala’) #no effect as Ala already in aa.remove(’Iso’) #decrease aa by one aa.pop() #remove the last item and decrease aa I several functions can work with multiple sets s={1,2,3,4,5} t={4,5,6,7} a=s&t #intersection b=s|t #union Dictionaries Unordered and modifiable key:value pairs I a dictionary is defined using brackets and the key is separated from its value by a colon. Elements are separated by a comma x={} x={"CCC":"Pro"} x={"CCC":"Pro", "GCA":"Ala"} I the key must be hashable, whereas the value can be of any type I keys and values can be of different types x={(’CCA’,’CCC’,’CCG’,’CCT’):"Pro"} x={[’CCA’,’CCC’,’CCG’,’CCT’]:"Pro"} #error x={"Amphiprion clarkii":fasta, (’CCA’,’CCC’,’CCG’,’CCT’):"Pro"} Manipulating dictionaries Unordered, so no indexing I use again square brackets to access an element of a dictionary aa={"CCC":"Pro", "GCA":"Ala", "AGA":"Arg"} aa[’CCC’] aa[’ACA’] #error because not in aa aa.get(’ACA’) #alternative without error aa.get(’ACA’, ’unkown’) #return ’unknown’ if not in aa aa.get(’CCC’, ’unkown’) #return the value for ’CCC’ I note that the dictionary is not changed if a key is not present. Use setdefault for this I membership, modifying dictionaries and listing items ’CCC’ in aa len(aa) #nb of key:value pairs aa[’CCC’]=’Proline’ #change the value of the key ’CCC’ aa[’ACA’]=’Thr’ #add a new pair del aa[’CCC’] print(aa.keys(), aa.values()) print(aa.items()) References versus copy 2 Some caution needed I we have to be careful when a list or dictionary is defined through variables referencing mutable objects L=[1,2] #L points to the list [1,2] D={’a’:L} #reference to list [1,2] also pointed to by L print(D) L[0]=3 #we change L *in-place*, ie the list [1,2] print(D) D[’a’][1]=17 #we modify the list pointed to by L print(L) I apparently the same but very different behaviour!!! D={’a’:L} #reference to list [1,2] also pointed to by L print(D) L=[3,4] #we assign L to a *new list* #list [1,2] is still in memory #and pointed to by D print(D) I a better way D={’a’:L[:]} #L[:] returns a copy of L print(D) L[1]=23 #we change L *in-place* print(D) #D didn’t change, it points to the copy of [1,2] Morale de l’histoire Be careful! I the copy is done for top level object and if L contains a reference to another list, the reference is copied, not the list that is referenced I to copy a list or a dictionary, use the method copy() because = makes the new variable point to the same location in memory Exercices You have a multiple sequence alignment in fasta seq1=">Amphiprion clarkii\nAGTTGACCTAGTCATAGA" seq2=">Amphiprion frenatus\nAGCTGACCTAGTTTTAGA" seq3=">Amphiprion ocellaris\nAGTTGACCTGGGCATCGA" seq4=">Pomacentrus mollucensis\nAGTCTACCTGATCCGGA" Q1 Create a dictionary to store the sequences. Q2 Create a new dictionary with the same set of species but replace the sequence of each species by the following: seq1b="ATAATATTCGATTGATCAGT" seq2b="ATAATACTCGATTTATCAGT" seq3b="ATAATACTCGATCGATCCGT" seq4b="ATAATAGGCGATCGACTAGT" Q3 Merge the second sequence to the initial sequence of each species Q4 Calculate the GC content in each sequence Q5 Order the species according to their value of GC content Program control and logic Program execution Normal 1 2 Loop Conditional 1 False ? 1 True End * Next 3 a a 4 b b 5 2 2 6 3 3 Program execution Normal 1 2 Loop Conditional 1 False ? 1 True End * Next 3 a a 4 b b 5 2 2 6 3 3 Program execution Normal 1 2 Loop Conditional 1 False ? 1 True End * Next 3 a a 4 b b 5 2 2 6 3 3 If statements Conditional execution Some block of code should only be executed if a test is true. In Python if <condition1>: <statements> elif <condition2>: <statements> elif <condition3>: <statements> ... else: <statements> Remarks I <condition> = any expression evaluated to True or False I <statements> = any number of instructions I use indentation of <statements> rather than {} I indentation is part of the syntax I note the : as a delimiter of the tests I elif and else are optional Example Testing user input x=input("Enter a nucleotide:") if x == ’A’: print("The nucleotide is an adenine (purine)") elif x == ’C’: print("The nucleotide is a cytosine (pyrimidine)") elif x == ’G’: print("The nucleotide is a guanine (purine)") elif x == ’T’: print("The nucleotide is a thymine (pyrimidine)") elif x == ’U’: print("The nucleotide is an urasil (pyrimidine)") else: print("%s is not a nucleotide" % x) Remarks I tests are considered one after the other until one is true I if all if/elif are false, the else statements are executed what is the difference with having 5 independent if? I Remarks on If statements Comparison operators I >, <, <= or >= to compare numbers I != for 6= and == is and is not I x=[123,54,92,87,33] y=[:] # y is a copy of x y==x # True, they have the same values y is x # False, they are different objects Comparisons and truth Expressions without an obvious query can have conditional truthfulness I value of an item can be forced to be considered as true or false I assumed False: None, False, 0, 0.0, "", (), {}, [], set(()) I everything else is assumed True Logic operations Standard logical connectives (and, or, not) resulting expressions are not necessarily Boolean. Try z=0 and 3 #try: z=1 and 2, then z=5 and 0 if z: print("Z is equivalent to True") else: print("Z is equivalent to False") I gives back the x or y value, which only evaluates as equivalent to True or False in conditional statement I we can mix data types in the logical comparisons I also available: or and not Advanced syntax Shortcuts and parentheses I if statement in one go: print(x and "Yes" or "No") #similar to print("Yes" if x else "No") I in x or y, y is evaluated only if x is false. Similarly, in x and y, y is evaluated only if x is true x=[256,128] if x and x[0]>10: #do something... I a!=b is the same as not a==b I precedence and parentheses: x=y=0 z=1 x and (y or z) (x and y) or z try statement Statements that might cause error for e.g. if we want to compute the square of a given number x=raw_input("Enter a number:") try: num=int(x) except: #the try clause produced an error print("It’s not a number. I stop") quit() else: #the try clause did not produced an error print num**2 I try is a way to handle error: it catches an exception raised by an error and recovers gracefully Loops The for loop natural to consider (i.e. iterate over) every element in a collection nucleotides=[’A’,’C’,’G’,’T’] for nuc in nucleotides: print(nuc) I variable nuc is set in turn to each element from the list values and the lbock of code is executed every time. It doesn’t have to be defined before-hand I works the same for tuples and sets I for dictionaries, nuc would be assigned the keys of the dictionary I looping for sets and dictionaries (unordered collections) will happen in arbitrary order I note the : at the end of the for statement Loops, cont. Positional indices in most loops, no need to know the position of the associated value in the list. However, you sometimes need the index themselves I use the range() function range(7) # [0,1,2,3,4,5,6] range(2,7) # [2,3,4,5,6] nucleotides=[’A’,’C’,’G’,’T’] for i in range(len(nucleotides)): print(nucleotides[i]) I add a third argument to define the step size range(2,13,3) # [2,5,8,11] range(7,2,-2) # [7,5,3] I access the value and the index at once: enumerate() vec1=(4,7,9) vec2=(1,3,5) #get the inner (or dot) product s=0 for i, x in enumerate(vec1): s+=x*vec2[i] Loops, cont. The while loop same idea, but loop keeps going on while a condition is true nucleotides=[’A’,’C’,’G’,’T’] i=0 while i < len(nucleotides): print(nucleotides[i]) i+=1 I careful as you can have an infinite loop if the statement is never false I good if you don’t have an initial collection of elements to loop over I you need to define the variable i before-hand I to need to explicitly increment i within the loop I you can add else: after a while loop to indicate that the loop is finished Skipping and breaking loops continue statement means that the rest of the code in the block is skipped for this particular iteration nucleotides=[’A’,’C’,’G’,’T’] for nuc in nucleotides: if nuc == ’C’: continue print(nuc) break statement immediately causes all looping to finish nucleotides=[’A’,’C’,’G’,’T’] for nuc in nucleotides: if nuc == ’C’: break print(nuc) List comprehension Create a new collection go through a collection to generate another, but different, collection. E.g. squaring numbers squares=[] for x in range(1,8): squares.append(x*x) more efficient alternative: use list comprehension squares=[x*x for x in range(1,8)] Looping tips Altering collection elements I bad idea to alter the number of elements in a loop num=range(9) for n in num: if n<5: num.remove(n) print(num) #some n<5 still there... I duplicate the list to get the expected result for n in list(num): #create a new list similar to num if n<5: num.remove(n) print(num) #everything is ok #or use list comprehension num=[n for n in num if n >= 5] #or use a second list explicitly num2=[] for n in num: if n>=5: num2.append(n) num=num2 Exercices Loops and conditional statements Q1 write a code that will check that a specific amino acid is present in a sequence (using index() is not allowed) seq=’NFYIPMFNKTGVVRSPFEYPQYYLAGVVRSPFEY’ Q2 simulate a DNA sequence of length 150 by randomly drawing nucleotides (tip: use the module random, with its function random.choice(’ACGT’)) Q3 calculate how many amino-acids of seq2 are in seq seq2=’ALILALSMGY’ Functions Why using functions We have seen many examples I e.g. len(), split(), . . . I we will learn how to define our own functions I they are useful to maximize code reuse: the same operation is coded only once and can be used many times I they provide procedural decomposition: the problem is split into simple pieces or subtasks Defining a function Use the def instruction def function_name(arg1, arg2, ..., argN): statements # Body of the function # instructions involving arg1 to argN I parts in parenthesis is provided by you I Note the colon at the end of the header line and the indentation I the function is then called as function_name(. . .) with actual variables or literals I a function usually returns an object but may also have a side effect I upon calling a function, execution flow is transferred to the function I when finished, control is returned to instruction following the call Examples Function with a side effect def hello(): print("Hello world!") I no return value, only a side effect I does not require an argument on calling Function returning a value def mult(x,y): return x*y a=mult(2,3) # 6 mult(a,a) # 36 mult("bla ", 2) # ’bla bla ’ I the return statement used to return the value I return exits the function I it can be placed anywhere in the function def min(a, b): if a<b: return a else: return b Polymorphism Functions can work on any types of arguments I in the example above, the arguments were either numbers or strings I argument types are not defined in the function definition I arguments and return value have no type constraints I function will work as long as the operations done in the body of the function are defined on the objects passed as arguments I this is called polymorphism Remarks Position and assignment I a function can be defined anywhere in the code, even in a if or a loop. This is different from other languages where the functions are defined separately and not part of the main program I a function can be nested in another one I a function should be defined before it is used I def creates a function object and assigns it to the given name I we can always reassign a function to a new name def mult(x,y): return x*y times = mult print(times(2,3)) Modifying the argument value Mutable objects if passed as arguments, they can be modified in the body of a function def func(X): X[0]=5 #works if object accepts the operation L=[1,2] func(L) print(L) Immutable objects cannot be modified if passed as arguments def swap(x): x = (x[1],x[0]) #not very useful x = (2,3) swap(x) print(x) I immutable objects are passed by value: use the return value to modify them (try with swap function above as an exercice) I mutable objects are passed by reference Scope of a variable Were can we access a variable I variables defined within a function are only accessible inside it I their scope is limited to the body of the function I local variables disappear from memory when a function is exited def func(x): a,b = 2,4 # a and b are local, private return a*x+b print(func(1)) print(a) Local variables I we could have defined a in the main program I no way to the name of all variables in all functions a=89 def func(x): a,b = 2,4 # a and b are local, private print(a) return a*x+b print(func(1)) print(a) Local variables What makes a variable local I any assignment within a function makes the assigned variable local I otherwise, a variable not assigned in a function is considered as global scope is limited to the program it is defined in. This is why we need the import statement I x=3 def func(y): print(x) return x+y print(f(2)) Priority between nested scopes I search sequence: local scope, then global one I third level: built-in scope (i.e. reserved words like str, int, open, . . . I possible to reuse them in local scope, but careful def func(): open=3 open(’file.txt’) #won’t work Global statement Declaring a variable as global def func(): global x x=2 x=99 f() print(x) I this is not recommended (leads to bugs, no portability, . . .). I could be useful if function needs to remember its state after the previous call. Object oriented programming is better Function’s arguments Number of arguments I function should be called with the right number of arguments given in its definition I arguments are matched by position in the list between function definition and function call I but Python is more flexible def func(name, age, job): print("%d is %s years old and is a %s" % (name, age, job)) func(’Joe’, 32, ’teacher’) #usual way func(age=32, job=’teacher’, name=’Joe’) #using keyword arguments def func(name, age, job=’doctor’): #with default values print("%d is %s years old and is a %s" % (name, age, job)) func(age=45, name=’Allan’) func(’Allan’, 45) func(’Allan’, 45, ’lawyer’) Arbitrary numbers Using a list of arguments: * construct I packing a collection of arguments into a tuple def func(*args): print(args) f() f(1) f(1,2) I args becomes a tuple with all the arguments passed to the function. I packing a collection into a dictionary def func(item, *args, **kw): print(’Mandatory argument:’, item) print(’Unnamed argument:’, args) print(’Keyword dictionary:’, kw) func(’Hello’, 1, 99, valueA="abc", valueB=7.0) func(’Hello’, valueA="abc", valueB=7.0) func(’Hello’) I I kw becomes a dictionary with all the arguments passed to the function useful if arguments have to be passed to a nested function inside the main one Unpacking a collection Using the * construct * can be used when calling a function: it unpacks a sequence into a collection of arguments def func(*val): x=0 for y in val: x+=y return x f(1,2,3) # 6 a=(1,2,3) f(*a) # 6 b=[2,3,4] f(*b) # 9 Recursivity a function can be defined through itself. This is called recursivity def factorial(n): if n==1: return 1 else: return n*factorial(n-1) factorial(3) # 6 factorial(6) # 720 Function as argument A function can also be passed as argument def add(x, y): return x+y def mult(x, y): return x*y def combine(f, *args): a=args[0] for b in args[1:]: a=f(a,b) return a t=(1,2,3,4) v=combine(add, *t) # v is the sum: 10 v=combine(mult, *t) # v is the sum: 24 The map function very efficient way to apply a function on all elements of a list def func(x,y): x+y L1=[1,2,3] L2=[3,4,5] L3=map(f, L1, L2) # [4,6,8] Exercices Functions Q1 write a function to count the percentage of identity > 0.8 in a sequence alignement given as a long string """>seq1\nACTAATGCGTAGTACTGACTTACT\n >seq2\nAGTAAGTCGTAGTACTGCCGTACT\n >seq3\nACTAATGCTTAGTACTGACGGTTA\n >seq4\nATCAATGCGCAGTACTGACTTACA\n >seq5\nAGCAATGCGTAGTATTGCCAACCT""" Q2 compare the running time of a for loop and of map to add the values of two lists with 1 mio items. Use the following to calculate the running time from time import time ... start=time() ... end=time() print(end - start) Q3 find the minimum value in a sequence of numbers using I argument packing I a tuple or a list Input/Output with files Files Permanent storage I I files are used to store permanently data or information on the disk of the computer files can be accessed from a Python program: I I I read an existing file write to a file create, delete, copy files Opening a file I built-in function open(), which takes a filename as argument I it returns an object of type file whose methods can be used to act on the file I not a usual object such as lists, strings or numbers. It is a link to a file in the filesystem outfile=open("tmp/rh1.fas", ’w’) infile=open("data.txt", ’r’) Opening a file, cont. Remarks I names infile and outfile are variable and can take any variable names I filename is passed to open as a string I it may be an absolute name with full path or a relative name with respect to the directory where the Python script is run I the access-mode argument ’w’ means write, ’r’ means read (default if not present) and ’a’ means append I third argument possible to control the output buffering: passing a zero means the output is transferred immediately to the file (which may be slow) Closing a file Terminating the connection to the file I done through the method close() available for the file object infile.close() I this flushes the output buffer (if any) to the actual file I it releases the lock on the open file. The file can be used by other applications I at the of a Python program, open files are automatically closed Reading from a file Example of a file reading these lines as strings > cat myseq.txt Number of sequences and length 4 256 > myfile=open(’myseq.txt’, ’r’) a=myfile.readline() # read 1st print(a) b=myfile.readline() # read 2nd print(b) c=myfile.readline() # read 3rd print(c) print(int(b) * int(c)) # total myfile.close() line; a is a string line; b is a string line; c is a string sequence length; conversion needed I keeps track of the current position in the file, which is incremented after each read operation I prints double newline between items: one from the file, one added by print I use rstrip() to remove the first one before printing Parsing and other reading methods Using the split() method Reading a sequence alignment 2 15 sp1 ACATCATTGACCTAG sp2 ACACGATCGATCTAG myfile=open(’myseq.phy’, ’r’) line=myfile.readline() # ’2 15\n’ nseq, nchar = line.split(’ ’) line=myfile.readline() # ’sp1 ACATCATTGACCTAG’ k, v = line.split(’ ’) seq[k]=v ... Reading more than a line line = myfile.read(n) # read the next n bytes (char) into a string # next call to readline will read the end of # line + next one lines = myfile.read() # read entire file into a single string lines = myfile.readlines() # read entire file in a list of lines myfile.seek(n) #change current file position to offset n for next read Reading a file Often need to read an entire file I how to know its length? I best to read line by line to save memory I big files cannot fit in memory Method 1: f=open(’input.dat’, ’r’) while True: line = f.readline() if line: #an empty line means the EOF: why? line.rstrip() ... else: break Input/Output with files Files Permanent storage I I files are used to store permanently data or information on the disk of the computer files can be accessed from a Python program: I I I read an existing file write to a file create, delete, copy files Opening a file I built-in function open(), which takes a filename as argument I it returns an object of type file whose methods can be used to act on the file I not a usual object such as lists, strings or numbers. It is a link to a file in the filesystem outfile=open("tmp/rh1.fas", ’w’) infile=open("data.txt", ’r’) Opening a file, cont. Remarks I names infile and outfile are variable and can take any variable names I filename is passed to open as a string I it may be an absolute name with full path or a relative name with respect to the directory where the Python script is run I the access-mode argument ’w’ means write, ’r’ means read (default if not present) and ’a’ means append I third argument possible to control the output buffering: passing a zero means the output is transferred immediately to the file (which may be slow) Closing a file Terminating the connection to the file I done through the method close() available for the file object infile.close() I this flushes the output buffer (if any) to the actual file I it releases the lock on the open file. The file can be used by other applications I at the of a Python program, open files are automatically closed Reading from a file Example of a file reading these lines as strings > cat myseq.txt Number of sequences and length 4 256 > myfile=open(’myseq.txt’, ’r’) a=myfile.readline() # read 1st print(a) b=myfile.readline() # read 2nd print(b) c=myfile.readline() # read 3rd print(c) print(int(b) * int(c)) # total myfile.close() line; a is a string line; b is a string line; c is a string sequence length; conversion needed I keeps track of the current position in the file, which is incremented after each read operation I prints double newline between items: one from the file, one added by print I use rstrip() to remove the first one before printing Parsing and other reading methods Using the split() method Reading a sequence alignment 2 15 sp1 ACATCATTGACCTAG sp2 ACACGATCGATCTAG myfile=open(’myseq.phy’, ’r’) line=myfile.readline() # ’2 15\n’ nseq, nchar = line.split(’ ’) line=myfile.readline() # ’sp1 ACATCATTGACCTAG’ k, v = line.split(’ ’) seq[k]=v ... Reading more than a line line = myfile.read(n) # read the next n bytes (char) into a string # next call to readline will read the end of # line + next one lines = myfile.read() # read entire file into a single string lines = myfile.readlines() # read entire file in a list of lines myfile.seek(n) #change current file position to offset n for next read End of line . . . UNIX/LINUX, MacOS X, Windows End of line not represented in the same way depending on computer. But readline() depends on this character to recognize the end of line . . . I ’\n’ = end of line for UNIX/LINUX and MacOS X I ’\r\n’ = end of line for Windows I ’\r’ = end of line for MacOS up to version 9 Download the files testUNIX.txt, testWin.txt and testMacOS.txt. f=open(’testUNIX.txt’, ’r’) line=f.readline() print(line) # repeat this on test_win.txt and test_macos.txt How do make sure you can read files correctly? f=open(’testUNIX.txt’, ’rU’) # add ’U’ for universal end of line Reading a file Often need to read an entire file I how to know its length? I best to read line by line to save memory I big files cannot fit in memory f=open(’input.dat’, ’r’) while True: line = f.readline() if line: #an empty line means the EOF: why? line=line.rstrip() ... else: break Iterable file objects I we can loop through file objects I best and fastest way to read a file f=open(’input.dat’, ’r’) for line in f: line=line.rstrip() ... f.close() Writing to a file Opening a file for writing f=open(’filename’, ’w’) I when access mode is ’w’, the file is created if it does not exist already. It is overwritten otherwise I other access mode are: ’a’ for append: the written lines are added at the end of an existing file (or the beginning of a new one) if access mode is ’r+’, the file is open both for reading and writing I Basic methods f.write(’some strings’) f.writelines(aList) f.flush() # flushes the output buffer to actual file without closing Writing/reading complex objects Only strings are written/read from files Need some methods to deal with complex objects x, y, z = 10, 50, 100 s=’ACCATGAT’ D={’CCC’:’Pro’, ’ACC’:’Thr’} L=[’A’,’C’,’G’,’T’] f=open(’datafile.txt’, ’w’) f.write(s+’\n’) # to have a newline f.write(’%s, %s, %s\n’ % (x,y,z)) f.write(str(L) + ’$’ + str(D) + ’\n’) #need explicit string conversion f.write(’ ’.join(L)) #or ’’.join(L), or ’_’.join(L), ... for k, v in D.items(): f.write(’%s: %s\n’ % (k, v)) The % operator Used it to format a printout, syntax: print string % tuple a=’ACAATAT’ b=12 print(’The length of the sequence %s is %d nucleotides’ % (a,b)) More about files Serialize/deserialize an object Transform it to a string that can be written to a file and read it back import pickle f=open(’filename’, ’wb’) #should open wiht ’wb’ to create a binary file D={’CCC’:’Pro’, ’ACC’:’Thr’} pickle.dump(D,f) #write D to file object F f.close() f=open(’filename’, ’r’) E=pickle.load(f) E The file looks weird if you print it: it is saved in binary format. Failing to open a file Because of lack of permission, file system is full, inexisting file opened in a read mode, . . . I try-except construct can help recovering from a file error: filename=raw_input("Enter a file name: ") try: f=open(filename, ’r’) except: print("File doesn’t exist.") Modules sys and os Variable file names File names usually given by a user and they will change. Not ideal to hardcode them in your script. Use the sys module and call myscript.py data/inputFile.txt import sys pyScriptName = sys.argv[0] #name of the scirpt filename = sys.argv[1] #name of the file, ie data/inputFile.txt File operations Use the module os and the sub-module os.path to deal with files os chdir(path) getcwd() listdir(path) rmdir(path) remove(path) rename(src, dst) module changes current dir returns current dir lists dir content removes directory removes file moves from src to dst os.path sub-module exists(path) does path exists? isfile(path) is path a file? isdir(path) is path a dir? islink(path) is path a symbolic link? join(*paths) joins paths together dirname(path) dir containing path basename(path) path minus dirname split(path) returns dirname and basename Exercices Reading and writing files Q1 Download the fasta file clownfish.fasta and create a function that will read the file and store the sequences in a dictionary (species names as keys) containing I the sequence itself I its total length I the percentage of GC Q2 write a function that will read as input a genbank file and write as output the sequences in fasta format. A possible file is available here. You can also download one directly from GenBank Object Oriented Programming What is OOP? Objects in programming Object-oriented programming: a programming paradigm based on the concept of “objects”, which are data structures containing data (or attributes) and methods (or procedures). Objects in python We have already used objects. n = 12 # n is an object of type integer s = ’ACAGATC’ # s is an object of type string l = [12, ’A’, 21121, ’ACCAT’] # ls is an object of type list These objects I contain data (the number 12, the string ’ACAGCTC’, . . .) I can be modified/manipulated s.count(’A’) s.lower() l.append(’121’) Extending data types Python standard data types For most simple programs, we can usually survive well with standard Python data types. This includes I numbers, strings I tuples, lists, sets, dictionaries Defining your own data types It might be useful though to create your own data types: objects built to your own specifications and organised in the way convenient to you. This is done through an object definitions known as classes. Class vs object Implementation vs instantiation The class is the definition of a particular kind of object in terms of its component features and how it is constructed or implemented in the code. The object is a specific instance of the thing which has been made according to the class definition. Everything that exists in Python is an object. OOP in Python Common principle of OOP is that a class definition I makes available certain functionality I hides internal information about how a specific class is implemented This is called encapsulation and information hiding. Python is however quite permissive and you can access any element of an object if you know how to do that. An example A sequence object We need to store data: I species name I sequence in DNA and amino-acid I protein name I length of the sequences I percentage of GC I ... We need to be able to manipulate the data using methods: I add/remove nucleotide or amino-acid (and update the other data dependent on it) I translate DNA to amino-acid and inversely I print the sequence in various ways I calculate some characteristics ... I DNA sequence class Class definition class Sequence: # some statements I common practice to save each class into their specific files. Use then the from Sequence import Sequence Inheritance class DNAsequence(Sequence): # some statements I inheriting methods from a superclass I classes can have more than one superclass Class functions Providing object capabilities I functions are defined within the construction of a class I defined in the same way as ordinary functions (indented within the class code block) I accessed from the variable representing the object via ’dot’ syntax name = mySequence.getName() I getName() knows which Sequence object to use when fetching the name I first argument is special: it is the object called from (self) class Sequence: def getName(self): return self.name def getCapitalisedName(self): name = self.getName() if name: return name.capitalize() else: return name Remarks on functions Order of functions I order of functions does not matter I if function definition appears more than once, the last instead replaces the previous one class MultipleSeqAl(Sequence): def getMSA(self): # function implementation def getSequenceIdentity(self): # function implementation Using subclasses I call specific functions as ususal msa.getMultipleSeqAl() I can also call msa.getName() from Sequence directly because of inheritance however, .getMSA() cannot be accessed from an object Sequence I Object attributes Variables tied to the object I attributes hold information useful for the object and its functions I e.g. associate a variable storing sequence name in Sequence objects Class attributes I specific to a particular object I defined inside class functions I use the self keyword to access it Object attributes I available to all instances of a class I defined outside all function blocks I usually used for variable that do not change I accessed directly using the variable name I bare function names are also class attributes Examples of class attributes class Sequence: type = ’DNA’ # class attribute def setSequenceLength(self, l): self.length = l def getSequenceLength(self): return self.length myseq = Sequence() print(myseq.type) # variable type can be accessed from the object # not that we don’t use the () as we access a variable print(Sequence.type) # accessed through the class itself getSeqLenFunc = Sequence.getSequenceLength getSeqLenFunc(myseq) # same as myseq.getSequenceLength() length = myseq.length # error. Length not yet set myseq.setSequenceLength(541) length = myseq.length # returns 541 this time length = myseq.getSequenceLength() myseq.l = 541 # create new attributes on the fly... Object life cycle Birth, life and death I creation of object handled in a special function called constructor I removal is handled by a function called destructor I Python has automatic garbage collection, usually no need to define a destructor Class constructor I called whenever the corresponding object is created I use a special name: __init__ I first argument is the object itself (i.e. self) I any other arguments you need to create the object I good idea to introduce a key to uniquely identifies objects of a given class Example class Sequence: def __init__(self, name, type=’DNA’): self.type = type try name: self.name = name except: print(’Name must be set to something’) myseq = Sequence(’opsin’) myseq = Sequence(’opsin’, ’AA’) When to create attributes I attributes can be created in any class function (or directly on the object) I convention to create most of them in the constructor either directly or through the call to a function I set it to None if it cannot be set at object creation I constructor are inherited by subclasses Exercices Q1 Create a class to store all the elements of a GenBank record and store the accessions found in the genbank file used last time (download it here) in a list. Try to invent functions that could be useful to deal with these accessions Q2 Create a function to get the length of the sequence of each GenBank record and calculate the mean, maximum and minimum length Q3 Create a new class that will hold the set of GenBank accessions and store useful information on them