Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Programming in Python Michael Schroeder Andreas Henschel {ms, ah}@biotec.tu-dresden.de 1 Motivation All these sequences are winged helix DNA binding domains. How can we group them into families? mkdntvplkliallangefhsgeqlgetlgmsraainkhiqtlrdwgvdvftvpgkgyslpep mktvrqerlksivrilerskepvsgaqlaeelsvsrqvivqdiaylrslgynivatprgyvlagg kaltarqqevfdlirdhisqtgmpptraeiaqrlgfrspnaaeehlkalarkgvieivsgasrgirllqee mrssakqeelvkafkallkeekfssqgeivaalqeqgfdninqskvsrmltkfgavrtrnakmemvyclpaelgvptt gqrhikireiimsndietqdelvdrlreagfnvtqatvsrdikemqlvkvpmangrykyslpsdqrfnplqklkr kgqrhikireiitsneietqdelvdmlkqdgykvtqatvsrdikelhlvkvptnngsykyslpadqrfnplsklkr dvtgriaqtllnlakqpdamthpdgmqikitrqeigqivgcsretvgrilkmledqnlisahgktivvygt dikqriagffidhanttgrqtqggvivsvdftveeianligssrqttstalnslikegyisrqgrghytipnlvrlkaaa iderdkiileilekdartpfteiakklgisetavrkrvkaleekgiiegytikinpkklg elqaiapevaqslaeffavladpnrlrllsllarselcvgdlaqaigvsesavshqlrslrnlrlvsyrkqgrhvyyqlqdhhivalyqnaldhlqec mntlkkafeildfivknpgdvsvseiaekfnmsvsnaykymvvleekgfvlrkkdkryvpgyklieygsfvlrrf lfneiiplgrlihmvnqkkdrllneylsplditaaqfkvlcsircaacitpvelkkvlsvdlgaltrmldrlvckgwverlpnpndkrgvlvklttggaaiceqchqlvgqdlhqeltknltadevatleyllkkvlp nypvnpdlmpalmavfqhvrtriqseldcqrldltppdvhvlklideqrglnlqdlgrqmcrdkalitrkirelegrnlvrrernpsdqrsfqlfltdeglaihqhaeaimsrvhdelfapltpveqatlvhlldqclaaq tdilreigmiaraldsisniefkelsltrgqylylvrvcenpgiiqekiaelikvdrttaaraikrleeqgfiyrqedasnkkikriyatekgknvypiivrenqhsnqvalqglseveisqladylvrmrknvsedwefvkkg mskindindlvnatfqvkkffrdtkkkfnlnyeeiyilnhilrsesneisskeiakcsefkpyyltkalqklkdlkllskkrslqdertvivyvtdtqkaniqkliseleeyikn aitkindcfellsmvtyadklkslikkefsisfeefavltyisenkekeyylkdiinhlnykqpqvvkavkilsqedyfdkkrnehdertvlilvnaqqrkkiesllsrvnkrit miimeeakkliielfselakihglnksvgavyailylsdkpltisdimeelkiskgnvsmslkkleelgfvrkvwikgerknyyeavdgfssikdiakrkhdliaktyedlkkleekcneeekefikqkikgiermkkisekilealndl aqspagfaeeyiiesiwnnrfppgtilpaerelseligvtrttlrevlqrlardgwltiqhgkptkvnnfwets eekrsstgflvkqraflklymitmteqerlyglkllevlrsefkeigfkpnhtevyrslhellddgilkqikvkkegaklqevvlyqfkdyeaaklykkqlkveldrckkliekalsdnf hmqaeilltlklqqklfadprrisllkhialsgsisqgakdagisyksawdainemnqlsehilveratggkggggavltrygqrliqlydllaqiqqkafdvlsdddalplnsllaaisrfslqts skvtyiikasndvlnektatilitiakkdfitaaevrevhpdlgnavvnsnigvlikkglveksgdgliitgeaqdiisnaatlyaqenapellk sprivqsndlteaayslsrdqkrmlylfvdqirksdgtlqehdgiceihvakyaeifgltsaeaskdirqalksfagkevvfyrpeedagdekgyesfpwfikpahspsrglysvhinpylipffiglq nrftqfrlsetkeitnpyamrlyeslcqyrkpdgsgivslkidwiieryqlpqsyqrmpdfrrrflqvcvneinsrtpmrlsyiekkkgrqtthivfsfrdit lglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelarrhl lglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelayrhlgypppv egldefdrkilktiieiyrggpvglnalaaslgveadtlsevyepyllqagflartprgrivtekaykhlkyevp iseevliglplheklfllaivrslkishtpyitfgdaeesykivceeygerprvhsqlwsylndlrekgivetrqnkrgegvrgrttlisigtepldtleavitklikeelr kyeltlqrslpfiegmltnlgamklhkihsflkitvpkdwgynritlqqlegylntladegrlkyiangsyeiv pmkteqkqeqetthknieedrklliqaaivrimkmrkvlkhqqllgevltqlssrfkprvpvikkcidiliekeylervdgekdtysyla gspekilaqiiqehregldwqeaatraslsleetrkllqsmaaagqvtllrvendlyaist eryqawwqavtraleefhsryplrpglareelrsryfsrlparvyqalleewsregrlqlaantvalagftps fsetqkkllkdledkyrvsrwqppsfkevagsfnldpseleellhylvregvlvkindefywhr qalgeareviknlastgpfglaeardalgssrkyvlplleyldqvkftrrvgdkrvvvgn vpkrvywemlatnltdkeyvrtrralileilikagslkieqiqdnlkklgfdevietiendikglintgifieikgrfyqlkdhilqfvipnrgvtkqlv irtfgwvqnpgkfenlkrvvqvfdrnskvhnevknikiptlvkeskiqkelvaimnqhdliytykelvgtgtsirseapcdaiiqatiadqgnkkgyidnwssdgflrwahalgfieyinksdsfvitdvglaysksad gsaiekeilieaissyppairiltlledgqhltkfdlgknlgfsgesgftslpegilldtlanampkdkgeirnnwegssdkyarmiggwldklglvkqgkkefiiptlgkpdnkefishafkitgeglkvlrrakgstkftr 2 Motivation: Let's rebuild SCOP families • Given a SCOP superfamily and its sequences, how can we divide it into families? • First, we need dynamic programming to determine the sequence similarity • Then we do the following: – For all pairs of sequences, call the sequence similarity algorithm and record the similarity into a distance matrix – Next, run hierarchical clustering to cluster the sequences. 3 Python for Bioinformatics Lecture 1: Datatypes and Loops Slides derived from Ian Holmes Department of Statistics University of Oxford 4 Goals of this course • Concepts of computer programming • Rudimentary Python (widely-used language) • Introduction to Bioinformatics file formats • Practical data-handling algorithms • Exposure to Bioinformatics software 5 Literature/Material • Textbook: Python in a Nutshell, Alex Martelli • Textbook: Python Cookbook, Alex Martelli, David Ascher (both published by O'Reilly) • Python Course in Bioinformatics, K. Schuerer/C. Letondal, Pasteur University (pdf) • a lot of online material (see course homepage http://www.biotec.tu-dresden.de/schroeder/group/ teaching/bioinfo2/python.html) 6 Style of this lecture • The color scheme for programs, output and text files: The main program The program output Files are shown in yellow The filename goes here • Interaction with the Python shell: very handy for quick tests. Helps beginners to overcome physiological barrier: Go ahead, try things out! Prompt, (python expects input here) >>> (Python Expression) (immediate Python result) Press Enter 7 General principles of programming • Make incremental changes • Test everything you do – use the Python shell for testing expressions/functions interactively – the edit-run-revise cycle • Write so that others can read it – (when possible, write with others) • Think before you write • Use a good text editor (emacs) 8 Python/Emacs IDE 9 Python: Motivation • Well suited for scripting (better syntax than Perl) • However, capable of Object Orientation • Hence complex data types and large projects feasible, reuse of code (BioPython) • Universal language, Applications in and beyond bioinformatics: Amber, ProHit, PyRat, PyMOL, Gene2EST/Google, CGI, Zope • Compatible with most software technologies: GUI, MPI, OpenGL, Corba, RDB • Test complicated expressions in python shell 10 Python basics • Basic syntax of a Python program: Lines # Elementary Python program beginning print "Hello World" with "#" are comments, and are ignored by Python Single or double quotes enclose a "string literal" print statement tells Python to print the following stuff to the screen Hello World 11 Variables • We can tell Python to "remember" a particular value, using the assignment operator "=": x = 3 print x x = "ACGCGT" print x Binding site for yeast transcription factor MCB 3 ACGCGT • The x is referred to as a "scalar variable". Variable names can contain alphabetic characters, numbers (but not at the start of the name), and underscore symbols "_" 12 Variables and Objects • Everything in Python is an object • An object models a real-world entity • objects possess methods (also called functions) that are typically applied to the object, possibly parameterized • objects can also possess variables, that describe their state Object . Method or variable • e.g. x.upper()is a parameter-less method, that works on the string object x 13 Arithmetic operations… • Basic operators are + - / * % x = 14 y = 3 print "Sum: ", x + y print "Product: ", x * y print "Remainder: ", x % y Could write x *= 2 Could write x += 1 x = 5 print x = x print x = x print "x started as", x * 2 "Then x was", x + 1 "Finally x was" ,x Sum: 17 Product: 42 Remainder: 2 x started as 5 Then x was 10 Finally x was 11 14 … Or interactively >>> x = 14 >>> y = 3 >>> x + y 17 >>> x * y 42 >>> x % y 2 >>> x = 5 >>> print "x started as", x x started as 5 >>> x *= 2 >>> print "Then x was", x Then x was 10 >>> x += 1 >>> print "Finally x was", x Finally x was 11 >>> • This way, you can use Python as a calculator • Can also use += -= /= *= 15 String operations • Concatenation + += a = "pan" b = "cake" a = a + b print a a = "soap" b = "dish" a += b print a pancake soapdish • Can find the length of a string using the function len(x) mcb = "ACGCGT" print "Length of %s is "%mcb, len(mcb) Length of ACGCGT is 6 16 String formatting • Strings can be formatted with place holders for inserted strings (%s) and numbers (%d for digits and %f for floats) • Use Operator % on strings: Formatted String % Insertion Tuple >>> "aaaa%saaaa%saaa"%("gcgcgc","tttt") 'aaaagcgcgcaaaattttaaa' >>> "A range written like this: (%d - %d)" % (2,5) 'A range written like this: (2 - 5)' >>> "Or with preceeding 0's: (%03d - %04d)" % (2,5) "Or with preceeding 0's: (002 - 0005)" >>> "Rounding floats %.3f" % math.pi 'Rounding floats 3.142' >>> "Space holders: _%-7s_ and _%7s_" %("left", "right") 'Space holders: _left _ and _ right_' 17 More string operations Convert to upper case Convert to lower case Convert the string to a list Reverse the list Join all list members x = "A simple sentence" print x print x.upper() print x.lower() xl=list(x) xl.reverse() print "".join(xl) x = x.replace("i", "a") print x print len(x) Translate "i"'s into "a"'s Calculate the length of the string A simple A SIMPLE a simple ecnetnes A sample 17 sentence SENTENCE sentence elpmis A sentence 18 Concatenating DNA fragments dna1 = "accacgt" dna2 = "taggtct" print dna1 + dna2 accacgttaggtct "Transcribing" DNA to RNA dna = "accACgttAGGTct" rna = dna.lower().replace("t", "u") print rna Make it all lower case DNA string is a mixture of upper & lower case Replace "t" with "u" accacguuaggucu 19 Conditional blocks • The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Python, this looks like this: Consistent, level-wise indenting important These indentations tell Python which piece of code is contingent on the condition. if condition: action else: alternative x = 149 y = 100 if x > y: print x,"is greater than",y else: print x,"is less than", y 149 is greater than 100 20 Conditional operators "does not equal" • Numeric: > >= < <= != == x = 5 * 4 y = 17 + 3 if x == y: print x, "equals", y Note that the test for "x equals y" is x==y, not x=y 20 equals 20 • The same operators work on strings as alphabetic comparisons Shorthand syntax for assigning more than one variable at a time (x, y) = ("Apple", "Banana") if y > x: print y, "after", x Banana after Apple 21 Logical operators • Logical operators: and and or x = 222 if x % 2 == 0 and x % 3 == 0: print x, "is an even multiple of 3" 222 is an even multiple of 3 • The keyword not is used to negate what follows. Thus not x < y means the same as x >= y • The keyword False (or the value zero) is used to represent falsehood, while True (or any non-zero value, e.g. 1) represents truth. Thus: if True: if False: if -99: print "True is true" print "False is true" print "-99 is true" True is true -99 is true 22 Loops • Here's how to print out the numbers 0 to 9: The indented code x = 0 while x < 10: is repeatedly print x, executed as long x+=1 as the condition x<10 remains true 0 1 2 3 4 5 6 7 8 9 Equivalent to x = x + 1 • This is a while loop. The code is executed while the condition is true. 23 A common kind of loop • Let's dissect the code of the while loop again: Initialisation Test for completion Continuation x = 0 while x < 10: print x, x+=1 • Alternatively, the for loop construct iterates through a list Iteration variable Generates a list [0,1, …,9] for x in range(10): print x, 24 For loop features • Loops can be used with all iteratable types, ie.: lists, strings, tuples, iterators, sets, file handlers >>> for nucleotide in "actgc": ... print nucleotide, a c t g c • Stepsizes can be specified with the 3. argument of the slice constructor (negative values for iterating backwards) >>> ... 0 7 >>> ... c g for number in range(50)[::7]: print number, 14 21 28 35 42 49 for nucleotide in "actgc"[::-1]: print nucleotide, t c a 25 Reading Data from Files • To read from a file, we can conveniently iterate through it linewise with a for-loop and the open function. Internally a filehandle is maintained during the loop. for line in open("sequence.txt"): print line, The comma prevents print's automatic newline This code snippet opens a file called "sequence.txt" in the in the current directory, and iterates through it line by line sequence.txt >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC 26 Python for Bioinformatics Lecture 2: Sequences and Lists 27 Summary: scalars and loops • • • • • • • Assignment operator x = 5 Arithmetic operations y = x * 3 String operations s = "Concatenating " + "strings" Conditional tests if y > 10: print s Logical operators if y > 10 and not s == "": print s for x in range(10): print x Loops Reading a file for line in open("sequence.txt"): print line, 28 Pattern-matching • A very sophisticated kind of logical test is to ask whether a string contains a pattern • e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT? name = "YBR007C" dna="TAATAAAAAACGCGTTGTCG" if "ACGCGT" in dna: print name, "has MCB!" The pattern for the MCB binding site 20 bases upstream of the yeast gene YBR007C The membership operator in YBR007C has MCB! 29 FASTA format • A format for storing multiple named sequences in a single file >CG11604 Name of sequence is preceded by > symbol NB sequences can span multiple lines • This file contains 3' UTRs for Drosophila genes CG11604, CG11455 and CG11488 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT 30 Call this file fly3utr.txt Printing all sequence names in a FASTA database for line in open("fly3utr.txt"): if line.startswith(">"): print line, >CG11604 >CG11455 >CG11488 31 Finding all sequence lengths The rstrip statement length=0 name="" trims the white space for line in open("/home/bioinf/ah/tmp/sequence.txt"): characters line=line.rstrip() off the right end. if line.startswith(">"): Try it without this and if name and length: see what happens – print name, length and if you can work name=line[1:] out why length=0 else: length+=len(line) >CG11604 print name, length TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT CG11604 58 CG11455 83 CG11488 69 32 Reverse complementing DNA • A common operation due to double-helix symmetry of DNA Start by making string lower case again. This is generally good practice Replace 'a' with 't', 'c' with 'g', 'g' with 'c' and 't' with 'a' Reverse the list def revcomp(dna): replaced=list(dna.lower(). replace("a","x").replace("t","a"). replace("x", "t").replace("g","x"). replace("c","g").replace("x", "c")) replaced.reverse() return "".join(replaced) print revcomp("accACgttAGgtct ") agacctaacgtggt 33 Lists • A list is a list of variables nucleotides = ['a', 'c', 'g', 't'] print "Nucleotides: ", nucleotides Nucleotides: ['a', 'c', 'g', 't'] • We can think of this as a list with 4 entries a c g t element 0 element 1 element 2 element 3 the list is the set of all four elements Note that the element indices start at zero. 34 List literals • There are several, equally valid ways to assign an entire array at once. This is the most common: a commaseparated list, delimited by squared brackets a = [1,2,3,4,5] print "a = ",a b = ['a','c','g','t'] print "b = ",b c = range(1,6) print "c = ",c d = "a c g t".split() print "d = ", d a b c d = = = = [1,2,3,4,5] ['a','c','g','t'] [1,2,3,4,5] ['a','c','g','t'] 35 Accessing lists • To access list elements, use square brackets e.g. x[0] means "element zero of list x" x = ['a', 'c', 'g', 't'] i=2 print x[0], x[i], x[-1] a g t • Remember, element indices start at zero! • Negative indices refer to elements counting from the end e.g. x[-1] means "last element of list x" 36 List operations • You can sort and reverse lists... x = ['a', 't', 'g', 'c'] print "x =",x x.sort() print "x =",x x.reverse() print "x =",x x = a t g c x = a c g t x = t g c a • You can read the entire contents of a file into an array (each line of the file becomes an element of the array) seqfile = open, "C:/sequence.txt" x = <FILE> 37 Applying Methods to Objects • Instances of lists, strings, etc. are objects with built-in methods • Explore available methods using dir: >>> dir("hello") ['__add__', … ,'__str__', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'] >>> help("hello".count) (…)Return the number of occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation. >>> "hello".count("l") 2 String object List of applicable methods Method . (dot) applies method to object 38 List operations Multiplying lists with * pop removes the last element of a list append adds an element to the end of a list concatenating lists with + or += Removing the first occurrence of an element Position of an element >>> >>> [1, >>> x=[1,0]*5 x 0, 1, 0, 1, 0, 1, 0, 1, 0] while 0 in x: print x.pop(), 0 1 >>> [1] >>> >>> [1, >>> >>> [1, >>> >>> [1, 0 1 0 1 0 1 0 x x.append(2) x 2] x+=x x 2, 1, 2] x.remove(2) x 1, 2] >>> x.index(2) 2 39 for loop revisited • Finding the total of a list of numbers: for statement loops through each entry in a list val = [4, 19, 1, 100, 125, 10] total = 0 for x in val: total += x print total 259 • Equivalent to: val = [4, 19, 1, 100, 125, 10] total = 0 for i in range(len(val)): total += val[i] print total 259 40 Modules • Additional functionality, that is not part of the core language, is in modules like – sys (system) – re (regular expressions) – math (mathematics) • Load modules with import >>> import math >>> help(math) Help on built-in module math: … • You can write your own modules and import them 41 The sys.argv list • A special list is sys.argv • This contains the command-line arguments if the program is invoked at the command line • It's a way for the user to pass information into the program, if you don't have a graphical interface with which to do this import sys print sys.argv File args.py ah@studipool1> python args.py abc 123 ['args.py', 'abc', '123'] Output at command line 42 Converting a sequence into a list Data types can be converted. Here the list function converts a string into a list. Triple quotes allow for strings that stretch over several lines >>> dna="acgtcgaga" >>> list(dna) ['a', 'c', 'g', 't', 'c', 'g', 'a', 'g', 'a'] >>> """You can also make use of long strings and the split function""".split() ['You', 'can', 'also', 'make', 'use', 'of', 'long', 'strings', 'and', 'the', 'split', 'function'] • The underlying programming language C treats all strings as lists 43 Taking a slice of a list • The syntax x[i:j] returns a list containing elements i,i+1,…,j-1 of list x nucleotides = ['a', 'g', 'c', 't'] purines = nucleotides[0:2] # nucleotides[:2] also works pyrimidines = nucleotides[2:4]# nucleotides[2:] also works print "Nucleotides:", nucleotides print "Purines:", purines print "Pyrimidines:", pyrimidines Nucleotides: ['a', 'g', 'c', 't'] Purines: ['a', 'g'] Pyrimidines: ['c', 't'] 44 Applying a function to a list • The map command applies a function to every element in an array • Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST • EXPR can be arbitrary function, defined elsewhere or lambda calculus expression • Lambda calculus: provides "anonymous" function, constructed with keyword lambda, a set of parameters, and an expression with these • Example: multiply every number by 3 >>> map(lambda x: x*3, [1,2,3]) [3, 6, 9] 45 Python for Bioinformatics Lecture 3: Patterns and Functions 46 Review: pattern-matching • The following code: if "ACGCGT" in dna: print "Found MCB binding site!" prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the string variable "sequence" • We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax: dna.replace("ACGCGT","_MCB_", 1) pattern replacement count • We can replace all occurrences by omitting the optional count argument dna.replace("ACGCGT","_MCB_") 47 Regular expressions • Python provides a pattern-matching engine • Patterns are called regular expressions • They are extremely powerful • Often called "regexps" for short • import module re 48 Motivation: N-glycosylation motif • Common post-translational modification • Attachment of a sugar group • Occurs at asparagine residues with the consensus sequence NX1X2, where – X1 can be anything (but proline inhibits) – X2 is serine or threonine • Can we detect potential N-glycosylation sites in a protein sequence? 49 Building regexps • In general square brackets denote a set of alternative possibilities • Use - to match a range of characters: [AZ] • . matches anything • \s matches spaces or tabs • \S is anything that's not a space or tab • [^X] matches anything but X 50 Using Regular Expressions • Compile a regular expression object (pattern) using re.compile • pattern has a number of methods (try dir(pattern)), eg. – match (in case of success returns a Match object, otherwise None) – search (scans through a string looking for a match) – findall (returns a list of all matches) >>> import re >>> pattern = re.compile('[ACGT]') >>> if pattern.match("A"): print "Matched" Matched >>> if pattern.match("a"): print "Matched" >>> successful match unsuccessful, returns None 51 by default case sensitive Matching alternative strings • /(this|that)/ matches "this" or "that" • ...and is equivalent to /th(is|at)/ case unsensitive search pattern >>> pattern=re.compile("(this|that|other)", re.IGNORECASE) >>> pattern.search("Will match THIS") ## success <_sre.SRE_Match object at 0x00B52860> >>> pattern.search("Will also match THat") ## success <_sre.SRE_Match object at 0x00B528A0> >>> pattern.search("Will not match ot-her") ## will return None >>> Python returns a description of the match object 52 Matching multiple characters • x* matches zero or more x's • x+ matches one or more x's • x{n} matches n x's • x{m,n} matches from m to n x's Word and string boundaries • ^ matches the start of a string • $ matches the end of a string • \b matches word boundaries 53 "Escaping" special characters • \ is used to "escape" characters that otherwise have meaning in a regexp • so \[ matches the character "[" – if not escaped, "[" signifies the start of a list of alternative characters, as in [ACGT] • All special characters: . ^ $ * + ? { [ ] \ | ( ) 54 Substitutions/Match Retrieval • regexp methods can be used without compiling (less efficient but easier to use) • Example re.sub (substitution): >>> re.sub("(red|blue|green)", "color", "blue socks and red shoes") 'color socks and color shoes' • Example re.findall: >>> e,raw,frm,to = re.findall("\d+", \ "E-value: 4, \ Raw Bit Score: 165, \ Match position: 362-419") >>> print e, raw, frm, to 4 165 362 419 matches one or more digits The result, a list of 4 strings, is assigned to 4 variables \ allows multiple line commands alternatively, construct multi-line strings using triple quotes """ …""" 55 N-glycosylation site detector >>> protein=""" MGMFFNLRSNIKKKAMDNGLSLPISRNGSSNNIKDKRSEHNSNSLKGKYRYQPRSTPSKFQLTVSITSLI IIAVLSLYLFISFLSGMGIGVSTQNGRSLLGSSKSSENYKTIDLEDEEYYDYDFEDIDPEVISKFDDGVQ HYLISQFGSEVLTPKDDEKYQRELNMLFDSTVEEYDLSNFEGAPNGLETRDHILLCIPLRNAADVLPLMF KHLMNLTYPHELIDLAFLVSDCSEGDTTLDALIAYSRHLQNGTLSQIFQEIDAVIDSQTKGTDKLYLKYM DEGYINRVHQAFSPPFHENYDKPFRSVQIFQKDFGQVIGQGFSDRHAVKVQGIRRKLMGRARNWLTANAL KPYHSWVYWRDADVELCPGSVIQDLMSKNYDVI""".upper().replace("\n","") >>> for match in re.finditer("N[^P][ST]", protein): print match.group(), match.span() NGS (26, 29) NLT (214, 217) NGT (250, 253) re.finditer provides an iterator over match-objects N[^P][ST]- the main regular expression multi-line string, upper case, line breaks removed match.group and match.span print the actual matched string and the position-tuple. Altenatively, you can print gene[match.start():match.end()] 56 PROSITE and Pfam PROSITE – a database of regular expressions for protein families, domains and motifs Pfam – a database of Hidden Markov Models (HMMs) – equivalent to probabilistic regular expressions 57 Another Example: • Ferredoxins are a group of iron-sulfur proteins which mediate electron transfer • The share the motif C, then two residues, C, then two residues, C, then three residues, C, then either P,E, or G • The 4 C's are 4Fe-4S ligands • What is the corresponding Python 58 Another Example: • Ferredoxins are a group of iron-sulfur proteins which mediate electron transfer • The share the motif C, then two residues, C, then two residues, C, then three residues, C, then either P,E, or G • The 4 C's are 4Fe-4S ligands • What is the corresponding Python code? • C.{2}C.{2}C.{3}C[PEG] 59 Another Example: 60 Courtesy of Chris Bystroff 61 Courtesy of Chris Bystroff Another Example: 62 Courtesy of Chris Bystroff Another Example • Regular expressions are useful to parse text • Example: extract information from Blast output, such as – species name – E value – Score – ID 63 Another Example BLASTP 2.2.6 [Apr-09-2003] RID: 1062117117-16602-2157828.BLASTQ3 Query= gi|6174889|sp|P26367|PAX6_HUMAN Paired box protein Pax-6 (Oculorhombin) (Aniridia, type II protein). (422 letters) Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF 1,509,571 sequences 486,132,453 total letters Results of PSI-Blast iteration 1 Sequences with E-value BETTER than threshold Sequences producing significant alignments: Score E (bits) Value gi|4505615|ref|NP_000271.1| paired box gene 6 isoform a Paired box h... 781 0.0 gi|189353|gb|AAA59962.1| oculorhombin >gi|189354|gb|AAA59963.1| oculo... 780 0.0 gi|6981334|ref|NP_037133.1| paired box homeotic gene 6 [Rattus norveg... 778 0.0 gi|26389393|dbj|BAC25729.1| unnamed protein product [Mus musculus] 776 0.0 gi|7305369|ref|NP_038655.1| paired box gene 6 small eye Dickie's sm... 776 0.0 gi|383296|prf||1902328A PAX6 gene 775 0.0 gi|4580424|ref|NP_001595.2| paired box gene 6 isoform b Paired box h... 775 0.0 gi|18138028|emb|CAC80516.1| paired box protein [Mus musculus] 773 0.0 gi|2576237|dbj|BAA23004.1| PAX6 protein [Gallus gallus] 770 0.0 gi|27469846|gb|AAH41712.1| Similar to paired box gene 6 [Xenopus laevis] 768 0.0 … 64 Functions • Often, we can identify self-contained tasks that occur in so many different places we may want to separate their description from the rest of our program. • Code for such a task is called a function NB: Python • Examples of such tasks: provides the – finding the length of a sequence – reverse complementing a sequence – finding the mean of a list of numbers function len(x) to do this already 65 Maximum element of a list • Function to find the largest entry in a list def find_max(data): max = data.pop() for x in data: if x > max: max = x return max data = [1, 5, 1, 12, 3, 4, 6] print "Data:", data print "Maximum:", find_max(data) Function declaration Function body Function result Function call Data: [1, 5, 1, 12, 3, 4, 6] Maximum: 12 66 Reverse complement The arguments follow the from string import maketrans function name in parantheses def revcomp(seq): (in this case seq, the translation = maketrans("agct", "tcga") sequence to be revcomp'd) comp = seq.translate(translation) rcomp = comp[::-1] # reversing comp return rcomp By default, translation and comp are local variables, ie., they "live" only inside dna = "cggcgt" the surrounding function rev = revcomp(dna) print "Revcomp of %s is %s"%(dna, rev) return announces that the return value of this function is whatever's in rcomp string formatted with place holders Revcomp of cggcgt is acgccg 67 revcomp goes OO self refers to the current object, gives access to all its variables from string import * Class Constructor saves input sequence as object variable in lower case method calls class DNA: def __init__(self, sequence): self.seq=sequence.lower() def revcomp(self): translation = maketrans("agct", "tcga") comp = self.seq.translate(translation) self.revcomp = comp[::-1] def report(self): print "Revcomp of %s is %s"%\ (self.seq,self.revcomp) dna = DNA("accggcatg")# Creating a DNA object dna.revcomp() dna.report() Useful to structure code : add additional DNA sequence functionality to this class, eg. a function that calculates GC-contents, translation to protein etc. 68 Mean & standard deviation Importing square root function from module math Function takes a list of n numeric arguments Function returns a two-element tuple: (mean,sd) from math import sqrt def mean_sd(data): n = len(data) sum = 0 sqSum = 0 for x in data: sum += x sqSum += x * x mean = sum / n variance = sqSum / n - mean * mean sd = sqrt (variance) return (mean, sd) data = [1, 5, 1, 12, 3, 4, 6] (mean, sd) = mean_sd (data) print "Data:", data print "Mean:", mean print "Standard deviation:", sd 69 Including variables in patterns • Function to find number of instances of a given binding site in a sequence finds rightmost position, where pattern matches in text no match call recursively with text to the left of rightmost match, count up one def count_matches(pattern, text): pos=text.rfind(pattern) if pos==-1: return 0 else: return count_matches(pattern, text[:pos])+1 print count_matches("ACGCGT", "ACGCGTAAGTCGGCACGCGTACGCGT") 3 NB: Built-in string method count also does the job text="ACGCGTAAGTCGGCACGCGTACGCGT" print text.count("ACGCGT") 70 Python for Bioinformatics Lecture 4: Dictionaries 71 Data structures • Suppose we have a file containing a table of Drosophila gene names and cellular compartments, one pair on each line: Cyp12a5 MRG15 Cop bor Bx42 Mitochondrion Nucleus Golgi Cytoplasm Nucleus Suppose this file is in "c:/genecomp.txt" 72 Reading a table of data • We can split each line into a 2-element list using the split command. • This breaks the line at each space: genes, comps= [], [] for line in open("C:/genecomp.txt"): gene, comp = line.split() genes.append(gene) comps.append(comp) print "Genes:", " - ".join(genes) print "Compartments:", " ".join(comps) Genes: Cyp12a5 - MRG15 – Cop - bor - Bx42 Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus • The opposite of split is join, which makes a string from a list of strings 73 Finding an entry in a table • The following code assumes that we've already read in the table from the file: import sys geneToFind = sys.argv[1] for i in range(len(genes)): if genes[i]==geneToFind: print "Gene:", genes[i] print "Compartment:", comps[i] sys.exit() print "Couldn't find gene" • Example: sys.argv[1] = "Cop" Searching for gene Cop Gene: Cop Compartment: Golgi 74 Binary search • The previous algorithm is inefficient. If there are N entries in the list, then on average we have to search through ½N entries to find the one we want. • For the full Drosophila genome, N=12,000. This is painfully slow. • An alternative is the Binary Search algorithm: Start with a sorted list. Compare the middle element with the one we want. Pick the half of the list that contains our element. Iterate this procedure to "home in" on the right element. This takes log2(N) steps. 75 Dictionaries (hashes) • Implementing algorithms like binary search is a common task in languages like C. • Conveniently, Python provides a type of array called a dictionary (also called a hash) that does something similar for you. • A dictionary is a set of key:value pairs (like our gene:compartment table) comp["Cop"] = "Golgi" Squared brackets [] are used to index a dictionary 76 keys and values • keys returns the list of keys in the hash – e.g. names, in the name2seq hash • values returns the list of values – e.g. sequences, in the name2seq hash name2seq = read_FASTA ("C:/fly3utr.txt") print "Sequence names: ", " ".join(name2seq.keys()) print "Total length: ", len("".join(name2seq.values())) Sequence names: CG11488 CG11604 CG11455 Total length: 210 77 Getting familiar with hashes >>> tlf={"Michael" : 40062, \ "Bingding" : 40064, "Andreas": 40063 } >>> tlf.keys() ['Bingding', 'Andreas', 'Michael'] >>> tlf.values() [40064, 40063, 40062] >>> tlf["Michael"] 40062 >>> tlf.has_key("Lars") False >>> tlf["Lars"] = 40070 >>> tlf.has_key("Lars") # now its there True >>> for name in tlf.keys(): ... print name, tlf[name] ... Lars 40070 Bingding 40064 Andreas 40063 Michael 40062 Creating an initial phone book Asking for all keys Asking for all values Asking for a value, given a key Checking whether a key is in the list Inserting a single key:value pair Looping through the dictionary 78 Reading a table using hashes import sys comps={} for line in open("C:/genecomp.txt"): gene, comp = line.split() comps[gene] = comp geneToFind=sys.argv[1] print "Gene:", geneToFind print "Compartment:", comp[geneToFind] ...with sys.argv[1] = "Cop" as before: Gene: Cop Compartment: Golgi 79 Reading a FASTA file into a hash def read_fasta(filename): name = None name2seq = {} for line in open(filename): if line.startswith(">"): if name: name2seq[name]=seq name=line[1:].rstrip() seq="" else: seq+=line.rstrip() name2seq[name]=seq return name2seq if name only evaluates to false, if it is still None (when going over first line) new name is derived from line from second letter on, with new-line character removed Final entry, after loop 80 Formatted output of sequences def print_seq(name, seq, width=50): print ">"+name i=0 while i<len(seq): print seq[i : i+width] i+=width Default values, assigned in parameter line, placed rightmost Here, width default is 50-column output print_seq("Tata-box1", "TA"*55) print_seq("Tata-box2", "TA"*55, 30) >Tata-box1 TATATATATATATATATATATATATATATATATATATATATATATATATA TATATATATATATATATATATATATATATATATATATATATATATATATA TATATATATA >Tata-box2 TATATATATATATATATATATATATATATA TATATATATATATATATATATATATATATA TATATATATATATATATATATATATATATA TATATATATATATATATATA 81 Files of sequence names • Easy way to specify a subset of a given FASTA database • Each line is the name of a sequence in a given database • e.g. CG1167 CG685 CG1041 CG1043 82 Get named sequences • Given a FASTA database and a "file of sequence names", print every named sequence: (fasta, fosn) = sys.argv[1:3] name2seq = read_FASTA (fasta) for name in open(fosn): name = name.rstrip() if name2seq.has_key[name]: seq = name2seq[name] print_seq (name, seq) else: print "Can't find sequence: %s."%name, "Known sequences: ", " ".join(name2seq.keys()) 83 Common Set operations • Two files of sequence names: • What is the overlap/difference/union? • Find eg. intersection using sets from sets import Set geneSet1 = Set([]) geneSet2 = Set([]) for line in open("C:/fosn1.txt"): geneSet1.add(line.rstrip()) for line in open("C:/fosn2.txt"): geneSet2.add(line.rstrip()) CG1167 CG685 CG1041 CG1043 CG215 CG1041 CG483 CG1167 CG1163 C:/fosn1.txt C:/fosn2.txt difference AA >>> geneSet1 Set(['CG1043', 'CG1041', 'CG1167', 'CG685']) >>> geneSet1.intersection(geneSet2) Set(['CG1041', 'CG1167']) >>> geneSet2.difference(geneSet1) Set(['CG483', 'CG215', 'CG1163']) >>> geneSet1.union(geneSet2) Set(['CG483', 'CG1043', 'CG1041', 'CG1167', 'CG685', 'CG1163', 'CG215']) intersection AA B B union AA B 84 More Set operations • Since every element in a Set occurs only once, sets can be used to reduce redundancy >>> from sets import Set >>> Set([1,2,3,1,3,3]) Set([1, 2, 3]) • A is a superset of B when A fully contains B Test: A.issuperset(B) • A is a subset of B when A is fully contained in B Test: A.issubset(B) >>> pqs=Set("1kim 1dan 1bob".split()) >>> pdb=Set("1bob 3mad 1dan 2bad 1kim".split()) >>> pqs.issubset(pdb) True 85 The genetic code as a hash aa = {'ttt':'F', 'ttc':'F', 'tta':'L', 'ttg':'L', 'tct':'S', 'tcc':'S', 'tca':'S', 'tcg':'S', 'tat':'Y', 'tac':'Y', 'taa':'!', 'tag':'!', 'tgt':'C', 'tgc':'C', 'tga':'!', 'tgg':'W', 'ctt':'L', 'ctc':'L', 'cta':'L', 'ctg':'L', 'cct':'P', 'ccc':'P', 'cca':'P', 'ccg':'P', 'cat':'H', 'cac':'H', 'caa':'Q', 'cag':'Q', 'cgt':'R', 'cgc':'R', 'cga':'R', 'cgg':'R', 'att':'I', 'atc':'I', 'ata':'I', 'atg':'M', 'act':'T', 'acc':'T', 'aca':'T', 'acg':'T', 'aat':'N', 'aac':'N', 'aaa':'K', 'aag':'K', 'agt':'S', 'agc':'S', 'aga':'R', 'agg':'R', 'gtt':'V', 'gtc':'V', 'gta':'V', 'gtg':'V', 'gct':'A', 'gcc':'A', 'gca':'A', 'gcg':'A', 'gat':'D', 'gac':'D', 'gaa':'E', 'gag':'E', 'ggt':'G', 'ggc':'G', 'gga':'G', 'ggg':'G' } 86 Translating: DNA to protein def translate(dna): length = len(dna) if len(dna) % 3 != 0: print "Warning: Length is not a multiple of 3!" sys.exit() protein = "" i = 0 while i < length: codon = dna[i:i+3] if not aa.has_key(codon): print "Codon %s is illegal"%codon sys.exit() protein += aa[codon] i+=3 return protein >>> translate("gatgacgaaagttgt") 'DDESC' >>> translate("gatgacgaaagttgta") Warning: Length is not a multiple of 3! … (SystemExit) >>> translate("gatgacgiaagttgt") Codon gia is illegal … (SystemExit) 87 Counting residue frequencies def count_residues(seq): freq={} seq = seq.lower() for i in range(len(seq)): if freq.has_key(seq[i]): freq[seq[i]]+=1 else: freq[seq[i]]=1 return freq freq = count_residues("gatgacgaaagttgt") for residue in freq.keys(): print residue,":", freq[residue] g a c t : : : : 5 5 1 4 88 Counting N-mer frequencies def count_nmers(seq, n): freq={} seq = seq.lower() for i in range(len(seq)-n+1): nmer=seq[i : i+n] if freq.has_key(nmer): freq[nmer]+=1 # incr. according counter else: freq[nmer]=1 # first occurence return freq freq = count_nmers("gatgacgaaagttgt", 2) for residue in freq.keys(): print residue,":", freq[residue] cg: tt: ga: tg: gt: aa: ac: at: ag: 1 1 3 2 1 2 1 1 1 89 N-mer frequencies for a whole file from read_fasta import read_fasta def count_nmers(seq, n, freq): seq = seq.lower() for i in range(len(seq)-n+1): nmer=seq[i : i+n] if freq.has_key(nmer): freq[nmer]+=1 else: freq[nmer]=1 return freq name2seq = read_fasta("z:/tmp/fly3utr.txt") freq = {} ## count for each sequence for seq in name2seq.values(): freq = count_nmers(seq, 2, freq) ## display statistics for residue in freq.keys(): print residue,":", freq[residue] We reuse a function we wrote earlier by import ing it.The first is the filename (without .py), the second the function name ct tc tt cg ga tg gc gt aa ac gg at ca ag ta cc : : : : : : : : : : : : : : : : 5 9 26 4 11 12 2 17 39 10 4 17 11 15 20 2 Note how we keep passing freq back into the count_nmers function, to get cumulative counts Files and filehandles This fh is the filehandle • • • • • • • • Opening a file: Closing a file: Reading a line: Reading an array: Printing a line: Read-only: Write-only: Test if file exists: fh = open(filename) fh.close() data = fh.readline() data = fh.read() fh.write(line) fh = open(filename, "r") fh = open(filename, "w") import os if os.path.exists(filename): print "filename exists!" 91 Database access from Python # use the database package with all the DB relevant sub-routines import MySQLdb # import class that enables data acquirement as dictionaries from MySQLdb.cursors import DictCursor # Connection to database with access specification conn = MySQLdb.connect(db="scop", # name of database host="myserver", # name of server user="guest", # username passwd="guest") # password # create access pointer that retrieves dictionaries cursor = conn.cursor(DictCursor) # send a query cursor.execute("SELECT * FROM cla LIMIT 10") # retrieve all rows as a list of dictionaries data = cursor.fetchall() # close connection conn.close() 92 Local vs. global variables def foo(): a=3 print a Parameters are local def foo(a): print a a=6 print a foo() print a a=6 print a foo(3) print a 6 3 6 6 3 6 does not affect global a def foo(): global a a=3 print a a=6 print a foo() print a • Function variables and parameters are by default local • Unless you declare them to be global 6 3 3 93 does affect global a References in Python • Lists, Dictionaries and other Datatypes are usually referenced, ie. when assigning a variable, no data is copied: [1,2,3,4] assigning b=a • a • • b b points to the same list as a >>> >>> >>> >>> [1, "Real copies" with copy module Don't worry about any referencing, Python is doing the job! But be aware when you want to copy objects >>> >>> >>> >>> [1, a = [1,2,3,4] b=a b[2]=7 a 2, 7, 4] from copy import copy b = copy(a) b[2]=3 a 2, 7, 4] 94 Matrices >>> m = [[1, 2], [3, 4]] • Easy solution: Lists of lists >>> m[1] in core Python [3, 4] >>> m[1][1] • Access an element at 4 position (i,j) in a list of lists: selecting from the i'th row the j'th element • Disadvantages: Operations like Addition/Multiplications on lists (of lists) would be slow, need to be implemented • Luckily: big library already available, fast (since implemented in C), rich functionality 95 Matrices with numarray • Faster, more calculations (reshaping, built-in matrix operations) with external package numarray • various matrix creation methods with numarray: – – – – – from list of lists zeros/2 ones/2 identity/1 from a function – etc. • Convenient access of multidimensional array elements >>> from numarray import * >>> m1 = array(m);m1 array([[1, 2], [3, 4]]) >>> m1.getshape() (2, 2) >>> zeros((3,5)) array([[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]) >>> m2=array(arange(8)) >>> m2.setshape((2,4)) >>> m2 array([[0, 1, 2, 3], [4, 5, 6, 7]]) >>> m2[1,1] 5 96 Matrices with numarray • You can select rows and columns, or even submatrices (same "slicing" as with lists) • You can apply a scalar operation like – addition + – multiplication * – sine or cosine to an array >>> m1[:,1] # second column array([2, 4]) >>> #arange produces one-dim. array >>> m = arange(9, shape=(3,3));m array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) >>> m[1:,1:] array([[4, 5], [7, 8]]) >>> m[1] + 3 array([6, 7, 8]) >>> m[1] * 3 array([ 9, 12, 15]) >>> m1[1] * 3 array([ 9, 12]) >>> sin(m1) array([[ 0.84147098, 0.90929743], [ 0.14112001, -0.7568025 ]]) 97 More Math • Remember the mean and standard deviation from Lecture 3? • Reuse of existing packages makes live easier: >>> data = array([1, 5, 1, 12, 3, 4, 6]) >>> data.mean() 4.5714285714285712 >>> data.stddev() 3.7796447300922722 • Or finding the maximum in a list becomes now: >>> data[argmax(data)] 12 • numarray also provides functions for dot product, vector calculations etc. >>> dot(array([1,2,3]), array([1,2,3])) 14 >>> array([1,2,3]) + array([4,5,6]) array([5, 7, 9]) 98 Longest Common Subsequence from numarray import * seq1="ATCTGATC" seq2="TGCATA" len1 = len(seq1) len2 = len(seq2) def max3(a,b,c): return max( max(a,b) ,c) #Create an array val of length len1+1 times len2+1 val=zeros((len1+1,len2+1)) for i in range(1,len1+1): for j in range(1,len2+1): if seq1[i-1]==seq2[j-1]: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1]+1) else: val[i,j] = max3(val[i-1,j], val[i,j-1], val[i-1,j-1]) print val lcs = val[len1,len2] print "The longest common subsequence of %s and %s is %d (%f)"% \ 99 (seq1, seq2, lcs, float(lcs) / max(len1,len2)) Longest Common Subsequence Output [[0 0 0 0 0] [0 1 1 1 1] [0 1 1 1 1] [0 1 1 1 1] [0 1 2 2 2] [0 1 2 3 3] [0 1 2 3 4] [0 1 2 3 4] [0 1 2 3 4]] The longest common subsequence of ATCTGATC and TGCATA is 4 (0.500000) Result of print val Final Result 100 Classes • Define a class to store PDB residues. A residue has: a name, a position in the sequence, and a list of atoms. An atom has a name and coordinates. Define 2 methods: add_residue and add_atom class PDBStructure: def add_residue(self, name, posseq): residue = {'name': resname, 'posseq': posseq, 'atoms': []} self._residues.append(residue) return residue def add_atom(self, residue, name, coord): atom = {'residue': residue, 'name': name, 'coord': coord } residue['atoms'].append(atom) return atom 101 Classes: Usage struct = PDBStructure() residue = struct.add_residue(name = "ILE", posseq = 1 ) struct.add_atom(residue, name = "N", coord = (23.46800041, -8.01799965, -15.26200008)) struct.add_atom(residue, name = "CZ", coord = (125.50499725, 4.50500011, -19.14800072)) residue = struct.add_residue(name = "LYS", posseq = 2 ) struct.add_atom(residue, name = "OE1", coord = (126.12000275, -1.78199995, -15.04199982)) print struct.residues [{'name': 'ILE', 'posseq': 1, 'atoms': [ \ {'name': 'N', 'coord': (23.468000409999998, \ -8.0179996500000001, -15.26200008)}, \ {'name': 'CZ', 'coord': (125.50499725, \ 4.5050001100000001, -19.148000719999999)}]}, \ {'name': 'LYS', 'posseq': 2, 'atoms': [ \ {'name': 'OE1', 'coord': (126.12000275, \ -1.7819999500000001, -15.041999819999999)}]}] 102