Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Python and Biopython Scripting for Busy Bioinformaticians Jeffrey Chang Stanford University 11 Aug 2003 CSB2003 Introduction 03 20 01 20 99 19 97 19 19 95 Jeffrey Chang <[email protected]> 2 3 Outline • Act I • Act II So what is this Python I keep hearing about? Python, it is nice to meet you! Intermission • Act III • Act IV • Act V Let’s write some code! Biopython: Batteries Included. Where do we go from here? 4 Act I So what is Python? 5 Assembly BASIC LISP C C++ perl FORTRAN • 00 20 90 19 80 19 19 19 19 50 60 70 Geneology Java Python Increasing layers of abstraction • structured/object-oriented programming • memory handling • Sophistication in data types 6 Happy Birthday! Usability tested! ABC Guido Van Rossum Programming Language Aug 13, 1991 7 Raison D’être FORTRAN LISP C C++ Java perl python numerical analysis symbolic computation and more! system programming objects, speed, compatibility with C objects, internet system administration general programming 8 Language for Research • • minimize development time interactive • examine your data • tweak algorithms • • suitable for library development sociable • other research tools • internet • multiplatform 9 Python for Research • • minimize development time interactive • examine your data • tweak algorithms • • suitable for library development sociable • other research tools • internet • • • • • • • high level data types garbage collection interpreted interactive environment rich module support extensible with C multiplatform 10 Python vs Perl Python Strengths • • • • Object-oriented Handles Numbers Clean Syntax Clean Extensions to C Perl Strengths • • • • • Popular Available Mature Libraries Familiar Syntax String handling 11 Python vs. Java Python Strengths • • • • Libraries Sociable High Level Data Types Easy to Prototype Java Strengths • • • • • Popular Industry Support GUI Tools Fast Good Development Tools 12 Python in Biology 13 Where can I find Python? Officially available for: http://www.python.org • • • • Windows Macintosh Linux Source 14 Also available on: 15 Act II Python, it is nice to meet you! 16 Interacting with Python 17 Interacting with Python 18 Python Interpreter Python 1.5.2 (#1, Aug 2 1999, 18:47:55) [GCC egcs-2.91.66 19990314 (egcs-1.1.2 on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> _ Triple prompt means go! 19 Our First Script Python 1.5.2 (#1, Aug 2 1999, 18:47:55) [GCC egcs-2.91.66 19990314 (egcs-1.1.2 on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> print "hello world"_ • Interactive environment. Try out new ideas here! 20 Python Interpreter Python 1.5.2 (#1, Aug 2 1999, 18:47:55) [GCC egcs-2.91.66 19990314 (egcs-1.1.2 on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> print "hello world” hello world >>> _ No Semicolon! • Commands are evaluated as you type 21 Python Interpreter Python 1.5.2 (#1, Aug 2 1999, 18:47:55) [GCC egcs-2.91.66 19990314 (egcs-1.1.2 on sunos5 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> print "hello world" hello world >>> print gene_name Traceback (innermost last): Errors caught File "<stdin>", line 1, in ? immediately. NameError: gene_name >>> _ • • Examine your data. Quickly develop and test algorithms 22 Creating Variables >>> print "hello world" hello world >>> print gene_name Traceback (innermost last): File "<stdin>", line 1, in ? NameError: gene_name >>> gene_name = “caspase” >>> print gene_name caspase >>> _ • • Variables created when you assign them. gene_name is not Gene_Name. 23 Printing Variables >>> >>> >>> 500 >>> a = 100 b = 5 print a * b print "%10d" % a 100 >>> print “%d + %d = %d” % (a, b, a+b) 100 + 5 = 105 >>> _ • Formatting like printf in C. 24 Special Value: ‘None’ >>> >>> >>> 500 >>> a = 100 b = 5 print a * b print "%10d" % a 100 >>> print “%d + %d = %d” % (a, b, a+b) 100 + 5 = 105 >>> c = None >>> print c None >>> _ 25 Integers >>> >>> 100 >>> 110 >>> a = 100 print a print a + 10 _ Supports arithmetic: • +, -, *, /, ** 26 Integers >>> >>> 100 >>> 110 >>> >>> 220 >>> a = 100 print a print a + 10 a = 2*(a+10) print a _ Understands parentheses. 27 Integers >>> >>> 100 >>> 110 >>> >>> 220 >>> 2 >>> a = 100 print a print a + 10 a = 2*(a+10) print a print a / 100 Gotcha! _ Be careful of division. 28 Integers >>> a = 100 >>> print a 100 >>> print a + 10 110 >>> a = 2*(a+10) >>> print a 220 >>> print a / 100 2 >>> print a ** 2 48400 >>> print a ** 100 Traceback (innermost last): File "<stdin>", line 1, in ? OverflowError: integer pow() >>> _ Biggest integer is (about) 2 billion. 29 Long Integers >>> a = 100L >>> print a 100L >>> a ** 100 10000000000000000000000000000000000000 00000000000000000000000000000000000000 00000000000000000000000000000000000000 00000000000000000000000000000000000000 00000000000000000000000000000000000000 00000000000L >>> _ Have no limit! 30 Float >>> a = 100.0 >>> print a 100.0 >>> a ** 100 1e+200 >>> a / 3 33.3333333333 >>> 31 Number Coercion >>> 1 >>> 1.5 >>> 3.0 >>> 1.5 >>> 3 / 2 3. / 2 Integers convert to floating point. float(3) float(3)/2 32 Strings >>> protein = "TSQGRTRTLLNLTPIRLIVALFLVAAAVGL” >>> print protein TSQGRTRTLLNLTPIRLIVALFLVAAAVGL >>> _ 33 Strings >>> protein = "TSQGRTRTLLNLTPIRLIVALFLVAAAVGL” >>> print protein TSQGRTRTLLNLTPIRLIVALFLVAAAVGL >>> _ Characters numbered from 0. T S Q G R T R T L L N ... 0 1 2 3 4 5 6 7 8 9 10 ... 34 Strings >>> protein = "TSQGRTRTLLNLTPIRLIVALFLVAAAVGL” >>> print protein TSQGRTRTLLNLTPIRLIVALFLVAAAVGL >>> print protein[0:5] TSQGR >>> _ “Slices” do not include the end. T S Q G R T R T L L N ... 0 1 2 3 4 5 6 7 8 9 10 ... 35 Strings >>> protein = "TSQGRTRTLLNLTPIRLIVALFLVAAAVGL” >>> print protein TSQGRTRTLLNLTPIRLIVALFLVAAAVGL >>> print protein[0:5] TSQGR >>> fragment = protein[5:10] >>> print fragment TRTLL >>> len(fragment) 5 >>> _ “len” gives the length of the string. 36 Strings >>> print fragment TRTLL >>> len(fragment) 5 >>> fragment[3:] 'LL' >>> _ Slice endpoints are optional. T R T L L 0 1 2 3 4 37 Strings >>> print fragment TRTLL >>> len(fragment) 5 >>> fragment[3:] 'LL’ >>> fragment[10:] '' >>> _ Slices can be out of range. T R T L L 0 1 2 3 4 38 Strings >>> print fragment TRTLL >>> len(fragment) 5 >>> fragment[3:] 'LL’ >>> fragment[10:] '’ >>> fragment[-1] 'L' >>> _ Slices can also be counted from the end. T R T L L -5 -4 -3 -2 -1 39 Lists >>> fragment = ['T', 'R', 'T', 'L', 'L'] >>> print fragment ['T', 'R', 'T', 'L', 'L'] >>> print fragment[1:3] ['R', 'T'] >>> _ Slices like strings. 40 Lists >>> fragment = ['T', 'R', 'T', 'L', 'L'] >>> print fragment ['T', 'R', 'T', 'L', 'L'] >>> print fragment[1:3] ['R', 'T'] >>> 'R' in fragment 1 >>> 'A' in fragment 0 >>> _ “in” checks whether something is in a list. 41 List Assignments >>> print fragment ['T', 'R', 'T', 'L', 'L'] >>> reference = fragment >>> fragment[0] = 'A' >>> print fragment ['A', 'R', 'T', 'L', 'L'] >>> print reference ??? 42 List Assignments >>> print fragment ['T', 'R', 'T', 'L', 'L'] >>> reference = fragment >>> fragment[0] = 'A' >>> print fragment ['A', 'R', 'T', 'L', 'L'] >>> print reference ['A', 'R', 'T', 'L', 'L'] >>> _ list assignment is a reference. 43 List Assignments >>> print fragment ['T', 'R', 'T', 'L', 'L'] >>> reference = fragment[:] >>> fragment[0] = 'A' >>> print fragment ['A', 'R', 'T', 'L', 'L'] >>> print reference [’T', 'R', 'T', 'L', 'L'] >>> _ Python Idiom To copy a list, slice the whole thing! 44 Lists are Objects >>> dir(fragment) ['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'] >>> _ “dir” tells you what an object can do. 45 Lists are Objects >>> dir(fragment) ['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'] >>> fragment.append <built-in method append of list object at 2094b8> >>> _ 46 Lists are Objects >>> dir(fragment) ['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'] >>> fragment.append <built-in method append of list object at 2094b8> >>> print fragment.append.__doc__ L.append(object) -- append object to end >>> _ __doc__ shows you documentation. 47 Tuples >>> fragment = ('T', 'R', 'T', 'L', 'L') >>> print fragment ('T', 'R', 'T', 'L', 'L') >>> print fragment[1:3] ('R', 'T') >>> fragment[0] = 'A' Traceback (innermost last): File "<stdin>", line 1, in ? TypeError: object doesn't support item assignment >>> _ Like lists, but can not be changed. 48 Dictionaries >>> genetic_code = { 'UUU' : 'F', 'UUC' : 'F', 'UUA' : 'L', 'UUG' : 'L', [...] } >>> Creates a mapping between a key to a value. (Duplicate keys are not allowed.) 49 Dictionaries >>> genetic_code = { 'UUU' : 'F', 'UUC' : 'F', 'UUA' : 'L', 'UUG' : 'L', [...] } >>> print genetic_code['GGU'] G >>> print genetic_code['ABC'] Traceback (innermost last): File "<stdin>", line 1, in ? KeyError:'ABC' >>> 50 Dictionaries >>> genetic_code = { 'UUU' : 'F', 'UUC' : 'F', 'UUA' : 'L', 'UUG' : 'L', [...] } >>> print genetic_code['GGU'] G >>> print genetic_code['ABC'] Traceback (innermost last): File "<stdin>", line 1, in ? KeyError:'ABC' >>> dir(genetic_code) ['clear', 'copy', 'get', 'has_key', 'items', 'keys', 'update', 'values'] >>> Dictionaries are objects. 51 What we’ve covered so far... • Python is a high-level scripting language. • Data types: • • • • Numbers: Integer, Float Strings Lists Dictionary 52 Intermission 53 Act III Let’s write some code! 54 On with the programming! >>> rna = ("AUG", "GGU", "GCC") >>> prot = "" >>> for codon in rna: ... print codon ... prot = prot + gencode[codon] ... 'AUG' 'GGU' 'GCC' >>> print prot 'MGA' >>> _ “for” loop delineated by whitespace. 55 Whitespace: Good or Bad??? • Usability testing • Enforce common style We will perhaps eventually be writing only small modules which are identified by name as they are used to build larger ones, so that devices like indentation, rather than delimiters, might become feasible for expressing local structure in the source language. Donald E. Knuth, 1974 56 while... >>> rna = ("AUG", "GGU", "GCC") >>> prot = "" >>> i = 0 >>> while i < len(rna): ... print rna[i] ... prot = prot + gencode[rna[i]] ... i = i + 1 ... 'AUG' 'GGU' 'GCC' >>> print prot 'MGA' >>> _ 57 if, else >>> rna = ("AUG", "XXX", "GGU", "GCC") >>> prot = "" >>> for codon in rna: ... if gencode.has_key(codon): ... prot = prot + gencode[codon] ... else: ... print "unknown '%s'" % codon ... unknown XXX >>> print prot 'MGA' >>> _ 58 Loop control >>> rna = ("AUG", "XXX", "GGU", "UAA", "GCC") >>> prot = "" >>> for codon in rna: ... if codon in ['UAA', 'UAG', 'UGA']: ... break ... elif gencode.has_key(codon): ... prot = prot + gencode[codon] ... else: ... pass # handle unknown key ... >>> print prot 'MG' >>> _ “break” exits the loop “pass” does nothing “continue” (not shown) skips to the next iteration 59 Functions >>> def to_aa(codon): ... gencode = {[...]} ... return gencode[codon] ... >>> _ “return” exits the function 60 Saving Code as Scripts • • Save your code for next time! “.py” files 61 Modules #!/usr/local/bin/python def translate(rna): gencode = { "bio" module [...] } prot = "" for codon in rna: prot = prot + gencode[codon] • • • A module is a library of code. "bio.py" is the "bio" module. import / reload >>> import bio >>> dir(bio) ['translate'] >>> bio.translate(("AUG", "GGU", "GCC")) 'MGA' >>> _ 62 ... as a standalone script #!/usr/local/bin/python def translate(rna): gencode = { [...] } prot = "" for codon in rna: prot = prot + gencode[codon] return prot print "Hi!" if __name__ == '__main__': print translate(("AUG", "GGU", "GCC")) • • Interprets and executes the script __name__ is set to module name Hi! MGA 63 Global Variables #!/usr/local/bin/python gencode = "UUU" "UUC" "UUA" "UUG" [...] } { : : : : • "F", "F", "L", "L", def translate(rna): prot = "" for codon in rna: prot = prot + gencode[codon] return prot • Recreating the genetic code mapping each time is expensive! Create a global variable to store it. 64 Default Parameters #!/usr/local/bin/python gencode = "UUU" "UUC" "UUA" "UUG" [...] } { : : : : "F", "F", "L", "L", def translate(rna, code=gencode): prot = "" for codon in rna: prot = prot + code[codon] return prot • • • But what if you want to use a different genetic code? Pass a translation table as a parameter. Set default parameter • standard one used most often • does not break existing programs 65 Default Parameters #!/usr/local/bin/python gencode = "UUU" "UUC" "UUA" "UUG" [...] } { : : : : "F", "F", "L", "L", def translate(rna, code=gencode): prot = "" for codon in rna: prot = prot + code[codon] return prot >>> import bio >>> bio.translate(("AUG", "GGU", "GCC")) 'MGA' >>> mycode = {[...]} >>> bio.translate(("AUG", "GGU", "GCC"), mycode) 'MGV' >>> bio.translate(("AUG", "GGU", "GCC"), code=mycode) 'MGV' >>> 66 Using stopcodons #!/usr/local/bin/python gencode = "UUU" "UUC" "UUA" "UUG" [...] } { : : : : "F", "F", "L", "L", def translate(rna, code=gencode): prot = "" for codon in rna: if codon in ['UAA', 'UAG', 'UGA']: break prot = prot + code[codon] return prot Do a translation only up to any recognized stop codon. Bug!!! 67 What's the bug? #!/usr/local/bin/python gencode = "UUU" "UUC" "UUA" "UUG" [...] } { : : : : "F", "F", "L", "L", The stop codon may be different for different genetic codes! def translate(rna, code=gencode): prot = "" for codon in rna: if codon in ['UAA', 'UAG', 'UGA']: break prot = prot + code[codon] return prot 68 adding another parameter #!/usr/local/bin/python gencode = "UUU" "UUC" "UUA" "UUG" [...] } { : : : : "F", "F", "L", "L", def translate(rna, code=gencode, stopcodon=['UAA', 'UAG', 'UGA']): prot = "" for codon in rna: if codon in stopcodon: break prot = prot + code[codon] return prot One solution: make the stopcodon a parameter >>> bio.translate(("AUG", "GGU", "GCC"), mycode) 'MGV' >>> bio.translate(("AUG", "GGU", "GCC"), code=mycode, stopcodon=['GCC']) 'MG' >>> bio.translate(("AUG", "GGU", "GCC"), stopcodon=['GGU']) 'M' >>> _ 69 finishing touches #!/usr/local/bin/python [...] def translate(rna, code=gencode, stopcodon=['UAA', 'UAG', 'UGA']): """translate(rna[, code][, stopcodon]) -> string Translate an RNA sequence into a protein sequence. """ prot = "" for codon in rna: if codon in stopcodon: break prot = prot + code[codon] return prot • • Documentation Triple-quoted strings • Allows newlines >>> import bio >>> dir(bio.translate) ['__doc__', '__name__', 'func_code', 'func_defaults', 'func_doc', 'func_globals', 'func_name'] >>> print bio.translate.__doc__ translate(rna[, code][, stopcodon]) -> string Translate an RNA sequence into a protein sequence. >>> _ 70 RNA as a tuple of codons? • ("AUG", "GGU", "GCC") • Representation problems • hard to get sequences into that form • not sliceable, e.g. cannot easily get residues 2 to 4 • Semantic problems • what about non-coding regions? • insertion/deletion errors? 71 Building a Sequence object #!/usr/local/bin/python class Sequence: seq = '' name = '' • • "class" keyword member variables • defined in scope of class • class "owns" the variables >>> import Sequence >>> seq = Sequence.Sequence() >>> seq.name '' >>> seq.name = "Actin Binding Protein" >>> print seq.name Actin Binding Protein >>> _ 72 Private Variables #!/usr/local/bin/python class Sequence1: _seq = '' _name = '' class Sequence2: __seq = '' __name = '' • data hiding by convention • leading underscore • name mangling >>> dir(Sequence.Sequence1) ['_seq', '_name', ...] >>> dir(Sequence.Sequence2) ['_Sequence2__seq', '_Sequence2__name’, ...] >>> _ 73 Adding a constructor #!/usr/local/bin/python • class Sequence: def __init__(self, seq='', name=''): self._seq = seq self._name = name __init__ • optional constructor • automatically called when objected created • self • reference to object • like "this" in C++, java • defined explicitly 74 Adding methods #!/usr/local/bin/python • methods defined class Sequence: def __init__(self, seq='', name=''): self._seq = seq self._name = name inside class. def get_seq(self): return self._seq def get_name(self): return self._name >>> seq = Sequence.Sequence( ... "HSRDIDQEYQ", "Actin Binding") ... >>> print seq.get_name() Actin Binding >>> print seq._name Actin Binding >>> _ 75 Create an RNA class #!/usr/local/bin/python • [...] # Sequence declaration class RNASequence(Sequence): def __init__(self, seq='', name=''): Sequence.__init__(self, seq, name) # get_seq defined in Sequence # get_name defined in Sequence • • subclass from "Sequence" inherit its methods and members new constructor hides Sequence one • need to call it explicitly 76 Make my codons! #!/usr/local/bin/python [...] # Sequence declaration class RNASequence(Sequence): def __init__(self, seq='', name=''): Sequence.__init__(self, seq, name) def as_codons(self): codons = [] i = 0 while i < len(self._seq): codon = self._seq[i:i+3] codons.append(codon) i = i + 3 return codons • Create a new method to split the sequence into triple codons. • as_codons not appropriate for general sequences • only available to RNASequence 77 How to handle errors? #!/usr/local/bin/python [...] # Sequence declaration class RNASequence(Sequence): [...] # constructor check for def as_codons(self): condition if len(self._seq) % 3 != 0: raise ValueError, "broken" • What happens when the sequence cannot be split evenly into triplets? codons = [] i = 0 while i < len(self._seq): codon = self._seq[i:i+3] codons.append(codon) i = i + 3 return codons 78 Exception Handling • • Use for "unignorable" conditions Fail loudly! class RNASequence(Sequence): [...] # constructor def as_codons(self): if len(self._seq) % 3 != 0: raise ValueError, "broken" codons = [] i = 0 while i < len(self._seq): codon = self._seq[i:i+3] codons.append(codon) i = i + 3 return codons >>> goodseq = Sequence.RNASequence( ... "AUGGGU") ... >>> print goodseq.as_codons() ['AUG', 'GGU'] >>> brokenseq = Sequence.RNASequence( ... "AUGGGUG") ... >>> print brokenseq.as_codons() Traceback (innermost last): File "<stdin>", line 1, in ? File "Sequence.py", line 23, in as_codons raise ValueError, "broken" ValueError: broken >>> _ 79 Exception Handling >>> try: ... codons = badseq.as_codons() ... print "Codons: %s" % codons ... except ValueError: ... print "Sequence is broken" ... Sequence is broken >>> _ 80 Reading sequence from a file >ABP1_SACEX Actin-Binding Protein MALEPIDATTHSRDIEQEYQKVVRGTDNDT TWLIISPNTQKEYLPSSTGSSFSDFLQSFD ETKVEYGIARVSPPGSDVGKIILVGWCPDS APMKTRASFAANFGTIANSVLPGYHIQVTA RDEDDLDEEELLTKISNAAGARYSIQAAGN SVPTSSASGSAPVKKVFTPSLAKKESEPKK SFVPPPVREEPVPVNVVKDN FASTA-formatted file 81 Opening a File >>> print open.__doc__ open(filename[, mode[, buffering]]) -> file object Open a file. [...] >>> _ “open” returns a file object. 82 Opening a File >>> print open.__doc__ open(filename[, mode[, buffering]]) -> file object Open a file. [...] >>> file = open("does_not_exist", "r") Traceback (innermost last): File "<stdin>", line 1, in ? IOError: [Errno 2] No such file or directory: 'does_not_exist' >>> _ 83 Opening a File >>> print open.__doc__ open(filename[, mode[, buffering]]) -> file object Open a file. [...] >>> file = open("does_not_exist", "r") Traceback (innermost last): File "<stdin>", line 1, in ? IOError: [Errno 2] No such file or directory: 'does_not_exist' >>> file = open("fasta_file", "r") >>> dir(file) ['close', 'closed', 'fileno', 'flush', 'isatty', 'mode', 'name’, 'read', 'readinto', 'readline', 'readlines', 'seek', 'softspace', 'tell', 'truncate', 'write', 'writelines'] >>> _ 84 Reading a FASTA file #!/usr/local/bin/python [...] # Sequence stuff def read_fasta(filename): file = open(filename, 'r') title_line = file.readline() sequence = '' while 1: line = file.readline() if not line: break sequence = sequence + line Added to our Sequence module... Bug!!! name = title_line[1:] return Sequence(name, sequence) 85 Reading a FASTA file #!/usr/local/bin/python [...] # Sequence stuff def read_fasta(filename): file = open(filename, 'r') title_line = file.readline() sequence = '' while 1: line = file.readline() if not line: break sequence = sequence + line “line” contains newlines and/or carriage returns! BUG: extra characters name = title_line[1:] return Sequence(name, sequence) 86 The string Module >>> import string >>> dir(string) ['atof', 'atoi', 'atol', 'capitalize', 'capwords', 'center', 'count', 'digits', 'expandtabs', 'find', 'hexdigits', 'index', 'index_error', 'join', 'joinfields', 'letters', 'ljust', 'lower', 'lowercase', 'lstrip', 'maketrans', 'octdigits', 'replace', 'rfind', 'rindex', 'rjust', 'rstrip', 'split', 'splitfields', 'strip', 'swapcase', 'translate', 'upper', 'uppercase', 'whitespace', 'zfill'] >>> print string.rstrip.__doc__ rstrip(s) -> string Return a copy of the string s with trailing whitespace removed. >>> _ See the Library Reference for more modules! 87 Using the string library #!/usr/local/bin/python import string import here [...] # Sequence stuff “import” the library to access the functions. def read_fasta(filename): file = open(filename, 'r') title_line = file.readline() sequence = '' while 1: line = file.readline() if not line: break remove whitespace line = string.rstrip(line) sequence = sequence + line name = title_line[1:] return Sequence(name, sequence) 88 read_fasta (finished) #!/usr/local/bin/python import string [...] # Sequence stuff def read_fasta(filename): """read_fasta(filename) -> Sequence Added docstring. Should add error checking. Check format. Check sequence. Read a FASTA-formatted file and return a Sequence object ""” file = open(filename, 'r') title_line = file.readline() sequence = '' while 1: line = file.readline() if not line: break 89 Summary: Act III • Python Covered: • • • • Functions Objects Modules Read/Write Files Code written: • Translate RNA to protein. • Sequence, RNASequence classes. • Read FASTA files. 90 Act IV Biopython: Batteries Included 91 Python in Biology, 1999 • Growing body of code being developed in • • Python. Much code attacking the same problem. Little intellectual property in the code -- we just need code to get something done! 92 Solution: Biopython! • • • Provides freely available software tools for biology research. High-tech penny jar. Modelled on Bioperl. www.biopython.org 93 Who Should Use Biopython? • People who manipulate and analyze • • • biological data using python. People who need a module to perform a function. Very little end user tools (scripts to run). Very little GUI tools. 94 Why should I use Biopython? • Software is hard. • Complete solutions are hard. • Maintenance is hard. 95 What does Biopython do? • Database access / File formats • Sequence analysis • Structure analysis • Access to algorithms • Microarray data analysis 96 Sequence Library • Sequence class. • Understands types of biological sequences. • Transcribe and translate sequences. • Analyses • reverse complement, molecular weight, GC content, Smith-Waterman alignment 97 Structure Analysis • Thomas Hamelryck’s PDB Library • Read and write PDB files. Hard! • Fast search for neighbors • in 3D space. Superimpose structures. 98 Microarray Analysis • Michiel de Hoon’s PyCluster • Read/Write Cluster/TreeView Files • Data Analysis includes: • Hierarchical Clustering, Self-Organizing Maps, Principal Component Analysis 99 Databases • Typical Functions: • Search for data. • Download data. • Databases Supported: • GenBank, PubMed, SWISS-PROT, PDB, SCOP, Prosite, LocusLink, etc... • BLAST search (Local and WWW) 100 Using Biopython 101 Download a SWISS-PROT Seq >>> from Bio.SwissProt import Sprot >>> SWISSPROT = SProt.ExPASyDictionary() >>> entry = SWISSPROT['POL_HV2RO'] >>> print entry ID POL_HV2RO STANDARD; PRT; 1036 AA. AC P04584; Q76629; [...] SQ SEQUENCE 1036 AA; 117080 MW; 5224E354B1DCC83B CRC64; TGRFFRTGPL GKEAPQLPRG PSSAGADTNS TPSGSSSGST GEIYAAREKT ERAERETIQG SDRGLTAPRA GGDTIQGATN RGLAAPQFSL WKRPVVTAYI EGQPVEVLLD TGADDSIVAG [...] >>> _ 102 Saving FASTA-format >>> from Bio import Fasta >>> fasta_seq = Fasta.Record() >>> fasta_seq.title = seq_obj.entry_name >>> fasta_seq.sequence = seq_obj.sequence >>> print fasta_seq >POL_HV2RO TGRFFRTGPLGKEAPQLPRGPSSAGADTNSTPSGSSSGSTGEIYAAREKTERAERETIQG SDRGLTAPRAGGDTIQGATNRGLAAPQFSLWKRPVVTAYIEGQPVEVLLDTGADDSIVAG IELGNNYSPKIVGGIGGFINTKEYKNVEIEVLNKKVRATIMTGDTPINIFGRNILTALGM [...] >>> open('myseq', 'w').write(str(fasta_seq)) >>> _ 103 Run a BLAST search >>> from Bio.Blast import NCBIWWW >>> handle = NCBIWWW.blast('blastp', 'pdb', open('myseq')) >>> results = NCBIWWW.BlastParser().parse(handle) >>> print len(results.descriptions) 15 >>> for desc in results.descriptions: ... print desc ... pdb|1QGH|A Chain A, The X-Ray Structure Of The Unusual Dod... pdb|1QGH|A Chain A, The X-Ray Structure Of The Unusual Dod... […] pdb|1P35|B Chain B, Crystal Structure Of Baculovirus P35 >g... >>> _ 24 24 2.6 2.6 23 8.5 104 Getting Biopython • Download from: • http://www.biopython.org Open Source license! • Free to modify, redistribute. • See LICENSE for details. 105 Learning Biopython • Read the Tutorial at: http://www.biopython.org/docs/tutorial/ 106 To Help Out • Visit the web site. • Join the biopython mailing lists. • Find a project! • Best: something you do often that is not in the library. • Support more programs, databases, types of data. • Documentation, site management, news reporter. 107 Act V Where do we go from here? 108 Jython = Java + Python • Java implementation of Python. • Compatible with Java • Jython can execute Java code. • Java can execute Jython code. • Can take advantage of libraries written in Java. 109 Optimizing Python with C • Python is closely tied with C. • • Can extend Python code with C code • Optimization strategy: • Find slow points • Rewrite in C Gordon Bell Prize for supercomputing • 1998 Finalist for Price/Performance • SPaSM (Scalable Parallel Short-range Molecular-dynamics) • C/Python 110 Take-Home Messages • Python is a simple high level language. • Python is full-featured, and scales well for • • scientific computation. Python is well-supported in biology. Biopython performs common biologyrelated tasks. 111 From here... • Visit the web page: http://www.python.org • Download Python • Read the online documentation • Tutorial • Library Reference • Find a project and start coding! 112 Recommended Books Gentle introduction Fast-paced introduction Reference Practical reference 113 he ! t y e o j c En eren f n Co Thank You! Jeffrey Chang [email protected] 114