Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Roadmap The topics: Homework due basic concepts of molecular biology Elements of Python Where to get Python? If you want to run your python programs on your own machine, download Python interpreter from different places: http://www.activestate.com/activepython/ or http://www.python.org/download/ Python’s data types Python’ Python functions Python Control of Flow Python regex overview of the field biological databases and database searching sequence alignments phylogenetics structure prediction microarray & next gen Running Python Python is one of the best scripting languages. It is often being used in texttext-based command shell. You can find the links at the class website. website. Download IEP as your IDE There are many good Python IDEs. I have the links to some of them at the class website. website. I’ll use IEP in class: Download IEP at: http://www.iep--project.org/downloads.html http://www.iep I downloaded : “iep iep--3.2.win32.exe - Windows installer” Use IEP You’ll see an icon like the following on your desktop. Start your IEP. Ctrl S: save Ctrl E: execution Drawback: Won’t be able to pass arguments to script 1 A Taste of Python: at prompt Programming Python for Bioinformatics Part I Type statements or expressions at prompt: >>> print "Hello, world" Hello, world >>> x = 12**2 >>> x/2 72 >>> # this is a comment A Taste of Python: print a message demo1.py: Greet the entire world. (where to find python) #greet the entire world print “Hello world!”; - a comment - variable assignment statement print “All”, x, “of you!”; } - function calls (output statements) Assignment Overview demo2.py: parsing email addresses -command interpretation header #!/usr/bin/python x = 7e9; A Taste of Python: scripting Assignment & Names Data types Sequences types: Lists, Tuples, and Strings Mutability Understanding Reference Semantics in Python Assignment uses = and comparison uses == The first assignment to a variable creates it Dynamic typing: no declarations, names don’ don’t have types, objects do For numbers + - * / % are as expected. Use of + for string concatenation. Use of % for string formatting (like printf in C) Block structure indicated by indentation Logical operators are words (and,or,not (and,or,not)) not symbols The basic printing command is print Indentation matters to meaning the code 2 Naming Rules Names are case sensitive and cannot start with a number. They can contain letters, numbers, and underscores. bob Naming conventions Bob _bob _2_bob_ There are some reserved words: bob_2 BoB and, assert, break, class, continue, def, del, elif, elif, else, except, exec, finally, for, from, global, if, import, in, is, lambda, not, or, pass, print, raise, return, try, while The Python community has these recommended naming conventions • joined_lower for functions, methods and, attributes • joined_lower or ALL_CAPS for constants • StudlyCaps for classes pre--existing conventions • camelCase only to conform to pre • Attributes: interface, _internal, __private Whitespace Whitespace is meaningful in Python, especially indentation and placement of newlines • Use a newline to end a line of code Use \ when must go to next line prematurely braces { } to mark blocks of code, use consistent indentation instead • No • • First line with less indentation is outside of the block First line with more indentation starts a nested block • Colons start of a new block in many constructs, e.g. e.g. function definitions, then clauses Comments comments with #, rest of line is ignored include a “documentation string” string” as the first line of a new function or class you define Development environments, debugger, and other tools use it: it’ it’s good style to include one Start Can def fact(n): fact(n) assumes n is a positive “““fact(n) “““ integer and returns factorial of n.””” n.””” assert(n>=0) return 1 if n==0 else n*fact(nn*fact(n-1) Python’s builtPython’ built-in type hierarchy Python’ss data types Python’ 3 Basic Datatypes Numbers Integers (default for numbers) z = 5 / 2 # Answer 2, integer division Numbers Floats Can use ""…" …" or '…' to specify, "foo" == 'foo' Unmatched can occur within the string John’’s” or ‘John said “foo! foo!””.’ “John Use triple doubledouble-quotes for multimulti-line strings or strings than contain both ‘ and “ inside of them: a‘b“c””” “““a “““ Operators add subract multiply divide modulus/remainder Relational operators < <= Floating--point – represent numbers with decimal places Floating Octal and hexadecimal numbers Complex numbers Ex: 3+4j, 3.0+4.0j, 3J Example y=5; z=3 x=y+z x= x=y–z x= x=y*z x= x=y/z x= x=y%z x= Python Basics – arithmetic operations Operators equal not equal greater than greater than or equal less than less than or equal Example y=5; z=3 8 2 15 1 2 << shift left >> shift right | bitwise or ^ bitwise exclusive or & bitwise and ** raise to power x = y << 1 x = y >> 2 x=y|z x=y^z x=y&z x = y ** z x = 10, y=5 x=1 x=7 x=6 x=1 x = 125 Python Basics – Relational Operators == !=, <> > >= Ex: O177, 0x9ff, Oxff Python Basics – Relational and Logical Operators Long Integers – unlimited size Ex: 1.2, 3.14159,3.14e3.14159,3.14e-10 Python Basics – arithmetic operations + * / % Ex: 9999999999999999999999L Strings Normal Integers –represent whole numbers Ex: 3, -7, 123, 76 x = 3.456 Assume x = 1, y = 4, z = 14 Logical operators and or not and or not Expression Value Interpretation x<y+z 1 True y == 2 * x + 3 0 False z <= x + y 0 False z>x 1 True x != y 1 True 4 Python Basics – Logical Operators Assume x = 1, y = 4, z = 14 Expression Value Interpretation x<=1 and y==3 0 False x<= 1 or y==3 1 True not (x > 1) 1 True (not x) > 1 0 False not (x<=1 or y==3) 0 False Three sequence types: Tuples,, Lists, and Tuples Strings Sequence Types are containers that hold objects ordered, indexed by integers Tuple:: (1, “a”, [100], “foo foo””) Tuple Sequences Finite, • • An immutable ordered sequence of items Items can be of mixed types, including collection types String:: String • • • “foo bar bar”” An immutable ordered sequence of chars Conceptually very much like a tuple List List:: Similar Syntax [“one one””, “two two””, 3] A Mutable ordered sequence of items of mixed types All three sequence types (tuples, strings, and lists) share much of the same syntax and functionality. Key difference: Tuples and strings are immutable Lists are mutable The operations shown in this section can be applied to all sequence types most examples will just show the operation performed on one Sequence Types - 1 Define tuples using parentheses and commas Define lists are using square brackets and commas Define strings using quotes (“ (“, ‘, or “”” “””). ). Sequence Types - 2 Access individual members of a tuple, list, or string array”” notation using square bracket “array Note that all are 0 based… >>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’) >>> li = [“abc”, [“abc”, 34, 4.34, 23] >>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’) >>> tu[1] # Second item in the tuple. abc’ ‘ abc ’ >>> st = “Hello World” >>> st = ‘Hello World’ >>> st = “””This “””This is a multimulti-line string that uses triple quotes.””” quotes.””” >>> li = [“ [“abc”, 34, 4.34, 23] >>> li[1] # Second item in the list. 34 World” >>> st = “ Hello World” >>> st[1] # Second character in string. ‘e’ 5 Positive and negative indices Slicing: Return Copy of a Subset >>> t = (23, ‘abc’ abc’, 4.56, (2,3), ‘def’ def’) >>> t = (23, ‘abc abc’’, 4.56, (2,3), ‘def def’’) Returns copy of container with subset of original members. Start copying at first index, and stop copying before the second index >>> t[1:4] (‘abc’ abc’, 4.56, (2,3)) You can also use negative indices >>> t[1:t[1:-1] (‘abc’ abc’, 4.56, (2,3)) Positive index: count from the left, starting with 0 >>> t[1] abc’’ ‘abc Negative index: count from right, starting with –1 >>> t[t[-3] 4.56 Slicing: Return Copy of a Subset >>> t = (23, ‘abc abc’’, 4.56, (2,3), ‘def def’’) Omit first index to make a copy starting from the beginning of container >>> t[:2] (23, ‘abc abc’’) Omit second index to make a copy starting at 1st index and going to end of the container >>> t[2:] (4.56, (2,3), ‘def def’’) Copying the Whole Sequence (23, ‘abc’, 4.56, (2,3), ‘def’) Note the difference between these two lines for mutable sequences >>> l2 = l1 # Both refer to same ref, # changing one affects both >>> l2 = l1[:] # Independent copies, 2 refs + Operator is Concatenation The ‘in in’’ Operator [ : ] makes a copy of an entire sequence >>> t[:] Boolean test whether a value is inside a container: >>> t >>> 3 False >>> 4 True >>> 4 False = [1, 2, 4, 5] in t in t not in t For strings, tests for substrings >>> 'TATA' in 'TATATATATATATATATATATATA' True >>> 'ATG' in 'TATATATATATATATATATATATA' False >>> 'AA' not in 'TATATATATATATATATATATATA' True Careful: the in keyword is also used in the syntax of for loops and list comprehensions The + operator produces a new tuple, list, or string whose value is the concatenation of its arguments. >>> (1, 2, 3) + (4, 5, 6) (1, 2, 3, 4, 5, 6) >>> [1, 2, 3] + [4, 5, 6] [1, 2, 3, 4, 5, 6] >>> 'ACCTGAGAGCT' + 8*'A' 'ACCTGAGAGCTAAAAAAAA' 6 Other String operations Expression Value Purpose len(mystring) 11 number of characters in mystring “%s world”%“hello” “hello world” “hello”+“world” “world” == “hello” “world” == ‘world’ “a” < “b” “b” < “a” “helloworld” Concatenate strings 0 or False 1 or True Test for equality 1 or True 0 or False Format strings (like sprintf) Alphabetical ordering count.py: Exercise dna =“ATGaCGgaTCAGCCGcAAtACataCACTgttca" GC content? dna = " "ATGaCGgaTCAGCCGcAAtACataCACTgttca ATGaCGgaTCAGCCGcAAtACataCACTgttca" " dna1 = dna.upper dna.upper() () (dna1.count('G') + dna1.count('C')) / len(dna1) len(dna1) Many useful built-in functions >>> mystring = 'ACCTGAGAGCT' mystring.upper() 'ACCTGAGAGCT' >>> mystring.replace('GC', 'CG') 'ACCTGAGACGT' >>> set(mystring) set(['A', 'C', 'T', 'G']) transcribe.py: Exercise dna =“ATGaCGgaTCAGCCGcAAtACataCACTgttca" rna = ???; dna = " "ATGaCGgaTCAGCCGcAAtACataCACTgttca ATGaCGgaTCAGCCGcAAtACataCACTgttca" " rna = dna.upper() dna.upper() rna1 = rna.replace rna.replace('A', ('A', 'a') rna = rna1.replace('T', 'A') rna1 = rna.replace rna.replace('C', ('C', 'c') rna = rna1.replace('G', 'C') rna1 = rna.replace rna.replace('a', ('a', 'U') rna= rna = rna1.replace('c', 'G') rna[::-1] # reverse rna Mutability: Tuples vs. Lists Lists are mutable >>> li = [‘ [‘abc’, 23, 4.34, 23] >>> li[1] = 45 >>> li [‘abc abc’ ’, 45, 4.34, 23] We can change lists in place. Name li still points to the same memory reference when we’ we’re done. Tuples are immutable >>> t = (23, ‘abc abc’, ’, 4.56, (2,3), ‘def’) >>> t[2] = 3.14 File "C: "C:\ \Users Users\ \duan duan\ \Desktop\ Desktop\CS445 CS445\ \demos demos\ \Ch0 Ch0\ \tmp.py", line 1 t = (23, ‘abc ‘abc’, ’, 4.56, (2,3), ‘def’) ^ SyntaxError: SyntaxError : invalid syntax You can’’t change a tuple. can tuple. You can make a fresh tuple and assign its reference to a previously used name. >>> t = (23, ‘abc’, abc’, 3.14, (2,3), ‘def’) Immutability of tuples they aare re faster than lists 7 Tuple details The comma is the tuple creation operator, not parens Tuples vs. Lists Python shows parens for clarity (best practice) >>> (1,) (1,) Don't forget the comma for singletons! tuples have a special syntactic form >>> () () >>> tuple() () Lists can be modified and they have many handy operations and methods Tuples are immutable & have fewer features >>> (1) 1 Empty Lists slower but more powerful than tuples >>> 1, (1,) Sometimes an immutable collection is required (e.g., as a hash key) Tuples used for multiple return values and parallel assignments x,y,z = 100,200,300 old,new = new,old Convert tuples and lists using list() and tuple(): tuple(): mylst = list(mytup list(mytup); ); mytup = tuple tuple((mylst) mylst) Build--in functions vs. methods Build Operations can be functions or methods Remember that (almost) everything is an object You just have to learn (and remember or lookup) which operations are functions, which are methods len() is a function on collections that returns the number of things they contain index() is a method on collections that returns the index of the 1st occurrence of its arg >>> ['a’,'b’,'c'].index('a') 0 >>> ('a','b','c').index('b') 1 >>> "abc".index('c') 2 >>> len(['a', 'b', 'c']) 3 >>> len(('a','b','c')) 3 >>> len("abc") 3 Lists methods Lists have many methods, including index, count, append, remove, reverse, sort, etc. Many of these modify the list >>> l = [1,3,4] >>> l.append(0) # adds a new element to the end of the list >>> l [1, 3, 4, 0] >>> l.insert(1,200) # insert 200 just before index position 1 >>> l [1, 200, 3, 4, 0] >>> l.reverse() # reverse the list in place >>> l [0, 4, 3, 200, 1] >>> l.sort() # sort the elements. Optional arguments can give >>> l # the sorting function and direction [0, 1, 3, 4, 200] >>> l.remove(3) # remove first occurence of element from list >>> l [0, 1, 4, 200] Exercise A valid DNA sequence? dna =“ATGaCGgaTDCUAGCCPGcAAtACataCACTngttca" Python dicts and sets 8 Dictionaries: A Mapping type Overview Python doesn’ doesn’t have traditional vectors and arrays! Instead, Python makes heavy use of the dict datatype (a hashtable) which can serve as a sparse array Efficient traditional arrays are available as modules that interface to C A Python set is derived from a dict Updating Dictionaries Creating & accessing dictionaries >>> d = {‘ {‘user user’ bozo’ pswd’ ’:‘bozo ’, ‘pswd ’:1234} ’] >>> d[‘ d[‘user user’ bozo’’ ‘bozo >>> d[‘ d[‘pswd pswd’ ’] 1234 ’] bozo’ >>> d[‘ d[‘bozo Traceback (innermost last): File ‘<interactive input>’ input>’ line 1, in ? KeyError: bozo Removing dictionary entries >>> d = {‘ {‘user user’’:‘bozo bozo’’, ‘p’:1234, ‘i’:34} >>> del d[‘ d[‘user user’’] # Remove one. >>> d {‘p’:1234, ‘i’:34} >>> d.clear() # Remove all. >>> d {} >>> a=[1,2] >>> del a[1] # del works on lists, too >>> a [1] Dictionaries store a mapping between a set of keys and a set of values Keys can be any immutable type. Values can be any type A single dictionary can store values of different types You can define, modify, view, lookup or delete the key--value pairs in the dictionary key Python’’s dictionaries are also known as hash tables Python and associative arrays >>> d = {‘ {‘user user’ bozo’ pswd’ ’:‘bozo ’, ‘pswd ’:1234} ’] = ‘clown ’ user’ clown’ >>> d[‘ d[‘user >>> d {‘user ’:‘clown ’, ‘pswd ’:1234} user’ clown’ pswd’ Keys must be unique Assigning to an existing key replaces its value >>> d[‘ d[‘id id’ ’] = 45 >>> d user’ clown’ id’ pswd’ {‘user ’:‘clown ’, ‘id ’:45, ‘pswd ’:1234} Dictionaries are unordered New entries can appear anywhere in output Dictionaries work by hashing Useful Accessor Methods ‘user ’:‘bozo ’, ‘p’:1234, ‘i’:34} >>> d = { {‘ user’ bozo’ >>> d.keys() # List of keys, VERY useful [‘user user’ ’, ‘p’, ‘i’] >>> d.values() # List of values ’, 1234, 34] bozo’ [‘bozo >>> d.items() # List of item tuples user’ bozo’ (‘p’,1234), (‘ (‘i’,34)] ‘user ’,‘bozo ’), (‘ [(‘ [( 9 A Dictionary Example Dictionary example: wf1.py Problem: count the frequency of each word in text read from the standard input, print results Six versions of increasing complexity wf1.py is a simple start wf2.py uses a common idiom for default values wf3.py sorts the output alphabetically wf4.py downcase and strip punctuation from words and ignore stop words wf5.py sort output by frequency wf6.py add command line options: -n, -t, -h #!/usr/bin/python import sys freq = {} # frequency of words in text for line in sys.stdin: for word in line.split(): if word in freq: freq[word] = 1 + freq[word] else: freq[word] = 1 print freq Dictionary example wf1.py Dictionary example wf2.py #!/usr/bin/python import sys freq = {} # frequency of words in text for line in sys.stdin: This is a common pattern for word in line.split(): if word in freq: freq[word] = 1 + freq[word] else: freq[word] = 1 print freq Dictionary example wf3.py #!/usr/bin/python import sys freq = {} # frequency of words in text for line in sys.stdin: for word in line.split(): freq[word] = freq.get(word,0) for w in sorted(freq.keys()): print w, freq[w] #!/usr/bin/python import sys freq = {} # frequency of words in text for line in sys.stdin: for word in line.split(): freq[word] = 1 + freq.get(word, 0) print freq key Default value if not found Dictionary example wf4.py #!/usr/bin/python import sys punctuation = """'!"#$%&\ """'!"#$%&\'()*+,'()*+,./:;<=>?@[\ ./:;<=>?@[ \\]^_`{|}~'""" freq = {} text # frequency of words in stop_words = set() for line in open("stop_words.txt"): stop_words.add(line.strip()) 10 Dictionary example wf4.py Dictionary example wf5.py #!/usr/bin/python import sys from operator import itemgetter … words = sorted(freq.items(), key=itemgetter(1), reverse=True) for line in sys.stdin: for word in line.split(): word = word.strip(punct).lower() if word not in stop_words: freq[word] = freq.get(word,0)+1 # print sorted words and their frequencies for w in sorted(freq.keys()): print w, freq[w] for (w,f) in words: print w, f Dictionary example wf6.py from optparse import OptionParser # read command line arguments and process parser = OptionParser() parser.add_option('parser.add_option(' -n', '-'--number', number', type="int", default=default= -1, help='number of words to report') parser.add_option("parser.add_option(" -t", "-"--threshold", threshold", type="int", default=0, help=” help=”print if frequency > threshold") (options, args) = parser.parse_args() ... # print the top option.number words but only those # with freq>option.threshold for (word, freq) in words[:options.number]: if freq > options.threshold: print freq, word Why must keys be immutable? >>> name1, name2 = 'john', ['bob', 'marley'] >>> fav = name2 >>> d = {name1: 'alive', name2: 'dead'} Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list objects are unhashable defaultdict The keys used in a dictionary must be immutable objects? Why is this? Suppose we could index a value for name2 and then did fav[0] = “Bobby Bobby”” Could we find d[name2] or d[fav] or …? Project 1 >>> from collections import defaultdict >>> kids = defaultdict(list, {'alice': ['mary', 'nick'], 'bob': ['oscar', 'peggy']}) >>> kids['bob'] ['oscar', 'peggy'] >>> kids['carol'] [] >>> age = defaultdict(int) >>> age['alice'] = 30 >>> age['bob'] 0 >>> age['bob'] += 1 >>> age defaultdict(<type 'int'>, {'bob': 1, 'alice': 30}) 11