Download Python [chapter 4]: mostly python, and some NLTK

NLTK chapter 2, 4 (approximately) NLTK programming course Peter Ljunglöf Basic Python types (repetition) create search inspect modify str/ unicode s = “abcd” u = u”abcd” u = s.decode(“utf-8”) s = u.encode(“utf-8”) “bc” in s s.index(“c”) == 2 s.index(“bc”) == 1 s.startswith(“ab”) s[2] == “c” s[–1] == “d” s[:2] == “ab” s[1:-1] == “bc” s = s + “efg” s = s.replace(“34”, “#”) s = s.strip() s.join(list-or-tuple) tuple t = (“a”,”b”,”c”,”d”) t = tuple(“abcd”) t = tuple(s) “c” in t “bc” not in t t.index(“c”) == 2 t[2] == “c” t[–1] == “d” t[:2] == (“a”, ”b”) t[1:-1] == (“b”, “c”) t = t + (“e”,) t = t + (“e”,”f”,”g”) “c” in w “bc” not in w w.index(“c”) == 2 w[2] == “c” w[–1] == “d” w[:2] == [“a”, ”b”] w[1:-1] == [“b”, “c”] w.append(“e”) w.extend((“e”,”f”,”g”)) w.insert(1, “x”) w[2] = “q” w.pop() list w = [“a”,”b”,”c”,”d”] w = list(“abcd”) w = list(t) set e = set(“abcd”) e = set(w) “c” in e “bc” not in e dict d = {“a”:9, ”b”:8, ”c”:7, “d”:6} d2 = dict((k, 33) for k in t) d2 = dict.fromkeys(t, 33) “c” in d “bc” not in d 2 e.add(“e”) e.update((“e”,”f”,”g”)) e.pop() e.remove(“c”) d[“a”] == 12 d[“c”] == 1 sorted(d.keys()) == w d[“e”] = 999 d.pop(“c”) Division in Python (repetition) Dividing two integers returns an int: >>> 3 / 2 1 Coerce to float first: >>> float(3) / 2 1.5 or use Python 3 division: >>> from __future__ import division >>> 3 / 2 1.5 3 Mutable / immutable (rep.) list, set, dict, nltk.FreqDist, … are mutable: (they have methods that modify themselves) >>> m = w = [“a”, “b”, “c”, “d”] >>> w[2] = “#” >>> m [“a”, “b”, “#”, “d”] tuple, str, unicode, int, float, … are immutable: (you have to create a copy) >>> m = w = (“a”, “b”, “c”, “d”) >>> w[2]="#" TypeError: 'tuple' object does not support item assignment >>> w = w[:2] + (“#”,) + w[3:] >>> m (“a”, “b”, “c”, “d”) 4 Reading and writing (rep.) Use Unicode strings: decode(“utf-8”) when reading from file encode(“utf-8”) when writing to file or use the codecs module Use the with statement… …for reading: with codecs.open(“inputfile”, “r”, encoding=”utf-8”) as F: content = F.read() …or writing: with codecs.open(“outputfile”, “w”, encoding=”utf-8”) as F: F.write(content) 5 Python files and modules Standard file structure Importing modules Standard modules 6 Python file structure if you use non-ASCII strings/comments # -*- coding: utf-8 -*- import before everything constant declarations: don’t change them later the main function(s) before the helper(s) ””” module docstring — try to use docstrings instead of comments ””” import sys import nltk module_constant = 42 another_constant = u”non-ascii letters: åäö ÅÄÖ” def main_function(input_file, another_arg): ”””description of main function, and its arguments””” with open(input_file, “r”) as F: …another_function(x, y)… def another_function(arg_1, arg_2): ”””another function, its arguments and the return value””” return some_result if __name__ == ”__main__”: main_function(*sys.argv[1:]) 7 if the file is run as a script, call main function Importing modules Never use this: from os.path import * from nltk.tag.tnt import * But it’s okay to assign short names: import os.path as P import nltk.tag.tnt as tnt And sometimes this is okay: from glob import glob from os.path import basename, dirname from nltk.tag.tnt import TnT 8 Useful Python modules strings, unicode: re, codecs, unicodedata objects, data types: copy, pprint collections, heapq, bisect numbers: math, random iterators, higher-order functions: itertools, operator 9 More Python modules file system, operating system: os.path, glob time, os, sys, subprocess reading/writing special files: pickle, cPickle zlib, gzip, bz2, zipfile, tarfile html, cgi: urllib, HTMLParser, htmlentitydefs cgi, cgitb testing efficiency and correctness: timeit, doctest 10 Modules for strings re — last lecture codecs: codecs.open(filename, mode, encoding) unicodedata: >>> unicodedata.name(u'\u00e4') 'LATIN SMALL LETTER A WITH DIAERESIS' >>> unicodedata.lookup('LATIN SMALL LETTER A WITH DIAER… u'\xe4' >>> unicodedata.category(u'\xe4') 'Lu' # Letter Uppercase >>> unicodedata.category(u'2') 'Nd' # Number Decimal 11 Objects, data types deep copying of nested objects: copy.deepcopy(obj) pretty-printing of nested objects: pprint.pprint(object, [stream], [indent], [width], [depth]) default dictionaries: ctr = collections.defaultdict(int) ctr[‘a’] # returns 0 ctr[‘b’] += 3 # ctr[‘b’] is now 3 priority queues; fast searching in sorted lists: heapq.heappush, heapq.heappop, heapq.heapify bisect.bisect_left, bisect.bisect_right 12 Numbers math functions: math.exp(x) math.pow(x, y) math.sin(x) math.pi == == == == ex xy sin x π math.log(x) math.sqrt(x) math.cos(x) math.e == == == == ln x √x cos x e random numbers and sequences: random.random() random.randrange(10) random.randrange(100, 110) random.choice(“abcdef”) random.sample(“abcdef”, 3) xs = [1,2,3,4,5] random.shuffle(xs) # result: xs == [5, 1, 4, 2, 3] 13 ==> ==> ==> ==> ==> 0.70117748128509816 7 103 “c” [“d”, “a”, “c”] File system, OS utilities pathname manipulation & expansion os.path.basename(“/test/a/path.xml”) ==> “path.xml” os.path.dirname(”/test/a/path.xml”) ==> “/test/a” glob.glob(“test/∗/∗.xml”) ==> [“test/a/path.xml”, “test/b/zip.xml”] time time.time() ==> nr seconds since the epoch (1970 on unix) time.strftime(format, [time]) ==> pretty-formatted time string os, sys, subprocess os.getcwd(), os.chdir(path), os.listdir(path), os.mkdir(path) os.environ, sys.argv, sys.platform sys.stdin, sys.stdout, sys.stderr subprocess.Popen(…), subprocess.call(…) 14 And the rest… pickle, cPickle: for reading/writing Python objects from/to files zlib, gzip, bz2, zipfile, tarfile: for reading/writing compressed data urllib, HTMLParser, htmlentitydefs: for reading/parsing html cgi, cgitb: for writing cgi scripts timeit, doctest: for testing efficiency and correctness of your code 15 NLTK Frequency distributions Conditional frequency distributions 16 NLTK frequency distribution FreqDist is a dictionary with counters initialize by giving a sequence of elements e.g., a string, a list of words, a list of bigrams a lot of useful methods compare (<, <=, ==, >=, >), add (+) statistics (B, N, Nr, freq, d[x]) get elements (hapaxes, max, samples, items) change (inc, update, d[x]=n) display (tabulate, plot) 17 ConditionalFreqDist a dictionary of FreqDist’s initialize by a sequence of (cond, sample) pairs useful methods: ==, <, >, N, conditions, plot, tabulate however: keys, values, in, for…in are missing examples: figure 2.1: words “america” vs “citizen” figure 2.2: word length in different languages figure 2.10: last letter of male/female names example 2.5: random text generation 18 Example cond. freq. dist modals in different text types (2.1 Brown) cfd = nltk.ConditionalFreqDist( (genre, word) for genre in brown.categories() for word in brown.words(categories=genre)) genres = ['hobbies', 'lore', 'news', 'romance'] modals = ['can', 'could', 'may', 'might', 'must', 'will'] cfd.tabulate(conditions=genres, samples=modals) can hobbies 268 lore 170 news 93 romance 74 could 58 141 86 193 may might must 131 22 83 165 49 96 66 38 50 11 51 45 19 will 264 175 389 43 Python coding tips Coding style Procedural vs declarative Looping Named function arguments Defensive programming 20 Python coding style (sect. 4.3) Indent with 4 spaces; not tabs Don’t write long lines; line break instead either use parentheses: if ( len(syllables) > 4 and len(syllables[2]) == 3 and syllables[2][2] in “aeiou” and syllables[2][3] == syllables[1][3] ): or add a backslash at the end of the line: if len(syllables) > 4 and len(syllables[2]) == 3 and \ syllables[2][2] in “aeiou” and \ syllables[2][3] == syllables[1][3]: With the risk of being repetitive: write docstrings! 21 Procedural vs declarative (4.3) Procedural: count = 0; total = 0 for token in tokens: count += 1 total += len(token) print float(total) / count Declarative: count = len(tokens) total = sum(len(t) for t in tokens) print float(total) / count The declarative style needs more infrastructure higher-order functions, generic classes 22 Sorted word list (1st attempt) A very procedural version: word_list = [] len_word_list = 0 i=0 while i < len(tokens): j=0 while j < len_word_list and word_list[j] < tokens[i]: j += 1 if j == 0 or tokens[i] != word_list[j]: word_list.insert(j, tokens[i]) len_word_list += 1 i += 1 Note: we only use len, insert and lookup 23 Sorted word list (2nd) Using a for loop instead: word_list = [] for token in tokens: j=0 while j < len(word_list) and word_list[j] < token: j += 1 if j == 0 or token != word_list[j]: word_list.insert(j, token) Note: we didn’t need the i, just the token the for loop takes care of i and increasing it but, we do need j we rely on the fact that len() is efficient for lists 24 Sorted word list (3rd attempt) A very declarative version: word_list = sorted(set(tokens)) Note: we’re not only using sorted and set since they rely on a lot of underlying methods 25 Looping in Python Other languages often use a counter: C: for (i = 0; i < mylist_length; i++) { token = mylist[i] // do something with token… } Pascal: for i := 0 to mylist_length do begin token := mylist[i] {do something with token…} end In Python we just loop over the elements: for token in tokens: # do something with token… 26 Getting the index Sometimes we really want the list index too then we use the function enumerate: for i, token in enumerate(tokens): # now this holds: token == tokens[i] And sometimes we don’t have list to loop over then we use the range function: for i in range(10): # i will be 0, 1, 2, …, 9 range can take a start value: for i in range(1, 11): # i will be 1, 2, 3, …, 10 27 Looping over two lists Sometimes we want to loop in parallel e.g., part-of-speech tagging def postag(word): if word in (“a”, “the”, “all”): return “det” else: return “noun” getting a list of postags: postaglist = [postag(w) for w in corpus_words] loop over each word and postag: for word, postag in zip(corpus_words, postaglist): # do something with the word and its postag… 28 Loops vs comprehensions Getting all words that starts with a string: def prefix_search(prefix, words): result = set() for word in words: if word.startswith(prefix): result.add(word) return result The same thing using a set comprehension: def prefix_search(prefix, words): return set(word for word in words if word.startswith(prefix)) More declarative = shorter = more readable 29 Named arguments You can always call a function with named args: prefix_search(“engl”, brown.words()) …vs… prefix_search(prefix=“engl”, words=brown.words()) Often it is more readable: codecs.open(“inputfile”, “wb”, “gbk”) …vs… codecs.open(“inputfile”, mode=“wb”, encoding=“gbk”) But sometimes it doesn’t give us anything: nltk.bigrams(brown.words()) …vs… nltk.bigrams(sequence=brown.words()) 30 Named arguments Since you can always use named args, it is especially important with good names: def prefix_search(prefix, words): …vs… def prefix_search(x, y): 31 Defensive programming use assert to check input arguments: def prefix_search(prefix, words): assert isinstance(prefix, (str, unicode)), “prefix must be a string” assert isinstance(words, (list, tuple, set)), “words must be a sequ… # the rest… …to check return values: def prefix_search(prefix, words): # the rest… assert isinstance(result, set), “result should be a set” return result …to check return values: search_result = prefix_search(“engl”, brown.words()) assert isinstance(search_result, set), “search result should be a set” assert search_result, “search result should be non-empty” 32

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Python [chapter 4]: mostly python, and some NLTK