Download Introduction to Python Part 4: NLTK and other cool Python stuff

Introduction to Python Part 4: NLTK and other cool Python stuff Outline Introduction to Python Part 4: NLTK and other cool Python stuff Alastair Burt Andreas Eisele Christian Federmann Torsten Marek Ulrich Schäfer DFKI & Universität des Saarlandes October 8th, 2009 Outline Introduction to Python Part 4: NLTK and other cool Python stuff Today’s Topics: 1 NLTK, the Natural Language Toolkit Overview Tokenization and PoS Tagging Morphology and Feature Structures Accessing Corpora Chunking and CFG Parsing Classification and Clustering Outline 2 Other cool Python stuff Building GUIs with TK (Tkinter) UNO: (Python-)programming OpenOffice Zope & Plone Google App Engine 3 Summary and some thoughts Introduction to Python Part 4: NLTK and other cool Python stuff What is NLTK? Installation Part I NLP Pipeline NLTK Overview What is NLTK? Introduction to Python Part 4: NLTK and other cool Python stuff NLTK Natural Language Toolkit What is NLTK? Installation NLP Pipeline Developed by Steven Bird, Ewan Klein and Edward Loper Mainly addresses education and research Very good documentation, book online: http://www.nltk.org/book NLTK examples on slides taken from there We only present very simple approaches to NLP here, often based on regular expressions. More sophisticated approaches are discussed in the book. Installation Introduction to Python Part 4: NLTK and other cool Python stuff What is NLTK? Installation NLP Pipeline Required packages 1 Python version 2.4, 2.5 or 2.6 2 NLTK source distribution http://nltk.googlecode.com/files/nltk-2.0b6.zip 3 or NLTK Debian package: http://nltk.googlecode.com/files/nltk_2.0b5-1_all.deb 4 NLTK data (corpora etc.): see http://www.nltk.org/data 5 further packages and installation instructions: http://www.nltk.org/download The NLP Analysis Pipeline Introduction to Python Part 4: NLTK and other cool Python stuff From Text Strings to Text Understanding 1 Tokenization 2 Morphological analysis 3 Part-of-Speech tagging 4 Named entity recognition 5 Chunking (chunk parsing) 6 Sentence boundary detection 7 Syntactic parsing 8 Anaphora resolution 9 Semantic analysis 10 Pragmatics, Reasoning, ... Introduction to Python Part 4: NLTK and other cool Python stuff Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger Part III Tokenization and PoS Tagging Regular Expression Tokenizer Introduction to Python Part 4: NLTK and other cool Python stuff Split input string into list of words – remember re.findall() regexptokenizer1.py: Simple tokenizer Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II import nltk text = "Hello. Isn’t this fun?" pattern = r’\w+|[^\w\s]+’ print nltk.tokenize.regexp_tokenize(text, pattern) Regular Expression Tagger Result: [’Hello’, ’.’, ’Isn’, "’", ’t’, ’this’, ’fun’, ’?’] Debugging Regular Expressions Introduction to Python Part 4: NLTK and other cool Python stuff NLTK tool for debugging regular expressions import nltk nltk.re_show(pattern, inputstring) Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger Example import nltk nltk.re_show(’o+’, ’Computational linguistics is cool.’) Result: C{o}mputati{o}nal linguistics is c{oo}l. Regular Expression Tokenizer II Introduction to Python Part 4: NLTK and other cool Python stuff Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger Tokenize currency amounts and abbreviations correctly regexptokenizer2.py: Extended simple tokenizer import nltk text = ’That poster costs $22.40.’ pattern = r’’’(?x) \w+ # sequences of word characters | \$?\d+(\.\d+)? # currency amounts, e.g. $12.50 | ([A-Z]\.)+ # abbreviations, e.g. U.S.A. | [^\w\s]+ # sequences of punctuation ’’’ nltk.tokenize.regexp_tokenize(text, pattern) Result: [’That’, ’poster’, ’costs’, ’$22.40.’] Regular Expression Tagger Introduction to Python Part 4: NLTK and other cool Python stuff Part-of-Speech tagging assigns a word class to each input token regexptagger.py: Simple Tagger for English import nltk Tokenization and PoS Tagging Debugging Regular Expressions Regular Expression Tokenizer II Regular Expression Tagger patterns = [ (r’.*ing$’, ’VBG’), # gerunds (r’.*ed$’, ’VBD’), # simple past (r’.*es$’, ’VBZ’), # 3rd singular present (r’.*ould$’, ’MD’), # modals (r’.*\’s$’, ’NN$’), # possessive nouns (r’.*s$’, ’NNS’), # plural nouns (r’^-?[0-9]+(.[0-9]+)?$’, ’CD’), # cardinal numbers (r’.*’, ’NN’) # nouns (default) ] regexp_tagger = nltk.RegexpTagger(patterns) regexp_tagger.tag(nltk.corpus.brown.sents(categories=’adventure’) Introduction to Python Part 4: NLTK and other cool Python stuff Porter Stemmer Part IV Morphology/Stemmer Morphology/Stemmer Introduction to Python Part 4: NLTK and other cool Python stuff porter.py: Simple Stemmer (Porter) Porter Stemmer stemmer = nltk.PorterStemmer() verbs = [’appears’, ’appear’, ’appeared’, ’calling’, ’called’] stems = [] import nltk for verb in verbs: stemmed_verb = stemmer.stem(verb) stems.append(stemmed_verb) sorted(set(stems)) Result: [’appear’, ’call’] Introduction to Python Part 4: NLTK and other cool Python stuff Corpora included Gutenberg Statistics UDHR word length distribution Concordance Part V Accessing Corpora Corpora included in NLTK Introduction to Python Part 4: NLTK and other cool Python stuff Corpora included in NLTK Gutenberg Texts (subset) Brown Corpus Corpora included Gutenberg Statistics UDHR word length distribution Concordance UDHR CMU Pronunciation Dictionary TIMIT Senseval 2 ... and many more Example import nltk nltk.corpus.brown.words() nltk.corpus.gutenberg.fileids() Gutenberg Corpus Statistics Introduction to Python Part 4: NLTK and other cool Python stuff Code displays corpus filename, average word length, average sentence length, and the number of times each vocabulary item appears in the text on average. Corpora included gutenberg.py: Compute simple corpus statistics Gutenberg Statistics from nltk.corpus import gutenberg UDHR word length distribution Concordance for filename in gutenberg.fileids(): r = gutenberg.raw(filename) w = gutenberg.words(filename) s = gutenberg.sents(filename) v = set(w) print filename, len(r)/len(w), len(w)/len(s), len(w)/len(v) UDHR corpus Introduction to Python Part 4: NLTK and other cool Python stuff Corpora included Gutenberg Statistics UDHR word length distribution Concordance Contains the Universal Declaration of Human Rights in over 300 languages. Example requires matplotlib (see NLTK download page on how to get it). udhr.py: Compute and display word length distribution import nltk, pylab def cld(lang): text = nltk.corpus.udhr.words(lang) fd = nltk.FreqDist(len(token) for token in text) ld = [100*fd.freq(i) for i in range(36)] return [sum(ld[0:i+1]) for i in range(len(ld))] langs = [’Chickasaw-Latin1’, ’English-Latin1’, ’German_Deutsch-Latin1’, ’Greenlandic_Inuktikut-Latin1’, ’Hungarian_Magyar-Latin1’, ’Ibibio_Efik-Latin1’] dists = [pylab.plot(cld(l), label=l[:-7], linewidth=2) for l in langs] pylab.title(’Cumulative Word Length Distrib. for Several Languages’) pylab.legend(loc=’lower right’) pylab.show() Concordance Introduction to Python Part 4: NLTK and other cool Python stuff Corpora included Gutenberg Statistics UDHR word length distribution Concordance concordance.py: Concordance on Brown Corpus import nltk def concordance(word, context): "Generate a concordance for the word with the specified context window" for sent in nltk.corpus.brown.sents(categories=’a’): try: pos = sent.index(word) left = ’ ’.join(sent[:pos]) right = ’ ’.join(sent[pos+1:]) print ’%*s %s %-*s’ %\ (context, left[-context:], word, context, right[:context]) except ValueError: pass concordance(’line’, 32) Introduction to Python Part 4: NLTK and other cool Python stuff Chunker CFG Parser Part VI Earley Parser Classification and Clustering Chunking and CFG Parsing Chunker Introduction to Python Part 4: NLTK and other cool Python stuff chunkparser.py: Chunk parser import nltk Chunker CFG Parser Earley Parser Classification and Clustering grammar = r""" NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, # adjectives and nouns {<NNP>+} # chunk sequences of proper nouns """ cp = nltk.RegexpParser(grammar) tagged_tokens = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("golden", "JJ"), ("hair", "NN")] print cp.parse(tagged_tokens) cp.parse(tagged_tokens).draw() CFG Parser Introduction to Python Part 4: NLTK and other cool Python stuff Various Parsing Algorithms are implemented within NLTK and explained in detail in the NLTK book cfgparser.py: CFG parser Chunker import nltk CFG Parser Earley Parser Classification and Clustering grammar = nltk.parse_cfg(""" S -> NP VP VP -> V NP | V NP PP V -> "saw" | "ate" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "dog" | "cat" | "cookie" | "park" PP -> P NP P -> "in" | "on" | "by" | "with" """) CFG Parser Introduction to Python Part 4: NLTK and other cool Python stuff Chunker CFG Parser Earley Parser Classification and Clustering Recursive Decent Parser cfgparser.py: CFG parser example continued sent = "Mary saw Bob".split() rd_parser = nltk.RecursiveDescentParser(grammar) for p in rd_parser.nbest_parse(sent): print p p.draw() sent = "John ate my cookie in the park".split() rd_parser = nltk.RecursiveDescentParser(grammar) for p in rd_parser.nbest_parse(sent): print p p.draw() Earley Parsing with Feature Structures Introduction to Python Part 4: NLTK and other cool Python stuff Chunker CFG Parser Earley Parser Classification and Clustering Augment CFG with feature structures (e.g. to express agreement on morphologic properties) NLTK comes with a very simple implementation of feature structures Example: Earley Algorithm earley.py: Feature-Based Earley Parsing import nltk tokens = ’Kim likes children’.split() from nltk.parse import load_parser cp = load_parser(’grammars/feat0.fcfg’, trace=2) trees = cp.nbest_parse(tokens) Classification and Clustering Introduction to Python Part 4: NLTK and other cool Python stuff Chunker CFG Parser See the NLTK book... Earley Parser Classification and Clustering Additional support via numpy (optional numeric Python library) Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts Part VII ... and other cool stuff: TK / Tkinter TK / TKinter Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog Easily and quickly create GUI dialogs with Python Support users in entering data with minimal typing amount Platform-independent Additional package may be required (on Debian/Ubuntu: python-tk) Doc: Fredrik Lundh An Introduction to Tkinter There are more ways to do GUI programming in Python ... UNO Zope and Plone Google App Engine Some wise thoughts e.g. used in configuration dialogs in Linux (KDE, Gnome: GTK) My first TK GUI Introduction to Python Part 4: NLTK and other cool Python stuff tk1.py: TK GUI with Label and Button from Tkinter import * TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts def end(): sys.exit(0) mywindow = Tk() w = Label(mywindow, text="Hello, world!") b = Button(mywindow, text="End", command = end) w.pack() b.pack() mywindow.mainloop() Text Entry and Listbox Example Introduction to Python Part 4: NLTK and other cool Python stuff wordnet.py: WordNet Polysemy Synset Inspector TK GUI def showWordNetPolysemy(): liBox.delete(0,END) input = eText.get() poly = wordnet.synsets(input,wordnet.NOUN) for synset in poly: liBox.insert(END, synset) liBox.pack() WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts from Tkinter import * from nltk import wordnet mywindow = Tk() eText = Entry(mywindow, width=80) # Text box eText.pack() bGo = Button(mywindow, text="Show WordNet Polysemy", command=showWordNetPolysemy) bGo.pack() liBox = Listbox(mywindow, height=0, width=80) liBox.pack() mywindow.mainloop() Further simple GUI elements (Widgets) Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI TK Widgets WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts Seen: labels, buttons, text box, list box Further elements: slider, checkbox, radio button, bitmap Primitive drawings: points, lines, circles, rectangles Predefined Message Boxes Introduction to Python Part 4: NLTK and other cool Python stuff tk2.py: Predefined Message Boxes TK GUI WordNet browser from Tkinter import * import tkMessageBox TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts mywindow = Tk() answer = tkMessageBox.askyesno("Yes or No?", "Please choose: Yes or No?") print answer FileOpen Dialog Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts tk3.py: FileOpen Dialog from Tkinter import * import tkFileDialog fo_window = tkFileDialog.Open() filename = fo_window.show() if filename: print filename else: print "no file chosen" Python Power for OpenOffice Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts Python-UNO bridge (Remote-)controlling OpenOffice through Python Introduction to Python Part 4: NLTK and other cool Python stuff UNO stands for Universal Network Objects Component object model of OpenOffice WordNet browser Idea: Interoperability between programming languages, object models and hardware architectures, TK Widgets either in process or over process boundaries, Message Boxes as well as in the intranet or the Internet. FileOpen Dialog UNO components may be implemented in and accessed from any programming language for which a UNO implementation (AKA language binding) and an appropriate bridge or adapter exists TK GUI UNO Zope and Plone Google App Engine Some wise thoughts OpenOffice ships with built-in Python 2.3.4 interpreter Python UNO Example Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Python UNO Example Python Code (Macro) to create new OpenOffice text (Writer) document, insert text and table. Sample code from /usr/lib/openoffice/share/Scripts/ python/pythonSamples/TableSample.py Required packages (in Debian/Ubuntu): openoffice.org, python-uno Example runs from within OO macro menu Zope and Plone Alternatively, OO can be started as socket server, and external Python script controls OO (even remotely) Google App Engine Further documentation: Some wise thoughts http://udk.openoffice.org/python/python-bridge.html Zope and Plone Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog Zope: Application Server implemented in Python Dynamic HTML, Object Model, Database, Python Scripting http://www.zope.org Plone: CMS based on Zope http://www.plone.org Example UNO Zope and Plone Google App Engine Some wise thoughts Language Technology World Portal (http://www.lt-world.org) By the way... Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser TK Widgets Message Boxes ... need a job? DFKI project TAKE: Science Information Systems Innovative applications using NLP, information & relation extraction on scientific papers in our own field Annotation jobs: Coreference/Anaphora Programming jobs: Python and/or Java: GUI, NLP Interested? → send me CV FileOpen Dialog ... need credit points? UNO Wintersemester 2008/09 Projektseminar / M.Sc. LS&T: Specialization course, area LT NLP+ML for Science Information Systems Various tasks: evaluation of existing system, implementation work, e.g. automatic extension of ontology (NLTK!) Zope and Plone Google App Engine Some wise thoughts Google App Engine Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser TK Widgets Message Boxes FileOpen Dialog UNO Zope and Plone Google App Engine Some wise thoughts Google is a Python company You may use Google servers to run your Python (only) applications through the Google App Engine http://code.google.com/appengine/ Would Google be as successful as it is if it would be based on Java or Perl as much as on Python? Learned from Google: Do business faster with Python (not only computationally) Extremely important for a company growing as quickly as Google Some wise thoughts Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser Coming back to the Zen from first lecture... TK Widgets Explicit or implicit? (remember ’Hello’ * 100 or list comprehensions...) Message Boxes Explicitness, maintainability: Java > Python > Perl FileOpen Dialog Dynamic typing in Python is both advantage and disadvantage UNO Zope and Plone Google App Engine Some wise thoughts Some wise thoughts Introduction to Python Part 4: NLTK and other cool Python stuff TK GUI WordNet browser TK Widgets Use the language that best solves your problem! Don’t be dogmatic Python is a universal programming language that can help to solve many problems very quickly Message Boxes Python is very good for learning how to program FileOpen Dialog Python scales up to world-size problems UNO But it may not be the perfect programming language in every case Zope and Plone Google App Engine Some wise thoughts Introduction to Python Part 4: NLTK and other cool Python stuff Exercises Part VIII Exercises Exercises Introduction to Python Part 4: NLTK and other cool Python stuff Exercises Tomorrow 1 Look at today’s example code, experiment with it 2 Get the exercise sheet http://www.dfki.de/~uschaefer/python09/ Try to solve the exercises 3 Introduction to Python Part 4: NLTK and other cool Python stuff Exercises Thank you for your attention!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to Python Part 4: NLTK and other cool Python stuff