Download Introduction to Python Part 4: NLTK and other cool Python stuff

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Outline
Introduction to Python
Part 4: NLTK and other cool Python stuff
Alastair Burt Andreas Eisele Christian Federmann
Torsten Marek Ulrich Schäfer
DFKI & Universität des Saarlandes
October 8th, 2009
Outline
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Today’s Topics:
1 NLTK, the Natural Language Toolkit
Overview
Tokenization and PoS Tagging
Morphology and Feature Structures
Accessing Corpora
Chunking and CFG Parsing
Classification and Clustering
Outline
2
Other cool Python stuff
Building GUIs with TK (Tkinter)
UNO: (Python-)programming OpenOffice
Zope & Plone
Google App Engine
3
Summary and some thoughts
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
What is
NLTK?
Installation
Part I
NLP Pipeline
NLTK Overview
What is NLTK?
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
NLTK
Natural Language Toolkit
What is
NLTK?
Installation
NLP Pipeline
Developed by Steven Bird, Ewan Klein and Edward Loper
Mainly addresses education and research
Very good documentation, book online:
http://www.nltk.org/book
NLTK examples on slides taken from there
We only present very simple approaches to NLP here,
often based on regular expressions. More sophisticated
approaches are discussed in the book.
Installation
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
What is
NLTK?
Installation
NLP Pipeline
Required packages
1
Python version 2.4, 2.5 or 2.6
2 NLTK source distribution
http://nltk.googlecode.com/files/nltk-2.0b6.zip
3 or NLTK Debian package:
http://nltk.googlecode.com/files/nltk_2.0b5-1_all.deb
4 NLTK data (corpora etc.): see http://www.nltk.org/data
5 further packages and installation instructions:
http://www.nltk.org/download
The NLP Analysis Pipeline
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
From Text Strings to Text Understanding
1
Tokenization
2
Morphological analysis
3
Part-of-Speech tagging
4
Named entity recognition
5
Chunking (chunk parsing)
6
Sentence boundary detection
7
Syntactic parsing
8
Anaphora resolution
9
Semantic analysis
10
Pragmatics, Reasoning, ...
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Tokenization
and PoS
Tagging
Debugging
Regular
Expressions
Regular
Expression
Tokenizer II
Regular
Expression
Tagger
Part III
Tokenization and PoS Tagging
Regular Expression Tokenizer
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Split input string into list of words – remember re.findall()
regexptokenizer1.py: Simple tokenizer
Tokenization
and PoS
Tagging
Debugging
Regular
Expressions
Regular
Expression
Tokenizer II
import nltk
text = "Hello. Isn’t this fun?"
pattern = r’\w+|[^\w\s]+’
print nltk.tokenize.regexp_tokenize(text, pattern)
Regular
Expression
Tagger
Result:
[’Hello’, ’.’, ’Isn’, "’", ’t’, ’this’, ’fun’, ’?’]
Debugging Regular Expressions
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
NLTK tool for debugging regular expressions
import nltk
nltk.re_show(pattern, inputstring)
Tokenization
and PoS
Tagging
Debugging
Regular
Expressions
Regular
Expression
Tokenizer II
Regular
Expression
Tagger
Example
import nltk
nltk.re_show(’o+’, ’Computational linguistics is cool.’)
Result:
C{o}mputati{o}nal linguistics is c{oo}l.
Regular Expression Tokenizer II
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Tokenization
and PoS
Tagging
Debugging
Regular
Expressions
Regular
Expression
Tokenizer II
Regular
Expression
Tagger
Tokenize currency amounts and abbreviations correctly
regexptokenizer2.py: Extended simple tokenizer
import nltk
text = ’That poster costs $22.40.’
pattern = r’’’(?x)
\w+
# sequences of word characters
| \$?\d+(\.\d+)? # currency amounts, e.g. $12.50
| ([A-Z]\.)+
# abbreviations, e.g. U.S.A.
| [^\w\s]+
# sequences of punctuation
’’’
nltk.tokenize.regexp_tokenize(text, pattern)
Result:
[’That’, ’poster’, ’costs’, ’$22.40.’]
Regular Expression Tagger
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Part-of-Speech tagging assigns a word class to each input token
regexptagger.py: Simple Tagger for English
import nltk
Tokenization
and PoS
Tagging
Debugging
Regular
Expressions
Regular
Expression
Tokenizer II
Regular
Expression
Tagger
patterns = [
(r’.*ing$’, ’VBG’),
# gerunds
(r’.*ed$’, ’VBD’),
# simple past
(r’.*es$’, ’VBZ’),
# 3rd singular present
(r’.*ould$’, ’MD’),
# modals
(r’.*\’s$’, ’NN$’),
# possessive nouns
(r’.*s$’, ’NNS’),
# plural nouns
(r’^-?[0-9]+(.[0-9]+)?$’, ’CD’), # cardinal numbers
(r’.*’, ’NN’)
# nouns (default)
]
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(nltk.corpus.brown.sents(categories=’adventure’)
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Porter
Stemmer
Part IV
Morphology/Stemmer
Morphology/Stemmer
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
porter.py: Simple Stemmer (Porter)
Porter
Stemmer
stemmer = nltk.PorterStemmer()
verbs = [’appears’, ’appear’, ’appeared’, ’calling’,
’called’]
stems = []
import nltk
for verb in verbs:
stemmed_verb = stemmer.stem(verb)
stems.append(stemmed_verb)
sorted(set(stems))
Result:
[’appear’, ’call’]
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Corpora
included
Gutenberg
Statistics
UDHR word
length
distribution
Concordance
Part V
Accessing Corpora
Corpora included in NLTK
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Corpora included in NLTK
Gutenberg Texts (subset)
Brown Corpus
Corpora
included
Gutenberg
Statistics
UDHR word
length
distribution
Concordance
UDHR
CMU Pronunciation Dictionary
TIMIT
Senseval 2 ... and many more
Example
import nltk
nltk.corpus.brown.words()
nltk.corpus.gutenberg.fileids()
Gutenberg Corpus Statistics
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Code displays corpus filename, average word length, average
sentence length, and the number of times each vocabulary item
appears in the text on average.
Corpora
included
gutenberg.py: Compute simple corpus statistics
Gutenberg
Statistics
from nltk.corpus import gutenberg
UDHR word
length
distribution
Concordance
for filename in gutenberg.fileids():
r = gutenberg.raw(filename)
w = gutenberg.words(filename)
s = gutenberg.sents(filename)
v = set(w)
print filename, len(r)/len(w),
len(w)/len(s), len(w)/len(v)
UDHR corpus
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Corpora
included
Gutenberg
Statistics
UDHR word
length
distribution
Concordance
Contains the Universal Declaration of Human Rights in over
300 languages. Example requires matplotlib (see NLTK
download page on how to get it).
udhr.py: Compute and display word length distribution
import nltk, pylab
def cld(lang):
text = nltk.corpus.udhr.words(lang)
fd = nltk.FreqDist(len(token) for token in text)
ld = [100*fd.freq(i) for i in range(36)]
return [sum(ld[0:i+1]) for i in range(len(ld))]
langs = [’Chickasaw-Latin1’, ’English-Latin1’,
’German_Deutsch-Latin1’, ’Greenlandic_Inuktikut-Latin1’,
’Hungarian_Magyar-Latin1’, ’Ibibio_Efik-Latin1’]
dists = [pylab.plot(cld(l), label=l[:-7], linewidth=2)
for l in langs]
pylab.title(’Cumulative Word Length Distrib. for Several Languages’)
pylab.legend(loc=’lower right’)
pylab.show()
Concordance
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Corpora
included
Gutenberg
Statistics
UDHR word
length
distribution
Concordance
concordance.py: Concordance on Brown Corpus
import nltk
def concordance(word, context):
"Generate a concordance for the word with the
specified context window"
for sent in nltk.corpus.brown.sents(categories=’a’):
try:
pos = sent.index(word)
left = ’ ’.join(sent[:pos])
right = ’ ’.join(sent[pos+1:])
print ’%*s %s %-*s’ %\
(context, left[-context:], word,
context, right[:context])
except ValueError:
pass
concordance(’line’, 32)
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Chunker
CFG Parser
Part VI
Earley Parser
Classification
and Clustering
Chunking and CFG Parsing
Chunker
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
chunkparser.py: Chunk parser
import nltk
Chunker
CFG Parser
Earley Parser
Classification
and Clustering
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive,
# adjectives and nouns
{<NNP>+} # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
tagged_tokens = [("Rapunzel", "NNP"), ("let", "VBD"),
("down", "RP"), ("her", "PP$"), ("golden", "JJ"),
("hair", "NN")]
print cp.parse(tagged_tokens)
cp.parse(tagged_tokens).draw()
CFG Parser
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Various Parsing Algorithms are implemented within NLTK and
explained in detail in the NLTK book
cfgparser.py: CFG parser
Chunker
import nltk
CFG Parser
Earley Parser
Classification
and Clustering
grammar = nltk.parse_cfg("""
S -> NP VP
VP -> V NP | V NP PP
V -> "saw" | "ate"
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "dog" | "cat" | "cookie" | "park"
PP -> P NP
P -> "in" | "on" | "by" | "with"
""")
CFG Parser
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Chunker
CFG Parser
Earley Parser
Classification
and Clustering
Recursive Decent Parser
cfgparser.py: CFG parser example continued
sent = "Mary saw Bob".split()
rd_parser = nltk.RecursiveDescentParser(grammar)
for p in rd_parser.nbest_parse(sent):
print p
p.draw()
sent = "John ate my cookie in the park".split()
rd_parser = nltk.RecursiveDescentParser(grammar)
for p in rd_parser.nbest_parse(sent):
print p
p.draw()
Earley Parsing with Feature Structures
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Chunker
CFG Parser
Earley Parser
Classification
and Clustering
Augment CFG with feature structures (e.g. to express
agreement on morphologic properties)
NLTK comes with a very simple implementation of feature
structures
Example: Earley Algorithm
earley.py: Feature-Based Earley Parsing
import nltk
tokens = ’Kim likes children’.split()
from nltk.parse import load_parser
cp = load_parser(’grammars/feat0.fcfg’, trace=2)
trees = cp.nbest_parse(tokens)
Classification and Clustering
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Chunker
CFG Parser
See the NLTK book...
Earley Parser
Classification
and Clustering
Additional support via numpy (optional numeric Python
library)
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
Part VII
... and other cool stuff: TK / Tkinter
TK / TKinter
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
Easily and quickly create GUI dialogs with Python
Support users in entering data with minimal typing amount
Platform-independent
Additional package may be required (on Debian/Ubuntu:
python-tk)
Doc: Fredrik Lundh An Introduction to Tkinter
There are more ways to do GUI programming in Python ...
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
e.g. used in configuration dialogs in Linux (KDE, Gnome:
GTK)
My first TK GUI
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
tk1.py: TK GUI with Label and Button
from Tkinter import *
TK GUI
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
def end():
sys.exit(0)
mywindow = Tk()
w = Label(mywindow, text="Hello, world!")
b = Button(mywindow, text="End", command = end)
w.pack()
b.pack()
mywindow.mainloop()
Text Entry and Listbox Example
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
wordnet.py: WordNet Polysemy Synset Inspector
TK GUI
def showWordNetPolysemy():
liBox.delete(0,END)
input = eText.get()
poly = wordnet.synsets(input,wordnet.NOUN)
for synset in poly:
liBox.insert(END, synset)
liBox.pack()
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
from Tkinter import *
from nltk import wordnet
mywindow = Tk()
eText = Entry(mywindow, width=80) # Text box
eText.pack()
bGo = Button(mywindow, text="Show WordNet Polysemy",
command=showWordNetPolysemy)
bGo.pack()
liBox = Listbox(mywindow, height=0, width=80)
liBox.pack()
mywindow.mainloop()
Further simple GUI elements (Widgets)
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
TK Widgets
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
Seen: labels, buttons, text box, list box
Further elements: slider, checkbox, radio button, bitmap
Primitive drawings: points, lines, circles, rectangles
Predefined Message Boxes
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
tk2.py: Predefined Message Boxes
TK GUI
WordNet
browser
from Tkinter import *
import tkMessageBox
TK Widgets
Message
Boxes
FileOpen
Dialog
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
mywindow = Tk()
answer = tkMessageBox.askyesno("Yes or No?",
"Please choose: Yes or No?")
print answer
FileOpen Dialog
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
tk3.py: FileOpen Dialog
from Tkinter import *
import tkFileDialog
fo_window = tkFileDialog.Open()
filename = fo_window.show()
if filename:
print filename
else:
print "no file chosen"
Python Power for OpenOffice
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
Python-UNO bridge
(Remote-)controlling OpenOffice through Python
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
UNO stands for Universal Network Objects
Component object model of OpenOffice
WordNet
browser
Idea: Interoperability between programming languages,
object models and hardware architectures,
TK Widgets
either in process or over process boundaries,
Message
Boxes
as well as in the intranet or the Internet.
FileOpen
Dialog
UNO components may be implemented in and accessed
from any programming language for which a UNO
implementation (AKA language binding) and an
appropriate bridge or adapter exists
TK GUI
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
OpenOffice ships with built-in Python 2.3.4 interpreter
Python UNO Example
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
UNO
Python UNO Example
Python Code (Macro) to create new OpenOffice text
(Writer) document, insert text and table.
Sample code from /usr/lib/openoffice/share/Scripts/
python/pythonSamples/TableSample.py
Required packages (in Debian/Ubuntu):
openoffice.org, python-uno
Example runs from within OO macro menu
Zope and
Plone
Alternatively, OO can be started as socket server, and
external Python script controls OO (even remotely)
Google App
Engine
Further documentation:
Some wise
thoughts
http://udk.openoffice.org/python/python-bridge.html
Zope and Plone
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
Zope: Application Server implemented in Python
Dynamic HTML, Object Model, Database, Python
Scripting
http://www.zope.org
Plone: CMS based on Zope
http://www.plone.org
Example
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
Language Technology World Portal
(http://www.lt-world.org)
By the way...
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
TK Widgets
Message
Boxes
... need a job?
DFKI project TAKE: Science Information Systems
Innovative applications using NLP, information & relation
extraction on scientific papers in our own field
Annotation jobs: Coreference/Anaphora
Programming jobs: Python and/or Java: GUI, NLP
Interested? → send me CV
FileOpen
Dialog
... need credit points?
UNO
Wintersemester 2008/09 Projektseminar
/ M.Sc. LS&T: Specialization course, area LT
NLP+ML for Science Information Systems
Various tasks: evaluation of existing system, implementation
work, e.g. automatic extension of ontology (NLTK!)
Zope and
Plone
Google App
Engine
Some wise
thoughts
Google App Engine
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
TK Widgets
Message
Boxes
FileOpen
Dialog
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
Google is a Python company
You may use Google servers to run your Python (only)
applications through the Google App Engine
http://code.google.com/appengine/
Would Google be as successful as it is if it would be based
on Java or Perl as much as on Python?
Learned from Google: Do business faster with Python (not
only computationally)
Extremely important for a company growing as quickly as
Google
Some wise thoughts
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
Coming back to the Zen from first lecture...
TK Widgets
Explicit or implicit? (remember ’Hello’ * 100 or list
comprehensions...)
Message
Boxes
Explicitness, maintainability: Java > Python > Perl
FileOpen
Dialog
Dynamic typing in Python is both advantage and
disadvantage
UNO
Zope and
Plone
Google App
Engine
Some wise
thoughts
Some wise thoughts
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
TK GUI
WordNet
browser
TK Widgets
Use the language that best solves your problem!
Don’t be dogmatic
Python is a universal programming language that can help
to solve many problems very quickly
Message
Boxes
Python is very good for learning how to program
FileOpen
Dialog
Python scales up to world-size problems
UNO
But it may not be the perfect programming language in
every case
Zope and
Plone
Google App
Engine
Some wise
thoughts
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Exercises
Part VIII
Exercises
Exercises
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Exercises
Tomorrow
1
Look at today’s example code, experiment with it
2
Get the exercise sheet
http://www.dfki.de/~uschaefer/python09/
Try to solve the exercises
3
Introduction
to Python
Part 4: NLTK
and other cool
Python stuff
Exercises
Thank you for your attention!