Download Getting started with Python/NLTK

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CIS 530 Fall 2010
Annie Louis
20 Sept. 2010

Basics to help you get started
◦ Data structures, functions, loops, running a script…
◦ NLTK corpus reader, probability distributions


You will find more resources on the course
webpage
Some examples in this tutorial are based on
◦ Mark Lutz & David Ascher, ‘Learning Python’
◦ Brad Dayley, ‘Python Phrasebook’

Portable

Can contain wrappers to other code—eg: ‘C’

Various inbuilt types—lists, dictionaries…

Easy/ can be learnt quickly—it is very concise

NLTK toolkit comes with a lot of NLP utilities

Read
http://www.cis.upenn.edu/~cis530/hw_2010/generalinfo.pdf

Both already installed on eniac
> python
Python 2.6.2 (r262:71600, Jun 17 2010, 13:37:45)
[GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> import nltk
>>> nltk.corpus.brown.words()[0]
'The'
>python
>>> 2+3
5
>>> word = "python"
>>> word[2]
't‘
>>> print word
python
>>>
>>>
>>>
>>>
4.0
x=3
y=9
z = (x + y + 0.0)/3
z
No type
declaration

Unix command line
python myscript.py

Within the interpreter
>>> execfile(“myscript.py”)
>>> sentence = "Stock prices fell. They did so last week too."
>>> sentence.split()
['Stock', 'prices', 'fell.', 'They', 'did', 'so', 'last', 'week', 'too.']


Split() with no separator => split at
whitespace characters (space, tab, newline)
With a different separator
>>> sentence.split(".")
['Stock prices fell', ' They did so last week too', '']

Method 1– using ‘+’ operator
>>> word1 = "This"
>>> word2 = "is"
>>> word3 = "a"
>>> word4 = "sentence“
>>> combined1 = word1+" "+word2+" "+word3+" "+word4
>>> combined1
'This is a sentence‘

Method 2– using ‘join’ operator
>>> wordlist = ["This", "is", "a", "sentence"]
>>> ' '.join(wordlist)
'This is a sentence'

Case sensitive
>>> string1
>>> string2
>>> string3
>>> string1
False
>>> string1
False

= "apple"
= "orange"
= "aPPle"
== string2
== string3
Case insensitive
>>> string1.lower() == string3.lower()
True


Search and replace
Trimming

Ordered collection of items

Can contain items of any type
>>> digits = [0,1,2,3,4,5,6,7,8,9]
>>> strings = ["the", "dog", "ran"]

Indices start from 0

Items in a range

Negative indices work backwards
>>> strings[0]
'the‘
>>> strings[2]
'ran‚
>>> digits[2:4]
[2, 3]
>>> digits[-1]
9
>>> digits[-2]
8

Add/remove
>>> strings.append("fast")
>>> strings.insert(1, "brown")
>>> strings
['the', 'brown', 'dog', 'ran', 'fast']
>>> digits.remove(8)
>>> digits
[0, 1, 2, 3, 4, 5, 6, 7, 9]

Sort
>>> digits.sort(reverse=1)
>>> digits
[9, 7, 6, 5, 4, 3, 2, 1, 0]
>>> digits.sort()
>>> digits
[0, 1, 2, 3, 4, 5, 6, 7, 9]

Similar to lists

But cannot be modified
>>> first_five = (1,2,3,4,5)
>>> first_five[2]
3
>>> first_five.append(6)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'

Tuple to list

List to tuple
>>> newlist = list(first_five)
>>> newlist.append(6)
>>> newlist
[1, 2, 3, 4, 5, 6]
>>> first_six = tuple(newlist)
>>> first_six.append(7)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'
>>> first_six
(1, 2, 3, 4, 5, 6)

<key, value> pairs
>>> numbers = {1:"one", 2:"two", 3:"three"}
>>> letters = {"vowel":['a','e','i','o','u'],“consonant":['b','c','d','f','g']}

Get value given key
>>> numbers[2]
'two'
>>> letters["consonant"]
['b', 'c', 'd', 'f', 'g']

Changing the value associated with a key
>>> letters["consonant"].append('h')
>>> letters["consonant"]
['b', 'c', 'd', 'f', 'g', 'h']
>>> numbers[2]="twosome"
>>> numbers[2]
'twosome'
Universal newline
(handles all newline
variations--\r, \r\n)
>>> file1 = open("topics.txt", "rU")
>>> file1_lines = file1.readlines()
>>> file1_lines[1:3]
['Finance\n', 'Computers and the internet\n']
>>>
>>>
>>>
>>>
file2 = open("topics_copy.txt", "w")
file2.write("This is a copy of topics.txt\n")
file2.writelines(file1_lines)
Write list of
file2.close()
strings
Write string
>>> file1 = open("topics.txt", "rU")
>>> file1_lines = file1.readlines()
Indentation is
important
Blank line
when done
Begin blocks
indicated by :
>>> if len(file1_lines) < 2:
... print "fewer than 2 lines"
... elif len(file1_lines) > 10:
... print "more than 10 lines"
... else:
... print "between 2 and 10 lines"
...
between 2 and 10 lines
>>>
>>> word = "dog"
>>> for letter in word:
... print letter
...
d
o
g
>>> pets = ["dog", "cat", "fish"]
>>> for i in range(len(pets)):
... print pets[i]
...
dog
cat
fish
range: 0 to
that number
‘break’ and
‘continue’
statements are
available as usual
>>> def get_length(listx):
... list_len = len(listx)
... return list_len
...
>>> pets = ["dogs", "cats", "fish"]
>>> print "I have "+ str(get_length(pets)) + " pets"
I have 3 pets
Integer to
string
Functions
that can be
performed
on this data
Data
(specific to
each object)

Apple tree
◦ Data
 Fruit
 Leaf
◦ Functions
 Pick_fruit()
 Pick_leaf()

Can abstract into a Tree class
◦ Data
 Fruit
 Leaf
◦ Functions
 Pick_fruit()
 Pick_leaf()
 Instances: apple tree, maple tree,
palm tree..


Lists, tuples, dictionaries, files-–were all
objects of their respective classes
The functions we used on them were the
member functions of those classes
◦ list1.append(‘a’)
class roster:
course = "cis530"
Called when a
object is
‘instantiated’
def __init__(self, name, dept):
self.student_name = name
self.student_dept = dept
def print_details(self):
print "Name: " + self.student_name
print "Dept: " + self.student_dept
print "Course: " + self.course
student1 = roster("annie", "cis")
student1.print_details()
Another member
function
Creating an instance
Calling methods of an
object

Suite of classes for several NLP tasks

Parsing, POS tagging, classifiers…

Several text processing utilities, corpora
◦ Brown, Penn Treebank corpus…
◦ Your data was divided into sentences using ‘punkt’

Basics – skim chapters 1-4

For this homework, be familiar with
◦ Corpus utilities
 Simplied in NLTK
◦ Probability distributions—FreqDist, ConditionalFreqDist
 Read definitions of all member functions
 Look at the code to see how it is implemented
◦ Smoothing techniques

You will need to import the necessary
modules to create objects and call member
functions
◦ import ~ include objects from pre-built packages


FreqDist, ConditionalFreqDist are in
nltk.probability
PlaintextCorpusReader is in nltk.corpus
import nltk
from nltk.corpus import PlaintextCorpusReader
def get_files_from_category(category):
subcat = category.split('#')
if (len(subcat) == 1):
corpus_root = '/home1/c/cis530/data/‘ + subcat[0]
else:
corpus_root = '/home1/c/cis530/data/‘ + subcat[0] + '/' +
subcat[1]
files = PlaintextCorpusReader(corpus_root, '.*')
return files
finance_files = get_files_from_category(“Finance”)
cancer_files = get_files_from_category(“Health#Cancer”)
def get_num_tokens(topic):
categ_files = get_files_from_category(topic)
all_words = categ_files.words()
return len(all_words)
print get_num_tokens(“Health#Diet_and_Nutrition”)
print get_num_tokens(“Computers_and_the_Internet”)
from nltk import FreqDist
def get_top_word(topic):
categ_files = get_files_from_category(topic)
all_words = categ_files.words()
fdist1 = nltk.FreqDist(all_words)
return fdist1.keys()[0]
print get_top_word(“Finance”)
keys() returns samples
in decreasing order of
frequency