Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CIS 530 Fall 2010 Annie Louis 20 Sept. 2010 Basics to help you get started ◦ Data structures, functions, loops, running a script… ◦ NLTK corpus reader, probability distributions You will find more resources on the course webpage Some examples in this tutorial are based on ◦ Mark Lutz & David Ascher, ‘Learning Python’ ◦ Brad Dayley, ‘Python Phrasebook’ Portable Can contain wrappers to other code—eg: ‘C’ Various inbuilt types—lists, dictionaries… Easy/ can be learnt quickly—it is very concise NLTK toolkit comes with a lot of NLP utilities Read http://www.cis.upenn.edu/~cis530/hw_2010/generalinfo.pdf Both already installed on eniac > python Python 2.6.2 (r262:71600, Jun 17 2010, 13:37:45) [GCC 4.4.1 [gcc-4_4-branch revision 150839]] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk >>> nltk.corpus.brown.words()[0] 'The' >python >>> 2+3 5 >>> word = "python" >>> word[2] 't‘ >>> print word python >>> >>> >>> >>> 4.0 x=3 y=9 z = (x + y + 0.0)/3 z No type declaration Unix command line python myscript.py Within the interpreter >>> execfile(“myscript.py”) >>> sentence = "Stock prices fell. They did so last week too." >>> sentence.split() ['Stock', 'prices', 'fell.', 'They', 'did', 'so', 'last', 'week', 'too.'] Split() with no separator => split at whitespace characters (space, tab, newline) With a different separator >>> sentence.split(".") ['Stock prices fell', ' They did so last week too', ''] Method 1– using ‘+’ operator >>> word1 = "This" >>> word2 = "is" >>> word3 = "a" >>> word4 = "sentence“ >>> combined1 = word1+" "+word2+" "+word3+" "+word4 >>> combined1 'This is a sentence‘ Method 2– using ‘join’ operator >>> wordlist = ["This", "is", "a", "sentence"] >>> ' '.join(wordlist) 'This is a sentence' Case sensitive >>> string1 >>> string2 >>> string3 >>> string1 False >>> string1 False = "apple" = "orange" = "aPPle" == string2 == string3 Case insensitive >>> string1.lower() == string3.lower() True Search and replace Trimming Ordered collection of items Can contain items of any type >>> digits = [0,1,2,3,4,5,6,7,8,9] >>> strings = ["the", "dog", "ran"] Indices start from 0 Items in a range Negative indices work backwards >>> strings[0] 'the‘ >>> strings[2] 'ran‚ >>> digits[2:4] [2, 3] >>> digits[-1] 9 >>> digits[-2] 8 Add/remove >>> strings.append("fast") >>> strings.insert(1, "brown") >>> strings ['the', 'brown', 'dog', 'ran', 'fast'] >>> digits.remove(8) >>> digits [0, 1, 2, 3, 4, 5, 6, 7, 9] Sort >>> digits.sort(reverse=1) >>> digits [9, 7, 6, 5, 4, 3, 2, 1, 0] >>> digits.sort() >>> digits [0, 1, 2, 3, 4, 5, 6, 7, 9] Similar to lists But cannot be modified >>> first_five = (1,2,3,4,5) >>> first_five[2] 3 >>> first_five.append(6) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'tuple' object has no attribute 'append' Tuple to list List to tuple >>> newlist = list(first_five) >>> newlist.append(6) >>> newlist [1, 2, 3, 4, 5, 6] >>> first_six = tuple(newlist) >>> first_six.append(7) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'tuple' object has no attribute 'append' >>> first_six (1, 2, 3, 4, 5, 6) <key, value> pairs >>> numbers = {1:"one", 2:"two", 3:"three"} >>> letters = {"vowel":['a','e','i','o','u'],“consonant":['b','c','d','f','g']} Get value given key >>> numbers[2] 'two' >>> letters["consonant"] ['b', 'c', 'd', 'f', 'g'] Changing the value associated with a key >>> letters["consonant"].append('h') >>> letters["consonant"] ['b', 'c', 'd', 'f', 'g', 'h'] >>> numbers[2]="twosome" >>> numbers[2] 'twosome' Universal newline (handles all newline variations--\r, \r\n) >>> file1 = open("topics.txt", "rU") >>> file1_lines = file1.readlines() >>> file1_lines[1:3] ['Finance\n', 'Computers and the internet\n'] >>> >>> >>> >>> file2 = open("topics_copy.txt", "w") file2.write("This is a copy of topics.txt\n") file2.writelines(file1_lines) Write list of file2.close() strings Write string >>> file1 = open("topics.txt", "rU") >>> file1_lines = file1.readlines() Indentation is important Blank line when done Begin blocks indicated by : >>> if len(file1_lines) < 2: ... print "fewer than 2 lines" ... elif len(file1_lines) > 10: ... print "more than 10 lines" ... else: ... print "between 2 and 10 lines" ... between 2 and 10 lines >>> >>> word = "dog" >>> for letter in word: ... print letter ... d o g >>> pets = ["dog", "cat", "fish"] >>> for i in range(len(pets)): ... print pets[i] ... dog cat fish range: 0 to that number ‘break’ and ‘continue’ statements are available as usual >>> def get_length(listx): ... list_len = len(listx) ... return list_len ... >>> pets = ["dogs", "cats", "fish"] >>> print "I have "+ str(get_length(pets)) + " pets" I have 3 pets Integer to string Functions that can be performed on this data Data (specific to each object) Apple tree ◦ Data Fruit Leaf ◦ Functions Pick_fruit() Pick_leaf() Can abstract into a Tree class ◦ Data Fruit Leaf ◦ Functions Pick_fruit() Pick_leaf() Instances: apple tree, maple tree, palm tree.. Lists, tuples, dictionaries, files-–were all objects of their respective classes The functions we used on them were the member functions of those classes ◦ list1.append(‘a’) class roster: course = "cis530" Called when a object is ‘instantiated’ def __init__(self, name, dept): self.student_name = name self.student_dept = dept def print_details(self): print "Name: " + self.student_name print "Dept: " + self.student_dept print "Course: " + self.course student1 = roster("annie", "cis") student1.print_details() Another member function Creating an instance Calling methods of an object Suite of classes for several NLP tasks Parsing, POS tagging, classifiers… Several text processing utilities, corpora ◦ Brown, Penn Treebank corpus… ◦ Your data was divided into sentences using ‘punkt’ Basics – skim chapters 1-4 For this homework, be familiar with ◦ Corpus utilities Simplied in NLTK ◦ Probability distributions—FreqDist, ConditionalFreqDist Read definitions of all member functions Look at the code to see how it is implemented ◦ Smoothing techniques You will need to import the necessary modules to create objects and call member functions ◦ import ~ include objects from pre-built packages FreqDist, ConditionalFreqDist are in nltk.probability PlaintextCorpusReader is in nltk.corpus import nltk from nltk.corpus import PlaintextCorpusReader def get_files_from_category(category): subcat = category.split('#') if (len(subcat) == 1): corpus_root = '/home1/c/cis530/data/‘ + subcat[0] else: corpus_root = '/home1/c/cis530/data/‘ + subcat[0] + '/' + subcat[1] files = PlaintextCorpusReader(corpus_root, '.*') return files finance_files = get_files_from_category(“Finance”) cancer_files = get_files_from_category(“Health#Cancer”) def get_num_tokens(topic): categ_files = get_files_from_category(topic) all_words = categ_files.words() return len(all_words) print get_num_tokens(“Health#Diet_and_Nutrition”) print get_num_tokens(“Computers_and_the_Internet”) from nltk import FreqDist def get_top_word(topic): categ_files = get_files_from_category(topic) all_words = categ_files.words() fdist1 = nltk.FreqDist(all_words) return fdist1.keys()[0] print get_top_word(“Finance”) keys() returns samples in decreasing order of frequency