Download Programming for Linguists

Programming for Linguists An Introduction to Python 22/12/2011 Feedback  Ex. 1) Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of “men”, “women”, and “people” in each document. What has happened to the usage of these words over time? import nltk from nltk.corpus import state_union cfd = nltk.ConditionalFreqDist((fileid, word) for fileid in state_union.fileids( ) for word in state_union.words(fileids = fileid)) fileids = state_union.fileids( ) search_words = ["men", "women", "people"] cfd.tabulate(conditions = fileids, samples = search_words)  Ex 2) According to Strunk and White's Elements of Style, the word “however”, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. Use the concordance tool to study actual usage of this word in 5 NLTK texts. import nltk from nltk.book import * texts = [text1, text2, text3, text4, text5] for text in texts: print text.concordance("however")  Ex 3) Create a corpus of your own of minimum 10 files containing text fragments. You can take texts of your own, the internet,… Write a program that investigates the usage of modal verbs in this corpus using the frequency distribution tool and plot the 10 most frequent words. import nltk from nltk.corpus import PlaintextCorpusReader corpus_root = “/Users/claudia/my_corpus” #corpus_root = “C:\Users\...” my_corpus = PlaintextCorpusReader (corpus_root, '.*’) words = my_corpus.words( ) cfd = nltk.ConditionalFreqDist((fileid, word) for fileid in my_corpus.fileids( ) for word in my_corpus.words(fileid)) fileids = my_corpus.fileids( ) modals = ['can', 'could', 'may', 'might', 'must', 'will’ cfd.tabulate(conditions = fileids, samples = modals) fd = nltk.FreqDist(words) all_tokens = fd.keys( ) for t in all_tokens: if re.match(r'[^a-zA-Z0-9]+', t): all_tokens.remove(t) most_frequent=all_tokens[:10] most_frequent.plot( )  Ex 1) Choose a website. Read it in in Python using the urlopen function, remove all HTML mark-up and tokenize it. Make a frequency dictionary of all words ending with ‘ing’ and sort it on its values (decreasingly).  Ex 2) Write the raw text of the text in the previous exercise to an output file. import nltk import re url= “website” from urllib import urlopen htmltext= urlopen(url).read( ) rawtext= nltk.clean_html(htmltext) rawtext2= rawtext.lower( ) tokens= nltk.wordpunct_tokenize(rawtext2) my_text= nltk.Text(tokens) wordlist_ing= [w for w in tokens if re.search(r'^.*ing$',w)] freq_dict= { } for word in wordlist_ing: if word not in freq_dict: freq_dict[word] = 1 else: freq_dict[word] = freq_dict[word]+1 from operator import itemgetter sorted_wordlist_ing = sorted(freq_dict.iteritems(), key= itemgetter(1), reverse=True) Ex 2) output_file = open(“dir/output.txt","w") output_file.write(str(rawtext2)+"\n") output_file.close( )  Ex 3) Write a script that performs the same classification task as we saw today using word bigrams as features instead of single words. Some Mentioned Issues  Loading your own corpus in NLTK with no subcategories: import nltk from nltk.corpus import PlaintextCorpusReader loc = “/Users/claudia/my_corpus” #Mac loc = “C:\Users\claudia\my_corpus” #Windows 7 my_corpus = PlaintextCorpusReader(loc, “.*”)  Loading your own corpus in NLTK with subcategories: import nltk from nltk.corpus import CategorizedPlaintextCorpusReader loc=“/Users/claudia/my_corpus” #Mac loc=“C:\Users\claudia\my_corpus” #Windows 7 my_corpus = CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt', cat_pattern= r'(cat1|cat2)/.*') Dispersion Plot  determine the location of a word in the text: how many words from the beginning it appears Exercises  Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the text, and converts the words to lowercase. You can get a list of all punctuation marks by: import string print string.punctuation import nltk, string def strip(filepath): f = open(filepath, ‘r’) text = f.read( ) tokens = nltk.wordpunct_tokenize(text) for token in tokens: token = token.lower( ) if token in string.punctuation: tokens.remove(token) return tokens  If you want to analyse a text, but filter out a stop list first (e.g. containing “the”, “and”,…), you need to make 2 dictionaries: 1 with all words from your text and 1 with all words from the stop list. Then you need to subtract the 2nd from the 1st. Write a function subtract(d1, d2) which takes dictionaries d1 and d2 and returns a new dictionary that contains all the keys from d1 that are not in d2. You can set the values to None. def subtract(d1, d2): d3 = { } for key in d1.keys(): if key not in d2: d3[key] = None return d3  Let’s try it out: import nltk from nltk.book import * from nltk.corpus import stopwords d1 = { } for word in text7: d1[word] = None wordlist = stopwords.words(“english”) d2 = { } for word in wordlist: d2[word] = None rest_dict = subtract(d1, d2) wordlist_min_stopwords=rest_dict.keys( ) Questions? Evaluation Assignment  Deadline = 23/01/2012  Conversation in the week of 23/01/12  If you need any explanation about the content of the assignment, feel free to email me Further Reading  Since this was only a short introduction to programming in Python, if you want to expand your programming skills further: see chapters 15 – 18 about objectoriented programming  Think Python. How to Think Like a Computer Scientist?  NLTK book  Official Python documentation: http://www.python.org/doc/  There is a newer version of Python available, but it is not (yet) compatible with NLTK  Our research group: CLiPS: Computational Linguistics and Psycholinguistics Research Center http://www.clips.ua.ac.be/  Our projects: http://www.clips.ua.ac.be/projects Happy holidays and success with your exams

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Programming for Linguists