Download CS-595: Machine Learning in Natural Language Processing

CS 595-052 Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, [email protected] Room: 237C Office Hours: Mon 3-4 PM Book: Statistical Natural Language Processing C. D. Manning and H. Schütze Requirements: – Several programming projects – Research Proposal Machine Learning Training Examples Learning Algorithm Test Examples Learned Model Classification/ Labeling Results Modeling • Decide how to represent learned models: – – – – Decision rules Linear functions Markov models … • Type chosen affects generalization accuracy (on new data) Generalization Example Representation • Set of Features: – – – – Continuous Discrete (ordered and unordered) Binary Sets vs. Sequences • Classes: – Continuous vs. discrete – Binary vs. multivalued – Disjoint vs. overlapping Learning Algorithms • Find a “good” hypothesis “consistent” with the training data – Many hypotheses may be consistent, so may need a “preference bias” – No hypothesis may be consistent, so need to find “nearly” consistent • May rule out some hypotheses to start with: – Feature reduction Estimating Generalization Accuracy • Accuracy on the training says nothing about new examples! • Must train and test on different example sets • Estimate generalization accuracy over multiple train/test divisions • Sources of estimation error: – Bias: Systematic error in the estimate – Variance: How much the estimate changes between different runs Cross-validation 1. Divide training into k sets 2. Repeat for each set: 1. Train on the remaining k-1 sets 2. Test on the kth 3. Average k accuracies (and compute statistics) Bootstrapping For a corpus of n examples: 1. Choose n examples randomly (with replacement) Note: We expect ~0.632n different examples 2. Train model, and evaluate: • • acc0 = accuracy of model on non-chosen examples accS = accuracy of model on n training examples 3. Estimate accuracy as 0.632*acc0 + 0.368*accS 4. Average accuracies over b different runs Also note: there are other similar bootstrapping techniques Bootstrapping vs. Cross-validation • Cross-validation: – Equal participation of all examples – Dependency of class distribution in tests on distributions in training – Stratified cross-validation: equalize class dist. • Bootstrap: – Often has higher bias (fewer distinct examples) – Best for small datasets Natural Language Processing • Extract useful information from natural language texts (articles, books, web pages, queries, etc.) • Traditional method: Handcrafted lexicons, grammars, parsers • Statistical approach: Learn how to process language from a corpus of real usage Some Statistical NLP Tasks 1. Part of speech tagging - How to distinguish between book the noun, and book the verb. 2. Shallow parsing – Pick out phrases of different types from a text, such as the purple people eater or would have been going 3. Word sense disambiguation - How to distinguish between river bank and bank as a financial institution. 4. Alignment – Find the correspondence between words, sentences and paragraphs of a source text and its translation. A Paradigmatic Task • Language Modeling: Predict the next word of a text (probabilistically): P(wn | w1w2…wn-1) = m(wn | w1w2…wn-1) • To do this perfectly, we must capture true notions of grammaticality • So: Better approximation of prob. of “the next word”  Better language model Measuring “Surprise” • The lower the probability of the actual word, the more the model is “surprised”: H(wn | w1…wn-1) = -log2 m(wn | w1…wn-1) (The conditional entropy of wn given w1,n-1) Cross-entropy: Suppose the actual distribution of the language is p(wn | w1…wn-1), then our model is on average surprised by: Ep[H(wn|w1,n-1)] = w p(wn=w|w1,n-1)H(wn=w |w1,n-1) = Ep[-log2 m(wn | w1,n-1)] Estimating the Cross-Entropy How can we estimate Ep[H(wn|w1,n-1)] when we don’t (by definition) know p? Assume: • Stationarity: The language doesn’t change • Ergodicity: The language never gets “stuck” Then: Ep[H(wn|w1,n-1)] = limn (1/n) nH(w | w n ) 1,n-1 Perplexity Commonly used measure of “model fit”: perplexity(w1,n,m) = 2H(w ,m) = m(w1,n)-(1/n) 1,n How many “choices” for next word on average? • Lower perplexity = better model N-gram Models • Assume a “limited horizon”: P(wk | w1w2…wk-1) = P(wk | wk-n…wk-1) – Each word depends only on the last n-1 words • Specific cases: – Unigram model: P(wk) – words independent – Bigram model: P(wk | wk-1) • Learning task: estimate these probabilities from a given corpus Using Bigrams • Compute probability of a sentence: W = The cat sat on the mat P(W) = P(The|START)P(cat|The)P(sat|cat)  P(on|sat)P(the|on)P(mat|the)P(END|mat) • Generate a random text and examine for “reasonableness” Maximum Likelihood Estimation • PMLE(w1…wn) = C(w1…wn) / N • PMLE(wn | w1…wn-1) = C(w1…wn) / C(w1…wn-1) Problem: Data Sparseness!! • For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus • The larger the context, the greater the problem • But there are always new cases not seen before! Smoothing • Idea: Take some probability away from seen events and assign it to unseen events Simple method (Laplace): Give every event an a priori count of 1 PLap(X) = C(X)+1 / N+B where X is any entity, B is the number of entity types • Problem: Assigns too much probability to new events The more event types there are, the worse this becomes Interpolation Lidstone: PLid(X) = (C(X) + d) / (N + dB) [d<1] Johnson: PLid(X) = m PMLE(X) + (1 – m)(1/B) where m = N/(N+dB) • How to choose d? • Doesn’t match low-frequency events well Held-out Estimation Idea: Estimate freq on unseen data from “unseen” data • Divide data: “training” &“held out” subsets C1(X) = freq of X in training data C2(X) = freq of X in held out data Tr = X:C1(X)=r C2(X) Pho(X) = Tr/(NrN) where C(X)=r Deleted Estimation Generalize to use all the data : • Divide data into 2 subsets: Nar = number of entities s.t. Ca(X)=r Tarb = X:Ca(X)=r Cb(X) Pdel (X) = (T0r1 + T1r0 ) / N(N0r1 + N1r0 ) [C(X)=r] • Needs a large data set • Overestimates unseen data, underestimates infrequent data Good-Turing For observed items, discount item count: r* = (r+1) E[Nr+1] / E[Nr] • The idea is that the chance of seeing the item one more time, is about E[Nr+1] / E[Nr] For unobserved items, total probability is: E[N1] / N – So, if we assume a uniform distribution over unknown items, we have: P(X) = E[N1] / (N0N) Good-Turing Issues • Has problems with high-frequency items (consider rmax* = E[Nrmax+1]/E[Nrmax] = 0) Usual answers: – Use only for low-frequency items (r < k) – Smooth E[Nr] by function S(r) • How to divide probability among unseen items? – Uniform distribution – Estimate which seem more likely than others… Back-off Models • If high-order n-gram has insufficient data, use lower order n-gram: { Pbo(wi|wi-n+1,i-1) = (1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1) if enough data (wi-n+1,i-1)Pbo(wi|wi-n+2,i-1) • Note recursive formulation otherwise Linear Interpolation More generally, we can interpolate: Pint(wi|h) = k k(h)Pk(wi| h) • Interpolation between different orders • Usually set weights by iterative training (gradient descent – EM algorithm) • Partition histories h into equivalence classes • Need to be responsive to the amount of data!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CS-595: Machine Learning in Natural Language Processing