Download Text Models

Text Models Why? • To “understand” text • To assist in text search & ranking • For autocompletion • Part of Speech Tagging Simple application: spelling suggestions • Say that we have a dictionary of words – Real dictionary or the result of crawling – Sentences instead of words • Now we are given a word w not in the dictionary • How can we correct it to something in the dictionary String editing • Given two strings (sequences) the “distance” between the two strings is defined by the minimum number of “character edit operations” needed to turn one sequence into the other. • Edit operations: delete, insert, modify (a character) – Cost assigned to each operation (e.g. uniform =1 ) Edit distance • Already a simple model for languages • Modeling the creation of strings (and errors in them) through simple edit operations Distance between strings • Edit distance between strings = minimum number of edit operations that can be used to get from one string to the other – Symmetric because of the particular choice of edit operations and uniform cost • distance(“Willliam Cohon”,“William Cohen”) • 2 Finding the edit distance • An “alignment” problem • Deciding how to align the two strings • Can we try all alignments? • How many (reasonable options) are there? Dynamic Programming • An umbrella name for a collection of algorithms • Main idea: reuse computation for subproblems, combined in different ways Example: Fibonnaci if n = 0 or n = 1 return n else return fib(n-1) + fib(n-2) Exponential time! Fib with Dynamic Programming table = {} def fib(n): global table if table.has_key(n): return table[n] if n == 0 or n == 1: table[n] = n return n else: value = fib(n-1) + fib(n-2) table[n] = value return value Using a partial solution • Partial solution: – Alignment of s up to location i, with t up to location j • How to reuse? • Try all options for the “last” operation • Base case : D(i,0)=I, D(0,i)=i for i inserts \ deletions • Easy to generalize to arbitrary cost functions! Models • Bag-of-words • N-grams • Hidden Markov Models • Probabilistic Context Free Grammar Bag-of-words • Every document is represented as a bag of the words it contains • Bag means that we keep the multiplicity (=number of occurrences) of each word • Very simple, but we lose all track of structure n-grams • Limited structure • Sliding window of n words n-gram model How would we infer the probabilities? • Issues: – Overfitting – Probability 0 How would we infer the probabilities? • Maximum Likelihood: "add-one" (Laplace) smoothing • V = Vocabulary size Good-Turing Estimate Good-Turing More than a fixed n.. Linear Interpolation Precision vs. Recall Richer Models • HMM • PCFG Motivation: Part-of-Speech Tagging – Useful for ranking – For machine translation – Word-Sense Disambiguation –… Part-of-Speech Tagging • Tag this word. This word is a tag. • He dogs like a flea • The can is in the fridge • The sailor dogs me every day A Learning Problem • Training set: tagged corpus – Most famous is the Brown Corpus with about 1M words – The goal is to learn a model from the training set, and then perform tagging of untagged text – Performance tested on a test-set Simple Algorithm • Assign to each word its most popular tag in the training set • Problem: Ignores context • Dogs, tag will always be tagged as a noun… • Can will be tagged as a verb • Still, achieves around 80% correctness for real-life test-sets – Goes up to as high as 90% when combined with some simple rules Hidden Markov Model (HMM) • Model: sentences are generated by a probabilistic process • In particular, a Markov Chain whose states correspond to Parts-of-Speech • Transitions are probabilistic • In each state a word is outputted – The output word is again chosen probabilistically based on the state HMM • HMM is: – A set of N states – A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans – A vector of size N of initial state probabilities Pstart – A matrix NXM of emissions probabilities Pout • “Hidden” because we see only the outputs, not the sequence of states traversed Example 3 Fundamental Problems 1) Compute the probability of a given observation Sequence (=sentence) 2) Given an observation sequence, find the most likely hidden state sequence This is tagging 3) Given a training set find the model that would make the observations most likely Tagging • Find the most likely sequence of states that led to an observed output sequence • Problem: exponentially many possible sequences! Viterbi Algorithm • Dynamic Programming • Vt,k is the probability of the most probable state sequence – Generating the first t + 1 observations (X0,..Xt) – And terminating at state k Viterbi Algorithm • Dynamic Programming • Vt,k is the probability of the most probable state sequence – Generating the first t + 1 observations (X0,..Xt) – And terminating at state k • V0,k = Pstart(k)*Pout(k,X0) • Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)} Finding the path • Note that we are interested in the most likely path, not only in its probability • So we need to keep track at each point of the argmax – Combine them to form a sequence • What about top-k? Complexity • O(T*|S|^2) • Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags) Computing the probability of a sequence • Forward probabilities: αt(k) is the probability of seeing the sequence X1…Xt and terminating at state k • Backward probabilities: βt(k) is the probability of seeing the sequence Xt+1…Xn given that the Markov process is at state k at time t. Computing the probabilities Forward algorithm α0(k)= Pstart(k)*Pout(k,X0) αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)} P(O1,…On)= Σk αn(k) Backward algorithm βt(k) = P(Ot+1…On| state at time t is k) βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)} βn(k) = 1 for all k P(O)= Σk β 0(k)* Pstart(k) Learning the HMM probabilities • Expectation-Maximization Algorithm 1. Start with initial probabilities 2. Compute Eij the expected number of transitions from i to j while generating a sequence, for each i,j (see next) 3. Set the probability of transition from i to j to be Eij/ (Σk Eik) 4. Similarly for omission probability 5. Repeat 2-4 using the new model, until convergence Estimating the expectancies • By sampling – Re-run a random a execution of the model 100 times – Count transitions • By analysis – Use Bayes rule on the formula for sequence probability – Called the Forward-backward algorithm Accuracy • Tested experimentally • Exceeds 96% for the Brown corpus – Trained on half and tested on the other half • Compare with the 80-90% by the trivial algorithm • The hard cases are few but are very hard.. NLTK • http://www.nltk.org/ • Natrual Language ToolKit • Open source python modules for NLP tasks – Including stemming, POS tagging and much more Context Free Grammars • Context Free Grammars are a more natural model for Natural Language • Syntax rules are very easy to formulate using CFGs • Provably more expressive than Finite State Machines – E.g. Can check for balanced parentheses Context Free Grammars • Non-terminals • Terminals • Production rules – V → w where V is a non-terminal and w is a sequence of terminals and non-terminals Context Free Grammars • Can be used as acceptors • Can be used as a generative model • Similarly to the case of Finite State Machines • How long can a string generated by a CFG be? Stochastic Context Free Grammar • Non-terminals • Terminals • Production rules associated with probability – V → w where V is a non-terminal and w is a sequence of terminals and non-terminals Chomsky Normal Form • Every rule is of the form • V → V1V2 where V,V1,V2 are non-terminals • V → t where V is a non-terminal and t is a terminal Every (S)CFG can be written in this form • Makes designing many algorithms easier Questions • What is the probability of a string? – Defined as the sum of probabilities of all possible derivations of the string • Given a string, what is its most likely derivation? – Called also the Viterbi derivation or parse – Easy adaptation of the Viterbi Algorithm for HMMs • Given a training corpus, and a CFG (no probabilities) learn the probabilities on derivation rule Inside-outside probabilities • Inside probability: probability of generating wp…wq from non-terminal Nj.  j ( p, q)  P(wpq | N ) j pq • Outside probability: total prob of beginning with the start symbol N1 and generating N pqj and everything outside wp…wq  j ( p, q)  P(w1( p1) , N pqj , w( q1) m ) CYK algorithm Nj Nr wp Ns wd Wd+1 wq  j (k , k )  P( N  wk ) j q 1  j ( p, q)   P( N j  N r N s )  r ( p, d )  s (d  1, q) r ,s d  p CYK algorithm N1 Nf Nj w1 wp  j ( p, q)   Ng wq m Wq+1 we f j g  ( p , e ) P ( N  N N )  g (q  1, e)  f f , g e  q 1 wm CYK algorithm N1 Nf Ng w1 we Wp-1 p 1 Nj Wp wq  j ( p, q)    f (e, q) P( N f  N g N j )  g (e, p  1) f , g e 1 wm Outside probability 1 if j  1  j (1, m)   0 otherwise  j ( p, q )   m f j g  ( p , e ) P ( N  N N )  g (q  1, e)  f f , g e  q 1 p 1    f (e, q) P( N f  N g N j )  g (e, p  1) f , g e 1 Probability of a sentence P(w1m )  1 (1, m) P( w1m )    j (k , k ) P( N j  wk ) for any k j P(w1m , N )   j ( p, q) j ( p, q) j pq The probability that a binary rule is used q 1 P( N pqj , N j  N r N s |w1m )   dp j ( p, q) P( N j  N r N s )  r ( p, d )  s (d  1, q) P( w1m ) P( N j , N j  N r N s |w1m ) m m   P( N pqj , N j  N r N s |w1m ) p 1 q 1 m  m q 1    p 1 q 1 dp j ( p, q ) P( N j  N r N s )  r ( p, d )  s (d  1, q) P( w1m ) (1) The probability that Nj is used P(w1m , N )   j ( p, q) j ( p, q) j pq P( N j pq | w1m )  P( N pqj , w1m ) P( w1m )   j ( p, q )  j ( p , q ) P( w1m ) P( N j | w1m ) m m   P(N pqj | w1m ) p 1 q 1   P(N j  N r N s | w1m ) r s m m   p 1 q 1  j ( p, q )  j ( p, q ) P( w1m ) (2) j r s j P ( N  N N , N | w1m ) (1) j r s P( N  N N | w1m )   j (2) P( N | w1m ) m  m q 1 j r s  ( p , q ) P ( N  N N )  r ( p, d )  s (d  1, q)   j p 1 q  p d  p m m   p 1 q  p j ( p, q )  j ( p, q ) The probability that a unary rule is used m P( N j  w k , N j is used |w1m )   h 1 j (h, h)  j (h, h) ( wh , w k ) P( w1m ) P ( N j  w k , N j | w1m ) (3) P ( N  w | N , w1m )   j P ( N | w1m ) ( 2) j k j m  k  ( h , h )  ( h , h )  ( w , w )  j j h h 1 m m   p 1 q  p j ( p, q )  j ( p, q ) (3) Multiple training sentences m 1 P( N j , N j  N r N s |w1m )  q 1 m    p 1 q  p 1 d  p j ( p, q) P( N j  N r N s )  r ( p, d )  s (d  1, q) (1) P( w1m )  f i ( j, r , s) m m P( N | w1m )   j  j ( p, q )  j ( p, q ) (2) P( w1m ) p 1 q 1  hi ( j ) for sentence Wi P( N  N N j r  f ( j, r , s) )  h ( j) i s i i i

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Text Models