Download Text Models

Document related concepts

Morphology (linguistics) wikipedia , lookup

Symbol grounding problem wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Junction Grammar wikipedia , lookup

Stemming wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Transcript
Text Models
Why?
• To “understand” text
• To assist in text search & ranking
• For autocompletion
• Part of Speech Tagging
Simple application:
spelling suggestions
• Say that we have a dictionary of words
– Real dictionary or the result of crawling
– Sentences instead of words
• Now we are given a word w not in the
dictionary
• How can we correct it to something in the
dictionary
String editing
• Given two strings (sequences) the “distance”
between the two strings is defined by the
minimum number of “character edit
operations” needed to turn one sequence into
the other.
• Edit operations: delete, insert, modify (a
character)
– Cost assigned to each operation (e.g. uniform =1 )
Edit distance
• Already a simple model for languages
• Modeling the creation of strings (and errors in
them) through simple edit operations
Distance between strings
• Edit distance between strings = minimum
number of edit operations that can be used to
get from one string to the other
– Symmetric because of the particular choice of edit
operations and uniform cost
• distance(“Willliam Cohon”,“William Cohen”)
• 2
Finding the edit distance
• An “alignment” problem
• Deciding how to align the two strings
• Can we try all alignments?
• How many (reasonable options) are there?
Dynamic Programming
• An umbrella name for a collection of
algorithms
• Main idea: reuse computation for subproblems, combined in different ways
Example: Fibonnaci
if n = 0 or n = 1
return n
else
return fib(n-1) + fib(n-2)
Exponential time!
Fib with Dynamic Programming
table = {}
def fib(n):
global table
if table.has_key(n):
return table[n]
if n == 0 or n == 1:
table[n] = n
return n
else:
value = fib(n-1) + fib(n-2)
table[n] = value
return value
Using a partial solution
• Partial solution:
– Alignment of s up to location i, with t up to
location j
• How to reuse?
• Try all options for the “last” operation
• Base case : D(i,0)=I, D(0,i)=i for i inserts \
deletions
• Easy to generalize to arbitrary cost functions!
Models
• Bag-of-words
• N-grams
• Hidden Markov Models
• Probabilistic Context Free Grammar
Bag-of-words
• Every document is represented as a bag of the
words it contains
• Bag means that we keep the multiplicity
(=number of occurrences) of each word
• Very simple, but we lose all track of structure
n-grams
• Limited structure
• Sliding window of n words
n-gram model
How would we infer the probabilities?
• Issues:
– Overfitting
– Probability 0
How would we infer the probabilities?
• Maximum Likelihood:
"add-one" (Laplace) smoothing
• V = Vocabulary size
Good-Turing Estimate
Good-Turing
More than a fixed n..
Linear Interpolation
Precision vs. Recall
Richer Models
• HMM
• PCFG
Motivation: Part-of-Speech Tagging
– Useful for ranking
– For machine translation
– Word-Sense Disambiguation
–…
Part-of-Speech Tagging
• Tag this word. This word is a tag.
• He dogs like a flea
• The can is in the fridge
• The sailor dogs me every day
A Learning Problem
• Training set: tagged corpus
– Most famous is the Brown Corpus with about 1M
words
– The goal is to learn a model from the training set, and
then perform tagging of untagged text
– Performance tested on a test-set
Simple Algorithm
• Assign to each word its most popular tag in the training set
• Problem: Ignores context
• Dogs, tag will always be tagged as a noun…
• Can will be tagged as a verb
• Still, achieves around 80% correctness for real-life test-sets
– Goes up to as high as 90% when combined with some simple
rules
Hidden Markov Model (HMM)
• Model: sentences are generated by a probabilistic
process
• In particular, a Markov Chain whose states correspond
to Parts-of-Speech
• Transitions are probabilistic
• In each state a word is outputted
– The output word is again chosen probabilistically based on
the state
HMM
• HMM is:
– A set of N states
– A set of M symbols (words)
– A matrix NXN of transition probabilities Ptrans
– A vector of size N of initial state probabilities
Pstart
– A matrix NXM of emissions probabilities Pout
• “Hidden” because we see only the outputs,
not the sequence of states traversed
Example
3 Fundamental Problems
1) Compute the probability of a given
observation
Sequence (=sentence)
2) Given an observation sequence, find the most
likely hidden state sequence
This is tagging
3) Given a training set find the model that would
make the observations most likely
Tagging
• Find the most likely sequence of states that
led to an observed output sequence
• Problem: exponentially many possible
sequences!
Viterbi Algorithm
• Dynamic Programming
• Vt,k is the probability of the most probable
state sequence
– Generating the first t + 1 observations (X0,..Xt)
– And terminating at state k
Viterbi Algorithm
• Dynamic Programming
• Vt,k is the probability of the most probable
state sequence
– Generating the first t + 1 observations (X0,..Xt)
– And terminating at state k
• V0,k = Pstart(k)*Pout(k,X0)
• Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)}
Finding the path
• Note that we are interested in the most likely
path, not only in its probability
• So we need to keep track at each point of the
argmax
– Combine them to form a sequence
• What about top-k?
Complexity
• O(T*|S|^2)
• Where T is the sequence (=sentence) length,
|S| is the number of states (= number of
possible tags)
Computing the probability of a
sequence
• Forward probabilities:
αt(k) is the probability of seeing the sequence
X1…Xt and terminating at state k
• Backward probabilities:
βt(k) is the probability of seeing the sequence
Xt+1…Xn given that the Markov process is at
state k at time t.
Computing the probabilities
Forward algorithm
α0(k)= Pstart(k)*Pout(k,X0)
αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)}
P(O1,…On)= Σk αn(k)
Backward algorithm
βt(k) = P(Ot+1…On| state at time t is k)
βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)}
βn(k) = 1 for all k
P(O)= Σk β 0(k)* Pstart(k)
Learning the HMM probabilities
• Expectation-Maximization Algorithm
1. Start with initial probabilities
2. Compute Eij the expected number of transitions
from i to j while generating a sequence,
for each i,j (see next)
3. Set the probability of transition from i to j to be
Eij/ (Σk Eik)
4. Similarly for omission probability
5. Repeat 2-4 using the new model, until convergence
Estimating the expectancies
• By sampling
– Re-run a random a execution of the model 100
times
– Count transitions
• By analysis
– Use Bayes rule on the formula for sequence
probability
– Called the Forward-backward algorithm
Accuracy
• Tested experimentally
• Exceeds 96% for the Brown corpus
– Trained on half and tested on the other half
• Compare with the 80-90% by the trivial algorithm
• The hard cases are few but are very hard..
NLTK
• http://www.nltk.org/
• Natrual Language ToolKit
• Open source python modules for NLP tasks
– Including stemming, POS tagging and much more
Context Free Grammars
• Context Free Grammars are a more natural model for
Natural Language
• Syntax rules are very easy to formulate using CFGs
• Provably more expressive than Finite State Machines
– E.g. Can check for balanced parentheses
Context Free Grammars
• Non-terminals
• Terminals
• Production rules
– V → w where V is a non-terminal and w is a
sequence of terminals and non-terminals
Context Free Grammars
• Can be used as acceptors
• Can be used as a generative model
• Similarly to the case of Finite State Machines
• How long can a string generated by a CFG be?
Stochastic Context Free Grammar
• Non-terminals
• Terminals
• Production rules associated with probability
– V → w where V is a non-terminal and w is a
sequence of terminals and non-terminals
Chomsky Normal Form
• Every rule is of the form
• V → V1V2 where V,V1,V2 are non-terminals
• V → t where V is a non-terminal and t is a terminal
Every (S)CFG can be written in this form
• Makes designing many algorithms easier
Questions
• What is the probability of a string?
– Defined as the sum of probabilities of all possible
derivations of the string
• Given a string, what is its most likely derivation?
– Called also the Viterbi derivation or parse
– Easy adaptation of the Viterbi Algorithm for HMMs
• Given a training corpus, and a CFG (no
probabilities) learn the probabilities on
derivation rule
Inside-outside probabilities
• Inside probability: probability of generating wp…wq
from non-terminal Nj.
 j ( p, q)  P(wpq | N )
j
pq
• Outside probability: total prob of beginning with the start
symbol N1 and generating N pqj and everything outside wp…wq
 j ( p, q)  P(w1( p1) , N pqj , w( q1) m )
CYK algorithm
Nj
Nr
wp
Ns
wd
Wd+1
wq
 j (k , k )  P( N  wk )
j
q 1
 j ( p, q)   P( N j  N r N s )  r ( p, d )  s (d  1, q)
r ,s d  p
CYK algorithm
N1
Nf
Nj
w1
wp
 j ( p, q)  
Ng
wq
m
Wq+1
we
f
j
g

(
p
,
e
)
P
(
N

N
N
)  g (q  1, e)
 f
f , g e  q 1
wm
CYK algorithm
N1
Nf
Ng
w1
we
Wp-1
p 1
Nj
Wp
wq
 j ( p, q)    f (e, q) P( N f  N g N j )  g (e, p  1)
f , g e 1
wm
Outside probability
1 if j  1
 j (1, m)  
0 otherwise
 j ( p, q )  
m
f
j
g

(
p
,
e
)
P
(
N

N
N
)  g (q  1, e)
 f
f , g e  q 1
p 1
   f (e, q) P( N f  N g N j )  g (e, p  1)
f , g e 1
Probability of a sentence
P(w1m )  1 (1, m)
P( w1m )    j (k , k ) P( N j  wk )
for any k
j
P(w1m , N )   j ( p, q) j ( p, q)
j
pq
The probability that a binary
rule is used
q 1
P( N pqj , N j  N r N s |w1m ) 

dp
j
( p, q) P( N j  N r N s )  r ( p, d )  s (d  1, q)
P( w1m )
P( N j , N j  N r N s |w1m )
m
m
  P( N pqj , N j  N r N s |w1m )
p 1 q 1
m

m
q 1
  
p 1 q 1
dp
j
( p, q ) P( N j  N r N s )  r ( p, d )  s (d  1, q)
P( w1m )
(1)
The probability that Nj is used
P(w1m , N )   j ( p, q) j ( p, q)
j
pq
P( N
j
pq
| w1m ) 
P( N pqj , w1m )
P( w1m )

 j ( p, q )  j ( p , q )
P( w1m )
P( N j | w1m )
m
m
  P(N pqj | w1m )
p 1 q 1
  P(N j  N r N s | w1m )
r
s
m
m
 
p 1 q 1
 j ( p, q )  j ( p, q )
P( w1m )
(2)
j
r
s
j
P
(
N

N
N
,
N
| w1m ) (1)
j
r
s
P( N  N N | w1m ) 

j
(2)
P( N | w1m )
m

m
q 1
j
r
s

(
p
,
q
)
P
(
N

N
N
)  r ( p, d )  s (d  1, q)
  j
p 1 q  p d  p
m
m
 
p 1 q  p
j
( p, q )  j ( p, q )
The probability that a unary
rule is used
m
P( N j  w k , N j is used |w1m ) 

h 1
j
(h, h)  j (h, h) ( wh , w k )
P( w1m )
P ( N j  w k , N j | w1m ) (3)
P ( N  w | N , w1m ) 

j
P ( N | w1m )
( 2)
j
k
j
m

k

(
h
,
h
)

(
h
,
h
)

(
w
,
w
)
 j
j
h
h 1
m
m
 
p 1 q  p
j
( p, q )  j ( p, q )
(3)
Multiple training sentences
m 1
P( N j , N j  N r N s |w1m ) 
q 1
m
  
p 1 q  p 1 d  p
j
( p, q) P( N j  N r N s )  r ( p, d )  s (d  1, q)
(1)
P( w1m )
 f i ( j, r , s)
m
m
P( N | w1m )  
j
 j ( p, q )  j ( p, q )
(2)
P( w1m )
p 1 q 1
 hi ( j ) for sentence Wi
P( N  N N
j
r
 f ( j, r , s)
)
 h ( j)
i
s
i
i
i