Download A Maximum Entropy Approach to Natural Language Processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Transcript
Maximum Entropy (ME)
Maximum Entropy Markov Model
(MEMM)
Conditional Random Field (CRF)
Boltzmann-Gibbs Distribution

Given:



States s1, s2, …, sn
Density p(s) = ps
Maximum entropy principle:

Without any information, one chooses the
density ps to maximize the entropy
  p s log p s
s
subject to the constraints
 ps f i ( s)  Di , i
s
Boltzmann-Gibbs (Cnt’d)

Consider the Lagrangian
L   p s log p s   i ( p s f i ( s )  Di )   ( p s  1)
i

s
s
Take partial derivatives of L with respect to ps
and set them to zero, we obtain BoltzmannGibbs density functions


exp    i f i ( s) 
 i

ps 
Z
where Z is the normalizing factor
Exercise

From the Lagrangian
L   p s log p s   i ( p s f i ( s )  Di )   ( p s  1)
i
s
derive


exp    i f i ( s) 
 i

ps 
Z
s
Boltzmann-Gibbs (Cnt’d)

Classification Rule



Use of Boltzmann-Gibbs as prior
distribution
Compute the posterior for given
observed data and features fi
Use the optimal posterior to classify
Boltzmann-Gibbs (Cnt’d)

Maximum Entropy (ME)

The posterior is the state probability density
p(s | X), where X = (x1, x2, …, xn)

Maximum entropy Markov model (MEMM)

The posterior consists of transition probability
densities p(s | s´, X)
Boltzmann-Gibbs (Cnt’d)

Conditional random field (CRF)

The posterior consists of both transition
probability densities p(s | s´, X) and
state probability densities
p(s | X)
References



R. O. Duda, P. E. Hart, and D. G. Stork, Pattern
Classification, 2nd Ed., Wiley Interscience, 2001.
T. Hastie, R. Tibshirani, and J. Friedman, The
Elements of Statistical Learning, Springer-Verlag,
2001.
P. Baldi and S. Brunak, Bioinformatics: The
Machine Learning Approach, The MIT Press,
2001.
Maximum Entropy Approach
An Example

Five possible French translations of the English
word in:


Certain constraints obeyed:


Dans, en, à, au cours de, pendant
When April follows in, the proper translation is en
How do we make the proper translation of a
French word y under an English context x?
Formalism

Probability assignment p(y|x):


y: French word, x: English context
Indicator function of a context feature f
1 if y  en and April follows in
f ( x, y )  
0 otherwise.
Expected Values of f

The expected value of f with respect to
p ( x, y )
the empirical distribution ~
~
p( f )   ~
p ( x, y ) f ( x, y )
x, y

The expected value of f with respect to
the conditional probability p(y|x)
p( f )   ~
p ( x ) p ( y | x ) f ( x, y )
x, y
Constraint Equation

Set equal the two expected values:
~
p ( f )  p( f )
or equivalently,
~
~
p
(
x
,
y
)
f
(
x
,
y
)


 p ( x ) p ( y | x ) f ( x, y )
x, y
x, y
Maximum Entropy Principle

Given n feature functions fi, we want
p(y|x) to maximize the entropy measure
H ( p)    ~
p ( x) p( y | x) log p( y | x)
x, y
where p is chosen from
C  { p | p( f i)  ~
p( f i) i  1, 2, ..., n}
Constrained Optimization
Problem

The Lagrangian
( p,  )  H ( p)   i ( p( f i)  ~
p ( f i) )
i

Solutions
1
p ( y | x) 
exp   i f i ( x, y ) 
i

Z  ( x)
Z  ( x)   exp   i fi ( x, y) 
y
i

Iterative Solution

Compute the expectation of fi under the current
estimate of probability function
p (n) ( f i )   ~
p ( x) pi( n ) ( y | x) f i ( x, y )
x

Update Lagrange multipliers
exp( (i n 1)

y
- (i n) )
~
p ( fi )
 ( n)
p ( fi )
Update probability functions
pi( n1) ( y |
1
 ( n1)

x) 
exp

f
(
x
,
y)



i
i
( n 1)
Z ( x)
i

Feature Selection

Motivation:


For a large collection of candidate features,
we want to select a small subset
Incremental growth
Incremental Learning
Adding feature fˆ
to S to obtain S  fˆ
p ( f ) i  1, 2, ..., n}
Consider C ( S  fˆ )  { p : p ( f )  ~
The optimal model: PS  fˆ  aug max H ( p )
pC ( S  fˆ )
ˆ )  L( P )  L( P )

L
(
S
,
f
S ,
Gain:
S  fˆ
where L is the log-likelihood of training data
Algorithm: Feature Selection
1. Start with S as an empty set; PS is uniform
2. For each feature f, compute PS  f and L( S , f )
3. Check the termination condition (specified by the user)
4. Select fˆ  aug max L( S , f )
f
5. Add fˆ
to S
6. Update PS
7. Go to step 2
Approximation

Computation of maximum entropy model
is costly for each candidate f

Simplification assumption:

The multipliers λ associated with S do not
change when f is added to S
Approximation (cnt’d)
The approximate solution for S  f then has
the form

PS , f
1

PS ( y | x)e f ( x , y )
Z  ( x)
Z  ( x )   PS ( y | x)e f ( x , y )
y
Approximate Solution
The approximate gain is
GS , f ( )  L( PS, f )  L( pS )   ~
p ( x) log Z  ( x)  ~
p( f )
x
The approximate solution is then
~ PS  f  aug max GS , f ( )
PS f
Conditional Random Field
(CRF)
CRF
The probability of a label sequence y given observation
sequence x is the normalized product of potential functions,
each of the form


exp    j t j ( yi 1 , yi , x, i )    k sk ( yi , x, i )  ,
k
 j

where yi-1 and yi are labels at position i-1 and i
t j ( yi 1 , yi , x, i ) is a transition feature function, and
sk ( yi , x, i) is a state function
Feature Functions
Example:
A feature given by
1 if the observatio n sequence at position i is the word "September"
b ( x, i )  
0 otherwise.
Transition function:
1 if yi 1  IN and yi  NNP
t j ( yi 1 , yi , x, i )  
0 otherwise.
Difference from MEMM

If the state feature is dropped, we obtain
a MEMM model

The drawback of MEMM

The state probabilities are not learnt, but
inferred

Bias can be generated, since the transition
feature is dominating in the training
Difference from HMM

HMM is a generative model

In order to define a joint distribution, this
model must enumerate all possible
observation sequences and their
corresponding label sequences

This task is intractable, unless
observation elements are represented as
isolated units
CRF Training Methods


CRF training requires intensive efforts in
numerical manipulation
Preconditioned conjugate gradient


Limited-Memory Quasi-Newton


Instead of searching along the gradient, conjugate gradient
searches along a carefully chosen linear combination of the
gradient and the previous search direction
Limited-memory BFGS (L-BFGS) is a second-order method that
estimates the curvature numerically from previous gradients
and updates, avoiding the need for an exact Hessian inverse
computation
Voted perceptron
Voted Perceptron

Like the perceptron algorithm, this algorithm
scans through the training instances, updating
the weight vectorλt when a prediction error is
detected

Instead of taking just the final weight vector, the
voted perceptron algorithms takes the average
of theλt
Voted Perceptron (cnt’d)
Let
F ( y , x)   f j ( yi 1 , yi , x, i )
i
where fj is either a state function or a transition function.
For each training instance, the method computes a weight update
t 1  t  F ( y k , x k )  F ( yˆ k , x k )
in which
ŷ k
is obtained in the Viterbi path
yˆ k  aug max t  F ( y , x k )
y
References

A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A maximum
entropy approach to natural language processing

A. McCallum and F. Pereira, Maximum entropy Markov
models for information extraction and segmentation

H. M. Wallach, Conditional random fields: an introduction

J. Lafferty, A. McCallum, F. Pereira, Conditional random
fields: probabilistic models for segmentation and labeling
sequence data

F. Sha and F. Pereira, Shallow parsing with conditional
random fields