Download A Maximum Entropy Approach to Natural Language Processing

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF) Boltzmann-Gibbs Distribution  Given:    States s1, s2, …, sn Density p(s) = ps Maximum entropy principle:  Without any information, one chooses the density ps to maximize the entropy   p s log p s s subject to the constraints  ps f i ( s)  Di , i s Boltzmann-Gibbs (Cnt’d)  Consider the Lagrangian L   p s log p s   i ( p s f i ( s )  Di )   ( p s  1) i  s s Take partial derivatives of L with respect to ps and set them to zero, we obtain BoltzmannGibbs density functions   exp    i f i ( s)   i  ps  Z where Z is the normalizing factor Exercise  From the Lagrangian L   p s log p s   i ( p s f i ( s )  Di )   ( p s  1) i s derive   exp    i f i ( s)   i  ps  Z s Boltzmann-Gibbs (Cnt’d)  Classification Rule    Use of Boltzmann-Gibbs as prior distribution Compute the posterior for given observed data and features fi Use the optimal posterior to classify Boltzmann-Gibbs (Cnt’d)  Maximum Entropy (ME)  The posterior is the state probability density p(s | X), where X = (x1, x2, …, xn)  Maximum entropy Markov model (MEMM)  The posterior consists of transition probability densities p(s | s´, X) Boltzmann-Gibbs (Cnt’d)  Conditional random field (CRF)  The posterior consists of both transition probability densities p(s | s´, X) and state probability densities p(s | X) References    R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001. P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach, The MIT Press, 2001. Maximum Entropy Approach An Example  Five possible French translations of the English word in:   Certain constraints obeyed:   Dans, en, à, au cours de, pendant When April follows in, the proper translation is en How do we make the proper translation of a French word y under an English context x? Formalism  Probability assignment p(y|x):   y: French word, x: English context Indicator function of a context feature f 1 if y  en and April follows in f ( x, y )   0 otherwise. Expected Values of f  The expected value of f with respect to p ( x, y ) the empirical distribution ~ ~ p( f )   ~ p ( x, y ) f ( x, y ) x, y  The expected value of f with respect to the conditional probability p(y|x) p( f )   ~ p ( x ) p ( y | x ) f ( x, y ) x, y Constraint Equation  Set equal the two expected values: ~ p ( f )  p( f ) or equivalently, ~ ~ p ( x , y ) f ( x , y )    p ( x ) p ( y | x ) f ( x, y ) x, y x, y Maximum Entropy Principle  Given n feature functions fi, we want p(y|x) to maximize the entropy measure H ( p)    ~ p ( x) p( y | x) log p( y | x) x, y where p is chosen from C  { p | p( f i)  ~ p( f i) i  1, 2, ..., n} Constrained Optimization Problem  The Lagrangian ( p,  )  H ( p)   i ( p( f i)  ~ p ( f i) ) i  Solutions 1 p ( y | x)  exp   i f i ( x, y )  i  Z  ( x) Z  ( x)   exp   i fi ( x, y)  y i  Iterative Solution  Compute the expectation of fi under the current estimate of probability function p (n) ( f i )   ~ p ( x) pi( n ) ( y | x) f i ( x, y ) x  Update Lagrange multipliers exp( (i n 1)  y - (i n) ) ~ p ( fi )  ( n) p ( fi ) Update probability functions pi( n1) ( y | 1  ( n1)  x)  exp  f ( x , y)    i i ( n 1) Z ( x) i  Feature Selection  Motivation:   For a large collection of candidate features, we want to select a small subset Incremental growth Incremental Learning Adding feature fˆ to S to obtain S  fˆ p ( f ) i  1, 2, ..., n} Consider C ( S  fˆ )  { p : p ( f )  ~ The optimal model: PS  fˆ  aug max H ( p ) pC ( S  fˆ ) ˆ )  L( P )  L( P )  L ( S , f S , Gain: S  fˆ where L is the log-likelihood of training data Algorithm: Feature Selection 1. Start with S as an empty set; PS is uniform 2. For each feature f, compute PS  f and L( S , f ) 3. Check the termination condition (specified by the user) 4. Select fˆ  aug max L( S , f ) f 5. Add fˆ to S 6. Update PS 7. Go to step 2 Approximation  Computation of maximum entropy model is costly for each candidate f  Simplification assumption:  The multipliers λ associated with S do not change when f is added to S Approximation (cnt’d) The approximate solution for S  f then has the form  PS , f 1  PS ( y | x)e f ( x , y ) Z  ( x) Z  ( x )   PS ( y | x)e f ( x , y ) y Approximate Solution The approximate gain is GS , f ( )  L( PS, f )  L( pS )   ~ p ( x) log Z  ( x)  ~ p( f ) x The approximate solution is then ~ PS  f  aug max GS , f ( ) PS f Conditional Random Field (CRF) CRF The probability of a label sequence y given observation sequence x is the normalized product of potential functions, each of the form   exp    j t j ( yi 1 , yi , x, i )    k sk ( yi , x, i )  , k  j  where yi-1 and yi are labels at position i-1 and i t j ( yi 1 , yi , x, i ) is a transition feature function, and sk ( yi , x, i) is a state function Feature Functions Example: A feature given by 1 if the observatio n sequence at position i is the word "September" b ( x, i )   0 otherwise. Transition function: 1 if yi 1  IN and yi  NNP t j ( yi 1 , yi , x, i )   0 otherwise. Difference from MEMM  If the state feature is dropped, we obtain a MEMM model  The drawback of MEMM  The state probabilities are not learnt, but inferred  Bias can be generated, since the transition feature is dominating in the training Difference from HMM  HMM is a generative model  In order to define a joint distribution, this model must enumerate all possible observation sequences and their corresponding label sequences  This task is intractable, unless observation elements are represented as isolated units CRF Training Methods   CRF training requires intensive efforts in numerical manipulation Preconditioned conjugate gradient   Limited-Memory Quasi-Newton   Instead of searching along the gradient, conjugate gradient searches along a carefully chosen linear combination of the gradient and the previous search direction Limited-memory BFGS (L-BFGS) is a second-order method that estimates the curvature numerically from previous gradients and updates, avoiding the need for an exact Hessian inverse computation Voted perceptron Voted Perceptron  Like the perceptron algorithm, this algorithm scans through the training instances, updating the weight vectorλt when a prediction error is detected  Instead of taking just the final weight vector, the voted perceptron algorithms takes the average of theλt Voted Perceptron (cnt’d) Let F ( y , x)   f j ( yi 1 , yi , x, i ) i where fj is either a state function or a transition function. For each training instance, the method computes a weight update t 1  t  F ( y k , x k )  F ( yˆ k , x k ) in which ŷ k is obtained in the Viterbi path yˆ k  aug max t  F ( y , x k ) y References  A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A maximum entropy approach to natural language processing  A. McCallum and F. Pereira, Maximum entropy Markov models for information extraction and segmentation  H. M. Wallach, Conditional random fields: an introduction  J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmentation and labeling sequence data  F. Sha and F. Pereira, Shallow parsing with conditional random fields

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Maximum Entropy Approach to Natural Language Processing