Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF) Boltzmann-Gibbs Distribution Given: States s1, s2, …, sn Density p(s) = ps Maximum entropy principle: Without any information, one chooses the density ps to maximize the entropy p s log p s s subject to the constraints ps f i ( s) Di , i s Boltzmann-Gibbs (Cnt’d) Consider the Lagrangian L p s log p s i ( p s f i ( s ) Di ) ( p s 1) i s s Take partial derivatives of L with respect to ps and set them to zero, we obtain BoltzmannGibbs density functions exp i f i ( s) i ps Z where Z is the normalizing factor Exercise From the Lagrangian L p s log p s i ( p s f i ( s ) Di ) ( p s 1) i s derive exp i f i ( s) i ps Z s Boltzmann-Gibbs (Cnt’d) Classification Rule Use of Boltzmann-Gibbs as prior distribution Compute the posterior for given observed data and features fi Use the optimal posterior to classify Boltzmann-Gibbs (Cnt’d) Maximum Entropy (ME) The posterior is the state probability density p(s | X), where X = (x1, x2, …, xn) Maximum entropy Markov model (MEMM) The posterior consists of transition probability densities p(s | s´, X) Boltzmann-Gibbs (Cnt’d) Conditional random field (CRF) The posterior consists of both transition probability densities p(s | s´, X) and state probability densities p(s | X) References R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd Ed., Wiley Interscience, 2001. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001. P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach, The MIT Press, 2001. Maximum Entropy Approach An Example Five possible French translations of the English word in: Certain constraints obeyed: Dans, en, à, au cours de, pendant When April follows in, the proper translation is en How do we make the proper translation of a French word y under an English context x? Formalism Probability assignment p(y|x): y: French word, x: English context Indicator function of a context feature f 1 if y en and April follows in f ( x, y ) 0 otherwise. Expected Values of f The expected value of f with respect to p ( x, y ) the empirical distribution ~ ~ p( f ) ~ p ( x, y ) f ( x, y ) x, y The expected value of f with respect to the conditional probability p(y|x) p( f ) ~ p ( x ) p ( y | x ) f ( x, y ) x, y Constraint Equation Set equal the two expected values: ~ p ( f ) p( f ) or equivalently, ~ ~ p ( x , y ) f ( x , y ) p ( x ) p ( y | x ) f ( x, y ) x, y x, y Maximum Entropy Principle Given n feature functions fi, we want p(y|x) to maximize the entropy measure H ( p) ~ p ( x) p( y | x) log p( y | x) x, y where p is chosen from C { p | p( f i) ~ p( f i) i 1, 2, ..., n} Constrained Optimization Problem The Lagrangian ( p, ) H ( p) i ( p( f i) ~ p ( f i) ) i Solutions 1 p ( y | x) exp i f i ( x, y ) i Z ( x) Z ( x) exp i fi ( x, y) y i Iterative Solution Compute the expectation of fi under the current estimate of probability function p (n) ( f i ) ~ p ( x) pi( n ) ( y | x) f i ( x, y ) x Update Lagrange multipliers exp( (i n 1) y - (i n) ) ~ p ( fi ) ( n) p ( fi ) Update probability functions pi( n1) ( y | 1 ( n1) x) exp f ( x , y) i i ( n 1) Z ( x) i Feature Selection Motivation: For a large collection of candidate features, we want to select a small subset Incremental growth Incremental Learning Adding feature fˆ to S to obtain S fˆ p ( f ) i 1, 2, ..., n} Consider C ( S fˆ ) { p : p ( f ) ~ The optimal model: PS fˆ aug max H ( p ) pC ( S fˆ ) ˆ ) L( P ) L( P ) L ( S , f S , Gain: S fˆ where L is the log-likelihood of training data Algorithm: Feature Selection 1. Start with S as an empty set; PS is uniform 2. For each feature f, compute PS f and L( S , f ) 3. Check the termination condition (specified by the user) 4. Select fˆ aug max L( S , f ) f 5. Add fˆ to S 6. Update PS 7. Go to step 2 Approximation Computation of maximum entropy model is costly for each candidate f Simplification assumption: The multipliers λ associated with S do not change when f is added to S Approximation (cnt’d) The approximate solution for S f then has the form PS , f 1 PS ( y | x)e f ( x , y ) Z ( x) Z ( x ) PS ( y | x)e f ( x , y ) y Approximate Solution The approximate gain is GS , f ( ) L( PS, f ) L( pS ) ~ p ( x) log Z ( x) ~ p( f ) x The approximate solution is then ~ PS f aug max GS , f ( ) PS f Conditional Random Field (CRF) CRF The probability of a label sequence y given observation sequence x is the normalized product of potential functions, each of the form exp j t j ( yi 1 , yi , x, i ) k sk ( yi , x, i ) , k j where yi-1 and yi are labels at position i-1 and i t j ( yi 1 , yi , x, i ) is a transition feature function, and sk ( yi , x, i) is a state function Feature Functions Example: A feature given by 1 if the observatio n sequence at position i is the word "September" b ( x, i ) 0 otherwise. Transition function: 1 if yi 1 IN and yi NNP t j ( yi 1 , yi , x, i ) 0 otherwise. Difference from MEMM If the state feature is dropped, we obtain a MEMM model The drawback of MEMM The state probabilities are not learnt, but inferred Bias can be generated, since the transition feature is dominating in the training Difference from HMM HMM is a generative model In order to define a joint distribution, this model must enumerate all possible observation sequences and their corresponding label sequences This task is intractable, unless observation elements are represented as isolated units CRF Training Methods CRF training requires intensive efforts in numerical manipulation Preconditioned conjugate gradient Limited-Memory Quasi-Newton Instead of searching along the gradient, conjugate gradient searches along a carefully chosen linear combination of the gradient and the previous search direction Limited-memory BFGS (L-BFGS) is a second-order method that estimates the curvature numerically from previous gradients and updates, avoiding the need for an exact Hessian inverse computation Voted perceptron Voted Perceptron Like the perceptron algorithm, this algorithm scans through the training instances, updating the weight vectorλt when a prediction error is detected Instead of taking just the final weight vector, the voted perceptron algorithms takes the average of theλt Voted Perceptron (cnt’d) Let F ( y , x) f j ( yi 1 , yi , x, i ) i where fj is either a state function or a transition function. For each training instance, the method computes a weight update t 1 t F ( y k , x k ) F ( yˆ k , x k ) in which ŷ k is obtained in the Viterbi path yˆ k aug max t F ( y , x k ) y References A. L. Berger, S. A. D. Pietra, V. J. D. Pietra, A maximum entropy approach to natural language processing A. McCallum and F. Pereira, Maximum entropy Markov models for information extraction and segmentation H. M. Wallach, Conditional random fields: an introduction J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmentation and labeling sequence data F. Sha and F. Pereira, Shallow parsing with conditional random fields