Download conditional entropy

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05 http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html Outline  Overview   Maxent modeling       Motivating example Training data Features and constraints The maxent principle Exponential form Maximum likelihood Skipped sections and further reading Overview  Statistical modeling     Models the behavior of a random process Utilizes samples of output data to construct a representation of the process Predicts the future behavior of the process Maximum Entropy Models  A family of distributions within the class of exponential models for statistical modeling Motivating example (1/4)   An English-to-French translator translates the English word in into 5 French phrases Goal 1. 2. Extract a set of facts about the decision-making process Construct a model of this process dans English-to-French in Translator en à au cours de pendant f P P( f ) Motivating example (2/4)  The translator always chooses among those 5 French words p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1  The most intuitively appealing model (most uniform model subject to our knowledge) is: p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1 5 Motivating example (3/4)  If the second clue is discovered: translator chose either dans or en 30% of the time, P must satisfy 2 constraints: p(dans)  p(en)  3 10 p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1  A reasonable choice for P would be (the most uniform one): p(dans )  p (en)  3 20 p(à )  p (au cours de)  p( pendant )  7 30 Motivating example (4/4)  What if the third constraint is discovered: p(dans)  p(en)  3 10 p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1 p(dans)  p(à)  1 2   The choice for the model is not as obvious Two problems arise when complexity is added:    The meaning of “uniform” and how to measure the uniformity of a model How to find the most uniform model subject to a set of constraints One solution: Maximum Entropy (maxent) Model Maxent Modeling    Consider a random process which produces an output value y, a member of a finite set Y In generating y, the process may be influenced by some contextual information x, a member of a finite set X. The task is to construct a stochastic model that accurately represents the behavior of the random process   This model estimates the conditional probability that, given a context x, the process will output y. We denote by P the set of all conditional probability distributions.  A model p( y | x) is an element of P Training data  Training sample: ( x1 , y1 ), ( x2 , y2 ),..., ( xN , yN )  Training sample’s empirical probability distribution ~ p ( x, y)  N1  number of times that ( x,y) occurs in the sample Features and constraints (1/4)   Use a set of statistics of the training sample to construct a statistical model of the process Statistics that is independent of the context: p(dans)  p(en)  3 10 p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1 p(dans)  p(à)  1 2  Statistics that depends on the conditioning information x, e.g. in training sample, if April is the word following in, then the translation of in is en with frequency 9/10. Features and constraints (2/4)  To express the event that in translates as en when April is the following word, we can introduce the indicator function: 1 if y  en and April follows in f ( x, y )   0 otherwise  The expected value of f with respect to the empirical distribution ~p ( x, y ) is exactly the statistic we are interested in. This expected value is given by: ~ ~ p( f )  p ( x, y ) f ( x, y )  x, y  We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f. We call such function a feature function or feature for short. Features and constraints (3/4)  The expected value of f with respect to the model p( y | x) is: p( f )   ~ p ( x ) p ( y | x ) f ( x, y ) x, y where ~p ( x ) is the empirical distribution of x in the training sample  We constrain this expected value to be the same as the expected value of f in the training sample: p( f )  ~ p( f ) a constraint equation or simply a constraint Features and constraints (4/4)    Combining the above 3 equations yields: ~ ~ p ( x ) p ( y | x ) f ( x, y )  p ( x, y ) f ( x, y )   x, y x, y By restricting attention to those models ~p ( f ) for which the constraint holds, we are eliminating from considering those models which do not agree with the training sample on how often the output of the process should exhibit the feature f. What we have so far:   A means of representing statistical phenomena inherent in a sample of data, namely p( y | x) A means of requiring that our model of the process exhibit these ~ phenomena, namely p ( f )  p ( f ) The maxent principle (1/2)   Suppose n feature functions fi are given We would like our model to accord with these statistics, i.e. we would like p to lie in the subset C of P defined by C  p  P | p f   ~ p f  for i  1,2,..., n i  Among the models p  C , we would like to select the distribution which is most uniform.   i But what does “uniform” mean? A mathematical measure of the uniformity of a conditional distribution p( y | x) is provided by the conditional entropy: H ( p)   ~ p ( x) p( y | x) log p( y | x)  x, y The maxent principle (2/2) H ( p )   ~ p ( x) p( y | x) log p( y | x) x, y  The entropy is bounded from below by zero   The entropy is bounded from above by log Y   The entropy of a model with no uncertainty at all The entropy of the uniform distribution over all possible Y values of y The principle of maximum entropy: To select a model from a set C of allowed probability distributions, choose the model p*  C with maximum entropy H ( p) p*  arg max H ( p) pC p * is always well-defined; that is, there is always a unique model p * with maximum entropy in any constrained set C Exponential form (1/3)   The method of Lagrange multipliers is applied to impose the constraint on the optimization The constrained optimization problem is to find p*  arg max H ( p) pC   ~   arg max    p ( x) p( y | x) log p( y | x)  pC  x, y   Maximize H ( p) subject to the following constraints: p( y | x)  0 for all x, y Guarantee that p is a conditional probability distribution  p( y | x)  1 for all x y  ~p ( x) p( y | x) f ( x, y)   ~p ( x, y) f ( x, y) for i {1,2,..., n} i x, y i x, y In other words, p C Exponential form (2/3)  When the Lagrange multiplier is introduced, the objective function becomes:  ( p, ,  )   ~p ( x) p( y | x) log p( y | x) x, y  ~  ~    i   p ( x, y ) f i ( x, y )   p ( x) p( y | x) f i ( x, y )  i x, y  x, y     p( y | x)  1 x   The real-valued parameters  and   1 , 2 ,..., n  correspond to the 1+n constraints imposed on the solution Solve p, ,  by using EM algorithm  See http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node7.html Exponential form (3/3)  The final result: The maximum entropy model subject to the constraints C has the parametric form p * of the equation below, where * can be determined by maximizing the dual function  ()   p * ( y | x)  Z ( x) exp   i f i ( x, y )   i    Z ( x)   exp   i f i ( x, y)  y  i   ()   ( p*, ,  *) Maximum likelihood  The log-likelihood L~p ( p ) of the empirical distribution ~p as predicted by a model p is defined by: L~p ( p)  log  p( y | x) p ( x, y )   ~ p( x, y) log p( y | x) ~ x, y  x, y The dual function  () of the previous section is just the loglikelihood for the exponential model p ; that is:  ( )  L~p ( p)  The result from the previous section can be rephrased as: The model p*  C with maximum entropy is the model in the parametric family p( y | x) that maximizes the likelihood of the training sample ~p Skipped sections  Computing the parameters   Algorithms for inductive learning   http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tuto rial/node10.html#SECTION00030000000000000000 http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tuto rial/node11.html#SECTION00040000000000000000 Further readings  http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tuto rial/node14.html#SECTION00050000000000000000

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download conditional entropy