Download A brief maximum entropy tutorial

A brief maximum entropy tutorial Overview • Statistical modeling addresses the problem of modeling the behavior of a random process • In constructing this model, we typically have at our disposal a sample of output from the process. From the sample, which constitutes an incomplete state of knowledge about the process, the modeling problem is to parlay this knowledge into a succinct, accurate representation of the process • We can then use this representation to make predictions of the future behavior of the process Motivating example • Suppose we wish to model an expert translator’s decisions concerning the proper French rendering of the English word in. • A model p of the expert’s decisions assigns to each French word or phrase f an estimate, p(f), of the probability that the expert would choose f as a translation of in. • Develop p – collect a large sample of instances of the expert’s decisions Motivating example • Our goal is to – Extract a set of facts about the decision-making process from the sample (the first task of modeling) – Construct a model of this process (the second task) Motivating example • One obvious clue we might glean from the sample is the list of allowed translations – in  {dans, en, à, au cours de, pendant} • With this information in hand, we can impose our first constraint on our model p: p(dans)  p(en)  p(a )  p(au cours de)  p( pendant )  1 This equation represents our first statistic of the process; we can now proceed to search for a suitable model which obeys this equation – There are infinite number of models p for which this identify holds Motivating example • One model which satisfies the above equation is p(dans)=1； in other words, the model always predicts dans. • Another model which obeys this constraint predicts pendant with a probability of ½, and à with a probability of ½. • But both of these models offend our sensibilities: knowing only that the expert always chose from among these five French phrases, how can we justify either of these probability distributions? Motivating example • Knowing only that the expert chose exclusively from among these five French phrases, the most intuitively appealing model is p (dans )  1 / 5 p (en)  1 / 5 p (a )  1 / 5 p (au cours de)  1 / 5 p ( pendant )  1 / 5 This model, which allocates the total probability evenly among the five possible phrases, is the most uniform model subject to our knowledge It is not, however, the most uniform overall; that model would grant an equal probability to every possible French phrase. Motivating example • We might hope to glean more clues about the expert’s decisions from our sample. • Suppose we notice that the expert chose either dans or en 30% of the time p(dans)  p(en)  3 / 10 p(dans)  p(en)  p(a )  p(au cours de)  p( pendant )  1 Once again there are many probability distributions consistent with these two constraints. p (dans )  3 / 20 • In the absence of any other knowledge, a reasonable choice for p is again the p(en)  3 / 20 most uniform – that is, the distribution p(a )  7 / 30 which allocates its probability as evenly as possible, subject to the constrains: p (au cours de)  7 / 30 p ( pendant )  7 / 30 Motivating example • Say we inspect the data once more, and this time notice another interesting fact: in half the cases, the expert chose either dans or à. We can incorporate this information into our model as a third constraint: p(dans)  p(en)  3 / 10 p(dans)  p(en)  p(a )  p(au cours de)  p( pendant )  1 p(dans)  p(a )  1 / 2 • We can once again look for the most uniform p satisfying these constraints, but now the choice is not as obvious. Motivating example • As we have added complexity, we have encountered two problems: – First, what exactly is meant by “uniform,” and how can one measure the uniformity of a model? – Second, having determined a suitable answer to these questions, how does one find the most uniform model subject to a set of constraints like those we have described? Motivating example • The maximum entropy method answers both these questions. • Intuitively, the principle is simple: – model all that is known and assume nothing about that which is unknown – In other words, given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible. • This is precisely the approach we took in selecting our model p at each step in the above example Maxent Modeling • Consider a random process which produces an output value y, a member of a finite set У. – y may be any word in the set {dans, en, à, au cours de, pendant} • In generating y, the process may be influenced by some contextual information x, a mamber of a finite set X. – x could include the words in the English sentence surrounding in • To construct a stochastic model that accurately represents the behavior of the random process – Given a context x, the process will output y. Training data • Collect a large number of samples (x1, y1), (x2, y2),…, (xN, yN) – Each sample would consist of a phrase x containing the words surrounding in, together with the translation y of in which the process produced 1 ~ p x, y    number of times that x, y  occurs in the sample N • Typically, a particular pair (x, y) will either not occur at all in the sample, or will occur at most a few times. – smoothing Features and constraints • The goal is to construct a statistical model of the process which generated the training sample ~ px, y  • The building blocks of this model will be a set of statistics of the training sample – The frequency that in translated to either dans or en was 3/10 – The frequency that in translated to either dans or au cours de was ½ – And so on Statistics of the ~ p x, y training sample   Features and constraints • Conditioning information x – E.g., in the training sample, if April is the word following in, then the translation of in is en with frequency 9/10 • Indicator function 1 if y  en and April follows in f x, y    0 otherwise • Expected value of f ~ ~ p f   p  x, y  f  x, y   x, y (1) Features and constraints • We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f – We call such function a feature function or feature for short Features and constraints • When we discover a statistic that we feel is useful, we can acknowledge its importance by requiring that our model accord with it • We do this by constraining the expected value that the model assigns to the corresponding feature function f • The expected value of f with respect to the model p(y | x) is p f    ~ p  x  p  y | x  f  x, y  (2) x, y where ~ px is the empirical distributi on of x in the training sample Features and constraints • We constrain this expected value to be the same as the expected value of f in the training sample. That is, we require p f   ~ p f  (3) – We call the requirement (3) a constraint equation or simply a constraint • Combining (1), (2) and (3) yields ~ ~ p  x  p  y | x  f  x, y   p  x, y  f  x, y    x, y x, y Features and constraints • To sum up so far, we now have – A means of representing statistical phenomena inherent in a sample of data (namely, ~ p f  ) – A means of requiring that our model of the process exhibit these phenomena (namely, p f   ~ p f  ) • Feature: – Is a binary-value function of (x, y) • Constraint – Is an equation between the expected value of the feature function in the model and its expected value in the training data The maxent principle • Suppose that we are given n feature functions fi, which determine statistics we feel are important in modeling the process. We would like our model to accord with these statistics • That is, we would like p to lie in the subset C of P defined by C  p  P | p f i   ~ p f i  for i  1,2,..., n (4) P (a) Figure 1: P P P C1 C1 C2 C1 C 2 (b) (c) (d) • If we impose no constraints, then all probability models are allowable • Imposing one linear constraint C1 restricts us to those pP which lie on the region defined by C1 • A second linear constraint could determine p exactly, if the two constraints are satisfiable, where the intersection of C1 and C2 is non-empty. p C1  C2 • Alternatively, a second linear constraint could be inconsistent with the first (i,e, C1  C2 = ); no pP can satisfy them both The maxent principle • In the present setting, however, the linear constraints are extracted from the training sample and cannot, by construction, be inconsistent • Furthermore, the linear constraints in our applications will not even come close to determining pP uniquely as they do in (c); instead, the set C = C1  C2  …  Cn of allowable models will be infinite The maxent principle • Among the models pC, the maximum entropy philosophy dictates that we select the distribution which is most uniform • A mathematical measure of the uniformity of a conditional distribution p(y|x) is provided by the conditional entropy H  p    ~ p  x  p y | x  log p y | x  x, y (5) The maxent principle • The principle of maximum entropy – To select a model from a set C of allowed probability distributions, choose the model p★ C with maximum entropy H(p): p*  arg max H  p  pC (6) Exponential form • The maximum entropy principle presents us with a problem in constrained optimization: find the p★C which maximizes H(p) • Find p *  arg max H  p  pC   ~  arg max    p x  p y | x  log p y | x  pC  x, y  (7) Exponential form • We refer to this as the primal problem; it is a succinct way of saying that we seek to maximize H(p) subject to the following constraints: – 1. p y | x  0 for all x, y. – 2.  p y | x   1 y for all x. • This and the previous condition guarantee that p is a conditional probability distribution – 3. ~ ~       p x p y | x f x , y  x, y  x , y p  x, y  f  x, y  for i  1,2,..., n. • In other words, p C, and so satisfies the active constraints C Exponential form • To solve this optimization problem, introduce the Lagrangian   p, ,     ~p x  p y | x  log p y | x  x, y  ~  ~    i   p x, y  f i  x, y   p  x  p y | x  f i x, y  i  x, y         p y | x   1  y  (8) Exponential form   ~ p  x 1  log p y | x    i ~ p  x  f i  x, y    p y | x  i (9) ~ p  x 1  log p y | x    i ~ p  x  f i  x, y     0 i ~ p  x 1  log p y | x    i ~ p  x  f i  x, y    i   log p y | x    i ~ p  x  f i  x, y   ~  1 p x  i      ~  p y | x   exp    i p  x  f i  x, y  exp  ~  1  i   p x   (10) Exponential form • We have thus found the parametric form of p★, and so we now take up the task of solving for the optimal values ★, ★. • Recognizing that the second factor in this equation is the factor corresponding to the second of the constraints listed above, we can rewrit (10) as   p  y | x   Z x  exp   i f i x, y   i   (11) where Z(x), the normalizing factor, is given by   Z x    exp   i f i x, y  y  i  (12) Proof (12) : second constraint : x,  y p  y | x   1        y p  y | x    y exp   i f i  x, y  exp   ~  1  1  i   p x   *   exp      exp         1   y exp   i f i  x, y   1 ~ p x    i    1 1  1   ~ p x     Z x   y exp  i i f i x, y     Z  x    y exp   i f i  x, y   i  Exponential form • We have found ★ but not yet ★. Towards this end we introduce some further notation. Define the dual function () as      p , ,    (13) and the dual optimization problem as Find   arg max     (14) • Since p★ and ★ are fixed, the righthand side of (14) has only the free variables ={1, 2,…, n}. Exponential form • Final result – The maximum entropy model subject to the constraints C has the parametric form p★ of (11), where Λ★ can be determined by maximizing the dual function () Maximum likelihood The log - likelihood L~p  p  of the empirical distributi on ~ p as predicted by a model p is defined by ~ p  x, y  L~p  p   log  p y | x  ~ p  x, y  log p y | x  x, y (15) x, y It is easy to check that the dual function    of the previous section is, in fact, just the log - likelihood for the exponentia l model p; that is     L~p  p  (16) where p has the parametric form of (11). With this interpreta tion, the result of the previous section can be rephrased as : The model p *  C with maximum entropy is the model in the parametric family p y | x  that maximizes the likelihood of the train ing sample ~ p. Maximum likelihood Since (16) and From (8) :   p, ,     ~p  x  p y | x  log p y | x  x, y  ~  ~   i   p  x, y  f i  x, y   p  x  p y | x  f i  x, y  i  x, y        p  y | x   1  y     p, ,      ~ p  x  p y | x  log p y | x   x, y        p * , ,  *   ~ p x  ~ p  y | x  log p y | x  x, y        p * , ,  *   ~ p x  ~ p  y | x  log p y | x  x, y ~  1    ~    p  x   p  y | x   log  exp   i f i x. y   x, y   i    Z x   ~   ~    p  x   p  y | x     log Z x    i f i x. y  x, y  i        ~ p x  ~ p  y | x   log Z x     ~ p x   ~ p  y | x    i f i  x. y  x, y x. y  i  ~  ~    p  x   log Z x     p x, y    i f i  x. y  x x. y  i    ~ ~    p  x   log Z  x    i  p  x, y   f i  x. y  x i  x, y    ~ p  x   log Z  x    i ~ p  f i  x i Outline (Maxent Modeling summary) • We began by seeking the conditional distribution p(y|x) which had maximal entropy H(p) subject to a set of linear constraints (7) • Following the traditional procedure in constrained optimization, we introduced the Lagrangian ( p,,), where ,  are a set of Lagrange multipliers for the constraints we imposed on p(y|x) • To find the solution to the optimization problem, we appealed to the Kuhn-Tucker theorem, which states that we can (1) first solve ( p,,) for p to get a parametric form for p★ in terms of , ; (2) then plug p★ back in to ( p,,), this time solving for ★, ★. Outline (Maxent Modeling summary) • The parametric form for p★ turns out to have the exponential form (11) • The ★ gives rise to the normalizing factor Z(x), given in (12) • The ★ will be solved for numerically using the dual function (14). Furthermore, it so happens that this function, (), is the log-likelihood for the exponential model p (11). So what started as the maximization of entropy subject to a set of linear constraints turns out to be equivalent to the unconstrained maximization of likelihood of a certain parametric family of distributions. Outline (Maxent Modeling summary) • Table 1 summarize the primal-dual framework Primal Dual problem argmaxpCH(p) argmax() description maximum entropy maximum likelihood type of search constrained optimization unconstrained optimization search domain pC real-value vectors {1 2,…} solution p★ ★ Kuhn-Tucker theorem: p★ = p★ Computing the parameters Algorithm 1 Improved Iterative Scaling Input : Feature functions f1,f 2 , f n ; empirica l distribu tion ~ p ( x, y ) Output : Optimal pa rameter values *i ; optimal model p * 1. Start with i  0 for all i  {1,2,  , n} 2. Do for each i  {1,2,  , n} : a. Let i be the solution to # ~ p ( x ) p ( y | x ) f ( x , y ) exp   f ( x, y )  ~ p( fi )  i i   (18) x, y where f #  x, y   i 1 f i  x, y  n (19) b. Update the value of i according to : i  i  i 3. Go to step 2 if not all the i have converged   f x, y  i i i

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A brief maximum entropy tutorial