Survey

# Download A brief maximum entropy tutorial

Transcript

A brief maximum entropy tutorial Overview • Statistical modeling addresses the problem of modeling the behavior of a random process • In constructing this model, we typically have at our disposal a sample of output from the process. From the sample, which constitutes an incomplete state of knowledge about the process, the modeling problem is to parlay this knowledge into a succinct, accurate representation of the process • We can then use this representation to make predictions of the future behavior of the process Motivating example • Suppose we wish to model an expert translator’s decisions concerning the proper French rendering of the English word in. • A model p of the expert’s decisions assigns to each French word or phrase f an estimate, p(f), of the probability that the expert would choose f as a translation of in. • Develop p – collect a large sample of instances of the expert’s decisions Motivating example • Our goal is to – Extract a set of facts about the decision-making process from the sample (the first task of modeling) – Construct a model of this process (the second task) Motivating example • One obvious clue we might glean from the sample is the list of allowed translations – in {dans, en, à, au cours de, pendant} • With this information in hand, we can impose our first constraint on our model p: p(dans) p(en) p(a ) p(au cours de) p( pendant ) 1 This equation represents our first statistic of the process; we can now proceed to search for a suitable model which obeys this equation – There are infinite number of models p for which this identify holds Motivating example • One model which satisfies the above equation is p(dans)=1； in other words, the model always predicts dans. • Another model which obeys this constraint predicts pendant with a probability of ½, and à with a probability of ½. • But both of these models offend our sensibilities: knowing only that the expert always chose from among these five French phrases, how can we justify either of these probability distributions? Motivating example • Knowing only that the expert chose exclusively from among these five French phrases, the most intuitively appealing model is p (dans ) 1 / 5 p (en) 1 / 5 p (a ) 1 / 5 p (au cours de) 1 / 5 p ( pendant ) 1 / 5 This model, which allocates the total probability evenly among the five possible phrases, is the most uniform model subject to our knowledge It is not, however, the most uniform overall; that model would grant an equal probability to every possible French phrase. Motivating example • We might hope to glean more clues about the expert’s decisions from our sample. • Suppose we notice that the expert chose either dans or en 30% of the time p(dans) p(en) 3 / 10 p(dans) p(en) p(a ) p(au cours de) p( pendant ) 1 Once again there are many probability distributions consistent with these two constraints. p (dans ) 3 / 20 • In the absence of any other knowledge, a reasonable choice for p is again the p(en) 3 / 20 most uniform – that is, the distribution p(a ) 7 / 30 which allocates its probability as evenly as possible, subject to the constrains: p (au cours de) 7 / 30 p ( pendant ) 7 / 30 Motivating example • Say we inspect the data once more, and this time notice another interesting fact: in half the cases, the expert chose either dans or à. We can incorporate this information into our model as a third constraint: p(dans) p(en) 3 / 10 p(dans) p(en) p(a ) p(au cours de) p( pendant ) 1 p(dans) p(a ) 1 / 2 • We can once again look for the most uniform p satisfying these constraints, but now the choice is not as obvious. Motivating example • As we have added complexity, we have encountered two problems: – First, what exactly is meant by “uniform,” and how can one measure the uniformity of a model? – Second, having determined a suitable answer to these questions, how does one find the most uniform model subject to a set of constraints like those we have described? Motivating example • The maximum entropy method answers both these questions. • Intuitively, the principle is simple: – model all that is known and assume nothing about that which is unknown – In other words, given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible. • This is precisely the approach we took in selecting our model p at each step in the above example Maxent Modeling • Consider a random process which produces an output value y, a member of a finite set У. – y may be any word in the set {dans, en, à, au cours de, pendant} • In generating y, the process may be influenced by some contextual information x, a mamber of a finite set X. – x could include the words in the English sentence surrounding in • To construct a stochastic model that accurately represents the behavior of the random process – Given a context x, the process will output y. Training data • Collect a large number of samples (x1, y1), (x2, y2),…, (xN, yN) – Each sample would consist of a phrase x containing the words surrounding in, together with the translation y of in which the process produced 1 ~ p x, y number of times that x, y occurs in the sample N • Typically, a particular pair (x, y) will either not occur at all in the sample, or will occur at most a few times. – smoothing Features and constraints • The goal is to construct a statistical model of the process which generated the training sample ~ px, y • The building blocks of this model will be a set of statistics of the training sample – The frequency that in translated to either dans or en was 3/10 – The frequency that in translated to either dans or au cours de was ½ – And so on Statistics of the ~ p x, y training sample Features and constraints • Conditioning information x – E.g., in the training sample, if April is the word following in, then the translation of in is en with frequency 9/10 • Indicator function 1 if y en and April follows in f x, y 0 otherwise • Expected value of f ~ ~ p f p x, y f x, y x, y (1) Features and constraints • We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f – We call such function a feature function or feature for short Features and constraints • When we discover a statistic that we feel is useful, we can acknowledge its importance by requiring that our model accord with it • We do this by constraining the expected value that the model assigns to the corresponding feature function f • The expected value of f with respect to the model p(y | x) is p f ~ p x p y | x f x, y (2) x, y where ~ px is the empirical distributi on of x in the training sample Features and constraints • We constrain this expected value to be the same as the expected value of f in the training sample. That is, we require p f ~ p f (3) – We call the requirement (3) a constraint equation or simply a constraint • Combining (1), (2) and (3) yields ~ ~ p x p y | x f x, y p x, y f x, y x, y x, y Features and constraints • To sum up so far, we now have – A means of representing statistical phenomena inherent in a sample of data (namely, ~ p f ) – A means of requiring that our model of the process exhibit these phenomena (namely, p f ~ p f ) • Feature: – Is a binary-value function of (x, y) • Constraint – Is an equation between the expected value of the feature function in the model and its expected value in the training data The maxent principle • Suppose that we are given n feature functions fi, which determine statistics we feel are important in modeling the process. We would like our model to accord with these statistics • That is, we would like p to lie in the subset C of P defined by C p P | p f i ~ p f i for i 1,2,..., n (4) P (a) Figure 1: P P P C1 C1 C2 C1 C 2 (b) (c) (d) • If we impose no constraints, then all probability models are allowable • Imposing one linear constraint C1 restricts us to those pP which lie on the region defined by C1 • A second linear constraint could determine p exactly, if the two constraints are satisfiable, where the intersection of C1 and C2 is non-empty. p C1 C2 • Alternatively, a second linear constraint could be inconsistent with the first (i,e, C1 C2 = ); no pP can satisfy them both The maxent principle • In the present setting, however, the linear constraints are extracted from the training sample and cannot, by construction, be inconsistent • Furthermore, the linear constraints in our applications will not even come close to determining pP uniquely as they do in (c); instead, the set C = C1 C2 … Cn of allowable models will be infinite The maxent principle • Among the models pC, the maximum entropy philosophy dictates that we select the distribution which is most uniform • A mathematical measure of the uniformity of a conditional distribution p(y|x) is provided by the conditional entropy H p ~ p x p y | x log p y | x x, y (5) The maxent principle • The principle of maximum entropy – To select a model from a set C of allowed probability distributions, choose the model p★ C with maximum entropy H(p): p* arg max H p pC (6) Exponential form • The maximum entropy principle presents us with a problem in constrained optimization: find the p★C which maximizes H(p) • Find p * arg max H p pC ~ arg max p x p y | x log p y | x pC x, y (7) Exponential form • We refer to this as the primal problem; it is a succinct way of saying that we seek to maximize H(p) subject to the following constraints: – 1. p y | x 0 for all x, y. – 2. p y | x 1 y for all x. • This and the previous condition guarantee that p is a conditional probability distribution – 3. ~ ~ p x p y | x f x , y x, y x , y p x, y f x, y for i 1,2,..., n. • In other words, p C, and so satisfies the active constraints C Exponential form • To solve this optimization problem, introduce the Lagrangian p, , ~p x p y | x log p y | x x, y ~ ~ i p x, y f i x, y p x p y | x f i x, y i x, y p y | x 1 y (8) Exponential form ~ p x 1 log p y | x i ~ p x f i x, y p y | x i (9) ~ p x 1 log p y | x i ~ p x f i x, y 0 i ~ p x 1 log p y | x i ~ p x f i x, y i log p y | x i ~ p x f i x, y ~ 1 p x i ~ p y | x exp i p x f i x, y exp ~ 1 i p x (10) Exponential form • We have thus found the parametric form of p★, and so we now take up the task of solving for the optimal values ★, ★. • Recognizing that the second factor in this equation is the factor corresponding to the second of the constraints listed above, we can rewrit (10) as p y | x Z x exp i f i x, y i (11) where Z(x), the normalizing factor, is given by Z x exp i f i x, y y i (12) Proof (12) : second constraint : x, y p y | x 1 y p y | x y exp i f i x, y exp ~ 1 1 i p x * exp exp 1 y exp i f i x, y 1 ~ p x i 1 1 1 ~ p x Z x y exp i i f i x, y Z x y exp i f i x, y i Exponential form • We have found ★ but not yet ★. Towards this end we introduce some further notation. Define the dual function () as p , , (13) and the dual optimization problem as Find arg max (14) • Since p★ and ★ are fixed, the righthand side of (14) has only the free variables ={1, 2,…, n}. Exponential form • Final result – The maximum entropy model subject to the constraints C has the parametric form p★ of (11), where Λ★ can be determined by maximizing the dual function () Maximum likelihood The log - likelihood L~p p of the empirical distributi on ~ p as predicted by a model p is defined by ~ p x, y L~p p log p y | x ~ p x, y log p y | x x, y (15) x, y It is easy to check that the dual function of the previous section is, in fact, just the log - likelihood for the exponentia l model p; that is L~p p (16) where p has the parametric form of (11). With this interpreta tion, the result of the previous section can be rephrased as : The model p * C with maximum entropy is the model in the parametric family p y | x that maximizes the likelihood of the train ing sample ~ p. Maximum likelihood Since (16) and From (8) : p, , ~p x p y | x log p y | x x, y ~ ~ i p x, y f i x, y p x p y | x f i x, y i x, y p y | x 1 y p, , ~ p x p y | x log p y | x x, y p * , , * ~ p x ~ p y | x log p y | x x, y p * , , * ~ p x ~ p y | x log p y | x x, y ~ 1 ~ p x p y | x log exp i f i x. y x, y i Z x ~ ~ p x p y | x log Z x i f i x. y x, y i ~ p x ~ p y | x log Z x ~ p x ~ p y | x i f i x. y x, y x. y i ~ ~ p x log Z x p x, y i f i x. y x x. y i ~ ~ p x log Z x i p x, y f i x. y x i x, y ~ p x log Z x i ~ p f i x i Outline (Maxent Modeling summary) • We began by seeking the conditional distribution p(y|x) which had maximal entropy H(p) subject to a set of linear constraints (7) • Following the traditional procedure in constrained optimization, we introduced the Lagrangian ( p,,), where , are a set of Lagrange multipliers for the constraints we imposed on p(y|x) • To find the solution to the optimization problem, we appealed to the Kuhn-Tucker theorem, which states that we can (1) first solve ( p,,) for p to get a parametric form for p★ in terms of , ; (2) then plug p★ back in to ( p,,), this time solving for ★, ★. Outline (Maxent Modeling summary) • The parametric form for p★ turns out to have the exponential form (11) • The ★ gives rise to the normalizing factor Z(x), given in (12) • The ★ will be solved for numerically using the dual function (14). Furthermore, it so happens that this function, (), is the log-likelihood for the exponential model p (11). So what started as the maximization of entropy subject to a set of linear constraints turns out to be equivalent to the unconstrained maximization of likelihood of a certain parametric family of distributions. Outline (Maxent Modeling summary) • Table 1 summarize the primal-dual framework Primal Dual problem argmaxpCH(p) argmax() description maximum entropy maximum likelihood type of search constrained optimization unconstrained optimization search domain pC real-value vectors {1 2,…} solution p★ ★ Kuhn-Tucker theorem: p★ = p★ Computing the parameters Algorithm 1 Improved Iterative Scaling Input : Feature functions f1,f 2 , f n ; empirica l distribu tion ~ p ( x, y ) Output : Optimal pa rameter values *i ; optimal model p * 1. Start with i 0 for all i {1,2, , n} 2. Do for each i {1,2, , n} : a. Let i be the solution to # ~ p ( x ) p ( y | x ) f ( x , y ) exp f ( x, y ) ~ p( fi ) i i (18) x, y where f # x, y i 1 f i x, y n (19) b. Update the value of i according to : i i i 3. Go to step 2 if not all the i have converged f x, y i i i