Download A brief maximum entropy tutorial

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Transcript
A brief maximum entropy tutorial
Overview
• Statistical modeling addresses the problem of
modeling the behavior of a random process
• In constructing this model, we typically have at
our disposal a sample of output from the process.
From the sample, which constitutes an incomplete
state of knowledge about the process, the
modeling problem is to parlay this knowledge into
a succinct, accurate representation of the process
• We can then use this representation to make
predictions of the future behavior of the process
Motivating example
• Suppose we wish to model an expert translator’s
decisions concerning the proper French rendering
of the English word in.
• A model p of the expert’s decisions assigns to each
French word or phrase f an estimate, p(f), of the
probability that the expert would choose f as a
translation of in.
• Develop p – collect a large sample of instances of
the expert’s decisions
Motivating example
• Our goal is to
– Extract a set of facts about the decision-making
process from the sample (the first task of
modeling)
– Construct a model of this process (the second
task)
Motivating example
• One obvious clue we might glean from the sample
is the list of allowed translations
– in  {dans, en, à, au cours de, pendant}
• With this information in hand, we can impose our
first constraint on our model p:
p(dans)  p(en)  p(a )  p(au cours de)  p( pendant )  1
This equation represents our first statistic of the
process; we can now proceed to search for a
suitable model which obeys this equation
– There are infinite number of models p for which this
identify holds
Motivating example
• One model which satisfies the above equation is
p(dans)=1; in other words, the model always
predicts dans.
• Another model which obeys this constraint
predicts pendant with a probability of ½, and à
with a probability of ½.
• But both of these models offend our sensibilities:
knowing only that the expert always chose from
among these five French phrases, how can we
justify either of these probability distributions?
Motivating example
• Knowing only that the expert chose exclusively from
among these five French phrases, the most intuitively
appealing model is
p (dans )  1 / 5
p (en)  1 / 5
p (a )  1 / 5
p (au cours de)  1 / 5
p ( pendant )  1 / 5
This model, which allocates the total probability evenly among
the five possible phrases, is the most uniform model subject to
our knowledge
It is not, however, the most uniform overall; that model would
grant an equal probability to every possible French phrase.
Motivating example
• We might hope to glean more clues about the expert’s
decisions from our sample.
• Suppose we notice that the expert chose either dans or en
30% of the time
p(dans)  p(en)  3 / 10
p(dans)  p(en)  p(a )  p(au cours de)  p( pendant )  1
Once again there are many probability distributions
consistent with these two constraints.
p (dans )  3 / 20
• In the absence of any other knowledge,
a reasonable choice for p is again the
p(en)  3 / 20
most uniform – that is, the distribution
p(a )  7 / 30
which allocates its probability as evenly
as possible, subject to the constrains: p (au cours de)  7 / 30
p ( pendant )  7 / 30
Motivating example
• Say we inspect the data once more, and this time notice
another interesting fact: in half the cases, the expert chose
either dans or à. We can incorporate this information into
our model as a third constraint:
p(dans)  p(en)  3 / 10
p(dans)  p(en)  p(a )  p(au cours de)  p( pendant )  1
p(dans)  p(a )  1 / 2
• We can once again look for the most uniform p satisfying
these constraints, but now the choice is not as obvious.
Motivating example
• As we have added complexity, we have
encountered two problems:
– First, what exactly is meant by “uniform,” and how can
one measure the uniformity of a model?
– Second, having determined a suitable answer to these
questions, how does one find the most uniform model
subject to a set of constraints like those we have
described?
Motivating example
• The maximum entropy method answers both these
questions.
• Intuitively, the principle is simple:
– model all that is known and assume nothing about that
which is unknown
– In other words, given a collection of facts, choose a
model which is consistent with all the facts, but
otherwise as uniform as possible.
• This is precisely the approach we took in selecting
our model p at each step in the above example
Maxent Modeling
• Consider a random process which produces an
output value y, a member of a finite set У.
– y may be any word in the set {dans, en, à, au cours de,
pendant}
• In generating y, the process may be influenced by
some contextual information x, a mamber of a
finite set X.
– x could include the words in the English sentence
surrounding in
• To construct a stochastic model that accurately
represents the behavior of the random process
– Given a context x, the process will output y.
Training data
• Collect a large number of samples (x1, y1), (x2, y2),…,
(xN, yN)
– Each sample would consist of a phrase x containing the
words surrounding in, together with the translation y of
in which the process produced
1
~
p x, y    number of times that x, y  occurs in the sample
N
• Typically, a particular pair (x, y) will either not occur
at all in the sample, or will occur at most a few times.
– smoothing
Features and constraints
• The goal is to construct a statistical model of the
process which generated the training sample ~
px, y 
• The building blocks of this model will be a set of
statistics of the training sample
– The frequency that in translated to either dans or en
was 3/10
– The frequency that in translated to either dans or au
cours de was ½
– And so on
Statistics of the
~
p x, y
training sample


Features and constraints
• Conditioning information x
– E.g., in the training sample, if April is the word
following in, then the translation of in is en with
frequency 9/10
• Indicator function
1 if y  en and April follows in
f x, y   
0 otherwise
• Expected value of f
~
~
p f  
p  x, y  f  x, y 

x, y
(1)
Features and constraints
• We can express any statistic of the sample as the
expected value of an appropriate binary-valued
indicator function f
– We call such function a feature function or feature for
short
Features and constraints
• When we discover a statistic that we feel is useful,
we can acknowledge its importance by requiring
that our model accord with it
• We do this by constraining the expected value that
the model assigns to the corresponding feature
function f
• The expected value of f with respect to the model
p(y | x) is
p f    ~
p  x  p  y | x  f  x, y 
(2)
x, y
where ~
px is the empirical distributi on of x in the training sample
Features and constraints
• We constrain this expected value to be the same as
the expected value of f in the training sample. That
is, we require
p f   ~
p f 
(3)
– We call the requirement (3) a constraint equation or
simply a constraint
• Combining (1), (2) and (3) yields
~
~
p  x  p  y | x  f  x, y  
p  x, y  f  x, y 


x, y
x, y
Features and constraints
• To sum up so far, we now have
– A means of representing statistical phenomena inherent
in a sample of data (namely, ~
p f  )
– A means of requiring that our model of the process
exhibit these phenomena (namely, p f   ~
p f  )
• Feature:
– Is a binary-value function of (x, y)
• Constraint
– Is an equation between the expected value of the feature
function in the model and its expected value in the
training data
The maxent principle
• Suppose that we are given n feature functions fi,
which determine statistics we feel are important in
modeling the process. We would like our model to
accord with these statistics
• That is, we would like p to lie in the subset C of P
defined by
C  p  P | p f i   ~
p f i  for i  1,2,..., n
(4)
P
(a)
Figure 1:
P
P
P
C1
C1 C2
C1 C
2
(b)
(c)
(d)
•
If we impose no constraints, then all probability models are
allowable
•
Imposing one linear constraint C1 restricts us to those pP which
lie on the region defined by C1
•
A second linear constraint could determine p exactly, if the two
constraints are satisfiable, where the intersection of C1 and C2 is
non-empty. p C1  C2
•
Alternatively, a second linear constraint could be inconsistent with
the first (i,e, C1  C2 = ); no pP can satisfy them both
The maxent principle
• In the present setting, however, the linear
constraints are extracted from the training sample
and cannot, by construction, be inconsistent
• Furthermore, the linear constraints in our
applications will not even come close to
determining pP uniquely as they do in (c);
instead, the set C = C1  C2  …  Cn of allowable
models will be infinite
The maxent principle
• Among the models pC, the maximum entropy
philosophy dictates that we select the distribution
which is most uniform
• A mathematical measure of the uniformity of a
conditional distribution p(y|x) is provided by the
conditional entropy
H  p    ~
p  x  p y | x  log p y | x 
x, y
(5)
The maxent principle
• The principle of maximum entropy
– To select a model from a set C of allowed probability
distributions, choose the model p★ C with maximum
entropy H(p):
p*  arg max H  p 
pC
(6)
Exponential form
• The maximum entropy principle presents us with a
problem in constrained optimization: find the
p★C which maximizes H(p)
• Find
p *  arg max H  p 
pC


~
 arg max    p x  p y | x  log p y | x 
pC
 x, y

(7)
Exponential form
• We refer to this as the primal problem; it is a
succinct way of saying that we seek to maximize
H(p) subject to the following constraints:
– 1. p y | x  0 for all x, y.
– 2.
 p y | x   1
y
for all x.
• This and the previous condition guarantee that p is a
conditional probability distribution
– 3.
~
~






p
x
p
y
|
x
f
x
,
y

x, y
 x , y p  x, y  f  x, y 
for i  1,2,..., n.
• In other words, p C, and so satisfies the active constraints C
Exponential form
• To solve this optimization problem, introduce the
Lagrangian
  p, ,     ~p x  p y | x  log p y | x 
x, y
 ~

~

  i   p x, y  f i  x, y   p  x  p y | x  f i x, y 
i
 x, y




    p y | x   1
 y

(8)
Exponential form

 ~
p  x 1  log p y | x    i ~
p  x  f i  x, y   
p y | x 
i
(9)
~
p  x 1  log p y | x    i ~
p  x  f i  x, y     0
i
~
p  x 1  log p y | x    i ~
p  x  f i  x, y   
i

 log p y | x    i ~
p  x  f i  x, y   ~  1
p x 
i
 



~
 p y | x   exp    i p  x  f i  x, y  exp  ~  1
 i

 p x  
(10)
Exponential form
• We have thus found the parametric form of p★, and so we
now take up the task of solving for the optimal values ★,
★.
• Recognizing that the second factor in this equation is the
factor corresponding to the second of the constraints listed
above, we can rewrit (10) as


p  y | x   Z x  exp   i f i x, y 
 i


(11)
where Z(x), the normalizing factor, is given by


Z x    exp   i f i x, y 
y
 i

(12)
Proof (12) :
second constraint : x,  y p  y | x   1
 



  y p  y | x    y exp   i f i  x, y  exp   ~  1  1
 i

 p x  
*

 exp  


 exp  





 1   y exp   i f i  x, y   1
~
p x  
 i



1
1
 1 

~
p x  

 Z x 
 y exp  i i f i x, y 


 Z  x    y exp   i f i  x, y 
 i

Exponential form
• We have found ★ but not yet ★. Towards this end we
introduce some further notation. Define the dual function
() as

    p , ,  

(13)
and the dual optimization problem as
Find   arg max   

(14)
• Since p★ and ★ are fixed, the righthand side of (14) has
only the free variables ={1, 2,…, n}.
Exponential form
• Final result
– The maximum entropy model subject to the constraints
C has the parametric form p★ of (11), where Λ★ can be
determined by maximizing the dual function ()
Maximum likelihood
The log - likelihood L~p  p  of the empirical distributi on ~
p
as predicted by a model p is defined by
~
p  x, y 
L~p  p   log  p y | x 
~
p  x, y  log p y | x 
x, y
(15)
x, y
It is easy to check that the dual function    of the previous
section is, in fact, just the log - likelihood for the exponentia l
model p; that is
    L~p  p 
(16)
where p has the parametric form of (11). With this interpreta tion,
the result of the previous section can be rephrased as :
The model p *  C with maximum entropy is the model in the
parametric family p y | x  that maximizes the likelihood of the
train ing sample ~
p.
Maximum likelihood
Since (16) and From (8) :
  p, ,     ~p  x  p y | x  log p y | x 
x, y
 ~

~
  i   p  x, y  f i  x, y   p  x  p y | x  f i  x, y 
i
 x, y



    p  y | x   1
 y

   p, ,      ~
p  x  p y | x  log p y | x 

x, y

      p * , ,  *   ~
p x  ~
p  y | x  log p y | x 
x, y


     p * , ,  *   ~
p x  ~
p  y | x  log p y | x 
x, y
~
 1

 
~
   p  x   p  y | x   log 
exp   i f i x. y  
x, y 
 i
 
 Z x 

~


~
   p  x   p  y | x     log Z x    i f i x. y 
x, y 
i




   ~
p x  ~
p  y | x   log Z x     ~
p x   ~
p  y | x    i f i  x. y 
x, y
x. y 
i

~

~
   p  x   log Z x     p x, y    i f i  x. y 
x
x. y 
i



~
~
   p  x   log Z  x    i  p  x, y   f i  x. y 
x
i 
x, y

  ~
p  x   log Z  x    i ~
p  f i 
x
i
Outline (Maxent Modeling summary)
• We began by seeking the conditional distribution p(y|x)
which had maximal entropy H(p) subject to a set of linear
constraints (7)
• Following the traditional procedure in constrained
optimization, we introduced the Lagrangian ( p,,),
where ,  are a set of Lagrange multipliers for the
constraints we imposed on p(y|x)
• To find the solution to the optimization problem, we
appealed to the Kuhn-Tucker theorem, which states that we
can (1) first solve ( p,,) for p to get a parametric form
for p★ in terms of , ; (2) then plug p★ back in to
( p,,), this time solving for ★, ★.
Outline (Maxent Modeling summary)
• The parametric form for p★ turns out to have the
exponential form (11)
• The ★ gives rise to the normalizing factor Z(x), given in
(12)
• The ★ will be solved for numerically using the dual
function (14). Furthermore, it so happens that this function,
(), is the log-likelihood for the exponential model p
(11). So what started as the maximization of entropy
subject to a set of linear constraints turns out to be
equivalent to the unconstrained maximization of likelihood
of a certain parametric family of distributions.
Outline (Maxent Modeling summary)
• Table 1 summarize the primal-dual framework
Primal
Dual
problem
argmaxpCH(p)
argmax()
description
maximum entropy
maximum likelihood
type of search constrained optimization unconstrained optimization
search domain
pC
real-value vectors {1 2,…}
solution
p★
★
Kuhn-Tucker theorem: p★ = p★
Computing the parameters
Algorithm 1 Improved Iterative Scaling
Input :
Feature functions f1,f 2 , f n ; empirica l distribu tion ~
p ( x, y )
Output : Optimal pa rameter values *i ; optimal model p *
1. Start with i  0 for all i  {1,2,  , n}
2. Do for each i  {1,2,  , n} :
a. Let i be the solution to
#
~
p
(
x
)
p
(
y
|
x
)
f
(
x
,
y
)
exp


f
( x, y )  ~
p( fi )

i
i


(18)
x, y
where f #  x, y   i 1 f i  x, y 
n
(19)
b. Update the value of i according to : i  i  i
3. Go to step 2 if not all the i have converged
  f x, y 
i i
i