Download conditional entropy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability box wikipedia , lookup

Mixture model wikipedia , lookup

Transcript
A Brief Maximum Entropy
Tutorial
Presenter: Davidson
Date: 2009/02/04
Original Author: Adam Berger, 1996/07/05
http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html
Outline

Overview


Maxent modeling






Motivating example
Training data
Features and constraints
The maxent principle
Exponential form
Maximum likelihood
Skipped sections and further reading
Overview

Statistical modeling




Models the behavior of a random process
Utilizes samples of output data to construct a
representation of the process
Predicts the future behavior of the process
Maximum Entropy Models

A family of distributions within the class of
exponential models for statistical modeling
Motivating example (1/4)


An English-to-French translator translates the English word in into 5
French phrases
Goal
1.
2.
Extract a set of facts about the decision-making process
Construct a model of this process
dans
English-to-French
in
Translator
en
à
au cours de
pendant
f
P
P( f )
Motivating example (2/4)

The translator always chooses among those 5 French words
p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1

The most intuitively appealing model (most uniform model
subject to our knowledge) is:
p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1 5
Motivating example (3/4)

If the second clue is discovered: translator chose either dans or
en 30% of the time, P must satisfy 2 constraints:
p(dans)  p(en)  3 10
p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1

A reasonable choice for P would be (the most uniform one):
p(dans )  p (en)  3 20
p(à )  p (au cours de)  p( pendant )  7 30
Motivating example (4/4)

What if the third constraint is discovered:
p(dans)  p(en)  3 10
p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1
p(dans)  p(à)  1 2


The choice for the model is not as obvious
Two problems arise when complexity is added:



The meaning of “uniform” and how to measure the uniformity of a
model
How to find the most uniform model subject to a set of constraints
One solution: Maximum Entropy (maxent) Model
Maxent Modeling



Consider a random process which produces an output value y,
a member of a finite set Y
In generating y, the process may be influenced by some
contextual information x, a member of a finite set X.
The task is to construct a stochastic model that accurately
represents the behavior of the random process


This model estimates the conditional probability that, given a context x,
the process will output y.
We denote by P the set of all conditional probability
distributions.

A model p( y | x) is an element of P
Training data

Training sample:
( x1 , y1 ), ( x2 , y2 ),..., ( xN , yN )

Training sample’s empirical probability distribution
~
p ( x, y)  N1  number of times that ( x,y) occurs in the sample
Features and constraints (1/4)


Use a set of statistics of the training sample to construct a
statistical model of the process
Statistics that is independent of the context:
p(dans)  p(en)  3 10
p(dans)  p(en)  p(à)  p(au cours de)  p( pendant )  1
p(dans)  p(à)  1 2

Statistics that depends on the conditioning information x, e.g.
in training sample, if April is the word following in, then the
translation of in is en with frequency 9/10.
Features and constraints (2/4)

To express the event that in translates as en when April is the
following word, we can introduce the indicator function:
1 if y  en and April follows in
f ( x, y )  
0 otherwise

The expected value of f with respect to the empirical
distribution ~p ( x, y ) is exactly the statistic we are interested in.
This expected value is given by:
~
~
p( f ) 
p ( x, y ) f ( x, y )

x, y

We can express any statistic of the sample as the expected
value of an appropriate binary-valued indicator function f. We
call such function a feature function or feature for short.
Features and constraints (3/4)

The expected value of f with respect to the model
p( y | x)
is:
p( f )   ~
p ( x ) p ( y | x ) f ( x, y )
x, y
where ~p ( x ) is the empirical distribution of x in the training sample

We constrain this expected value to be the same as the
expected value of f in the training sample:
p( f )  ~
p( f )
a constraint equation or simply a constraint
Features and constraints (4/4)



Combining the above 3 equations yields:
~
~
p ( x ) p ( y | x ) f ( x, y ) 
p ( x, y ) f ( x, y )


x, y
x, y
By restricting attention to those models ~p ( f ) for which the
constraint holds, we are eliminating from considering those
models which do not agree with the training sample on how
often the output of the process should exhibit the feature f.
What we have so far:


A means of representing statistical phenomena inherent in a sample of
data, namely p( y | x)
A means of requiring that our model of the process exhibit these
~
phenomena, namely p ( f )  p ( f )
The maxent principle (1/2)


Suppose n feature functions fi are given
We would like our model to accord with these statistics, i.e.
we would like p to lie in the subset C of P defined by
C  p  P | p f   ~
p f  for i  1,2,..., n
i

Among the models p  C , we would like to select the
distribution which is most uniform.


i
But what does “uniform” mean?
A mathematical measure of the uniformity of a conditional
distribution p( y | x) is provided by the conditional entropy:
H ( p)   ~
p ( x) p( y | x) log p( y | x)

x, y
The maxent principle (2/2)
H ( p )   ~
p ( x) p( y | x) log p( y | x)
x, y

The entropy is bounded from below by zero


The entropy is bounded from above by log Y


The entropy of a model with no uncertainty at all
The entropy of the uniform distribution over all possible Y values of y
The principle of maximum entropy:
To select a model from a set C of allowed probability distributions,
choose the model p*  C with maximum entropy H ( p)
p*  arg max H ( p)
pC
p * is always well-defined; that is, there is always a unique model p *
with maximum entropy in any constrained set C
Exponential form (1/3)


The method of Lagrange multipliers is applied to impose the
constraint on the optimization
The constrained optimization problem is to find
p*  arg max H ( p)
pC


~

 arg max    p ( x) p( y | x) log p( y | x) 
pC
 x, y


Maximize H ( p) subject to the following constraints:
p( y | x)  0 for all x, y
Guarantee that p is a conditional
probability distribution
 p( y | x)  1 for all x
y
 ~p ( x) p( y | x) f ( x, y)   ~p ( x, y) f ( x, y) for i {1,2,..., n}
i
x, y
i
x, y
In other words,
p C
Exponential form (2/3)

When the Lagrange multiplier is introduced, the objective
function becomes:
 ( p, ,  )   ~p ( x) p( y | x) log p( y | x)
x, y
 ~

~

  i   p ( x, y ) f i ( x, y )   p ( x) p( y | x) f i ( x, y ) 
i
x, y
 x, y

   p( y | x)  1
x


The real-valued parameters  and   1 , 2 ,..., n 
correspond to the 1+n constraints imposed on the solution
Solve p, ,  by using EM algorithm

See http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node7.html
Exponential form (3/3)

The final result:
The maximum entropy model subject to the constraints C has
the parametric form p * of the equation below, where * can be
determined by maximizing the dual function  ()


p * ( y | x)  Z ( x) exp   i f i ( x, y ) 
 i



Z ( x)   exp   i f i ( x, y) 
y
 i

 ()   ( p*, ,  *)
Maximum likelihood

The log-likelihood L~p ( p ) of the empirical distribution ~p as
predicted by a model p is defined by:
L~p ( p)  log  p( y | x) p ( x, y )   ~
p( x, y) log p( y | x)
~
x, y

x, y
The dual function  () of the previous section is just the loglikelihood for the exponential model p ; that is:
 ( )  L~p ( p)

The result from the previous section can be rephrased as:
The model p*  C with maximum entropy is the model in the
parametric family p( y | x) that maximizes the likelihood of the
training sample ~p
Skipped sections

Computing the parameters


Algorithms for inductive learning


http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tuto
rial/node10.html#SECTION00030000000000000000
http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tuto
rial/node11.html#SECTION00040000000000000000
Further readings

http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tuto
rial/node14.html#SECTION00050000000000000000