Download MAP estimator for the coin toss problem

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Probability interpretations wikipedia , lookup

German tank problem wikipedia , lookup

Doomsday argument wikipedia , lookup

Conditioning (probability) wikipedia , lookup

Transcript
580.691 Learning Theory
Reza Shadmehr
Bayesian learning 1:
Bayes rule, priors and maximum a posteriori
Frequentist vs. Bayesian Statistics
Frequentist Thinking
True parameter:
Bayesian Thinking
w*
Estimate of this parameter:
Bias: E  wˆ   w*
Does not have the concept of a true
parameter.
ŵ
var  wˆ 
Many different ways in which we can
come up with estimates (e.g.
Maximum Likelihood estimate), and
we can evaluate them.
Rather, at every given time we have
knowledge about w (the prior), gain
new data, and then update our
knowledge using Bayes rule (the
posterior).
Conditional Distr.
Prior Distr.
p(w | D ) 
p w  p  D | w 
p D 

p w , D 
 p w, D  dw
Posterior distr.
Given Bayes rule, there is only ONE
correct way of learning.
Binomial distribution and discrete random variables
Suppose a random variable can only take one of two variables (e.g., 0 and 1,
success and failure, etc.). Such trials are termed Bernoulli trials.
x  0,1
P  x  1  

x  x(1) , x(2) ,
Probability density
or distribution
Probability distribution
of a specific sequence
of successes and
failures
  
p x
(1)
p x  
x(1)
x(1)
, x( N )
P  x  0  1  

1 x(1)
1   
1 x(1)
1   

x(2)
1 x(2)
1   

x( N )
1   
n  number of times the trial succeeded
n
N

x (i )
i 1
N n
N!
N n
N n
p  n     1   

 n 1   
n ! N  n !
n
E  n   N
var  n   N 1   
1 x ( N )
Poor performance of ML estimators with small data samples
• Suppose we have a coin and wish to estimate the outcome (head or tail) from
observing a series of coin tosses.  = probability of tossing a head.
• After observing n coin tosses, we note that:

D  x (1) ,
, x(n)

out of which h trials are head.
• To estimate whether the next toss will be head or tail, we form an ML
estimator:
Probability of observing a particular
L    p x(1) , , x( n) 
sequence of heads and tails in D


 px   px   px  
(1)
  h 1   
(2)
( n)
nh
log L    h log   n  h  log 1   
d
h nh
log L    
0
d
 1
h
 ML 
n
• After one toss, if it comes up tail, our ML estimate predicts zero probability of
seeing heads. If first n tosses are tails, the ML continues to predict zero prob. of
seeing heads.
Including prior knowledge into the estimation process
• Even though the ML estimator might say  ML  0 , we “know” that the coin
can come up both heads and tails, i.e.:   0
• Starting point for our consideration is that  is not only a number, but we will
give  a full probability distribution function
•Suppose we know that the coin is either fair (=0.5) with prob. p or in favor of
tails (=0.4) with probability 1-p.
• We want to combine this prior knowledge with new data D (i.e. number of
heads in n throws) to arrive at a posterior distribution for . We will apply Bayes
rule:
Prior Distr.
Conditional Distr.
Posterior distr.
p( | D ) 
p  , D 
p D 

p   p  D |  
 p   p  D |   d

p   p  D |  
n
p     p  D |    

 

1
The numerator is just the joint distribution of  and D, evaluated at a particular D.
The denominator is the marginal distribution of the data D, that is, it is just a
number that makes the Numerator integrate to one.
Bayesian estimation for a potentially biased coin
• Suppose that we believe that the coin is either fair, or that it is biased toward
tails:  = probability of tossing a head. After observing n coin tosses, we note
that: D  x (1) , , x ( n)
out of which h trials are head.


for   0.5
p

p    1  p for   0.4
 0 otherwise

p  D     h 1   
( nh)
p 0.5h 0.5n h
p 0.5n
P   0.5 | D  

p 0.5h 0.5n h  1  p  0.4h 0.6n h p 0.5n  1  p  0.4 h 0.6 n h
1  p  0.4h 0.6n h

P   0.4 | D  
p 0.5n  1  p  0.4h 0.6n h
Now we can accurately calculate the probability that we have a fair coin, given some data D. In
contrast to the ML estimate, which only gave us one number ML, we have here a full probability
distribution, that is we know also how certain we are that we have a fair or unfair coin.
In some situation we would like a single number, that represents our best guess of . One
possibility for this best guess is the maximum a-posteriori estimate (MAP).
Maximum a-posteriori estimate
We define the MAP estimate as the maximum (i.e. mode) of the posterior
distribution.
 
  arg max  p  D   p  
 MAP  arg max log  p  D     log  p   
MAP estimator: arg max p  D
The latter version makes the comparison to the maximum likelihood estimate
easy:
 ML  arg max p  D |    arg max  log  p  D |    


 MAP  arg max p  | D   arg max  log  p  D |     log  p    


We see that ML and MAP are identical, if p() is a constant that does not
depend on .
Thus our prior would be a uniform distribution over the domain of . We call
such a prior for obvious reasons a flat or uniformed prior.
Formulating a continuous prior for the coin toss problem
• In the last example the probability of tossing a head, represented by , could
only be either 0.5 or p=0.4. How should we choose a prior distribution if  can be
between 0 and 1?
•Suppose we observed n tosses. The probability density that exactly h of those
tosses were heads is:
n h
n h
p  h     1   
Binomial distribution
h
 
n!
nh

 h 1   
h! n  h !
 = probability of tossing a head
  0.5
0.25
n  10
n  20
0.2
p h
0.15
0.1
0.05
5
10
h
15
20
Formulating a continuous prior for the coin toss problem
•  represents the probability of a head. We want a continuous distribution that
is defined between 0 and 1, and is 0 for 0 and 1.
p  D      1   
n 
1
n 
p      1   
c
1
c    1   

n 
Beta distribution
d
normalizing constant
0
 = probability of tossing a head
  4; n  8
  3; n  6
  2; n  4
  1; n  2
2.5
2
p  
1.5
1
3.5
3
2.5
2
1.5
  1; n  8
  1; n  6
  1; n  4
  1; n  2
1
0.5
0.5
0.2
0.4

0.6
0.8
1
0.2
0.4

0.6
0.8
1
Formulating a continuous prior for the coin toss problem
• In general, let’s assume our knowledge comes in the form of a beta
distribution:
1

p      1   
c
1

c    1    d

0
p  D |     h 1   
p  | D  
nh
1 

nh
 1     h 1   
c
1

0
1 

nh
 1     h 1    d
c
1
  nh
    h 1   
d
1
d     h 1   

0
  nh
d
When we apply Bayes rule to integrate
some old knowledge (the prior) in the
form of a beta-distribution with
parameters  and , with some new
knowledge h and n (coming from a
binomial distribution), then we find that
the posterior distribution also has the
form of a beta distribution with
parameters +h and +n-h.
Beta and binomial distribution are
therefore call conjugate distributions.
MAP estimator for the coin toss problem
Let us look at the MAP estimator if we start with a prior of =1, n=2, i.e. we have
a slight belief in the fact that the coin is fair.
1.5
Our posterior is then:
1
n1h
p  | D    h1 1   
d
p  
n  2; h  1
1
0.5
0.2
0.4

0.6
0.8
1
Let’s calculate the MAP-estimate so that we can compare it to the ML estimate.
1
log p  | D    h  1 log    n  h  1 log 1     log  
d 
d log p  | D  h  1 n  h  1


0
d

1
1 n  h 1


h 1
1 n  h 1 h 1


h 1
Note that after one toss, if we get a tail, our
h 1
probability of tossing a head is 0.33, not
 MAP 
zero as in the ML case.
n2
Classification with a continuous conditional distribution
Assume you only know the height of a person, but not their gender. Can height
tell you something about gender?
Assume y=height and x=gender (0=male or 1=female).
What we have: densities p  y | x  0  and p  y | x  1
What we want: probability P  x  1| y 
p  y | x  1
p  y | x  0
P  x  1| y  
P  x  1 p  y x  1
1
 P  x  i p  y x  i
i 0
Height is normally distributed in the population of men and in the population of women,
with different means, and similar variances. Let x be an indicator variable for being a
female. Then the conditional distribution of y (the height becomes):
2 
 1
exp   2  y   f  
2p
 2

1
2 
 1
p  y | x  0 
exp   2  y  m  
2p
 2

p  y | x  1 
1
Classification with a continuous conditional distribution
Let us further assume that we start with a prior distribution, such that x is 1
with probability p.
P  x  1 p  y | x  1
P x  1| y 
The posterior is a logistic function of
P  x  1 p  y | x  1  P  x  0  p  y | x  0 
a linear function of the data and
2 
 1
p exp   2  y   f  
parameters (remember this result the
 2


section on classification!).
2 
2 
 1
 1
p exp   2  y   f    1  p  exp   2  y   m  
 2

 2

The maximum-likelihood argument
1

would just have decided under which
2 
 1
1  p  exp   2  y  m  
model the data would have been
2


1
more likely.
2 
 1
p exp  
2
 2



y 
f
1

1 p
1  exp  log 
 p

1


2
 2


 y      y     
2
2
m
f
1

1 p
1  exp  log 
 p

1

1
1

 2   f 2   2 y  m   f

2  m

 2
1  exp  θT y 
 1 p
θ  log 
  p
 



2
m
 f2 
2 2

,
The posterior distribution gives us
the full probability that we have a
male or female.

m  f 
T
 , y  1, y 
2


T
We can also include prior knowledge
 in our scheme.
 

Classification with a continuous conditional distribution
Computing the probability that the subject is female, given that we
observed height y.
P  x  1| y  
1
  1 p
1  exp  log 
  p

2
2
  m   f   m   f  

y


2 2
2


 m  176cm
 f  166cm
  12cm
Posterior
probability:
P  x  1| y 
1
P  x  1  0.5
0.8
P  x  1  0.3
Our prior probability
0.6
0.4
0.2
120
140
160
y
180
200
220
Summary
•Bayesian estimation involves the application of Bayes rule to combine a prior
density and a conditional density to arrive at a posterior density.
p  D   p  
p  D  
p  D
•Maximum a posteriori (MAP) estimation: If we need a “best guess” from our
posterior distribution, often the maximum of the posterior distribution is used.




 MAP  arg max p  D   arg max p  D   p  
•The MAP and ML estimate are identical, when our prior is uniformly distributed
on , i.e. is flat or uniformed.
•With a two-way classification problem and data that is Gaussian given the
category membership, the posterior is a logistic function, linear in the data.
P x 1 y 
1
 
1  exp θT y