Download MAP estimator for the coin toss problem

580.691 Learning Theory Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori Frequentist vs. Bayesian Statistics Frequentist Thinking True parameter: Bayesian Thinking w* Estimate of this parameter: Bias: E  wˆ   w* Does not have the concept of a true parameter. ŵ var  wˆ  Many different ways in which we can come up with estimates (e.g. Maximum Likelihood estimate), and we can evaluate them. Rather, at every given time we have knowledge about w (the prior), gain new data, and then update our knowledge using Bayes rule (the posterior). Conditional Distr. Prior Distr. p(w | D )  p w  p  D | w  p D   p w , D   p w, D  dw Posterior distr. Given Bayes rule, there is only ONE correct way of learning. Binomial distribution and discrete random variables Suppose a random variable can only take one of two variables (e.g., 0 and 1, success and failure, etc.). Such trials are termed Bernoulli trials. x  0,1 P  x  1    x  x(1) , x(2) , Probability density or distribution Probability distribution of a specific sequence of successes and failures    p x (1) p x   x(1) x(1) , x( N ) P  x  0  1    1 x(1) 1    1 x(1) 1     x(2) 1 x(2) 1     x( N ) 1    n  number of times the trial succeeded n N  x (i ) i 1 N n N! N n N n p  n     1      n 1    n ! N  n ! n E  n   N var  n   N 1    1 x ( N ) Poor performance of ML estimators with small data samples • Suppose we have a coin and wish to estimate the outcome (head or tail) from observing a series of coin tosses.  = probability of tossing a head. • After observing n coin tosses, we note that:  D  x (1) , , x(n)  out of which h trials are head. • To estimate whether the next toss will be head or tail, we form an ML estimator: Probability of observing a particular L    p x(1) , , x( n)  sequence of heads and tails in D    px   px   px   (1)   h 1    (2) ( n) nh log L    h log   n  h  log 1    d h nh log L     0 d  1 h  ML  n • After one toss, if it comes up tail, our ML estimate predicts zero probability of seeing heads. If first n tosses are tails, the ML continues to predict zero prob. of seeing heads. Including prior knowledge into the estimation process • Even though the ML estimator might say  ML  0 , we “know” that the coin can come up both heads and tails, i.e.:   0 • Starting point for our consideration is that  is not only a number, but we will give  a full probability distribution function •Suppose we know that the coin is either fair (=0.5) with prob. p or in favor of tails (=0.4) with probability 1-p. • We want to combine this prior knowledge with new data D (i.e. number of heads in n throws) to arrive at a posterior distribution for . We will apply Bayes rule: Prior Distr. Conditional Distr. Posterior distr. p( | D )  p  , D  p D   p   p  D |    p   p  D |   d  p   p  D |   n p     p  D |         1 The numerator is just the joint distribution of  and D, evaluated at a particular D. The denominator is the marginal distribution of the data D, that is, it is just a number that makes the Numerator integrate to one. Bayesian estimation for a potentially biased coin • Suppose that we believe that the coin is either fair, or that it is biased toward tails:  = probability of tossing a head. After observing n coin tosses, we note that: D  x (1) , , x ( n) out of which h trials are head.   for   0.5 p  p    1  p for   0.4  0 otherwise  p  D     h 1    ( nh) p 0.5h 0.5n h p 0.5n P   0.5 | D    p 0.5h 0.5n h  1  p  0.4h 0.6n h p 0.5n  1  p  0.4 h 0.6 n h 1  p  0.4h 0.6n h  P   0.4 | D   p 0.5n  1  p  0.4h 0.6n h Now we can accurately calculate the probability that we have a fair coin, given some data D. In contrast to the ML estimate, which only gave us one number ML, we have here a full probability distribution, that is we know also how certain we are that we have a fair or unfair coin. In some situation we would like a single number, that represents our best guess of . One possibility for this best guess is the maximum a-posteriori estimate (MAP). Maximum a-posteriori estimate We define the MAP estimate as the maximum (i.e. mode) of the posterior distribution.     arg max  p  D   p    MAP  arg max log  p  D     log  p    MAP estimator: arg max p  D The latter version makes the comparison to the maximum likelihood estimate easy:  ML  arg max p  D |    arg max  log  p  D |        MAP  arg max p  | D   arg max  log  p  D |     log  p       We see that ML and MAP are identical, if p() is a constant that does not depend on . Thus our prior would be a uniform distribution over the domain of . We call such a prior for obvious reasons a flat or uniformed prior. Formulating a continuous prior for the coin toss problem • In the last example the probability of tossing a head, represented by , could only be either 0.5 or p=0.4. How should we choose a prior distribution if  can be between 0 and 1? •Suppose we observed n tosses. The probability density that exactly h of those tosses were heads is: n h n h p  h     1    Binomial distribution h   n! nh   h 1    h! n  h !  = probability of tossing a head   0.5 0.25 n  10 n  20 0.2 p h 0.15 0.1 0.05 5 10 h 15 20 Formulating a continuous prior for the coin toss problem •  represents the probability of a head. We want a continuous distribution that is defined between 0 and 1, and is 0 for 0 and 1. p  D      1    n  1 n  p      1    c 1 c    1     n  Beta distribution d normalizing constant 0  = probability of tossing a head   4; n  8   3; n  6   2; n  4   1; n  2 2.5 2 p   1.5 1 3.5 3 2.5 2 1.5   1; n  8   1; n  6   1; n  4   1; n  2 1 0.5 0.5 0.2 0.4  0.6 0.8 1 0.2 0.4  0.6 0.8 1 Formulating a continuous prior for the coin toss problem • In general, let’s assume our knowledge comes in the form of a beta distribution: 1  p      1    c 1  c    1    d  0 p  D |     h 1    p  | D   nh 1   nh  1     h 1    c 1  0 1   nh  1     h 1    d c 1   nh     h 1    d 1 d     h 1     0   nh d When we apply Bayes rule to integrate some old knowledge (the prior) in the form of a beta-distribution with parameters  and , with some new knowledge h and n (coming from a binomial distribution), then we find that the posterior distribution also has the form of a beta distribution with parameters +h and +n-h. Beta and binomial distribution are therefore call conjugate distributions. MAP estimator for the coin toss problem Let us look at the MAP estimator if we start with a prior of =1, n=2, i.e. we have a slight belief in the fact that the coin is fair. 1.5 Our posterior is then: 1 n1h p  | D    h1 1    d p   n  2; h  1 1 0.5 0.2 0.4  0.6 0.8 1 Let’s calculate the MAP-estimate so that we can compare it to the ML estimate. 1 log p  | D    h  1 log    n  h  1 log 1     log   d  d log p  | D  h  1 n  h  1   0 d  1 1 n  h 1   h 1 1 n  h 1 h 1   h 1 Note that after one toss, if we get a tail, our h 1 probability of tossing a head is 0.33, not  MAP  zero as in the ML case. n2 Classification with a continuous conditional distribution Assume you only know the height of a person, but not their gender. Can height tell you something about gender? Assume y=height and x=gender (0=male or 1=female). What we have: densities p  y | x  0  and p  y | x  1 What we want: probability P  x  1| y  p  y | x  1 p  y | x  0 P  x  1| y   P  x  1 p  y x  1 1  P  x  i p  y x  i i 0 Height is normally distributed in the population of men and in the population of women, with different means, and similar variances. Let x be an indicator variable for being a female. Then the conditional distribution of y (the height becomes): 2   1 exp   2  y   f   2p  2  1 2   1 p  y | x  0  exp   2  y  m   2p  2  p  y | x  1  1 Classification with a continuous conditional distribution Let us further assume that we start with a prior distribution, such that x is 1 with probability p. P  x  1 p  y | x  1 P x  1| y  The posterior is a logistic function of P  x  1 p  y | x  1  P  x  0  p  y | x  0  a linear function of the data and 2   1 p exp   2  y   f   parameters (remember this result the  2   section on classification!). 2  2   1  1 p exp   2  y   f    1  p  exp   2  y   m    2   2  The maximum-likelihood argument 1  would just have decided under which 2   1 1  p  exp   2  y  m   model the data would have been 2   1 more likely. 2   1 p exp   2  2    y  f 1  1 p 1  exp  log   p  1   2  2    y      y      2 2 m f 1  1 p 1  exp  log   p  1  1 1   2   f 2   2 y  m   f  2  m   2 1  exp  θT y   1 p θ  log    p      2 m  f2  2 2  , The posterior distribution gives us the full probability that we have a male or female.  m  f  T  , y  1, y  2   T We can also include prior knowledge  in our scheme.    Classification with a continuous conditional distribution Computing the probability that the subject is female, given that we observed height y. P  x  1| y   1   1 p 1  exp  log    p  2 2   m   f   m   f    y   2 2 2    m  176cm  f  166cm   12cm Posterior probability: P  x  1| y  1 P  x  1  0.5 0.8 P  x  1  0.3 Our prior probability 0.6 0.4 0.2 120 140 160 y 180 200 220 Summary •Bayesian estimation involves the application of Bayes rule to combine a prior density and a conditional density to arrive at a posterior density. p  D   p   p  D   p  D •Maximum a posteriori (MAP) estimation: If we need a “best guess” from our posterior distribution, often the maximum of the posterior distribution is used.      MAP  arg max p  D   arg max p  D   p   •The MAP and ML estimate are identical, when our prior is uniformly distributed on , i.e. is flat or uniformed. •With a two-way classification problem and data that is Gaussian given the category membership, the posterior is a logistic function, linear in the data. P x 1 y  1   1  exp θT y

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download MAP estimator for the coin toss problem