Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Ch 2. Probability Distributions (1/2)
Pattern Recognition and Machine Learning,
C. M. Bishop, 2006.
Summarized by
Joo-kyung Kim
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
Contents

2.1. Binary Variables
 2.1.1. The beta distribution

2.2. Multinomial Variables
 2.2.1. The Dirichlet distribution

2.3. The Gaussian Distribution
 2.3.1. Conditional Gaussian distributions
 2.3.2. Marginal Gaussian distributions
 2.3.3. Bayes` theorem for Gaussian variables
 2.3.4. Maximum likelihood for the Gaussian
 2.3.5. Sequential estimation
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
Density Estimation

Modeling the probability distribution p(x) of a random
variable x, given a finite set x1,…,xn of observations.
 We will assume that the data points are i.i.d.

Fundamentally ill-posed
 There are infinitely many probability distributions that could
have given rise to the observed finite data set.


The issue of choosing an appropriate distribution relates
to the problem of model selection.
Begins by considering parametric distributions.
 binomial, multinomial, and Gaussian
 Governed by a small number of adaptive parameters.
 Such
as the mean and variance in the case of a Gaussian.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
Frequentist and Bayesian Treatments for the
Density Estimation

Frequentist
 Choose specific values for the parameters by optimizing some
criterion, such as the likelihood function.

Bayesian
 Introduce prior distributions over the parameters and the use
Bayes` theorem to compute the corresponding posterior
distribution given the observed data.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
Bernoulli Distribution
x
 Bern  x |     1   
1 x
, E  x    , var  x    1   
 Considering single binary r.v. x∈{0,1}.

Frequentist treatment
 Likelihood function
N
N
p  D |     p  xn |     xn 1   
n 1
1 xn
n 1
N
N
n 1
n 1
ln p  D |     ln p  xn |      xn ln u  1  xn  ln 1   

Suppose we have a data set D={x1,…,xN} of observed values of x.
 Maximum likelihood estimator
 ML 
1
N
N
x
n 1
n
 If we flip a coin 3 times and happen to observe 3 heads the ML
estimator is 1.

An extreme example of the overfitting associated with ML.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
Binomial & Beta Distribution

Binomial Distribution
 The distribution of the number m
of observations of x=1 given that
the data set has size N.
N
N m
Bin  m | N ,       m 1   
m
E  m   N  , var  m   N  1   

Beta Distribution
Beta   | a,b  
  x 


0
  a  b  a1
b 1
 1   
  a   b 
 x1 e  u du,
 Beta   | a,b  d   1
1
0
a
ab
E   
, var    
2
ab
 a  b   a  b  1
a and b are often called hyperparameters.
 Histogram plot of the binomial
distribution (N=10, μ=0.25)
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
Bernoulli & Binomial Distribution - Bayesian
Treatment (1/3)


We need to introduce a prior distribution. p   
Conjugacy
 Posterior distribution have the same functional form as the prior.

We will use beta distribution as the prior.
 The posterior distribution of μ is now obtained by multiplying
the beta prior by the binomial likelihood function and
normalizing.
p   | m,l,a,b   
1    , where l  N  m
m a 1
l b1
the same functional dependence on μ as the prior distribution
reflecting the conjugacy properties.
 Has
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
Bernoulli & Binomial Distribution - Bayesian
Treatment (2/3)

Because of the beta distribution property, it is simple to
normalize.
 Simply another beta distribution
 m  a  l  b
p   | m,l,a,b  

1   
  m  a   l  b 
m a 1
l b1
 Observing a data set of m observations of x=1 and has been to
increase the value of a by m.
 This allows us to provide a simple interpretation of the
hyperparameters a and b in the prior as an effective number of
observations if x=1 and x=0.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
Bernoulli & Binomial Distribution - Bayesian
Treatment (3/3)


The posterior distribution can act as the prior if we subsequently observe
additional data.
Prediction of the outcome of the next trial
p  x  1| D  

1
0
p  x  1|   p   | D  d  
ma
  p   | D  d   E  | D  m  a  l  b
1
0
 If m,l∞, then the result reduces to the maximum likelihood result.
 The Bayesian and maximum likelihood results (frequentist view) will agree in
the limit of an infinitely large data set.
 For a finite data set, the posterior mean for μ always lies between the prior
mean and the maximum likelihood estimate for μ corresponding to the relative
frequencies of events given by μML.


As the number of observations increases, the posterior distribution
becomes more sharply peaked (variance is reduced).
Illustration of one step of sequential Bayesian inference
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
Multinomial Variables

We will use 1-of-K scheme

 The variable is represented by
a K-dimensional vector x in
which one of the elements xk
equals 1, and all remaining
elements equal 0.
 Ex) x=(0,0,1,0,0,0)T
p  x | μ 
K
μ
k 1

p  x | μ 

K
μk  1
k 1
x
E  x| μ   μ
 Considering a data set D of N
independent observations.
p  D | μ 
N
K
 
n 1
k 1

K
k 1

xnk 

n

  μk
K
xnk
k




mk ln μk   

where mk 

K
k 1
x

μk  1 

nk
n
μkML 

xk
k
Maximizing the log-likelihood
using a Lagrange multiplier
mk
N
Multinomial distribution
 The joint distribution of the
quantities m1,…,mK,
conditioned on the parameters
μ and on the total number N
of observations.
Mult  m1 , m2 ,..., mK | μ ,N  
k 1
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
N!
m1 ! m2 ! ...mK !
K

m
K
μkmk ,
k 1
k 1
10
k
N
Dirichlet distribution


Dirichlet distribution
 The relation of multinomial
and Dirichlet distributions is
the same as that of binomial
and beta distributions.
 K

  ak 
 k 1 
Dir  μ | a  
  a1  ...  aK 


 Confined to a simplex because of
the constraints. 0    1 and    1
k
k
k
K
μ
ak 1
k
k 1
p  μ | D, a   Dir  μ | a  m 

ak 
k 1



  a1  m1  ...  a K  mk 
 N 
The Dirichlet distribution over three
variables.

K
K
μ
ak  mk 1
k
 Two horizontal axes are simplex,
the vertical axis corresponds to the
density. (ak=0.1, 1, 10, respectively)
k 1
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
The Gaussian Distribution

In the case of a single variable
N  x |  , 2  
1
 2 
2
1/ 2
2
 1
exp  2  x    
 2

where  is the mean and  2 is the variance.

For a D-dimensional vector x
N  x |  ,  
1
 2 
D/ 2

1/ 2
T
 1

exp   x     1  x   
 2

where  is a D-dimensional mean vector and  is a D  D covariance matrix.

The distribution maximizes the entropy
 The central limit theorem
 The sum of a set of random variables has a distribution that becomes
increasingly Gaussian as the number of terms in the sum increases.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12
The Geometrical Form of the Gaussian
Distribution (1/3)

Functional dependence of the Gaussian on x
 2   x     1  x   
T

Δ is called the Mahalanobis distance
 Is the Euclidean distance when ∑ is I.

∑ can be taken to be symmetric.
 Because any antisymmetric component would disappear from
the exponent.

The eigenvector equation
 ui  i ui
 Choose the eigenvectors to form an orthonormal set.
uiT u j  I ij
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
13
The Geometrical Form of the Gaussian
Distribution (2/3)

The covariance matrix can be
expressed as an expansion in terms
of its eigenvectors
 
 u u
D
i
i
i
i 1

, 
 uu
D
1

1
i
i 1
T
i
i
The functional dependence becomes
  , where y  u
D
 
2
yi2
i
i 1

T
T
i
x  
i
We can interpret {yi} as a new
coordinate system defined by the
orthonormal vectors ui that are
shifted and rotated.
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
The Geometrical Form of the Gaussian
Distribution (3/3)


y=U(x-μ)
U is a matrix whose rows are given by uiT.
 Orthogonal matrix.
 To be well defined, it should be positive definite.

The determinant |∑| of the covariance matrix can be
written as the product of its eigenvalues.

D
1/ 2


1/ 2
j
j 1

The Gaussian distribution takes the form which is the
product of D independent univariate Gaussian
distributions.
p  y  p  x J 
D
  2 
j 1
1
j
1/ 2
 y 2j 
exp 

 2 j 
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
15
Covariance Matrix Form for the Gaussian
Distribution


For large D, the total number of parameters grows
quadratically with D.
One way to reduce computation cost is to restrict the
form of the covariance matrix.
 (a) general form
 (b) diagonal
 (c) isotropic (proportional to the identity matrix)
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
16
Conditional & Marginal Gaussian Distributions
(1/2)

If two sets of variables are jointly Gaussian, then the
conditional distribution of one set conditioned on the
other is again Gaussian.
      x   
1
a|b
a
ab
bb
b
b
 a|b   aa   ab  bb 1 ba
 The mean of the conditional distribution p(xa|xb) is a linear
function of xb and covariance is independent of xa.
 An

example of a Linear Gaussian model.
If a joint distribution p(xa, xb) is Gaussian, then the
marginal distribution p  x    p  x ,x dx is also Gaussian.
a
 Prove using that

a
b
b
1
T
1
 x     1  x      xT  1 x  xT  1   const
2
2
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
17
Conditional & Marginal Gaussian Distributions
(2/2)

The contours of a Gaussian
distribution p(xa,xb) over two
variables.

The marginal distribution p(xa) and
the conditional distribution p(xa|xb)
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
18
Maximum likelihood for the Gaussian

ML w.r.t. μ.
ln p  X |  ,    

ND
N
1
ln  2   ln  
2
2
2

ln p  X |  ,   
u

N
x
N
n 1
 1  xn      ML 
n 1
    1  xn   
T
n
1
N
x
N
n
n 1
Evaluating the expectations of
the ML solutions under the
true distribution.
E    
ML

E   ML  
ML w.r.t. ∑.
 Imposing the symmetry and
positive definiteness
constraints
 ML
1

N
 x
N
n
n1
 ML  xn  ML 
T
N 1

N
 The ML estimate for the
covariance has an expectation
that is less than the true value.

Using following estimator,
that can be corrected.
1

N 1
 x
N
n
 ML  xn  ML 
T
n1
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
19
Sequential Estimation (1/2)

N
ML
Allow data points to be
processed one at a time and
then discarded.
1

N
 
 ML

N 1


N
n1
1
1
xn  xN 
N
N

1
 N 1
xN  ML
N
N 1
x
n
n1

1
N  1  N 1
xN 
ML
N
N

Robbins-Monro algorithm
 More general formulation of
sequential learning
 Consider a pair of random
variables θ and z governed by
a joint distribution p(z,θ).
 Regression function
f    E  z |     zp  z |   dz
Our goal is to find the root
θ* at which f(θ*)=0.
 We observe z one at a time
and wish to find a
corresponding sequential
estimation scheme for θ*.

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
20
Sequential Estimation (2/2)

In the case of a Gaussian
 A general maximum
distribution. (Θ corresponds to μ)
likelihood problem
 1

  N
lim
N 


The Robbins-Monro procedure
defines a sequence of successive
estimate of the root θ*.

  N     N 1  aN 1 z   N 1


where lim aN  0,
N 
a
N=1
1
N

 ln p  x |  
N
0
n
n1
 ML

 

  ln p  x |    E   ln p  x |  
N
n
x
n1
Finding the maximum
likelihood solution
corresponds to finding the root
of a regression function.
  N     N 1  aN 1


 N 1

ln p xN |  
N 1


n
 , and
a
2
N

N=1
(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
21
Related documents