Download • Review • Maximum A-Posteriori (MAP) Estimation • Bayesian

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

Inductive probability wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

German tank problem wikipedia , lookup

Statistical inference wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bayesian inference wikipedia , lookup

Transcript
Outline
• 
• 
• 
• 
• 
• 
• 
• 
• 
Review
Maximum A-Posteriori (MAP) Estimation
Bayesian Parameter Estimation
Example:The Gaussian Case
Recursive Bayesian Incremental Learning
Problems of Dimensionality
Linear Algebra review
Principal Component Analysis
Fisher Discriminant
18 March 2016
SIV 813
1
Bayesian Decision Theory
•  Bayesian decision theory is a fundamental statistical
approach to the problem of pattern classification.
Ø Decision making when all the probabilistic information
is known.
Ø For given probabilities the decision is optimal.
Ø When new information is added, it is assimilated in
optimal fashion for improvement of decisions.
18 March 2016
SIV 813
2
Bayes' formula
P(ωj | x) = P(x |ωj ) P(ωj ) / P(x),
where
2
P( x) = ∑ p( x | ω j ) P(ω j )
j =1
Likelihood ∗ Prior
Posterior =
Evidence
18 March 2016
SIV 813
3
Bayes' formula cont.
•  p(x|ωj ) is called the likelihood of ωj with
respect to x.
(the ωj category for which p(x|ωj ) is large
is more "likely" to be the true category)
•  p(x) is the evidence
how frequently we will measure a pattern with
feature value x.
Scale factor that guarantees that the posterior
probabilities sum to 1.
18 March 2016
SIV 813
4
Bayes' Decision Rule
(Minimizes the probability of error)
ω1 : if P(ω1|x) > P(ω2|x)
ω2 : otherwise
or
ω1 : if P ( x |ω1) P(ω1) > P(x|ω2) P(ω2)
ω2 : otherwise
and
P(Error|x) = min [P(ω1|x) , P(ω2|x)]
18 March 2016
SIV 813
5
Normal Density - Univariate Case
•  Gaussian density with mean µ ∈ ° and standard deviation
2
(
named variance )
σ ∈° +,
σ
2
⎡
1
1 ⎛ x − µ ⎞ ⎤
p(x) =
exp ⎢− ⎜
⎟ ⎥
1/ 2
(2π ) σ
⎢⎣ 2 ⎝ σ ⎠ ⎥⎦
2
p ( x) ~ N ( µ , σ )
•  It can be shown that:
∞
µ = E[ x] =
∫
∞
xp( x)dx,
σ 2 = E[( x − µ )2 ] =
−∞
18 March 2016
2
(
x
−
µ
)
p( x)dx.
∫
−∞
SIV 813
6
Normal Density - Multivariate Case
•  The general multivariate normal density (MND) in a d
dimensions is written as
1
⎡ 1
⎤
t −1
p ( x) =
exp
−
(
x
−
µ
)
Σ
(
x
−
µ
)
⎢⎣ 2
⎥⎦
(2π ) d / 2 | Σ |1/ 2
•  It can be shown that:
µ = E[x] =
∫
°
Σ = E[(x − µ)(x − µ)t ] .
x p(x)dx,
d
which means for components
σ i j = E[( xi − µi )( x j − µ j )] .
•  The covariance matrix
semidefinite.
18 March 2016
Σ
is always symmetric and positive
SIV 813
7
Normal Density - Multivariate Case
•  The general multivariate normal density (MND) in a d
dimensions is written as
1
⎡ 1
⎤
t −1
p ( x) =
exp
−
(
x
−
µ
)
Σ
(
x
−
µ
)
⎢⎣ 2
⎥⎦
(2π ) d / 2 | Σ |1/ 2
•  It can be shown that:
µ = E[x] =
Σ = E[(x − µ)(x − µ)t ] .
∫ x p(x)dx,
Rd
which means for components
σ i j = E[( xi − µi )( x j − µ j )] .
18 March 2016
SIV 813
8
Maximum Likelihood and Bayesian Parameter Estimation
•  To design an optimal classifier we need P(ωi) and
p(x| ωi), but usually we do not know them.
•  Solution – to use training data to estimate the
unknown probabilities. Estimation of classconditional densities is a difficult task.
18 March 2016
SIV 813
9
Maximum Likelihood and Bayesian Parameter Estimation
•  Supervised learning: we get to see samples from
each of the classes “separately” (called tagged or
labeled samples).
•  Tagged samples are “expensive”. We need to learn
the distributions as efficiently as possible.
•  Two methods: parametric (easier) and nonparametric (harder)
18 March 2016
SIV 813
10
Maximum Likelihood and Bayesian Parameter Estimation
•  Program for parametric methods:
Ø Assume specific parametric distributions with parameters
θ ∈Θ ⊂ Rp
Ø Estimate parameters $q(D) from training data D.
Ø Replace true value of class-conditional density with
approximation and apply the Bayesian framework for
decision making.
18 March 2016
SIV 813
11
Maximum Likelihood and Bayesian Parameter Estimation
•  Suppose we can assume that the relevant (class-conditional)
densities are of some parametric form. That is,
p(x|ω)=p(x|θ), where θ ∈ Θ ⊂ R p
•  Examples of parameterized densities:
–  Binomial: x(n) has m 1’s and n-m 0’s
⎛ n ⎞
p( x ( n ) | θ ) = ⎜ ⎟θ m (1 − θ ) n −m ,
⎝ m ⎠
Θ = [0,1]
–  Exponential: Each data point x is distributed according to
p( x | θ ) = θ e−θ x ,
18 March 2016
SIV 813
Θ = (0, ∞)
12
Maximum Likelihood and Bayesian Parameter Estimation
cont.
•  Two procedures for parameter estimation will be considered:
Ø Maximum likelihood estimation: choose parameter value $q
that makes the data most probable (i.e., maximizes the
probability of obtaining the sample that has actually been
observed),
p(x | D) = p(x | θ$(D)),
θ$(D) = arg max p(D | θ )
θ
Ø Bayesian learning: define a prior probability on the model
space p (θ ) and compute the posterior p(θ | D).
Additional samples sharp the posterior density which peaks
near the true values of the parameters .
18 March 2016
SIV 813
13
Sampling Model
•  It is assumed that a sample set S = {(xl , ωl ) : l = 1,..., N} with
independently generated samples is available.
•  The sample set is partitioned into separate sample sets for
each class, D j = {xl : (xl , ωl ) ∈ D}
•  A generic sample set will simply be denoted by D .
•  Each class-conditional p (x | ω j ) is assumed to have a
known parametric form and is uniquely specified by a
parameter (vector) θ j .
•  Samples in each set D j are assumed to be independent and
identically distributed (i.i.d.) according to some true
probability law p (x | ω j ) .
18 March 2016
SIV 813
14
Log-Likelihood function and Score Function
•  The sample sets are assumed to be functionally
independent, i.e., the training set S j contains no
information about θi for i ≠ j .
•  The i.i.d. assumption implies that
p(D j |θ j ) = ∏ p(x | θ j )
x∈D j
•  Let D be a generic sample of size n ≡ |D | .
•  Log-likelihood function:
n
l (θ; D) ≡ ln p(D|θ) = ∑ ln p(x k | θ)
k =1
•  The log-likelihood function is identical to the logarithm of
the probability density function, but is interpreted as a
function over the sample space for given parameter θ.
18 March 2016
SIV 813
15
Log-Likelihood Illustration
•  Assume that all the points in D are drawn from some (onedimensional) normal distribution with some (known) variance and
unknown mean.
18 March 2016
SIV 813
16
Log-Likelihood function and Score Function cont.
•  Maximum likelihood estimator (MLE):
θ$(D) = arg max l (θ; D)
14 2 43
θ ∈Θ
(tacitly assuming that such a maximum exists!)
•  Score function:
∂l (θ; D)
U k (θ; D) ≡
1≤ k ≤ p
∂θ k
and hence
U(θ;D) ≡ ∇θ l (θ;D)
•  Necessary condition for MLE (if not on border of domain
Θ ):
U(θ; D) = 0
18 March 2016
SIV 813
17
Maximum A Posteriory
•  Maximum a posteriory (MAP):
Find the value of θ that maximizes l(θ)+ln(p(θ)),
where p(θ),is a prior probability of different
parameter values.A MAP estimator finds the peak or
mode of a posterior.
Drawback of MAP: after arbitrary nonlinear
transformation of the parameter space, the density
will change, and the MAP solution will no longer be
correct.
18 March 2016
SIV 813
18
Maximum A-Posteriori (MAP) Estimation
" The “most likely value” is given by θ
(n)
p
(
θ
)
p
(
X
|θ )
(n)
0
$
θ = arg max p(θ | X ) = arg max
θ
θ
p( X ( n ) )
n
= arg max
θ
18 March 2016
p0 (θ )∏ p( xi | θ )
i =1
(n)
p
(
X
| θ ') p0 (θ ')dθ '
∫
SIV 813
19
Maximum A-Posteriori (MAP) Estimation
p( X
(n)
n
| θ ) = ∏ p( xi | θ )
i =1
since the data is i.i.d.
•  We can disregard the normalizing factor p( X ( n ) ) when
looking for the maximum
18 March 2016
SIV 813
20
MAP - continued
So, the θ$
we are looking for is
n
⎡
⎤
$
θ = arg max ⎢ p0 (θ )∏ p( xi | θ )]⎥
θ
i =1
⎣
⎦
( log is monotonically increasing)
n
⎛
⎡
⎤ ⎞
= arg max ⎜ log ⎢ p0 (θ )∏ p( xi | θ )]⎥ ⎟
θ
i =1
⎣
⎦ ⎠
⎝
n
⎛
⎞
= arg max ⎜ log p0 (θ ) + log ∏ p ( xi | θ ) ⎟
θ
i =1
⎝
⎠
n
⎛
⎞
= arg max ⎜ log p0 (θ ) + ∑ log p ( xi | θ ) ⎟
θ
i =1
⎝
⎠
18 March 2016
SIV 813
21
The Gaussian Case: Unknown Mean
•  Suppose that the samples are drawn from a multivariate
normal population with mean µ , and covariance matrix
Σ.
•  Consider fist the case where only the mean is unknown
θ =µ.
•  For a sample point xk , we have
1
1
d
⎡
⎤
ln P(xk | µ) = − ln ⎣(2π ) | Σ |⎦ − (x k − µ)t Σ −1 ( x k − µ)
2
2
and
∇µ ln P(xk | µ) = Σ−1 (xk − µ)
•  The maximum likelihood estimate for µ must satisfy
18 March 2016
SIV 813
22
The Gaussian Case: Unknown Mean
n
−1
Σ
∑ (xk − µˆ ) = 0
•  Multiplying by
k =1
Σ , and rearranging, we obtain
1 n
µˆ = ∑ x k
n k =1
•  The MLE estimate for the unknown population mean is
just the arithmetic average of the training samples (sample
mean).
•  Geometrically, if we think of the n samples as a cloud of
points, the sample mean is the centroid of the cloud
18 March 2016
SIV 813
23
The Gaussian Case: Unknown Mean and Covariance
•  In the general multivariate normal case, neither the mean
nor the covariance matrix is known θ = [µ, Σ] .
•  Consider fist the univariate case with θ1 = µ and
θ 2 = σ 2 . The log-likelihood of a single point is
1
1
ln p(x k | θ) = − ln 2πθ 2 −
(x k − θ1 ) 2
2
2θ 2
and its derivative is
1
⎡
⎤
⎢ θ ( xk − θ1 ) ⎥
2
⎢
⎥
∇θl = ∇θ ln p( xk | θ) =
⎢ 1 ( xk − θ1 ) 2 ⎥
+
⎢ −
⎥
2
2
θ
2
θ
⎣
2
2
⎦
18 March 2016
SIV 813
24
The Gaussian Case: Unknown Mean and Covariance
•  Setting the gradient to zero, and using all the sample
points, we get the following necessary conditions:
n
1
ˆ ) = 0 and
(
x
−
θ
∑
k
1
ˆ
k =1 θ
2
n
( xk − θˆ1 )2
1
−∑ +∑
=0
2
ˆ k =1 θˆ
k =1 θ
n
2
2
•  where θˆ1 = µˆ and θˆ2 = σˆ 2 , are the MLE estimates for θˆ1 ,
and θˆ2 respectively.
•  Solving for µˆ and σˆ 2, we obtain
1 n
1 n
2
µˆ = ∑ xk and σˆ = ∑ ( xk − µˆ ) 2
n k =1
n k =1
18 March 2016
SIV 813
25
The Gaussian multivariate case
•  For the multivariate case, it is easy to show that the MLE
estimates for are given by
n
1 n
1
µˆ = ∑ x k and Σˆ = ∑ (x k − µˆ )(x k − µˆ )t
n k =1
n k =1
•  The MLE for the mean vector is the sample mean, and the
MLE estimate for the covariance matrix is the arithmetic
t
average of the n matrices (x k − µˆ )(x k − µˆ )
•  The MLE for σ 2 is biased (i.e., the expected value over
all data sets of size n of the sample variance is not equal to
the true variance:
n −1 2
⎡ 1 n
2 ⎤
E ⎢ ∑ ( xi − µˆ ) ⎥ =
σ ≠σ2
n
⎣ n i =1
⎦
18 March 2016
SIV 813
26
The Gaussian multivariate case
•  Unbiased estimator for µ and Σ are given by
1 n
µˆ = ∑ x k
n k =1
and
1 n
t
ˆ
ˆ
C=
(
x
−
µ
)(
x
−
µ
)
∑
k
k
n − 1 k =1
C is called the sample covariance matrix . C is absolutely
unbiased. σˆ 2 is asymptotically unbiased.
18 March 2016
SIV 813
27
Bayesian Estimation: Class-Conditional Densities
•  The aim is to find posteriors P(ωi|x) knowing p(x|ωi) and P(ωi),
but they are unknown. How to find them?
•  Given the sample D, we say that the aim is to find P(ωi|x, D)
•  Bayes formula gives:
P (ωi | x, D) =
p (x | ωi , D) P(ωi | D)
c
∑ p(x | ω , D) P(ω
i
i
.
| D)
j =1
•  We use the information provided by training samples to
determine the class conditional densities and the prior
probabilities.
•  Generally used assumptions:
–  Priors generally are known or obtainable from a trivial calculations. Thus
P(ωi)= P(ωi|D).
–  The training set can be separated into c subsets: D1,…,Dc
18 March 2016
SIV 813
28
Bayesian Estimation: Class-Conditional Densities
–  The samples Dj have no influence on p(x|ωi,Di ) if
•  Thus we can write:
P (ωi | x, D) =
p (x | ωi , Di ) P(ωi )
i≠ j
.
c
∑ p(x | ω , D ) P(ω )
j
j
j
j =1
•  We have c separate problems of the form:
Use a set D of samples drawn independently according to a
fixed but unknown probability distribution p(x) to determine
p(x|D).
18 March 2016
SIV 813
29
Bayesian Estimation: General Theory
•  Bayesian leaning considers θ (the parameter vector to be
estimated) to be a random variable.
Before we observe the data, the parameters are described by
a prior p(θ ) which is typically very broad. Once we
observed the data, we can make use of Bayes’ formula to
find posterior p(θ |D ). Since some values of the parameters
are more consistent with the data than others, the posterior is
narrower than prior. This is Bayesian learning (see fig.)
18 March 2016
SIV 813
30
General Theory cont.
•  Density function for x, given the training data set D ,
p(x | D) = ∫ p(x,θ | D)dθ
•  From the definition of conditional probability densities
p(x,θ | D) = p(x | θ , D) p(θ | D).
•  The first factor is independent of D since it just our
assumed form p(x | θ , D ) ⇒ p(x | θ ) for parameterized density.
•  Therefore
p(x | D) = ∫ p(x | θ) p(θ | D)dθ
•  Instead of choosing a specific value for θ , the Bayesian
approach performs a weighted average over all values of θ .
The weighting factor p(θ | D) , which is a posterior of θ is
determined by starting from some assumed prior p (θ )
18 March 2016
SIV 813
31
General Theory cont.
•  Then update it using Bayes’ formula to take account of
data set D . Since D = {x1 ,..., x N } are drawn independently
N
p(D | θ ) =
n
p
(
x
∏ |θ ) ,
(∗)
n =1
which is likelihood function.
•  Posterior for θ is
p(D |θ ) p(θ ) p(θ ) N
n
p(θ | D) =
=
p
(
x
∏ |θ ) ,
p(D)
p(D)
(**)
n =1
where normalization factor
N
p(D) = ∫ p(θ ')∏ p( x n |θ ')dθ ',
n =1
18 March 2016
SIV 813
32
Bayesian Learning – Univariate Normal Distribution
•  Let us use the Bayesian estimation technique to calculate a
posteriori density p(θ | D) and the desired probability
density p(x | D) for the case p(x | µ) ~ N (µ, Σ)
Ø  Univariate Case: p ( µ | D)
Let µ be the only unknown parameter
p( x | µ ) ~ N ( µ , σ 2 )
18 March 2016
SIV 813
33
Bayesian Learning – Univariate Normal Distribution
•  Prior probability: normal distribution over µ ,
p( µ ) ~ N ( µ0 , σ 02 )
µ 0 encodes some prior knowledge about the true mean µ ,
2
while σ 0 measures our prior uncertainty.
•  If µ is drawn from p(µ) then density for x is completely
determined. Letting D = {x1 ,..., xn }
we use
p( µ | D) =
p(D | µ ) p( µ )
∫ p(D |µ ) p(µ )d µ
n
= α ∏ p( xk | µ ) p( µ )
k =1
18 March 2016
SIV 813
34
Bayesian Learning – Univariate Normal Distribution
•  Computing the posterior distribution
p ( µ | D) ∝ p (D | µ ) p( µ )
⎡ 1 ⎛ n ⎛ x − µ ⎞2 ⎛ µ − µ ⎞ 2 ⎞ ⎤
0
⎟ ⎥
= α 'exp ⎢ − ⎜ ∑ ⎜ k
+
⎜
⎟
⎟
⎢ 2 ⎜⎝ k =1 ⎝ σ ⎠ ⎝ σ 0 ⎠ ⎟⎠ ⎥
⎣
⎦
⎡ 1 ⎡⎛ n
⎛ 1
1 ⎞ 2
= α ''exp ⎢ − ⎢⎜ 2 + 2 ⎟ µ − 2 ⎜ 2
σ 0 ⎠
⎢⎣ 2 ⎣⎝ σ
⎝ σ
18 March 2016
SIV 813
µ0 ⎞ ⎤ ⎤
xk + 2 ⎟ µ ⎥ ⎥
∑
σ 0 ⎠ ⎦ ⎥⎦
k =1
n
35
Bayesian Learning – Univariate Normal Distribution
•  Where factors that do not depend on µ have been
absorbed into the constants α ' and
''
p ( µ | D)
• 
is an exponential function of a quadratic
function of
i.e. it is a normal density.
• 
remains normal for any number of
p ( µ | D)
training samples.
•  If we write
α
µ
2
⎡
1
1 ⎛ µ − µn ⎞ ⎤
p( µ | D) =
exp ⎢ − ⎜
⎟ ⎥
2πσ n
⎢⎣ 2 ⎝ σ n ⎠ ⎥⎦
then identifying the coefficients, we get
18 March 2016
SIV 813
36
Bayesian Learning – Univariate Normal Distribution
1
σ
2
n
=
n
σ
where
• 
2
+
1
σ 02
µn
µ0
n
= 2 µˆ n + 2
2
σn σ
σ0
1 n
µˆ n = ∑ xk is the sample mean.
n k =1
Solving explicitly for
µn
and
σ n2 we obtain
⎛ nσ 02 ⎞
σ2
µ n = ⎜ 2
µˆ + 2
µ0 and
2 ⎟ n
2
nσ 0 + σ
⎝ nσ 0 + σ ⎠
• 
2 2
σ
σ
σ n2 = 20 2
nσ 0 + σ
µ n represents our best guess for µ after observing
n samples.
2
•  σ n measures our uncertainty about this guess.
•  σ n2 decreases monotonically with n (approaching σ 2 / n
as n approaches infinity)
18 March 2016
SIV 813
37
Bayesian Learning – Univariate Normal Distribution
•  Each additional observation decreases our uncertainty
about the true value of µ .
•  As n increases, p ( µ | D) becomes more and more
sharply peaked, approaching a Dirac delta function as n
approaches infinity. This behavior is known as Bayesian
Learning.
18 March 2016
SIV 813
38
Bayesian Learning – Univariate Normal Distribution
•  In general, µ n is a linear combination of µˆ n and µ 0 ,
with coefficients that are non-negative and sum to 1.
•  Thus µ n lies somewhere between µˆ n and µ 0 .
•  If σ 0 ≠ 0 , µn → µˆ n as n → ∞
µ = µ 0 is so
•  If σ 0 = 0, our a priori certainty that
strong that no number of observations can change our
opinion.
•  If σ 0 ? σ , a priori guess is very uncertain, and we
take µ n = µˆ n
•  The ratio σ 2 / σ 02 is called dogmatism.
18 March 2016
SIV 813
39
Bayesian Learning – Univariate Normal Distribution
p( x | D)
p( x | D) = ∫ p( x | µ )P( µ | D)d µ
•  The Univariate Case:
=∫
⎡ 1 ⎛ x − µ ⎞
1
exp ⎢ − ⎜
⎟
2
σ
2πσ
⎠
⎢⎣ ⎝
2
⎤
⎥
⎥⎦
2
⎡
1
1 ⎛ µ − µn ⎞ ⎤
exp ⎢ − ⎜
⎟ ⎥ d µ
2πσ n
⎢⎣ 2 ⎝ σ n ⎠ ⎥⎦
⎡ 1 ( x − µn ) 2 ⎤
=
exp ⎢ −
f (σ , σ n )
2
2 ⎥
2πσσ n
⎣ 2 σ + σ n ⎦
1
where
2
2
2
⎡ 1 σ 2 + σ 2 ⎛
σ n x + σ µ n ⎞ ⎤
n
f (σ , σ n ) = ∫ exp ⎢ −
µ−
⎟ ⎥d µ
2 2 ⎜
2
2
σ + σ n ⎠ ⎥
⎢⎣ 2 σ σ n ⎝
⎦
18 March 2016
SIV 813
40
Bayesian Learning – Univariate Normal Distribution
⎡ 1 ( x − µn )2 ⎤
•  Since p( x | D) ∝ exp ⎢ −
2
2 ⎥ we can write
2
σ
+
σ
n ⎦
⎣
2
2
n
p( x | D) : N ( µn , σ + σ )
•  To obtain the class conditional probability p( x | D) , whose
parametric form is known to be p( x | µ ) : N ( µ , σ )
we replace µ by µ n and σ 2 by σ 2 + σ n2
•  The conditional mean µ n is treated as if it were the true
mean, and the known variance is increased to account for
the additional uncertainty in x resulting from our lack of
exact knowledge of the mean µ .
18 March 2016
SIV 813
41
Example (demo-MAP)
•  We have N points which are generated by one dimensional Gaussian,
p( x | µ ) = Gx [ µ ,1]. Since we think that the mean should not be very big we
use as a prior p(µ ) = Gµ [0,α 2 ], where α is a hyperparameter. The total
objective function is:
2
N
µ
E ∝ −∑ ( xn −µ ) − 2
α
n =1
2
which is maximized to give,
1
µ=
N+ 1
α
1
N
∑x
n
2 n =1
For N ? 2
influence of prior is negligible and result is ML estimate. But
α
for very strong belief in the prior 1 ? N
the estimate tends to zero. Thus,
α2
if few data are available, the prior will bias the estimate towards the prior
expected value
18 March 2016
SIV 813
42
Recursive Bayesian Incremental Learning
n
•  We have seen that p(D | θ) = ∏ p( xk | θ) , Let us define D n = {x1 ,..., x n }
k =1
n
Then
p(D | θ) = p( xn | θ) p(D n −1 | θ).
•  Substituting into p(θ | D), and using Bayes we have:
n
p (θ | D ) =
p (D n | θ) p(θ)
n
∫ p(D | θ) p(θ)dθ
=
p (x n | θ) p(D n −1 | θ) p(θ)
n −1
p
(
x
|
θ
)
p
(D
| θ) p(θ) dθ
n
∫
p(D n-1 )
p(x n |θ)p(θ|D )
p(θ)
p(θ)
=
p(D n-1 )
n-1
∫ p(x n |θ)p(θ|D ) p(θ) p(θ)dθ
n-1
Finally
p(θ|Dn )=
18 March 2016
p(x n |θ)p(θ|Dn-1 )
n-1
p(x
|θ)p(θ|D
)dθ
n
∫
SIV 813
43
Recursive Bayesian Incremental Learning
•  While p(θ|D0 )=p(θ), repeated use of this eq. produces a
sequence
p(θ), p(θ | x1 ), p(θ | x1 , x1 ),...
• 
•  This is called the recursive Bayes approach to the parameter
estimation. (Also incremental or on-line learning).
•  When this sequence of densities converges to a Dirac delta
function centered about the true parameter value, we have
Bayesian learning.
18 March 2016
SIV 813
44
Maximal Likelihood vs. Bayesian
•  ML and Bayesian estimations are asymptotically equivalent
and “consistent”. They yield the same class-conditional
densities when the size of the training data grows to infinity.
•  ML is typically computationally easier: in ML we need to do
(multidimensional) differentiation and in Bayesian
(multidimensional) integration.
•  ML is often easier to interpret: it returns the single best model
(parameter) whereas Bayesian gives a weighted average of
models.
•  But for a finite training data (and given a reliable prior)
Bayesian is more accurate (uses more of the information).
•  Bayesian with “flat” prior is essentially ML; with asymmetric
and broad priors the methods lead to different solutions.
18 March 2016
SIV 813
45
Problems of Dimensionality:Accuracy, Dimension, and
Training Sample Size
•  Consider two-class multivariate normal distributions p(x | ωi ) : N (µi , Σ)
with the same covariance. If priors are equal then Bayesian error rate
is given by
1
P (e) =
2π
where
∞
∫e
−u2 / 2
du,
r/2
r 2 is the squared Mahalanobis distance:
r 2 = (µ1 − µ 2 )t Σ −1 (µ1 − µ 2 ).
•  Thus the probability of error decreases as r increases. In the
conditionally independent case Σ = diag (σ 12 ,..., σ d2 ) and
d
⎛ µi1 − µi 2 ⎞
r = ∑ ⎜
⎟
σ
i =1 ⎝
i
⎠
2
18 March 2016
SIV 813
2
46
Problems of Dimensionality
•  While classification accuracy can
become better with growing of
dimensionality (and an amount of
training data),
18 March 2016
SIV 813
–  beyond a certain point, the
inclusion of additional features
leads to worse rather then better
performance
–  computational complexity grows
–  the problem of overfitting arises
47
Occam's Razor
•  "Pluralitas non est ponenda sine neccesitate" or "plurality
should not be posited without necessity." The words are
those of the medieval English philosopher and Franciscan
monk William of Occam (ca. 1285-1349).
Decisions based on overly complex models often lead to
lower accuracy of the classifier.
18 March 2016
SIV 813
48
What is feature reduction?
•  Feature reduction refers to the mapping of the
original high-dimensional data onto a lowerdimensional space.
–  Criterion for feature reduction can be different based
on different problem settings.
•  Unsupervised setting: minimize the information loss
•  Supervised setting: maximize the class discrimination
•  Given a set of data points of p variables {x1 , x2 ,!, xn }
Compute the linear transformation (projection)
G ∈ ℜ p×d : x ∈ ℜ p → y = GT x ∈ ℜd (d << p)
18 March 2016
SIV 813
49
What is feature reduction?
Original data
reduced data
Linear transformation
T
G ∈ℜ
Y ∈ ℜd
d× p
X ∈ℜp
G ∈ℜ
18 March 2016
p× d
T
: X → Y = G X ∈ℜ
SIV 813
d
50