Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Model Inference and
Averaging
Prof. Liqing Zhang
Dept. Computer Science & Engineering,
Shanghai Jiaotong University
Contents
• The Bootstrap and Maximum Likelihood Methods
• Bayesian Methods
• Relationship Between the Bootstrap and Bayesian
Inference
• The EM Algorithm
• MCMC for Sampling from the Posterior
• Bagging
• Model Averaging and Stacking
• Stochastic Search: Bumping
2017/5/25
Model Inference and Averaging
2
Bootstrap by Basis Expansions
• Consider a linearN expansion
( x ) j h j ( x)
j 1
• The least square error solution
T
1
T
ˆ
(H H ) H y
• The Covariance of \beta
T
1 2
ˆ
cor ( ) ( H H ) ˆ ;
2017/5/25
ˆ 2 ( yi ˆ ( xi )) 2 / N
Model Inference and Averaging
3
2017/5/25
Model Inference and Averaging
4
Parametric Model
• Assume a parameterized probability density
(parametric model) for observations
2017/5/25
Model Inference and Averaging
5
Maximum Likelihood Inference
• Suppose we are trying to measure the true
value
of some
quantity (xT).MAKE SENSE?
DOES
THIS PROCEDURE
– We make repeated measurements of this
quantity {x1, x2, … xn}.
The
maximum
method
(MLM)
– The
standard
waylikelihood
to estimate
xT from
our
answers thisisquestion
and the
provides
measurements
to calculate
meanavalue:
general method for estimating
parameters
1 N
x from
xidata.
of interest
N i1
and set xT = μx.
2017/5/25
Model Inference and Averaging
6
The Maximum Likelihood Method
• Statement of the Maximum Likelihood
Method
– Assume we have made N measurements of x
{x1, x2, …, xn}.
– Assume we know the probability distribution
function that describes x: f(x, a).
– Assume we want to determine the parameter a.
• MLM: pick a to maximize the probability of
getting the measurements (the xi's) we did!
2017/5/25
Model Inference and Averaging
7
The MLM Implementation
•
•
•
•
The probability of measuring x1 is f ( x1 , )dx
The probability of measuring x2 is f ( x2 , )dx
The probability of measuring xn is f ( xn , )dx
If the measurements are independent, the
probability of getting the measurements we did is:
L f ( x1 , )dx f ( x2 , )dx f ( xn , )dx
f ( x1 , ) f ( x2 , ) f ( xn , )[dx n ]
• We can drop the dxn term as it is only a
proportionality constant.
N
L is called the Likelihood Function : L f ( xi , )
2017/5/25
Model Inference and Averaging
i 1
8
Log Maximum Likelihood Method
• We want to pick the a that maximizes L:
L
0
*
– Often easier to maximize lnL.
– L and lnL are both maximum at the same
location.
• we maximize lnL rather than L itself because
lnL converts the product into a summation.
N
ln L ln f (xi , )
i1
2017/5/25
Model Inference and Averaging
9
Log Maximum Likelihood Method
• The new maximization condition is:
N
ln L
ln f ( xi , )
0
* i 1
*
• could be an array of parameters (e.g.
slope and intercept) or just a single variable.
• equations to determine a range from simple
linear equations to coupled non-linear
equations.
2017/5/25
Model Inference and Averaging
10
An Example: Gaussian
• Let f(x, ) be given by a Gaussian distribution
function.
• Let =μbe the mean of the Gaussian. We
want to use our data+MLM to find the mean.
• We want the best estimate of a from our set
of n measurements {x1, x2, …, xn}.
• Let’s assume that s is the same for each
measurement.
2017/5/25
Model Inference and Averaging
11
An Example: Gaussian
• Gaussian PDF
1
f (xi , )
2
( x i ) 2
2
2
e
• The likelihood function for this problem is:
n
n
1
L f ( xi ,
)
e
i 1
i 1 2
n
1
e
2
2017/5/25
( x1 )
2
2
2
e
( x2 )
2
2
( xi ) 2
2
e
2 2
( xn )
2
2
2
n
1
i 1
e
2
Model Inference and Averaging
n
( xi ) 2
2 2
12
An Example: Gaussian
n
1
ln L ln f ( xi , ) ln([
]n e i1
2
i 1
n ( x )2
1
n ln(
) i 2
2 i 1 2
n
( xi ) 2
2 2
)
• We want to find the a that maximizes the log
likelihood function:
2
n
(
x
)
ln L
1
i
n ln
0
2
2 i 1 2
n
n
n
1
2
( xi ) 0;
2( xi )( 1) 0 xi
i 1
n i 1
i 1
2017/5/25
Model Inference and Averaging
13
An Example: Gaussian
• If are different for each data point then is
just the weighted average:
n
xi
2
i n1 i
Weighted Average
1
2
i 1 i
2017/5/25
Model Inference and Averaging
14
An Example: Poisson
• Let f(x,) be given by a Poisson distribution.
• Let = μ be the mean of the Poisson.
• We want the best estimate of a from our set
of n measurements {x1, x2, … xn}.
• Poisson PDF:
e
f ( x, )
x!
2017/5/25
Model Inference and Averaging
x
15
An Example: Poisson
• The likelihood function for this problem is:
n
L
i 1
e
x1!
2017/5/25
e
f ( xi , )
xi !
i 1
n
x1
e
e
...
x2 !
xn !
x2
xn
xi
n
xi
e n i1
x1! x2 !..xn !
Model Inference and Averaging
16
An Example: Poisson
• Find a that maximizes the log likelihood
function:
n
d ln L
d
n ln xi ln( x1! x2 !..xn !)
d
d
i 1
1 n
n xi 0
i 1
1 n
xi
n i 1
2017/5/25
Average
Model Inference and Averaging
17
General properties of MLM
• For large data samples (large n) the
likelihood function, L, approaches a
Gaussian distribution.
• Maximum likelihood estimates are usually
consistent.
– For large n the estimates converge to the true
value of the parameters we wish to determine.
• Maximum likelihood estimates are usually
unbiased.
– For all sample sizes the parameter of interest is
calculated correctly.
2017/5/25
Model Inference and Averaging
18
General properties of MLM
• Maximum likelihood estimate is efficient: the
estimate has the smallest variance.
• Maximum likelihood estimate is sufficient: it
uses all the information in the observations
(the xi’s).
• The solution from MLM is unique.
• Bad news: we must know the correct
probability distribution for the problem at
hand!
2017/5/25
Model Inference and Averaging
19
Maximum Likelihood
• We maximize the likelihood function
• Log-likelihood function
2017/5/25
Model Inference and Averaging
20
Score Function
• Assess the precision of using the likelihood
function
• Assume that L takes its maximum in the
interior parameter space. Then
2017/5/25
Model Inference and Averaging
21
Likelihood Function
• We maximize the likelihood function
• We omit normalization since only adds a
constant factor
• Think of L as a function of ˆ with Z fixed
• Log-likelihood function
2017/5/25
Model Inference and Averaging
22
Fisher Information
• Negative sum of second derivatives is the
information matrix
• is called the observed information, should be
greater 0.
• Fisher information ( expected information ) is
2017/5/25
Model Inference and Averaging
23
Sampling Theory
• Basic result of sampling theory
• The sampling distribution of the maxlikelihood estimator approaches the following
normal distribution, as
• When we sample independently from
• This suggests to approximate the distribution
with
2017/5/25
Model Inference and Averaging
24
Error Bound
• The corresponding error estimates are
obtained from
• The confidence points have the form
2017/5/25
Model Inference and Averaging
25
Simplified form of the Fisher information
Suppose, in addition, that the operations of integration and
differentiation can be swapped for the second derivative of f(x;θ) as
well, i.e.,
In this case, it can be shown that the Fisher information equals
The Cramér–Rao bound can then be written as
2017/5/25
Model Inference and Averaging
26
Single-parameter proof
If the expectation of T is denoted by ψ(θ), then, for all θ,
Let X be a random variable with probability density function f(x;θ).
Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If
V is the score, i.e.
then the expectation of V, written E(V), is zero. If we consider the
covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because
E(V) = 0. Expanding this expression we have
2017/5/25
Model Inference and Averaging
27
Using the chain rule
and the definition of expectation gives, after cancelling f(x;θ),
because the integration and differentiation operations
commute (second condition).
The Cauchy-Schwarz inequality shows that
Therefore
2017/5/25
Model Inference and Averaging
28
An Example
• Consider a linearN expansion
( x ) j h j ( x)
j 1
• The least square error solution
T
1
T
ˆ
(H H ) H y
• The Covariance of \beta
T
1 2
ˆ
cor ( ) ( H H ) ˆ ;
2017/5/25
ˆ 2 ( yi ˆ ( xi )) 2 / N
Model Inference and Averaging
29
An Example
N
Consider prediction model ˆ ( x) ˆ j h j ( x),
j 1
The standard deviation
se[ ˆ ( x)] [ h( x)T ( H T H ) 1 h( x)]1/2 ˆ
• The confidence region
ˆ ( x) 1.96se[ˆ ( x)]
2017/5/25
Model Inference and Averaging
30
Bayesian Methods
• Given a sampling model Pr(Z | ) and a prior Pr( )
for the parameters, estimate the posterior
probability
• By drawing samples or estimating its mean or
mode
• Differences to mere counting ( frequentist
approach )
– Prior: allow for uncertainties present before seeing the
data
– Posterior: allow for uncertainties present after seeing the
data
2017/5/25
Model Inference and Averaging
31
Bayesian Methods
• The posterior distribution affords also a
predictive distribution of seeing future values
Z
new
• In contrast, the max-likelihood approach
would predict future data on the basis of
not accounting for the uncertainty
in the parameters
2017/5/25
Model Inference and Averaging
32
An Example
• Consider a linearN expansion
( x ) j h j ( x)
j 1
• The least square error solution
px
2017/5/25
2
1
p/2
p/2
e
12 x
Model Inference and Averaging
T 1 x
33
• The posterior distribution for \beta is also Gaussian, with mean and
covariance
• The corresponding posterior values for \mu(x),
2017/5/25
Model Inference and Averaging
34
2017/5/25
Model Inference and Averaging
35
Bootstrap vs Bayesian
• The bootstrap mean is an approximate
posterior average
• Simple example:
– Single observation z drawn from a normal
distribution
– Assume a normal prior for :
– Resulting posterior distribution
2017/5/25
Model Inference and Averaging
36
Bootstrap vs Bayesian
• Three ingredients make this work
– The choice of a noninformative prior for
– The dependence of
on Z only through
the max-likelihood estimate ˆ Thus
– The symmetry of
2017/5/25
Model Inference and Averaging
37
Bootstrap vs Bayesian
• The bootstrap distribution represents an (approximate)
nonparametric, noninformative posterior distribution for our
parameter.
• But this bootstrap distribution is obtained painlessly without
having to formally specify a prior and without having to
sample from the posterior distribution.
• Hence we might think of the bootstrap distribution as a
\poor man's" Bayes posterior. By perturbing the data, the
bootstrap approximates the Bayesian effect of perturbing
the parameters, and is typically much simpler to carry out.
2017/5/25
Model Inference and Averaging
38
The EM Algorithm
• The EM algorithm for two-component
Gaussian mixtures
– Take initial guesses ˆ , ˆ1 ,ˆ1 , ˆ 2 ,ˆ2 for the
parameters
– Expectation Step: Compute the responsibilities
2017/5/25
Model Inference and Averaging
39
The EM Algorithm
– Maximization Step: Compute the weighted
means and variances
– Iterate 2 and 3 until convergence
2017/5/25
Model Inference and Averaging
40
The EM Algorithm in General
• Baum-Welch algorithm
• Applicable to problems for which maximizing
the log-likelihood is difficult but simplified by
enlarging the sample set with unobserved
( latent ) data ( data augmentation ).
2017/5/25
Model Inference and Averaging
41
The EM Algorithm in General
2017/5/25
Model Inference and Averaging
42
2017/5/25
Model Inference and Averaging
43
The EM Algorithm in General
• Start with initial params ˆ
• Expectation Step: at the j-th step compute
as a function of the dummy argument
• Maximization Step: Determine the new
params ˆ( j1) by maximizing
• Iterate 2 and 3 until convergence
2017/5/25
Model Inference and Averaging
44
Model Averaging and Stacking
• Given predictions
• Under square-error loss, seek weights
• Such that
• Here the input x is fixed and the N
observations in Z are distributed according to
P
2017/5/25
Model Inference and Averaging
45
Model Averaging and Stacking
• The solution is the population linear
regression of Y on
namely
• Now the full regression has smaller error,
namely
• Population linear regression is not available,
thus replace it by the linear regression over
the training set
2017/5/25
Model Inference and Averaging
46