Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Model Inference and
Averaging
Prof. Liqing Zhang
Dept. Computer Science & Engineering,
Shanghai Jiaotong University
Contents
• The Bootstrap and Maximum Likelihood Methods
• Bayesian Methods
• Relationship Between the Bootstrap and Bayesian
Inference
• The EM Algorithm
• MCMC for Sampling from the Posterior
• Bagging
• Model Averaging and Stacking
• Stochastic Search: Bumping
2017/5/25
Model Inference and Averaging
2
Bootstrap by Basis Expansions
• Consider a linearN expansion
 ( x )    j h j ( x)
j 1
• The least square error solution
T
1
T
ˆ
  (H H ) H y
• The Covariance of \beta
T
1 2
ˆ
cor (  )  ( H H ) ˆ ;
2017/5/25
ˆ 2   ( yi  ˆ ( xi )) 2 / N
Model Inference and Averaging
3
2017/5/25
Model Inference and Averaging
4
Parametric Model
• Assume a parameterized probability density
(parametric model) for observations
2017/5/25
Model Inference and Averaging
5
Maximum Likelihood Inference
• Suppose we are trying to measure the true
value
of some
quantity (xT).MAKE SENSE?
DOES
THIS PROCEDURE
– We make repeated measurements of this
quantity {x1, x2, … xn}.
The
maximum
method
(MLM)
– The
standard
waylikelihood
to estimate
xT from
our
answers thisisquestion
and the
provides
measurements
to calculate
meanavalue:
general method for estimating
parameters
1 N
 x  from
 xidata.
of interest
N i1
and set xT = μx.
2017/5/25
Model Inference and Averaging
6
The Maximum Likelihood Method
• Statement of the Maximum Likelihood
Method
– Assume we have made N measurements of x
{x1, x2, …, xn}.
– Assume we know the probability distribution
function that describes x: f(x, a).
– Assume we want to determine the parameter a.
• MLM: pick a to maximize the probability of
getting the measurements (the xi's) we did!
2017/5/25
Model Inference and Averaging
7
The MLM Implementation
•
•
•
•
The probability of measuring x1 is f ( x1 ,  )dx
The probability of measuring x2 is f ( x2 ,  )dx
The probability of measuring xn is f ( xn ,  )dx
If the measurements are independent, the
probability of getting the measurements we did is:
L  f ( x1 ,  )dx  f ( x2 ,  )dx    f ( xn ,  )dx
 f ( x1 ,  )  f ( x2 ,  )    f ( xn ,  )[dx n ]
• We can drop the dxn term as it is only a
proportionality constant.
N
L is called the Likelihood Function : L   f ( xi , )
2017/5/25
Model Inference and Averaging
i 1
8
Log Maximum Likelihood Method
• We want to pick the a that maximizes L:
L
0
  *
– Often easier to maximize lnL.
– L and lnL are both maximum at the same
location.
• we maximize lnL rather than L itself because
lnL converts the product into a summation.
N
ln L   ln f (xi , )
i1
2017/5/25
Model Inference and Averaging
9
Log Maximum Likelihood Method
• The new maximization condition is:
N
 ln L
ln f ( xi ,  )
0
   * i 1 
  *
•  could be an array of parameters (e.g.
slope and intercept) or just a single variable.
• equations to determine a range from simple
linear equations to coupled non-linear
equations.
2017/5/25
Model Inference and Averaging
10
An Example: Gaussian
• Let f(x, ) be given by a Gaussian distribution
function.
• Let  =μbe the mean of the Gaussian. We
want to use our data+MLM to find the mean.
• We want the best estimate of a from our set
of n measurements {x1, x2, …, xn}.
• Let’s assume that s is the same for each
measurement.
2017/5/25
Model Inference and Averaging
11
An Example: Gaussian
• Gaussian PDF
1
f (xi , ) 
 2
( x i  ) 2
2
2
e
• The likelihood function for this problem is:
n
n
1 
L   f ( xi ,
)  
e
i 1
i 1  2
n
 1  
e
 2 
2017/5/25
( x1  )
2
2
2
e
( x2  )
2
2
( xi  ) 2
2
e
2 2
( xn  )
2
2
2
n
 1  
i 1
e
 2 
Model Inference and Averaging
n 
( xi  ) 2
2 2
12
An Example: Gaussian
n
1
ln L  ln  f ( xi ,  )  ln([
]n e i1
 2
i 1
n ( x   )2
1
 n ln(
) i 2
 2 i 1 2
n
( xi  ) 2
2 2
)
• We want to find the a that maximizes the log
likelihood function:
2
n
(
x
)
 ln L 
 1 
i
n ln 
0
2
    2  i 1 2
n
n
n
1
2
( xi   )  0;
2( xi   )( 1)  0    xi
 i 1
n i 1
i 1
2017/5/25
Model Inference and Averaging
13
An Example: Gaussian
• If  are different for each data point then  is
just the weighted average:
n
xi
2
  i n1 i
Weighted Average
1
2
i 1  i
2017/5/25
Model Inference and Averaging
14
An Example: Poisson
• Let f(x,) be given by a Poisson distribution.
• Let  = μ be the mean of the Poisson.
• We want the best estimate of a from our set
of n measurements {x1, x2, … xn}.
• Poisson PDF:
e 
f ( x, ) 
x!
2017/5/25
Model Inference and Averaging
x
15
An Example: Poisson
• The likelihood function for this problem is:
n
L
i 1
e  
x1!
2017/5/25
e 
f ( xi ,  )  
xi !
i 1
n
x1
e  
e  
...
x2 !
xn !
x2
xn
xi
n
 xi
e  n  i1
x1! x2 !..xn !
Model Inference and Averaging
16
An Example: Poisson
• Find a that maximizes the log likelihood
function:
n
d ln L
d 
  n  ln    xi  ln( x1! x2 !..xn !) 
d
d 
i 1
1 n
 n   xi  0
i 1
1 n
   xi
n i 1
2017/5/25
Average
Model Inference and Averaging
17
General properties of MLM
• For large data samples (large n) the
likelihood function, L, approaches a
Gaussian distribution.
• Maximum likelihood estimates are usually
consistent.
– For large n the estimates converge to the true
value of the parameters we wish to determine.
• Maximum likelihood estimates are usually
unbiased.
– For all sample sizes the parameter of interest is
calculated correctly.
2017/5/25
Model Inference and Averaging
18
General properties of MLM
• Maximum likelihood estimate is efficient: the
estimate has the smallest variance.
• Maximum likelihood estimate is sufficient: it
uses all the information in the observations
(the xi’s).
• The solution from MLM is unique.
• Bad news: we must know the correct
probability distribution for the problem at
hand!
2017/5/25
Model Inference and Averaging
19
Maximum Likelihood
• We maximize the likelihood function
• Log-likelihood function
2017/5/25
Model Inference and Averaging
20
Score Function
• Assess the precision of  using the likelihood
function
• Assume that L takes its maximum in the
interior parameter space. Then
2017/5/25
Model Inference and Averaging
21
Likelihood Function
• We maximize the likelihood function
• We omit normalization since only adds a
constant factor
• Think of L as a function of ˆ with Z fixed
• Log-likelihood function
2017/5/25
Model Inference and Averaging
22
Fisher Information
• Negative sum of second derivatives is the
information matrix
• is called the observed information, should be
greater 0.
• Fisher information ( expected information ) is
2017/5/25
Model Inference and Averaging
23
Sampling Theory
• Basic result of sampling theory
• The sampling distribution of the maxlikelihood estimator approaches the following
normal distribution, as
• When we sample independently from
• This suggests to approximate the distribution
with
2017/5/25
Model Inference and Averaging
24
Error Bound
• The corresponding error estimates are
obtained from
• The confidence points have the form
2017/5/25
Model Inference and Averaging
25
Simplified form of the Fisher information
Suppose, in addition, that the operations of integration and
differentiation can be swapped for the second derivative of f(x;θ) as
well, i.e.,
In this case, it can be shown that the Fisher information equals
The Cramér–Rao bound can then be written as
2017/5/25
Model Inference and Averaging
26
Single-parameter proof
If the expectation of T is denoted by ψ(θ), then, for all θ,
Let X be a random variable with probability density function f(x;θ).
Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If
V is the score, i.e.
then the expectation of V, written E(V), is zero. If we consider the
covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because
E(V) = 0. Expanding this expression we have
2017/5/25
Model Inference and Averaging
27
Using the chain rule
and the definition of expectation gives, after cancelling f(x;θ),
because the integration and differentiation operations
commute (second condition).
The Cauchy-Schwarz inequality shows that
Therefore
2017/5/25
Model Inference and Averaging
28
An Example
• Consider a linearN expansion
 ( x )    j h j ( x)
j 1
• The least square error solution
T
1
T
ˆ
  (H H ) H y
• The Covariance of \beta
T
1 2
ˆ
cor (  )  ( H H ) ˆ ;
2017/5/25
ˆ 2   ( yi  ˆ ( xi )) 2 / N
Model Inference and Averaging
29
An Example
N
Consider prediction model ˆ ( x)   ˆ j h j ( x),
j 1
The standard deviation
se[ ˆ ( x)]  [ h( x)T ( H T H ) 1 h( x)]1/2 ˆ
• The confidence region
ˆ ( x)  1.96se[ˆ ( x)]
2017/5/25
Model Inference and Averaging
30
Bayesian Methods
• Given a sampling model Pr(Z |  ) and a prior Pr( )
for the parameters, estimate the posterior
probability
• By drawing samples or estimating its mean or
mode
• Differences to mere counting ( frequentist
approach )
– Prior: allow for uncertainties present before seeing the
data
– Posterior: allow for uncertainties present after seeing the
data
2017/5/25
Model Inference and Averaging
31
Bayesian Methods
• The posterior distribution affords also a
predictive distribution of seeing future values
Z
new
• In contrast, the max-likelihood approach
would predict future data on the basis of
not accounting for the uncertainty
in the parameters
2017/5/25
Model Inference and Averaging
32
An Example
• Consider a linearN expansion
 ( x )    j h j ( x)
j 1
• The least square error solution
px  
2017/5/25
2 
1
p/2
p/2
e
 12 x  
Model Inference and Averaging
T 1 x   
33
• The posterior distribution for \beta is also Gaussian, with mean and
covariance
• The corresponding posterior values for \mu(x),
2017/5/25
Model Inference and Averaging
34
2017/5/25
Model Inference and Averaging
35
Bootstrap vs Bayesian
• The bootstrap mean is an approximate
posterior average
• Simple example:
– Single observation z drawn from a normal
distribution
– Assume a normal prior for  :
– Resulting posterior distribution
2017/5/25
Model Inference and Averaging
36
Bootstrap vs Bayesian
• Three ingredients make this work
– The choice of a noninformative prior for 
– The dependence of
on Z only through
the max-likelihood estimate ˆ Thus
– The symmetry of
2017/5/25
Model Inference and Averaging
37
Bootstrap vs Bayesian
• The bootstrap distribution represents an (approximate)
nonparametric, noninformative posterior distribution for our
parameter.
• But this bootstrap distribution is obtained painlessly without
having to formally specify a prior and without having to
sample from the posterior distribution.
• Hence we might think of the bootstrap distribution as a
\poor man's" Bayes posterior. By perturbing the data, the
bootstrap approximates the Bayesian effect of perturbing
the parameters, and is typically much simpler to carry out.
2017/5/25
Model Inference and Averaging
38
The EM Algorithm
• The EM algorithm for two-component
Gaussian mixtures
– Take initial guesses ˆ , ˆ1 ,ˆ1 , ˆ 2 ,ˆ2 for the
parameters
– Expectation Step: Compute the responsibilities
2017/5/25
Model Inference and Averaging
39
The EM Algorithm
– Maximization Step: Compute the weighted
means and variances
– Iterate 2 and 3 until convergence
2017/5/25
Model Inference and Averaging
40
The EM Algorithm in General
• Baum-Welch algorithm
• Applicable to problems for which maximizing
the log-likelihood is difficult but simplified by
enlarging the sample set with unobserved
( latent ) data ( data augmentation ).
2017/5/25
Model Inference and Averaging
41
The EM Algorithm in General
2017/5/25
Model Inference and Averaging
42
2017/5/25
Model Inference and Averaging
43
The EM Algorithm in General
• Start with initial params ˆ
• Expectation Step: at the j-th step compute
as a function of the dummy argument
• Maximization Step: Determine the new
params ˆ( j1) by maximizing
• Iterate 2 and 3 until convergence
2017/5/25
Model Inference and Averaging
44
Model Averaging and Stacking
• Given predictions
• Under square-error loss, seek weights
• Such that
• Here the input x is fixed and the N
observations in Z are distributed according to
P
2017/5/25
Model Inference and Averaging
45
Model Averaging and Stacking
• The solution is the population linear
regression of Y on
namely
• Now the full regression has smaller error,
namely
• Population linear regression is not available,
thus replace it by the linear regression over
the training set
2017/5/25
Model Inference and Averaging
46