Download X - BCMI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Model Inference and
Averaging
Prof. Liqing Zhang
Dept. Computer Science & Engineering,
Shanghai Jiaotong University
Contents
• The Bootstrap and Maximum Likelihood Methods
• Bayesian Methods
• Relationship Between the Bootstrap and Bayesian
Inference
• The EM Algorithm
• MCMC for Sampling from the Posterior
• Bagging
• Model Averaging and Stacking
• Stochastic Search: Bumping
2017/5/25
Model Inference and Averaging
2
Bootstrap by Basis Expansions
• Consider a linearN expansion
 ( x )    j h j ( x)
j 1
• The least square error solution
T
1
T
ˆ
  (H H ) H y
• The Covariance of \beta
T
1 2
ˆ
cor (  )  ( H H ) ˆ ;
2017/5/25
ˆ 2   ( yi  ˆ ( xi )) 2 / N
Model Inference and Averaging
3
2017/5/25
Model Inference and Averaging
4
Parametric Model
• Assume a parameterized probability density
(parametric model) for observations
2017/5/25
Model Inference and Averaging
5
Maximum Likelihood Inference
• Suppose we are trying to measure the true
value of some quantity (xT).
THISrepeated
PROCEDURE
MAKE SENSE?
–DOES
We make
measurements
of this
quantity {x1, x2, … xn}.
– The standard way to estimate xT from our
The maximum likelihood method (MLM)
measurements is to calculate the mean value:
answers this question Nand provides a
1
general method for
estimating
 x   xi parameters
N i1 data.
of interest from
and set xT .  x
2017/5/25

Model Inference and Averaging
6
The Maximum Likelihood Method
• Statement of the Maximum Likelihood
Method
– Assume we have made N measurements of x
{x1, x2, …, xn}.
– Assume we know the probability distribution
function that describes x: f(x, a).
– Assume we want to determine the parameter a.
• MLM: pick a to maximize the probability of
getting the measurements (the xi's) we did!
2017/5/25
Model Inference and Averaging
7
The MLM Implementation
•
•
•
•
The probability of measuring x1 is f ( x1,  )dx
The probability of measuring x2 is f ( x2 ,  )dx
The probability of measuring xn is f ( xn ,  )dx
If the measurements are independent, the
probability of getting the measurements we did is:
L  f ( x1 ,  )dx  f ( x2 ,  )dx    f ( xn ,  )dx
 f ( x1 ,  )  f ( x2 ,  )    f ( xn ,  )[dx n ]
• We can drop the dxn term as it is only a
proportionality constant.
N
L is called the Likelihood Function : L   f ( xi , )
2017/5/25
Model Inference and Averaging
i 1
8
Log Maximum Likelihood Method
• We want to pick the a that maximizes L:
L
0
  *
– Often easier to maximize lnL.
– L and lnL are both maximum at the same

location.
• we maximize ln L rather than L itself because ln L
converts the product into a summation.
N
ln L   ln f (xi , )
i1
2017/5/25
Model Inference and Averaging
9
Log Maximum Likelihood Method
• The new maximization condition is:
N
 ln L


ln f ( xi ,  )
0
   * i 1 
  *
•
 could be an array of parameters (e.g. slope and
intercept) or just a single variable.
• equations to determine a range from simple linear
equations to coupled non-linear equations.
2017/5/25
Model Inference and Averaging
10
An Example: Gaussian
• Let f(x, ) be given by a Gaussian distribution
function.
• Let  =μ be the mean of the Gaussian. We
want to use our data+MLM to find the mean.
• We want the best estimate of a from our set
of n measurements {x1, x2, …, xn}.
• Let’s assume that s is the same for each
measurement.
2017/5/25
Model Inference and Averaging
11
An Example: Gaussian
• Gaussian PDF
1
f (xi , ) 
 2
( x i  ) 2

2
2

e
• The likelihood function for this problem is:
n
n
1 
L   f ( xi ,
)  
e
i 1
i 1  2
n
 1  

e

 2 
2017/5/25
( x1  )
2
2
2

e
( x2  )
2
2
( xi  ) 2
2
e
2 2

( xn  )
2
2
2
n
 1  
i 1

e

 2 
Model Inference and Averaging
n 
( xi  ) 2
2 2
12
An Example: Gaussian
n
( xi  ) 2

1
2 2
n
i 1
ln L  ln  f ( xi ,  )  ln([
] e
)
 2
i 1
n ( x   )2
1
 n ln(
) i 2
 2 i 1 2
• We want to find the a that maximizes the log
likelihood function:
n
2
n


(
x


)
 ln L 
 1 
i


n ln 
0
2

    2  i 1 2

n
n
n

1
2
( xi   )  0;
2( xi   )( 1)  0    xi


 i 1
n i 1
i 1
2017/5/25
Model Inference and Averaging
13
An Example: Gaussian
• If  are different for each data point then  is
just the weighted average:
n
xi

2

  i n1 i
Weighted Average
1

2
i 1  i
2017/5/25
Model Inference and Averaging
14
An Example: Poisson
• Let f(x,) be given by a Poisson distribution.
• Let  = μ be the mean of the Poisson.
• We want the best estimate of a from our set
of n measurements {x1, x2, … xn}.
• Poisson PDF:

e 
f ( x, ) 
x!
2017/5/25
Model Inference and Averaging
x
15
An Example: Poisson
• The likelihood function for this problem is:
n
L
i 1
e  

x1!
2017/5/25

e 
f ( xi ,  )  
xi !
i 1
n
x1
e  
e  
...
x2 !
xn !
x2
xn
xi
n
 xi
e  n  i1

x1! x2 !..xn !
Model Inference and Averaging
16
An Example: Poisson
• Find a that maximizes the log likelihood
function:
n
d ln L
d 


  n  ln    xi  ln( x1! x2 !..xn !) 
d
d 
i 1

1 n
 n   xi  0

i 1
1 n
   xi
n i 1
2017/5/25
Average
Model Inference and Averaging
17
General properties of MLM
• For large data samples (large n) the likelihood
function, L, approaches a Gaussian distribution.
• Maximum likelihood estimates are usually
consistent.
– For large n the estimates converge to the true value of
the parameters we wish to determine.
• Maximum likelihood estimates are usually
unbiased.
– For all sample sizes the parameter of interest is
calculated correctly.
2017/5/25
Model Inference and Averaging
18
General properties of MLM
• Maximum likelihood estimate is efficient: the
estimate has the smallest variance.
• Maximum likelihood estimate is sufficient: it
uses all the information in the observations
(the xi’s).
• The solution from MLM is unique.
• Bad news: we must know the correct
probability distribution for the problem at
hand!
2017/5/25
Model Inference and Averaging
19
Maximum Likelihood
• We maximize the likelihood function
• Log-likelihood function
2017/5/25
Model Inference and Averaging
20
Score Function
• Assess the precision of  using the likelihood
function
• Assume that L takes its maximum in the
interior parameter space. Then
2017/5/25
Model Inference and Averaging
21
Likelihood Function
• We maximize the likelihood function
• We omit normalization since only adds a
constant factor
• Think of L as a function of ˆ with Z fixed
• Log-likelihood function
2017/5/25
Model Inference and Averaging
22
Fisher Information
• Negative sum of second derivatives is the
information matrix
• is called the observed information, should be
greater 0.
• Fisher information ( expected information ) is
2017/5/25
Model Inference and Averaging
23
Sampling Theory
• Basic result of sampling theory
• The sampling distribution of the maxlikelihood estimator approaches the following
normal distribution, as
when we sample independently from
• This suggests to approximate the distribution with
2017/5/25
Model Inference and Averaging
24
Error Bound
• The corresponding error estimates are
obtained from
• The confidence points have the form
2017/5/25
Model Inference and Averaging
25
Simplified form of the Fisher information
Suppose, in addition, that the operations of integration and
differentiation can be swapped for the second derivative of f(x;θ) as
well, i.e.,
In this case, it can be shown that the Fisher information equals
The Cramér–Rao bound can then be written as
2017/5/25
Model Inference and Averaging
26
Single-parameter proof
If the expectation of T is denoted by ψ(θ), then, for all θ,
Let X be a random variable with probability density function f(x;θ).
Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If
V is the score, i.e.
then the expectation of V, written E(V), is zero. If we consider the
covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because
E(V) = 0. Expanding this expression we have
2017/5/25
Model Inference and Averaging
27
Using the chain rule
and the definition of expectation gives, after cancelling f(x;θ),
because the integration and differentiation operations
commute (second condition).
The Cauchy-Schwarz inequality shows that
Therefore
2017/5/25
Model Inference and Averaging
28
An Example
• Consider a linearN expansion
 ( x )    j h j ( x)
j 1
• The least square error solution
T
1
T
ˆ
  (H H ) H y
• The Covariance of \beta
T
1 2
ˆ
cor (  )  ( H H ) ˆ ;
2017/5/25
ˆ 2   ( yi  ˆ ( xi )) 2 / N
Model Inference and Averaging
29
An Example
N
Consider prediction model ˆ ( x)   ˆ j h j ( x),
j 1
The standard deviation
se[ ˆ ( x)]  [ h( x)T ( H T H ) 1 h( x)]1/2 ˆ
• The confidence region
ˆ ( x)  1.96se[ˆ ( x)]
2017/5/25
Model Inference and Averaging
30
Bayesian Methods
• Given a sampling model Pr(Z |  ) and a prior Pr( )
for the parameters, estimate the posterior
probability
• By drawing samples or estimating its mean or
mode
• Differences to mere counting ( frequentist
approach )
– Prior: allow for uncertainties present before seeing the
data
– Posterior: allow for uncertainties present after seeing the
data
2017/5/25
Model Inference and Averaging
31
Bayesian Methods
• The posterior distribution affords also a
predictive distribution of seeing future values
Z
new
• In contrast, the max-likelihood approach
would predict future data on the basis of
not accounting for the uncertainty
in the parameters
2017/5/25
Model Inference and Averaging
32
An Example
• Consider a linearN expansion
 ( x )    j h j ( x)
j 1
• The least square error solution
px  
2017/5/25
2 
1
p/2

p/2
e

 12 x  
Model Inference and Averaging
T 1 x   
33
• The posterior distribution for \beta is also Gaussian, with mean and
covariance
• The corresponding posterior values for \mu(x),
2017/5/25
Model Inference and Averaging
34
2017/5/25
Model Inference and Averaging
35
Bootstrap vs Bayesian
• The bootstrap mean is an approximate
posterior average
• Simple example:
– Single observation z drawn from a normal
distribution
– Assume a normal prior for  :
– Resulting posterior distribution
2017/5/25
Model Inference and Averaging
36
Bootstrap vs Bayesian
• Three ingredients make this work
– The choice of a noninformative prior for 
– The dependence of
on Z only through
the max-likelihood estimate ˆ Thus
– The symmetry of
2017/5/25
Model Inference and Averaging
37
Bootstrap vs Bayesian
• The bootstrap distribution represents an (approximate)
nonparametric, noninformative posterior distribution for our
parameter.
• But this bootstrap distribution is obtained painlessly without
having to formally specify a prior and without having to
sample from the posterior distribution.
• Hence we might think of the bootstrap distribution as a
\poor man's" Bayes posterior. By perturbing the data, the
bootstrap approximates the Bayesian effect of perturbing
the parameters, and is typically much simpler to carry out.
2017/5/25
Model Inference and Averaging
38
The EM Algorithm
• The EM algorithm for two-component
Gaussian mixtures
– Take initial guesses ˆ , ˆ1 ,ˆ1 , ˆ 2 ,ˆ2 for the
parameters
– Expectation Step: Compute the responsibilities
2017/5/25
Model Inference and Averaging
39
The EM Algorithm
– Maximization Step: Compute the weighted
means and variances
– Iterate 2 and 3 until convergence
2017/5/25
Model Inference and Averaging
40
The EM Algorithm in General
• Baum-Welch algorithm
• Applicable to problems for which maximizing
the log-likelihood is difficult but simplified by
enlarging the sample set with unobserved
( latent ) data ( data augmentation ).
2017/5/25
Model Inference and Averaging
41
The EM Algorithm in General
2017/5/25
Model Inference and Averaging
42
2017/5/25
Model Inference and Averaging
43
The EM Algorithm in General
• Start with initial params ˆ
• Expectation Step: at the j-th step compute
as a function of the dummy argument
• Maximization Step: Determine the new
params ˆ( j1) by maximizing
• Iterate 2 and 3 until convergence
2017/5/25
Model Inference and Averaging
44
Model Averaging and Stacking
• Given predictions
• Under square-error loss, seek weights
• Such that
• Here the input x is fixed and the N
observations in Z are distributed according to
P
2017/5/25
Model Inference and Averaging
45
Model Averaging and Stacking
• The solution is the population linear
regression of Y on
namely
• Now the full regression has smaller error,
namely
• Population linear regression is not available,
thus replace it by the linear regression over
the training set
2017/5/25
Model Inference and Averaging
46
Related documents