Download Slides 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Model Averaging
Bayesian versus Classical Approaches to Inference
CES LMU München
Gernot Doppelhofer
Norwegian School of Economics (NHH)
15-17 March 2011
1
Bayesian versus Classical Approaches to Inference
Outline
Introduction
Course objective: introduction to model averaging
Motivation: increasing attention to model uncertainty in
economic problems
Feasibility: improvement in numerical techniques combined
with decrease in computing resource cost makes model
averaging and simulation-based methods accessible
2
Bayesian versus Classical Approaches to Inference
Outline
Lecture Outline
Lect 1 Bayesian and Classical Approaches to Inference:
introduce a number of key issues in Bayesian analysis,
including random parameters, prior uncertainty,
model uncertainty, contrast with Classical approach
Lect 2 Model Averaging: introduce basics of Frequentist
and Bayesian model averaging, focussing on the loss
function of the decision maker, and the calculation of
posterior distributions, Benchmark Model Averaging
in the normal, linear regression model
Lect 3 Applications of Model Averaging: inference,
forecasting and policy evaluation
3
Bayesian versus Classical Approaches to Inference
Outline
ntnulogowhite
Outline of Lecture 1 I
1
2
3
4
5
6
7
8
4
Probability Statements about Unknown Parameters
Probability Statements and Interval Estimation
Motivation: Multiple Models
Bayesian and Classical Objects
Binary Uncertainty
Bayesian Hypothesis Testing
Bayes Theorem
Aspects of Bayesian Inference
Simultaneous Bayesian and Classical Inference
Prior Uncertainty
Noninformative and Improper Priors
Natural Conjugate Priors
Hierarchical Priors
Bayesian versus Classical Approaches to Inference
Probability Statements about Unknown Parameters
A probability statement about a parameter β is simply
not possible in classical statistics because parameters are
not allowed to be random variables.
However, statements about parameters are what the
investigator actually wants in practice.
It is hard to think that statistical inference about β could
mean anything else.
Kendall and Stuart
5
Bayesian versus Classical Approaches to Inference
Probability Statements about Unknown Parameters
Bayesian versus Classical Approaches
Classical Relies (mostly) on distributions of estimators and
test statistics over hypothetical repeated samples.
Does NOT condition on the data.
Inferences are based on data not observed!
Such sampling distributions are strictly irrelevant to
Bayesian inference.
Bayesian there are no estimators and no test statistics; there is
no role for unbiasedness, minimum variance,
efficiency, etc.
6
Bayesian versus Classical Approaches to Inference
Probability Statements about Unknown Parameters
What are we interested in?
Estimation of parameters (for example regression coefficients)
Testing of hypotheses and comparison of different models
Prediction of quantities of interest
7
Bayesian versus Classical Approaches to Inference
Probability Statements about Unknown Parameters
A Bayesian Model for the data p (y )
Requirements:
The Likelihood p (y |θ ) - a conditional probability statement about
the data
The Prior p (θ ) - a prior distribution over the parameters
R
The Model p (y ) = p (y |θ )p (θ )d θ
The Estimator p (θ |y , M )
8
Bayesian versus Classical Approaches to Inference
Probability Statements about Unknown Parameters
Remark
it is worth highlighting the Bayesian requirements in developing a
model for data relative to classical.
Classical estimators, based for example on GMM, can proceed with
moment conditions of the form E [ψ(zm , θ)] = 0,
∀m = 1, ..., M
and a statement that the data are i.i.d.
This is much less demanding than a likelihood statement
Remark
BUT combining a likelihood statement with a prior gives us a
posterior
p (θ |y , M ) = p (y |θ, M )p (θ )]/p (y )
(1)
and then we conduct inference that are exact and conditional on
the observed data
AND at the outset, conditional on a given model M
9
Bayesian versus Classical Approaches to Inference
Probability Statements about Unknown Parameters
Probability Statements and Interval Estimation
Bayesian Perspective
Interpret the following probability statement for i.i.d. data
nN (µ, σ2 ) with σ2 known.
√
√
Pr(x − 1.96σ/ n < µ < x + 1.96σ/ n ) = 0.95
(2)
Bayesian This is a set whose probability content is 0.95 ; no point
outside it has higher posterior density than any point inside it.
More appropriate to write Pr(a < µ < b |Y ), where Y denotes the
available data.
A Bayesian posterior probability interval provides the probability
that µ lies in that fixed interval, conditional on the observed data
This statement means what it says! It does not refer to
hypothetical repeated samples.
10
Bayesian versus Classical Approaches to Inference
Probability Statements about Unknown Parameters
Probability Statements and Interval Estimation
Classical Perspective
A (1 − α) confidence interval.
I A statement as to the percentage of yet unobserved samples,
that will generate intervals that contain µ.
II A statement that one should have 95% confidence in a
parameter µ lying in a given interval is a probability statement
about the random interval, and not µ.
If we repeated the experiment and obtained 100 such intervals,
95/100 would contain the true unknown parameter.
11
Bayesian versus Classical Approaches to Inference
Motivation: Multiple Models
Motivation: Multiple Models
The inference problem Suppose we want to investigate potential
determinants of a dependent variable y given the
(potentially large) set of regressors X
Model Space Suppose that there are K potential regressors and
the model space M is the set of all 2K linear models.
Model Mj is described by a (k × 1) binary vector
γ = (γ1 , ..., γK )0 , where a one (zero) indicates the
inclusion (exclusion) of an explanatory variable xi
Conditional and Unconditional Inference the unknown parameter
vector β represents the effects of the variables
included in a regression model.
We can estimate its distribution p ( β|, y, Mj )
conditional on model Mj .
12
Bayesian versus Classical Approaches to Inference
Motivation: Multiple Models
... And Unconditionally
Remark
Given the (potentially large) space of models M the unconditional
effects of model parameters can be derived by integrating out all
aspects of model uncertainty, including the space of models M.
The posterior density p ( β|y) can then be expressed as function of
sample observations y.
13
Bayesian versus Classical Approaches to Inference
Bayesian and Classical Objects
Posterior Model Uncertainty
p (Mj |y) =
l (y|Mj ) · p (Mj )
2K
(3)
∑ l (y|Mi ) · p (Mi )
i =1
Prior Model Uncertainty p (Mj )
Marginal Likelihood of Model Mj
l ( y | Mj ) =
Z
p (y|α, σ, βj , Mj ) · p (α, σ, βj |Mj )d αd σd βj
(4)
Likelihood function p (y|α, σ, βj , Mj ); p (y ; θ, Mj )
Posterior Parameter Uncertainty p (α, σ, βj |y, Mj ); p (θ|y, Mj );
Prior Parameter Uncertainty p (α, σ, βj |Mj ); p (θ|Mj )
14
Bayesian versus Classical Approaches to Inference
Bayesian and Classical Objects
Bayesian and Classical Objects - continued
Posterior Odds
l (y|Mj ) p (Mj )
p (Mj |y)
=
·
p (Ml |y)
l (y|Ml ) p (Ml )
(5)
Bayes Factors BFjl = l (y|Mj )/l (y|Ml )
This is similar to a LR test, but instead of
maximizing the respective likelihood, Bayesians
average it over the parameters
Prior Odds p (Mj )/p (Ml )
15
Bayesian versus Classical Approaches to Inference
Bayesian and Classical Objects
Posterior Model Uncertainty
For each model l ∈ M, p (Mj |y) is determined by the prior odds
p (Mj )/p (Ml ) and the Bayes Factor BFjl = l (y|Mj )/l (y|Ml )
Bayes Factors are sensitive to the choice of prior distributions for
model parameters θ = (α, σ, β),
and specifically the distinction between:
generic vs model specific parameters.
improper vs proper priors
Remark
We now consider the fundamental theorem of probability that will
allow us to consider these objects in a unifying framework.
16
Bayesian versus Classical Approaches to Inference
Binary Uncertainty
Bayesian Hypothesis Testing
Example: Prior and Posterior Odds of Death by Homicide
versus Natural Causes
The use of posterior odds to reflect the relative likelihood of cot
death (D c ) versus death by homicide (D H )
Data: two deaths D2
Pr(D c |D2 )
Pr(D c ) Pr(D2 |D c )
=
·
Pr(D H |D2 )
Pr(D H ) Pr(D2 |D H )
(6)
Second term on the RHS of (6) is the Bayes Factor.
This may be used to test hypotheses within a Bayesian framework.
Example
Suppose prior odds are 1 and the posterior odds 1 in 77.
Pr(D |D c )
The sample evidence, namely Pr(D 2|D H ) , updates the prior odds of
2
1/1 to 1 in 77 in favor of the probability that the two observed
deaths are due to cot death.
17
Bayesian versus Classical Approaches to Inference
Binary Uncertainty
Bayesian Hypothesis Testing
The logarithm of the Bayes Factor, say LBF , is sometimes called
the weight of evidence given by one explanation (model,
hypothesis) than another i.e. H0 over H1
Classical hypothesis testing gives one hypothesis (or model)
preferred status (the ’null hypothesis’), and only considers evidence
against it.
A value of LBF > 1 means that the data indicate that H0 is more
likely than H1 . Jeffreys gave a scale for interpretation of LBF:
LBF
Strength of evidence
< 1:1
Negative (supports H1 )
1:1 to 3:1
Barely worth mentioning
3:1 to 12:1
Positive
12:1 to 150:1
> 150:1
18
Strong
Very strong
Bayesian versus Classical Approaches to Inference
Binary Uncertainty
Bayesian Hypothesis Testing
Posterior Odds: A Derivation
It is convenient to start by defining
Pr[Y ≤ y |θ ] =
Z y
−∞
f (Y |θ )dy
= F (y | θ )
Use the fundamental rule of probability
Pr[Y ≤ y
\
H = H0 ] = Pr[Y ≤ y |H = H0 ] Pr[H = H0 ]
Pr[H = H0 |Y ≤ y ]
Pr[H = H1 |Y ≤ y ]
= Pr[H = H0 |Y ≤ y ] Pr[Y ≤ y ]
or
Pr[Y ≤ y |H = H0 ] Pr[H = H0 ]
=
(7)
Pr[Y ≤ y |H = H1 ] Pr[H = H1 ]
19
Bayesian versus Classical Approaches to Inference
Binary Uncertainty
Bayesian Hypothesis Testing
If F (y |θ ) a continuous distribution we can rewrite (7) this as
Pr[H0 |y ]
f (y |H0 ) Pr[H0 ]
=
Pr[H1 |y ]
f (y |H1 ) Pr[H1 ]
(8)
Posterior Odds = Bayes factor x Prior Odds
Remark
The Bayes factor (BF) is similar to but not the same as the LR in
classical statistics. BF does not replace parameters by point
estimates!
Can show that the BF is a ratio of weighted likelihoods with the
prior density as weights.
20
Bayesian versus Classical Approaches to Inference
Bayes Theorem
Bayes Theorem for Events
For events E and F the degree of belief in event F given evidence,
or event, E , is equal to the joint probability of E and F divided by
the probability of E .
This last division is just a rescaling operation. Given E has
occurred, total probability must now be defined wrt a reduced
universe (a different population).
Theorem
Pr(F |E ) = [Pr(E |F ) · Pr(F )]/ Pr(E )
The probability of F given E is the unconditional probability of
that part of F in E , (Pr(E ∩ F )), multiplied by the rescale factor
1/ Pr(E ).
21
Bayesian versus Classical Approaches to Inference
Bayes Theorem
Bayes Theorem For Random Variables
Theorem
gY (y |x ) = [gX (x |y )fY (y )]/ ∑y gX (x |y )fY (y )
for Y discrete
R
gY (y |x ) = [gX (x |y )fY (y )]/ y gX (x |y )fY (y )
for Y continuous
Derived in an exactly analagous manner as BT for events.
The interpretation of the above also follows from what we know:
the posterior beliefs in Y in the light of info. provided by
X , is given by the likelihood of getting the info. x given
y , multiplied by the prior beliefs about Y ; divided by the
marginal distribution for X
22
Bayesian versus Classical Approaches to Inference
Bayes Theorem
Bayes Theorem For Parameters
Let p (θ ) represent a density on an unknown random quantity θ
p (θ |y), y = (y1 , ..., yN ) denote a posterior distribution.
The probability of observing y conditional on θ is
N
f (y | θ ) =
∏ f (yi |θ )
(9)
i =1
Fundamental rule of probability p (θ |y)f (y) = f (y|θ )p (θ )
p (θ |y) = f (y|θ )p (θ )/f (y)
p ( θ |y )
∝ f (y | θ )p ( θ )
(10)
(11)
Remark
Bayes rule is simply a mapping from one conditional probability to
another. It does not imply a Bayesian perspective unless the
unconditional probability is the prior distribution.
23
Bayesian versus Classical Approaches to Inference
Bayes Theorem
Dissecting Bayes THM
p (θ |y) = f (y|θ )p (θ )/f (y)
(12)
The Posterior Distribution p (θ |y)
The Likelihood Function l (y; θ, Mj ): think of y as data - as given.
Now maximize probability of observing what we
observe for θ ∈ Θ.
The Probability Density Function fY (y|θ, Mj ): pdf of Y evaluated
at y , conditional on θ
The Marginal Distribution (predictive density) - f (y)
The Prior Distribution p (θ )
24
Bayesian versus Classical Approaches to Inference
Aspects of Bayesian Inference
25
exact finite sample inference Bayesian results are exact finite
samples since distributions derived conditional on
data
components of uncertainty model, parameter, prior ... Handled in
different ways in classical vs Bayesian approaches
large sample approximations Much of sampling theory results
depend upon asymptotics and are therefore only
approximations for the observed sample of data.
hypothesis tests: sampling theory conditional on the null
hypothesis being true and sample data, here is the
probability of a discrepancy between a null and
observed realisation in a random sample of data.
hypothesis tests: Bayesian conditional upon the prior and data,
here is the probability support for a particular
hypothesis.
Bayesians measure the datas support for the
hypothesis, while sampling theory measures the
hypothesis’s support for the data.
Bayesian versus Classical Approaches to Inference
Aspects of Bayesian Inference
Pragmatic Bayesians: Constructing the Posterior
Remark
Bayesian procedures circumvent two of the most significant
difficulties associated with classical estimation.
I. Bayesian procedures do not require maximization of any function
II. Desirable estimation properties such as consistency can be
obtained under more relaxed conditions with Bayesian procedures.
26
Bayesian versus Classical Approaches to Inference
Simultaneous Bayesian and Classical Inference
Simultaneous Bayesian and Classical Inference
Remark
The use of Bayesian procedures does not necessitate that an
analyst adopt a Bayesian perspective on inference.
As shown below Bayesian procedures provide an estimator whose
properties can be examined in purely classical ways
Remark
Under certain conditions, the estimator that results from Bayesian
procedures is asymptotically equivalent to the ML estimator.
In this sense, an advantage of a Bayesian perspective is that results
can be interpreted from both perspectives simultaneously
27
Bayesian versus Classical Approaches to Inference
Simultaneous Bayesian and Classical Inference
Bayesian Properties of the Posterior Mean
θ̄ =
Z
θp (θ |y)d θ
(13)
The mean of the posterior distribution for θ, θ̄ is important from
Bayesian and Classical perspectives
Bayesian: A decision has to be made that depends upon the value
of θ.
What value of θ minimises the expected cost of being wrong, given
beliefs represented in the posterior distribution?
Proposition
θ̄ is the value of θ that measures the expected cost of being wrong
about θ (if the cost is quadratic in the distance between θ and the
true value of θ)
28
Bayesian versus Classical Approaches to Inference
Simultaneous Bayesian and Classical Inference
Remark
θ̄ is a statistic - what does it’s sampling distribution look like?
Proposition
θ̄ is an estimator that has the same asymptotic sampling
distribution as the MLE.
The Sampling Distribution of the Posterior Mean?
Sounds like a contradiction in terms but is a natural question for a
classical analyst.
29
Bayesian versus Classical Approaches to Inference
Simultaneous Bayesian and Classical Inference
The Bernstein von-Mises Theorem
Theorem
√
−
→
N (θ̄ − θ ∗ ) d N (0, (−H )−1 )
(14)
→
which implies θ̄ −
a N (θ ∗ , (−H )−1 /N )
The mean of the posterior, considered as a classical estimator, is
asymptotically equivalent to the MLE.
θ ∗ and −H denotes the true value and the expected derivative of
the score, respectively.
Remark
Instead of maximizing the likelihood function one can calculate the
mean of the posterior distribution and know that the resulting
estimator is as good in classical terms as ML
30
Bayesian versus Classical Approaches to Inference
Simultaneous Bayesian and Classical Inference
Pointwise Convergence versus Convergence to a Distribution
Remark
Bayesian inference is conducted using a posterior distribution;
Classical inference is conducted using a point estimator in
conjunction with an interval estimator.
Classical inference considers convergence to a maximum of a
(likelihood) function
Bayesian techniques use an iterative process that converges to
draws from the posterior distribution.
Sometimes it can be difficult to verify if convergence has taken
place.
31
Bayesian versus Classical Approaches to Inference
Prior Uncertainty
Prior Uncertainty
Remark
Bayes’ theorem shows only how beliefs change. It does not dictate
what beliefs (priors) should be.
Remark
The stage of the process where most frequentists can be found
circling wagons is when the prior is chosen
Koop, Poirier, Tobias (2007)
32
Bayesian versus Classical Approaches to Inference
Prior Uncertainty
Classification of Prior Structure
Informative vs Informative Priors
Improper Priors
Natural Conjugate Priors
Hierarchical Priors
33
Bayesian versus Classical Approaches to Inference
Prior Uncertainty
Noninformative and Improper Priors
Noninformative and Improper Priors
Noninformative priors apply in cases where there is potentially wide
disagreement over the form of the prior density, and it is desired
that the data dominates.
In most cases noninformative (NI) priors are improper.
Example
Let θ ∈ [a, b ]. A NI prior allocates equal prior weight to each
equally sized sub-interval on the support [a, b ].
This implies a uniform prior would be appropriate.
If a, b are unknown, we set them to −∞ and +∞ respectively.
Any uniform density will integrate to infinity over (−∞, ∞) and in
this sense is improper.
34
Bayesian versus Classical Approaches to Inference
Prior Uncertainty
Noninformative and Improper Priors
And Improper Priors: Does this matter?
From Bayes Theorem
Z
p ( θ |y ) = f (y | θ )p ( θ ) /
f (y | θ )p ( θ )
(15)
This holds if the prior probabilities over the support of p (θ ) were
multiplied by a given constant;
The posterior probabilities will still sum ( integrate) to 1 even if
the prior values do not, and so the priors only need be specified in
the correct proportion.
In many cases the sum or integral of the prior values may not even
need to be finite to get sensible answers for the posterior
probabilities.
The prior is called an improper prior.
35
Bayesian versus Classical Approaches to Inference
Prior Uncertainty
Noninformative and Improper Priors
Proper and Improper Priors -continued
A prior distribution for the mean and variance of a random variable
Example
Assume p (m, v ) ∝ 1/v (for v > 0). This suggests that:
i. any value for the mean is equally likely.
ii. a value for the positive variance becomes less likely in inverse
proportion to its value. Since
Z +∞
−∞
dm =
Z +∞
1
0
v
dv = ∞
(16)
this would be an improper prior both for the mean and for the
variance.
36
Bayesian versus Classical Approaches to Inference
Prior Uncertainty
Natural Conjugate Priors
Natural Conjugate Priors
A class of priors is conjugate for a family of likelihoods if both prior
and posterior are in the same class for all data y .
Natural Conjugate prior has the same functional form as the
likelihood.
When confronted by data, the parameters change as evidence
accumulates from a prior base.
Definition
if τ is a class of sampling distributions p (y |θ ) and ω is a class of
prior distributions for θ, then the class ω is conjugate for τ if
p (θ |y ) ∈ ω for all p (y |θ ) ∈ τ and p (θ ) ∈ ω
Example: Beta prior is natural conjugate for Binomial distribution.
37
Bayesian versus Classical Approaches to Inference
Prior Uncertainty
Hierarchical Priors
Hierarchical Priors
Remark
One of the advantages of Bayesian methods is due to the fact that
there are real advantages in dealing with problems characterized by
a large number of parameters.
Hierarchical priors is one way of operationalizing this.
Useful to think about the prior for a vector of parameters in stages.
Suppose that θ = (θ 1 , θ 2 , ..., θ n )0 and λ is a parameter of lower
dimension than θ.
λ may be a fixed hyperparameter or a random variable.
The prior distribution p (θ) can be derived in stages
p (θ|λ)p (λ) = p (θ, λ)
R
So, p (θ) = p (θ|λ)p (λ)d λ. λ is a hyperparameter. And
p (θ, λ, y ) = p (y |θ)p (θ|λ)p (λ)
38
Bayesian versus Classical Approaches to Inference
Prior Uncertainty
Hierarchical Priors
Remark
The move from a fixed hyperparameter to hierarchical prior
structure represents an interesting example of the distinction
between Classical versus Bayesian robustness/sensitivity analysis.
39
Bayesian versus Classical Approaches to Inference