Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Model Averaging Bayesian versus Classical Approaches to Inference CES LMU München Gernot Doppelhofer Norwegian School of Economics (NHH) 15-17 March 2011 1 Bayesian versus Classical Approaches to Inference Outline Introduction Course objective: introduction to model averaging Motivation: increasing attention to model uncertainty in economic problems Feasibility: improvement in numerical techniques combined with decrease in computing resource cost makes model averaging and simulation-based methods accessible 2 Bayesian versus Classical Approaches to Inference Outline Lecture Outline Lect 1 Bayesian and Classical Approaches to Inference: introduce a number of key issues in Bayesian analysis, including random parameters, prior uncertainty, model uncertainty, contrast with Classical approach Lect 2 Model Averaging: introduce basics of Frequentist and Bayesian model averaging, focussing on the loss function of the decision maker, and the calculation of posterior distributions, Benchmark Model Averaging in the normal, linear regression model Lect 3 Applications of Model Averaging: inference, forecasting and policy evaluation 3 Bayesian versus Classical Approaches to Inference Outline ntnulogowhite Outline of Lecture 1 I 1 2 3 4 5 6 7 8 4 Probability Statements about Unknown Parameters Probability Statements and Interval Estimation Motivation: Multiple Models Bayesian and Classical Objects Binary Uncertainty Bayesian Hypothesis Testing Bayes Theorem Aspects of Bayesian Inference Simultaneous Bayesian and Classical Inference Prior Uncertainty Noninformative and Improper Priors Natural Conjugate Priors Hierarchical Priors Bayesian versus Classical Approaches to Inference Probability Statements about Unknown Parameters A probability statement about a parameter β is simply not possible in classical statistics because parameters are not allowed to be random variables. However, statements about parameters are what the investigator actually wants in practice. It is hard to think that statistical inference about β could mean anything else. Kendall and Stuart 5 Bayesian versus Classical Approaches to Inference Probability Statements about Unknown Parameters Bayesian versus Classical Approaches Classical Relies (mostly) on distributions of estimators and test statistics over hypothetical repeated samples. Does NOT condition on the data. Inferences are based on data not observed! Such sampling distributions are strictly irrelevant to Bayesian inference. Bayesian there are no estimators and no test statistics; there is no role for unbiasedness, minimum variance, efficiency, etc. 6 Bayesian versus Classical Approaches to Inference Probability Statements about Unknown Parameters What are we interested in? Estimation of parameters (for example regression coefficients) Testing of hypotheses and comparison of different models Prediction of quantities of interest 7 Bayesian versus Classical Approaches to Inference Probability Statements about Unknown Parameters A Bayesian Model for the data p (y ) Requirements: The Likelihood p (y |θ ) - a conditional probability statement about the data The Prior p (θ ) - a prior distribution over the parameters R The Model p (y ) = p (y |θ )p (θ )d θ The Estimator p (θ |y , M ) 8 Bayesian versus Classical Approaches to Inference Probability Statements about Unknown Parameters Remark it is worth highlighting the Bayesian requirements in developing a model for data relative to classical. Classical estimators, based for example on GMM, can proceed with moment conditions of the form E [ψ(zm , θ)] = 0, ∀m = 1, ..., M and a statement that the data are i.i.d. This is much less demanding than a likelihood statement Remark BUT combining a likelihood statement with a prior gives us a posterior p (θ |y , M ) = p (y |θ, M )p (θ )]/p (y ) (1) and then we conduct inference that are exact and conditional on the observed data AND at the outset, conditional on a given model M 9 Bayesian versus Classical Approaches to Inference Probability Statements about Unknown Parameters Probability Statements and Interval Estimation Bayesian Perspective Interpret the following probability statement for i.i.d. data nN (µ, σ2 ) with σ2 known. √ √ Pr(x − 1.96σ/ n < µ < x + 1.96σ/ n ) = 0.95 (2) Bayesian This is a set whose probability content is 0.95 ; no point outside it has higher posterior density than any point inside it. More appropriate to write Pr(a < µ < b |Y ), where Y denotes the available data. A Bayesian posterior probability interval provides the probability that µ lies in that fixed interval, conditional on the observed data This statement means what it says! It does not refer to hypothetical repeated samples. 10 Bayesian versus Classical Approaches to Inference Probability Statements about Unknown Parameters Probability Statements and Interval Estimation Classical Perspective A (1 − α) confidence interval. I A statement as to the percentage of yet unobserved samples, that will generate intervals that contain µ. II A statement that one should have 95% confidence in a parameter µ lying in a given interval is a probability statement about the random interval, and not µ. If we repeated the experiment and obtained 100 such intervals, 95/100 would contain the true unknown parameter. 11 Bayesian versus Classical Approaches to Inference Motivation: Multiple Models Motivation: Multiple Models The inference problem Suppose we want to investigate potential determinants of a dependent variable y given the (potentially large) set of regressors X Model Space Suppose that there are K potential regressors and the model space M is the set of all 2K linear models. Model Mj is described by a (k × 1) binary vector γ = (γ1 , ..., γK )0 , where a one (zero) indicates the inclusion (exclusion) of an explanatory variable xi Conditional and Unconditional Inference the unknown parameter vector β represents the effects of the variables included in a regression model. We can estimate its distribution p ( β|, y, Mj ) conditional on model Mj . 12 Bayesian versus Classical Approaches to Inference Motivation: Multiple Models ... And Unconditionally Remark Given the (potentially large) space of models M the unconditional effects of model parameters can be derived by integrating out all aspects of model uncertainty, including the space of models M. The posterior density p ( β|y) can then be expressed as function of sample observations y. 13 Bayesian versus Classical Approaches to Inference Bayesian and Classical Objects Posterior Model Uncertainty p (Mj |y) = l (y|Mj ) · p (Mj ) 2K (3) ∑ l (y|Mi ) · p (Mi ) i =1 Prior Model Uncertainty p (Mj ) Marginal Likelihood of Model Mj l ( y | Mj ) = Z p (y|α, σ, βj , Mj ) · p (α, σ, βj |Mj )d αd σd βj (4) Likelihood function p (y|α, σ, βj , Mj ); p (y ; θ, Mj ) Posterior Parameter Uncertainty p (α, σ, βj |y, Mj ); p (θ|y, Mj ); Prior Parameter Uncertainty p (α, σ, βj |Mj ); p (θ|Mj ) 14 Bayesian versus Classical Approaches to Inference Bayesian and Classical Objects Bayesian and Classical Objects - continued Posterior Odds l (y|Mj ) p (Mj ) p (Mj |y) = · p (Ml |y) l (y|Ml ) p (Ml ) (5) Bayes Factors BFjl = l (y|Mj )/l (y|Ml ) This is similar to a LR test, but instead of maximizing the respective likelihood, Bayesians average it over the parameters Prior Odds p (Mj )/p (Ml ) 15 Bayesian versus Classical Approaches to Inference Bayesian and Classical Objects Posterior Model Uncertainty For each model l ∈ M, p (Mj |y) is determined by the prior odds p (Mj )/p (Ml ) and the Bayes Factor BFjl = l (y|Mj )/l (y|Ml ) Bayes Factors are sensitive to the choice of prior distributions for model parameters θ = (α, σ, β), and specifically the distinction between: generic vs model specific parameters. improper vs proper priors Remark We now consider the fundamental theorem of probability that will allow us to consider these objects in a unifying framework. 16 Bayesian versus Classical Approaches to Inference Binary Uncertainty Bayesian Hypothesis Testing Example: Prior and Posterior Odds of Death by Homicide versus Natural Causes The use of posterior odds to reflect the relative likelihood of cot death (D c ) versus death by homicide (D H ) Data: two deaths D2 Pr(D c |D2 ) Pr(D c ) Pr(D2 |D c ) = · Pr(D H |D2 ) Pr(D H ) Pr(D2 |D H ) (6) Second term on the RHS of (6) is the Bayes Factor. This may be used to test hypotheses within a Bayesian framework. Example Suppose prior odds are 1 and the posterior odds 1 in 77. Pr(D |D c ) The sample evidence, namely Pr(D 2|D H ) , updates the prior odds of 2 1/1 to 1 in 77 in favor of the probability that the two observed deaths are due to cot death. 17 Bayesian versus Classical Approaches to Inference Binary Uncertainty Bayesian Hypothesis Testing The logarithm of the Bayes Factor, say LBF , is sometimes called the weight of evidence given by one explanation (model, hypothesis) than another i.e. H0 over H1 Classical hypothesis testing gives one hypothesis (or model) preferred status (the ’null hypothesis’), and only considers evidence against it. A value of LBF > 1 means that the data indicate that H0 is more likely than H1 . Jeffreys gave a scale for interpretation of LBF: LBF Strength of evidence < 1:1 Negative (supports H1 ) 1:1 to 3:1 Barely worth mentioning 3:1 to 12:1 Positive 12:1 to 150:1 > 150:1 18 Strong Very strong Bayesian versus Classical Approaches to Inference Binary Uncertainty Bayesian Hypothesis Testing Posterior Odds: A Derivation It is convenient to start by defining Pr[Y ≤ y |θ ] = Z y −∞ f (Y |θ )dy = F (y | θ ) Use the fundamental rule of probability Pr[Y ≤ y \ H = H0 ] = Pr[Y ≤ y |H = H0 ] Pr[H = H0 ] Pr[H = H0 |Y ≤ y ] Pr[H = H1 |Y ≤ y ] = Pr[H = H0 |Y ≤ y ] Pr[Y ≤ y ] or Pr[Y ≤ y |H = H0 ] Pr[H = H0 ] = (7) Pr[Y ≤ y |H = H1 ] Pr[H = H1 ] 19 Bayesian versus Classical Approaches to Inference Binary Uncertainty Bayesian Hypothesis Testing If F (y |θ ) a continuous distribution we can rewrite (7) this as Pr[H0 |y ] f (y |H0 ) Pr[H0 ] = Pr[H1 |y ] f (y |H1 ) Pr[H1 ] (8) Posterior Odds = Bayes factor x Prior Odds Remark The Bayes factor (BF) is similar to but not the same as the LR in classical statistics. BF does not replace parameters by point estimates! Can show that the BF is a ratio of weighted likelihoods with the prior density as weights. 20 Bayesian versus Classical Approaches to Inference Bayes Theorem Bayes Theorem for Events For events E and F the degree of belief in event F given evidence, or event, E , is equal to the joint probability of E and F divided by the probability of E . This last division is just a rescaling operation. Given E has occurred, total probability must now be defined wrt a reduced universe (a different population). Theorem Pr(F |E ) = [Pr(E |F ) · Pr(F )]/ Pr(E ) The probability of F given E is the unconditional probability of that part of F in E , (Pr(E ∩ F )), multiplied by the rescale factor 1/ Pr(E ). 21 Bayesian versus Classical Approaches to Inference Bayes Theorem Bayes Theorem For Random Variables Theorem gY (y |x ) = [gX (x |y )fY (y )]/ ∑y gX (x |y )fY (y ) for Y discrete R gY (y |x ) = [gX (x |y )fY (y )]/ y gX (x |y )fY (y ) for Y continuous Derived in an exactly analagous manner as BT for events. The interpretation of the above also follows from what we know: the posterior beliefs in Y in the light of info. provided by X , is given by the likelihood of getting the info. x given y , multiplied by the prior beliefs about Y ; divided by the marginal distribution for X 22 Bayesian versus Classical Approaches to Inference Bayes Theorem Bayes Theorem For Parameters Let p (θ ) represent a density on an unknown random quantity θ p (θ |y), y = (y1 , ..., yN ) denote a posterior distribution. The probability of observing y conditional on θ is N f (y | θ ) = ∏ f (yi |θ ) (9) i =1 Fundamental rule of probability p (θ |y)f (y) = f (y|θ )p (θ ) p (θ |y) = f (y|θ )p (θ )/f (y) p ( θ |y ) ∝ f (y | θ )p ( θ ) (10) (11) Remark Bayes rule is simply a mapping from one conditional probability to another. It does not imply a Bayesian perspective unless the unconditional probability is the prior distribution. 23 Bayesian versus Classical Approaches to Inference Bayes Theorem Dissecting Bayes THM p (θ |y) = f (y|θ )p (θ )/f (y) (12) The Posterior Distribution p (θ |y) The Likelihood Function l (y; θ, Mj ): think of y as data - as given. Now maximize probability of observing what we observe for θ ∈ Θ. The Probability Density Function fY (y|θ, Mj ): pdf of Y evaluated at y , conditional on θ The Marginal Distribution (predictive density) - f (y) The Prior Distribution p (θ ) 24 Bayesian versus Classical Approaches to Inference Aspects of Bayesian Inference 25 exact finite sample inference Bayesian results are exact finite samples since distributions derived conditional on data components of uncertainty model, parameter, prior ... Handled in different ways in classical vs Bayesian approaches large sample approximations Much of sampling theory results depend upon asymptotics and are therefore only approximations for the observed sample of data. hypothesis tests: sampling theory conditional on the null hypothesis being true and sample data, here is the probability of a discrepancy between a null and observed realisation in a random sample of data. hypothesis tests: Bayesian conditional upon the prior and data, here is the probability support for a particular hypothesis. Bayesians measure the datas support for the hypothesis, while sampling theory measures the hypothesis’s support for the data. Bayesian versus Classical Approaches to Inference Aspects of Bayesian Inference Pragmatic Bayesians: Constructing the Posterior Remark Bayesian procedures circumvent two of the most significant difficulties associated with classical estimation. I. Bayesian procedures do not require maximization of any function II. Desirable estimation properties such as consistency can be obtained under more relaxed conditions with Bayesian procedures. 26 Bayesian versus Classical Approaches to Inference Simultaneous Bayesian and Classical Inference Simultaneous Bayesian and Classical Inference Remark The use of Bayesian procedures does not necessitate that an analyst adopt a Bayesian perspective on inference. As shown below Bayesian procedures provide an estimator whose properties can be examined in purely classical ways Remark Under certain conditions, the estimator that results from Bayesian procedures is asymptotically equivalent to the ML estimator. In this sense, an advantage of a Bayesian perspective is that results can be interpreted from both perspectives simultaneously 27 Bayesian versus Classical Approaches to Inference Simultaneous Bayesian and Classical Inference Bayesian Properties of the Posterior Mean θ̄ = Z θp (θ |y)d θ (13) The mean of the posterior distribution for θ, θ̄ is important from Bayesian and Classical perspectives Bayesian: A decision has to be made that depends upon the value of θ. What value of θ minimises the expected cost of being wrong, given beliefs represented in the posterior distribution? Proposition θ̄ is the value of θ that measures the expected cost of being wrong about θ (if the cost is quadratic in the distance between θ and the true value of θ) 28 Bayesian versus Classical Approaches to Inference Simultaneous Bayesian and Classical Inference Remark θ̄ is a statistic - what does it’s sampling distribution look like? Proposition θ̄ is an estimator that has the same asymptotic sampling distribution as the MLE. The Sampling Distribution of the Posterior Mean? Sounds like a contradiction in terms but is a natural question for a classical analyst. 29 Bayesian versus Classical Approaches to Inference Simultaneous Bayesian and Classical Inference The Bernstein von-Mises Theorem Theorem √ − → N (θ̄ − θ ∗ ) d N (0, (−H )−1 ) (14) → which implies θ̄ − a N (θ ∗ , (−H )−1 /N ) The mean of the posterior, considered as a classical estimator, is asymptotically equivalent to the MLE. θ ∗ and −H denotes the true value and the expected derivative of the score, respectively. Remark Instead of maximizing the likelihood function one can calculate the mean of the posterior distribution and know that the resulting estimator is as good in classical terms as ML 30 Bayesian versus Classical Approaches to Inference Simultaneous Bayesian and Classical Inference Pointwise Convergence versus Convergence to a Distribution Remark Bayesian inference is conducted using a posterior distribution; Classical inference is conducted using a point estimator in conjunction with an interval estimator. Classical inference considers convergence to a maximum of a (likelihood) function Bayesian techniques use an iterative process that converges to draws from the posterior distribution. Sometimes it can be difficult to verify if convergence has taken place. 31 Bayesian versus Classical Approaches to Inference Prior Uncertainty Prior Uncertainty Remark Bayes’ theorem shows only how beliefs change. It does not dictate what beliefs (priors) should be. Remark The stage of the process where most frequentists can be found circling wagons is when the prior is chosen Koop, Poirier, Tobias (2007) 32 Bayesian versus Classical Approaches to Inference Prior Uncertainty Classification of Prior Structure Informative vs Informative Priors Improper Priors Natural Conjugate Priors Hierarchical Priors 33 Bayesian versus Classical Approaches to Inference Prior Uncertainty Noninformative and Improper Priors Noninformative and Improper Priors Noninformative priors apply in cases where there is potentially wide disagreement over the form of the prior density, and it is desired that the data dominates. In most cases noninformative (NI) priors are improper. Example Let θ ∈ [a, b ]. A NI prior allocates equal prior weight to each equally sized sub-interval on the support [a, b ]. This implies a uniform prior would be appropriate. If a, b are unknown, we set them to −∞ and +∞ respectively. Any uniform density will integrate to infinity over (−∞, ∞) and in this sense is improper. 34 Bayesian versus Classical Approaches to Inference Prior Uncertainty Noninformative and Improper Priors And Improper Priors: Does this matter? From Bayes Theorem Z p ( θ |y ) = f (y | θ )p ( θ ) / f (y | θ )p ( θ ) (15) This holds if the prior probabilities over the support of p (θ ) were multiplied by a given constant; The posterior probabilities will still sum ( integrate) to 1 even if the prior values do not, and so the priors only need be specified in the correct proportion. In many cases the sum or integral of the prior values may not even need to be finite to get sensible answers for the posterior probabilities. The prior is called an improper prior. 35 Bayesian versus Classical Approaches to Inference Prior Uncertainty Noninformative and Improper Priors Proper and Improper Priors -continued A prior distribution for the mean and variance of a random variable Example Assume p (m, v ) ∝ 1/v (for v > 0). This suggests that: i. any value for the mean is equally likely. ii. a value for the positive variance becomes less likely in inverse proportion to its value. Since Z +∞ −∞ dm = Z +∞ 1 0 v dv = ∞ (16) this would be an improper prior both for the mean and for the variance. 36 Bayesian versus Classical Approaches to Inference Prior Uncertainty Natural Conjugate Priors Natural Conjugate Priors A class of priors is conjugate for a family of likelihoods if both prior and posterior are in the same class for all data y . Natural Conjugate prior has the same functional form as the likelihood. When confronted by data, the parameters change as evidence accumulates from a prior base. Definition if τ is a class of sampling distributions p (y |θ ) and ω is a class of prior distributions for θ, then the class ω is conjugate for τ if p (θ |y ) ∈ ω for all p (y |θ ) ∈ τ and p (θ ) ∈ ω Example: Beta prior is natural conjugate for Binomial distribution. 37 Bayesian versus Classical Approaches to Inference Prior Uncertainty Hierarchical Priors Hierarchical Priors Remark One of the advantages of Bayesian methods is due to the fact that there are real advantages in dealing with problems characterized by a large number of parameters. Hierarchical priors is one way of operationalizing this. Useful to think about the prior for a vector of parameters in stages. Suppose that θ = (θ 1 , θ 2 , ..., θ n )0 and λ is a parameter of lower dimension than θ. λ may be a fixed hyperparameter or a random variable. The prior distribution p (θ) can be derived in stages p (θ|λ)p (λ) = p (θ, λ) R So, p (θ) = p (θ|λ)p (λ)d λ. λ is a hyperparameter. And p (θ, λ, y ) = p (y |θ)p (θ|λ)p (λ) 38 Bayesian versus Classical Approaches to Inference Prior Uncertainty Hierarchical Priors Remark The move from a fixed hyperparameter to hierarchical prior structure represents an interesting example of the distinction between Classical versus Bayesian robustness/sensitivity analysis. 39 Bayesian versus Classical Approaches to Inference