Download Lecture 11. Bayesian Regression

Lecture 11. Bayesian Regression with conjugate and non-conjugate priors 1) Set-up of the Bayesian Regression Model 2) The standard “improper” non-informative prior 3) Conjugate Prior Analysis A really, really fast review of multiple regression Many studies concern the relationship between two or more observable quantities. How do changes in quantity y (the dependent variable) vary as a function of another quantity x (the independent variable)? Regression models allow us to examine the conditional distribution of y given x, parameterized as p(y|B,x) when the n observations (yi, xi) are exchangeable. The normal linear model occurs when a distribution of y given x is normal with a mean equal to a linear function of X: E(yi|B,X) = B1X1i + … + BkXki for i  1,…,n and X1 is a vector of one’s. The ordinary linear regression model occurs when the variance of y given X, B is assumed to be constant over all observations. In other words, we have an ordinary linear regression model when: yi ~ N(B1 + … + BkXki , 2) for i  1,…,n More review of regression If yi ~ N(B1 + … + BkXki , 2), then it is well known that the ordinary least squares estimates and the maximum likelihood estimates of the parameters B1, …, Bm are equivalent. If B = [B1, …, Bm]T, then the frequentist estimate B’ is: B’ = (XTX)-1XTY We know by the Gauss-Markov theorem that this estimate is BLUE. The frequentist estimate of 2 is s2 = (Y-XB’)T(Y-XB’)/(n-k) The uncertainty about the quantities B’ is summarized by the regression coefficients’ standard errors which is the diagonal of the matrix: Var(B’) = s2(XTX)-1 Finally, we know that if Vi is the ith diagonal element of Var(B’), then: (Bk’-0)/(sVi1/2) = tn-k This statistic forms the basis for our hypothesis tests. The Bayesian Setup For the normal linear model, we have: yi ~ N(i, 2) for i  1,…,n where i is just an indicator for the expression: i = B0 + B1X1i … + BkXki The object of statistical inference is the posterior distribution of the parameters B0,…,Bk and 2. By Bayes Rule, we know that this is simply: p(B0,…,Bk, 2 | Y, X)  p(B0,…,Bk, 2)  i p(yi | i, 2) Bayesian regression with standard noninformative priors By Bayes Rule, the posterior distribution is: p(B0,…,Bk, 2 | Y, X)  p(B0,…,Bk, 2)  i p(yi | i, 2) To make inferences about the regression coefficients, we obviously need to choose a prior distribution for B, 2. The standard non-informative prior distribution is uniform on (B, log 2), which is equivalent to: p(B, log 2) -2 This prior is a good choice for statistical models when you have a lot of data points and only a few parameters. Why? Because if you have a large lot of data and few parameters, then the likelihood function is very sharply peaked, which means that the likelihood (the data) will dominate posterior inferences. With small sample sizes or a lot of parameters, prior distributions or hierarchical models become more important for analysis. Bayesian regression with non-informative priors If Y|B,2,X ~ N(XB, 2I) and p(B, log 2) -2 then it follows that the conditional posterior distribution p(B|2;data) can be written: p(B|2;data) ~ Nmultivariate (B’, 2(XTX)-1) where B’ = (XTX)-1XTY This follows from by completing the square, like we have seen over and over again when dealing with the normal distribution. The posterior distribution of 2 can be written as: p(2|data)~Scaled Inv-2(n-k,s2)where s2 = (Y-XB)T(Y-XB)/(n-k) To make inferences about the marginal posterior of B, we can either: 1) take repeated samples from the posterior of 2 and then B| 2 like we did when we studied the normal model with unknown mean and variance; or, 2) integrate out 2 from the conditional posterior for B and find that the marginal distribution of B is a multivariate t distribution: p(B | data) ~ multivariate tn-k(B’, s2(XTX)-1)  Notice the close comparison with the classical results. The key difference would be interpretation of the standard errors. Congdon Example 4.1 In Example 4.1, Congdon considers a peculiar example concerning a method for weighing small stuff, where the method has an error that is assumed to be normal with mean zero. The dependent variable is the recorded weight and the independent variables are dummy variables for the presence or absence of object A and object B. X1=1 if A is present and 0 otherwise. X2=1 if B is present and 0 otherwise. Both objects may be weighed together. Thus, yi ~ N(B1X1i + B2X2i , 2) [note, no intercept] B1 = weight of object A and B2 = weight of object B Congdon adopts the reference prior: p(B1, B2, 2)  2 WinBugs Implementation of Example 4.1 model { for (i in 1:18) { # this set of statements generates the sample variance s2 yhat[i] <- b[1]*X[i,1]+b[2]*X[i,2] e[i] <- y[i]-yhat[i] e2[i] <- pow(e[i],2) } s2 <- sum(e2[])/16 # this equals e2 / (n - k) for (i in 1:2) { # mean of regression coeff b[i] <- inprod(XTX.INV[i,],Xy[]) Xy[i] <- inprod(X[,i],y[]) for (j in 1:2) { Prec[i,j] <- XTX[i,j]/s2 Disp[i,j] <- s2*XTX.INV[i,j] XTX[i,j] <- inprod(X[,i],X[,j]) } } XTX.INV[1:2,1:2] <- inverse(XTX[,]) # note: WinBugs now has a different command for inversion than that used by Congdon # direct sampling of regression coefficient beta[1:2] ~ dmt(b[],Prec[,],16) } Bayesian Model Evaluation Model evaluation proceeds just like it would in the traditional OLS framework. Models are largely evaluated based on their fitted (predicted) values. For example, 2 ˆ   y  y  i R2  , where ŷi is the fitted 2   yi  y  value for y. In this respect, Bayesian model evaluation is easy. -The fitted value for an observation i is a random draw from a t distribution with mean XiB’ and variance s2(I+Xi(XTX)Xi) on n-k degrees of freedom. In most cases, you can simply use XiB’ as the fitted value and ignore this additional uncertainty. - Other diagnostics that you should use are plots of errors versus covariates histograms for the error term, etc. to make sure that the assumptions of the normal linear model hold. Comment on the regression setup There are two subtle points regarding the Bayesian regression setup. First, a full Bayesian model includes a distribution for the independent variable X, p(X | ). Therefore, we have a joint likelihood p(X,Y | ,,) and joint prior p(,,). The fundamental assumption of the normal linear model is that p(y|X, , ) and p(X| ) are independent in their prior distributions such that the posterior distribution factors into: p(, , | X,Y) = p(, | X,Y) p(| X,Y) As a result, p(, | X,Y)  p(, ) p(Y| , , X) Second, when we setup our probability model, we are implicitly conditioning on a model, call it H, which represents our beliefs about the data-generating process. Thus, p(, | X,Y,H)  p(, |H) p(Y| , , X,H) It is important to keep in mind that our inferences are dependent on H, and this is equally true for the frequentist perspective, where results can be dependent on the choice of likelihood function, covariates, etc. Conjugate priors and the normal linear model Suppose that instead of an improper prior, we decide to use the conjugate prior. For the normal regression model, the conjugate prior distribution for p(B0,…,Bk, 2) is the normal-inversegamma distribution. We’ve seen this distribution before when we studied the normal model with unknown mean and variance. We know that this distribution can be factored such that: p(B0,…,Bk, 2) = p(B0,…,Bk | 2) p(2) p(B0,…,Bk | 2) ~ NMV(Bprior , prior), where prior is the prior covariance matrix for the B’s. and p(2) ~ Inverse-Gamma(aprior, bprior). Conjugate priors and the normal linear model If we use a conjugate prior, then the prior distribution will have the same form. Thus, the posterior distribution will also be normal-inverse-gamma. If we integrate out 2 the marginal for B will be a multivariate t-dist: E B | Data   1  X T X  1Bprior  X T XB' 1 - Notice that the coefficients are essentially a weighted average of the prior coefficients described by Bprior and the standard OLS estimates B’. - The weights are provided by the conditional prior precision -1 and the data XTX. This should make clear that as we increase our prior precision (decrease our prior variance) for B, we place greater posterior weight on our prior beliefs relative to the data. Note: Zellner (1971) treats Bprior and the conditional prior variance  in the following way: suppose you have two data sets—Y1,X1 and Y2,X2. He sets Bprior equal to the posterior mean for a regression analysis of X1, Y1 with the improper prior 1/2 and sets  equal to X1TX1. Conjugate priors and the normal linear model To summarize our uncertainty about the coefficients, the variance-covariance matrix for B is given by:  0    Cov 1 | Data      2   Y ~ where s  2b  1 ~ s  1  X T X  nak 3 The posterior standard deviation can be taken from the square root of the diagonal of this matrix.  XB' Y  XB'  T T  B prior  E B   1B prior  B' E B  X T XB' nak 3 The second term is the maximum likelihood estimate of the variance. The third terms states that our variance estimates will be greater if our prior values for the regression coefficients differ from their posterior values, especially if we indicate a great deal of confidence in our prior beliefs by assigning small variances in the matrix . The fourth term states that our variance estimates for the regression coefficients will be greater if the standard OLS estimates differ from the posterior values, especially if XTX is a large number. Winbugs implementation would proceed in a manner akin to the earlier example.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 11. Bayesian Regression