Download Lecture 11. Bayesian Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Lecture 11. Bayesian Regression with
conjugate and non-conjugate priors
1) Set-up of the Bayesian Regression Model
2) The standard “improper” non-informative prior
3) Conjugate Prior Analysis
A really, really fast review
of multiple regression
Many studies concern the relationship between two or more observable
quantities.
How do changes in quantity y (the dependent variable) vary as a
function of another quantity x (the independent variable)?
Regression models allow us to examine the conditional distribution of y
given x, parameterized as p(y|B,x) when the n observations (yi, xi)
are exchangeable.
The normal linear model occurs when a distribution of y given x is
normal with a mean equal to a linear function of X: E(yi|B,X) = B1X1i
+ … + BkXki for i  1,…,n and X1 is a vector of one’s.
The ordinary linear regression model occurs when the variance of y
given X, B is assumed to be constant over all observations.
In other words, we have an ordinary linear regression model when:
yi ~ N(B1 + … + BkXki , 2)
for i  1,…,n
More review of regression
If yi ~ N(B1 + … + BkXki , 2), then it is well known that the ordinary least
squares estimates and the maximum likelihood estimates of the
parameters B1, …, Bm are equivalent.
If B = [B1, …, Bm]T, then the frequentist estimate B’ is:
B’ = (XTX)-1XTY
We know by the Gauss-Markov theorem that this estimate is BLUE.
The frequentist estimate of 2 is s2 = (Y-XB’)T(Y-XB’)/(n-k)
The uncertainty about the quantities B’ is summarized by the regression
coefficients’ standard errors which is the diagonal of the matrix:
Var(B’) = s2(XTX)-1
Finally, we know that if Vi is the ith diagonal element of Var(B’), then:
(Bk’-0)/(sVi1/2) = tn-k
This statistic forms the basis for our hypothesis tests.
The Bayesian Setup
For the normal linear model, we have:
yi ~ N(i, 2)
for i  1,…,n
where i is just an indicator for the expression:
i = B0 + B1X1i … + BkXki
The object of statistical inference is the posterior
distribution of the parameters B0,…,Bk and 2.
By Bayes Rule, we know that this is simply:
p(B0,…,Bk, 2 | Y, X)  p(B0,…,Bk, 2)  i p(yi | i, 2)
Bayesian regression with standard noninformative priors
By Bayes Rule, the posterior distribution is:
p(B0,…,Bk, 2 | Y, X)  p(B0,…,Bk, 2)  i p(yi | i, 2)
To make inferences about the regression coefficients, we obviously
need to choose a prior distribution for B, 2.
The standard non-informative prior distribution is uniform on (B, log 2),
which is equivalent to:
p(B, log 2) -2
This prior is a good choice for statistical models when you have a lot of
data points and only a few parameters.
Why? Because if you have a large lot of data and few parameters, then
the likelihood function is very sharply peaked, which means that the
likelihood (the data) will dominate posterior inferences. With small
sample sizes or a lot of parameters, prior distributions or hierarchical
models become more important for analysis.
Bayesian regression with non-informative priors
If Y|B,2,X ~ N(XB, 2I) and p(B, log 2) -2 then it follows that the
conditional posterior distribution p(B|2;data) can be written:
p(B|2;data) ~ Nmultivariate (B’, 2(XTX)-1) where B’ = (XTX)-1XTY
This follows from by completing the square, like we have seen over and
over again when dealing with the normal distribution.
The posterior distribution of 2 can be written as:
p(2|data)~Scaled Inv-2(n-k,s2)where s2 = (Y-XB)T(Y-XB)/(n-k)
To make inferences about the marginal posterior of B, we can either:
1) take repeated samples from the posterior of 2 and then B| 2 like we
did when we studied the normal model with unknown mean and
variance; or,
2) integrate out 2 from the conditional posterior for B and find that the
marginal distribution of B is a multivariate t distribution:
p(B | data) ~ multivariate tn-k(B’, s2(XTX)-1)
 Notice the close comparison with the classical results. The key
difference would be interpretation of the standard errors.
Congdon Example 4.1
In Example 4.1, Congdon considers a peculiar example
concerning a method for weighing small stuff, where the
method has an error that is assumed to be normal with
mean zero.
The dependent variable is the recorded weight and the
independent variables are dummy variables for the
presence or absence of object A and object B. X1=1 if A
is present and 0 otherwise. X2=1 if B is present and 0
otherwise. Both objects may be weighed together.
Thus, yi ~ N(B1X1i + B2X2i , 2)
[note, no intercept]
B1 = weight of object A and B2 = weight of object B
Congdon adopts the reference prior: p(B1, B2, 2)  2
WinBugs Implementation of Example 4.1
model {
for (i in 1:18) { # this set of statements generates the sample variance s2
yhat[i] <- b[1]*X[i,1]+b[2]*X[i,2]
e[i] <- y[i]-yhat[i]
e2[i] <- pow(e[i],2)
}
s2 <- sum(e2[])/16
# this equals e2 / (n - k)
for (i in 1:2) { # mean of regression coeff
b[i] <- inprod(XTX.INV[i,],Xy[])
Xy[i] <- inprod(X[,i],y[])
for (j in 1:2) {
Prec[i,j] <- XTX[i,j]/s2
Disp[i,j] <- s2*XTX.INV[i,j]
XTX[i,j] <- inprod(X[,i],X[,j])
}
}
XTX.INV[1:2,1:2] <- inverse(XTX[,])
# note: WinBugs now has a different command for inversion than that used by Congdon
# direct sampling of regression coefficient
beta[1:2] ~ dmt(b[],Prec[,],16) }
Bayesian Model Evaluation
Model evaluation proceeds just like it would in the traditional OLS
framework. Models are largely evaluated based on their fitted
(predicted) values. For example,
2
ˆ


y

y
 i
R2 
, where ŷi is the fitted
2
  yi  y 
value for y.
In this respect, Bayesian model evaluation is easy.
-The fitted value for an observation i is a random draw from a t
distribution with mean XiB’ and variance s2(I+Xi(XTX)Xi) on n-k
degrees of freedom. In most cases, you can simply use XiB’ as the
fitted value and ignore this additional uncertainty.
- Other diagnostics that you should use are plots of errors versus
covariates histograms for the error term, etc. to make sure that the
assumptions of the normal linear model hold.
Comment on the regression setup
There are two subtle points regarding the Bayesian regression setup.
First, a full Bayesian model includes a distribution for the independent
variable X, p(X | ). Therefore, we have a joint likelihood p(X,Y |
,,) and joint prior p(,,). The fundamental assumption of the
normal linear model is that p(y|X, , ) and p(X| ) are independent
in their prior distributions such that the posterior distribution factors
into: p(, , | X,Y) = p(, | X,Y) p(| X,Y)
As a result, p(, | X,Y)  p(, ) p(Y| , , X)
Second, when we setup our probability model, we are implicitly
conditioning on a model, call it H, which represents our beliefs about
the data-generating process. Thus,
p(, | X,Y,H)  p(, |H) p(Y| , , X,H)
It is important to keep in mind that our inferences are dependent on H,
and this is equally true for the frequentist perspective, where results
can be dependent on the choice of likelihood function, covariates,
etc.
Conjugate priors and the normal linear model
Suppose that instead of an improper prior, we decide to
use the conjugate prior.
For the normal regression model, the conjugate prior
distribution for p(B0,…,Bk, 2) is the normal-inversegamma distribution.
We’ve seen this distribution before when we studied the
normal model with unknown mean and variance. We
know that this distribution can be factored such that:
p(B0,…,Bk, 2) = p(B0,…,Bk | 2) p(2)
p(B0,…,Bk | 2) ~ NMV(Bprior , prior),
where prior is the prior covariance matrix for the B’s.
and p(2) ~ Inverse-Gamma(aprior, bprior).
Conjugate priors and the normal linear model
If we use a conjugate prior, then the prior distribution will have the same
form. Thus, the posterior distribution will also be normal-inverse-gamma.
If we integrate out 2 the marginal for B will be a multivariate t-dist:
E B | Data   1  X T X  1Bprior  X T XB'
1
- Notice that the coefficients are essentially a weighted average of the
prior coefficients described by Bprior and the standard OLS estimates B’.
- The weights are provided by the conditional prior precision -1 and the
data XTX. This should make clear that as we increase our prior precision
(decrease our prior variance) for B, we place greater posterior weight on
our prior beliefs relative to the data.
Note: Zellner (1971) treats Bprior and the conditional prior variance  in
the following way: suppose you have two data sets—Y1,X1 and Y2,X2. He
sets Bprior equal to the posterior mean for a regression analysis of X1, Y1
with the improper prior 1/2 and sets  equal to X1TX1.
Conjugate priors and the normal linear model
To summarize our uncertainty about the coefficients, the
variance-covariance matrix for B is given by:
 0



Cov 1 | Data  


 2


Y
~
where s  2b 
1
~
s  1  X T X 
nak 3
The posterior standard deviation can be
taken from the square root of the
diagonal of this matrix.
 XB' Y  XB' 
T
T
 B prior  E B   1B prior  B' E B  X T XB'
nak 3
The second term is the maximum likelihood estimate of the variance. The third
terms states that our variance estimates will be greater if our prior values for the
regression coefficients differ from their posterior values, especially if we indicate a
great deal of confidence in our prior beliefs by assigning small variances in the
matrix . The fourth term states that our variance estimates for the regression
coefficients will be greater if the standard OLS estimates differ from the posterior
values, especially if XTX is a large number.
Winbugs implementation would proceed in a manner akin to the earlier example.