Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MAST30027: Modern Applied Statistics Week 12 Lab Sheet 1. Suppose that X = X1 X2 ∼ N (µ, Σ), with µ = µ1 µ2 and Σ = σ12 σ12 σ12 σ22 . (a) Show that the conditional distribution of X1 |X2 = x2 is normal with mean µ1 +(x2 −µ2 )σ12 /σ22 and variance σ12 − σ12 /σ22 . a b Solution: Let Σ−1 = , then the condional marginal distribution of X1 given X2 = b c x2 is f (x1 , x2 ) f (x2 ) ∝ f (x1 , x2 ) ∝ exp{− 12 [(x1 − µ1 )2 a + 2(x1 − µ2 )(x2 − µ2 )b + (x2 − µ2 )2 c]} ∝ exp{− 21 [x21 a − 2x1 (µ1 a − (x2 − µ2 )b]} ∝ exp{− 21 [x1 − (µ1 − (x2 − µ2 )b/a)]2 a} 2 Thus X1 |X2 = x2 ∼ N (µ1 − (x2 − µ2 )b/a, 1/a), where a = σ22 /(σ12 σ22 − σ12 ) and b = 2 2 −σ12 /(σ12 σ22 − σ12 ), and thus b/a = −σ12 /σ22 and 1/a = σ12 − σ12 /σ22 . (b) Write an uses the Gibbs sampler to generate a sample of size n = 1000 from R function that 0 4 1 the N , distribution. 0 1 4 Plot traces of X1 and X2 . Solution: To test the simulator we do a normal probability plot for each marginal, and both look good. The traces show pretty good mixing. > > > > > > > > > > > > > > > > + + + + > > > > > > set.seed(200) # params mu1 <- 0 mu2 <- 0 s11 <- 4 s12 <- 1 s22 <- 4 # initial values x1 <- 6 x2 <- -6 # sample size nreps <- 1000 Gsamples <- matrix(nrow=nreps, ncol=2) Gsamples[1,] <- c(x1, x2) # main loop for (i in 2:nreps) { x1 <- rnorm(1, mu1 + (x2 - mu2)*s12/s22, sqrt(s11 - s12/s22)) x2 <- rnorm(1, mu2 + (x1 - mu1)*s12/s11, sqrt(s22 - s12/s11)) Gsamples[i,] <- c(x1, x2) } # output par(mfrow=c(2,2), mar=c(2,4,1,1)) qqnorm(Gsamples[,1], main="x1") qqnorm(Gsamples[,2], main="x2") plot(Gsamples[,1], type="l", xlab="iteration", ylab="x1") plot(Gsamples[,2], type="l", xlab="iteration", ylab="x2") 1 x2 2 0 −4 −2 Sample Quantiles 0 −6 −5 Sample Quantiles 5 4 6 x1 −3 −2 −1 0 1 2 3 −3 −1 0 1 2 3 Theoretical Quantiles −6 −5 −4 −2 0 x2 0 x1 2 5 4 6 Theoretical Quantiles −2 0 200 400 600 800 1000 0 200 400 600 800 1000 (c) Use your simulator to estimate P(X1 ≥ 0, X2 ≥ 0). To get a feel for the convergence rate, calculate the estimate using samples {1, . . . , k}, for k = 1, . . . , n, and then plot the estimates against n. Solution: The plot appears after part (d). > par(mfrow=c(1,1)) > success <- apply(Gsamples, 1, function(x) (x[1] > 0)&(x[2] > 0)) > mean(success) [1] 0.296 > plot(1:nreps, cumsum(success)/(1:nreps), type="l", xlab="k", ylab="prob", ylim=c(0,1)) 4 2.8 (d) Now change Σ to and generate another sample of size 1000. 2.8 4 What do the traces/estimates look like now? Solution: We put s12 <- 2.8 then re-run the code above, getting a different Gsamples. We plot the cumlative estimates on top of the previous graph using lines. The cumulative estimates are more volatile in the second case, reflecting the stronger autocorrelation in the Markov chain, caused by the stronger correlation between X1 and X2 . > success <- apply(Gsamples, 1, function(x) (x[1] > 0)&(x[2] > 0)) > mean(success) [1] 0.38 > lines(1:nreps, cumsum(success)/(1:nreps), col="red") 2 1.0 0.8 0.6 0.0 0.2 0.4 prob 0 200 400 600 800 1000 k 2. Read the “Dyes” example from the WinBUGS Examples Vol. 1 (in the Help menu). The posterior predictive distribution is the posterior distribution of a fitted value. Let T be some function, applied to either the observed values or the fitted values, then the posterior predicted p-value is the probability that T applied to the fitted values is larger than T applied to the observations. The posterior predicted p-value is averaged over the posterior distribution of the parameters, so it is a single number rather than a distribution. Values between 0.05 and 0.95 are generally considered reasonable, while a value smaller than 0.01 or larger than 0.99 indicates a major failure of the model (Gelman et al. §6.3) Here is a modification of the Dyes code from the examples; six lines have been added to the model. Note that the ranked function picks out a particular value from an ordered sample (the largest in this case; see the User Manual in the Help menu). model { for( i in 1 : batches ) { mu[i] ~ dnorm(theta, tau.btw) for( j in 1 : samples ) { y[i , j] ~ dnorm(mu[i], tau.with) yfit[i, j] ~ dnorm(mu[i], tau.with) resid[i, j] <- abs(y[i, j] - mu[i]) fresid[i, j] <- abs(yfit[i, j] - mu[i]) } largest[i] <- ranked(resid[i, ], samples) flargest[i] <- ranked(fresid[i, ], samples) pppv[i] <- step(largest[i] - flargest[i]) 3 } theta ~ dnorm(0.0, 1.0E-10) # prior for within-variation sigma2.with <- 1 / tau.with tau.with ~ dgamma(0.001, 0.001) # prior for between-variation # ICC = sigma2.btw / (sigma2.btw + sigma2.with) ICC ~ dunif(0,1) sigma2.btw <- sigma2.with * ICC/(1-ICC) tau.btw <- 1/sigma2.btw } # data list(batches = 6, samples = 5, y = structure( .Data = c(1545, 1440, 1440, 1520, 1580, 1540, 1555, 1490, 1560, 1495, 1595, 1550, 1605, 1510, 1560, 1445, 1440, 1595, 1465, 1545, 1595, 1630, 1515, 1635, 1625, 1520, 1455, 1450, 1480, 1445), .Dim = c(6, 5))) # initial values list(theta=1500, tau.with=1,ICC=0.5, mu=c(1,1,1,1,1,1), yfit=structure( .Data = c(1545, 1440, 1440, 1520, 1580, 1540, 1555, 1490, 1560, 1495, 1595, 1550, 1605, 1510, 1560, 1445, 1440, 1595, 1465, 1545, 1595, 1630, 1515, 1635, 1625, 1520, 1455, 1450, 1480, 1445), .Dim = c(6, 5))) What are the nodes yfit, resid, fresid and pppv for? Fit the model and monitor pppv. Does the output suggest any problems with model fit? Solution: Here yfit are the fitted values, resid are the distances between the observations and their means and fresid are the corresponding values for the fitted values. In this example we are calculating posterior predicted p-values for six test statistics, one for each batch. For batch i the test statistic is maxj |yij − µi |. pppv[i] is a Bernoulli r.v., equal to 1 when the test statistic is larger for the fitted values than the observed values. The posterior mean of pppv[i] gives us an estimate of the posterior predicted p-value. Running the model we get posterior means for the pppv[i] all comfortably between 0.05 and 0.95, so (from the point of view afforded by these test statistics) there is no cause to worry that the model does not fit. 3. The “Rats” example in WinBUGS Examples Vol. 1 and the “Birats” example in Vol. 2 both model data on the growth of baby rats. Estimate the DIC for these two models. In both cases run at least 1000 burn-in iterations before you set the DIC, then use output for the DIC based on 10,000 additional iterations. You can use the initial values given in the examples. What is the estimated effective number of (free) parameters in each model? What could explain the difference? Is one model strongly preferred to the other after the penalty for model complexity is taken in to account? 4 Solution:. To estimate the DIC we use the DIC... button in the Inference menu. For the first model we have an estimated DIC of 1020.5 and effective number of parameters pD = 54.5. For the second model the estimated DIC is 1019.2 and pD = 49.9. For both models we have yij ∼ N (µij , 1/τ ) where µij = αi + βi xj for some αi and βi . Without any prior information about the parameters, both models have 61 free parameters since i = 1, . . . , 30. In the first model αi and βi are independent, while the second model allows for dependence between αi and βi , for each i. The priors for the first model are clearly not completely uninformative, as the effective number of parameters has come down from 61 to 54.5. The second model is a generalisation of the first, which we naturally think of as meaning it is more flexible. Thus it is perhaps surprising that the effective number of parameters has gone down. The reason that pD is lower for the second model is that the priors are not completely uninformative, so the extra structure that allows for dependence between αi and βi ends up introducing a (small) prior dependence between them. The difference in DIC is small, so there is no strong preference for one model over the other. 5