Download Model Evaluation and Model Selection

Model Evaluation & Model Selection Modeling process Problem identification / data collection Identify scientific objectives Collect & understand data Draw upon existing theory/ knowledge Model specification Visualize model (DAG) Write model mathematically using probability notation & appropriate distributions or Write down (unnormalized) posterior Derive full-conditional distributions Program model components in software Construct MCMC algorithm Model implementation Model evaluation & inference Fit model (using MCMC & data) Evaluate models (posterior predictive checks) Use output to make inferences Model selection Motivating issues • How well does my model(s) fit my data? [evaluation] – Just because the MCMC procedure “went smoothly,” doesn’t mean you have a “good model” – Just because you got posterior stats for your parameters of interest doesn’t mean you have a “good model” – Check ability of model to “replicate” observed data – Potentially check ability of model to “predict” observed data (via cross-validation) • Alternative model formulations = alternative hypotheses about system. Which model? [selection] – – – – Which model agrees the best with my data? Which model is simpler to interpret? Which model satisfies both criteria? Which model should I chose? Combine alternative models? Model averaging 3 Lecture content • Posterior predictive checks: – Observed vs “predicted” – Replicated data – Bayesian p-values • Model selection/comparison: – Deviance – Akaike Information Criterion (AIC) in the likelihood framework – Deviance information criteria (DIC) – Posterior predictive loss (D) The first question we should ask after fitting a model: Are the predictions of the model consistent with the data? 1. Is our process model a reasonable representation? 2. Have we made the right choices of distributions to represent the uncertainties? Evaluate model fits • Model evaluation and diagnostics are relatively under-developed in Bayesian analysis • We often rely on relatively qualitative and informal methods that are based on ideas developed in “classical” analyses • But, one particularly useful method for evaluating model fit is to compare posterior predictions of “replicated” data with observed data. • Examine the ability of the model to produce “replicated” data that are consistent with the observed data • Assume that “replicated” data (yrep) arise from same sampling distribution used to define the likelihood of the observed data (y): yi ~ P  y |   yrepi ~ P  y |   • Again, y is observed and yrep is not observed; thus, we obtain the posterior predictive distribution for each yrep. • Compare the posterior predictive distribution for each yrep to the corresponding 6 observed y value. Posterior predictive checks P( y new | y )   P( y new |  ) P( | y )d Posterior predictive distribution • It is called posterior because it is conditional on the observed y and predictive because it is a prediction for an observable ynew. • It gives the probability of a new prediction of y conditional on θ , which in turn is conditioned on the data at hand, y. The mechanics • We have a scientific model g(θ,x) that predicts a response y. We estimate the posterior distribution, P(θ|y). For any given value of x, we can simulate the posterior predictive distribution ynew by making a draw of θ= θ’ from P(θ|y) and estimating ynew ~P(g(θ’|x),σ). • In MCMC, this simply means making draws from the data model because each draw is conditional on the current value of the parameters. These draws define the posterior predictive distribution in exactly the same way that draws allow us to define the posterior probability of the parameters. DAG: Back to hemlock trees yijis the observed growth. Data model yi μi is the true value of growth. It is “latent”, i.e. not observable.  proc Process model i Parameter model  g (bo , b1 , diam)  bo  b1diami n P(bo , b1 , | y, diam)   normal ( yi |g (bo , b1 , diami ), ) x i 1 normal (bo | 0,.0001)normal (b1 | 0,.0001) gamma( | .001,.001) model{ for(i in 1:length(y)){ mu[i] <- b0 + b1*diam[i] y[i] ~ dnorm(mu[i],tau) #posterior predictive distribution of y.new[i], unobserved trees y.sim[i] ~ dnorm(mu[i],tau) } # Priors b0 ~ dnorm(0,.0001) b1 ~ dnorm(0,.0001) tau ~ dgamma(.0001,.0001) sigma<-1/sqrt(tau) } Marginal posterior distributions Recall that by Monte Carlo Integration…. 1 E ( | y )  K and K k    k 1 K var( | y )  k  2 (   E (  | y ))  k 1 K Derived quantities • Equivariance: Any quantity calculated from a random variable becomes a random variable with its own probability distribution. • These quantities may be of scientific interest in themselves (e.g., biomass using allometric equations, Shannon Diversity Index, effect sizes…and so on). • The derived quantity may involve model parameters, latent processes, or data. • Equivalence is also incredibly useful in calculating goodness of fit of the model against observed data and in making forecasts about yet-unobserved quantities. Bayesian p-values • Let T(y,) be a test statistic (e.g., mean, standard deviation, CV, quantile, sums of squared discrepancy, etc.) associated with the observed data • Likewise, let T(yrep,) be the corresponding test statistics associated with the replicated data • We can calculate the “tail” probability p: p  P T  yrep |    T ( y |  ) y  or p  P T  yrep |    T ( y |  ) y  • If p is very large (e.g., p >>0.5 or close to 1) or very small (i.e., p <<0.5 or close to 0), then the difference between the observed and simulated data cannot be attributed to chance, indicating potential lack of fit. R.A. Fischer’s ticks A simple example: We want to know the average number of ticks on sheep. We round up 60 sheep and count ticks on each one. Does a Poisson distribution fit the data? 60 P( | y )   P( yi |  )P( ) i 1 For each value in the MCMC chain, we generate a new data set, ysim, by sampling from: P( y sim |  ) P ( | y ) yi  A single mean governs the pattern Data Parameter model{ #prior lambda ~ dgamma(0.001,0.001) Key part for(i in 1:60){ y[i] ~ dpois(lambda) y.sim[i] ~ dpois(lambda) #simulate a new data set of 60 points } cv.y <- sd(y[ ])/mean(y[ ]) cv.y.sim <- sd(y.sim[])/mean(y.sim[ ]) mean.y <-mean(y[]) mean.y.sim <-mean(y.sim[]) # find Bayesian P value--the mean of many 0's and 1's returned by the step function, one for each step in the chain pvalue.cv <- step(cv.y.sim-cv.y) Step function=1 if ()>0 pvalue.mean <-step(mean.y.sim - mean.y) # Sums of Squares for(j in 1:60){ sq[j] <- (y[j]-lambda)^2 sq.new[j] <- (y.sim[j]-lambda)^2 } fit <- sum(sq[]) fit.new <- sum(sq.new[]) pvalue.fit <- step(fit.new-fit) } #end of model 0.15 0.00 Density Real Data 0 2 4 6 8 10 Number of Ticks Simple Model 0.10 0.20 0.00 Density Simulated Data 0 2 4 6 Number of Ticks 8 10 Posterior predictive check Simple Model fit <- sum(sq[]) fit.new <- sum(sq.new[]) pvalue.fit <- step(fit.new-fit) } #end of model p-value for CV=0.0015 P-value for mean=0.51 Remember, this is a twotailed probability, so values close to 0 and 1 indicate lack of fit. 0 2 4 6 Real data 8 10 0 2 4 6 Simulated data 8 10  yi Data i Parameter  Each sheep has its own mean (a.k.a. random effect) Hyperparameter model{ # Priors a~ dgamma(.001,.001) b~ dgamma(.001,.001) 60 for(i in 1:60){ P(a, b,  | y )  P ( yi | i )P (i | a, b) lambda[i] ~ dgamma(a,b) i 1 y[i] ~ dpois(lambda[i]) y.sim[i] ~ dpois(lambda[i]) P(a) P (b) } cv.y <- sd(y[ ])/mean(y[ ]) cv.y.sim <- sd(y.sim[])/mean(y.sim[ ]) pvalue.cv <- step(cv.y.sim-cv.y) # find Bayesian P value--the mean of many 0's and 1's returned by the step function, one for each step in the chain mean.y <-mean(y[]) mean.y.sim <-mean(y.sim[]) pvalue.mean <-step(mean.y.sim - mean.y) Hierarchical Model  for(j in 1:60){ sq[j] <- (y[j]-lambda[j])^2 sq.new[j] <- (y.sim[j]-lambda[j])^2} fit <- sum(sq[]) fit.new <- sum(sq.new[]) pvalue.fit <- step(fit.new-fit) } #end of model Hierarchical Model Posterior predictive check Hierarchical Model fit <- sum(sq[]) fit.new <- sum(sq.new[]) pvalue.fit <- step(fit.new-fit) } #end of model p-value for CV=0.46 p-value for mean=0.51 Remember, this is a twotailed probability, so values close to 0 and 1 indicate lack of fit. 0 2 4 6 Real data 8 10 0 Simulated data 2 4 6 8 10 Posterior predictive checks • Gelman, A., and J. Hill. 2009. Data analysis using regression and multilevel / hierarchical models. Cambridge University Press, Cambridge, UK. • Link, W. A., and R. J. Barker. 2010. Bayesian Inference with Ecological Applications. Academic Press. • Kery, M. 2010. Introduction to WinBUGS for Ecologists: A Bayesian approach to regression, ANOVA, mixed models and related analyses. Academic Press. Motivating issues • How well does my model(s) fit my data? [evaluation] – Just because the MCMC procedure “went smoothly,” doesn’t mean you have a “good model” – Just because you got posterior stats for your parameters of interest doesn’t mean you have a “good model” – Check ability of model to “replicate” observed data – Potentially check ability of model to “predict” observed data (via cross-validation) • Alternative model formulations = alternative hypotheses about system. Which model? [selection] – – – – Which model agrees the best with my data? Which model is simpler to interpret? Which model satisfies both criteria? Which model should I chose? Combine alternative models? Model averaging 26 Lecture content • Posterior predictive checks: – Observed vs “predicted” – Replicated data – Bayesian p-values • Model selection/comparison: – Deviance – Akaike Information Criterion (AIC) in the likelihood framework – Deviance information criteria (DIC) – Posterior predictive loss (D) 27 “Model selection and model averaging are deep waters, mathematically, and no consensus has emerged in the substantial literature on a single approach. Indeed, our only criticism of the wide use of AIC weights in wildlife and ecological statistics is with their uncritical acceptance and the view that this challenging problem has been simply resolved” Link, W. A., and R. J. Barker. 2006. Model weights and the foundations of multi-model inference. Ecology 87:2626 The problem of model selection • Up until now, we have been concerned with the uncertainties associated with a given model. • What about the uncertainty that arises from our choice of models? • How do we decide which model is best? • How do we make inferences based on multiple models? Parsimony==Ockham’s razor William of Ockham (1285-1349) “Pluralitas non est ponenda sine neccesitate” “entities should not be multiplied unnecessarily” “Parsimony: ... 2 : economy in the use of means to an end; especially : economy of explanation in conformity with Occam's razor” (Merriam-Webster Online Dictionary) Information theory and the principle of parsimony True model: ye ( x 0.3) 2 1 Generated ten datasets sampling from normal distribution with mean=0 and var=0.01. Fit five models to the these ten datasets. y   0  1 x y   0  1 x   2 x 2 y   0  1 x   2 x 2   3 x 3 y   0  1 x   2 x   3 x   4 x 2 3 4 y   0  1 x    2 x 2   3 x 3   4 x 4   5 x 5 What creates noise in models? Illustration of tradeoff The Kullback-Leibler distance Interpretation of Kullblack-Leibler Information (aka. distance between 2 models) • Given truth represented by f and a model approximating truth g, the K-distance measures the information lost by using model g to approximate f. Interpretation of Kullblack-Leibler Information (aka. distance between 2 models) 650 Count 520 f(x) Truth 390 260 650 130 520 0 0 5 10 GAMMA 15 20 g2(x) 390 260 130 650 0 0 520 390 10 WEIBULL 15 20 Approximations to truth g1(x) 260 5 130 0 0 5 10 15 LOGNORMAL 20 Measures the (asymmetric) distance between two models. Minimizing the information lost when using g(x) to approximate f(x) is the same as maximizing the likelihood. Heuristic interpretation of K-L Model comparison • Within the classical modeling framework, we tradeoff a measure of complexity (typically deviance) for a measure of complexity (typically number of parameters). How do we know the truth? Akaike’s Information Criterion Akaike defined “an information criterion” that related K-L distance and the maximized log-likelihood as follows: ^ AIC  2 ln( L(  | y ))  2 K This is an estimate of the expected, relative distance between the fitted model and the unknown true mechanism that generated the observed data. K=number of estimable parameters Deviance D( y,  )  2  log  P( y| ) • Deviance (deviance) is a built-in node in JAGS, • thus, you can monitor deviance, • look at its history plots (helpful to evaluating overall model convergence and potential “problem” chains) • compute posterior statistics 44 • Use the difference in AIC to compare competing models.  r  AICr  min( AIC ) min( AIC )  min( AIC1 ,..... AICn )  r  AICr  min( AIC ) As a rule of thumb models having Δr ≤ 2 have sufficient support— they should receive consideration in making inferences. Models having Δr within about 3-7 have considerable less support, while models with Δr ≥10 have essentially no support . But there is a better way…. L( | y)  e 1 ( r ) 2 • The likelihood of a model given the data decreases exponentially with increasing Δr . Note that the likelihood of the best model = 1 and all other likelihoods are relative to the likelihood of the best model. Likelihood ratio from AIC e e 1 (  1 ) 2 1 ( 2 ) 2  relative strength of evidence in data for model 1 over 2 Akaike Weights wr  e R 1 ( r ) 2 e 1 (   r ,i ) 2 likelihood model r | data  total likelihood of all models | data i 1 wr are Akaike weights, the likelihood of one of the candidate models divided by the sum of the likelihoods of all of the candidates. The wr for the best model does not equal 1. The wr sum to 1. The wr can be thought of as “probabilities.” This is a frequentist interpretation derived from simulation. They are not “true” probabilities. (Link, W. A., and R. J. Barker. 2006) Interpretation of Akaike Weights • wi is the weight of evidence in favor of model i being the actual best K-L model given that one of the R models must be the K-L best model of the candidate set. • “probability” that model i is the actual best K-L model • Last statement is quite controversial. The raptors…moving towards model selection Summary of problem and data: In most northern temperate regions, diurnal birds of prey (raptors) migrate seasonally between their breeding and wintering grounds. Most raptors are 0. obligate facultative soaring migrants that congregate along major thermal Identifyorscientific problem/objectives; understand data; draw-up and orographicexisting updraft corridors. theory/knowledge. We might wish to analyze the raptor survey data to understand how temperature and wind speed affect the chance of observing birds of each species. Data: Autumn migration counts of multiple species of raptors in NE U.S., conducted during 2010. yd,s = number of birds observed on day d for species s 95 days, 15 species xd = total time of observation period (minutes) Td = average air temperature (C) during observation period on day d WSd = average wind speed (km/hr) during observation period on day d Therrien et al. 2012. Ecology. Visualize model via DAG T WS x Day y  Day, species  Species  σ σ σ Month Population Data (stochastic) Latent process Data parameters Process parameters Specify model yd ,s ~ Poisson  d ,s  xd  Likelihood: Log link function (for log-linear Poisson regression): T WS d ,s  log(d ,s ) x  d ,s  exp(d ,s ) Stochastic model for linear predictor (account for over-dispersion)  y d ,s ~ Normal 1s  2s  Td  3s Wd  4s  Td Wd   m(d ) ,  2  Hierarchical priors for species-level effects parameters:   k ,s ~ N ˆk , 2k  k  1, 2,3, 4 parameters Zero-centered hierarchical prior for month random effect:   m ~ N 0,  2   Conjugate, relatively non-informative priors for root nodes:  ˆk ~ N  0,10000   ,  k ,  ~ gamma(0.01, 0.01) σ σ σ where σ2 = 1/τ for each σ2 term Is over-dispersion needed? T WS Log link function (for log-linear Poisson regression): x d ,s  log(d ,s ) y  d ,s  exp(d ,s ) Stochastic model for linear predictor (without over-dispersion) d ,s  1s   2 s  Td  3s Wd   4 s  Td  Wd   m ( d )  Hierarchical priors for species-level effects parameters:  k ,s ~ N ˆk , 2k   k  1, 2,3, 4 parameters Zero-centered hierarchical prior for month random effect:   m ~ N 0,  2   Conjugate, relatively non-informative priors for root nodes: ˆk ~ N  0,10000  σ σ σ  ,  k ,  ~ gamma(0.01, 0.01) where σ2 = 1/τ for each σ2 term Implement (code) models: Model 1 BUGS code shown here. BUGS code shown here. Implement models: Model 2 (no overdispersion) BUGS code shown here. 57 Evaluate results/make inferences Model 1: includes over-dispersion, example (temperature effects at sp level):  d ,s ~ Normal 1s  2 s  Td  3s Wd  4s  Td Wd   m( d ) ,  2  th e ta .s ta r[2 ,] 0 .0 0 .5 Eff ect of temperature on observation rate [2,10] [2,13] [2,3] [2,1] [2,2] [2,4] [2,5] [2,6] [2,14] [2,15] [2,16] [2,7] [2,8] [2,9] [2,11] [2,12] Population-level parameter -0 .5 ˆ2 Species ID Do the posterior stats for the over-dispersion standard deviation term indicate the presences of “significant” over-dispersion? sig (σ) mean 1.68 sd 0.07601 val2.5pc 1.538 median 1.678 val97.5pc 1.84 Replicated data (Model 1) 12 10 Predicted # of birds Posterior mean & 95% CI for yrep Things to look for: Species 8 (northern goshawk) R2 = 0.965 Coverage = 100% 14 ○ Bias/accuracy: Do the points fall around the 1:1 line, or is there some prediction bias? 8 ○ Coverage: Do most of the observed values fall inside the 95% CIs for the replicated values (the Yrep’s)? 6 4 2 0 0 2 4 6 8 10 12 14 ○ Is the variability in the observed data consistent with the variability in the replicated values? Species 10 (broad-winged hawk) R2 = 1.0 Coverage = 100% 800 600 ○ What percentage do you expect to fall outside of the Yrep CIs? ○ Can also overlay plots the observed Y values and the predicted Y values (i.e., posterior means for Yrep and 95% CI) as functions of a covariate. 400 200 0 0 200 400 600 Observed # of birds (y) 800 ○ Why we get a “perfect” fit and 100% coverage when including overdispersion here? Replicated data (Model 2) 12 10 Predicted # of birds Posterior mean & 95% CI for yrep Things to look for: Species 8 (northern goshawk) R2 = 0.082 Coverage = 97.9% 14 ○ Bias/accuracy: Do the points fall around the 1:1 line, or is there some prediction bias? 8 ○ Coverage: Do most of the observed values fall inside the 95% CIs for the replicated values (the Yrep’s)? 6 4 2 0 0 2 800 4 6 8 10 12 14 ○ Is the variability in the observed data consistent with the variability in the replicated values? Species 10 (broad-winged hawk) R2 = 0.259 Coverage = 52.6% 600 ○ Can also overlay plots the observed Y values and the predicted Y values (i.e., posterior means for Yrep and 95% CI) as functions of a covariate. 400 200 0 0 200 400 600 Observed # of birds (y) ○ What percentage do you expect to fall outside of the Yrep CIs? 800 ○ Why is the fit so much worse when we don’t include an over-dispersion term? Point estimates of deviance • Can get a point estimate of deviance by plugging in a point estimate of the parameters (e.g., ’s) D ˆ ( y )  D( y, ˆ )  = point estimate (usually posterior mean) • But, this doesn’t account for uncertainty in the parameters. • Compute an “expected” deviance, which may be used as an overall measure of model fit. • Compute “expected” deviance by “averaging” over the posterior distribution of the parameters: Dave ( y )  E( D( y ,  )| y ))   D( y ,  )  P ( | y )d  • If we have L draws from the posterior, then an estimate of Dave(y) is: L 1 Dˆ ave ( y )   D( y ,  l ) L l 1 Posterior mean of deviance Model complexity • To account for model complexity, compute the effective number of parameters. To do this we compute deviance in two ways: • The posterior mean of the deviance. • The deviance evaluated at the posterior mean values of model parameters. pD  Dˆ ave ( y )  D ˆ ( y )  • Why should first component be larger than the second component? • In some situations, the above solution for pD can lead to pD < 0 (i.e., a negative # of effective parameters), which renders DIC and pD useless. Computing DIC • Thus, DIC is given by: DIC  Dˆ ave ( y )  pD The deviance of the model evaluated at the means of the posterior distribution of parameters The effective number of parameters model fit (lower better) + model penalty (lower better) • Lower DIC -> “better” model, but what is an “important difference” in DIC? • Interpretation of DIC values same as AIC • DIC difference of 1-2: “best” model deserves consideration • DIC differences of 3-7: “considerably” more support for best model • Differences can be affected by Monte Carlo error • I look for differences > 10 # JAGS model model{ for (i in 1:n){ mu[i]<-(alpha*x[i])/((alpha/gama)+x[i]) y[i]~ dnorm(mu[i],tau)} tau~dgamma(0.001,.001) alpha~dgamma(0.001,.001) gama~dgamma(.001,.001) In R } # end of model #In R jm=jags.model("Bugs_light_example.R",data=data,mod.inits,n.chains=3,n.adapt = n.adapt) update(jm, n.iter=5000) #generate coda object for parameters and deviance. zm=coda.samples(jm,variable.names=c("alpha", "gama", "c","sigma","deviance"),n.iter=5000) dic.samples(jm,n.iter, type="pD") #Mean deviance: 529.8 #penalty 2.97 =pd # Penalized deviance: 532.8 =dic # Another way: summary(zm[,"deviance"])$stat[,] #pd* = (1/2)*Var(deviance) pd= 0.5*summary(zm[,"deviance"])$stat[2]^2 Why DIC? • DIC, AIC, and BIC all have the same general form: xIC = model fit + model penalty • So, why focus on DIC? Why not use AIC or BIC? • BIC and AIC require us to count the number of parameters in a model, but an informative prior, or hierarchical priors makes it impossible to count the number of “effective” parameters • Thus, Spiegelhalter et al. (2002) developed DIC, which computes an “effective” number of parameters that (should) capture the effect of “shrinkage” or “borrowing of strength” due to informative priors or hierarchical priors Some intuition for DIC • The problem is parameters that are “free” to be influenced by noise in the data. How free are they? • If a prior on a parameter is very informative—the parameter is not free to respond to the data, it does not contribute to the effective number of parameters. • If a prior is uninformative, the opposite is the case. It is free to respond and contributes to the effective number of parameters in the same way as in a likelihood analysis. • If a parameter is part of a hierarchy, should it count to same way as a parameter that is part of a simpler model? Posterior predictive loss • Posterior predictive loss (i.e., Dk) is fairly widely used within statistics, but is not frequently used in ecology • Dk provides an index of a model’s predictive ability by comparing observed data to replicated data • The “best” model(s) is the one that performs the best under a “balanced loss function.” • Similar to the DIC, the loss function penalizes for both departure from the observed data (measure of model “fit”) and departure from what we expect the replicated data to be (measure of “smoothness” – somewhat analogous to DIC’s effective number of parameters). • The loss function puts weights, which depend on k > 0, on the model fit component (G) and a weight of 1 (one) on the smoothness or penalty component (P); the value of k is determined by the user: k Dk  GP k 1 Posterior predictive loss (D) • We often assume “k = ,” in which case Dk (call this D) is given by: D  G  P • Under squared-error loss, G and P are given by: N G    repi  y i  2 i 1 N 2 P    rep i i 1 where • Thus, D is equivalent to: 𝐷∞ = 𝐺 + 𝑃 where repi  E( yrepi | y ) 2  rep  Var ( yrepi | y ) i Computing D in practice • In your code, simply monitor the squared deviation, (yrepi - yi)2, for each observation i, and outside of the i loop, compute the sum of the squared deviations: Dsum.species[s]<-sum(sqdiff[,s]) # in the species loop Dsum is NOT D; D is computed after you’ve run the model (after convergence) and it is approximated as the posterior mean (expected value) of Dsum. 69 30 20 10 tree.data$Observed.growth.rate 40 Back to the tree and the light 20 40 60 tree.data$Light 80 100 Model 1:mu[i]<-(alpha*x[i])/((alpha/gama)+x[i]) Model 2: mu[i]<-alpha*x[i] # JAGS model model{ for (i in 1:n){ mu[i]<-(alpha*x[i])/((alpha/gama)+x[i]) y[i]~ dnorm(mu[i],tau) yrep[i]~dnorm(mu[i],tau)) } tau~dgamma(0.001,.001) alpha~dgamma(0.001,.001) gama~dgamma(.001,.001)} # end of model In R zm=coda.samples(jm,variable.names=c("alpha","gama","sigma","deviance",yrep"),n.iter=5000) y<-tree.data$Observed.growth.rate # assign response variable ntree=nrow(tree.data) yrep.stats=summary(zm[,paste("yrep[",1:ntree,"]",sep="")])$stat G <- sum((yrep.stats[,1]-y)^2)# sq diff (yrep.mean-y) P <- sum((yrep.stats[,2])^2)# var for each yrep Dinf <- G + P Interpreting D values • For a “poor” model, we expect large predictive variance (large P) and poor fit (large G). • Better models have a lower Dk (or lower D) associated with a smaller P and/or smaller G • But, as we start to “overfit” (e.g., model with lots of parameters), G will continue to decrease (better fit), but P will start to increase (i.e., variance will be inflated due to multi-collinearity between parameters). • The model with the smallest Dk (or D) is preferred. • But, how small is “small”? 72 Linear vs non-linear pd Mean(dev) P G Dinf DIC R2 Nonlinear 2.94 529.8 4551.13 4236.13 8787.27 532.7 0.91 Linear 2.05 574.2 8019.67 7612.01 15631.6 576.3 0.82 Consider multiple comparison indices Average Deviance DIC pD Countable # unknowns D G P 2 R (obs vs pred) Coverage CI width of replicated data Model 1 2069 2576 507 1500 9895 201 9693 1.00 100% 4.36 Model 2 11210 11270 61 75 682802 677141 5661 0.30 85% 4.44 Indices computed across all species. Some conclusions: • Posterior predictive checks suggest we should “pick” Model 1 (with overdispersion) over Model 2 (Model 1 has much lower DIC, D, G) • But, the results suggest that Model 1 is over-fitting the data – it has a very high P, the coverage of the replicated data is too high (should be ~95%), and the uncertainty in the replicated data (e.g., width of 95% CIs) is similar to Model 1. • Perhaps we should evaluate a third model that lies somewhere between the complexity of Model 1 and Model 2? E.g., explore incorporation of day (nested in month) random effects without observation-level over-dispersion? 74 • In the index of Gelman et al. 1995: “Model selection, why we do not do it” • In the index of Gelman et al. 2004: “Model selection, why we avoid it.” – Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin. 2004. Bayesian data analysis. Chapman and Hall / CRC, London. When should we avoid model selection? • When we have a firm, theoretical / mechanistic basis for a particular model formulation. • When our objectives for insight determine the form of the model. • When we want to make forecasts and must include known influences on future behavior of the system. Concluding remarks • Use multiple approaches for comparing / selecting between models • Selection criteria may depend on scientific objectives: • Use model to learn (heuristic tool) • Use model to predict under novel conditions • Other topics not covered • Model averaging & Bayes factors • E.g., can use posterior model weights to average derived or predicted quantities obtained from each model • Can use BIC to approximate the Bayes factors (BF) • See Link & Barber (2006) or Kass (1993) • Evaluation of model assumptions • Appropriate choice of distributions? • Appropriate model structure • E.g.: linear vs non-linear; choice of covariates; random vs fixed effects; hierarchical vs non-hierarchical priors, etc 77 References • Posterior predictive loss: Gelfand & Ghosh (1998) Model choice: A minimum posterior predictive loss approach. Biometrika 85:1-11. • Bayes factors & BIC: Link & Barker (2006) Model weights and the foundations of multimodel inference. Ecology 87:2626-2635. • Model checking & improvement: Gelman, Carlin, Stern, Rubin (2004) Bayesian Data Analysis. Chapman & Hall/CRC. (Chapter 6) • Bayes factors: Kass (1993) Bayes factors in practice. The Statistician 42:551-560. • Elements of hierarchical Bayes, including Bayes factors, DIC, D: Carlin et al. (2006) Elements of hierarchical Bayesian inference, in Hierarchical Modelling for the Environmental Sciences: Statistical Methods and Applications, J.S. Clark & A.E. Gelfand (eds.) Oxford Univ Press. 78 Quantiles vs. HPD intervals HPDI quantiles Quantiles vs. HPD intervals • HPDI: The longest horizontal line that can be placed within the distribution such that the area between the vertical dashed lines and beneath the distribution curve = 1 –alpha. • Equal tailed intervals: quantiles of distribution (1alpha/2). Likelihood Ratio Test • Ratios of llikelihoods (R) follow a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between models A and B. R  2[ L(Y | M A )  L(Y | M B )] the Likelihood Ratio Test Chi-square probability Difference in loglikelihood Chi-sq=3.84 θ Likelihood profile

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Model Evaluation and Model Selection