Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Regression analysis wikipedia , lookup
Theoretical ecology wikipedia , lookup
Predictive analytics wikipedia , lookup
Vector generalized linear model wikipedia , lookup
Least squares wikipedia , lookup
General circulation model wikipedia , lookup
History of numerical weather prediction wikipedia , lookup
Plateau principle wikipedia , lookup
Data assimilation wikipedia , lookup
Computer simulation wikipedia , lookup
Lecture 17: Poisson GLMs with a Rate Parameter (Text Section 9.2) So far, we have considered Poisson GLMs where exposure is constant, so that it is sensible to model the mean of each count, Yi . In this lecture, we extend the Poisson regression model by allowing for the case where (nonnegative) exposure ti is associated with observation i. Then, Yi ∼Poisson(µi ) where µi = θi ti and p g(θi ) = X xij βj . j=1 Here g is used to transform θi rather than µi . However, for fitting these models in S-PLUS, we need to specify g(µi ), not g(θi ). NOTE: We wouldn’t want to simply compute Yi0 = Yi /ti and use the Yi0 ’s as the responses because Yi0 is not necessarily a count! Therefore, it would likely be hard to find an appropriate distribution to describe the transformed responses. Specifying the Form of the Rate Parameter in S-PLUS If g is the log link, log µi = log ti + log θi = log ti + p X xij βj . j=1 Therefore, we can specify this model in S-PLUS using the usual Poisson GLM with log link by including the term log ti as an offset. If g is the square-root link, √ √ q ti θi p √ X = ti xij βj µi = j=1 = ≡ p X √ ( ti xij )βj j=1 p X x∗ij βj . j=1 In our usual models, xi1 ≡ 1 so that β√ 1 is the intercept (included in the model by default in SPLUS). However, in this case, x∗i1 = ti , i.e. there is no intercept. Therefore, we can specify this model in S-PLUS using the usual Poisson GLM with square-root link by including the covariates x∗i rather than xi and by excluding the intercept term. 1 We can also work out how to specify the model properly in S-PLUS when g is the identity link (see Assignment 3). Example: Geiger counter experiment Let Yi be the ith count observed at distance xi metres from the source of radioactivity over a period of ti seconds. Goal: Estimate the effect of distance on the number of counts. A reasonable preliminary model is Yi ∼Poisson(µi ), where µi = ti θi , and log θi = β0 + β1 xi so that log µi = log ti + β0 + β1 xi . Let θ∗ be the rate when xi = c, and let θ1∗ be the rate when xi = c + 1, i.e. log θ∗ = β0 + β1 c log θ1∗ = β0 + β1 (c + 1). Then, β1 = log θ1∗ −log θ∗ is the difference in log geiger counter rate when distance is increased by 1 metre. Equivalently, using the fact that eβ1 = θ1∗ , θ∗ the geiger counter rate changes by a factor of eβ1 when distance is increased by 1 metre. S-PLUS estimates β̂1 = −0.684 with a standard error of 0.019. Alternatively, we can fit the model Yi ∼Poisson(µi ), where √ √ µi = β0 ti + β1 x∗i , √ where x∗i = ti xi . Then, β1 is the difference in the square-root of the geiger counter rate when distance is increased by 1 metre. S-PLUS estimates β̂1 = −1.14 with a standard error of 0.031. Choosing the Link Function Counts are easier to model than binary data in the sense that we usually have a variety of observed values (i.e., 0, 1, 2, . . . ) instead of just 0’s and 1’s. For this reason, plots of the raw data are often more informative than in the binary case. If exposure is constant so that we are modelling a mean rather than a rate, we can plot the observed counts vs. a continuous predictor variable. The relationship between these quantities can give insight into the link function relating the mean (µi ) and the predictor. 2 If exposure varies by a variable ti , then we can plot the “adjusted” observed counts Yi0 = Yi /ti vs. a continuous predictor variable. Since E[Yi0 ] = θi , this plot can give insight into the link function relating the rate (θi ) and the predictor. IMPORTANT: We are modelling θi , not Yi0 . And, we use Yi as the response, not Yi0 . We are simply using the Yi0 ’s in this plot to learn about the behaviour of θi in relation to the predictor. Goodness-of-Fit Assessment in Poisson GLMs In Lecture 11, we computed the deviance as D=2 " n X yi (log yi − log µ̂i ) − i=1 n X # (yi − µ̂i ) , i=1 which does not contain unknown parameters, so can be calculated from the data. This is labelled as the “residual deviance” in the S-PLUS summary output. NOTE: If we’ve modelled the rate rather than the mean, then µ̂i = ti θ̂i . Likewise, if there are no replicates, the Pearson chi-squared statistic is defined as X2 = n X (yi − µ̂i )2 µ̂i i=1 , which has the alternative interpretation of X2 = n X (observedi − expectedi )2 expectedi i=1 . This statistic is not defined in the case where there are replicates. If D and/or X 2 are large relative to the χ2n−p distribution, then we have evidence against the null hypothesis that our model fits the data well (relative to the saturated model). The χ2n−p approximation is more accurate when the µi ’s are relatively large. The deviance residuals are defined as q di = sign(yi − µ̂i ) 2 [yi (log yi − log µ̂i ) − (yi − µ̂i )], i = 1, . . . , n. The Pearson residuals are defined as observedi − expectedi yi − µ̂i q ≡ Xi = √ , µ̂i expectedi i = 1, . . . , n. We can plot these against the fitted values or against the predictor variables. Patterns in these plots might suggest that a different form of a predictor variable is required (e.g., x2 in 3 addition to x). In addition, they might indicate outliers, i.e. counts which depart significantly from the fitted model. As in the binary case, these residuals are not normally distributed (though the normal distribution provides a reasonable approximation when the µi ’s are large). Therefore, we would not expect residual plots to look like those in the linear setting. To assess the predictive abilities of the model, we can plot the observed counts (Yi ) vs. the fitted values (µ̂i ). Note that, if we’re modelling the rate rather than the mean, the fitted values are already adjusted by ti , so it doesn’t make sense to plot them vs. the adjusted counts (Yi /ti ). Example: Geiger counter experiment (cont.) How do we check whether our chosen model is appropriate for these data? For example, how do we choose between the log and square-root link functions? 1. Deviance/Pearson chi-squared tests. The tests when applied to both the log and square-root link models suggest that neither model fits well. The deviances of these models are nearly identical. If we had to choose between the two, we might choose the model with log link for simplicity. 2. Plot the observed vs. fitted values. Even though the above tests suggest that the model fits poorly, this plot does not show any major deviation of the observed values from the fitted values! 3. Plot the observed and fitted values (simultaneously) vs. distance in order to detect discrepancies for specific values of distance. This plot also suggests that the model fits reasonably well. 4. Plot the deviance/Pearson residuals. These plots do not show any obvious problems. Summary: Although the formal GOF tests provide evidence that the model with log link doesn’t fit well, our graphical tools suggest otherwise. Likely explanation: Both the predictor and response variables are observed over a large range in this example. We therefore have a lot of information with which to estimate the relationship between these variables. The formal tests are likely picking up small deviations from the proposed model which (hopefully) aren’t important in practice. Conclusion: We would likely accept the model with log link as providing a reasonable description of the data. Remember that “All models are wrong, but some are useful” (George Box). 4