Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
3.1 3. Generalized linear models Use models to investigate the relationships (associations) among categorical and continuous variables. Reason for using models (p. 65): Helps describe the pattern of association and interaction Inferences for model parameters help determine which explanatory variables affect the response while controlling for other variables Estimate model parameters to determine the strength and importance of effects Models can more easily handle complicated problems A general class of models are “generalized linear models” (GLMs). You have already studied a special case of GLMs, linear models, previously in regression and ANOVA courses. In addition to Agresti (2002), other references on GLMs include: McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. 2nd edition. London: Chapman and Hall. McCulloch, C. and Searle, S. R. (2000). Generalized, Linear, and Mixed Models. New York: Wiley. 2010 Christopher R. Bilder 3.2 3.1 Components of a generalized linear model Review of regression models Yi = 0 + 1xi1 + 2xi2 + … + kxik + i where i~independent N(0, 2). Note that E(Yi) = 0 + 1xi1 + 2xi2 + … + kxik E(Yi) is what one would expect Yi to be on average for a set of xi1, xi2, …, xik values. One of the important things to realize here is that Y has a normal distribution. What if this is not true? Suppose Y is a nominal categorical variable. Suppose Y has a Poisson distribution. There are many other possibilities. GLMs allow us to generalize the model structure! Three components of a GLM Random For a sample of size n, denote the observations of the response variable Y as Y1, Y2, …, Yn. Assume Y1, Y2, …, Yn are obtained independently here. We will specifically be interested in the E(Y)=. 2010 Christopher R. Bilder 3.3 The distribution chosen for Y defines the “random” component of a GLM. For example, suppose the Y1, Y2, …, Yn are responses from a Bernoulli random variable Y. Thus, the Y1, Y2, …, Yn are all 0 or 1. Suppose Y1, Y2, …, Yn are responses from a Binomial random variable Y. Thus, the Y1, Y2, …, Yn are all nonnegative integers and denote the number of successes out of a certain number of trials. Suppose Y1, Y2, …, Yn are responses from a Poisson random variable Y. The Y1, Y2, …, Yn are all non-negative integers and could denote cell counts in a contingency table. In regression and ANOVA, Y1, Y2, …, Yn are responses from a normal random variable Y. Systematic This component specifies the explanatory variables: + 1x1 + 2x2 + … + kxk Notice that this is a “linear” combination of the explanatory variables. This is often called the “linear predictor”. Note 2010 Christopher R. Bilder 3.4 that the x’s above could be a transformation of an original explanatory variable(s) – such as a quadratic or interactions. Link This component “links” the random and systematic component. In other words, this shows how the mean of the distribution for Y is related to the explanatory variables. Let g() be a function of the E(Y)=. This is the link function. Specifically, the GLM is g() = + 1x1 + 2x2 + … + kxk Link functions: Identity – g() = = E(Y) = + 1x1 + 2x2 + … + kxk This is used for regression and ANOVA models! Log – g() = log() log() = + 1x1 + 2x2 + … + kxk = exp( + 1x1 + 2x2 + … + kxk) 2010 Christopher R. Bilder 3.5 The log link is used for “loglinear” models in Chapter 7. Most often, Y is assumed to have a Poisson distribution. Notice that all values of will be positive. This is why the link is used with modeling counts in a contingency table! Logit – g() = log = logit() 1 = + 1x1 + 2x2 + … + kxk log 1 exp( 1x1 2 x 2 ... k xk ) = 1 exp( 1x1 2 x 2 ... k xk ) The logit link is used for “logit” and logistic regression models in Chapters 4-5. Notice that all values of will be between 0 and 1 (try a few sample cases to see this). This is why the link is used with modeling probabilities! Remember that the mean of a Bernoulli random variable is . Other links are possible such as the probit and complementary log-log. These will be discussed later. Read Section 3.1.4 about the normal GLM on p.67-8. 2010 Christopher R. Bilder 3.6 3.2 Generalized linear models for binary data Binary data means observations obtained from a random variable with only two possible values. Typically, these two possible values are called a “success” and a “failure”. From Chapter 1: Bernoulli distribution: P(Y=y) = y (1 )1 y for y=0 or 1 This is a special case of the binomial with n=1. The expected value of Y is E(Y) = and the variance of Y is Var(Y) = (1-). The goal in this section to find a GLM to model at specific values of explanatory variables (x’s) For example, suppose you want to estimate the probability of success, , of a field goal. The value of will probably be different for a 20 yard field goal than for a 50 yard field goal. Thus, it would be of interest to incorporate length of a field goal in a model for . Notation: Agresti (2007) uses (x) to denote here. The reason is because explanatory variables (x’s) will be used to try to predict the value of . Thus, “depends” on the level of the explanatory variables. 2010 Christopher R. Bilder 3.7 To simplify the upcoming discussion, only one explanatory variable, x, will be used to model the probability of success, (x). Linear probability model Suppose an ordinary regression model was used to model the probability of success. Thus, E(Y) = (x) = + x with Y~N(0,2). This is called a linear probability model because the probability of success changes in a linear manner. Problems with model: o Violates the distributional assumptions for Y. Y is Bernoulli, not normal. o Probabilities can be less than 0 or greater than 1! o Non constant variance – Var(Y) = (x)(1-(x)); variance changes as a function of x Therefore, do not use this model! Logistic regression model A great introductory reference on logistic regression is 2010 Christopher R. Bilder 3.8 Hosmer, D. W. and Lemeshow, S. (2000). Applied Logisitic Regression, 2nd edition. New York: Wiley. Many STAT 870 books also will include a chapter on logistic regression. For example, see Chapter 14 of Kutner, Nachtsheim, and Neter (2004). The model is (x) = logit[(x)] = + x log 1 (x) The random component is Bernoulli. The logit transformation is the link function. The model can be equivalently written as: (x) ex 1 ex What does a plot of (x) vs. x look like? Example: Plot of (x) vs. x (pi_plot.R) When there is only one explanatory variable, =1, and =0.5, a plot of (x) vs. x looks like the following: 2010 Christopher R. Bilder 3.9 e x 1 e x 0 5 0.0 0.2 0.4 x 0.6 0.8 1.0 x -15 -10 -5 10 15 x When =1 and =-0.5, the plot of (x) vs. x looks like the following: e x 1 e x 0 5 0.0 0.2 0.4 x 0.6 0.8 1.0 x -15 -10 -5 10 15 x R code: alpha<-1 beta1<-0.5 par(pty="s") curve(expr = exp(alpha+beta1*x)/(1+exp(alpha+beta1*x)), 2010 Christopher R. Bilder 3.10 from = -15, to = 15, col = "red", main = expression(pi(x) == frac(e^{alpha+beta*x}, 1+e^{alpha+beta*x})), xlab = "x", ylab = expression(pi(x)), panel.first = grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")) #See help(plotmath) for more on the expression function and see demo(plotmath) Notes: When >0, there is a positive relationship between x and (x). When <0, there is a negative relationship between x and (x). The shape of the function is similar to an “s”. Notice the symmetric shape about (x) = 0.5 0<(x)<1 Questions: What happens to the =0.5 plot when is increased? What happens to the =0.5 plot when is decreased to be close to 0? Suppose a plot of logit[(x)] vs. x was made. What would the plot look like? Parameter estimation Suppose there is a random sample of size n providing (y1, x1), (y2, x2), …, (yn, xn) where the yi’s are 0’s or 1’s. The probability of observing a 1 for yi is denoted by i(x). 2010 Christopher R. Bilder 3.11 The logistic regression model is (x) log i = + xi for i=1,…,n 1 i (x) This is the assumed relationship between the xi and i(x). The model can be rewritten as i(x) = exp( xi ) . 1 exp( xi ) Parameter estimates can be found from maximum likelihood estimation – see Chapter 1’s discussion. ( 1(x),..., n (x) | y1,...,yn ) n f(yi ) i1 n i (x)yi (1 i (x))1 yi i1 n different parameters Then the log likelihood function is log ( 1(x),..., n (x) | y1,...,yn ) n yi log[ i (x)] (1 yi )log[1 i (x)] i1 Since i(x) = exp( xi ) , this implies 1 exp( xi ) 2010 Christopher R. Bilder 3.12 log (, | y1,...,yn ) exi yi log xi i 1 1 e n exi (1 yi )log 1 xi 1 e Now only two parameters! n yi ( xi ) yi log(1 exi ) (1 yi )log(1 exi ) i 1 n yi ( xi ) log(1 exi ) i 1 The maximum likelihood estimates of and are the values which maximize the above quantity. Since these estimates can only be found using numerical methods (except in special cases), parameter estimates are found by many software packages by using iteratively reweighted least squares to yield the maximum likelihood estimates. See p. 88 of Agresti (2007) and p. 143-149 of Agresti (2002) for more information. The R function, glm(), finds the parameter estimates (using a call to optim()). By using the model, the complexity of estimating has been reduced from estimating n different parameters (one for each i = 1,…,n) to only 2 - and ! Example: Placekicking (placekick_ch3.R, place.s.csv) See Bilder and Loughin (Chance, 1998) and the video! 2010 Christopher R. Bilder 3.13 The purpose of this example is to estimate the probability of success for a placekick in football. The place.s.csv data file contains a sample of 1,425 placekicks attempted during the 1995 National Football League season. Below is a brief description of the variables in the data set: week = Week of the season dist = distance of the placekick in yards change = binary variable denoting lead-change placekicks (1) vs. non lead-change (0) placekicks elap30 = continuous variable denoting the number of minutes left in a half with overtime placekicks assigned a value of 0 pat1 = binary variable for whether the placekick is a point after touchdown (1) or a field goal (0) type1 = binary variable for placekicks in a dome (0) or outdoors (1) field 1 = binary variable for placekicks on grass (1) or artificial turf (0) good1 = binary variable for placekicks which are successes (1) or failures (0) wind = binary variable for placekicks attempted in “windy” conditions (1) at kickoff versus non-windy conditions (0) using a 15 mph cutoff for non-windy The data was actually first stored in an Excel file. While one can use the xlsReadWrite or RODBC packages to read in an Excel file (see R introduction lecture), I used a 2010 Christopher R. Bilder 3.14 different method to read in the file. First, I re-saved the Excel file as a .csv format. To do this, select FILE > SAVE AS in Excel. Then select the .csv format in the SAVE AS TYPE box. Choose a file name and then select SAVE. This creates an ASCII text file which has commas separating each variable. 2010 Christopher R. Bilder 3.15 In order to get the data into R, I used the read.table() function as shown below: > place.s<-read.table(file = "C:\\chris\\UNL\\STAT875\\ chapter3_new\\place.s.csv", header = TRUE, sep = ",") > head(place.s) week dist change elap30 pat1 type1 field1 good1 wind 1 1 21 1 24.7167 0 1 1 1 0 2 1 21 0 15.8500 0 1 1 1 0 3 1 20 0 0.4500 1 1 1 1 0 4 1 28 0 13.5500 0 1 1 1 0 5 1 20 0 21.8667 1 0 0 1 0 6 1 25 0 17.6833 0 0 0 1 0 For now, only distance (dist) is going to be used to predict the probability of a successful placekick. The good1 variable contains the Bernoulli observations denoting the success or failure of a placekick. The logistic regression model of interest is (x) = logit[(x)] = + x = + (Distance) log 1 (x) where x=distance of the placekick, (x) = E(Y), Y=1 for success or 0 for failure. This particular GLM is used since the response variable (good1) is binary. To find the estimated model in R, the glm() function is used. Below is the code. 2010 Christopher R. Bilder 3.16 > mod.fit <- glm(formula = good1 ~ dist, data = place.s, family = binomial(link = logit), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T)) Deviance = 836.7715 Iterations - 1 Deviance = 781.1072 Iterations - 2 Deviance = 775.8357 Iterations - 3 Deviance = 775.7451 Iterations - 4 Deviance = 775.745 Iterations - 5 > names(mod.fit) [1] "coefficients" "effects" "qr" "deviance" "iter" "df.residual" "converged" "call" "data" "method" "residuals" "R" "family" "aic" "weights" "df.null" "boundary" "formula" “offset" "contrasts" "fitted.values" "rank" "linear.predictors" "null.deviance" "prior.weights" "y" "model" "terms" "control" "xlevels" > mod.fit$coefficients (Intercept) dist 5.812045 -0.1150259 > mod.fit Call: glm(formula = good1 ~ dist, family = binomial(link = logit), data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) Coefficients: (Intercept) 5.8121 dist -0.1150 Degrees of Freedom: 1424 Total (i.e. Null); Null Deviance: 1013 Residual Deviance: 775.7 AIC: 779.7 1423 Residual > summary(mod.fit) Call: glm(formula = good1 ~ dist, family = binomial(link = logit), 2010 Christopher R. Bilder 3.17 data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) Deviance Residuals: Min 1Q Median -2.7441 0.2425 0.2425 3Q 0.3801 Max 1.6091 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.812079 0.326158 17.82 <2e-16 *** dist -0.115027 0.008337 -13.80 <2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1013.43 Residual deviance: 775.75 AIC: 779.75 on 1424 on 1423 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 There are many different things that need to be discussed about the code and output. Only a few of them will be discussed here. More will be discussed later in this chapter and in Chapter 5. Notice the syntax used with the glm() function. The names(mod.fit) shows the different components of the mod.fit object. The estimated logistic regression model is logit[ˆ(x)] ˆ ˆ x = 5.8121 – 0.1150x What happens to the probability of success as the distance increases? The estimated probability of success for a particular distance can be found from 2010 Christopher R. Bilder 3.18 exp(ˆ ˆ x) exp(5.8121 0.1150x) ˆ (x) 1 exp(ˆ ˆ x) 1 exp(5.8121 0.1150x) For example, the estimated probability of success for a 20 yard placekick is ˆ (x 20) exp[5.8121 0.1150(20)] 0.9710 1 exp[5.8121 0.1150(20)] The estimated probability of success for a 50 yard placekick is ˆ (x 50) exp[5.8121 0.1150(50)] 0.5152 1 exp[5.8121 0.1150(50)] > #Estimated probability of success for a 20 yard field goal > lin.pred<-mod.fit$coefficients[1]+mod.fit$coefficients[2]*20 > exp(lin.pred)/(1 + exp(lin.pred)) 0.971014 > #Estimated probability of success for a 50 yard field goal > lin.pred<-mod.fit$coefficients[1]+mod.fit$coefficients[2]*50 > exp(lin.pred)/(1 + exp(lin.pred)) 0.5151829 The z value in the output is a z test statistic which gives a test for whether the corresponding parameter is 0 or not. This test statistic can be compared to a standard normal distribution. Is distance important to predicting the probability of success for a placekick? A simple plot of the “fitted values” versus distance is: 2010 Christopher R. Bilder 3.19 0.6 0.4 0.2 Estimated probability 0.8 1.0 Estimated probability of success of a placekick 20 30 40 50 60 Distance (yards) #Simple plot plot(x = place.s$dist, y = mod.fit$fitted.values, xlab="Distance (yards)", ylab="Estimated probability", main = "Estimated probability of success of a placekick") Note that this plot would not be appropriate to hand in for a project. Much better plots will be shown soon. There are often many observations for the same distance. For example, there are 20 placekicks from 21 yards and 19 of them are successful. This information can be found from using the table() or xtabs() functions. > #Summary of the placekicks by distance > dist.good <- table(place.s$dist, place.s$good1) > dist.good 2010 Christopher R. Bilder 3.20 integer matrix: 43 rows, 2 columns. 0 1 18 1 2 19 0 7 20 13 776 21 1 19 22 2 12 23 1 26 24 0 7 25 1 12 EDITED 55 56 59 62 63 66 1 0 0 1 1 1 2 1 1 0 0 0 Another way to put the data into this format is to use the gsummary() function. > library(nlme) > place.small<-data.frame(good = place.s$good1, dist = place.s$dist) > place.sum<-gsummary(object = place.small, FUN = sum, groups = place.small$dist) > place.length<-gsummary(object = place.small, FUN = length, groups = place.small$dist) > prop<-place.sum$good/place.length$good > place.pattern<-data.frame(sum.y = place.sum$good, n = place.length$good, prop = prop, distance=place.sum$dist) > head(place.pattern) sum.y n prop distance 1 2 3 0.6666667 18 2 7 7 1.0000000 19 3 776 789 0.9835234 20 4 19 20 0.9500000 21 2010 Christopher R. Bilder 3.21 5 6 12 26 14 0.8571429 27 0.9629630 22 23 Below is a plot of the estimated probability of success using the estimated logistic regression model. The observed proportions of successes are the plotting points. For example, there is a 19/20=0.95 at 21 yards. This type of plot can be used as a measure of how well the model fits the data. What do you think about the fit of the model? 0.6 0.4 0.2 0.0 Estimated probability 0.8 1.0 Estimated probability of success of a placekick with observed proportions 20 30 40 Distance (yards) 2010 Christopher R. Bilder 50 60 3.22 > #Find plot of the observed proportions > plot(x = place.pattern$distance, y = place.pattern$prop, xlab = "Distance (yards)", ylab = "Estimated probability", main = "Estimated probability of success of a placekick \n with observed proportions", panel.first=grid(col="gray", lty="dotted")) > curve(expr = exp(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x) / (1+exp(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x)), col = "red", add = TRUE) #Quicker way to do curve() here will learn about later #curve(plogis(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x), col = “red”, add = TRUE) You may think the model fits poorly at the larger distances. This is not necessarily true! The binary nature of the data can distort the perceived fit. At most of the larger distances, there are very few placekicks. For example, there was only one 59 yard placekick attempted and it was a success. Thus, the proportion of successful placekicks at this distance is 1/1 =1. To help make a judgment about the fit of the model, I created the bubble plot below. A bubble plot is a scatter plot with the plotting point proportional to another variable. The other variable in this case is the number of placekicks at each distance. Notice how the extreme proportions are the placekicks at distances without many observations. 2010 Christopher R. Bilder 3.23 The circles = __ option provides the third variable displayed in the plot as the size of the plotting point. 1.0 0.5 0.0 Estimated probability 1.5 Estimated probability of success of a placekick with observed proportions 10 20 30 40 50 60 Distance (yards) #plots the plotting points symbols(x = place.pattern$distance, y = place.pattern$prop, circles=sqrt(place.pattern$n), inches = 1, xlab = "Distance (yards)", ylab="Estimated probability", xlim = c(10,65), ylim = c(0, 1.5), main = "Estimated probability of success of a placekick \n with observed proportions", panel.first = grid(col = "gray", lty = "dotted")) #Puts the estimated logistic regression model on the plot curve(expr = exp(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x) / (1+exp(mod.fit$coefficients[1]+mod.fit$coefficients[2]*x)), 2010 Christopher R. Bilder 3.24 col = "red", add = TRUE) Estimated probability of success of a placekick w ith observed proportions 0.8 0.6 0.4 0.0 0.2 Estimated probability 1.0 1.2 Questions: Which placekicks does the largest bubble represent? Suppose the plot looked like this (this plot was edited in PowerPoint; note different scale): 20 30 40 50 60 Distance (yards) What do you think about the fit of the model? Note: The inches = __ option in the symbols() function controls the size of the larges circle. The default is 1” in height. You may need to change this to help make the plot more informative for a particular problem. Also, I used the sqrt() function here with the circles = ___ option since the disparity between the largest place.pattern$n value and the others is so large. Other functions could have been used as well. Examine what the plot looks like on 2010 Christopher R. Bilder 3.25 your own without the sqrt() function to see how much it helped. Alternative binary links Many other link functions could be used to model binary data. These links functions use the “cumulative distribution function” or CDF. Below is a formal definition. Let X be a continuous random variable with probability density function f(x). An observed value of X is denoted by x. The cumulative distribution function of x X is F(x) = P(Xx) = f(u)du . Note that u is substituted into the probability distribution function to avoid confusion with the upper limit of integration. If X is a discrete random variable, the cumulative distribution function of X is F(x) = P(Xx) = f(x) = P(X x) where the sum is over all values of Xx. An informal definition is the cumulative distribution function “cumulates” probabilities as a function of x. See the Chapter 3 additional notes for examples of a CDF involving the binomial distribution and the uniform distribution. 2010 Christopher R. Bilder 3.26 The reason why CDFs are used as link functions for binary data is because the CDF is always between 0 and 1. Example: Logistic distribution (logistic_distribution.R) Let X have a logistic probability distribution. The probability distribution function for X can be represented by f(x) 1 e( x ) / 1 e( x ) / 2 for -<x< and parameters -<< and >0. Note that E(X) = and Var(X) = 22/3 > 2. Below is a plot of the distribution for =-2 and =2. 2 and 2 0.06 0.04 0.02 0.00 f(x) 0.08 0.10 0.12 Logistic PDF with -15 -10 -5 0 5 x 2010 Christopher R. Bilder 10 15 3.27 mu<--2 sigma<-2 curve(expr = 1/sigma * exp(-(x-mu)/sigma) /(1+exp(-(xmu)/sigma))^2, ylab = "f(x)", xlab = "x", from = -15, to = 15, main = expression(paste("Logistic PDF with ", mu==-2, " and ",sigma==2)), col = "red") #Note that expr = dlogis(x, location=mu, scale=sigma) could also be used abline(h = 0) The cumulative distribution function can be found by finding P(Xx): x F(x) f(u)du x 1 e(u ) / 1 e(u ) / du 2 x 1 1 e(u ) / 1 1 e( x ) / Below is a plot of the CDF for =-2 and =2. 2 and 2 0.0 0.2 0.4 F(x) 0.6 0.8 1.0 Logistic CDF with -15 -10 -5 0 5 x 2010 Christopher R. Bilder 10 15 3.28 curve(expr = 1/(1+exp(-(x-mu)/sigma)), ylab = "F(x)", xlab = "x", from = -15, to = 15, lwd = 2, main = expression(paste("Logistic CDF with ", mu==-2, " and ",sigma==2)), col = "red", panel.first = grid(col = "gray", lty = "dotted")) #Note that expr = plogis(x, location=mu, scale=sigma) could also be used Does this plot look familiar? See p. 3.8. This is the same function being plotted! Note that F(x) = 1 1 e( x )/ 1 1 e[ x ( 2)]/2 1 1 x 1 2 1 e 1 1 e( x ) where =1 and =1/2. Then 1 F(x) 1 1 ex 1 e( x ) 1 ex 1 e( x ) ex Also notice that log[F(x)/(1-F(x))] = + x. Therefore, the logistic cumulative distribution function is used for “logistic” regression! (Note: One could say F-1(x) = log[F(x)/(1-F(x))], where F-1(x) is the inverse CDF.) Example: Normal probability distribution Let X have a normal probability distribution. The probability distribution for X can be represented by 2010 Christopher R. Bilder 3.29 f(x) 1 2 e ( x )2 2 2 for -<x<,-<<, and >0 The cumulative distribution function can be found by finding P(Xx): x F(x) 1 2 e (u )2 2 2 du Suppose =0 and 2=1. Then F(1.645) = 0.95, F(1.96) = 0.975, and F(2.576) = 0.995. Many textbooks will use () to denote the CDF of a standard normal distribution. Thus, (1.645) = 0.95. In more familiar notation, Z1- = Z(1-) = Z(0.95) = 1.645 where =0.05. 1- represents the area to the left of 1.645 (for this example) of the probability distribution function. Note that other books may use Z where is the quantity in the “right” tail of the probability distribution function. CDFs are nice to use for link functions with binary data since the CDF is always between 0 and 1. Two other commonly used link functions based on CDFs are: 2010 Christopher R. Bilder 3.30 Probit – based on the CDF of the standard normal distribution; the name comes from probit being a shortened version of “probability unit” (Hubert, 1992). Random component: Y~Bernoulli() Systematic component: + x Link function: probit transformation (x) = ( + x) where () is the CDF of a standard normal distribution. Then -1[(x)] = + x -1[ ] is often called the “probit” transformation and denoted by probit( ). In general, this is often referred to as the inverse of the standard normal CDF. Thus, probit[(x)] = + x (Note: similar to “logit”) What does -1[ ] or probit[ ] represent? Here are a few examples: -1[0.95] = probit(0.95) = 1.645, -1[0.975] = 1.96, and -1[0.995] = 2.576. Compare the probit transformation to the logit transformation. Remember the main purpose is to get a 2010 Christopher R. Bilder 3.31 value of the function between 0 and 1 in order to model the probability of success. Complementary log-log – based on 1 - CDF of the Gumbel (extreme value) distribution Random component: Y~Bernoulli() Systematic component: + x Link function: Complementary log-log transformation The CDF of a Gumbel distribution is F(x) = exp{ exp[ (x ) / ]} for parameters -<< and >0. Notice that 1- F(x) is still between 0 and 1. Also, note that E(X) = + where 0.577216 (Euler’s constant) and Var(X) = 22/6. Let =-1/ and =/. Through the use of some algebra, the 1-CDF becomes 1-F(x) = 1 exp[ exp( x)] . Thus, (x) = 1 exp[ exp( x)] Solving for the systematic component produces: log{-log[1-(x)]} = + x 2010 Christopher R. Bilder 3.32 The “complementary” part of the name comes from 1F(X) instead of F(X) being used. Example: Compare the logistic, probit, and complementary log-log GLMs (pi_plot.R) = 1 and = 0.5 0.8 1.0 x vs. x for 0.0 0.2 0.4 x 0.6 Logit Probit Cloglog -15 -10 -5 0 5 x 2010 Christopher R. Bilder 10 15 3.33 = 1 and = -0.5 0.8 1.0 x vs. x for 0.0 0.2 0.4 x 0.6 Logit Probit Cloglog -15 -10 -5 0 5 10 15 x The R code used to create the data for the plots is below. alpha<-1 beta<-0.5 par(pty="s") curve(expr = plogis(alpha+beta*x), from = -15, to = 15, col = "red", lwd = 2, lty = 1, main = expression(paste(pi(x), " vs. x for ", alpha, " = 1 and ", beta," = 0.5")), xlab = "x", ylab = expression(pi(x)), panel.first = grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")) curve(expr = pnorm(alpha+beta*x, mean=0, sd=1), from = -15, to = 15, col = "blue", add = TRUE, lty = 2, lwd = 2) curve(expr = 1-exp(-exp(alpha+beta*x)), from = -15, to = 15, col = "green", add = TRUE, lty = 4, lwd = 2) legend(locator(1), legend = c("Logit", "Probit", "Cloglog"), lty = c(1,2,4), lwd = c(2,2,2), col = c("red", "blue", "green"), bty = "n") #There is a pgumbel(q, loc=0, scale=1, lower.tail = TRUE) 2010 Christopher R. Bilder 3.34 function in the evd and VGAM packages Notes: The logistic model corresponds to the model plotted on p. 4.9. The logistic and probit intersect at (x)=0.5. Notice the logistic and probit curves are both symmetric. This means that the curve when (x)<0.5 is the mirror image of the curve for (x)>0.5. The complementary log-log curve does not have this property. When you fit these models to a data set, you should not expect all of the ̂ and ̂ ’s to be the same. Thus, these plots are a little misleading in some respect. Which model should you use??? This is not an easy question to answer. The logit link provides a convenient way to interpret the model through the use of odds and odds ratios. Notice the logit transformation is a log of an odds! Because of this aspect, the logit link will often be used over the other two. Chapter 5 focuses on the logit link. The logit and probit links provide models that are often not too different. See the upcoming examples. I have not seen the complementary log-log link used often; however, this does not mean it is not used in practice. 2010 Christopher R. Bilder 3.35 One way to decide between the three link functions is to use all three and see which one gives the best “fit”. This means which graphically fits the data the best (like on p. 3.42), which has the smallest residuals in absolute value, and which satisfies goodness-of-fit statistics the best. Goodness-of-link function tests can be used to help determine which link function to use. These tests usually incorporate the link functions under one family of functions. For example, Aranda-Ordaz (1981) has incorporated the probit and logit transformations under one family of transformations. He gives a hypothesis test to help choose between them. A small discussion of these tests is available on p. 301 and p. 257-8 in Agresti (2002). Example: Placekicking (placekick_ch3.R, place.s.csv) Probit model: > mod.fit.probit<-glm(formula = good1 ~ dist, data = place.s, family = binomial(link = probit), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T)) Deviance = 825.0748 Iterations - 1 Deviance = 776.0735 Iterations - 2 Deviance = 772.0135 Iterations - 3 Deviance = 771.9512 Iterations - 4 > summary(mod.fit.probit) Call: 2010 Christopher R. Bilder 3.36 glm(formula = good1 ~ dist, family = binomial(link = probit), data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) Deviance Residuals: Min 1Q Median -2.8166 0.2275 0.2275 3Q 0.3914 Max 1.5316 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.205985 0.155195 20.66 <2e-16 *** dist -0.062768 0.004284 -14.65 <2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1013.43 Residual deviance: 771.95 AIC: 775.95 on 1424 on 1423 degrees of freedom degrees of freedom > #Estimated probability of success for a 20 yard field goal lin.pred<-mod.fit.probit$coefficients[1] + mod.fit.probit$coefficients[2]*20 > pnorm(q = lin.pred, mean = 0, sd =1) 0.9744488 > #Estimated probability of success for a 50 yard field goal > lin.pred<-mod.fit.probit$coefficients[1] + mod.fit.probit$coefficients[2]*50 > pnorm(q = lin.pred, mean = 0, sd = 1) 0.526936 Complementary log-log model: > mod.fit.cloglog <-glm(formula = good1 ~ dist, data = place.s, family = binomial(link = cloglog), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T)) 2010 Christopher R. Bilder 3.37 Deviance Deviance Deviance Deviance = = = = 836.9174 771.2283 769.4893 769.4776 Iterations Iterations Iterations Iterations - 1 2 3 4 > summary(mod.fit.cloglog) Call: glm(formula = good1 ~ dist, family = binomial(link = cloglog), data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) Deviance Residuals: Min 1Q Median -2.9052 0.2126 0.2126 3Q 0.4132 Max 1.3705 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.379921 0.117955 20.18 <2e-16 *** dist -0.052226 0.003702 -14.11 <2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1013.43 Residual deviance: 769.48 AIC: 773.48 on 1424 on 1423 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 > #Estimated probability of success for a 20 yard field goal > lin.pred<-mod.fit.cloglog$coefficients[1] + mod.fit.cloglog$coefficients[2] * 20 > 1-exp(-exp(lin.pred)) 0.977664 > #Estimated probability of success for a 50 yard field goal > lin.pred<-mod.fit.cloglog$coefficients[1] + mod.fit.cloglog$coefficients[2]*50 > 1-exp(-exp(lin.pred)) 2010 Christopher R. Bilder 3.38 0.5477212 Notes: Models: Estimated model Logistic logit[ ̂(x)] = 5.8121 – 0.1150x Probit probit[ ̂(x) ] = 3.2060 – 0.0628x Comp. log-log log{-log[1- ̂(x) ]} = 2.3799 – 0.0522x Estimated model exp(5.8121 0.1150x) (x) ˆ Logistic 1 exp(5.8121 0.1150x) Probit ̂(x) = (3.2060 – 0.0628x) Comp. log-log ̂(x) = 1 – exp[-exp(2.3799 – 0.05222x)] Estimate probabilities: Suppose you want to predict the estimated probability of success for a distance of 20 yards. For the probit model, ̂(x 20) = (3.2060 – 0.0628*20) = (1.95) = 0.9744 For the complementary log-log model: ̂(x 20) = 1 – exp[-exp(2.3799 – 0.0522220)] = 0.9777 2010 Christopher R. Bilder 3.39 To summarize, Distance ̂(x) Logistic 20 0.9710 Probit 20 0.9744 Comp. log-log 20 0.9777 Distance ̂(x) Logistic 50 0.5152 Probit 50 0.5269 Comp. log-log 50 0.5477 An easier way to find the estimate probabilities is to use the predict() function. Suppose the complementary log-log model is fit and the model fit summary information is stored in the mod.fit.cloglog object. Then the predict() function can be used the following way to predict the probability of success at x=20: > predict.data<-data.frame(dist=20) > predict(object = mod.fit.cloglog, newdata = predict.data, type = "response") [1] 0.977664 The type = “response” option is used to tell R that you want to predict . If you want to predict the linear predictor, use the type = “link” option. > #predict the linear predictor > predict(object = mod.fit.cloglog, newdata = 2010 Christopher R. Bilder 3.40 predict.data, type = "link") [1] 1.335410 To predict for more than one distance, create a data set with extra rows: > #Predict for 20 and 50 yards > predict.data<-data.frame(dist = c(20, 50)) > save.pi.hat<-predict(object = mod.fit, newdata = predict.data, type = "response") > data.frame(predict.data, pi.hat = round(save.pi.hat,4)) dist pi.hat 1 20 0.9777 2 50 0.5477 Finally, one could also use the predict() function to find the standard error of ̂ . This information can be used to find approximate (1-)100% Wald confidence intervals for . The actual formulas will be discussed in Chapter 4. > #Prediction with C.I.s > predict.data<-data.frame(dist = c(20, 50)) > alpha<-0.05 > save.pi.hat<-predict(object = mod.fit, newdata = predict.data, type = "response", se.fit = TRUE) > lower<-save.pi.hat$fit-qnorm(1-alpha/2) * save.pi.hat$se.fit > upper<-save.pi.hat$fit+qnorm(1-alpha/2) * save.pi.hat$se.fit > data.frame(predict.data, pi.hat = round(save.pi.hat$fit, 4), se = round(save.pi.hat$se.fit,4), lower = round(lower,4), upper = round(upper,4)) dist pi.hat se lower upper 2010 Christopher R. Bilder 3.41 1 2 20 0.9777 0.0046 0.9686 0.9867 50 0.5477 0.0303 0.4884 0.6070 Below is a plot of the estimated probabilities from all three of the models. 0.6 0.4 0.2 Complementary log-log Logit Probit 0.0 Estimated probability 0.8 1.0 Estimated probability of success of a placekick with observed proportions 20 30 40 50 60 Distance (yards) R code: par(pty = "m") #plots over all of graph - not square plot(x = place.pattern$distance, y = place.pattern$prop, xlab="Distance (yards)", ylab="Estimated probability", main = "Estimated probability of success of a placekick \n with observed 2010 Christopher R. Bilder 3.42 proportions", panel.first = grid(col = "gray", lty = "dotted")) curve(expr = plogis(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x), col = "red", add = TRUE, lwd = 2, lty = 1) curve(expr = pnorm(mod.fit.probit$coefficients[1] + mod.fit.probit$coefficients[2]*x), col = "blue", add = TRUE, lty = 2, lwd = 2) curve(expr = 1-exp(-exp(mod.fit.cloglog$coefficients[1] + mod.fit.cloglog$coefficients[2]*x)), col = "green", add = TRUE, lty = 4, lwd = 2) legend(locator(1), legend = c("Complementary log-log", "Logit", "Probit"), lty = c(4, 1, 2), bty = "n", col=c("green", "red", "blue"), cex = 0.75) 0.8 0.6 0.4 0.2 Complementary log-log Logit Probit 0.0 Estimated probability 1.0 1.2 Estimated probability of success of a placekick with observed proportions 10 20 30 40 Distance (yards) 2010 Christopher R. Bilder 50 60 3.43 R code: # Bubble plot version with bubble proportional to sample size symbols(x = place.pattern$distance, y = place.pattern$prop, circles = sqrt(place.pattern$n), xlab = "Distance (yards)", ylab="Estimated probability", xlim = c(10,65), ylim = c(0, 1.2), main = "Estimated probability of success of a placekick \n with observed proportions", panel.first = grid(lty = "dotted") curve(expr = plogis(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x), col = "red", add = TRUE, lwd = 2, lty = 1) curve(expr = pnorm(mod.fit.probit$coefficients[1] + mod.fit.probit$coefficients[2]*x), col = "blue", add = TRUE, lty = 2, lwd = 2) curve(expr = 1-exp(-exp(mod.fit.cloglog$coefficients[1] + mod.fit.cloglog$coefficients[2]*x)), col = "green", add = TRUE, lty = 4, lwd = 2) legend(locator(1), legend = c("Complementary log-log", "Logit", "Probit"), lty = c(4, 1, 2), lwd = c(2,2,2), bty = "n", col=c("green", "red", "blue"), cex = 0.75) 2010 Christopher R. Bilder 3.44 3.3 Generalized linear models for count data Counts (for example, counts in a contingency table) of possible outcomes are non-negative integers. These are often modeled as Poisson random variables. Chapter 7 focuses on counts from a contingency table for multiple categorical variables. This section focuses on counts for a single categorical variable that do not necessarily appear in a contingency table. Review: e y Poisson distribution: P(Y y) for y=0,1,2,… y! where Y is a random variable y denotes the possible outcomes of Y is a parameter E(Y) = and Var(Y) = - this can be too restrictive Poisson regression To make the introduction easier, assume there is only one explanatory variable. Random component: Y~Poisson() Systematic component: + x 2010 Christopher R. Bilder 3.45 Link function: log transformation log[E(Y)] = log() = + x = e + x = eex = e(e)x Notice the effect of a change in x has on . Could call here “(x)” similar to what was done with in the previous section. Question: Why do you think the log link is preferred over the identity link for count data? Examples: Possible Y and X variables Y = # of credit cards you have Y = # of arrests for a city per year Y = # of airplane crashes per year Y = # of cars stopped at the 33rd and Holdrege streets intersection What variables could have an effect on Y? Suppose Y is # of credit cards: X = income level, gender, where you live,… 2010 Christopher R. Bilder 3.46 Example: Horseshoe crabs and satellites (horseshoe.R, horseshoe.txt) See the video! Also, please see the description on p. 75 of Agresti (2007). Page 76-77 shows the entire data set. 2010 Christopher R. Bilder 3.47 More on the crabs: www.npr.org/templates/story/story.php?storyId=10648 9695 http://www.ceoe.udel.edu/horseshoecrab For each ith female, assume the number of satellites, Yi, has a Poisson distribution with mean i dependent on female shell width. We will model the expected number of satellites with the following model: log(i) = + xi where xi is the width of the ith female crab. > #Read in data > crab<-read.table(file = "c:\\Chris\\UNL\\STAT875\\chapter4 \\horseshoe.txt", header=FALSE, col.names = Notice how data was read in 2010 Christopher R. Bilder 3.48 c("satellite", "width")) > mod.fit<-glm(formula = satellite ~ width, data = crab, family = poisson(link = log), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T)) Deviance = 759.6346 Iterations - 1 Deviance = 580.078 Iterations - 2 Deviance = 567.9793 Iterations - 3 Deviance = 567.8786 Iterations - 4 Deviance = 567.8786 Iterations - 5 > summary(mod.fit) Call: glm(formula = satellite ~ width, family = poisson(link = log), data = crab, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) Deviance Residuals: Min 1Q Median -2.8526 -1.9884 -0.4933 3Q 1.0970 Max 4.9221 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30476 0.54222 -6.095 1.10e-09 *** width 0.16405 0.01996 8.217 < 2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 632.79 Residual deviance: 567.88 AIC: 927.18 on 172 on 171 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 > #Predict for 23 and 30 widths > predict.data<-data.frame(width = c(23, 30)) > alpha<-0.05 2010 Christopher R. Bilder 3.49 > save.mu.hat<-predict(object = mod.fit, newdata = predict.data, type = "response", se = TRUE) > lower<-save.mu.hat$fit-qnorm(1-alpha/2)*save.mu.hat$se > upper<-save.mu.hat$fit+qnorm(1-alpha/2)*save.mu.hat$se > data.frame(predict.data, mu.hat = round(save.mu.hat$fit,4), lower = round(lower,4), upper = round(upper,4)) width mu.hat lower upper 1 23 1.5972 1.3074 1.8871 2 30 5.0359 4.3101 5.7618 #Plot of data and estimated model > plot(x = crab$width, y = crab$satellite, xlab = "Width (cm)", ylab = "Number of satellites", main = "Horseshoe crab data set \n with poisson regression model fit", panel.first = grid(col = "gray", lty = "dotted")) > curve(expr = exp(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x), col = "red", add = TRUE, lty = 1) > #The is part of Table 3.3 on p. 80 of Agresti (2007). The last two "columns" are the number of cases and the number of satellites. The first "column" is the group width mean corresponding to the width categories given in Table 3.3. These means are stated on p. 90 of Agresti (1996). In the 2007 edition, he did not state them. However, these can be simply found as shown in my table3.3.R program. > crab.tab3.3<-data.frame(width = c(22.69, 23.84, 24.77, 25.84, 26.79, 27.74, 28.67, 30.41), cases = c(14, 14, 28, 39, 22, 24, 18, 14), satell = c(14, 20, 67, 105, 63, 93, 71, 72)) > temp3<-matrix(data=temp2, nrow=8, ncol=3, byrow=T) > crab.tab4.3<-data.frame(width=temp3[,1], cases=temp3[,2], satell=temp3[,3]) > #Average number of satellites per group > mu.obs<-crab.tab4.3$satell/crab.tab4.3$cases > points(x = crab.tab4.3$width, y = mu.obs, pch = 18, col = "darkgreen", cex = 2) > legend(locator(1), legend="Diamonds are group mean", cex = 0.75) 2010 Christopher R. Bilder 3.50 15 Horseshoe crab data set with poisson regression model fit 10 5 0 Number of satellites Diamonds are group mean 22 24 26 28 30 32 34 Width (cm) Notes: First examine the plot of the data above – ignoring the estimated model plotted in red. The data show an upward trend. As the width increases, the number of satellites increases. This is easier to see with the group means (the grouping of the data comes from Table 4.3 of Agresti (1996, p. 90)). Remember that the Poisson regression model is modeling the MEAN response! 2010 Christopher R. Bilder 3.51 The glm() function fits the Poisson regression model to the data. Notice the use of the family = poisson(link = log) option. The estimated Poisson regression model is ˆ exp( 3.3048 0.1640x) where x=width and is the mean number of satellites. The model could also be written as: log(ˆ ) 3.3048 0.1640x What happens to the estimated mean number of satellites as the width increases? The estimated number of satellites for a particular width can be found from the model. For example, the estimated mean number of satellites for a width of 23 is ˆ exp( 3.3048 0.1640 23) 1.5972 The estimated number of satellites for a width of 30 is 5.0359. See how the predict() function was used here. The z value in the output gives a test for whether the corresponding parameter is 0 or not. This test statistic can be compared to a standard normal distribution. Is width important to predicting the mean number of satellites? 2010 Christopher R. Bilder 3.52 See the R code used to create the plot. I had difficulty creating a legend with the diamond plotting character. The plot is very important to do in order to determine if the model works for the data! Table3.3.R provides a general way to find tables like Table 3.3 on p. 80 in Agresti (2007). The program also provides a general way to find categories (not the same as those in Table 3.3). This program code can be incorporated into your own program for future projects!!! Negative binomial regression A limiting assumption for a Poison distribution is that E(Y) = Var(Y) = . Sometimes, the variance of Y appears to be greater than for a data set. Evidence of this occurs in the horseshoe example. See Table 3.3 on p. 80 or part of it produced below from my Table3.3.R program. > 1 2 3 4 5 6 7 8 table3.3[,1:5] width.group number.cases number.sat mean.per.group var.per.group 22.69286 14 14 1.000000 2.769231 23.84286 14 20 1.428571 8.879121 24.77500 28 67 2.392857 6.543651 25.83846 39 105 2.692308 11.376518 26.79091 22 63 2.863636 6.885281 27.73750 24 93 3.875000 8.809783 28.66667 18 71 3.944444 16.879085 30.40714 14 72 5.142857 8.285714 If the Poisson assumptions were satisfied, we would expect the mean.per.group column to be approximately the same as the var.per.group column. Obviously, this is 2010 Christopher R. Bilder 3.53 not happening here. Note that this is an “ad-hoc” way to show the variance is larger than the mean (due to the artificial grouping of the data), but it still shows evidence toward possible problems. When the variance is larger than the mean, this is called overdispersion, and it is a violation of our model. Thus, inferences made using the model may be incorrect. What can you do when this occurs? 1. Find more explanatory variables that help explain the variability in the response variable! The additional variability could be due to not accounting for other explanatory variables. For example, perhaps crab weight plays an important role in estimating the mean number of satellites. Without accounting for weight and using width only, there can be additional satellite variability than expected at individual widths. See Agresti (2007) on p. 80-1 for a further good explanation. 2. Page 151 of Agresti (2002) discusses quasi-Poisson regression models. These models do not assume a full parametric form for the model and can be estimated with the glm() function by using a family = quasipoisson(link = log) option. See the additional Chapter 3 notes for more information. Agresti (2007) does not discuss these models in Chapter 3 (a little on p. 280), so they will not be discussed here. 2010 Christopher R. Bilder 3.54 3. Poisson generalized linear mixed models, which are explained in Section 13.5 of Agresti (2002). 4. Agresti (2007) discusses negative binomial models, so these will be presented next. One way to write the negative binomial distribution is y k 1 k p (1 p)y for y = 0, 1, … y This distribution occurs when one is interested in the probability of y failures before the kth success (see Casella and Berger (2002, p. 95) if you are interested in more detail). For us, there are two important aspects to this distribution. First, the values of Y are non-negative integers just like a Poisson random variable. Second, the distribution can be rewritten as y k 1 k k y k y k 1 k for y = 0, 1, …, and k>0 where E(Y) = and Var(Y) = + 2/k. Notice that this is very similar to what we had for a Poisson random variable, but now we have a larger variance for Y! The parameter k is a measure of the “over” dispersion. Note that Agresti (2007) officially defines D = 1/k as the “dispersion parameter”. As 1/k goes to 0, we approach what the Poisson distribution would obtain. More in 2010 Christopher R. Bilder 3.55 depth information for how this distribution comes about is available on p. 559-561 of Agresti (2002) if you are interested. Example: Horseshoe crabs and satellites (horseshoe.R, horseshoe.txt) The glm() function can not fit this specific model so we will need to use the glm.nb() function in the MASS package. This package comes with an initial installation of R, but you will still need to tell R that you want to use it. > > > library(MASS) mod.fit.nb<-glm.nb(formula = satellite ~ width, data = crab, link = log) summary(mod.fit.nb) Call: glm.nb(formula = satellite ~ width, data = crab, link = log, init.theta = 0.904568080033865) Deviance Residuals: Min 1Q Median -1.7798 -1.4110 -0.2502 3Q 0.4770 Max 2.0177 Coefficients: Estimate Std. Error z value (Intercept) -4.05251 1.17143 -3.459 width 0.19207 0.04406 4.360 --Signif. codes: 0 '***' 0.001 '**' 0.01 1 Pr(>|z|) 0.000541 *** 1.30e-05 *** '*' 0.05 '.' 0.1 ' ' (Dispersion parameter for Negative Binomial(0.9046) family taken to be 1) 2010 Christopher R. Bilder 3.56 Null deviance: 213.05 Residual deviance: 195.81 AIC: 757.29 on 172 on 171 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 1 Correlation of Coefficients: (Intercept) width -1.00 Theta: Std. Err.: 2 x log-likelihood: 0.905 0.161 -751.291 The estimated negative binomial regression model is log(ˆ ) 4.0525 0.1921x with D̂ = 0.905 and k̂ = 1/0.905 = 1.1. Thus, the estimated variance of Y is now ˆ 0.9ˆ 2 (remember that ̂ depends on the value of x). Where does this larger variance show up in the analysis? One place is in Var(ˆ ) (without going into the formula details). Looking in the usual spot in the output, we obtain a value of 0.04406. The corresponding value in the Poisson regression model was 0.01996. Why does this larger variance matter? How could one test if there was evidence of overdispersion? 2010 Christopher R. Bilder 3.57 Poisson regression for rate data Rate data consists of the rate that a number of events occur for some time period or other baseline measure. Examples include: the number of times a computer crashes during a time period, number of melanoma cases per city size, number of arrivals at airports for a particular time period,… . The time period or baseline measure needs to be incorporated into the analysis. One way to do this is to model Y/t instead of just Y where Y is the number of events and t is the time period or baseline measure. Thus, the Poisson regression model becomes: log(/t) = + x where =E(Y). This expression can be simplified to log() – log(t) = + x log() = + x + log(t). log(t) is called an “offset”. Notice how the offset has an effect on : = e + x + log(t) = t e + x Thus, t helps to adjust the “usual” mean (e + x) by the time period or baseline measure. 2010 Christopher R. Bilder 3.58 Example: Horseshoe crabs and satellites (horseshoe.R, horseshoe.txt) This is not necessarily the best example where one would want to use Poisson regression for rate data, but it gives a nice illustration of the relationship between a Poisson model for rate data and “regular” data. Please see p. 83 of Agresti (2007) for another example where using rate data is more appropriate. Suppose the data was given in the form of the number of satellites per distinct width. Let Y be the number of satellites for a distinct width. Let t be the number of female crabs observed for a distinct width. For example, there are t=3 crabs with a width of 22.9 cm and they have a total of Y=4+0+0 = 4 satellites. Before the data set looked like this: Crab ID Satellites Width 1 8 28.3 2 0 22.5 3 9 26.0 4 0 24.8 Now the data set looks like this: # of crabs (t) Total satellites (Y) Width 1 0 21.0 2010 Christopher R. Bilder 3.59 # of crabs (t) Total satellites (Y) 1 0 3 5 3 4 Width 22.0 22.5 22.9 > library(nlme) #gsummary function is located here Loading required package: lattice > sum.rate.data<-gsummary(object = crab, FUN = sum, groups = crab$width) > length.rate.data<-gsummary(object = crab, FUN = length, groups = crab$width) > rate.data<-data.frame(y=sum.rate.data$satellite, t=length.rate.data$satellite, width=length.rate.data$width) > mod.fit.rate<-glm(formula = y ~ width+offset(log(t)), data = rate.data, family = poisson(link = log), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T)) Deviance = 211.7379 Iterations - 1 Deviance = 190.2969 Iterations - 2 Deviance = 190.0273 Iterations - 3 Deviance = 190.0272 Iterations - 4 Deviance = 286.3955 Iterations - 1 Deviance = 255.2993 Iterations - 2 Deviance = 254.9404 Iterations - 3 Deviance = 254.9403 Iterations - 4 > summary(mod.fit.rate) Call: glm(formula = y ~ width + offset(log(t)), family = poisson(link = log), data = rate.data, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) Deviance Residuals: Min 1Q Median 3Q 2010 Christopher R. Bilder Max 3.60 -3.8003 -1.4515 -0.3788 0.6619 4.7586 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30476 0.54218 -6.095 1.09e-09 *** width 0.16405 0.01996 8.217 < 2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 254.94 Residual deviance: 190.03 AIC: 402.52 on 65 on 64 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 > #Plot of data with estimated mu's; notice the use of the panel.first option to put grid lines behind plotting points > plot(x = crab$width, y=crab$satellite, xlab="Width (cm)", ylab="Number of satellites", panel.first=grid(col = "gray", lty = "dotted"), main = "Horseshoe crab data set \n with poisson regression model fit (rate data)") > points(x = rate.data$width, y = mod.fit.rate$fitted.values, pch = 18, col = "darkgreen", cex = 1) > legend(locator(1), legend="Diamonds are predicted value", cex = 0.75) 2010 Christopher R. Bilder 3.61 15 Horseshoe crab data set with poisson regression model fit (rate data) 10 5 0 Number of satellites Diamonds are predicted values 22 24 26 28 30 32 34 Width (cm) Notes: The gsummary() function allows one to summarize a data set by a grouping variable. This is similar to using a SAS procedure with a BY statement. In this case, I sum the satellites over the different crab widths. Also, I find the number of satellites per crab width. These are combined into the data.frame called rate.data. Note that the gsummary() function is in the nlme package so this package needs to be loaded first. 2010 Christopher R. Bilder 3.62 The glm() function is used to fit the Poisson regression model with an offset. Notice the parameter estimates are the same as before! Below is the estimated model with the offset: log(ˆ ) -3.3048 + 0.1640width + log(t) where t = number of crabs per distinct width. Generally, the parameter estimates will be displayed as being exactly the same. Through some statistical research that I have worked on, I have found some situations where there are some differences. Why are there two sets of iterations here for glm()? In the glm() function code, the model with an intercept ONLY is fit once and then the whole model is fit. This occurs only when an offset is used. I think the reason is due to what an intercept only model represents with rate data. Notice that a smooth curve can not be plotted because of the different number of crabs per width. A better version of the plot would include different colors for the plotting characters (corresponding to each t value) for each observed Y and predicted . More plots of the model are discussed in the Chapter 3 additional notes. 2010 Christopher R. Bilder 3.63 3.4 Statistical inference and model checking One of the best things about GLMs is that they provide a unified approach to test model parameters, check goodness-of-fit, examine residuals, estimate parameters, … . Thus, one can use the same basic methods for logistic, probit, complementary log-log, and Poisson regression. The Wald and likelihood ratio tests A hypothesis test commonly of interest is Ho:=0 Ha:0 Below are two different ways this test can be conducted: Wald - The test statistic is ˆ ˆ Z SE SE where SE stands for “standard error”. Actually, this standard error is an estimate of the “asymptotic” standard error. Often, you will see the standard error here denoted as AsVar(ˆ ) or Var(ˆ ) . For large n, remember that an MLE ( ̂ here) has an approximate normal distribution. Thus, Z has an approximate 2010 Christopher R. Bilder 3.64 standard normal distribution and this distribution can be used to perform the test. On p. 1.28 of the notes, the “large sample variance” was introduced for p y / n where y is 0 or 1 and n is number of trials. The formula given was: log ( | y1,...,yn E 2 2 1 p This formula can be used here also by using the likelihood function for and instead. Since there are two parameters, a matrix of the second partial derivatives is found: 2 log (, | y1,...,yn ) 2 E 2 log (, | y1,...,yn ) 1 2 log (, | y1,...,yn ) 2 log (, | y1,...,yn ) 2 ˆ ,ˆ ˆ ) ˆ ˆ Var( ) Cov( , = ˆ ˆ ˆ Cov( , ) Var( ) The “large sample variance” for ̂ is the (2,2) element of the above matrix. The square root of this quantity is the SE that we are using in the denominator of Z. 2010 Christopher R. Bilder 3.65 Notes: Try to write out the likelihood function for a logistic regression or Poisson model on your own. Then try to write out the matrix of second partial derivatives. You will never need to actually do the evaluation of the formula of the large sample variance by hand, but R will do it routinely for us! The same problems that we have had before using Wald confidence intervals happen here. Therefore, we need to make sure the sample size is large. The next method is a little better to use when the sample size is not large. Likelihood ratio test (LRT) - We have discussed the LRT before in Chapters 1-2. This procedure can also be used here. Review from p. 1.29 and 2.58: The LRT statistic is max. likelihood when parameters satisfy Ho max. likelihood when parameters satisfy Ho or Ha Remember that the ratio is between 0 and 1 since the numerator can not exceed the denominator. For the test of =0 vs. 0, the numerator is calculated assuming =0. Thus, the model fit to the data is only g() = (where g() denotes the link 2010 Christopher R. Bilder 3.66 function). The denominator is calculated without the assumption that =0. Thus, the model fit to the data is g() = + x. The likelihood functions are found using the fit of both models and the ratio is found. For example, the ratio becomes for logistic regression: max. likelihood when parameters satisfy Ho max. likelihood when parameters satisfy Ho or Ha n 1 yi ˆ o (xi ) i (1 ˆ o (xi )) i 1 n y 1 yi ˆ(xi ) i (1 ˆ(xi )) y i 1 ˆ ˆ xi eˆ o e where the ˆ o (xi ) and ˆ(xi ) . ˆ o ˆ ˆ xi 1 e 1 e The actual test statistic used for a LRT is –2log(). The reason is because this statistic has an approximate 2 distribution for large n. The degrees of freedom are found the same way as before. In this case, notice the difference between Ho and Ha is whether or not =0. Thus, the 2 distribution has 1 degree of freedom. Note that –2log() is often denoted in categorical data analysis as G2. Often in computer output, –2log() is not given directly. Instead, what is often given is the “null deviance” and the “residual deviance”. These are –2log() statistics themselves, but for testing a different set of hypotheses. 2010 Christopher R. Bilder 3.67 Simply put, the –2log() for a test of Ho:=0 vs. Ha:0 is: null deviance – residual deviance Below is a further explanation of the two deviances. The null deviance tests: Ho: Model with only Ha: Model using the observed values The test statistic for Poisson regression is yi ˆ o ̂ G 2 yi log where is . e o,i i1 ˆ o,i 2 1 n Compare the above form to what we saw on p. 2.59. The test statistic for logistic regression is yi 1 yi G 2 yi log (1 y )log where i i1 ˆ o,i 1 ˆ o,i eˆ o ˆ o,i . ˆ o 1 e 2 1 n Questions: o What is ̂0 for the Poisson regression model? o What is ̂0 for the logistic regression model? 2010 Christopher R. Bilder 3.68 The residual deviance tests: Ho: Model with only and Ha: Model using the observed values The test statistic for Poisson regression is yi ˆ ˆ xi G 2 yi log where ̂i is e . i1 ˆ i n 2 2 The test statistic for logistic regression is yi 1 yi where G 2 yi log (1 yi )log i1 ˆ i 1 ˆ i ˆ ˆ xi e ˆ i . ˆ ˆ xi 1 e n 2 2 Notice that G12 and G22 both have a few things in common. When the residual deviance is subtracted from the null deviance, the resulting statistic for Poisson regression is: n yi yi G G 2 yi log 2 y log i ˆ i1 i 1 ˆ o,i i n n n n 2 yi log(yi ) yi log(ˆ o,i ) 2 yi log(yi ) yi log(ˆ i ) i1 i1 i1 i1 2 1 2 2 n 2010 Christopher R. Bilder 3.69 n n 2 yi log(ˆ i ) yi log(ˆ o,i ) i1 i 1 ˆ ˆ xi n n e ˆ i 2 yi log 2 yi log ˆ 0 e i 1 i 1 ˆ o,i For logistic regression, the statistic becomes G12 G22 ˆ i 1 ˆ i 2 yi log (1 yi )log . i1 ˆ o,i 1 ˆ o,i n Without going into the details, these are the correct –2log() statistics for the test of Ho:=0 vs. Ha:0! The word “deviance” is used because the statistics give a measurement of how much the observed data “deviates” from the model’s fit. Example: Placekicking (placekick_ch3.R, place.s.csv) Perform the test of Ho:=0 vs. Ha:0. The output from glm() is reproduced below. > summary(mod.fit) Call: glm(formula = good1 ~ dist, family = binomial(link = logit), data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) 2010 Christopher R. Bilder 3.70 Deviance Residuals: Min 1Q Median -2.7441 0.2425 0.2425 3Q 0.3801 Max 1.6091 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.812079 0.326158 17.82 <2e-16 *** dist -0.115027 0.008337 -13.80 <2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1013.43 on 1424 degrees of freedom Residual deviance: 775.75 on 1423 degrees of freedom AIC: 779.75 Number of Fisher Scoring iterations: 5 The Wald test statistic is Z=-13.80. Since Z0.975=1.96, 0 with 95% confidence. Also, the p-value is very small. Therefore, distance is important for predicting the probability of success. To find the -2log(), use the null and residual deviance: G12 G22 = 1013.43 – 775.75 = 237.68 The degrees of freedom given from the output for the null and residual deviance can also be subtracted in the same way to find the degrees of freedom for the test: 1424 – 1423 = 1 Below is the R code and output to perform the LRT: 2010 Christopher R. Bilder 3.71 > #LRT: -2log(lambda) > mod.fit$null.deviance - mod.fit$deviance [1] 237.6811 > #DF > mod.fit$df.null-mod.fit$df.residual [1] 1 > #p-value > 1 - pchisq(q = mod.fit$null.deviance – mod.fit$deviance, df = mod.fit$df.nullmod.fit$df.residual) [1] 0 Since the p-value is very small, 0. Therefore, distance is important for predicting the probability of success. In Chapter 5, we will see that it is not appropriate to perform the test as done here. More will be discussed about it at that time. Here are some additional details showing how R calculates G22 : > y<-place.s$good > pi.hat<-mod.fit$fitted.values > pi.tilde<-y > 2*(sum(log(y^y)) - sum(y*log(pi.hat)) + sum(log((1-y)^(1-y))) - sum((1-y)*log(1-pi.hat))) #Need to do second part with pi^y due to 0 pi values [1] 775.745 > > > [1] #Discussed in next chapter dev.resid<-resid(mod.fit, type="deviance")^2 sum(dev.resid) 775.745 2010 Christopher R. Bilder 3.72 Question: Suppose you wanted to test Ho:=0 vs. Ha:0 for models with a probit or complementary log-log link. How would you do it? Example: Horseshoe crabs and satellites (horseshoe.R, horseshoe.txt) Perform the test of Ho:=0 vs. Ha:0. The output from glm() is reproduced below. Note that the model with the offset is used here! > summary(mod.fit) Call: glm(formula = y ~ width + offset(log(t)), family = poisson(link = log), data = rate.data, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) Deviance Residuals: Min 1Q Median -3.8003 -1.4515 -0.3788 3Q 0.6619 Max 4.7586 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30476 0.54218 -6.095 1.09e-09 *** width 0.16405 0.01996 8.217 < 2e-16 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 254.94 Residual deviance: 190.03 AIC: 402.52 on 65 on 64 degrees of freedom degrees of freedom 2010 Christopher R. Bilder 3.73 Number of Fisher Scoring iterations: 4 The Wald test statistic is Z=8.22. Since Z0.975=1.96 and the p-value < 2*10-16, 0. Therefore, width is important for predicting the mean number of satellites. -2log() = G12 G22 = 254.9403 - 190.0272 = 64.91 > #LRT: -2log(lambda) mod.fit$null.deviance - mod.fit$deviance [1] 64.91309 > #p-value > 1-pchisq(q = mod.fit$null.deviance-mod.fit$deviance, df = mod.fit$df.null-mod.fit$df.residual) [1] 7.771561e-16 Since the p-value is very small, 0. Therefore, width is important for predicting the mean number of satellites. 2010 Christopher R. Bilder 3.74 NOTE! In Section 3.4.5, Agresti (2007) talks about “goodness-of-fit” statistics and model residuals more so in the context of Poisson regression models only. In Chapter 5, these items are discussed for logistic regression models and in much more detail. The reason for the separation is because there are a few things one needs to watch out for in logistic regression that does not happen as much in Poisson regression. Thus, the rest of the discussion in this section will only be for Poisson regression. 2010 Christopher R. Bilder 3.75 Model residuals Pearson residuals can be calculated in a similar manner as described in Chapter 2. The Pearson residual in Chapter 2 was (nij ˆ ij ) ˆ ij where nij was the cell count for row i and column j, ̂ij was its estimated value under the hypothesis of independence, and ̂ij square root of the estimated variance (remember for a Poisson random variable, mean=variance). A Pearson residual has an approximate standard normal distribution provided the ̂ij is not small (>2 or 5). The same set-up can be used here for the Pearson residual from a Poisson regression model. For the Poisson regression model: yi ˆ i ˆ i where the yi is the ith observed value for the dependent variable, ̂i is its predicted value. 2010 Christopher R. Bilder 3.76 In Chapter 2, we also learned about a standardized residual. The standardized residual has a distribution that is closer to a standard normal distribution than the Pearson residual. The standardized residual is yi ˆ i Var(yi ˆ i ) yi ˆ i ˆ i (1 hi ) where hi is the ith diagonal value of the hat matrix. What is the hat matrix? With respect to regular regression analysis, you can see my Chapters 5 and 10 STAT 870 notes at www.chrisbilder.com/stat870/schedule.htm. With respect to Poisson regression, let X be a n2 matrix with 1’s in the first column and the explanatory variable values in the second column. Create a diagonal matrix, V̂ , with diagonal elements of ̂i in the same order as the corresponding explanatory variable values listed in X. The hat matrix is H = V̂1/ 2 X(X V̂ X)-1X V̂1/ 2 . Note that this is similar to the hat matrix used when fitting a regression model by weighted least squares. Note that the “standardized” residual may also be called elsewhere an “adjusted Pearson residual”, “adjusted residual” (Agresti, 1996, uses this term), or “studentized residual”. 2010 Christopher R. Bilder 3.77 The standardized residual can be calculated in R using h<-lm.influence(model = mod.fit)$h Pearson<-residuals(object = mod.fit, type="pearson") standard.pearson<-Pearson/sqrt(1-h) assuming mod.fit contains the model fit from glm(). We can use a standard normal approximation for both the Pearson and standardized residuals. Of course, the approximation works better with the standardized. Question: Suppose the standardized residuals are greater than 2.576 or less than -2.576. What does this mean about the model? Example: Horseshoe crabs and satellites (horseshoe.R, horseshoe.txt) > pearson1<-residuals(object = mod.fit, type="pearson") > > > #Standardized Pearson residuals h<-lm.influence(model = mod.fit)$h head(h) 1 2 3 4 5 6 0.009852678 0.015152453 0.006360592 0.008647581 0.006360592 0.011358140 > > standard.pearson<-pearson1/sqrt(1-h) head(standard.pearson) 1 2 2.1569835 -1.2223348 > > 3 4 3.9641123 -1.4712657 5 6 0.8609526 -1.3572621 X<-model.matrix(mod.fit) #Also could use mu.hat<-mod.fit$fitted.values here 2010 Christopher R. Bilder 3.78 > > mu.hat<-predict(object = mod.fit, type = "response") H<-diag(sqrt(mu.hat))%*%X%*% solve(t(X)%*%diag(mu.hat)%*%X)%*% t(X)%*%diag(sqrt(mu.hat)) > diag(H)[1:5] [1] 0.009852370 0.015150506 0.006360719 0.008647445 0.006360719 Notes: The residuals() function finds the residuals. There are a few functions that help you find the hat matrix diagonal values. One is the lm.influence() function. There are no direct functions for the standardized residuals. See how the matrix calculations are done in R. You are not responsible for this content. > > > > > > > par(mfrow = c(2,1)) #2x1 grid of plots #Pearson residual vs observation number plot plot(x = 1:length(pearson1), y = pearson1, xlab="Observation number", ylab="Pearson residuals", main = "Pearson residuals vs. observation number") abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red") #Standardized residual vs observation number plot plot(x = 1:length(standard.pearson), y = standard.pearson, xlab="Observation number", ylab="Standardized residuals", main = "Standardized residuals vs. observation number") abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red") 2010 Christopher R. Bilder 3.79 4 2 0 -2 Pearson residuals 6 Pearson residuals vs. observation number 0 50 100 150 Observation number 4 2 0 -2 Standardized residuals 6 Standardized residuals vs. observation number 0 50 100 150 Observation number Notes: The abline() function was used to draw lines on the plot at Z0.975 and Z0.995. Notice it takes one call to the function for the lines. Both plots are quite similar. Since we have only one explanatory variable, it is often helpful to plot these residuals vs. the explanatory variable. 2010 Christopher R. Bilder 3.80 > > > 4 2 0 -2 Standardized residuals 6 > par(mfrow = c(1,1)) # Residual vs width plot plot(x = crab$width, y = standard.pearson, xlab="Width", ylab="Standardized Pearson residuals", main = "Standardized Pearson residuals vs. width") abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red") Standardized residuals vs. width 22 24 26 28 30 32 34 Width Notice the patterns among the plotting points. It is not unusual to see these types of patterns when one is modeling a discrete response variable. The plot below shows you why these patterns are occurring. 2010 Christopher R. Bilder 3.81 > > > plot(x = crab$width, y = standard.pearson, xlab="Width", ylab="Standardized Pearson residuals", main = "Standardized Pearson residuals vs. width", type = "n") text(x = crab$width, y = standard.pearson, labels = crab$satellite, cex=0.75) abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005), lty=3, col="red") Standardized residuals vs. width 10 15 4 10 9 8 11 8 6 6 2 4 4 4 0 Standardized residuals 6 14 -2 0 55 7 66 55 7 6 8 9 88 12 10 7 9 666 8 44 66 55 4 5 5 44 5 4 5 3 4 4 3333 22 5 3333 44 4 5 22 3 4 1 1 4 2 2 333 3 1 1 33 11 0 0 000 00 11 0000000000 1 1 2 1 0 00000 0 1 00000 0 00 00 0 22 24 5 55 26 6 28 30 7 4 2 32 34 Width The model looks to have model fit problems when there are a larger number of satellites than expected at lower widths (relative to the observations with a particular 2010 Christopher R. Bilder 3.82 number of satellites). This may be a result of the overdispersion that we saw earlier. The negative binomial model could also be fit to the data. The same types of residuals can be found with the corresponding adjustments to reflect the new model. > > > > > > > > > pearson.nb<-residuals(object = mod.fit.nb, type="pearson") h.nb<-lm.influence(model = mod.fit.nb)$h standard.pearson.nb<-pearson.nb/sqrt(1-h.nb) par(mfrow = c(1,2)) plot(x = 1:length(standard.pearson.nb), y = standard.pearson.nb, xlab="Obs. number", ylab="Standardized residuals", main = "Stand. residuals (NB model) vs. obs. number") abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red") plot(x = crab$width, y = standard.pearson.nb, xlab="Width", ylab="Standardized residuals", main = "Stand. residuals (NB model) vs. width", type = "n") text(x = crab$width, y = standard.pearson.nb, labels = crab$satellite, cex=0.75) abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red") 2010 Christopher R. Bilder 3.83 Stand. residuals (NB model) vs. obs. number Stand. residuals (NB model) vs. width 14 3 3 10 10 15 2 8 6 4 7 66 4 4 11 6 5 5 7 4 0 1 2 1 1 0 12 8 55 1 Standardized residuals 2 1 0 9 6 2 9 10 88 5 55 6 7 66 4 55 6 9 6 44 55 6 8 4 5 3 5 33 33 4 4 33 5 22 33 5 44 3 44 2 3 4 3 2 33 4 1 11 11 11 1 1 0 0 00000 00000000000000 00 000000 33 2 7 4 2 0 00000 -1 -1 Standardized residuals 8 0 50 100 150 22 Obs. number 24 26 28 30 32 34 Width As we can see there are not as many standardized residuals outside of the 2.576 borderlines. How many standardized residuals would you expect outside of these borderlines with n = 173? Comments: 2010 Christopher R. Bilder 3.84 I am a little concern about how large these two standardized residuals are. One could examine these observations more closely like what you would do in a STAT 870 class. Due to time considerations, I am not going to do this here. I am also a little concern with there being no standardized residuals less than -1. Remember that a normal distribution is being used here. Do you think a normal approximation will work for these observations toward the bottom of these plots? One possible solution to the normal approximation problem is to work with the rate data formulation of the model. Why? See the additional Chapter 3 notes for details. Goodness-of-fit The Pearson statistic and LRT can be both used to assess how well (good) the model fits the data versus using just the “observed” values at the explanatory variable levels. This model is often called the “saturated” model since it has the most possible parameters. The saturated model estimates a parameter for every observation. For example, the saturated model for Poisson regression is log(i) = + i for i = 1, …, n results in ˆ i yi (Note: A restriction on the i’s is needed such as ni1i 0 like you would see in 2010 Christopher R. Bilder 3.85 STAT 802 or 870). Also, see the previous LRT work with saturated models. Pearson statistic: For Poisson regression, the statistic is: 2 (y ) ˆ i X2 i ˆ i i1 n The statistic can be approximated by a 2 distribution with n - # of model parameters = n – 2 degrees of freedom for large n. In order for the 2 approximation to work well, ̂i should not be small. LRT statistic: For Poisson regression, the statistic simplifies from -2log() to yi ˆ ˆ xi 2 yi log where ˆ i e i 1 ˆ i n This statistic is often denoted by G2 and was already introduced on p. 3.68. It can be approximated by the same distribution as used with the Pearson statistic, and it has the same potential problems. 2010 Christopher R. Bilder 3.86 Example: Horseshoe crabs and satellites (horseshoe.R, Table3.3.R, horseshoe.txt) > summary(mod.fit) Call: glm(formula = satellite ~ width, family = poisson(link = log), data = crab, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) Deviance Residuals: Min 1Q Median -2.8526 -1.9884 -0.4933 3Q 1.0970 Max 4.9221 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30476 0.54222 -6.095 1.10e-09 *** width 0.16405 0.01996 8.217 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 632.79 Residual deviance: 567.88 AIC: 927.18 on 172 on 171 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 > #LRT: -2log(lambda) > mod.fit$deviance [1] 567.88 > #p-value > 1-pchisq(q = mod.fit$deviance, df = mod.fit$df.residual) [1] 0 > > #Pearson statistic sum(pearson1^2) 2010 Christopher R. Bilder 3.87 [1] 544.157 > 1-pchisq(q = sum(pearson1^2), df = mod.fit$df.residual) [1] 0 The p-values for the LRT and the Pearson statistic test are quite small indicating evidence of lack of fit. However, one should be concerned with the chi-square approximation working here. What can be done then? There are no choices that always work. Here are two possibilities. 1) Convert the data to a rate data format and perform the same tests. Note that there are still a number of times where ̂ <5. Therefore, the 2 distribution approximation may be poor here as well. Below is part of the output given previously from the glm() function. > summary(mod.fit.rate) Call: glm(formula = y ~ width + offset(log(t)), family = poisson(link = log), data = rate.data, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T)) Deviance Residuals: Min 1Q Median -3.8003 -1.4515 -0.3788 3Q 0.6619 Max 4.7586 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30476 0.54218 -6.095 1.09e-09 *** 2010 Christopher R. Bilder 3.88 width --Signif. codes: 0.16405 0.01996 8.217 < 2e-16 *** 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 254.94 Residual deviance: 190.03 AIC: 402.52 on 65 on 64 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 The “residual deviance” given in the output is G2. To find the Pearson statistic, sum the squared Pearson residuals. Below is the R code for both the goodnessof-fit tests. > #LRT p-value > 1-pchisq(q = mod.fit.rate$deviance, mod.fit.rate$df.residual) [1] 1.998401e-14 > #Pearson statistic and p-value > pearson.rate<-resid(object = mod.fit.rate, type="pearson") > sum(pearson.rate^2) [1] 174.2737 > 1-pchisq(q = sum(pearson.rate^2), mod.fit.rate$df.residual) [1] 3.759215e-12 Both statistics indicate the model does not fit the data well – if we believe the 2 distribution approximation. 2) Form artificial groups (like in Table 3.3) and compute ad-hoc versions of these tests. 2010 Christopher R. Bilder 3.89 The purpose of forming these groups is to have each group have a mean value larger than 5 or so that we avoid the previous problems. In order to form a Pearson statistic, one can fit the model as usual and compute Pearson residuals for groups of size nk containing “alike” observations to result in nkk1 yk knk1ˆ k knk1ˆ k . The sum of these squared Pearson residuals then form a Pearson statistic. A LRT statistic can be found in the corresponding manner. There are two problems with this approach: a) There are many different ways to form the groups, and one could choose a variety of different number of groups. Your answers could change due to these choices. b) The usual type of distributional approximation is chi-square with g – 2 degrees of freedom where g is the number of groups and two parameters are being estimated ( and ). The statistics though do not have this same type of chi-square distribution so formal hypothesis tests should not be done. While this approach does have its problems, I like to use it as an informal way to assess the model overall along with graphical approaches if possible. Table3.3.R shows a few different ways to evaluate the model in this manner. Using the Table 3.3 categories, below is the Pearson and LRT statistics along with a 2010 Christopher R. Bilder 3.90 graphical assessment. The code for these results is available in the program. 15 Horseshoe crab data set with poisson regression model fit 0 5 Number of satellites 10 Table 3.3 obs. means Table 3.3 predicted (using my interpret) 22 24 26 28 30 Width (cm) Ad-hoc Degrees of statistic Value freedom p-value X2 6.48 6 0.37 G2 6.89 6 0.33 2010 Christopher R. Bilder 32 34 3.91 Below is a more general way (also in the program) to assess the fit of the model using different groups. >############################################################### > # More general way to put observations into classes > #Find 8 (9 quantiles) groups (why 8? Since Agresti had chosen 8 - other choices could have been made) > cutoff<-quantile(crab$width, probs = 0:8/8, na.rm = F) > cutoff 0% 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100% 21.00 23.85 24.90 25.65 26.10 26.90 27.70 28.70 33.50 > #Use midpoint for the width group designation; note that I could have used the mean width among all crabs within the group as well - there is not one correct way to do this. > groups<-ifelse(crab$width<cutoff[2], (cutoff[2]+cutoff[1])/2, ifelse(crab$width<cutoff[3], (cutoff[3]+cutoff[2])/2, ifelse(crab$width<cutoff[4], (cutoff[4]+cutoff[3])/2, ifelse(crab$width<cutoff[5], (cutoff[5]+cutoff[4])/2, ifelse(crab$width<cutoff[6], (cutoff[6]+cutoff[5])/2, ifelse(crab$width<cutoff[7], (cutoff[7]+cutoff[6])/2, ifelse(crab$width<cutoff[8], (cutoff[8]+cutoff[7])/2, (cutoff[9]+cutoff[8])/2))))))) > library(nlme) #Need package for the gsummary() function – don't need to rerun if already did before > > crab.group<-data.frame(crab2, groups) sat.count<-gsummary(object = crab.group, FUN = length, groups = groups) sat.sum<-gsummary(object = crab.group, FUN = sum, groups = groups) new.table3.3<-data.frame(width.group = sat.count$groups, number.cases = sat.count$satellite, number.sat = sat.sum$satellite, mean.per.group = sat.sum$satellite/sat.count$satellite, fitted.count = round(sat.sum$predicted,1), Pearson.residual = round((sat.sum$satellite – sat.sum$predicted)/sqrt(sat.sum$predicted),2)) > > 2010 Christopher R. Bilder 3.92 > 1 2 3 4 5 6 7 8 > > new.table3.3 width.group number.cases number.sat mean.per.group fitted.count Pearson.residual 22.425 22 20 0.9090909 35.6 -2.62 24.375 21 40 1.9047619 42.4 -0.36 25.275 22 60 2.7272727 50.5 1.34 25.875 20 68 3.4000000 50.9 2.40 26.500 23 47 2.0434783 64.4 -2.17 27.300 20 69 3.4500000 64.6 0.55 28.200 22 102 4.6363636 81.9 2.23 31.100 23 99 4.3043478 114.8 -1.48 #Pearson statistic cat("Ad-hoc Pearson statistic:", round(sum(new.table4.3$Pearson.residual^2),2), "with 6 DF results in a p-value of", round(1pchisq(sum(new.table4.3$Pearson.residual^2), 6),2), "using a chi-square distribution approximation \n") Ad-hoc Pearson statistic: 26.72 with 6 DF results in a p-value of 0 using a chi-square distribution approximation > > > #G^2 G.sq2<-2*sum(new.table4.3$number.sat *log(new.table4.3$number.sat/new.table4.3$fitted.count)) cat("Ad-hoc G^2 statistic:", round(G.sq2,2), "with 6 DF results in a p-value of", round(1-pchisq(G.sq2, 6),2), "using a chi-square distribution approximation \n") Ad-hoc G^2 statistic: 27.29 with 6 DF results in a p-value of 0 using a chi-square distribution approximation > > > > > > #This is interesting that these two measures suggest the model does not fit well! I would hope # that goodness-of-fit conclusions would be invariant to the way one chooses to group the observations # Possibly, this is example of why ad-hoc procedures can not always be trusted. #Visual assessment win.graph(width = 6, height = 6, pointsize = 10) plot(x = crab$width, y = crab$satellite, xlab = "Width (cm)", ylab = "Number of satellites", main = "Horseshoe crab data set \n with poisson regression model fit", 2010 Christopher R. Bilder 3.93 > > > > > > > > > panel.first = grid(col = "gray", lty = "dotted")) curve(expr = exp(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x), lty = 1, col = "red", add = TRUE) points(x = new.table4.3$width.group, y = new.table4.3$mean.per.group, pch = 18, col = "darkgreen", cex = 2) #Notice these points are not on the estimated model line; probably due to using group average value for x-axis instead of # weighted mean like did for the previous plot points(x = new.table4.3$width.group, y = new.table4.3$fitted.count/new.table4.3$number.cases, pch = 17, col = "darkblue", cex = 2) #Put group breaks on plot for (i in (2:8)) { abline(v = cutoff[i], lty = 1, col = "lightgreen") } legend(locator(1), legend = c("Obs. group means", "Predicted group means (using my interpret)"), pch = c(18,17), col = c("darkgreen","darkblue"), cex = 0.75, bg = "white") 2010 Christopher R. Bilder 3.94 15 Horseshoe crab data set with poisson regression model fit 0 5 Number of satellites 10 Obs. group means Predicted group means 22 24 26 28 30 32 34 Width (cm) With the negative binomial regression model, note that G2 2 = 195.81. Using a 171 approximation, we obtain a pvalue of 0.0939. 2010 Christopher R. Bilder 3.95 3.5 Fitting generalized linear models One of the best things about GLMs is that they provide a unified approach to test model parameters, che GLMs are fit (i.e., parameter estimates found) using maximum likelihood estimation. Except in simple cases, there is not one formula for the Chapter 3 models which can be written out that gives the parameter estimates. For Poisson regression, the likelihood function is iyi ei (1,..., n | y1,...,yn ) f(yi ) i 1 i 1 yi ! n different Then the log likelihood function is parameters n n log (1(x),..., n (x) | y1,...,yn ) n n i 1 i 1 yi log[i (x)] i (x) log yi ! n yi log[i (x)] i (x) i 1 where means proportional (the last term does not depend on the parameters). Since i(x) = exi , this implies 2010 Christopher R. Bilder Now only two parameters! 3.96 n log[ (, | y1,..., yn )] yi ( xi ) exi i 1 The Chapter 3 additional lecture notes gives additional general information about one common procedure, the Newton-Raphson method, and how it can be used to find the maximum likelihood estimates in an iterative manner. Pay special attention to how “convergence” is obtained. Note that glm() function actually uses the optim() function to do the maximization (equivalently, minimization of the negative log likelihood function). The optim() function has a few different iterative procedures (some do not need derivatives) that can be used to perform the maximization. 2010 Christopher R. Bilder