Download 3. Generalized linear models

Document related concepts

Forecasting wikipedia , lookup

Discrete choice wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
3.1
3. Generalized linear models
Use models to investigate the relationships (associations)
among categorical and continuous variables.
Reason for using models (p. 65):
 Helps describe the pattern of association and
interaction
 Inferences for model parameters help determine which
explanatory variables affect the response while
controlling for other variables
 Estimate model parameters to determine the strength
and importance of effects
 Models can more easily handle complicated problems
A general class of models are “generalized linear models”
(GLMs). You have already studied a special case of GLMs,
linear models, previously in regression and ANOVA courses.
In addition to Agresti (2002), other references on GLMs
include:
McCullagh, P. and Nelder, J. A. (1989). Generalized
Linear Models. 2nd edition. London: Chapman and Hall.
McCulloch, C. and Searle, S. R. (2000). Generalized,
Linear, and Mixed Models. New York: Wiley.
 2010 Christopher R. Bilder
3.2
3.1 Components of a generalized linear model
Review of regression models
Yi = 0 + 1xi1 + 2xi2 + … + kxik + i
where i~independent N(0, 2).
Note that
E(Yi) = 0 + 1xi1 + 2xi2 + … + kxik
E(Yi) is what one would expect Yi to be on average for a
set of xi1, xi2, …, xik values.
One of the important things to realize here is that Y has a
normal distribution. What if this is not true? Suppose Y
is a nominal categorical variable. Suppose Y has a
Poisson distribution. There are many other possibilities.
GLMs allow us to generalize the model structure!
Three components of a GLM
 Random
For a sample of size n, denote the observations of the
response variable Y as Y1, Y2, …, Yn. Assume Y1, Y2, …,
Yn are obtained independently here. We will specifically
be interested in the E(Y)=.
 2010 Christopher R. Bilder
3.3
The distribution chosen for Y defines the “random”
component of a GLM.
For example, suppose the Y1, Y2, …, Yn are responses
from a Bernoulli random variable Y. Thus, the Y1, Y2, …,
Yn are all 0 or 1.
Suppose Y1, Y2, …, Yn are responses from a Binomial
random variable Y. Thus, the Y1, Y2, …, Yn are all
nonnegative integers and denote the number of successes
out of a certain number of trials.
Suppose Y1, Y2, …, Yn are responses from a Poisson
random variable Y. The Y1, Y2, …, Yn are all non-negative
integers and could denote cell counts in a contingency
table.
In regression and ANOVA, Y1, Y2, …, Yn are responses
from a normal random variable Y.
 Systematic
This component specifies the explanatory variables:
 + 1x1 + 2x2 + … + kxk
Notice that this is a “linear” combination of the explanatory
variables. This is often called the “linear predictor”. Note
 2010 Christopher R. Bilder
3.4
that the x’s above could be a transformation of an original
explanatory variable(s) – such as a quadratic or
interactions.
 Link
This component “links” the random and systematic
component. In other words, this shows how the mean of
the distribution for Y is related to the explanatory variables.
Let g() be a function of the E(Y)=. This is the link
function. Specifically, the GLM is
g() =  + 1x1 + 2x2 + … + kxk
Link functions:
 Identity – g() = 
 = E(Y) =  + 1x1 + 2x2 + … + kxk
This is used for regression and ANOVA models!
 Log – g() = log()
log() =  + 1x1 + 2x2 + … + kxk
  = exp( + 1x1 + 2x2 + … + kxk)
 2010 Christopher R. Bilder
3.5
The log link is used for “loglinear” models in Chapter 7.
Most often, Y is assumed to have a Poisson distribution.
Notice that all values of  will be positive. This is why
the link is used with modeling counts in a contingency
table!
  
 Logit – g() = log 
= logit()

 1  
  
=  + 1x1 + 2x2 + … + kxk
log 

 1  
exp(  1x1  2 x 2  ...  k xk )
=
1  exp(  1x1  2 x 2  ...  k xk )
The logit link is used for “logit” and logistic regression
models in Chapters 4-5. Notice that all values of  will
be between 0 and 1 (try a few sample cases to see this).
This is why the link is used with modeling probabilities!
Remember that the mean of a Bernoulli random variable
is .
 Other links are possible such as the probit and
complementary log-log. These will be discussed later.
Read Section 3.1.4 about the normal GLM on p.67-8.
 2010 Christopher R. Bilder
3.6
3.2 Generalized linear models for binary data
Binary data means observations obtained from a random
variable with only two possible values. Typically, these
two possible values are called a “success” and a
“failure”.
From Chapter 1:
Bernoulli distribution: P(Y=y) = y (1  )1 y for y=0 or 1
This is a special case of the binomial with n=1. The
expected value of Y is E(Y) =  and the variance of
Y is Var(Y) = (1-).
The goal in this section to find a GLM to model  at specific
values of explanatory variables (x’s)
For example, suppose you want to estimate the
probability of success, , of a field goal. The value of 
will probably be different for a 20 yard field goal than for
a 50 yard field goal. Thus, it would be of interest to
incorporate length of a field goal in a model for .
Notation: Agresti (2007) uses (x) to denote  here. The
reason is because explanatory variables (x’s) will be
used to try to predict the value of . Thus,  “depends”
on the level of the explanatory variables.
 2010 Christopher R. Bilder
3.7
To simplify the upcoming discussion, only one explanatory
variable, x, will be used to model the probability of success,
(x).
Linear probability model
Suppose an ordinary regression model was used to
model the probability of success. Thus,
E(Y) = (x) =  + x
with Y~N(0,2). This is called a linear probability model
because the probability of success changes in a linear
manner.
Problems with model:
o Violates the distributional assumptions for Y. Y is
Bernoulli, not normal.
o Probabilities can be less than 0 or greater than 1!
o Non constant variance – Var(Y) = (x)(1-(x));
variance changes as a function of x
Therefore, do not use this model!
Logistic regression model
A great introductory reference on logistic regression is
 2010 Christopher R. Bilder
3.8
Hosmer, D. W. and Lemeshow, S. (2000). Applied
Logisitic Regression, 2nd edition. New York: Wiley.
Many STAT 870 books also will include a chapter on
logistic regression. For example, see Chapter 14 of
Kutner, Nachtsheim, and Neter (2004).
The model is
 (x) 
= logit[(x)] =  + x
log 

 1  (x) 
The random component is Bernoulli. The logit
transformation is the link function. The model can be
equivalently written as:
(x) 
ex
1  ex
What does a plot of (x) vs. x look like?
Example: Plot of (x) vs. x (pi_plot.R)
When there is only one explanatory variable, =1, and
=0.5, a plot of (x) vs. x looks like the following:
 2010 Christopher R. Bilder
3.9
e
x
1 e
x
0
5
0.0
0.2
0.4
x
0.6
0.8
1.0
x
-15
-10
-5
10
15
x
When =1 and =-0.5, the plot of (x) vs. x looks like the
following:
e
x
1 e
x
0
5
0.0
0.2
0.4
x
0.6
0.8
1.0
x
-15
-10
-5
10
15
x
R code:
alpha<-1
beta1<-0.5
par(pty="s")
curve(expr = exp(alpha+beta1*x)/(1+exp(alpha+beta1*x)),
 2010 Christopher R. Bilder
3.10
from = -15, to = 15, col = "red", main =
expression(pi(x) == frac(e^{alpha+beta*x},
1+e^{alpha+beta*x})), xlab = "x",
ylab = expression(pi(x)), panel.first = grid(nx =
NULL, ny = NULL, col = "gray", lty = "dotted"))
#See help(plotmath) for more on the expression function
and see demo(plotmath)
Notes:
 When >0, there is a positive relationship between x
and (x). When <0, there is a negative relationship
between x and (x).
 The shape of the function is similar to an “s”.
 Notice the symmetric shape about (x) = 0.5
 0<(x)<1
 Questions:
 What happens to the =0.5 plot when  is
increased?
 What happens to the =0.5 plot when  is
decreased to be close to 0?
 Suppose a plot of logit[(x)] vs. x was made. What
would the plot look like?
Parameter estimation
Suppose there is a random sample of size n providing
(y1, x1), (y2, x2), …, (yn, xn) where the yi’s are 0’s or 1’s.
The probability of observing a 1 for yi is denoted by i(x).
 2010 Christopher R. Bilder
3.11
The logistic regression model is
  (x) 
log  i
=  + xi for i=1,…,n

 1  i (x) 
This is the assumed relationship between the xi and
i(x). The model can be rewritten as
i(x) =
exp(  xi )
.
1  exp(  xi )
Parameter estimates can be found from maximum
likelihood estimation – see Chapter 1’s discussion.
( 1(x),..., n (x) | y1,...,yn )
n
  f(yi )
i1
n
  i (x)yi (1  i (x))1 yi
i1
n different
parameters
Then the log likelihood function is
log  ( 1(x),..., n (x) | y1,...,yn )
n
  yi log[ i (x)]  (1  yi )log[1  i (x)]
i1
Since i(x) =
exp(  xi )
, this implies
1  exp(  xi )
 2010 Christopher R. Bilder
3.12
log  (,  | y1,...,yn )
 exi
  yi log 
xi
i 1
 1 e
n


exi 
  (1  yi )log  1 
 xi 

 1 e

Now only two
parameters!
n
  yi (  xi )  yi log(1  exi )  (1  yi )log(1  exi )
i 1
n
  yi (  xi )  log(1  exi )
i 1
The maximum likelihood estimates of  and  are the
values which maximize the above quantity. Since these
estimates can only be found using numerical methods
(except in special cases), parameter estimates are found
by many software packages by using iteratively
reweighted least squares to yield the maximum
likelihood estimates. See p. 88 of Agresti (2007) and p.
143-149 of Agresti (2002) for more information. The R
function, glm(), finds the parameter estimates (using a
call to optim()).
By using the model, the complexity of estimating  has
been reduced from estimating n different parameters
(one for each i = 1,…,n) to only 2 -  and !
Example: Placekicking (placekick_ch3.R, place.s.csv)
See Bilder and Loughin (Chance, 1998) and the video!
 2010 Christopher R. Bilder
3.13
The purpose of this example is to estimate the
probability of success for a placekick in football. The
place.s.csv data file contains a sample of 1,425
placekicks attempted during the 1995 National Football
League season. Below is a brief description of the
variables in the data set:
 week = Week of the season
 dist = distance of the placekick in yards
 change = binary variable denoting lead-change
placekicks (1) vs. non lead-change (0) placekicks
 elap30 = continuous variable denoting the number of
minutes left in a half with overtime placekicks
assigned a value of 0
 pat1 = binary variable for whether the placekick is a
point after touchdown (1) or a field goal (0)
 type1 = binary variable for placekicks in a dome (0)
or outdoors (1)
 field 1 = binary variable for placekicks on grass (1)
or artificial turf (0)
 good1 = binary variable for placekicks which are
successes (1) or failures (0)
 wind = binary variable for placekicks attempted in
“windy” conditions (1) at kickoff versus non-windy
conditions (0) using a 15 mph cutoff for non-windy
The data was actually first stored in an Excel file. While
one can use the xlsReadWrite or RODBC packages to
read in an Excel file (see R introduction lecture), I used a
 2010 Christopher R. Bilder
3.14
different method to read in the file. First, I re-saved the
Excel file as a .csv format. To do this, select FILE >
SAVE AS in Excel. Then select the .csv format in the
SAVE AS TYPE box. Choose a file name and then
select SAVE.
This creates an ASCII text file which has commas
separating each variable.
 2010 Christopher R. Bilder
3.15
In order to get the data into R, I used the read.table()
function as shown below:
> place.s<-read.table(file = "C:\\chris\\UNL\\STAT875\\
chapter3_new\\place.s.csv", header = TRUE, sep = ",")
> head(place.s)
week dist change elap30 pat1 type1 field1 good1 wind
1
1
21
1 24.7167
0
1
1
1
0
2
1
21
0 15.8500
0
1
1
1
0
3
1
20
0 0.4500
1
1
1
1
0
4
1
28
0 13.5500
0
1
1
1
0
5
1
20
0 21.8667
1
0
0
1
0
6
1
25
0 17.6833
0
0
0
1
0
For now, only distance (dist) is going to be used to
predict the probability of a successful placekick. The
good1 variable contains the Bernoulli observations
denoting the success or failure of a placekick.
The logistic regression model of interest is
 (x) 
= logit[(x)] =  + x =  + (Distance)
log 

 1  (x) 
where x=distance of the placekick, (x) = E(Y), Y=1
for success or 0 for failure.
This particular GLM is used since the response variable
(good1) is binary. To find the estimated model in R, the
glm() function is used. Below is the code.
 2010 Christopher R. Bilder
3.16
> mod.fit <- glm(formula = good1 ~ dist, data = place.s,
family = binomial(link = logit), na.action = na.exclude,
control = list(epsilon = 0.0001, maxit = 50, trace = T))
Deviance = 836.7715 Iterations - 1
Deviance = 781.1072 Iterations - 2
Deviance = 775.8357 Iterations - 3
Deviance = 775.7451 Iterations - 4
Deviance = 775.745 Iterations - 5
> names(mod.fit)
[1] "coefficients"
"effects"
"qr"
"deviance"
"iter"
"df.residual"
"converged"
"call"
"data"
"method"
"residuals"
"R"
"family"
"aic"
"weights"
"df.null"
"boundary"
"formula"
“offset"
"contrasts"
"fitted.values"
"rank"
"linear.predictors"
"null.deviance"
"prior.weights"
"y"
"model"
"terms"
"control"
"xlevels"
> mod.fit$coefficients
(Intercept)
dist
5.812045 -0.1150259
> mod.fit
Call: glm(formula = good1 ~ dist, family = binomial(link =
logit), data = place.s, na.action = na.exclude, control =
list(epsilon = 1e-04, maxit = 50, trace = T))
Coefficients:
(Intercept)
5.8121
dist
-0.1150
Degrees of Freedom: 1424 Total (i.e. Null);
Null Deviance:
1013
Residual Deviance: 775.7
AIC: 779.7
1423 Residual
> summary(mod.fit)
Call:
glm(formula = good1 ~ dist, family = binomial(link = logit),
 2010 Christopher R. Bilder
3.17
data = place.s, na.action = na.exclude, control =
list(epsilon = 1e-04, maxit = 50, trace = T))
Deviance Residuals:
Min
1Q
Median
-2.7441
0.2425
0.2425
3Q
0.3801
Max
1.6091
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.812079
0.326158
17.82
<2e-16 ***
dist
-0.115027
0.008337 -13.80
<2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.43
Residual deviance: 775.75
AIC: 779.75
on 1424
on 1423
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5
There are many different things that need to be discussed
about the code and output. Only a few of them will be
discussed here. More will be discussed later in this
chapter and in Chapter 5.
 Notice the syntax used with the glm() function.
 The names(mod.fit) shows the different components of
the mod.fit object.
 The estimated logistic regression model is
logit[ˆ(x)]  ˆ  ˆ x = 5.8121 – 0.1150x
 What happens to the probability of success as the
distance increases?
 The estimated probability of success for a particular
distance can be found from
 2010 Christopher R. Bilder
3.18
exp(ˆ  ˆ x)
exp(5.8121  0.1150x)
ˆ (x) 

1  exp(ˆ  ˆ x) 1  exp(5.8121  0.1150x)
For example, the estimated probability of success for
a 20 yard placekick is
ˆ (x  20) 
exp[5.8121  0.1150(20)]
 0.9710
1  exp[5.8121  0.1150(20)]
The estimated probability of success for a 50 yard
placekick is
ˆ (x  50) 
exp[5.8121  0.1150(50)]
 0.5152
1  exp[5.8121  0.1150(50)]
> #Estimated probability of success for a 20 yard field goal
> lin.pred<-mod.fit$coefficients[1]+mod.fit$coefficients[2]*20
> exp(lin.pred)/(1 + exp(lin.pred))
0.971014
> #Estimated probability of success for a 50 yard field goal
> lin.pred<-mod.fit$coefficients[1]+mod.fit$coefficients[2]*50
> exp(lin.pred)/(1 + exp(lin.pred))
0.5151829
 The z value in the output is a z test statistic which
gives a test for whether the corresponding parameter
is 0 or not. This test statistic can be compared to a
standard normal distribution. Is distance important to
predicting the probability of success for a placekick?
 A simple plot of the “fitted values” versus distance is:
 2010 Christopher R. Bilder
3.19
0.6
0.4
0.2
Estimated probability
0.8
1.0
Estimated probability of success of a placekick
20
30
40
50
60
Distance (yards)
#Simple plot
plot(x = place.s$dist, y = mod.fit$fitted.values,
xlab="Distance (yards)", ylab="Estimated probability",
main = "Estimated probability of success of a
placekick")
Note that this plot would not be appropriate to hand in
for a project. Much better plots will be shown soon.
 There are often many observations for the same
distance. For example, there are 20 placekicks from
21 yards and 19 of them are successful. This
information can be found from using the table() or
xtabs() functions.
> #Summary of the placekicks by distance
> dist.good <- table(place.s$dist, place.s$good1)
> dist.good
 2010 Christopher R. Bilder
3.20
integer matrix: 43 rows, 2 columns.
0
1
18 1
2
19 0
7
20 13 776
21 1 19
22 2 12
23 1 26
24 0
7
25 1 12
EDITED
55
56
59
62
63
66
1
0
0
1
1
1
2
1
1
0
0
0
Another way to put the data into this format is to use
the gsummary() function.
> library(nlme)
> place.small<-data.frame(good = place.s$good1, dist =
place.s$dist)
> place.sum<-gsummary(object = place.small, FUN = sum,
groups = place.small$dist)
> place.length<-gsummary(object = place.small, FUN =
length, groups = place.small$dist)
> prop<-place.sum$good/place.length$good
> place.pattern<-data.frame(sum.y = place.sum$good, n =
place.length$good, prop = prop,
distance=place.sum$dist)
> head(place.pattern)
sum.y
n
prop distance
1
2
3 0.6666667
18
2
7
7 1.0000000
19
3
776 789 0.9835234
20
4
19 20 0.9500000
21
 2010 Christopher R. Bilder
3.21
5
6
12
26
14 0.8571429
27 0.9629630
22
23
Below is a plot of the estimated probability of success
using the estimated logistic regression model. The
observed proportions of successes are the plotting
points. For example, there is a 19/20=0.95 at 21 yards.
This type of plot can be used as a measure of how well
the model fits the data.
What do you think about the fit of the model?
0.6
0.4
0.2
0.0
Estimated probability
0.8
1.0
Estimated probability of success of a placekick
with observed proportions
20
30
40
Distance (yards)
 2010 Christopher R. Bilder
50
60
3.22
> #Find plot of the observed proportions
> plot(x = place.pattern$distance, y = place.pattern$prop,
xlab = "Distance (yards)", ylab = "Estimated
probability", main = "Estimated probability of success
of a placekick \n with observed proportions",
panel.first=grid(col="gray", lty="dotted"))
> curve(expr = exp(mod.fit$coefficients[1] +
mod.fit$coefficients[2]*x) /
(1+exp(mod.fit$coefficients[1] +
mod.fit$coefficients[2]*x)),
col = "red", add = TRUE)
#Quicker way to do curve() here will learn about later
#curve(plogis(mod.fit$coefficients[1] +
mod.fit$coefficients[2]*x), col = “red”, add = TRUE)
 You may think the model fits poorly at the larger
distances. This is not necessarily true! The binary
nature of the data can distort the perceived fit. At most
of the larger distances, there are very few placekicks.
For example, there was only one 59 yard placekick
attempted and it was a success. Thus, the proportion
of successful placekicks at this distance is 1/1 =1.
To help make a judgment about the fit of the model, I
created the bubble plot below. A bubble plot is a
scatter plot with the plotting point proportional to
another variable. The other variable in this case is the
number of placekicks at each distance. Notice how
the extreme proportions are the placekicks at
distances without many observations.
 2010 Christopher R. Bilder
3.23
The circles = __ option provides the third variable
displayed in the plot as the size of the plotting point.
1.0
0.5
0.0
Estimated probability
1.5
Estimated probability of success of a placekick
with observed proportions
10
20
30
40
50
60
Distance (yards)
#plots the plotting points
symbols(x = place.pattern$distance, y = place.pattern$prop,
circles=sqrt(place.pattern$n), inches = 1, xlab = "Distance
(yards)", ylab="Estimated probability", xlim = c(10,65),
ylim = c(0, 1.5), main = "Estimated probability of success
of a placekick \n with observed proportions", panel.first =
grid(col = "gray", lty = "dotted"))
#Puts the estimated logistic regression model on the plot
curve(expr = exp(mod.fit$coefficients[1] +
mod.fit$coefficients[2]*x) /
(1+exp(mod.fit$coefficients[1]+mod.fit$coefficients[2]*x)),
 2010 Christopher R. Bilder
3.24
col = "red", add = TRUE)
Estimated probability of success of a placekick
w ith observed proportions
0.8
0.6
0.4
0.0
0.2
Estimated probability
1.0
1.2
Questions:
 Which placekicks does the largest bubble represent?
 Suppose the plot looked like this (this plot was edited in
PowerPoint; note different scale):
20
30
40
50
60
Distance (yards)
What do you think about the fit of the model?
Note:
The inches = __ option in the symbols() function controls
the size of the larges circle. The default is 1” in height.
You may need to change this to help make the plot more
informative for a particular problem. Also, I used the
sqrt() function here with the circles = ___ option since
the disparity between the largest place.pattern$n value
and the others is so large. Other functions could have
been used as well. Examine what the plot looks like on
 2010 Christopher R. Bilder
3.25
your own without the sqrt() function to see how much it
helped.
Alternative binary links
Many other link functions could be used to model binary
data. These links functions use the “cumulative
distribution function” or CDF. Below is a formal definition.
Let X be a continuous random variable with probability
density function f(x). An observed value of X is
denoted by x. The cumulative distribution function of
x
X is F(x) = P(Xx) =  f(u)du . Note that u is

substituted into the probability distribution function to
avoid confusion with the upper limit of integration. If X
is a discrete random variable, the cumulative
distribution function of X is F(x) = P(Xx) =  f(x) =
 P(X  x) where the sum is over all values of Xx.
An informal definition is the cumulative distribution function
“cumulates” probabilities as a function of x. See the
Chapter 3 additional notes for examples of a CDF
involving the binomial distribution and the uniform
distribution.
 2010 Christopher R. Bilder
3.26
The reason why CDFs are used as link functions for binary
data is because the CDF is always between 0 and 1.
Example: Logistic distribution (logistic_distribution.R)
Let X have a logistic probability distribution. The
probability distribution function for X can be represented
by
f(x) 
1 e( x  ) / 
1  e( x  ) /  
2
for -<x< and parameters -<< and >0. Note that
E(X) =  and Var(X) = 22/3 > 2.
Below is a plot of the distribution for =-2 and =2.
2 and
2
0.06
0.04
0.02
0.00
f(x)
0.08
0.10
0.12
Logistic PDF with
-15
-10
-5
0
5
x
 2010 Christopher R. Bilder
10
15
3.27
mu<--2
sigma<-2
curve(expr = 1/sigma * exp(-(x-mu)/sigma) /(1+exp(-(xmu)/sigma))^2, ylab = "f(x)", xlab = "x", from
= -15, to = 15, main = expression(paste("Logistic PDF
with ", mu==-2, " and ",sigma==2)), col = "red")
#Note that expr = dlogis(x, location=mu, scale=sigma) could
also be used
abline(h = 0)
The cumulative distribution function can be found by
finding P(Xx):
x
F(x)   f(u)du

x
1 e(u ) / 

1  e(u ) /  
 
du 
2
x
1
1  e(u ) / 


1
1  e( x  ) / 
Below is a plot of the CDF for =-2 and =2.
2 and
2
0.0
0.2
0.4
F(x)
0.6
0.8
1.0
Logistic CDF with
-15
-10
-5
0
5
x
 2010 Christopher R. Bilder
10
15
3.28
curve(expr = 1/(1+exp(-(x-mu)/sigma)), ylab = "F(x)", xlab =
"x", from = -15, to = 15, lwd = 2, main =
expression(paste("Logistic CDF with ", mu==-2, " and
",sigma==2)), col = "red", panel.first = grid(col =
"gray", lty = "dotted"))
#Note that expr = plogis(x, location=mu, scale=sigma) could
also be used
Does this plot look familiar? See p. 3.8. This is the
same function being plotted! Note that F(x) =
1
1  e( x  )/ 

1
1  e[ x ( 2)]/2
1

1

 x 1
2

1 e

1
1  e( x )
where =1 and =1/2. Then
1
F(x) 
1
1
ex
1
e( x )


1  ex 1  e( x )
ex
Also notice that log[F(x)/(1-F(x))] =  + x. Therefore,
the logistic cumulative distribution function is used for
“logistic” regression! (Note: One could say F-1(x) =
log[F(x)/(1-F(x))], where F-1(x) is the inverse CDF.)
Example: Normal probability distribution
Let X have a normal probability distribution. The
probability distribution for X can be represented by
 2010 Christopher R. Bilder
3.29
f(x) 

1
 2
e
( x  )2
2 2
for -<x<,-<<, and >0
The cumulative distribution function can be found by
finding P(Xx):
x
F(x)  

1
2

e
(u  )2
2 2
du
Suppose =0 and 2=1. Then F(1.645) = 0.95, F(1.96) =
0.975, and F(2.576) = 0.995. Many textbooks will use
() to denote the CDF of a standard normal distribution.
Thus, (1.645) = 0.95.
In more familiar notation, Z1- = Z(1-) = Z(0.95) = 1.645
where =0.05. 1- represents the area to the left of
1.645 (for this example) of the probability distribution
function. Note that other books may use Z where  is
the quantity in the “right” tail of the probability distribution
function.
CDFs are nice to use for link functions with binary data since
the CDF is always between 0 and 1. Two other commonly
used link functions based on CDFs are:
 2010 Christopher R. Bilder
3.30
 Probit – based on the CDF of the standard normal
distribution; the name comes from probit being a
shortened version of “probability unit” (Hubert, 1992).
Random component: Y~Bernoulli()
Systematic component:  + x
Link function: probit transformation
(x) = ( + x)
where () is the CDF of a standard normal
distribution.
Then -1[(x)] =  + x
-1[ ] is often called the “probit” transformation and
denoted by probit( ). In general, this is often referred
to as the inverse of the standard normal CDF. Thus,
probit[(x)] =  + x
(Note: similar to “logit”)
What does -1[ ] or probit[ ] represent? Here are a
few examples: -1[0.95] = probit(0.95) = 1.645,
-1[0.975] = 1.96, and -1[0.995] = 2.576.
Compare the probit transformation to the logit
transformation. Remember the main purpose is to get a
 2010 Christopher R. Bilder
3.31
value of the function between 0 and 1 in order to model the
probability of success.
 Complementary log-log – based on 1 - CDF of the Gumbel
(extreme value) distribution
Random component: Y~Bernoulli()
Systematic component:  + x
Link function: Complementary log-log transformation
The CDF of a Gumbel distribution is F(x) =
exp{ exp[ (x  ) / ]} for parameters -<< and
>0. Notice that 1- F(x) is still between 0 and 1. Also,
note that E(X) = + where 0.577216 (Euler’s
constant) and Var(X) = 22/6.
Let =-1/ and =/. Through the use of some
algebra, the 1-CDF becomes 1-F(x) =
1  exp[ exp(  x)] . Thus,
(x) = 1  exp[ exp(  x)]
Solving for the systematic component produces:
log{-log[1-(x)]} =  + x
 2010 Christopher R. Bilder
3.32
The “complementary” part of the name comes from 1F(X) instead of F(X) being used.
Example: Compare the logistic, probit, and complementary
log-log GLMs (pi_plot.R)
= 1 and
= 0.5
0.8
1.0
x vs. x for
0.0
0.2
0.4
x
0.6
Logit
Probit
Cloglog
-15
-10
-5
0
5
x
 2010 Christopher R. Bilder
10
15
3.33
= 1 and
= -0.5
0.8
1.0
x vs. x for
0.0
0.2
0.4
x
0.6
Logit
Probit
Cloglog
-15
-10
-5
0
5
10
15
x
The R code used to create the data for the plots is
below.
alpha<-1
beta<-0.5
par(pty="s")
curve(expr = plogis(alpha+beta*x), from = -15, to = 15, col
= "red", lwd = 2, lty = 1, main =
expression(paste(pi(x), " vs. x for ", alpha, " = 1
and ", beta," = 0.5")), xlab = "x", ylab =
expression(pi(x)), panel.first = grid(nx = NULL, ny =
NULL, col = "gray", lty = "dotted"))
curve(expr = pnorm(alpha+beta*x, mean=0, sd=1), from = -15,
to = 15, col = "blue", add = TRUE, lty = 2, lwd = 2)
curve(expr = 1-exp(-exp(alpha+beta*x)), from = -15, to =
15, col = "green", add = TRUE, lty = 4, lwd = 2)
legend(locator(1), legend = c("Logit", "Probit", "Cloglog"),
lty = c(1,2,4), lwd = c(2,2,2), col = c("red",
"blue", "green"), bty = "n")
#There is a pgumbel(q, loc=0, scale=1, lower.tail = TRUE)
 2010 Christopher R. Bilder
3.34
function in the evd and VGAM packages
Notes:
 The logistic model corresponds to the model plotted on
p. 4.9.
 The logistic and probit intersect at (x)=0.5.
 Notice the logistic and probit curves are both
symmetric. This means that the curve when (x)<0.5
is the mirror image of the curve for (x)>0.5. The
complementary log-log curve does not have this
property.
 When you fit these models to a data set, you should
not expect all of the ̂ and ̂ ’s to be the same. Thus,
these plots are a little misleading in some respect.
Which model should you use???
This is not an easy question to answer.
 The logit link provides a convenient way to interpret the
model through the use of odds and odds ratios. Notice
the logit transformation is a log of an odds! Because of
this aspect, the logit link will often be used over the other
two. Chapter 5 focuses on the logit link.
 The logit and probit links provide models that are often
not too different. See the upcoming examples.
 I have not seen the complementary log-log link used
often; however, this does not mean it is not used in
practice.
 2010 Christopher R. Bilder
3.35
 One way to decide between the three link functions is to
use all three and see which one gives the best “fit”. This
means which graphically fits the data the best (like on p.
3.42), which has the smallest residuals in absolute
value, and which satisfies goodness-of-fit statistics the
best.
 Goodness-of-link function tests can be used to help
determine which link function to use. These tests
usually incorporate the link functions under one family of
functions. For example, Aranda-Ordaz (1981) has
incorporated the probit and logit transformations under
one family of transformations. He gives a hypothesis
test to help choose between them. A small discussion of
these tests is available on p. 301 and p. 257-8 in Agresti
(2002).
Example: Placekicking (placekick_ch3.R, place.s.csv)
Probit model:
> mod.fit.probit<-glm(formula = good1 ~ dist, data =
place.s, family = binomial(link = probit), na.action =
na.exclude, control = list(epsilon = 0.0001, maxit = 50,
trace = T))
Deviance = 825.0748 Iterations - 1
Deviance = 776.0735 Iterations - 2
Deviance = 772.0135 Iterations - 3
Deviance = 771.9512 Iterations - 4
> summary(mod.fit.probit)
Call:
 2010 Christopher R. Bilder
3.36
glm(formula = good1 ~ dist, family = binomial(link =
probit), data = place.s, na.action = na.exclude, control =
list(epsilon = 1e-04,
maxit = 50, trace = T))
Deviance Residuals:
Min
1Q
Median
-2.8166
0.2275
0.2275
3Q
0.3914
Max
1.5316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.205985
0.155195
20.66
<2e-16 ***
dist
-0.062768
0.004284 -14.65
<2e-16 ***
--Signif. codes:
0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.43
Residual deviance: 771.95
AIC: 775.95
on 1424
on 1423
degrees of freedom
degrees of freedom
> #Estimated probability of success for a 20 yard field
goal
lin.pred<-mod.fit.probit$coefficients[1] +
mod.fit.probit$coefficients[2]*20
> pnorm(q = lin.pred, mean = 0, sd =1)
0.9744488
> #Estimated probability of success for a 50 yard field
goal
> lin.pred<-mod.fit.probit$coefficients[1] +
mod.fit.probit$coefficients[2]*50
> pnorm(q = lin.pred, mean = 0, sd = 1)
0.526936
Complementary log-log model:
> mod.fit.cloglog <-glm(formula = good1 ~ dist, data =
place.s, family = binomial(link = cloglog),
na.action = na.exclude, control = list(epsilon =
0.0001, maxit = 50, trace = T))
 2010 Christopher R. Bilder
3.37
Deviance
Deviance
Deviance
Deviance
=
=
=
=
836.9174
771.2283
769.4893
769.4776
Iterations
Iterations
Iterations
Iterations
-
1
2
3
4
> summary(mod.fit.cloglog)
Call:
glm(formula = good1 ~ dist, family = binomial(link =
cloglog), data = place.s, na.action = na.exclude, control =
list(epsilon = 1e-04, maxit = 50, trace = T))
Deviance Residuals:
Min
1Q
Median
-2.9052
0.2126
0.2126
3Q
0.4132
Max
1.3705
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.379921
0.117955
20.18
<2e-16 ***
dist
-0.052226
0.003702 -14.11
<2e-16 ***
--Signif. codes:
0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.43
Residual deviance: 769.48
AIC: 773.48
on 1424
on 1423
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
> #Estimated probability of success for a 20 yard field
goal
> lin.pred<-mod.fit.cloglog$coefficients[1] +
mod.fit.cloglog$coefficients[2] * 20
> 1-exp(-exp(lin.pred))
0.977664
> #Estimated probability of success for a 50 yard field
goal
> lin.pred<-mod.fit.cloglog$coefficients[1] +
mod.fit.cloglog$coefficients[2]*50
> 1-exp(-exp(lin.pred))
 2010 Christopher R. Bilder
3.38
0.5477212
Notes:
 Models:
Estimated model
Logistic
logit[ ̂(x)] = 5.8121 – 0.1150x
Probit
probit[ ̂(x) ] = 3.2060 – 0.0628x
Comp. log-log log{-log[1- ̂(x) ]} = 2.3799 – 0.0522x
Estimated model
exp(5.8121  0.1150x)

(x)

ˆ
Logistic
1  exp(5.8121  0.1150x)
Probit
̂(x) = (3.2060 – 0.0628x)
Comp. log-log ̂(x) = 1 – exp[-exp(2.3799 – 0.05222x)]
 Estimate probabilities:
Suppose you want to predict the estimated probability
of success for a distance of 20 yards. For the probit
model,
̂(x  20) = (3.2060 – 0.0628*20)
= (1.95) = 0.9744
For the complementary log-log model:
̂(x  20) = 1 – exp[-exp(2.3799 – 0.0522220)]
= 0.9777
 2010 Christopher R. Bilder
3.39
To summarize,
Distance ̂(x)
Logistic
20
0.9710
Probit
20
0.9744
Comp. log-log
20
0.9777
Distance ̂(x)
Logistic
50
0.5152
Probit
50
0.5269
Comp. log-log
50
0.5477
 An easier way to find the estimate probabilities is to
use the predict() function. Suppose the
complementary log-log model is fit and the model fit
summary information is stored in the mod.fit.cloglog
object. Then the predict() function can be used the
following way to predict the probability of success at
x=20:
> predict.data<-data.frame(dist=20)
> predict(object = mod.fit.cloglog, newdata =
predict.data, type = "response")
[1] 0.977664
The type = “response” option is used to tell R that you
want to predict . If you want to predict the linear
predictor, use the type = “link” option.
> #predict the linear predictor
> predict(object = mod.fit.cloglog, newdata =
 2010 Christopher R. Bilder
3.40
predict.data, type = "link")
[1] 1.335410
To predict for more than one distance, create a data
set with extra rows:
> #Predict for 20 and 50 yards
> predict.data<-data.frame(dist = c(20, 50))
> save.pi.hat<-predict(object = mod.fit, newdata =
predict.data, type = "response")
> data.frame(predict.data, pi.hat = round(save.pi.hat,4))
dist pi.hat
1
20 0.9777
2
50 0.5477
Finally, one could also use the predict() function to
find the standard error of ̂ . This information can be
used to find approximate (1-)100% Wald confidence
intervals for . The actual formulas will be discussed
in Chapter 4.
> #Prediction with C.I.s
> predict.data<-data.frame(dist = c(20, 50))
> alpha<-0.05
> save.pi.hat<-predict(object = mod.fit, newdata =
predict.data, type = "response", se.fit = TRUE)
> lower<-save.pi.hat$fit-qnorm(1-alpha/2) *
save.pi.hat$se.fit
> upper<-save.pi.hat$fit+qnorm(1-alpha/2) *
save.pi.hat$se.fit
> data.frame(predict.data, pi.hat =
round(save.pi.hat$fit, 4), se =
round(save.pi.hat$se.fit,4), lower = round(lower,4),
upper = round(upper,4))
dist pi.hat
se
lower
upper
 2010 Christopher R. Bilder
3.41
1
2
20 0.9777 0.0046 0.9686 0.9867
50 0.5477 0.0303 0.4884 0.6070
 Below is a plot of the estimated probabilities from all
three of the models.
0.6
0.4
0.2
Complementary log-log
Logit
Probit
0.0
Estimated probability
0.8
1.0
Estimated probability of success of a placekick
with observed proportions
20
30
40
50
60
Distance (yards)
R code:
par(pty = "m") #plots over all of graph - not square
plot(x = place.pattern$distance, y = place.pattern$prop,
xlab="Distance (yards)", ylab="Estimated
probability", main = "Estimated probability of
success of a placekick \n with observed
 2010 Christopher R. Bilder
3.42
proportions", panel.first = grid(col = "gray", lty =
"dotted"))
curve(expr = plogis(mod.fit$coefficients[1] +
mod.fit$coefficients[2]*x), col = "red", add =
TRUE, lwd = 2, lty = 1)
curve(expr = pnorm(mod.fit.probit$coefficients[1] +
mod.fit.probit$coefficients[2]*x), col = "blue",
add = TRUE, lty = 2, lwd = 2)
curve(expr = 1-exp(-exp(mod.fit.cloglog$coefficients[1] +
mod.fit.cloglog$coefficients[2]*x)), col = "green",
add = TRUE, lty = 4, lwd = 2)
legend(locator(1), legend = c("Complementary log-log",
"Logit", "Probit"), lty = c(4, 1, 2), bty
= "n", col=c("green", "red", "blue"), cex = 0.75)
0.8
0.6
0.4
0.2
Complementary log-log
Logit
Probit
0.0
Estimated probability
1.0
1.2
Estimated probability of success of a placekick
with observed proportions
10
20
30
40
Distance (yards)
 2010 Christopher R. Bilder
50
60
3.43
R code:
# Bubble plot version with bubble proportional to sample
size
symbols(x = place.pattern$distance, y =
place.pattern$prop, circles =
sqrt(place.pattern$n), xlab =
"Distance (yards)", ylab="Estimated probability",
xlim = c(10,65), ylim = c(0, 1.2), main =
"Estimated probability of success of a placekick \n
with observed proportions", panel.first = grid(lty
= "dotted")
curve(expr = plogis(mod.fit$coefficients[1] +
mod.fit$coefficients[2]*x), col = "red", add =
TRUE, lwd = 2, lty = 1)
curve(expr = pnorm(mod.fit.probit$coefficients[1] +
mod.fit.probit$coefficients[2]*x), col = "blue",
add = TRUE, lty = 2, lwd = 2)
curve(expr = 1-exp(-exp(mod.fit.cloglog$coefficients[1] +
mod.fit.cloglog$coefficients[2]*x)), col = "green",
add = TRUE, lty = 4, lwd = 2)
legend(locator(1), legend = c("Complementary log-log",
"Logit", "Probit"), lty = c(4, 1, 2), lwd =
c(2,2,2), bty = "n", col=c("green", "red",
"blue"), cex = 0.75)
 2010 Christopher R. Bilder
3.44
3.3 Generalized linear models for count data
Counts (for example, counts in a contingency table) of
possible outcomes are non-negative integers. These
are often modeled as Poisson random variables.
Chapter 7 focuses on counts from a contingency table
for multiple categorical variables. This section focuses
on counts for a single categorical variable that do not
necessarily appear in a contingency table.
Review:
e y
Poisson distribution: P(Y  y) 
for y=0,1,2,…
y!
where
Y is a random variable
y denotes the possible outcomes of Y
 is a parameter
E(Y) =  and Var(Y) =  - this can be too restrictive
Poisson regression
To make the introduction easier, assume there is only
one explanatory variable.
Random component: Y~Poisson()
Systematic component:  + x
 2010 Christopher R. Bilder
3.45
Link function: log transformation
log[E(Y)] = log() =  + x
  = e + x = eex = e(e)x
Notice the effect of a change in x has on . Could call 
here “(x)” similar to what was done with  in the
previous section.
Question: Why do you think the log link is preferred over the
identity link for count data?
Examples: Possible Y and X variables
Y = # of credit cards you have
Y = # of arrests for a city per year
Y = # of airplane crashes per year
Y = # of cars stopped at the 33rd and Holdrege streets
intersection
What variables could have an effect on Y? Suppose Y is
# of credit cards:
X = income level, gender, where you live,…
 2010 Christopher R. Bilder
3.46
Example: Horseshoe crabs and satellites (horseshoe.R,
horseshoe.txt)
See the video! Also, please see the description on p. 75
of Agresti (2007). Page 76-77 shows the entire data set.
 2010 Christopher R. Bilder
3.47
More on the crabs:
 www.npr.org/templates/story/story.php?storyId=10648
9695
 http://www.ceoe.udel.edu/horseshoecrab
For each ith female, assume the number of satellites, Yi,
has a Poisson distribution with mean i dependent on
female shell width. We will model the expected number of
satellites with the following model:
log(i) =  + xi
where xi is the width of the ith female crab.
> #Read in data
> crab<-read.table(file = "c:\\Chris\\UNL\\STAT875\\chapter4
\\horseshoe.txt", header=FALSE, col.names =
Notice how data was read in
 2010 Christopher R. Bilder
3.48
c("satellite", "width"))
> mod.fit<-glm(formula = satellite ~ width, data = crab,
family = poisson(link = log), na.action =
na.exclude, control = list(epsilon = 0.0001, maxit
= 50, trace = T))
Deviance = 759.6346 Iterations - 1
Deviance = 580.078 Iterations - 2
Deviance = 567.9793 Iterations - 3
Deviance = 567.8786 Iterations - 4
Deviance = 567.8786 Iterations - 5
> summary(mod.fit)
Call:
glm(formula = satellite ~ width, family = poisson(link = log),
data = crab, na.action = na.exclude, control =
list(epsilon = 1e-04, maxit = 50, trace = T))
Deviance Residuals:
Min
1Q
Median
-2.8526 -1.9884 -0.4933
3Q
1.0970
Max
4.9221
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.30476
0.54222 -6.095 1.10e-09 ***
width
0.16405
0.01996
8.217 < 2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 632.79
Residual deviance: 567.88
AIC: 927.18
on 172
on 171
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5
> #Predict for 23 and 30 widths
> predict.data<-data.frame(width = c(23, 30))
> alpha<-0.05
 2010 Christopher R. Bilder
3.49
> save.mu.hat<-predict(object = mod.fit, newdata =
predict.data, type = "response", se = TRUE)
> lower<-save.mu.hat$fit-qnorm(1-alpha/2)*save.mu.hat$se
> upper<-save.mu.hat$fit+qnorm(1-alpha/2)*save.mu.hat$se
> data.frame(predict.data, mu.hat = round(save.mu.hat$fit,4),
lower = round(lower,4), upper = round(upper,4))
width mu.hat lower upper
1
23 1.5972 1.3074 1.8871
2
30 5.0359 4.3101 5.7618
#Plot of data and estimated model
> plot(x = crab$width, y = crab$satellite, xlab = "Width
(cm)", ylab = "Number of satellites", main = "Horseshoe
crab data set \n with poisson regression model fit",
panel.first = grid(col = "gray", lty = "dotted"))
> curve(expr = exp(mod.fit$coefficients[1] +
mod.fit$coefficients[2]*x), col = "red", add = TRUE, lty
= 1)
> #The is part of Table 3.3 on p. 80 of Agresti (2007). The
last two "columns" are the number of cases and the number
of satellites. The first "column" is the group width mean
corresponding to the width categories given in Table 3.3.
These means are stated on p. 90 of Agresti (1996). In the
2007 edition, he did not state them. However, these can be
simply found as shown in my table3.3.R program.
> crab.tab3.3<-data.frame(width = c(22.69, 23.84, 24.77,
25.84, 26.79, 27.74, 28.67, 30.41),
cases = c(14, 14, 28, 39, 22, 24, 18, 14),
satell = c(14, 20, 67, 105, 63, 93, 71, 72))
> temp3<-matrix(data=temp2, nrow=8, ncol=3, byrow=T)
> crab.tab4.3<-data.frame(width=temp3[,1], cases=temp3[,2],
satell=temp3[,3])
> #Average number of satellites per group
> mu.obs<-crab.tab4.3$satell/crab.tab4.3$cases
> points(x = crab.tab4.3$width, y = mu.obs, pch = 18, col =
"darkgreen", cex = 2)
> legend(locator(1), legend="Diamonds are group mean", cex =
0.75)
 2010 Christopher R. Bilder
3.50
15
Horseshoe crab data set
with poisson regression model fit
10
5
0
Number of satellites
Diamonds are group mean
22
24
26
28
30
32
34
Width (cm)
Notes:
 First examine the plot of the data above – ignoring the
estimated model plotted in red. The data show an
upward trend. As the width increases, the number of
satellites increases. This is easier to see with the
group means (the grouping of the data comes from
Table 4.3 of Agresti (1996, p. 90)). Remember that
the Poisson regression model is modeling the MEAN
response!
 2010 Christopher R. Bilder
3.51
 The glm() function fits the Poisson regression model to
the data. Notice the use of the family = poisson(link =
log) option.
 The estimated Poisson regression model is
ˆ  exp( 3.3048  0.1640x)
where x=width and  is the mean number of satellites.
The model could also be written as:
log(ˆ )  3.3048  0.1640x
 What happens to the estimated mean number of
satellites as the width increases?
 The estimated number of satellites for a particular
width can be found from the model. For example, the
estimated mean number of satellites for a width of 23
is
ˆ  exp( 3.3048  0.1640  23)  1.5972
The estimated number of satellites for a width of 30 is
5.0359. See how the predict() function was used here.
 The z value in the output gives a test for whether the
corresponding parameter is 0 or not. This test statistic
can be compared to a standard normal distribution. Is
width important to predicting the mean number of
satellites?
 2010 Christopher R. Bilder
3.52
 See the R code used to create the plot. I had difficulty
creating a legend with the diamond plotting character.
 The plot is very important to do in order to determine if
the model works for the data!
 Table3.3.R provides a general way to find tables like
Table 3.3 on p. 80 in Agresti (2007). The program
also provides a general way to find categories (not the
same as those in Table 3.3). This program code can
be incorporated into your own program for future
projects!!!
Negative binomial regression
A limiting assumption for a Poison distribution is that E(Y)
= Var(Y) = . Sometimes, the variance of Y appears to
be greater than  for a data set. Evidence of this occurs
in the horseshoe example. See Table 3.3 on p. 80 or
part of it produced below from my Table3.3.R program.
>
1
2
3
4
5
6
7
8
table3.3[,1:5]
width.group number.cases number.sat mean.per.group var.per.group
22.69286
14
14
1.000000
2.769231
23.84286
14
20
1.428571
8.879121
24.77500
28
67
2.392857
6.543651
25.83846
39
105
2.692308
11.376518
26.79091
22
63
2.863636
6.885281
27.73750
24
93
3.875000
8.809783
28.66667
18
71
3.944444
16.879085
30.40714
14
72
5.142857
8.285714
If the Poisson assumptions were satisfied, we would
expect the mean.per.group column to be approximately
the same as the var.per.group column. Obviously, this is
 2010 Christopher R. Bilder
3.53
not happening here. Note that this is an “ad-hoc” way to
show the variance is larger than the mean (due to the
artificial grouping of the data), but it still shows evidence
toward possible problems.
When the variance is larger than the mean, this is called
overdispersion, and it is a violation of our model. Thus,
inferences made using the model may be incorrect.
What can you do when this occurs?
1. Find more explanatory variables that help explain the
variability in the response variable! The additional
variability could be due to not accounting for other
explanatory variables. For example, perhaps crab
weight plays an important role in estimating the mean
number of satellites. Without accounting for weight and
using width only, there can be additional satellite
variability than expected at individual widths. See
Agresti (2007) on p. 80-1 for a further good
explanation.
2. Page 151 of Agresti (2002) discusses quasi-Poisson
regression models. These models do not assume a full
parametric form for the model and can be estimated
with the glm() function by using a family =
quasipoisson(link = log) option. See the additional
Chapter 3 notes for more information. Agresti (2007)
does not discuss these models in Chapter 3 (a little on
p. 280), so they will not be discussed here.
 2010 Christopher R. Bilder
3.54
3. Poisson generalized linear mixed models, which are
explained in Section 13.5 of Agresti (2002).
4. Agresti (2007) discusses negative binomial models, so
these will be presented next.
One way to write the negative binomial distribution is
 y  k  1 k
p (1  p)y for y = 0, 1, …


y


This distribution occurs when one is interested in the
probability of y failures before the kth success (see
Casella and Berger (2002, p. 95) if you are interested in
more detail). For us, there are two important aspects to
this distribution. First, the values of Y are non-negative
integers just like a Poisson random variable. Second, the
distribution can be rewritten as
 y  k  1  k 

   k 
y



k
y
k 

1

   k  for y = 0, 1, …, and k>0


where E(Y) =  and Var(Y) =  + 2/k. Notice that this is
very similar to what we had for a Poisson random
variable, but now we have a larger variance for Y! The
parameter k is a measure of the “over” dispersion. Note
that Agresti (2007) officially defines D = 1/k as the
“dispersion parameter”. As 1/k goes to 0, we approach
what the Poisson distribution would obtain. More in 2010 Christopher R. Bilder
3.55
depth information for how this distribution comes about is
available on p. 559-561 of Agresti (2002) if you are
interested.
Example: Horseshoe crabs and satellites (horseshoe.R,
horseshoe.txt)
The glm() function can not fit this specific model so we
will need to use the glm.nb() function in the MASS
package. This package comes with an initial installation
of R, but you will still need to tell R that you want to use it.
>
>
>
library(MASS)
mod.fit.nb<-glm.nb(formula = satellite ~ width, data =
crab, link = log)
summary(mod.fit.nb)
Call:
glm.nb(formula = satellite ~ width, data = crab, link = log,
init.theta = 0.904568080033865)
Deviance Residuals:
Min
1Q
Median
-1.7798 -1.4110 -0.2502
3Q
0.4770
Max
2.0177
Coefficients:
Estimate Std. Error z value
(Intercept) -4.05251
1.17143 -3.459
width
0.19207
0.04406
4.360
--Signif. codes: 0 '***' 0.001 '**' 0.01
1
Pr(>|z|)
0.000541 ***
1.30e-05 ***
'*' 0.05 '.' 0.1 ' '
(Dispersion parameter for Negative Binomial(0.9046) family
taken to be 1)
 2010 Christopher R. Bilder
3.56
Null deviance: 213.05
Residual deviance: 195.81
AIC: 757.29
on 172
on 171
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 1
Correlation of Coefficients:
(Intercept)
width -1.00
Theta:
Std. Err.:
2 x log-likelihood:
0.905
0.161
-751.291
The estimated negative binomial regression model is
log(ˆ )  4.0525  0.1921x
with D̂ = 0.905 and k̂ = 1/0.905 = 1.1. Thus, the
estimated variance of Y is now ˆ  0.9ˆ 2 (remember that
̂ depends on the value of x).
Where does this larger variance show up in the analysis?
One place is in Var(ˆ ) (without going into the formula
details). Looking in the usual spot in the output, we
obtain a value of 0.04406. The corresponding value in
the Poisson regression model was 0.01996. Why does
this larger variance matter?
How could one test if there was evidence of
overdispersion?
 2010 Christopher R. Bilder
3.57
Poisson regression for rate data
Rate data consists of the rate that a number of events
occur for some time period or other baseline measure.
Examples include: the number of times a computer
crashes during a time period, number of melanoma cases
per city size, number of arrivals at airports for a particular
time period,… .
The time period or baseline measure needs to be
incorporated into the analysis. One way to do this is to
model Y/t instead of just Y where Y is the number of
events and t is the time period or baseline measure.
Thus, the Poisson regression model becomes:
log(/t) =  + x
where =E(Y). This expression can be simplified to
log() – log(t) =  + x  log() =  + x + log(t).
log(t) is called an “offset”. Notice how the offset has an
effect on :
 = e + x + log(t)   = t  e + x
Thus, t helps to adjust the “usual” mean (e + x) by the
time period or baseline measure.
 2010 Christopher R. Bilder
3.58
Example: Horseshoe crabs and satellites (horseshoe.R,
horseshoe.txt)
This is not necessarily the best example where one
would want to use Poisson regression for rate data, but it
gives a nice illustration of the relationship between a
Poisson model for rate data and “regular” data. Please
see p. 83 of Agresti (2007) for another example where
using rate data is more appropriate.
Suppose the data was given in the form of the number of
satellites per distinct width. Let Y be the number of
satellites for a distinct width. Let t be the number of
female crabs observed for a distinct width. For example,
there are t=3 crabs with a width of 22.9 cm and they
have a total of Y=4+0+0 = 4 satellites.
Before the data set looked like this:
Crab ID Satellites Width
1
8
28.3
2
0
22.5
3
9
26.0
4
0
24.8
Now the data set looks like this:
# of crabs (t) Total satellites (Y) Width
1
0
21.0
 2010 Christopher R. Bilder
3.59
# of crabs (t) Total satellites (Y)
1
0
3
5
3
4
Width
22.0
22.5
22.9
> library(nlme) #gsummary function is located here
Loading required package: lattice
> sum.rate.data<-gsummary(object = crab, FUN = sum,
groups = crab$width)
> length.rate.data<-gsummary(object = crab, FUN = length,
groups = crab$width)
> rate.data<-data.frame(y=sum.rate.data$satellite,
t=length.rate.data$satellite,
width=length.rate.data$width)
> mod.fit.rate<-glm(formula = y ~ width+offset(log(t)),
data = rate.data, family = poisson(link = log),
na.action = na.exclude, control = list(epsilon =
0.0001, maxit = 50, trace = T))
Deviance = 211.7379 Iterations - 1
Deviance = 190.2969 Iterations - 2
Deviance = 190.0273 Iterations - 3
Deviance = 190.0272 Iterations - 4
Deviance = 286.3955 Iterations - 1
Deviance = 255.2993 Iterations - 2
Deviance = 254.9404 Iterations - 3
Deviance = 254.9403 Iterations - 4
> summary(mod.fit.rate)
Call:
glm(formula = y ~ width + offset(log(t)), family =
poisson(link = log), data = rate.data, na.action =
na.exclude, control = list(epsilon = 1e-04,
maxit = 50, trace = T))
Deviance Residuals:
Min
1Q
Median
3Q
 2010 Christopher R. Bilder
Max
3.60
-3.8003
-1.4515
-0.3788
0.6619
4.7586
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.30476
0.54218 -6.095 1.09e-09 ***
width
0.16405
0.01996
8.217 < 2e-16 ***
--Signif. codes:
0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 254.94
Residual deviance: 190.03
AIC: 402.52
on 65
on 64
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
> #Plot of data with estimated mu's; notice the use of the
panel.first option to put grid lines behind plotting
points
> plot(x = crab$width, y=crab$satellite, xlab="Width (cm)",
ylab="Number of satellites", panel.first=grid(col =
"gray", lty = "dotted"), main = "Horseshoe crab data
set \n with poisson regression model fit (rate
data)")
> points(x = rate.data$width, y =
mod.fit.rate$fitted.values, pch = 18, col =
"darkgreen", cex = 1)
> legend(locator(1), legend="Diamonds are predicted value",
cex = 0.75)
 2010 Christopher R. Bilder
3.61
15
Horseshoe crab data set
with poisson regression model fit (rate data)
10
5
0
Number of satellites
Diamonds are predicted values
22
24
26
28
30
32
34
Width (cm)
Notes:
 The gsummary() function allows one to summarize a
data set by a grouping variable. This is similar to
using a SAS procedure with a BY statement. In this
case, I sum the satellites over the different crab
widths. Also, I find the number of satellites per crab
width. These are combined into the data.frame called
rate.data. Note that the gsummary() function is in the
nlme package so this package needs to be loaded
first.
 2010 Christopher R. Bilder
3.62
 The glm() function is used to fit the Poisson regression
model with an offset. Notice the parameter estimates
are the same as before! Below is the estimated model
with the offset:
log(ˆ )  -3.3048 + 0.1640width + log(t)





where t = number of crabs per distinct width.
Generally, the parameter estimates will be displayed
as being exactly the same. Through some statistical
research that I have worked on, I have found some
situations where there are some differences.
Why are there two sets of iterations here for glm()? In
the glm() function code, the model with an intercept
ONLY is fit once and then the whole model is fit. This
occurs only when an offset is used. I think the reason
is due to what an intercept only model represents with
rate data.
Notice that a smooth curve can not be plotted because
of the different number of crabs per width.
A better version of the plot would include different
colors for the plotting characters (corresponding to
each t value) for each observed Y and predicted .
More plots of the model are discussed in the Chapter 3
additional notes.
 2010 Christopher R. Bilder
3.63
3.4 Statistical inference and model checking
One of the best things about GLMs is that they provide a
unified approach to test model parameters, check
goodness-of-fit, examine residuals, estimate parameters,
… . Thus, one can use the same basic methods for
logistic, probit, complementary log-log, and Poisson
regression.
The Wald and likelihood ratio tests
A hypothesis test commonly of interest is
Ho:=0
Ha:0
Below are two different ways this test can be conducted:
 Wald - The test statistic is
ˆ  
ˆ
Z

SE
SE
where SE stands for “standard error”. Actually, this
standard error is an estimate of the “asymptotic”
standard error. Often, you will see the standard error


here denoted as AsVar(ˆ ) or Var(ˆ ) . For large n,
remember that an MLE ( ̂ here) has an approximate
normal distribution. Thus, Z has an approximate
 2010 Christopher R. Bilder
3.64
standard normal distribution and this distribution can
be used to perform the test.
On p. 1.28 of the notes, the “large sample variance”
was introduced for p   y / n where y is 0 or 1 and n
is number of trials. The formula given was:
   log  (  | y1,...,yn   
 E 

2

 
 
2
1
p
This formula can be used here also by using the
likelihood function for  and  instead. Since there
are two parameters, a matrix of the second partial
derivatives is found:
  2 log  (,  | y1,...,yn )

2


E
  2 log  (,  | y1,...,yn )




1
2 log  (,  | y1,...,yn )  



2
 log  (,  | y1,...,yn )  
 
2
 
 ˆ ,ˆ

 
ˆ )
ˆ
ˆ
Var(

)
Cov(

,


= 



ˆ
ˆ
ˆ
Cov(

,

)
Var(

)


The “large sample variance” for ̂ is the (2,2) element
of the above matrix. The square root of this quantity
is the SE that we are using in the denominator of Z.
 2010 Christopher R. Bilder
3.65
Notes:
 Try to write out the likelihood function for a logistic
regression or Poisson model on your own. Then try
to write out the matrix of second partial derivatives.
 You will never need to actually do the evaluation of
the formula of the large sample variance by hand,
but R will do it routinely for us!
 The same problems that we have had before using
Wald confidence intervals happen here. Therefore,
we need to make sure the sample size is large. The
next method is a little better to use when the sample
size is not large.
 Likelihood ratio test (LRT) - We have discussed the
LRT before in Chapters 1-2. This procedure can also
be used here.
Review from p. 1.29 and 2.58: The LRT statistic is

max. likelihood when parameters satisfy Ho
max. likelihood when parameters satisfy Ho or Ha
Remember that the ratio is between 0 and 1 since the
numerator can not exceed the denominator.
For the test of =0 vs. 0, the numerator is
calculated assuming =0. Thus, the model fit to the
data is only g() =  (where g() denotes the link
 2010 Christopher R. Bilder
3.66
function). The denominator is calculated without the
assumption that =0. Thus, the model fit to the data is
g() =  + x. The likelihood functions are found
using the fit of both models and the ratio is found. For
example, the ratio becomes for logistic regression:

max. likelihood when parameters satisfy Ho
max. likelihood when parameters satisfy Ho or Ha
n

1 yi
 ˆ o (xi ) i (1  ˆ o (xi ))
i 1
n
y
1 yi
 ˆ(xi ) i (1  ˆ(xi ))
y
i 1
ˆ
ˆ xi
eˆ o
e
where the ˆ o (xi ) 
and ˆ(xi ) 
.
ˆ o
ˆ ˆ xi

1 e
1 e
The actual test statistic used for a LRT is –2log().
The reason is because this statistic has an
approximate 2 distribution for large n. The degrees
of freedom are found the same way as before. In this
case, notice the difference between Ho and Ha is
whether or not =0. Thus, the 2 distribution has 1
degree of freedom. Note that –2log() is often
denoted in categorical data analysis as G2.
Often in computer output, –2log() is not given directly.
Instead, what is often given is the “null deviance” and the
“residual deviance”. These are –2log() statistics
themselves, but for testing a different set of hypotheses.
 2010 Christopher R. Bilder
3.67
Simply put, the –2log() for a test of Ho:=0 vs. Ha:0
is:
null deviance – residual deviance
Below is a further explanation of the two deviances. The
null deviance tests:
Ho: Model with only 
Ha: Model using the observed values
The test statistic for Poisson regression is
 yi 
ˆ o
̂
G  2 yi log 
where
is
.
e
o,i

i1
 ˆ o,i 
2
1
n
Compare the above form to what we saw on p. 2.59.
The test statistic for logistic regression is
 yi 
 1  yi 
G  2 yi log 

(1

y
)log
where
i



i1
 ˆ o,i 
 1  ˆ o,i 
eˆ o
ˆ o,i 
.
ˆ o
1 e
2
1
n
Questions:
o What is ̂0 for the Poisson regression model?
o What is ̂0 for the logistic regression model?
 2010 Christopher R. Bilder
3.68
The residual deviance tests:
Ho: Model with only  and 
Ha: Model using the observed values
The test statistic for Poisson regression is
 yi 
ˆ ˆ xi
G  2 yi log   where ̂i is e
.
i1
 ˆ i 
n
2
2
The test statistic for logistic regression is
 yi 
 1  yi 
where
G  2 yi log    (1  yi )log 

i1
 ˆ i 
 1  ˆ i 
ˆ ˆ xi
e
ˆ i 
.
ˆ ˆ xi

1 e
n
2
2
Notice that G12 and G22 both have a few things in
common. When the residual deviance is subtracted from
the null deviance, the resulting statistic for Poisson
regression is:
n
 yi 
 yi 
G  G  2 yi log 

2
y
log

i

 ˆ 
i1
i

1

ˆ
o,i


 i
n
n
n
n



 2  yi log(yi )   yi log(ˆ o,i )  2  yi log(yi )   yi log(ˆ i )
 i1

 i1

i1
i1
2
1
2
2
n
 2010 Christopher R. Bilder
3.69
n
n

 2  yi log(ˆ i )   yi log(ˆ o,i ) 
 i1

i 1
ˆ ˆ xi
n
n
 e
 ˆ i 
 2 yi log 
 2 yi log  ˆ 0

 e
i 1
i 1
 ˆ o,i 




For logistic regression, the statistic becomes
G12
 G22
 ˆ i 
 1  ˆ i 
 2  yi log 
  (1  yi )log 
.
i1
 ˆ o,i 
 1  ˆ o,i 
n
Without going into the details, these are the correct
–2log() statistics for the test of Ho:=0 vs. Ha:0!
The word “deviance” is used because the statistics
give a measurement of how much the observed data
“deviates” from the model’s fit.
Example: Placekicking (placekick_ch3.R, place.s.csv)
Perform the test of Ho:=0 vs. Ha:0. The output from
glm() is reproduced below.
> summary(mod.fit)
Call:
glm(formula = good1 ~ dist, family = binomial(link = logit),
data = place.s, na.action = na.exclude, control =
list(epsilon = 1e-04, maxit = 50, trace = T))
 2010 Christopher R. Bilder
3.70
Deviance Residuals:
Min
1Q
Median
-2.7441
0.2425
0.2425
3Q
0.3801
Max
1.6091
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.812079
0.326158
17.82
<2e-16 ***
dist
-0.115027
0.008337 -13.80
<2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1013.43 on 1424 degrees of freedom
Residual deviance: 775.75 on 1423 degrees of freedom
AIC: 779.75
Number of Fisher Scoring iterations: 5
The Wald test statistic is Z=-13.80. Since Z0.975=1.96,
0 with 95% confidence. Also, the p-value is very
small. Therefore, distance is important for predicting the
probability of success.
To find the -2log(), use the null and residual deviance:
G12  G22 = 1013.43 – 775.75 = 237.68
The degrees of freedom given from the output for the
null and residual deviance can also be subtracted in the
same way to find the degrees of freedom for the test:
1424 – 1423 = 1
Below is the R code and output to perform the LRT:
 2010 Christopher R. Bilder
3.71
> #LRT: -2log(lambda)
> mod.fit$null.deviance - mod.fit$deviance
[1] 237.6811
> #DF
> mod.fit$df.null-mod.fit$df.residual
[1] 1
> #p-value
> 1 - pchisq(q = mod.fit$null.deviance –
mod.fit$deviance, df = mod.fit$df.nullmod.fit$df.residual)
[1] 0
Since the p-value is very small, 0. Therefore, distance
is important for predicting the probability of success. In
Chapter 5, we will see that it is not appropriate to
perform the test as done here. More will be discussed
about it at that time.
Here are some additional details showing how R
calculates G22 :
>
y<-place.s$good
>
pi.hat<-mod.fit$fitted.values
>
pi.tilde<-y
> 2*(sum(log(y^y)) - sum(y*log(pi.hat))
+ sum(log((1-y)^(1-y))) - sum((1-y)*log(1-pi.hat)))
#Need to do second part with pi^y due to 0 pi values
[1] 775.745
>
>
>
[1]
#Discussed in next chapter
dev.resid<-resid(mod.fit, type="deviance")^2
sum(dev.resid)
775.745
 2010 Christopher R. Bilder
3.72
Question: Suppose you wanted to test Ho:=0 vs. Ha:0 for
models with a probit or complementary log-log link. How
would you do it?
Example: Horseshoe crabs and satellites (horseshoe.R,
horseshoe.txt)
Perform the test of Ho:=0 vs. Ha:0. The output from
glm() is reproduced below. Note that the model with the
offset is used here!
> summary(mod.fit)
Call:
glm(formula = y ~ width + offset(log(t)), family =
poisson(link = log), data = rate.data, na.action =
na.exclude, control = list(epsilon = 1e-04, maxit = 50,
trace = T))
Deviance Residuals:
Min
1Q
Median
-3.8003 -1.4515 -0.3788
3Q
0.6619
Max
4.7586
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.30476
0.54218 -6.095 1.09e-09 ***
width
0.16405
0.01996
8.217 < 2e-16 ***
--Signif. codes:
0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 254.94
Residual deviance: 190.03
AIC: 402.52
on 65
on 64
degrees of freedom
degrees of freedom
 2010 Christopher R. Bilder
3.73
Number of Fisher Scoring iterations: 4
The Wald test statistic is Z=8.22. Since Z0.975=1.96
and the p-value < 2*10-16, 0. Therefore, width is
important for predicting the mean number of satellites.
-2log() = G12  G22 = 254.9403 - 190.0272 = 64.91
> #LRT: -2log(lambda)
mod.fit$null.deviance - mod.fit$deviance
[1] 64.91309
> #p-value
> 1-pchisq(q = mod.fit$null.deviance-mod.fit$deviance,
df = mod.fit$df.null-mod.fit$df.residual)
[1] 7.771561e-16
Since the p-value is very small, 0. Therefore, width is
important for predicting the mean number of satellites.
 2010 Christopher R. Bilder
3.74
NOTE!
In Section 3.4.5, Agresti (2007) talks about “goodness-of-fit”
statistics and model residuals more so in the context of
Poisson regression models only. In Chapter 5, these items
are discussed for logistic regression models and in much
more detail. The reason for the separation is because there
are a few things one needs to watch out for in logistic
regression that does not happen as much in Poisson
regression. Thus, the rest of the discussion in this section
will only be for Poisson regression.
 2010 Christopher R. Bilder
3.75
Model residuals
Pearson residuals can be calculated in a similar manner
as described in Chapter 2. The Pearson residual in
Chapter 2 was
(nij  ˆ ij )
ˆ ij
where nij was the cell count for row i and column j, ̂ij
was its estimated value under the hypothesis of
independence, and ̂ij square root of the estimated
variance (remember for a Poisson random variable,
mean=variance). A Pearson residual has an
approximate standard normal distribution provided the
̂ij is not small (>2 or 5).
The same set-up can be used here for the Pearson
residual from a Poisson regression model. For the
Poisson regression model:
yi  ˆ i
ˆ i
where the yi is the ith observed value for the dependent
variable, ̂i is its predicted value.
 2010 Christopher R. Bilder
3.76
In Chapter 2, we also learned about a standardized
residual. The standardized residual has a distribution
that is closer to a standard normal distribution than the
Pearson residual. The standardized residual is
yi  ˆ i

Var(yi  ˆ i )

yi  ˆ i
ˆ i (1  hi )
where hi is the ith diagonal value of the hat matrix. What
is the hat matrix?
With respect to regular regression analysis, you can
see my Chapters 5 and 10 STAT 870 notes at
www.chrisbilder.com/stat870/schedule.htm. With
respect to Poisson regression, let X be a n2 matrix
with 1’s in the first column and the explanatory
variable values in the second column. Create a
diagonal matrix, V̂ , with diagonal elements of ̂i in the
same order as the corresponding explanatory variable
values listed in X. The hat matrix is H =
V̂1/ 2 X(X V̂ X)-1X V̂1/ 2 . Note that this is similar to the
hat matrix used when fitting a regression model by
weighted least squares.
Note that the “standardized” residual may also be called
elsewhere an “adjusted Pearson residual”, “adjusted
residual” (Agresti, 1996, uses this term), or “studentized
residual”.
 2010 Christopher R. Bilder
3.77
The standardized residual can be calculated in R using
h<-lm.influence(model = mod.fit)$h
Pearson<-residuals(object = mod.fit, type="pearson")
standard.pearson<-Pearson/sqrt(1-h)
assuming mod.fit contains the model fit from glm().
We can use a standard normal approximation for both the
Pearson and standardized residuals. Of course, the
approximation works better with the standardized.
Question: Suppose the standardized residuals are greater
than 2.576 or less than -2.576. What does this mean
about the model?
Example: Horseshoe crabs and satellites (horseshoe.R,
horseshoe.txt)
>
pearson1<-residuals(object = mod.fit, type="pearson")
>
>
>
#Standardized Pearson residuals
h<-lm.influence(model = mod.fit)$h
head(h)
1
2
3
4
5
6
0.009852678 0.015152453 0.006360592 0.008647581 0.006360592 0.011358140
>
>
standard.pearson<-pearson1/sqrt(1-h)
head(standard.pearson)
1
2
2.1569835 -1.2223348
>
>
3
4
3.9641123 -1.4712657
5
6
0.8609526 -1.3572621
X<-model.matrix(mod.fit)
#Also could use mu.hat<-mod.fit$fitted.values here
 2010 Christopher R. Bilder
3.78
>
>
mu.hat<-predict(object = mod.fit, type = "response")
H<-diag(sqrt(mu.hat))%*%X%*%
solve(t(X)%*%diag(mu.hat)%*%X)%*%
t(X)%*%diag(sqrt(mu.hat))
>
diag(H)[1:5]
[1] 0.009852370 0.015150506 0.006360719 0.008647445
0.006360719
Notes:
 The residuals() function finds the residuals.
 There are a few functions that help you find the hat
matrix diagonal values. One is the lm.influence()
function. There are no direct functions for the
standardized residuals.
 See how the matrix calculations are done in R. You
are not responsible for this content.
>
>
>
>
>
>
>
par(mfrow = c(2,1)) #2x1 grid of plots
#Pearson residual vs observation number plot
plot(x = 1:length(pearson1), y = pearson1,
xlab="Observation number", ylab="Pearson residuals",
main = "Pearson residuals vs. observation number")
abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3,
col="red")
#Standardized residual vs observation number plot
plot(x = 1:length(standard.pearson), y =
standard.pearson, xlab="Observation number",
ylab="Standardized residuals", main = "Standardized
residuals vs. observation number")
abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3,
col="red")
 2010 Christopher R. Bilder
3.79
4
2
0
-2
Pearson residuals
6
Pearson residuals vs. observation number
0
50
100
150
Observation number
4
2
0
-2
Standardized residuals
6
Standardized residuals vs. observation number
0
50
100
150
Observation number
Notes:
 The abline() function was used to draw lines on the plot
at Z0.975 and Z0.995. Notice it takes one call to the
function for the lines.
 Both plots are quite similar. Since we have only one
explanatory variable, it is often helpful to plot these
residuals vs. the explanatory variable.
 2010 Christopher R. Bilder
3.80
>
>
>
4
2
0
-2
Standardized residuals
6
>
par(mfrow = c(1,1))
# Residual vs width plot
plot(x = crab$width, y = standard.pearson, xlab="Width",
ylab="Standardized Pearson residuals", main =
"Standardized Pearson residuals vs. width")
abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3,
col="red")
Standardized residuals vs. width
22
24
26
28
30
32
34
Width
Notice the patterns among the plotting points. It is not
unusual to see these types of patterns when one is
modeling a discrete response variable. The plot
below shows you why these patterns are occurring.
 2010 Christopher R. Bilder
3.81
>
>
>
plot(x = crab$width, y = standard.pearson, xlab="Width",
ylab="Standardized Pearson residuals", main =
"Standardized Pearson residuals vs. width", type =
"n")
text(x = crab$width, y = standard.pearson, labels =
crab$satellite, cex=0.75)
abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005), lty=3,
col="red")
Standardized residuals vs. width
10
15
4
10
9
8
11
8
6
6
2
4 4
4
0
Standardized residuals
6
14
-2
0
55
7
66
55
7
6
8
9
88
12
10
7
9
666
8
44
66
55
4
5
5
44
5
4
5
3
4 4
3333
22
5
3333
44 4 5
22
3
4
1 1
4
2 2
333
3
1 1
33
11
0 0 000 00
11
0000000000
1
1
2
1
0 00000 0
1
00000 0
00 00
0
22
24
5 55
26
6
28
30
7
4
2
32
34
Width
The model looks to have model fit problems when there
are a larger number of satellites than expected at lower
widths (relative to the observations with a particular
 2010 Christopher R. Bilder
3.82
number of satellites). This may be a result of the
overdispersion that we saw earlier.
The negative binomial model could also be fit to the
data. The same types of residuals can be found with the
corresponding adjustments to reflect the new model.
>
>
>
>
>
>
>
>
>
pearson.nb<-residuals(object = mod.fit.nb,
type="pearson")
h.nb<-lm.influence(model = mod.fit.nb)$h
standard.pearson.nb<-pearson.nb/sqrt(1-h.nb)
par(mfrow = c(1,2))
plot(x = 1:length(standard.pearson.nb), y =
standard.pearson.nb, xlab="Obs. number",
ylab="Standardized residuals",
main = "Stand. residuals (NB model) vs. obs. number")
abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3,
col="red")
plot(x = crab$width, y = standard.pearson.nb,
xlab="Width", ylab="Standardized residuals",
main = "Stand. residuals (NB model) vs. width", type =
"n")
text(x = crab$width, y = standard.pearson.nb, labels =
crab$satellite, cex=0.75)
abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3,
col="red")
 2010 Christopher R. Bilder
3.83
Stand. residuals (NB model) vs. obs. number
Stand. residuals (NB model) vs. width
14
3
3
10
10
15
2
8
6
4
7
66
4
4
11
6
5
5
7
4
0
1
2
1
1
0
12
8
55
1
Standardized residuals
2
1
0
9
6
2
9 10
88
5
55
6
7
66
4
55 6
9
6
44
55 6
8
4
5
3
5
33
33
4
4
33
5
22
33
5
44
3
44
2
3
4
3
2
33
4
1
11
11
11 1
1
0 0 00000
00000000000000
00 000000
33
2
7
4
2
0 00000
-1
-1
Standardized residuals
8
0
50
100
150
22
Obs. number
24
26
28
30
32
34
Width
As we can see there are not as many standardized
residuals outside of the 2.576 borderlines. How many
standardized residuals would you expect outside of these
borderlines with n = 173?
Comments:
 2010 Christopher R. Bilder
3.84
 I am a little concern about how large these two
standardized residuals are. One could examine these
observations more closely like what you would do in a
STAT 870 class. Due to time considerations, I am not
going to do this here.
 I am also a little concern with there being no
standardized residuals less than -1. Remember that a
normal distribution is being used here. Do you think a
normal approximation will work for these observations
toward the bottom of these plots?
 One possible solution to the normal approximation
problem is to work with the rate data formulation of the
model. Why? See the additional Chapter 3 notes for
details.
Goodness-of-fit
The Pearson statistic and LRT can be both used to
assess how well (good) the model fits the data versus
using just the “observed” values at the explanatory
variable levels. This model is often called the
“saturated” model since it has the most possible
parameters. The saturated model estimates a
parameter for every observation. For example, the
saturated model for Poisson regression is log(i) =  + i
for i = 1, …, n results in ˆ i  yi (Note: A restriction on the
i’s is needed such as  ni1i  0 like you would see in
 2010 Christopher R. Bilder
3.85
STAT 802 or 870). Also, see the previous LRT work
with saturated models.
Pearson statistic:
For Poisson regression, the statistic is:
2
(y


)
ˆ
i
X2   i
ˆ i
i1
n
The statistic can be approximated by a 2
distribution with n - # of model parameters = n – 2
degrees of freedom for large n. In order for the 2
approximation to work well, ̂i should not be small.
LRT statistic:
For Poisson regression, the statistic simplifies from
-2log() to
 yi 
ˆ ˆ xi
2 yi log   where ˆ i  e
i 1
 ˆ i 
n
This statistic is often denoted by G2 and was already
introduced on p. 3.68. It can be approximated by
the same distribution as used with the Pearson
statistic, and it has the same potential problems.
 2010 Christopher R. Bilder
3.86
Example: Horseshoe crabs and satellites (horseshoe.R,
Table3.3.R, horseshoe.txt)
>
summary(mod.fit)
Call:
glm(formula = satellite ~ width, family = poisson(link =
log), data = crab, na.action = na.exclude, control =
list(epsilon = 1e-04, maxit = 50, trace = T))
Deviance Residuals:
Min
1Q
Median
-2.8526 -1.9884 -0.4933
3Q
1.0970
Max
4.9221
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.30476
0.54222 -6.095 1.10e-09 ***
width
0.16405
0.01996
8.217 < 2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 632.79
Residual deviance: 567.88
AIC: 927.18
on 172
on 171
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 5
>
#LRT: -2log(lambda)
>
mod.fit$deviance
[1] 567.88
>
#p-value
>
1-pchisq(q = mod.fit$deviance, df = mod.fit$df.residual)
[1] 0
>
>
#Pearson statistic
sum(pearson1^2)
 2010 Christopher R. Bilder
3.87
[1] 544.157
>
1-pchisq(q = sum(pearson1^2), df = mod.fit$df.residual)
[1] 0
The p-values for the LRT and the Pearson statistic test
are quite small indicating evidence of lack of fit.
However, one should be concerned with the chi-square
approximation working here. What can be done then?
There are no choices that always work. Here are two
possibilities.
1) Convert the data to a rate data format and perform the
same tests.
Note that there are still a number of times where ̂ <5.
Therefore, the 2 distribution approximation may be
poor here as well. Below is part of the output given
previously from the glm() function.
> summary(mod.fit.rate)
Call:
glm(formula = y ~ width + offset(log(t)), family =
poisson(link = log), data = rate.data, na.action =
na.exclude, control = list(epsilon = 1e-04, maxit = 50,
trace = T))
Deviance Residuals:
Min
1Q
Median
-3.8003 -1.4515 -0.3788
3Q
0.6619
Max
4.7586
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.30476
0.54218 -6.095 1.09e-09 ***
 2010 Christopher R. Bilder
3.88
width
--Signif. codes:
0.16405
0.01996
8.217
< 2e-16 ***
0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 254.94
Residual deviance: 190.03
AIC: 402.52
on 65
on 64
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
The “residual deviance” given in the output is G2. To
find the Pearson statistic, sum the squared Pearson
residuals. Below is the R code for both the goodnessof-fit tests.
> #LRT p-value
> 1-pchisq(q = mod.fit.rate$deviance,
mod.fit.rate$df.residual)
[1] 1.998401e-14
> #Pearson statistic and p-value
> pearson.rate<-resid(object = mod.fit.rate,
type="pearson")
> sum(pearson.rate^2)
[1] 174.2737
> 1-pchisq(q = sum(pearson.rate^2),
mod.fit.rate$df.residual)
[1] 3.759215e-12
Both statistics indicate the model does not fit the data
well – if we believe the 2 distribution approximation.
2) Form artificial groups (like in Table 3.3) and compute
ad-hoc versions of these tests.
 2010 Christopher R. Bilder
3.89
The purpose of forming these groups is to have each
group have a mean value larger than 5 or so that we
avoid the previous problems. In order to form a
Pearson statistic, one can fit the model as usual and
compute Pearson residuals for groups of size nk
containing “alike” observations to result in
 nkk1 yk  knk1ˆ k  knk1ˆ k . The sum of these squared
Pearson residuals then form a Pearson statistic. A
LRT statistic can be found in the corresponding
manner.
There are two problems with this approach:
a) There are many different ways to form the groups,
and one could choose a variety of different number
of groups. Your answers could change due to
these choices.
b) The usual type of distributional approximation is
chi-square with g – 2 degrees of freedom where g
is the number of groups and two parameters are
being estimated ( and ). The statistics though do
not have this same type of chi-square distribution
so formal hypothesis tests should not be done.
While this approach does have its problems, I like to
use it as an informal way to assess the model overall
along with graphical approaches if possible.
Table3.3.R shows a few different ways to evaluate the
model in this manner. Using the Table 3.3 categories,
below is the Pearson and LRT statistics along with a
 2010 Christopher R. Bilder
3.90
graphical assessment. The code for these results is
available in the program.
15
Horseshoe crab data set
with poisson regression model fit
0
5
Number of satellites
10
Table 3.3 obs. means
Table 3.3 predicted (using my interpret)
22
24
26
28
30
Width (cm)
Ad-hoc
Degrees of
statistic Value freedom p-value
X2
6.48
6
0.37
G2
6.89
6
0.33
 2010 Christopher R. Bilder
32
34
3.91
Below is a more general way (also in the program) to
assess the fit of the model using different groups.
>###############################################################
> # More general way to put observations into classes
>
#Find 8 (9 quantiles) groups (why 8? Since Agresti had
chosen 8 - other choices could have been made)
>
cutoff<-quantile(crab$width, probs = 0:8/8, na.rm = F)
>
cutoff
0% 12.5%
25% 37.5%
50% 62.5%
75% 87.5% 100%
21.00 23.85 24.90 25.65 26.10 26.90 27.70 28.70 33.50
>
#Use midpoint for the width group designation; note that I
could have used the mean width among all crabs within the
group as well - there is not one correct way to do this.
> groups<-ifelse(crab$width<cutoff[2], (cutoff[2]+cutoff[1])/2,
ifelse(crab$width<cutoff[3], (cutoff[3]+cutoff[2])/2,
ifelse(crab$width<cutoff[4], (cutoff[4]+cutoff[3])/2,
ifelse(crab$width<cutoff[5], (cutoff[5]+cutoff[4])/2,
ifelse(crab$width<cutoff[6], (cutoff[6]+cutoff[5])/2,
ifelse(crab$width<cutoff[7], (cutoff[7]+cutoff[6])/2,
ifelse(crab$width<cutoff[8], (cutoff[8]+cutoff[7])/2,
(cutoff[9]+cutoff[8])/2)))))))
>
library(nlme) #Need package for the gsummary() function –
don't need to rerun if already did before
>
>
crab.group<-data.frame(crab2, groups)
sat.count<-gsummary(object = crab.group, FUN = length,
groups = groups)
sat.sum<-gsummary(object = crab.group, FUN = sum, groups =
groups)
new.table3.3<-data.frame(width.group = sat.count$groups,
number.cases = sat.count$satellite, number.sat =
sat.sum$satellite, mean.per.group =
sat.sum$satellite/sat.count$satellite,
fitted.count = round(sat.sum$predicted,1),
Pearson.residual = round((sat.sum$satellite –
sat.sum$predicted)/sqrt(sat.sum$predicted),2))
>
>
 2010 Christopher R. Bilder
3.92
>
1
2
3
4
5
6
7
8
>
>
new.table3.3
width.group number.cases number.sat mean.per.group fitted.count Pearson.residual
22.425
22
20
0.9090909
35.6
-2.62
24.375
21
40
1.9047619
42.4
-0.36
25.275
22
60
2.7272727
50.5
1.34
25.875
20
68
3.4000000
50.9
2.40
26.500
23
47
2.0434783
64.4
-2.17
27.300
20
69
3.4500000
64.6
0.55
28.200
22
102
4.6363636
81.9
2.23
31.100
23
99
4.3043478
114.8
-1.48
#Pearson statistic
cat("Ad-hoc Pearson statistic:",
round(sum(new.table4.3$Pearson.residual^2),2), "with 6
DF results in a p-value of", round(1pchisq(sum(new.table4.3$Pearson.residual^2), 6),2),
"using a chi-square distribution approximation \n")
Ad-hoc Pearson statistic: 26.72 with 6 DF results in a p-value
of 0 using a chi-square distribution approximation
>
>
>
#G^2
G.sq2<-2*sum(new.table4.3$number.sat
*log(new.table4.3$number.sat/new.table4.3$fitted.count))
cat("Ad-hoc G^2 statistic:", round(G.sq2,2), "with 6 DF
results in a p-value of", round(1-pchisq(G.sq2, 6),2),
"using a chi-square distribution approximation \n")
Ad-hoc G^2 statistic: 27.29 with 6 DF results in a p-value of 0
using a chi-square distribution approximation
>
>
>
>
>
>
#This is interesting that these two measures suggest the
model does not fit well! I would hope
# that goodness-of-fit conclusions would be invariant to
the way one chooses to group the observations
# Possibly, this is example of why ad-hoc procedures can
not always be trusted.
#Visual assessment
win.graph(width = 6, height = 6, pointsize = 10)
plot(x = crab$width, y = crab$satellite, xlab = "Width
(cm)", ylab = "Number of satellites", main = "Horseshoe
crab data set \n with poisson regression model fit",
 2010 Christopher R. Bilder
3.93
>
>
>
>
>
>
>
>
>
panel.first = grid(col = "gray", lty = "dotted"))
curve(expr = exp(mod.fit$coefficients[1] +
mod.fit$coefficients[2]*x), lty = 1, col = "red", add
= TRUE)
points(x = new.table4.3$width.group, y =
new.table4.3$mean.per.group, pch = 18, col =
"darkgreen", cex = 2)
#Notice these points are not on the estimated model line;
probably due to using group average value for x-axis
instead of
# weighted mean like did for the previous plot
points(x = new.table4.3$width.group, y =
new.table4.3$fitted.count/new.table4.3$number.cases,
pch = 17, col = "darkblue", cex = 2)
#Put group breaks on plot
for (i in (2:8)) {
abline(v = cutoff[i], lty = 1, col = "lightgreen")
}
legend(locator(1), legend = c("Obs. group means", "Predicted
group means (using my interpret)"), pch = c(18,17), col =
c("darkgreen","darkblue"), cex = 0.75, bg = "white")
 2010 Christopher R. Bilder
3.94
15
Horseshoe crab data set
with poisson regression model fit
0
5
Number of satellites
10
Obs. group means
Predicted group means
22
24
26
28
30
32
34
Width (cm)
With the negative binomial regression model, note that G2
2
= 195.81. Using a 171
approximation, we obtain a pvalue of 0.0939.
 2010 Christopher R. Bilder
3.95
3.5 Fitting generalized linear models
One of the best things about GLMs is that they provide a
unified approach to test model parameters, che
GLMs are fit (i.e., parameter estimates found) using
maximum likelihood estimation. Except in simple cases,
there is not one formula for the Chapter 3 models which
can be written out that gives the parameter estimates.
For Poisson regression, the likelihood function is
iyi ei
(1,..., n | y1,...,yn )   f(yi )  
i 1
i 1 yi !
n different
Then the log likelihood function is parameters
n
n
log  (1(x),..., n (x) | y1,...,yn )
n
n
i 1
i 1
  yi log[i (x)]  i (x)  log  yi !
n
  yi log[i (x)]  i (x)
i 1
where  means proportional (the last term does not
depend on the parameters).
Since i(x) = exi , this implies
 2010 Christopher R. Bilder
Now only two
parameters!
3.96
n
log[ (,  | y1,..., yn )]   yi (  xi )  exi
i 1
The Chapter 3 additional lecture notes gives additional
general information about one common procedure, the
Newton-Raphson method, and how it can be used to find
the maximum likelihood estimates in an iterative manner.
Pay special attention to how “convergence” is obtained.
Note that glm() function actually uses the optim() function
to do the maximization (equivalently, minimization of the
negative log likelihood function). The optim() function
has a few different iterative procedures (some do not
need derivatives) that can be used to perform the
maximization.
 2010 Christopher R. Bilder