Download STA 4107/5107 Chapter 5: Multiple Discriminant Analysis - UF-Stat

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
STA 4107/5107
Chapter 5: Multiple Discriminant Analysis and Logistic Regression
March 19, 2007
1
Key Terms
Please review and learn these terms.
2
What are Discriminant Analysis and Logistic Regression?
Discriminant analysis and logistic regression are dependence techniques whose goal is to classify
(some refer to these techniques as classification techniques) categorical variables based on metric
variables. For both these techniques, we must have a “training set” that has the values for the
class variables and predictor variables, that is, you must have a data set where you know the group
(or population) memberships for all the cases. Once the discriminant function is found using the
training set and either technique, you can classify an unknown case based on the values of its
predictor variables. If group membership is unknown in the training set, then cluster analysis is
the appropriate technique.
2.1
Examples
1. An archaeologist wishes to determine which of three possible tribes created a particular statue
found in a dig. The archaeologist takes measurements from statues produced by the three
tribes, as well as the unknown statue. The known statues are used to train a discriminant
function and then the values from the unknown statue are plugged into the discriminant
function which then classifies the statue into one of the three tribes.
2. Lubishew (1962) considers a problem of discrimination between three species of flea beetles,
Chaetocnema concinna, C. heikertingeri, C heptapotamica based on various physical measurements.
3. The US forest service would like to identify the personal characteristics of residents near a
reservoir that predict whether that person will fish as an adult, with the goal of increasing
recreational fishing in the area.
4. Investigators are interested in the relationship between island size and bird extinctions. On
each island they count the number of species that went extinct out of all the species on the
1
island. The investigator would like to characterize the relationship between the area of an
island and the probability of extinction of birds present on the island?
5. Investigators are interested in moth coloration and natural selection. At a number of distances from Liverpool they count the number of moths from each morph that were taken by
predators. They would like to quantify the relationship between the distance from Liverpool,
where trees are dark from industrial soot, and the probability of predation on the light and
dark morphs of the moth Carbonaria?
6. Researchers are interested in survival in the Donner Party. What is the relationship between
age and sex of individuals in the Donner Party and whether or not they survived?
7. In a study of winter habitat selection by pronghorn in the Red Rim area in south-central
Wyoming (Source: Manly, McDonald, and Thomas 1993, Resource Selection by Animals,
Chapman and Hall, pp. 16-24; data from Ryder 1983), presence/absence of pronghorn during
winters of 1980-81 and 1981-82 were recorded for a systematic sample of 256 plots of 4 ha.
each. Other variables recorded for each plot were: density (in thousands/ha) of sagebrush,
black greasewood, Nuttal’s saltbrush, and Douglas rabbitbrush, and the slope, distance to
water, and aspect. The investigators were interested in which variables are most strongly
associated with presence/absence of pronghorn and whether they could formulate a model to
predict the probability that pronghorn will be present on a plot.
3
Discriminant Analysis
Discriminant analysis is used to classify cases into one of two or more groups or populations on the
basis of a set variables measured on each case. The populations are known by the researcher to be
distinct and to which each individual belongs. The discriminant variate, also called the discriminant
function is the linear combination of the independent variables that will best discriminate between
the a priori identified groups. Discrimination is achieved by finding the linear combination that
maximizes the differences between the groups. With n observations and p independent variables,
1 < i < p and 1 < k < m, the discriminant function, is given by:
Zik = a + W1 X1k + W2 X2k + · · · + Wn Xnk
(1)
where Zik is the discriminant Z score calculated using the j th discriminant function on the k th
observation; a is the intercept, Wi is the discriminant weight for independent variable i, and Xij is
the ith independent variable measured on the k th observation.
For purposes of this discussion we will consider the two-group case, however, this technique is
most useful for more-than-two populations because logistic regression is most common when there
are only two groups. The main reason for this is conceptual and graphical ease.
Suppose that we have a representative sample from both populations and have just one measurement on each observation (again for graphical simplicity). We could choose a discriminant
function to separate the two populations and then look at the distribution of discriminant scores.
These distributions might look something like those shown in the figure below.
2
The discriminant function shown in the upper portion of
the graph is doing a better job of separating the two populations, though the separation is not perfect. The dividing
point is the point on the X axis that is directly below the
point of overlap between the two groups. Discriminant scores
that are “low” will be classified as Group A and those that
are “high” will be classified as Group B. The lighter shaded
region on the left represents the percent of group B that will
be falsely classified as group A. The darker shaded region on the right represents the percent of
group B that will be falsely classified as group A. Our goal in discriminant analysis is to find a
discriminant function that minimizes both of these errors. For example, if the two populations have
Z̄A
the same variance, then the dividing point, C, will be C = Z̄B +
, or the average of the sample
2
averages.
3.1
Fisher’s Iris Data: an example
The Iris dataset we will use was originally introduced by R. A. Fisher as an example for
discriminant analysis. The data report four
characteristics (sepal width, sepal length, pedal
width and pedal length) of three species of
Iris flower, Iris setosa Iris versicolor, Iris virginica. The first figure shows the separation of
the three species using petal length and width,
that is, in two dimensions. The second figure
shows the sample distribution by petal width.
You can see the overlapping observations. We
expect better separation between Setosa and
Versicolor than between Versicolor and Virginica.
3.1.1
Stages of Analysis
Identify the Objectives
As with all statistical analyses we need to be sure of our objectives before we begin. For these data
we are interested in testing whether the 4 morphological characteristics are good classifiers for the
three species.
• profile analysis: We may wish to test whether the differences between the species are significant. We can do this via a statistical hypothesis test on the mean score profiles.
• classification: Or we could possibly have a newly discovered flower whose species is disputed
3
by the experts and we would like to see to which of the three species it belongs.
Research Design
• The Dependent Variable
In most cases the researcher will know which is the dependent variable and how many categories it has. However, this is not always straightforward. It could be the case that the
investigator is using a variable that could be viewed as continuous, but feels it is more appropriate to categorize due to precision or for simplification. It is usually better, in this case,
to create a small number of categories. The categories should be exhaustive and mutually
exclusive (every possible observation belongs to one exactly one and only one category) and
there should be no more categories than are necessary.
• The independent Variable
The researcher should, again, choose or measure the variables of interest and no more. It
is my opinion that looking at the data to decide which variables to include as independent
variables in the analysis is a form of data snooping. The researcher should have a scientific
theory that dictates which variables to include. If there is no theory, then the analysis might
still be appropriate, but should be reported as exploratory and not as a scientific analysis.
• Sample Size A rule of thumb is that there should be at least 5 observations for each predictor
variable and that the number of observations in each category should be at least one more
than the number of predictors. For the iris example, we have 3 categories and 4 independent
variables, so we need at least 5 observations in each of the three species categories and at
least 20 observations altogether. It will also work better if there is approximately the same
number of observations in each category, so our minimal experimental design is to have at
least 7 observations for each species, for a total of 21 observations. Of course, we would do
well to have quite a few more than this.
If the sample size becomes too large, even small differences in the discriminant scores between the groups will be significant and the researcher will always need to be cognizant of
what effect size is scientifically meaningful. This is not straightforward when we are looking at differences in discriminant scores, so some exploratory analysis beforehand is advisable.
• Cross Validation
Another consideration when planning an experiment is the sample size that would allow crossvalidation. Cross validation consists of removing a subset of observations from the sample
and performing the analysis on the remaining, and then use the subset that was removed as
test set to see how well the procedure performs on observations not included in the original
analysis. Your text suggests that using error estimates obtained from the entire data set is
better than none at all, though does admit that there will be bias in the estimates. However,
4
leave-one-out cross validation is a much better approach.
In leave-one-out cross validation, only one observation is removed at a time. The analysis is
performed on the remaining n − 1 observations and then the left-out observation is used as a
test case. This procedure is performed for all data points, yielding a prediction error estimate
for all observations. This approach will work well for any sample larger than say, 5, and is
the most commonly used cross-validation procedure. We will see how to do this in SAS later.
Assumptions of Discriminant Analysis
The main assumptions of discriminant analysis are that the independent variables are normally
distributed when separated into the different populations. There are different options for data sets
with equal covariance matrices versus unequal covariance matrices. There is some disagreement as
to how robust the method is to these assumptions. If the investigator is only interested in characterization, then distributional assumptions are not necessary. That is, unless we would like to
conduct formal hypothesis testing, we do not need to know the distribution.
When we wish to conduct hypothesis testing with data that do not meet the assumption of normality, we can try the standard transformations available. There are also non-parametric methods
available, where no distributional assumptions are made. If the equal covariance matrices assumption is violated, we can use the individual with-in group covariance matrix, rather than the pooled.
If the assumption is valid, we will have more power if we use the pooled covariance matrix.
We can test both of these assumptions in SAS. The normality assumption can be tested using
proc univariate. The equal covariance assumption can be tested within the proc discrim
analysis.
3.1.2
SAS Code and Output
proc univariate data=iris normal;
var SepalLength;
by species; run;
Results for I. virginica for the variables Sepal Length are shown below. The results for the other
two species are similar for the other variables. The tests for normality are not significant, so we
can proceed.
------------------------------------------ species=3 ------------------------------------------The UNIVARIATE Procedure
Variable: SepalLength
Tests for Normality
Test
--Statistic---
-----p Value------
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
W
D
W-Sq
Pr < W
Pr > D
Pr > W-Sq
0.971179
0.115034
0.089467
5
0.2583
0.0953
0.1538
Anderson-Darling
A-Sq
0.551641
Pr > A-Sq
0.1506
Estimation of the Discriminant Model and Assessing Overall Fit
With SAS we have a large number of options. Because the data are normal, we can use methods
that require the normal assumption. However, we’d like to be careful about the equal variance
assumption. The option pool=test will test whether it is appropriate to use the pooled covariance
matrix or to use the individual with-in covariance matrices.
SAS Code and Output
proc discrim data=iris outstat=irisstat
wcov pcov method=normal pool=test
distance anova manova listerr crosslisterr;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 ’Discriminant Analysis of Iris Data’;
run;
The DISCRIM Procedure
Within-Class Covariance Matrices
species = 1,
Variable
SepalLength
SepalWidth
PetalLength
PetalWidth
DF = 49
SepalLength
SepalWidth
PetalLength
PetalWidth
0.1242489796
0.1002979592
0.0161387755
0.0105469388
0.1002979592
0.1451795918
0.0116816327
0.0114367347
0.0161387755
0.0116816327
0.0301061224
0.0056979592
0.0105469388
0.0114367347
0.0056979592
0.0114938776
------------------------------------------------------------------------------------------------
species = 2,
Variable
SepalLength
SepalWidth
PetalLength
PetalWidth
DF = 49
SepalLength
SepalWidth
PetalLength
PetalWidth
0.2664326531
0.0851836735
0.1828979592
0.0557795918
0.0851836735
0.0984693878
0.0826530612
0.0412040816
0.1828979592
0.0826530612
0.2208163265
0.0731020408
0.0557795918
0.0412040816
0.0731020408
0.0391061224
------------------------------------------------------------------------------------------------
6
species = 3,
Variable
SepalLength
SepalWidth
PetalLength
PetalWidth
DF = 49
SepalLength
SepalWidth
PetalLength
PetalWidth
0.4043428571
0.0937632653
0.3032897959
0.0490938776
0.0937632653
0.1040040816
0.0713795918
0.0476285714
0.3032897959
0.0713795918
0.3045877551
0.0488244898
0.0490938776
0.0476285714
0.0488244898
0.0754326531
-----------------------------------------------------------------------------------------------The DISCRIM Procedure
Pooled Within-Class Covariance Matrix,
Variable
SepalLength
SepalWidth
PetalLength
PetalWidth
DF = 147
SepalLength
SepalWidth
PetalLength
PetalWidth
0.2650081633
0.0930816327
0.1674421769
0.0384734694
0.0930816327
0.1158843537
0.0552380952
0.0334231293
0.1674421769
0.0552380952
0.1851700680
0.0425414966
0.0384734694
0.0334231293
0.0425414966
0.0420108844
The DISCRIM Procedure
Test of Homogeneity of Within Covariance Matrices
Notation: K
P
N
N(i)
V
RHO
DF
=
=
=
=
Number of Groups
Number of Variables
Total Number of Observations - Number of Groups
Number of Observations in the i’th Group - 1
__
N(i)/2
|| |Within SS Matrix(i)|
= ----------------------------------N/2
|Pooled SS Matrix|
_
|
1
= 1.0 - | SUM ----|_
N(i)
-
_
2
1
| 2P + 3P - 1
--- | ------------N _| 6(P+1)(K-1)
= .5(K-1)P(P+1)
7
_
_
|
PN/2
|
|
N
V
|
-2 RHO ln | ------------------ |
|
__
PN(i)/2 |
|_ || N(i)
_|
Under the null hypothesis:
is distributed approximately as Chi-Square(DF).
Chi-Square
DF
Pr > ChiSq
139.236945
20
<.0001
Since the Chi-Square value is significant at the 0.1 level, the within
covariance matrices will be used in the discriminant function.
Univariate Test Statistics
F Statistics,
Variable
Num DF=2,
Den DF=147
Total
Standard
Deviation
Pooled
Standard
Deviation
Between
Standard
Deviation
R-Square
R-Square
/ (1-RSq)
F Value
Pr > F
0.8281
0.4336
1.7644
0.7632
0.5148
0.3404
0.4303
0.2050
0.7951
0.3313
2.0896
0.8978
0.6187
0.3919
0.9413
0.9288
1.6226
0.6444
16.0413
13.0520
119.26
47.36
1179.03
959.32
<.0001
<.0001
<.0001
<.0001
SepalLength
SepalWidth
PetalLength
PetalWidth
Average R-Square
Unweighted
Weighted by Variance
0.7201854
0.868708
Multivariate Statistics and F Approximations
S=2
Statistic
Wilks’ Lambda
Pillai’s Trace
Hotelling-Lawley Trace
Roy’s Greatest Root
M=0.5
N=71
Value
F Value
Num DF
Den DF
Pr > F
0.02352545
1.18720676
32.54952466
32.27195780
198.71
52.95
583.49
1169.86
8
8
8
4
288
290
203.4
145
<.0001
<.0001
<.0001
<.0001
NOTE: F Statistic for Roy’s Greatest Root is an upper bound.
NOTE: F Statistic for Wilks’ Lambda is exact.
Posterior Probability of Membership in species
Obs
From
species
Classified
into
species
1
2
3
71
84
134
2
2
3
3 *
3 *
2 *
0.0000
0.0000
0.0000
0.3359
0.1543
0.6050
0.6641
0.8457
0.3950
* Misclassified observation
8
Number of Observations and Percent Classified into species
From species
1
2
3
Total
1
50
100.00
0
0.00
0
0.00
50
100.00
2
0
0.00
48
96.00
2
4.00
50
100.00
3
0
0.00
1
2.00
49
98.00
50
100.00
Total
50
33.33
49
32.67
51
34.00
150
100.00
Priors
0.33333
0.33333
0.33333
Error Count Estimates for species
Rate
Priors
1
2
3
Total
0.0000
0.3333
0.0400
0.3333
0.0200
0.3333
0.0200
The assumption of equal covariances is violated for these data. We can still use discriminant
analysis to separate the data for us, but we cannot use the pooled estimate. Simulations have shown
that discriminant analysis works well most of the time, even when the assumptions are violated.
We just need to be careful about how we interpret the results.
The discrimination results are pretty good. There are only three misclassified observations.
Now lets look at our cross-validation results. To do this we need to go into SAS and look at the
data set created by the cross-validation procedure. The same three points that were misclassified
during “training” were also misclassified during cross-validation.
4
Logistic Regression
The simple logistic regression model is useful in the situation where there is a categorical response
variable with two levels, and a continuous predictor. The two levels of the response can be labeled
with the numbers 1 and 0. Of course, the values “one” and “zero” are arbitrary assignments to
levels of y, such as “in remission” and “not in remission,” or “toxic effect” and “no toxic effect,”
or “success” and “failure.” It is traditional to refer to the y = 1 level as “success” even though it
might be a label for an event like “gets cancer” or “dies.”
The multiple logistic regression model is a way to model the probability of a success, as a
function of continuous predictors X1 , · · · , Xp . It is appropriate when this probability is either
increasing over the range of x-values, or decreasing over the range.
When the response y can be either zero or one, the interest is in estimating the probability that
y = 1 at values of the predictors, instead of estimating the response y itself. The simple logistic
regression function is a probability curve. The curve is described by the parameters β0 and
β1 , · · · , βp which are estimated by b0 , · · · , bp . The estimates and their standard errors are computed
9
by the SAS, given data consisting of pairs (y, x1 , · · · , xp ). For a particular set of values of the
explanatory variables:
µ (Y |X1 , · · · , Xp ) = π = the proportion of 1s in the population.
Rather than model µ (Y |X1 , · · · , Xp ) as a linear function of the explanatory variables, we model
³
logit (π) = ln
π
1−π
´
= β0 + β1 X1 + · · · + βp Xp
that is, we model the logit of π as a linear function of the explanatory variables. Consider
that the probability that Y = 1 is equal to it population proportion, π. Consider also, that if
eη
η = logit(π), then π = 1+e
η , i.e., that the inverse logit gives us back the population proportion.
Example: Carcinogens in Rats
A classic example in logistic regression is the dose-response problem, in which increasing sizes
of dose are thought to be associated with increasing probability of a response. Suppose lab rats are
exposed to a sequence of doses of a substance thought to cause cancer. The independent variable
is the size of the dose (say, in milligrams), and the response is whether or not they develop cancer.
Let’s say y=1 if the rat develops cancer, y=0 otherwise. Suppose we have these data:
x
y
1
0
2
0
3
0
4
0
5
0
6
1
7
0
8
0
9
1
10
0
x
y
11
0
12
0
13
1
14
0
15
1
16
1
17
1
18
1
19
1
20
1
If you plot the data, the “scatterplot” has only two response values, as shown below. It is clear
that fitting a line to the data would not represent the relationship between y and x.
It looks like higher doses correspond to higher probabilities of cancer. In other words, if you have a
•
•
•
• • • • • •
y=1 −−
large population of rats all receiving
the same dose, a certain proportion
of them will get cancer, and this proportion is larger for larger doses. To
model the probability of getting cancer at dose x, we need a functional
form with certain properties. First,
the range of the function used to
• • • • •
• •
• • •
•
y=0 −−
model probability must be between
0
5
10
15
20
zero and one. Second, the function
x=size of dose (mg)
must either be increasing with x over the whole range of values, or decreasing with x over the
whole range of values.
The logistic regression model does not use the actual value of the response (zero or one) as
the underlying value of the function we are fitting to the data. Instead, the probability of y=1, or
10
P (y = 1), is used as the underlying function. We need a function to fit whose range is between zero
and one (because it models a probability), and is increasing as the independent variable increases.
For the ith response, we have
eβ0 +β1 xi
.
1 + eβ0 +β1 xi
This is called the logistic function. We can see that the function is always between zero and one,
and that it is increasing if β1 > 0 and decreasing if β1 < 0. We use the logistic function as our
“probability curve.” Note that if β1 = 0, then the probability that y = 1 is
P (yi = 1) = f (xi ) =
eβ0
,
1 + eβ0
for all values of x. In other words, if β1 = 0, then the probability of success does not depend on
the value of x.
P (yi = 1) = f (xi ) =
Probability of Cancer
The object in simple logistic regression is to estimate the probability
curve by estimating the parameters
•
•
•
• • • • • •
y=1 −−
β0 and β1 , and doing inference about
the curve. Usually, interest is in determining if β1 = 0, that is, if the
probability that y = 1 depends on x.
SAS provides estimates b0 and b1
of β0 and β1 , respectively. There
are no formulas for these estimates; a
• • • • •
• •
• • •
•
y=0 −−
computer must be used to calculate
0
5
10
15
20
them.
x=size of dose (mg)
Suppose the estimated parameters for the rat data are b0 = −4.097 and b1 = 0.3544. The estimated probability curve is shown
below, with the data marked as circles.
Now we can calculate the probability that a rat will develop cancer at a dose x = 10. We plug
in 10 to the above formula:
e−4.097+(0.3544)(10)
e−1.5
=
= 0.182
1 + e−1.5
1 + e−4.097+(0.3544)(10)
and see that the estimated probability of cancer at x = 10 is 0.182.
P (y = 1) =
Now you try it:
1. What is the estimated probability of cancer for dose x = 15?
2. What is the probability of y = 1 at x = 4?
11
3. What is the probability of y = 1 at x = 14?
4.1
Odds Ratios
Let f (x) represent the logistic function, or the probability of a success at x. Then the probability
of failure or y = 0 is
eβ0 +β1 xi
1
1 − f (x) = 1 −
=
.
β
+β
x
β
0
1
i
1+e
1 + e 0 +β1 xi
Now we define the odds as
odds =
P (Yi = 1)
f (x)
=
= eβ0 +β1 x .
P (Yi = 0)
1 − f (x)
The odds of a success are the probability of success divided by the probability of failure. If “odds=2”
then success is twice as likely as failure. In this case, the probability of success is 2/3.
Now we define the log odds of success, given x, as
log odds = log
f (x)
= β0 + β1 x,
1 − f (x)
so the log odds is linear in the predictor variable. This gives us some language to use when we
interpret the parameters:
The parameter β1 is the increase in the log odds associated with an increase of one unit in
the x variable.
The parameter β0 is the log odds associated with x = 0.
Pronghorn Example: In the pronghorn example, suppose, hypothetically, the true relationship between probability of use and distance to water (in meters) followed the logit model:
logit(π) = 3 − .0015W ater
Then,
π=
e3−0.0015W ater
1 + e3−0.0015W ater
Calculate π for the following distances (note that it might be easier to compute 1 − π then
subtract from 1):
Water = 100 m.
1000 m.
3000m.
12
• What is the interpretation of the coefficient -.0015 for the variable distance to water?
For every 1 m. increase in the distance to water, the log-odds of use decrease by .0015; for
every 1 km. increase in distance to water, log-odds of use decrease by 1.5.
• More meaningful: for every 1 m. increase in the distance to water, the odds of use change
by a multiplicative factor of e−0.0015 = .999. For every 1 km. increase in distance to water,
the odds of use change by a multiplicative factor of e−1.5 = .223 (we could also reverse these
statements; for example, the odds of use increase by a factor of e1.5 for every km. closer to
water.)
4.2
Variance in the Logistic Regression Model
The 0/1 response variable Y is a Bernoulli random variable:
µ (Y |X1 , · · · , Xp ) = π
q
SD (Y |X1 , · · · , Xp ) =
π(1 − π)
Variance of Y is not constant.
The logistic regression model is an example of a generalized linear model (in contrast to a
general linear model which is the usual regression model with normal errors and constant variance).
A generalized linear model is specified by:
1. a link function which specifies what function of µ(Y ) is a linear function of the X1 , · · · , Xp .
In logistic regression, the link function is the logit function.
2. the distribution of Y for a fixed set of values of X1 , · · · , Xp . In the logit model, this is the
Bernoulli distribution.
The usual linear regression model is also a generalized linear model. The link function is the
identity: that is, f (µ) = µ, and the distribution is normal with constant variance.
There are other generalized linear models which are useful (e.g., Poisson response distribution
with log link function). A general methodology has been developed to fit and analyze these models.
4.3
Estimation of Logistic Regression Coefficients
Estimation of parameters in linear regression model was by least squares. If we assume normal
errors with constant variance, the least squares estimators are the same as the maximum likelihood
estimators (MLE’s).
Maximum likelihood estimation is based on a simple principle: the estimates of the parameters
in a model are the values which maximize the probability (likelihood) of observing the sample data
we have.
Example: Suppose we select a random sample of 10 UFL students in order to estimate what
proportion of students own a car. We find that 6 out of the 10 own a car. What is the MLE of the
proportion π of all students who won a car? What do we think it should turn out to be?
13
We model the responses from the students as 10 independent Bernoulli trials with probability
of success π on each trial. Then the total number of successes, say Y, in 10 trials follows a binomial
model:
Ã
P r (Y = y) =
!
10 y
π (1 − π)10−y , y = 0, 1, · · · , 10
y
The maximum likelihood principle says to find the value of π which maximizes the probability
of observing the number of successes we actually observed in the sample. That is maximize:
Ã
P r (Y = y) =
!
10 6
π (1 − π)4
6
Can you guess what value of π maximizes this equation? We can also find the exact solution
using calculus. Note that finding the value of π that maximizes P r(Y = 4) is equivalent to finding
the value of π which maximizes
Ã
!
Ã
!
h
i
10
10
ln [P r(Y = 6)] = ln
+ ln π 6 (1 − π)4 = ln
+ 6 ln π + 4 ln(1 − π)
6
6
Let’s solve this for the general case of n and y: Taking the derivative with respect to pi and
setting it equal to 0, we have:
Now, solving for π:
So, for our example, our maximum likelihood estimate for π is π̂ =
Back to logistic regression: we use the maximum likelihood principle to find estimators of the
βs in the logistic regression model. The likelihood function is the probability of observing the
particular set of failures and successes that we observed in the sample. But there’s a difference
from the binomial model above: the model says that the probability of success is possibly different
for each subject because it depends on the explanatory variables X1 , · · · , Xp .
eη
ˆ
ˆ
Recall the link function: π = 1+e
η . In logistic regression the parameter estimates β0 , · · · , βp are
chosen to maximize the expression (called the likelihood)
Pn
exp(
L(β0 , β1 ) = Qn
i=1 yi (β0
+ β1 X1i · · · + βp xpi ))
.
i=1 (1 + exp(β0 + β1 x1i · · · + βp xpi ))
The natural logarithm of this expression is called the log-likelihood:
l(β0 , β1 ) =
n
X
i=1
yi (β0 + β1 x1i · · · + βp xpi ) −
n
X
i=1
14
log(1 + exp(β0 + β1 x1i + · · · + βp xpi )),
and the best fit parameters are said to minimize the negative log-likelihood.
This is not so intuitive as minimizing the sum of squared errors. We can try to find a formula for
the best fit β0 and β1 , · · · , βp , by taking derivatives of this last expression and setting them equal to
zero, but in fact, this leads to a pair of equations that are impossible to solve simultaneously. There
is no formula for the estimates β̂0 and β̂1 , · · · , βp like there is in simple linear regression; rather,
there are iterative computer algorithms to find the solution, built into the statistical packages.
SAS reports the negative log likelihood for the model, evaluated at the best fit parameters β̂0
and β̂1 , · · · , βp . SAS automatically fits another model as well as the specified model:
eβ0
.
1 + eβ0
This is called the “reduced model” as compared to the “full model” described above. Notice that
this reduced model does not use the predictor variable. It assumes that all of the responses have
the same probability of success. This is used for a test of
P (Yi = 1) =
H0 : βk = 0, vs.
Ha : βk 6= 0.
Clearly, if βk = 0, the probability of Y = 1 does not depend on xk . The graph of the logistic
function is a horizontal line if β1 = 0, where the vertical placement of the line is determined by the
value of β0 .
The negative log likelihood will always be larger for the reduced model. Mathematicians have
proved the following result: For large sample sizes n,
Under the null hypothesis βk = 0, twice the difference between the negative log likelihoods
of the reduced and full models is a random variable with (approximately) a chi-squared
distribution with one degree of freedom.
The Greek letter λ is used to represent the test statistic “twice the difference between the negative
log likelihoods.” Large values of this statistic support the alternative hypothesis.
A simpler way to test H0 : βk = 0 is to use the p-value reported in the parameter estimate
table, under “Analysis of Maximum Likelihood Estimates.”
Example: Teenage Drivers and the Risk of Getting a Ticket
You can find the data on the course website teendriver.txt. Here’s how to analyze the
probability of getting a ticket predicted by age and GPA:
proc logistic data=teendriver;
model ticket(event=’1’)=age GPA;
output out=teenLogRegout predprobs=I p=probpreb;
run;
The event=1 command in the model statement is how we tell SAS to model the probablity of
getting a ticket. If we leave that command out, SAS will automatically model the probability
Y = 0, i.e., the probability of not getting a ticket. SAS chooses the response with the “lowest”
value, i.e. 0 is less than 1. We could also have coded ticket with a yes or no. In that case, since
“no” comes first alphabetically, SAS would model the probability of not getting a ticket.
15
The SAS output for logistic regression is rather extensive:
The SAS System
12:24 Saturday, February 3, 2007
The LOGISTIC Procedure
Model Information
Data Set
Response Variable
Number of Response Levels
Model
Optimization Technique
WORK.TEENDRIVER
ticket
2
binary logit
Fisher’s scoring
Number of Observations Read
Number of Observations Used
52
52
Response Profile
Ordered
Value
Total
Frequency
ticket
1
2
0
1
42
10
Probability modeled is ticket=’1’.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
Intercept
Only
Intercept
and
Covariates
52.913
54.865
50.913
47.538
53.392
41.538
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
9.3750
8.4071
6.7558
2
2
2
0.0092
0.0149
0.0341
Likelihood Ratio
Score
Wald
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
16
Wald
Chi-Square
Pr > ChiSq
39
Intercept
age
GPA
1
1
1
12.3331
-0.5241
-1.6469
5.4389
0.2488
0.7465
5.1418
4.4382
4.8673
0.0234
0.0351
0.0274
Odds Ratio Estimates
Effect
age
GPA
Point
Estimate
95% Wald
Confidence Limits
0.592
0.193
0.364
0.045
0.964
0.832
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
80.7
19.0
0.2
420
Somers’ D
Gamma
Tau-a
c
0.617
0.618
0.195
0.808
The negative value for β1 (age) means that the probability of getting a ticket goes down as age
goes up. The interpretation of β̂1 is this: for every unit increase of age, the log-odds of getting a
ticket decreases by 0.5241.
The odds ratio estimate for age is 0.592. Note that this is e−.5241 . This means that the odds
of getting a ticket when the age is n + 1 is 59.2% of the odds of getting a ticket when the age is
n. In other words, when the age goes up by one unit, the odds of getting a ticket is only 59.2% of
what it the year before. The interpretation for β2 is similar. As in linear regression, interpretation
of the parameters is in the context of holding the other variables constant.
There are three different test statistics for H0 : β1 = 0, or the null hypothesis that the probability
of getting a ticket does not depend on the age or GPA of the driver.
The likelihood ratio test compares the likelihoods under the full model and the reduced model
with β1 and β2 set to zero. Here the full model has a smaller negative log likelihood than the
reduced model. The test statistic λ is the twice the difference in likelihoods=9.375. Larger values
of λ support the alternative hypothesis more, so the p-value is the area to the right of 9.375, under
a chi-squared density with one degree of freedom. We conclude that there is a strong evidence that
at least one of age or GPA is related to the probability of getting a ticket.
The Wald statistic uses the approximate distribution of the estimate of the βs and can be found
in the “analysis of maximum likelihood estimates” part of the output as well as the “testing global
hypotheses” part. Notice that here the p-value is larger. With larger datasets there is usually not
such a big difference in these two p-values. The Score statistic is beyond the scope of this discussion.
The last bit of output shows you how the observed data would be classified using the model in a
logistic discrimination. In logistic discrimination, the predicted value of Y is obtained by classifying
the observation in the category that has the highest probability under the model. So, if a given age
and GPA combination has a predicted probability of .52, then that case would be classified at Y=1
or this person got a ticket. The percent discordant shows us how many observations would
be miss-classified by the model.
17
4.4
Model Selection in Multiple Logistic Regression
Just as there were various criteria for model selection in linear regression, we have similar criteria
in logistic regression.
4.4.1
Likelihood Ratio Tests
In linear regression, we used the extra sum-of-squares F test to compare two nested models ( two
models are nested if one is a special case of the other). The analogous test in logistic regression
compares the values of -2ln(Maximized likelihood function).
• The quantity -2 ln(Maximized likelihood) is also called the deviance of a model since larger
values indicate greater deviation from the assumed model. Comparing two nested models by
the difference in deviances is a drop-in-deviance test.
• The difference between the values of -2ln(Maximized likelihood function) for a full and null
model has approximately a chi-square distribution if the null hypothesis that the extra parameters are all 0 is true. The d.f. is the difference in the number of parameters for the two
models. SAS performs this test automatically.
• The drop-in-deviance test is a likelihood ratio test (LR test) because it is based on the natural
log of the ratio of the maximized likelihoods (the difference of logs is the log of the ratio).
The extra sum-of squares F-test in linear regression also turns out to be a likelihood ratio
test.
• If the full and reduced models differ by only one parameter, as in this example, then the
likelihood ratio test is testing the same thing as the Wald test above. In this example, the
test statistic is slightly different than the Wald test value of 5.450 and P = .020. The likelihood
ratio test is preferred. The two tests will generally give similar results, but not always.
• The relationship between the Wald test and the likelihood ratio test is analogous to relationship between the t-test for a single coefficient and the F-test in linear regression. However,
in linear regression, the two tests are exactly equivalent. Not so in logistic regression.
• To use the likelihood ratio test to test the full model against a reduced model larger than the
null model, we obtain the -2 log likelihoods and find their difference, then find the p-value by
looking it up in a table for chi-square distribution with 1 degree of freedom.
4.4.2
AIC and BIC
Both AIC and BIC can be used as model selection criteria. As with linear regression models, they
are only relative measures of fit, not absolute measures of fit.
AIC = Deviance + 2p
BIC = Deviance + pln (n)
where p is the number of parameters in the model. You probably noticed that AIC was in the
basic output; however, SAS does not output the BIC directly, but SAS does output the negative
18
log likelihood and so we can calculate BIC ourselves.
Let’s compare the two reduced models with just one predictor variable to the full model we saw
above above. First, consider the model with just age as the predictor:
Model Fit Statistics
Criterion
Intercept
Only
Intercept
and
Covariates
52.913
54.865
50.913
51.355
55.258
47.355
AIC
SC
-2 Log L
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
3.5583
3.4813
3.2208
1
1
1
0.0592
0.0621
0.0727
Likelihood Ratio
Score
Wald
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
age
1
1
6.0373
-0.4095
4.1040
0.2282
2.1640
3.2208
0.1413
0.0727
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
61.7
25.0
13.3
420
Somers’ D
Gamma
Tau-a
c
0.367
0.423
0.116
0.683
Now the model with GPA as the predictor:
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
Intercept
Only
Intercept
and
Covariates
52.913
54.865
50.913
50.721
54.623
46.721
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Chi-Square
DF
Pr > ChiSq
4.1925
1
0.0406
19
Score
Wald
4.1323
3.7846
1
1
0.0421
0.0517
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
GPA
1
1
1.6965
-1.2155
1.5831
0.6248
1.1484
3.7846
0.2839
0.0517
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
65.5
34.0
0.5
420
Somers’ D
Gamma
Tau-a
c
0.314
0.316
0.100
0.657
The model with the smallest AIC is the full model. Notice also that the percent concordant
is also much better with with the full model than with either of the two smaller models. This is
directly analogous to the prediction error in a linear regression model and so is also a good measure
of model fit and predictive ability. Though, it still not a cross validation, which we will consider
shortly.
4.4.3
Interactions in Logistic Regression
We can conclude interactions terms in a logistic regression model. Interpretation is analogous to
that of linear regression. Proc Logistic handles interaction terms easily:
SAS Code and Output
proc logistic data=teendriver outest=teendriverAgeGPA;
class ticket;
model ticket(event=’1’)= age GPA age*GPA;
run;
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
Intercept
Only
Intercept
and
Covariates
52.913
54.865
50.913
45.021
52.826
37.021
Testing Global Null Hypothesis: BETA=0
Test
Likelihood Ratio
Score
Wald
Chi-Square
DF
Pr > ChiSq
13.8923
9.7159
6.9647
3
3
3
0.0031
0.0211
0.0730
20
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
age
GPA
age*GPA
1
1
1
1
-28.4576
1.6680
16.2074
-0.9726
20.9376
1.1267
9.0674
0.5011
1.8473
2.1915
3.1949
3.7672
0.1741
0.1388
0.0739
0.0523
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
85.5
14.5
0.0
420
Somers’ D
Gamma
Tau-a
c
0.710
0.710
0.225
0.855
Note that the AIC is smaller for the full(er) model with both predictors and the interaction term,
as well as an improvement in the percent concordant. Note also that age falls out of significance
and GPA is now marginal. This is not of any real concern–the model selection criteria have improved,
so we have a better model.
4.5
Residual Analysis in Logistic Regression
Residual analysis in logistic regression focuses on the identification of
outliers and influential points. There
are two standard ways to define a
residual in logistic regression.
The residuals in a logistic regression are not as useful as in a linear regression. The residuals are not
assumed to have a normal distribution, so the usual ±2 or ±3 cutoffs
don’t necessarily apply, and we cannot interpret any patterns we see in
a straight-forward manner. Instead,
we can simply look for outliers among
the distribution of Pearson or deviance residuals. We can plot the values against the predicted probabilities, the predictor variables, or observation number and look for points
that seem to lie far from the others.
We can plot the residuals using
Proc GPLOT as before. You can see
21
from the plots that things are not as clear with the residuals from the logistic regression. We don’t
necessarily expect normal residuals or residuals scattered around zero.
Pearson’s Residual:
Rp = p
yi − π̂
π̂(1 − π̂)
Deviance Residual:
(
RD =
−2 ln π̂
√
− −2 ln π̂
if yi = 1
if yi = 0
Pearson’s residual is sometimes called the “Standardized Residual”. Why?
The deviance residuals have the property that the sum of the deviance residuals squared is the
deviance, D (-2 ln(maximized likelihood)) for the model.
The Pearson residual is more easily understood, but the deviance residual directly gives the
contribution of each point to the lack of fit of the model.
data TEENDRIVER;
infile "C:\STA5701\Datasets\teendriver.txt" firstobs=2;
input age ticket$ GPA;
obsno=_N_;
run;
proc logistic data=teendriver;
class ticket;
model ticket(event=’1’)= age GPA age*GPA;
output out=teenLogRegOut predprobs=I p=predprob resdev=resdev reschi=pearres;
run;
proc
plot
plot
run;
gplot data=teenLogRegOut;
resdev*obsno;
pearres*obsno;
quit;
4.5.1
Measures of influence
Measures of influence attempt to measure how much individual observations influence the model.
Observations with high influence merit special examination. Many measures of influence have been
suggested and are used for linear regression. Some of these have analogies for logistic regression.
22
However, the guidelines for deciding what is a big enough value of an influence measure to merit special attention are less developed for logistic regression than for linear regression and the guidelines
developed there are sometimes not appropriate for logistic regression.
Cook’s Distance: This is a measure of how much the residuals change when the case is deleted.
Large values indicate a large change when the observation is left out. Plot Di against case number
or against the predicted probabilities and look for outliers.
The leverage is a measure of the potential influence of an observation. In linear regression, the
leverage is a function of how far its covariate vector xi is from the average. It is a function only of
the covariate vector. In logistic regression, the leverage is a function of xi and of πi (which must
be estimated) - it isn’t necessarily the observations which have the most extreme xi s which have
the most leverage.
You can also get these and a large number of other diagnostic plots within the Logistic
procedure:
proc logistic data=teendriver;
class ticket;
model ticket(event=’1’)= age GPA age*GPA/influence iplots;
run;
These plots will appear within the output and are difficult to export and incorporate into other
programs, but try this on your own.
You can also try using the ods graphics option in SAS:
ods html;
ods graphics on;
title ’Teenage Drivers: Age and GPA in predicting Tickets’;
proc logistic data=teendriver;
class ticket;
model ticket(event=’1’)= age GPA age*GPA/influence iplots;
run;
ods graphics off;
ods html close;
4.6
Logistic Regression using Counts or Proportions
As we saw in some of the examples at the beginning of the chapter notes, we can also use logistic
regression to model the counts or the proportion of successes rather than using the binary (0/1)
outcome.
Not all proportions are appropriate to model with logistic regression. We model proportions
like fat calories/total calories, etc., using normal theory, usually. The only proportions that are
appropriate in this context are those that result from an integer count of a certain outcome over
the total number of trials or outcomes.
23
In the case that the response variable is binomial counts, denoted by counts of binary variables.
P
we have: if X ∼bernoulli(π), then Y = Xi ∼Binomial(n, π).
Example 1 Island size and bird extinctions: On each island we count the number of
species that went extinct out all the species on the island. What is the relationship between the
area of an island and the probability of extinction of birds present on the island?
Example 2 Moth coloration and natural selection: At each distance from Liverpool we
count the number of moths from each morph that were taken by predators. What is the relationship
between the distance from Liverpool, where trees are dark from industrial soot, and the probability
of predation on the light and dark morphs of the moth Carbonaria?
4.6.1
The Logistic Regression Model for Binomial Counts
• Y = the number of successes in m binomial trials. For example, how many species went
extinct in the 10 year period of the study on each island?
• Yi ∼ Binomial(mi , πi ) where mi is the number of species on the ith island and πi is the
probability of extinction on the ith island.
• X1 , · · · , Xp are the explanatory variables, in the extinction example, X is the area of the
island.
• For a particular set of values of the explanatory variables, µ (Y |X1 , · · · , Xp ) = π = probability
of success (extinction in our example).
• π̄ = Y /m = the observed binomial proportion.
• Note that the sample size in the bird extinction study is the number of islands, not the number
of species.
• We model µ (Y |X1 , · · · , Xp ) = π just as we did for the binary response model.
• logit(π) = η = β0 + β1 + · · · + βp
• As before: π =
4.6.2
eη
1+eη
Variance
Since the response variable, Y , is a Binomial random variable we have:
q
SD (Yi |X1i , · · · , Xpi ) =
mi πi (1 − πi )
Variance of Y is not constant across all Yi .
4.6.3
Estimation of Logistic Regression Coefficients
As for the binary response model we use the maximum likelihood estimators (MLE’s).
24
4.6.4
SAS Code and Output
Below is the SAS code for the bird extinction and island size problem mentioned at the beginning
of these notes.
proc logistic data=IslandBirds;
model extinct/nspecies=area;
run;
The LOGISTIC Procedure
Model Information
Data Set
Response Variable (Events)
Response Variable (Trials)
Model
Optimization Technique
WORK.ISLANDBIRDS
extinct
nspecies
binary logit
Fisher’s scoring
Response Profile
Ordered
Value
Binary
Outcome
1
2
Total
Frequency
Event
Nonevent
108
524
Model Fit Statistics
Criterion
AIC
SC
-2 Log L
Intercept
Only
Intercept
and
Covariates
580.013
584.461
578.013
561.335
570.233
557.335
Testing Global Null Hypothesis: BETA=0
Test
Chi-Square
DF
Pr > ChiSq
20.6774
16.5183
14.2202
1
1
1
<.0001
<.0001
0.0002
Likelihood Ratio
Score
Wald
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
area
1
1
-1.3060
-0.0101
0.1173
0.00268
123.8728
14.2202
<.0001
0.0002
25
Odds Ratio Estimates
Effect
area
Point
Estimate
95% Wald
Confidence Limits
0.990
0.985
0.995
Association of Predicted Probabilities and Observed Responses
Percent Concordant
Percent Discordant
Percent Tied
Pairs
4.6.5
60.2
26.4
13.4
56592
Somers’ D
Gamma
Tau-a
c
0.339
0.391
0.096
0.669
Model Assessment
Model assessment for logistic regression using
counts is similar to that using the binary outcome variable.
Estimated versus observed: One way to assess the appropriateness of the model and the
efficacy of the estimation routine is to plot the
estimated probability, π̂i , against the observed
response proportion, π̄i . Additionally, plots of
the π̂i versus one or more of the explanatory variables are useful for visual examination as we do
for ordinary scatterplots in linear regression.
Residual Analysis: As in the binary response
case, we have two widely used residuals for binomial counts models.
There are two standard ways to define a residual in logistic regression.
Pearson’s Residual:
yi − mi π̂
mi π̂(1 − π̂)
Rp = p
Deviance Residual:
s ½
µ
Yi
Dr = sign(Yi − mi πˆI ) 2 Yi ln
mi π̂i
26
¶
µ
mi − Yi
+ (mi − Yi ) ln
mi − mi π̂i
¶¾
The Pearson residual is more easily understood, but the deviance residual directly gives the
contribution of each point to the lack of fit of the model.
Since the data are grouped, the residuals in a binomial counts logistic regression (either Pearson
or deviance) are more useful than in the binary response regression.
The residuals should be plotted against the predicted values for the πi s and examined for outliers
or remaining patterns.
4.6.6
Model Selection
Likelihood Ratio Tests
As with the binary response case, we use the value -2ln(Maximized likelihood function) to compare
models. Recall that the MLE’s of the βs are the values that maximize the likelihood function of
the data. So, we find that the values for the βs that maximize the likelihood function, take the
natural log and multiply by -2.
• The quantity -2 ln(Maximized likelihood) is also called the deviance of a model since larger
values indicate greater deviation from the assumed model. Comparing two nested models by
the difference in deviances is a drop-in-deviance test.
• The difference between the values of -2 ln(Maximized likelihood function) for a full and
reduced model has approximately a chi-square distribution if the null hypothesis that the
extra parameters are all 0 is true. The d.f. is the difference in the number of parameters for
the two models.
AIC and BIC
Both AIC and BIC can be used as model selection criteria. As with linear regression models, they
are only relative measures of fit, not absolute measures of fit.
27
• AIC = Deviance + 2p
• BIC = Deviance + p ln(n)
where p is the number of parameters in the model.
5
Appendix
• Afifi, Abdelmonem, V.A. Clark, May, S. 2004. Computer-Aided Multivariate Analysis. Chapman & Hall/CRC.
• Quinn, Gerry P. and Michael J. Keough 2002. Experimental Design and Data Analysis for
Biologists. Cambridge University Press.
• McCullagh, P. and J.A. Nelder 1998. Generalized Linear Models. Chapman & Hall/CRC.
28