Download 17 An Introduction to Logistic Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia, lookup

Choice modelling wikipedia, lookup

Expectation–maximization algorithm wikipedia, lookup

Regression analysis wikipedia, lookup

Linear regression wikipedia, lookup

Coefficient of determination wikipedia, lookup

Transcript
An Introduction to Logistic
Regression
GV917 For categorical Dependent Variables
What do we do when the dependent variable
in a regression is a dummy variable?




Suppose we have the dummy variable
turnout:
1 – if a survey respondent turns out to vote
0 – if they don’t vote
One thing we could do is simply run an
ordinary least squares regression
Turnout and Interest in the Election
Turnout
1.00
.00
1.00
1.00
1.00
.00
1.00
1.00
1.00
.00
.00
1.00
1.00
.00
1.00
1.00
.00
1.00
.00
1.00
.00
1.00
1.00
1.00
1.00
.00
1.00
1.00
1.00
1.00
Interest
4.00
1.00
4.00
3.00
3.00
1.00
3.00
2.00
3.00
1.00
2.00
2.00
2.00
1.00
2.00
4.00
2.00
2.00
1.00
1.00
2.00
4.00
1.00
2.00
1.00
2.00
3.00
3.00
4.00
3.00
(N=30)








Turnout
1 yes
0 No
Interest in the Election
1 not at all interested
2 not very interested
3 fairly interested
4 very interested
OLS Regression of Interest on Turnout – The
Linear Probability Model
Model Summary
Model
1
R
.540a
Adjusted
R Square
.266
R Square
.291
Std. Error of
the Estimate
.39930
a. Predictors: (Constant), interest
ANOVA b
Model
1
Regres sion
Residual
Total
Sum of
Squares
1.836
4.464
6.300
df
1
28
29
Mean Square
1.836
.159
F
11.513
Sig.
.002a
a. Predictors: (Constant), interest
b. Dependent Variable: turnout
Coefficientsa
Model
1
(Constant)
interes t
Unstandardized
Coeffic ients
B
Std. Error
.152
.177
.238
.070
a. Dependent Variable: turnout
Standardiz ed
Coeffic ients
Beta
.540
t
.856
3.393
Sig.
.399
.002
The Residuals of the OLS Turnout Regression
Casewise Diagnostics
Case Num ber
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Std. Residual
-.264
-.977
-.264
.333
.333
-.977
.333
.930
.333
-.977
-1.574
.930
.930
-.977
.930
-.264
-1.574
.930
-.977
1.527
-1.574
-.264
1.527
.930
1.527
-1.574
.333
.333
-.264
.333
a. Dependent Vari able: turnout
turnout
1.00
.00
1.00
1.00
1.00
.00
1.00
1.00
1.00
.00
.00
1.00
1.00
.00
1.00
1.00
.00
1.00
.00
1.00
.00
1.00
1.00
1.00
1.00
.00
1.00
1.00
1.00
1.00
a
Predicted
Value
1.1053
.3901
1.1053
.8669
.8669
.3901
.8669
.6285
.8669
.3901
.6285
.6285
.6285
.3901
.6285
1.1053
.6285
.6285
.3901
.3901
.6285
1.1053
.3901
.6285
.3901
.6285
.8669
.8669
1.1053
.8669
Residual
-.10526
-.39009
-.10526
.13313
.13313
-.39009
.13313
.37152
.13313
-.39009
-.62848
.37152
.37152
-.39009
.37152
-.10526
-.62848
.37152
-.39009
.60991
-.62848
-.10526
.60991
.37152
.60991
-.62848
.13313
.13313
-.10526
.13313
What’s Wrong?

Predicted probabilities which exceed 1.0, which makes
no sense.

The test statistics t and F are not valid because the
sampling distribution of residuals does not meet the
required assumptions (heteroscedasticity)

We can correct for the heteroscedasticity, but a better
option is to use a logistic regression model
Some Preliminaries needed for Logistic Regression

Odds Ratios

These are defined as the probability of an event occurring divided by the
probability of it not occurring. Thus if p is the probability of an event:
p
Odds = ------1- p









For example:
In the 2005 British Election Study Face-to-Face survey 48.2 per cent of the
sample were men, and 51.8 percent women, thus the odds of being a man
were:
0.482
0.518
-------- = 0.93 and the odds of being a women were -------- = 1.07
0.518
0.482
Note that if the odds ratio was 1.00 it would mean that women were equally
likely to appear in the survey as men.
Log Odds

The natural logarithm of a number is the power we must raise e (2.718) to
give the number in question.


So the natural logarithm of 100 is 4.605 because 100 = e 4.605
This can be written 100 = exp(4.605)

Similarly the anti-log of 4.605 is 100 because e 4.605 = 100

In the 2005 BES study 70.5 per cent of men and 72.9 per cent of women
voted.

The odds of men voting were 0.705/0.295 = 2.39, and the log odds were
ln(2.39) = 0.8712

The odds of women voting were 0.729/0.271 = 2.69, and the log odds were
ln(2.69) = 0.9896

Note that ln(1.0) = 0, so that when the odds ratio is 1.0 the log odds ratio is
zero
Why Use Logarithms?

They have 3 advantages:
Odds vary from 0 to ∞, whereas log odds vary from -∞ to + ∞ and are
centered on 0. Odds less than 1 have negative values in log odds, and odds
greater than one have positive values in log odds. This accords better with
the natural number system which runs from -∞ to + ∞.
If we take any two numbers and multiply them together that is the equivalent of
adding their logs. Thus logs make it possible to convert multiplicative
models to additive models, a useful property in the case of logistic
regression which is a non-linear multiplicative model when not expressed in
logs
A useful statistic for evaluating the fit of models is -2*loglikelihood (also known
as the deviance). The model has to be expressed in logarithms for this to
work
Logistic Regression


^
p(y)
ln -------- = a + bXi
^
1 - p(y)
^
Where p(y) is the predicted probability of being a voter
^
1 – p(y) is the predicted probability of not being a voter

If we express this in terms of anti-logs or odds ratios, then













^
p(y)
--------^
1 - p(y)
= exp( a + bXi)

and



^
exp(a + bXi)
p(y) = ----------------1 + exp(a + bXi)
The Logistic Function


The logistic function can never be greater than
one, so there are no impossible probabilities
It corrects for the problems with the test statistics
Estimating a Logistic Regression

In OLS regression the least squares solution can be defined
analytically – there are equations called the Normal Equations which
we use to find the values of a and b. In logistic regression there are
no such equations. The solutions are derived iteratively – by a
process of trial and error.

Doing this involves identifying a likelihood function. A likelihood is a
measure of how typical a sample is of a given population. For
example we can calculate how typical the ages of the students in
this class are in comparison with students in the university as a
whole. Applied to our regression problem we are working out how
likely individuals are to be voters given their level of interest in the
election and given values for the a and b coefficients.


We ‘try out’ different values of a and b and the maximum likelihood
estimation identifies the values which are most likely to reproduce
the distribution of voters and non-voters we see in the sample, given
their levels of interest in the election.
Maximum Likelihood







Define the probability of getting a head in tossing a fair coin as p(H)
= 0.5, so that p(1-H) = 0.5 (getting a tail). So the probability of two
heads followed by a tail is:
P[(H)(H)(1-H)] = (0.5)(0.5)(0.5) = 0.125
We can get this sequence in 3 different ways (the tail can be first,
second or third), so that the probability of getting 2 heads and a tail
without worrying about the sequence is 0.125(3) = 0.375
But suppose we did not know the value of p(H). We could ‘try out’
different values and see how well they fitted an experiment
consisting of repeated tosses of a coin three times. For example if
we thought p(H) = 0.4, then two heads and a tail would give
(0.4)(0.4)(0.6)(3)= 0.288.
If we thought it was 0.3 we would get:
(0.3)(0.3)(0.7)(3) = 0.189
Maximum Likelihood in General

More generally we can write a likelihood
function for this exercise:

LF = π [pi2 * (1- pi)] where pi is the
probability of getting a head and π is the
number of ways this sequence can occur.

The maximum value of this function occurs
when pi=0.5, making this the maximum
likelihood estimate of the sequence two
heads and a tail.
Explaining Variance



In OLS regression we defined the following
expression:
_
_
Σ(Yi – Y)2 = Σ(Ŷ – Y)2 + Σ(Yi - Ŷ)2

Or
Total Variation = Explained Variation + Residual Variation

In logistic regression measures of the
Deviance replace the sum of squares as the
building blocks of measures of fit and
statistical tests.
Deviance

Deviance measures are built from maximum likelihoods calculated
using different models. For example, suppose we fit a model with
no slope coefficient (b), but an intercept coefficient (a). We can call
this model zero because it has no predictors. We then fit a second
model, called model one, which has both a slope and an intercept.
we can form the ratio of the maximum likelihoods of these models:


maximum likelihood of model zero
Likelihood ratio = --------------------------------------------maximum likelihood of model one

Expressed in logs this becomes:


Log Likelihood ratio = ln(maximum likelihood of model zero –
maximum likelihood of model one)

Note that the (Likelihood ratio)2 is the same as 2(log likelihood ratio)

The Deviance is defined as -2(log likelihood ratio)
What does this mean?
The maximum likelihood of model zero is
analogous to the total variation in OLS and the
maximum likelihood of model one is analogous
to the explained variation. If the maximum
likelihoods of models zero and one were the
same, then the likelihood ratio would be 1 and
the log likelihood ratio 0.
This would mean that model one was no better
than model zero in accounting for turnout, so the
deviance captures how much we improve things
by taking into account interest in the election.
The bigger the deviance the more the
improvement
Logistic Regression of Turnout
Omnibus Tests of Model Coefficients
Step 1
Chi-square
10.757
10.757
10.757
Step
Block
Model
df
Sig.
.001
.001
.001
1
1
1
Mo de l Sum ma ry
St ep
1
-2 Log
lik eliho od
25 .894 a
Co x & Sne ll
R Squa re
.30 1
Na gelk erke
R Squa re
.42 7
a. Es tima tion term inat ed a t iteration num ber 6 b ecau se
pa rame ter e stim ate s ch ange d by les s tha n .0 01.
Classification Table
a
Predicted
turnout
Step 1
Observed
turnout
.00
.00
1.00
1.00
Percentage
Correct
55.6
85.7
76.7
5
3
4
18
df
Sig.
.012
.047
Overall Percentage
a. The cut value is .500
Variables in the Equation
Step
a
1
interest
Constant
B
1.742
-2.582
S.E.
.697
1.302
a. Variable(s) entered on step 1: interest.
Wald
6.251
3.934
1
1
Exp(B)
5.708
.076
The Meaning of the Omnibus Test
Omnibus Tests of Model Coefficients
Step 1
Step
Block
Model
Chi-square
10.757
10.757
10.757
df
1
1
1
Sig.
.001
.001
.001

SPSS starts by fitting what it calls Block 0, which is the model containing the constant
term and no predictor variables. It then proceeds to Block 1 which fits the model and
gives us another estimate of the likelihood function. These two can then be
compared and the table shows a chi-square statistical test of the improvement in the
model achieved by adding interest in the election to the equation. This chi-square
statistic is significant at the 0.001 level. In a multiple logistic regression this table tells
us how much all of the predictor variables improve things compared with model zero.

We have significantly improved on the baseline model by adding the variable
interest to the equation
The Model Summary Table
Model Sum ma ry
St ep
1
-2 Log
Cox & Snell
lik elihood
R Square
25.894 a
.301
Nagelk erke
R Square
.427
a. Es timation terminat ed at iteration number 6 because
parameter estimates changed by les s than .001.

The -2 loglikelihood statistic for our two variable model appears in
the table, but it is only really meaningful for comparing different
models. The Cox and Snell and the Nagelkerke R Squares are
different ways of approximating the percentage of variance
explained (R square) in multiple regression. The Cox and Snell
statistic is problematic because it has a maximum value of 0.75.
The Nagelkerke R square corrects this and has a maximum value of
1.0, so it is often the preferred measure.
The Classification Table
Classification Table
a
Predicted
turnout
Step 1
Observed
turnout
Overall Percentage
.00
.00
1.00
1.00
5
3
4
18
Percentage
Correct
55.6
85.7
76.7
a. The cut value is .500

The classification table tells us the extent to which the
model correctly predicts the actual turnout, so it is
another goodness of fit measure. The main diagonal
from top left to bottom right contains the cases predicted
correctly (23), whereas the off-diagonal from bottom right
to top left are the cases predicted incorrectly (7). So
overall 76.7 per cent of the cases are predicted correctly.
Interpreting the Coefficients
Variables in the Equation
Step
a
1
interest
Constant
B
1.742
-2.582
S.E.
.697
1.302
Wald
6.251
3.934
df
1
1
Sig.
.012
.047
Exp(B)
5.708
.076
a. Variable(s) entered on step 1: interest.

The column on the left gives the coefficients in the logistic regression
model. It means that a unit change in the level of interest in the election
increases the log odds of voting by 1.742. The standard error appears in
the next column (0.697) and the Wald statistic in the third column. The
latter is the t statistic squared (6.251) and as we can see it is significant at
the 0.012 level. Finally, Exp (B) is the anti-log of the (B) column so that
e1.742 = 5.708. This is the effect on the odds of voting of an increase in the
level of interest in the election by one unit. Since odds ratios are a bit more
easy to understand than log odds ratios the effects are often reported using
these coefficients.
Making Sense of the Coefficients





^
p(y)
ln -------^
1 - p(y)

So that




^
p(y)
= -2.582 + 1.742Xi
= exp(-2.582 + 1.742Xi)
------------------------------1 + exp(-2.582 + 1.742Xi)
Translating into Probabilities

Suppose a person scores 4 on the interest in the election variable (they are
very interested). Then according to the model the probability that they will
vote is:

^

P(y)
= exp(-2.582 + 1.742(4))
------------------------------1 + exp(-2.582 + 1.742(4)Xi)



^

If they are not at all interested and score (1) then:

^

Consequently a change from being not at all interested to being very
interested increases the probability of voting by 0.99-0.30= 0.69


P(y)
P(y)
= exp(4.386)/(1 + exp(4.386)) = 0.99
=
exp(-0.84)/(1 + exp(-0.84)) = 0.30
Probabilities





Level of Interest
1
2
3
4
Probability of Voting
0.30
0.71
0.93
0.99
Conclusions



Logistic regression allows us to model
relationships when the dependent variable is
a dummy variable.
It can be extended to multinomial logistic
regression in which there are several
categories – and this produces several sets
of coefficients
The results are more reliable than if we had
just used ordinary least squares regression