Download Logistic regression

Document related concepts
Transcript
By Hui Bian
Office for Faculty Excellence
1
Contact information
Email: [email protected]
 Phone: 328-5428
 Location: 2307 Old Cafeteria Complex
(east campus)
 Website:

http://core.ecu.edu/ofe/StatisticsResearch/
2
What is logistic regression

According to IBM SPSS Manual


It is used to predict the presence or absence of a
characteristic or outcome based on values of a set of
predictor variables. It is similar to a linear regression
model, but suited to models where the dependent
variable is dichotomous.
The ultimate goal of logistic regression

to determine the probability of a case belonging to the
1 category of dependent variable or the probability of
event occurring (event occurring is always coded as
1) for a given set of predictors.
3
Variables in logistic regression

Dependent variable: one dichotomous/binary
variable
Yes/No: drug users vs. non-drug users
 Membership: intervention vs. control
 Characteristics: males vs. females


Independent variables: interval or dummy or
categorical variables (indicator coded).

Indicator coded: SPSS will automatically recode
categorical variable for us.
4
Assumptions

Homogeneity of variance and normality
of errors are NOT assumed, but it
requires:
 Absence of multicollinearity
 No specification errors: all relevant
predictors are included and irrelevant
predictors are excluded.

Larger sample size than using linear
regression
5
Logistic regression equation
Equation: Logit function
ln(p/1-p) =a + b1x1+b2x2+…+bnxn
Logit (p) = a + b1x1+b2x2+…+bnxn





p: probability of a case belonging to category 1
p/1-p: odds
a: constant
n: number of predictors
b1-bn: regression coefficients
6
Logistic regression equation
Non-linear relationship between predictors and
binary outcome.
 The regression coefficients are estimated using
maximum likelihood.

7
Logistic regression curve
The Y-axis is P
(probability),
which indicates
the proportion of
1 at any given
value of X.
8
Coding of variables

Recommendation for dependent
variable
 Use 1 as the event occurring (the focus of
the study)
 Use 0 as absence of event (the reference
category)
 SPSS automatically recodes the lower
number of the category to 0 and higher
number to 1.
9
Coding of variables

Recommendation for independent
variables
 Coded as 1 for the category that is the focus
of the study
 Coded as 0 for the category of reference
10
Coding of variables
Cases coded as 1 referred to as the
response group, comparison group, or
target group.
 Cases coded as 0 referred to as the
reference group, base group, or control
group.

11
Terms

Probability: likelihood of an event
occurring
 Drug use status is DV and gender is IV.
 The probability of a student using drug is 205/413=.496
 We want to know whether this proportion is the same for
both males and females.
Drug users
Non-users
Total
Male
120
102
222
Female
85
106
191
Total
205
208
413
12
Terms

Odds: the probability of belonging to one group or
event occurring divided by the probability of not
belonging to that group or event not occurring.
 The odds of a male using drug is 120/102=1.18,
 The odds of a female using drug is 85/106= .80
 For males, it means that a male is 1.18 times as
likely to use drug as not to use.
Drug users
Non-users
Total
Male
120
102
222
Female
85
106
191
Total
205
208
413
13
Terms

Odds ratio: an important estimate in logistic
regression and used to answer our research
question.
 For the table below, the research question is
whether there is a gender difference in using
drugs or whether the probability of drug use is
the same for males and females.
Drug users (1)
Non-users (0)
Total
Male (1)
120
102
222
Female (0)
85
106
191
Total
205
208
413
14
Terms

Odds ratio
 A ratio of the odds for each group.
 Always odd for the response group (males)
divided by odd for the referent group
(females).
 Odds ratio is 1.18/.80= 1.48
Drug users (1)
Non-users (0)
Total
Male (1)
120
102
222
Female (0)
85
106
191
Total
205
208
413
15
Terms

Odds ratio
 Males in this example were 1.48 times more
likely than females to use drugs.
 An odds ratio > 1 indicates that the likelihood of
an event occurring is more likely for the
response category than the referent category of
an independent variable.
 An odds ratio < 1 indicates that the likelihood of
an event occurring is less likely for the response
category than the referent category of an
independent variable.
16
Terms

Adjusted odds ratio
 When multiple independent variables in the
model
 It indicates the contribution of a particular
predictor when other predictors are
controlled.
17
Terms

Ordinary least squares (OLS)
 A method used for a linear regression model.
 It minimizes the sum of squared vertical distances between the
observed responses in the dataset and the responses predicted
by the linear model.

Maximum likelihood estimation (MLE)
 It is more appropriate for logistic regression model.
 It maximizes the log likelihood.
 The log likelihood indicates how likely the observed grouping can
be predicted from observed values of predictors.
18
Logistic regression using SPSS


Example of logistic regression analysis
Research question is whether a gender, selfcontrol, and self-efficacy predict drug use
status.
 Three predictors
○ Gender (a01: 1 = males, 0 = females)
○ Self-control (continuous variable)
○ Self-efficacy (a80r: 1 =somewhat-not sure, 0 =
very sure)
 One dependent variable
○ Drug use status (1 = drug users, 0 =non-users)
19
Logistic regression

Null hypothesis
 There is an equal chance of drug use or not
use for a given set of predictors or
 The model coefficients are 0 (0 means
there is no change due to the predictor
variable).
20
Logistic regression using SPSS
Analyze > Regression > Binary Logistic
 Enter Drug_use to Dependent
 Enter a01, self-control, and a80r to
Covariates

21
Logistic regression using SPSS

Click Categorical
 Enter two categorical variables (a01 and
a80r) to the right.
22
Logistic regression using SPSS

Categorical
 The default contrast is Indicator and
reference category is Last.
23
Logistic regression using SPSS

Categorical: contrast methods
 Indicator. Contrasts indicate the presence or absence of
category membership. The reference category is represented
in the contrast matrix as a row of zeros.
 Simple. Each category of the predictor variable (except the
reference category) is compared to the reference category.
 Difference. Each category of the predictor variable except the
first category is compared to the average effect of previous
categories. Also known as reverse Helmert contrasts.
 Helmert. Each category of the predictor variable except the
last category is compared to the average effect of subsequent
categories.
24
Logistic regression using SPSS

Categorical: contrast
 Repeated. Each category of the predictor
variable except the first category is compared to
the category that precedes it.
 Polynomial. Orthogonal polynomial contrasts.
Categories are assumed to be equally spaced.
Polynomial contrasts are available for numeric
variables only.
 Deviation. Each category of the predictor
variable except the reference category is
compared to the overall effect.
25
Logistic regression using SPSS

Categorical
 For a01 and a80r, we want category 0 as a
reference category.
 Check First
 Then click Change
26
Logistic regression using SPSS

Save
For each case, if the
predicted probability
is greater than 0.5,
then this respondent
would be predicted to
use drug (coded as
1).
If the predicted
probability is less than
0.5, this respondent
would be predicted to
not use drug (coded
as 0).
27
Logistic regression using SPSS

Options
28
SPSS output

Coding
Our coding is the same as the Parameter coding
29
SPSS output
Constant-only-model:
Block 0 (beginning block/step
0) means only constant is in
the model and our predictors
are not in the equation yet.
30
SPSS output

Block 1
Model fit
statistics
Validity of the model:
The null hypothesis is
rejected (p < .05).
-2Log likelihood ratio
test: tests whether a set
of IVs improves
prediction of DV better
than chance.
Full model:
Block1 (step1) indicates that
our predictors are entered the
model simultaneously. The
method used is Enter.
Pseudo R square:
Nagelkerke R2 is preferred.
The model accounts for
almost 10% of variance of
DV.
This test assesses
whether the predicted
probabilities match the
observed probabilities. P
> .05 means a set of IVs
will accurately predict the
actual probabilities.
31
SPSS output

Model fit: Goodness-of-fit statistics help you to
determine whether the model adequately
describes the data.
 -2 log likelihood test (-2LL)
 Omnibus test of model coefficients: Chi-square,
it is based on the null hypothesis that all the
coefficients are zero.
 Model summary
 Pseudo R2
 Homer and Lemeshow Test
32
SPSS output

The overall predictive accuracy is 62.7%.
Block 1
In block 0, the probability of a correct prediction is
50.4%. In block 1, the overall predictive accuracy is
62.7%.
33
SPSS output

Classification table
 64.9% is also known as the sensitivity of
prediction.
 60.6% is also known as the specificity of
prediction.
34
SPSS output

Variables in the equation
1. Wald test
It tests the effect of individual predictor while controlling other predictors.
2. Exp(B)
It is an odds ratio. For gender, males are 1.60 times more likely to use
drugs than females. For self-control, the probability of drug use is
contingent on self-control level. Higher score of self-control, less likely to
use drugs. For a80r, low self-efficacy group is 1.53 times more likely to
use drugs than high self-efficacy group.
35
SPSS output

Equation
 The equation should be:
Ln [odds] = .256 + .471 Gender -.093Self-control + .427Self-efficacy
Predicted probability = eln(odds)/(1+eln(odds))
36
SPSS output

Saved new variables
1. For each case, if the predicted
probability is greater than 0.5, then
this respondent would be predicted
to use drug (coded as 1).
2. If the predicted probability is
less than 0.5, this respondent
would be predicted to not use drug
(coded as 0).
37
Results
The logistic regression was performed to test effects of
self-control, self-efficacy, and gender on drug use.
Results indicated that the three-predictor model provided
a statistically significant improvement over the constantonly-model, χ2(3, N= 413) = 31.36, p = .00. The
Nagelkerke R2 indicated that the model accounted for
9.8% of the total variance. The correct prediction rate
was about 63.7% . The Wald tests showed that all three
predictors significantly predicted drug use status.
38
Logistic regression using SPSS

Independent variables are categorical
variables with more than 2 categories.
 Example: add a93a (My friends think that it's
okay for me to drinks too much alcohol) into
the model as an independent variable.
 Rerun previous logistic regression
 Use Indicator method and first level as a
reference.
39
Logistic regression using SPSS

SPSS output: coding
SPSS recoded a93a into three dummy variables with first level as
the reference (in the contrast matrix as a row of zeros).
40
Logistic regression using SPSS

SPSS output: Variables in the equation
41
Comparing logistic models

Purpose of comparing logistic models
 Whether adding more variables in the model will
provide an improvement in predictive power.

Example: we want to know whether there is a
significant interaction of self-efficacy and self-control
on probability of drug use.
 Add interaction term (self-efficacy*self-control) to
the model
 We are going to have three models: constant-onlymodel, model with three predictors and constant,
and model with interaction term, three predictors,
and constant.
42
Comparing logistic models

Enter a01, a80r, and self-control to
Block1, then click Next.
43
Comparing logistic models

Block 2
Highlight a80r and
hold Control to select
self-control. Then
click a*b>, enter
a80r*self-control to
Block 2.
44
SPSS output

The results for Block 0 and Block 1 are
the same as those from the previous
study.
1. The Block 2 is not significant. It
means the interaction term is not
significant. Model means with
everything in the equation, the
whole model is significant.
2. The difference of -2Log Likelihood
between block 2 and 1 is 541.153537.486 = 3.68, this is a Chisquare statistic with df = 1, p < .05
(check Chi-square table, Chisquare = 3.84 as p =.05, df = 1)
45
SPSS output
The Chi-square change from Block1 to Block 2 is 35.032-31.364=3.668, which
is the Chi-square for interaction term. The R2 change indicates that 1% of
variance is explained by interaction term. The improvement of prediction is not
significant ( p = .055).
46
SPSS output
The Wald test also shows that there is no significant interaction effect
of self-efficacy and self-control on DV.
Equation of model: ln(odds) = .021 + .476 Gender -.063Self-control +
1.03 Self-efficacy -.079Self-efficacy * Self-control
47
Graphs for interaction effect
48
Graphs for interaction effect
Self-Efficacy:
1 = Somewhat
sure-not sure
Self-efficacy:
0 = Very sure
49
Multinomial logistic regression

Used to classify subjects based on
values of a set of predictor variables.
This type of regression is similar to
logistic regression, but it is more general
because the dependent variable is not
restricted to two categories.
50
Multinomial logistic regression

Variables
 The dependent variable should be
categorical.
 Independent variables can be factors or
covariates. In general, factors should be
categorical variables and covariates should
be continuous variables.
51
Multinomial logistic regression using SPSS

Example (from SPSS samples)
 Regression to determine marketing profiles for
each breakfast option.
 Dependent variable: Choice of breakfast: 1 =
Breakfast bar; 2 = Oatmeal; 3 = Cereal
 Independent variables: age, gender, lifestyle (they
are all categorical variables)
52
Multinomial logistic regression using SPSS

Go to Analyze > Regression >
Multinomial Logistic
53
Multinomial logistic regression using SPSS

Click Reference Category
We can use any category of
dependent variable as the
reference.
54
Multinomial logistic regression using SPSS

Click Model
The default model is
Main effects. We can
custom our model
and request main
effects and
interaction effects.
55
Multinomial logistic regression using SPSS

Click Statistics
56
Multinomial logistic regression using SPSS

SPSS outputs
57
Multinomial logistic regression using SPSS

SPSS Output
Cells with zero frequencies can be a useful indicator for
potential problems. Since there are few (only 4.2%) of
these empty cells, you can probably safely use the results
of the goodness-of-fit tests.
58
Multinomial logistic regression using SPSS

SPSS Outputs
1. The likelihood ratio tests check
the difference between null
model and final model.
2. The Chi-Square in the first
table is the change of -2 Log
Likelihood from intercept-onlymodel to the final model.
3. The results show that the final
model is outperforming the null.
3. Results of Goodness-of-Fit
show that the model adequately
fits the data.
59
Multinomial logistic regression using SPSS

SPSS Output
The likelihood ratio tests check
the contribution of each effect to
the model.
Age and Active make significant
contributions to the model.
60
Multinomial logistic regression using SPSS

SPSS Output
61
Multinomial logistic regression using SPSS

SPSS Output: Parameter estimates
table summarizes the effect of each
predictor.
 Parameters with significant negative
coefficients decrease the likelihood of that
response category with respect to the
reference category.
 Parameters with positive coefficients
increase the likelihood of that response
category.
62
Multinomial logistic regression using SPSS

SPSS Output
Cells on the diagonal are correct predictions and cells
off the diagonal are incorrect predictions.
Overall, 56.4% of the cases are classified correctly.
63
References
Agresti, A. & Finlay, B. (1997). Statistical
methods for the social sciences (3rd ed.).
Upper Saddle River, NJ: Prentice Hall, Inc.
Meyers, L. S., Gamst, G., & Guarino, A. J.
(2006). Applied multivariate research: design
and interpretation. Thousand Oaks, CA: Sage
Publications, Inc.
Stevens, J. P. (2002). Applied multivariate
statistics for the social sciences (4th ed.).
Mahwah, NJ: Lawrence Erlbaum Associates,
Inc.
64
65