Download Logistic regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vector generalized linear model wikipedia , lookup

Perceptual control theory wikipedia , lookup

Simplex algorithm wikipedia , lookup

Generalized linear model wikipedia , lookup

Predictive analytics wikipedia , lookup

Least squares wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
Logistic regression
Maths and Statistics Help Centre
Many statistical tests require the dependent (response) variable to be continuous so a different set of tests are
needed when the dependent variable is categorical. One of the most commonly used tests for categorical variables
is the Chi-squared test which looks at whether or not there is a relationship between two categorical variables but
this doesn’t make an allowance for the potential influence of other explanatory variables on that relationship. For
continuous outcome variables, Multiple regression can be used for
a) controlling for other explanatory variables when assessing relationships between a dependent variable and
several independent variables
b) predicting outcomes of a dependent variable using a linear combination of explanatory (independent)
variables
The maths:
For multiple regression a model of the following form can be used to predict the value of a response variable y
using the values of a number of explanatory variables:
y   0  1 x1   2 x2  .....   q xq
 0  Constant/ intercept , 1   q  co  efficients for q explanator y variables x1  xq
The regression process finds the co-efficients which minimise the squared differences between the observed
and expected values of y (the residuals). As the outcome of logistic regression is binary, y needs to be
transformed so that the regression process can be used. The logit transformation gives the following:
 p 
   0  1 x1   2 x2  .....   q xq
ln
1  p 
p  probabilty of event occuring e.g. person dies following heart attack,
p
 odds ratio
1- p
If probabilities of the event of interest happening for individuals are needed, the logistic regression equation
can be written as: p 
exp  0  1 x1   2 x2  .....   q xq 
1  exp  0  1 x1   2 x2  .....   q xq 
,
0<p<1
Logistic regression does the same but the outcome variable is binary and leads to a model which can predict the
probability of an event happening for an individual.
Titanic example: On April 14th 1912, only 705 passengers and crew out of the 2228 on board the Titanic survived
when the ship sank. Information on 1309 of those on board will be used to demonstrate logistic regression. The
data can be downloaded from
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
The key variables of interest are:
Dependent variable: Whether a passenger survived or not (survival is indicated by survived = 1).
Possible explanatory variables: Age, gender (recode so that sex = 1 for females and 0 for males), class (pclass = 1, 2 or
3), number of accompanying parents/ children (parch) and number of accompanying siblings/ spouses (sibsp)
1
Logistic regression
Maths and Statistics Help Centre
Most of the variables can be investigated using crosstabulations with the dependent variable ‘survived’. Another
reason for the cross tabulation is to identify categories with small frequencies as this can cause problems with the
logistic regression procedure. The number of accompanying parents/ children (parch) and number of accompanying
siblings/ spouses (sibsp) were used to create a new binary variable indicating whether or not the person was
travelling alone or with family (1 = travelling with family, 0 = travelling alone).
% surviving
Male
Female
1st class
2nd Class
3rd class
19.1%
72.7%
61.9%
43%
25.5%
Travelling
alone
30.3%
Travelling
with family
50.3%
When tested separately, Chi-squared tests concluded that there was evidence of a relationship between survival and
gender, class and whether an individual was travelling alone. Looking at the %’s of survival, it’s clear that women,
those in first class and those not travelling alone were much more likely to survive. Logistic regression will be carried
out using these three variables first of all. Stage 1 of the following analysis will relate to using logistic regression to
control for other variables when assessing relationships and stage 2 will look at producing a good model to predict
from.
Use ANALYZE  Regression Binary logistic to get the following screen:
Where there are more than
two categories, the last
category is automatically
the reference category.
This means that all the
other categories will be
compared to the reference
in the output e.g. 1st and
2nd class will be compared
to 3rd class.
Treatment of categorical explanatory variables
When interpreting SPSS output for logistic regression, it is important that binary variables are coded as 0 and 1.
Also, categorical variables with three or more categories need to be recoded as dummy variables with 0/ 1 outcomes
e.g. class needs to appear as two variables 1st/ not 1st with 1 = yes and 2nd/ not 2nd with 1 = yes. Luckily SPSS does
this for you! When adding a categorical variable to the list of covariates, click on the Categorical button and move all
categorical variables to the right hand box. The following table in the output shows the coding of these variables.
Categorical Variables Codings
Parameter coding
Frequency
class
Travelling alone
Gender
(1)
(2)
1st class
323
1.000
.000
2nd class
277
.000
1.000
3rd class
709
.000
.000
Alone
790
1.000
with family
519
.000
male
843
1.000
female
466
.000
For class, 3rd class is the reference class so
if 1st class = 0 and 2nd class = 0, the person
must have been in 3rd class.
Females and those not travelling alone are
the references for the other groups.
2
Logistic regression
Maths and Statistics Help Centre
Interpretation of the output
The output is split into two sections, block 0 and block 1. Block 0 assesses the usefulness of having a null model,
which is a model with no explanatory variables. The ‘variables in the equation’ table only includes a constant so
each person has the same chance of survival.
The null model is:
 p 
   0  0.481,
ln
1  p 
p  probabilit y of survival 
exp - 0.481
 0.382
1  exp - 0.481
SPSS calculates the probability of survival for each individual using the block model. If the probability of survival is
0.5 or more it will predict survival (as survival = 1) and death if the probability is less than 0.5. As more people died
than survived, the probability of survival is 0.382 and therefore everyone is predicted as dying (coded as 0). As
61.8% of people were correctly classified, classification from the null model is 61.8% accurate. The addition of
explanatory variables should increase the percentage of correct classification significantly if the model is good.
Block 0: Beginning Block
Block 1: Method = Enter
Block 1 shows the results after the addition of the explanatory variables selected.
The omnibus Tests of Model Co-efficients table gives the result of the Likelihood Ratio (LR) test which indicates
whether the inclusion of this block of variables contributes significantly to model fit. A p-value (sig) of less than 0.05
for block means that the block 1 model is a significant improvement to the block 0 model.
In standard regression, the co-efficient of determination (R2) value gives an indication of how much variation in y is
explained by the model. This cannot be calculated for logistic regression but the ‘Model Summary’ table gives the
values for two pseudo R2 values which try to measure something similar. From the table above, we can conclude
3
Logistic regression
Maths and Statistics Help Centre
that between 31% and 42.1 of the variation in survival can be explained by the model in block 1. The correct
classification rate has increased by 16.2% to 78%.
Finally, the ‘Variables in the Equation’ table summarises the importance of the explanatory variables individually
whilst controlling for the other explanatory variables.
The Wald test is similar to the LR test but here it is used to test the hypothesis that each   0 . In the sig column,
the p-values are all below 0.05 apart from the test for the variable Alone, (p = 0.286). This means that although the
Chi-squared test for Survival vs Alone was significant, once the other variables were controlled for, there is not a
strong enough relationship between that variable and survival. Class is tested as a whole (pclass) and then 1st and
2nd class compared to the reference category 3rd class. When interpreting the differences, look at the exp  
column which represents the odds ratio for the individual variable. For example, those in 1st class were 5.493 times
more likely to survive than those in first class. With gender, the odds ratio compares the likelihood of a male
surviving in comparison to females. The odds are a lot lower for men (0.084 times that of women). For ease of
interpretation, calculate the odds of a female surviving over a male using 1/0.084 = 11.9. Females were 11.9 times
more likely to survive.
The co-efficients for the model are contained in the column headed B. A negative value means that the odds of
survival decreases e.g. for males and those travelling alone.
The full model being tested is:
 p 
  0.460  1.703 x1st class  0.832x 2nd class - 2.474x male - 0.156x alone
ln
1 p 
x1st class  1 for 1st class. x2 nd class  1 for 2nd class, x male  1 for men and x alone  1 for person tra velling alone
4