Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Vector generalized linear model wikipedia, lookup

Predictive analytics wikipedia, lookup

Plateau principle wikipedia, lookup

Disease wikipedia, lookup

Least squares wikipedia, lookup

Generalized linear model wikipedia, lookup

Regression analysis wikipedia, lookup

Transcript
```Logistic Regression
Part I - Introduction
Logistic Regression
• Regression where the response variable is
dichotomous (not continuous)
• Examples
– effect of concentration of drug on whether
symptoms go away
– effect of age on whether or not a patient
survived treatment
– effect of negative cognitions about SELF,
WORLD, or Self-BLAME on whether a
participant has PTSD
Simple Linear Regression
• Relationship between continuous
response variable and continuous
explanatory variable
• Example
– Effect of concentration of drug on reaction
time
– Effect of age of patient on number of years of
post-operation survival
Simple Linear Regression
• RT (ms) = β0 + β1 x concentration (mg)
• β0 is value of RT when concentration is 0
• β1 is change in RT caused by a change in
concentration of 1mg.
• E.g. RT = 400 + 50 x concentration
Logistic Regression
• What do we do when we have a response
variable which is not continuous, but is
dichotomous
Probability of Disease
Concentration
Odds of Disease
Concentration
Log(Odds) of Disease
Concentration
Odds
• Odds are simply the ratio of the
proportions for the two possible
outcomes.
• If p is the proportion for one outcome,
then 1- p is the proportion for the
second outcome.
Odds (Example)
• At concentration level 16 we observe 75
participants out of 100 showing no disease
(healthy)
• If p is the probability of healthy is then p = 0.75.
• Then 1 – p is the probability of not healthy, and
equals 0.25
• Odds of showing healthy over not healthy
given concentration level 16
• p / (1 – p) = 0.75/0.25 = 3
• Means that it is 3 times more likely that person is
healthy at concentration level 16
Logarithms
• Logarithms are a way of expressing
numbers as powers of a base
• Example
– 102 = 100
– 10 is called the “base”
– The power, 2 in this case, is called the
“exponent”
• Therefore 102 = 100 means that log10100
=2
Log Odds
• Odds of being healthy after 16mg of drug
is 3
• Log odds is log(3) = 1.1
• Lets say that odds of being healthy after
2mg of drug is 0.25
• Means that it is four times more likely to
not be healthy after 2mg of drug
• Log odds is log(0.25) = -1.39
Logistic Regression
• With Log-odds we can now look at the linear
relationship between dichotomous response and
continuous explanatory
 pˆ 
   0  1 X
log 
 1  pˆ 
Where, for example, p is the probability of being
healthy at different levels of drug concentration,
X
Example: Simple Logistic
Regression
• Look at the effect of drug concentration on
probability of NOT having disease (i.e.
being healthy)
• Use SPSS to do the regression (we’ll all do
this soon)
• Get
 pˆ nodisease 
  2.92  0.106  Concentration
log 
 1  pˆ nodisease 
Looks Like
 pˆ nodisease 
  2.92  0.106  Concentration
log 
 1  pˆ nodisease 
• Interpreting parameters (b0 and b1) in logistic
regression is a little tricky
• An increase of 1mg of concentration increases
the log(odds) of being healthy by 0.106
• An increase of 1mg of concentration increases
the odds of being healthy by
eb1  e.106  1.111
• Increasing concentration by 1mg increases
odds of being healthy by a factor of 1.11
Slope Parameter
• Parameter β1 in general:
– if positive then increasing X increases the
odds of p
– if negative then increasing X decreases the
odds of p
– the larger (in magnitude) the larger the effect
of X on p
• Like simple linear regression, can test
whether or not β1 is significantly different
from 0.
Let’s break to do simple Logistic
Regression
• Open XYZ.sav in SPSS
• Fit logistic regression with
– PTSD (Y/N) as response variable
– Self-BLAME as explanatory variable
•
•
•
•
Is the effect of Self-BLAME significant?
Get parameter estimates
Write equation of model
What is the odds of having PTSD given SelfBLAME score of 3?
• Use the interpretation of the regression
coefficient to work out odds given Self-BLAME
of 4.
Logistic Regression
Part II – Multiple Logistic
Regression
Multiple Linear Regression
• Simple Linear Regression extended out to
more than one explanatory variable
• Example
– Effect of both concentration and age on
reaction time
– Effect of age, number of previous operations,
time in anaesthesia, cholesterol level, etc. on
number of years of post-operation survival
Multiple Linear Regression
RT (ms) = β0 + β1 x concentration (mg) + β2 x
age + β3 x gender (0=male,1=female)
β0 is value of RT when concentration is 0.
β1 is change in RT caused by a change in
concentration of 1mg.
β2 is change in RT caused by a change in age of 1
year.
β3 is change in RT caused by going from male to
female in gender.
Multiple Logistic Regression
• Look at the effect of drug concentration, age and
gender on probability of NOT having disease
 pˆ 
   0  1 X 1   2 X 2   3 X 3
log 
 1  pˆ 
Where p is the probability of not having the
disease, X1 is the concentration of drug (mg),
X2 is age (years), and X3 is gender (0 for males,
1 for females)
 pˆ nodisease 
  2.92  0.106  Concentration  0.0532  Age  0.001 Gender
log 
 1  pˆ nodisease 
• Again, use SPSS to fit logistic model
• Increasing concentration increases odds of not
having the disease (again, being healthy)
• Increasing age decreases odds of being healthy
• “Increasing” gender (from male to female)
increases odds of being healthy
• In particular, increasing age decreases the odds
of being healthy by a factor of 0.95
• M to F increases odds by factor of 1.001
Was it worth adding the factors?
• When we add parameters we make our
model more complicated.
• We really want this addition to be “worth it”
• In other words, adding age and gender
should improve our explanation of disease
• But what constitutes an improvement
Was it worth adding the factors?
• Quality (badness) of model fit is given by
-2logL
• If we fit want to see if it was worth adding
parameters we can compare the quality of the
fit of the simple and the more complex model
• Quality of model fit follows a chi-square (χ2)
distribution with degrees-of-freedom (df) equal to
the number of parameters in the model
• The difference between quality of fit also
follows a χ2 distribution with df equal to the
difference in the number of parameters between
the two models
Was it worth adding these factors?
• Simple logistic regression model has
overall χ2 of 45.7
• This multiple logistic regression model with
2 extra parameters has χ2 of 40.02
• Test whether χ2 = 45.7 - 40.02 = 5.68 is a
significant improvement
• Critical χ2 for 2 df is 5.99
• Our χ2 is smaller and so NO, not worth it
BUT…
• It doesn’t look like gender is having much of an
effect
• Check SPSS output and see that Wald χ2 for
Gender is 0.527, which has p = .47
• Perhaps it wasn’t worth adding both parameters,
but it will be worth just adding Age
• Age has Wald-χ2 = 4.33, p = .03
• When we only add Age, change in χ2 = 5.5 and
we test against χ2 with df of 1, which has p = .02
Logistic Regression Model Building
• What if we have a whole host of possible
explanatory variables
• We want to build a model which predicts whether
a person will have a disease given a set of
explanatory variables
• SAME as multiple linear regression
–
–
–
–
–
Forward selection
Backward elimination
Stepwise
All subsets
Hierarchical
How to know if a model is good
• All about having a model which does a good job of
appropriately classifying participants as having disease
or not
• In particular, model predicts how many people have
disease and how many people don’t have the disease
• The model can be
– Correct in two ways
• Correctly categorise a person who has a disease as having a
disease
• Correctly say no disease when no disease
– Incorrect in two ways
• Incorrectly categorise a person who has a disease as not having a
disease
• Incorrectly say no disease when disease
Accuracy of model
• Proportion of correct classifications
– Number of correct disease participants plus
number of correct no disease participants
divided by number of participants in total
nCD  nC ND
nCD  nC ND  nICD  nIC ND
Sensitivity of model
• Proportion of ‘successes’ correctly
identified
– Number of correct no disease participants
divided by total number of no disease
participants
nC ND
nC ND  nIC ND
Specificity of model
• Proportion of ‘failures’ correctly identified
– Number of correct disease participants
divided by total number of disease
participants
nC D
nC D  nIC D
Now…a real example
• Startup, Makgekgenene and Webster
(2007) looked at whether or not the
subscales of the Posttraumatic Cognitions
Inventory (PTCI) are good predictors of
Posttraumatic Stress Disorder (PTSD)
• Subscales are
– Negative Cognitions about the WORLD
– Self-BLAME
Descriptive Results
• PTSD participants showed higher scores
than non-PTSD in all three subscales
variables
Multiple Logistic Regression
• Response variable:
– whether or not the participant has PTSD
• Explanatory variables:
– Negative Cognitions about the WORLD
– Self-BLAME
Let’s do the Logistic Regression
• Open XYZ.sav in SPSS
• Run the appropriate regression
• What are the parameter estimates for our
three explanatory variables?
• Which of these are significant (at α = .05)?
• What are the odds ratios for those that are
significant?
• Anything unusual?
Self-BLAME
• Self-BLAME has a negative odds ratio.
• This means that increasing self-blame
decreases the chance of having PTSD
• This is surprising, especially since
participants with PTSD showed higher
Self-BLAME scores
• What’s going on?
Self-BLAME and SELF scales
• Startup et al. (2007) explain this by stating
that Self-BLAME is made up of both
behavioural and characterological
questions
• SELF, however, may also tap into
characterological aspects of self-blame
• Behavioural self-blame can be considered
adaptive. It may help avoid PTSD
• Characterological self-blame, however,
may be detrimental, and lead to PTSD
Suppressor Effect
• The relationship between SELF and PTSD is
strong, and accounts for the negative
relationship. This includes the effect of
characterological self-blame.
• The variation in PTSD that is left for Self-BLAME
to account for is the positive aspect of the
relationship between the Self-BLAME scores
and PTSD.
• The negative aspect of Self-BLAME scores has
been suppressed (already accounted for by
SELF). The positive aspect of Self-BLAME can
now come out.
Homework (haha)
• Evaluate the model by looking at
– Accuracy of model’s predictions
– Sensitivity of model’s predictions
– Specificity of model’s predictions
```
Related documents