Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Logistic Regression Part I - Introduction Logistic Regression • Regression where the response variable is dichotomous (not continuous) • Examples – effect of concentration of drug on whether symptoms go away – effect of age on whether or not a patient survived treatment – effect of negative cognitions about SELF, WORLD, or Self-BLAME on whether a participant has PTSD Simple Linear Regression • Relationship between continuous response variable and continuous explanatory variable • Example – Effect of concentration of drug on reaction time – Effect of age of patient on number of years of post-operation survival Simple Linear Regression • RT (ms) = β0 + β1 x concentration (mg) • β0 is value of RT when concentration is 0 • β1 is change in RT caused by a change in concentration of 1mg. • E.g. RT = 400 + 50 x concentration Logistic Regression • What do we do when we have a response variable which is not continuous, but is dichotomous Probability of Disease Concentration Odds of Disease Concentration Log(Odds) of Disease Concentration Odds • Odds are simply the ratio of the proportions for the two possible outcomes. • If p is the proportion for one outcome, then 1- p is the proportion for the second outcome. Odds (Example) • At concentration level 16 we observe 75 participants out of 100 showing no disease (healthy) • If p is the probability of healthy is then p = 0.75. • Then 1 – p is the probability of not healthy, and equals 0.25 • Odds of showing healthy over not healthy given concentration level 16 • p / (1 – p) = 0.75/0.25 = 3 • Means that it is 3 times more likely that person is healthy at concentration level 16 Logarithms • Logarithms are a way of expressing numbers as powers of a base • Example – 102 = 100 – 10 is called the “base” – The power, 2 in this case, is called the “exponent” • Therefore 102 = 100 means that log10100 =2 Log Odds • Odds of being healthy after 16mg of drug is 3 • Log odds is log(3) = 1.1 • Lets say that odds of being healthy after 2mg of drug is 0.25 • Means that it is four times more likely to not be healthy after 2mg of drug • Log odds is log(0.25) = -1.39 Logistic Regression • With Log-odds we can now look at the linear relationship between dichotomous response and continuous explanatory pˆ 0 1 X log 1 pˆ Where, for example, p is the probability of being healthy at different levels of drug concentration, X Example: Simple Logistic Regression • Look at the effect of drug concentration on probability of NOT having disease (i.e. being healthy) • Use SPSS to do the regression (we’ll all do this soon) • Get pˆ nodisease 2.92 0.106 Concentration log 1 pˆ nodisease Looks Like pˆ nodisease 2.92 0.106 Concentration log 1 pˆ nodisease • Interpreting parameters (b0 and b1) in logistic regression is a little tricky • An increase of 1mg of concentration increases the log(odds) of being healthy by 0.106 • An increase of 1mg of concentration increases the odds of being healthy by eb1 e.106 1.111 • Increasing concentration by 1mg increases odds of being healthy by a factor of 1.11 Slope Parameter • Parameter β1 in general: – if positive then increasing X increases the odds of p – if negative then increasing X decreases the odds of p – the larger (in magnitude) the larger the effect of X on p • Like simple linear regression, can test whether or not β1 is significantly different from 0. Let’s break to do simple Logistic Regression • Open XYZ.sav in SPSS • Fit logistic regression with – PTSD (Y/N) as response variable – Self-BLAME as explanatory variable • • • • Is the effect of Self-BLAME significant? Get parameter estimates Write equation of model What is the odds of having PTSD given SelfBLAME score of 3? • Use the interpretation of the regression coefficient to work out odds given Self-BLAME of 4. Logistic Regression Part II – Multiple Logistic Regression Multiple Linear Regression • Simple Linear Regression extended out to more than one explanatory variable • Example – Effect of both concentration and age on reaction time – Effect of age, number of previous operations, time in anaesthesia, cholesterol level, etc. on number of years of post-operation survival Multiple Linear Regression RT (ms) = β0 + β1 x concentration (mg) + β2 x age + β3 x gender (0=male,1=female) β0 is value of RT when concentration is 0. β1 is change in RT caused by a change in concentration of 1mg. β2 is change in RT caused by a change in age of 1 year. β3 is change in RT caused by going from male to female in gender. Multiple Logistic Regression • Look at the effect of drug concentration, age and gender on probability of NOT having disease pˆ 0 1 X 1 2 X 2 3 X 3 log 1 pˆ Where p is the probability of not having the disease, X1 is the concentration of drug (mg), X2 is age (years), and X3 is gender (0 for males, 1 for females) pˆ nodisease 2.92 0.106 Concentration 0.0532 Age 0.001 Gender log 1 pˆ nodisease • Again, use SPSS to fit logistic model • Increasing concentration increases odds of not having the disease (again, being healthy) • Increasing age decreases odds of being healthy • “Increasing” gender (from male to female) increases odds of being healthy • In particular, increasing age decreases the odds of being healthy by a factor of 0.95 • M to F increases odds by factor of 1.001 Was it worth adding the factors? • When we add parameters we make our model more complicated. • We really want this addition to be “worth it” • In other words, adding age and gender should improve our explanation of disease • But what constitutes an improvement Was it worth adding the factors? • Quality (badness) of model fit is given by -2logL • If we fit want to see if it was worth adding parameters we can compare the quality of the fit of the simple and the more complex model • Quality of model fit follows a chi-square (χ2) distribution with degrees-of-freedom (df) equal to the number of parameters in the model • The difference between quality of fit also follows a χ2 distribution with df equal to the difference in the number of parameters between the two models Was it worth adding these factors? • Simple logistic regression model has overall χ2 of 45.7 • This multiple logistic regression model with 2 extra parameters has χ2 of 40.02 • Test whether χ2 = 45.7 - 40.02 = 5.68 is a significant improvement • Critical χ2 for 2 df is 5.99 • Our χ2 is smaller and so NO, not worth it BUT… • It doesn’t look like gender is having much of an effect • Check SPSS output and see that Wald χ2 for Gender is 0.527, which has p = .47 • Perhaps it wasn’t worth adding both parameters, but it will be worth just adding Age • Age has Wald-χ2 = 4.33, p = .03 • When we only add Age, change in χ2 = 5.5 and we test against χ2 with df of 1, which has p = .02 Logistic Regression Model Building • What if we have a whole host of possible explanatory variables • We want to build a model which predicts whether a person will have a disease given a set of explanatory variables • SAME as multiple linear regression – – – – – Forward selection Backward elimination Stepwise All subsets Hierarchical How to know if a model is good • All about having a model which does a good job of appropriately classifying participants as having disease or not • In particular, model predicts how many people have disease and how many people don’t have the disease • The model can be – Correct in two ways • Correctly categorise a person who has a disease as having a disease • Correctly say no disease when no disease – Incorrect in two ways • Incorrectly categorise a person who has a disease as not having a disease • Incorrectly say no disease when disease Accuracy of model • Proportion of correct classifications – Number of correct disease participants plus number of correct no disease participants divided by number of participants in total nCD nC ND nCD nC ND nICD nIC ND Sensitivity of model • Proportion of ‘successes’ correctly identified – Number of correct no disease participants divided by total number of no disease participants nC ND nC ND nIC ND Specificity of model • Proportion of ‘failures’ correctly identified – Number of correct disease participants divided by total number of disease participants nC D nC D nIC D Now…a real example • Startup, Makgekgenene and Webster (2007) looked at whether or not the subscales of the Posttraumatic Cognitions Inventory (PTCI) are good predictors of Posttraumatic Stress Disorder (PTSD) • Subscales are – Negative Cognitions About SELF – Negative Cognitions about the WORLD – Self-BLAME Descriptive Results • PTSD participants showed higher scores than non-PTSD in all three subscales variables Multiple Logistic Regression • Response variable: – whether or not the participant has PTSD • Explanatory variables: – Negative Cognitions About SELF – Negative Cognitions about the WORLD – Self-BLAME Let’s do the Logistic Regression • Open XYZ.sav in SPSS • Run the appropriate regression • What are the parameter estimates for our three explanatory variables? • Which of these are significant (at α = .05)? • What are the odds ratios for those that are significant? • Anything unusual? Self-BLAME • Self-BLAME has a negative odds ratio. • This means that increasing self-blame decreases the chance of having PTSD • This is surprising, especially since participants with PTSD showed higher Self-BLAME scores • What’s going on? Self-BLAME and SELF scales • Startup et al. (2007) explain this by stating that Self-BLAME is made up of both behavioural and characterological questions • SELF, however, may also tap into characterological aspects of self-blame • Behavioural self-blame can be considered adaptive. It may help avoid PTSD • Characterological self-blame, however, may be detrimental, and lead to PTSD Suppressor Effect • The relationship between SELF and PTSD is strong, and accounts for the negative relationship. This includes the effect of characterological self-blame. • The variation in PTSD that is left for Self-BLAME to account for is the positive aspect of the relationship between the Self-BLAME scores and PTSD. • The negative aspect of Self-BLAME scores has been suppressed (already accounted for by SELF). The positive aspect of Self-BLAME can now come out. Homework (haha) • Evaluate the model by looking at – Accuracy of model’s predictions – Sensitivity of model’s predictions – Specificity of model’s predictions