Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Quantitative Research Methods for Social Sciences/Fall 2011 Module 2: Lecture 7 Introduction to Generalized Linear Models, Logistic Regression and Poisson Regression Priyantha Wijayatunga Department of Statistics Umeå University Modeling dependences with regression models: Generalized Linear Models (GLM) Different Regression Models Linear Regression (simple and multiple): dependent variable (response) is assumed to be normally distributed, In ordinary regression conditional mean of the response is modeled as a linear combination of explanatory variables, etc. Non-linear Regression: Quadratic regression, polynomial regression (we omit them here) GLM: logistic regression, Poisson regression, Cox regression, etc. (we talk them here) In GLM, for example, in logistic regression, conditional probability is modeled Linear regression (revisited) Simple linear regression model in one explanatory variable x. ŷ = b0 + b1x Estimated model Usually more complex linear models are needed in practical situations. Many problems in which a knowledge of more than one explanatory variable is necessary in order to obtain a better understanding and better prediction of a particular response. ŷ = b0 + b1x1+ b2x2 + … + bpxp Estimated model This model is called a ‘first-order model’ with p explanatory variables. Some explanatory variables can be functions of some of the others, Eg. X2=X12 ,X5=X3X4, .... In general, we need n cases of data for p explanatory variables (n>>p). The simple linear regression model allows for one independent variable, “x” y =b0 + b1x + e y X y X1 The multiple linear regression model allows for more than one independent variable. Y = b0 + b1x1 + b2x2 + e X2 y Note how the straight line becomes a plain, and... X1 The multiple linear regression model allows for more than one independent variable. Y = b0 + b1x1 + b2x2 + e X2 Required conditions for the error variable The random error e of the model is normally distributed Its mean is equal to zero and its standard deviation is a constant (s) for all values of X The errors are independent for different data cases We need more GLM Different situations where we need more general regression models Dependent variable has only two ordinal outcomes: success or failure –a political candidate may want to know the behaviour of voters, i.e., characteristics of them who may vote for him/her or not. Each voter’s age, profession, etc. have influence on his/her choice to vote for the candidate or not – Binary logistic regression Dependent variable may have more than two ordianl outcomes: after the high school, student’s selection to “go to university”, “go to vocational training”, “do a job”, or “other” – Multinary logistic regression Dependent variable is a count: couple’s characteristics affecting their family size. Number of children (0, 1, 2, 3, …): it may have some dependence on their income, ages, etc. – Poisson regression Dependent variable is period until a particular event happens: time until an employed person finds employment – Cox regression GLM Why do we need more general regression models? Eg: (Avoid details but JUST get the idea) Suppose person’s preference to vote or not to “candidate A” depends on his/her age: Y=1 if vote, Y=0, if not. Let X be the age of the voter. Using linear model: Y =b0 + b1x+ e E[Y|x]=P(Y=1|x) and E[Y|x] =b0 + b1x so P(Y=1|x) =b0 + b1x We model the conditional probabilities but NOT the conditional means (who agree more)!! For all values of X, above probability calculation may not work!! (0≤prob≤1) Therefore, do a trick (resulting a GLM) P(Y 1 | X ) b 0 b1 X ; log 1 P(Y 1 | X ) e b 0 b1 X P(Y 1 | X ) 1 e b 0 b1 X This is the binary logistic regression model (details later !! ) GLM cont., Why do we need more general regression models? We may be interested in who is likely more do something, i.e., P(Y=1|x) – which values of X (age, etc) is more likely to vote for party A Dependent variable is not normally distributed Even cases where dependent variable is a real number, normal distribution is not the optimal choice for model it: Unemployment duration or insurance claim amount (all positive) may have a skewed distribution (gamma distribution, lognormal distribution, etc. ) Dependent variable is a count: couple’s characteristics affecting their family size. Number of children (0, 1, 2, 3, …) they may have depended on their income, ages, etc. – Poisson regression Dependent variable is period until a particular event happens: time until an employed person finds employment (unemployment duration) – Cox regression Logistic Regression Model Topics Binomial setting and odds Model for logistic regression Fitting and interpreting the model Inference for logistic regression Multiple logistic regression Binary Logistic Regression We will study methods to model relationships when the response variable has only two possible values. How likely for a subject to be in either of categories? For example: customer buys or does not buy (which group is more likely to buy), patient lives or dies, etc. We call the two values of the response variable ‘success’ and ‘failure’. If the data is n independent observations with the same p = P(success), this is the binomial setting. Binary dependent variable Y: let Y=1 if a success and Y=0 if a failure Logistic regression is used when Y depends on certain explanatory variables. Otherwise one can simply use binomial distribution for Y. Example: Binge drinkers A survey of 17,096 students in U.S. four-year colleges collected information on drinking behavior and alcohol-related problems. The researchers define “frequent binge drinking” as having five or more drinks in a row three or more times in the past two weeks. X represents the number of binge drinkers in the sample. One possible explanatory variable is gender of the student: Population 1 (men) 2 (women) Total n 7,180 9,916 17,096 X 1,630 1,684 3,314 p̂ 0.227 0.170 0.194 Odds Logistic regression works with odds rather than proportions. Probability of obtaining a “spade” from a well shuffled pack of cards is ¼. So, usually we say odds that getting a “spade” is 3 to 1 against. Let event A means success. Then we define odds of A as Odds(A) = probability of success / probability of failure p( A) (1 p( A)) 10 When P(A)=0.5 then Odds(A)=1 5 When probability of the event goes up odds also goes up (non–linearly) 0 Odds(A) 15 odds( A) 0.2 0.4 0.6 P(A) 0.8 1.0 Odds …, p odds (1 p) We can omit mentioning A when it is clear from the context A similar formula for sample odds is obtained by substituting the sample proportion for p. Example: The estimated odds of a male student being a frequent binge drinker are: p/(1-p) = 0.227/(1 - 0.227) = 0.2937. The estimated odds of a female student being a frequent binge drinker are: 0.1698/(1 0.1698) = 0.2045. Model for logistic regression., The logistic regression model works with the natural log of the odds, p/(1 - p). We use the term log odds for this transformation (called logit). As p moves from 0 to 1, the log odds moves through all negative and positive numerical values. We model the log odds as a linear function of the explanatory variable: 2 0 -2 -4 log{Odd(A)} p b 0 b1 x log 1 p 4 0.0 0.2 0.4 0.6 P(A) 0.8 1.0 Plot of p versus x for selected values of β0<0 and β1>0. 0.9 0.35 1.0 Probability of success varies with x 0.8 β0 =1, β1=0 0.5 0.20 0.6 0.25 0.7 p p 0.30 β0 =-1, β1=0 0 2 4 6 8 0 10 2 4 6 8 10 x 1.0 1.0 x 0.6 0.8 β0 =4, β1=-1 p 0.0 0.0 0.2 0.2 0.4 0.4 p 0.6 0.8 β0 =4, β1=-2 0 2 4 6 x 8 10 0 2 4 6 x 8 10 Fitting the model Binge drinkers example: log odds for men are log(0.2937) = -1.23, and log odds for women are = log(0.2045) = -1.59. The explanatory variable gender can be expressed numerically using an indicator variable: x = I if the student is man, 0 if the student is women. The model says that for men: p log 1 b 0 b1 x 1 p1 And for women: p0 b 0 log 1 p0 Since log odds for men = -1.23, and log odds for women = -1.59, we get the parameter estimates: b0 = -1.59, and b1 = -1.23 - (-1.59) = 0.36. The fitted logistic model is: log (odds) = -1.59 + 0.36x In general, the calculations needed to find the parameter estimates are complex and require software. Interpreting the model parameters Most people are not comfortable thinking in the log(odds) scale so we apply a transformation. The exponential function (ex) reverses the natural log transformation. Applying the transformation we get: odds = e -1.59 + 0.36x = (e -1.59 )( e 0.36x) From this, the ratio of the odds for men (x =1) and women (x = 0) is oddsmen/oddswomen = e 0.36 = 1.43 = odds ratio This transformation transforms the logistic regression slope into an odds ratio, i.e. the odds that a man is a frequent binge drinker are 1.43 times the odds for a woman. Odds ratio is a measure of dependence between X and Y Interpreting…, Generally, we can understand model parameters Suppose person’s age (X) affects his/her choice of voting for a certain party (Y=1) or not (Y=0): log( oddsx ) b 0 b1 x for person aged x log( oddsx 1 ) b 0 b1 ( x 1) for person aged x+1 b0 b1 is not connected with X – not needed to understand relation X–Y is connected with X We get log( odds x 1 ) log( odds x ) b1 odds x 1 b1 log odds x odds x 1 e b1 odds x that is, b1 is the log of odd ratio For every unit increase in X, odds ‘is’ increased by the number e b1 Interpreting…, Odds ratio is odds x 1 e b1 odds x p b 0 b1 x; log 1 p e b0 b1x then we get p 1 e b 0 b1x When b1 0; X increases → p increases – positive dependence of X–Y b1 0; X increases → p remains the same – independence of X–Y b1 0; X increases → p decreases – negative dependence of X–Y Odds ratio is a measure of dependence between X and Y Odds ratio does not change with X Odds ratio is always non–negative odds ratio > 1 positive dependence of X–Y odds ratio = 1 independence of X–Y odds ratio < 1 negative dependence of X–Y Inference for logistic regression (omit details) About IBM SPSS visit: http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp Example: A researcher is interested in how variables, “gre” (Graduate Record Exam scores) effect admission into graduate school. The response variable, (“admit”), admit/don't admit (1/0), is a binary variable and “gre” is an interval variable. SPSS file: binary.sav (Thanks to UCLA Academic Technology Services for data) Data: (here we use only admit and gre) admit gre gpa rank 0 380 3.61 3 1 660 3.67 3 . . . . Analyze > Regression > Binary Logistic then move ‘admit’ to Dependent box and ‘gre’ to Covariates box. Make sure Enter appears in Method section. Click on Options and tick CI for exp(B) to get the confidence intervals for the odds ratio. Press Continue followed by OK. Example: … Classification Tablea Predicted admit Observed Step 1 admit 0 Percentage Correct 1 0 273 0 100,0 1 127 0 ,0 Overall Percentage 68,3 a. The cut value is ,500 Variables in the Equation 95% C.I.for EXP(B) B Step 1a gre Constant S.E. Wald df Sig. Exp(B) ,004 ,001 13,199 1 ,000 1,004 -2,901 ,606 22,919 1 ,000 ,055 Lower 1,002 Upper 1,006 a. Variable(s) entered on step 1: gre. For each unit increase in ’gre’ the ’odds’ of ’admit’ increases with the factor 1.004 e 0.004 that is the odds ratio. Note the 95% confidence interval for odds ratio. Multiple logistic regression In multiple logistic regression, the response variable has two possible values, as in simple logistic regression, but there can be several explanatory variables. As in multiple regression, there is an overall test for all of the explanatory variables. The null hypothesis that the coefficients of all the explanatory variables are zero is tested by a statistic that is approximately χ2 with degrees of freedom = number of explanatory variables. Hypotheses about individual coefficients are tested by a statistic that is approximately χ2 with 1 degree of freedom. Example cont.,: In addition to ‘gre’, assume that ‘gpa’ (grade point average) and prestige of the undergraduate institution (‘rank’) where student studied, effect admission into graduate school To run the logistic regression, start as earlier but includes all the explanatory variables in Covariates box Since there is a categorical variable, ‘rank’ press Categorical tab and move rank to Categorical Covariates box. We let the last category (4) as the reference category by allowing the defaults Classification Tablea Example. Predicted admit Observed Step 1 admit 0 Percentage Correct 1 0 254 19 93,0 1 97 30 23,6 Overall Percentage 71,0 a. The cut value is ,500 Variables in the Equation 95% C.I.for EXP(B) B Step 1a S.E. Wald df Sig. Exp(B) Lower Upper gre ,002 ,001 4,284 1 ,038 1,002 1,000 1,004 gpa ,804 ,332 5,872 1 ,015 2,235 1,166 4,282 20,895 3 ,000 rank rank(1) 1,551 ,418 13,787 1 ,000 4,718 2,080 10,702 rank(2) ,876 ,367 5,706 1 ,017 2,401 1,170 4,927 ,572 2,668 rank(3) Constant ,211 ,393 ,289 1 ,591 1,235 -5,541 1,138 23,709 1 ,000 ,004 a. Variable(s) entered on step 1: gre, gpa, rank. When other variables (“gre” and “rank”) are held constant, for every unit 0.804 increase of “gpa” odds of “admit” inflates with by factor 2.235 e Overall effect of ‘rank’ is significant. When other variables (“gre” and “gpa”) are held constant, “rank 1” has an odds 4.718 e1.551 times that of “rank 4” Predict the probability of admitance for person with gre=600, gpa=4.0 and rank=3 gre=650, gpa=4.5 and rank=1 Use the ”Save” tab in logistic regression Poisson Regression Model Topics Poisson Distribution Poisson Regression with Same/Equal Observation Interval Poisson Regression with Different/Unequal Observation Intervals (See exercise Batch 3 for dealing with observations on different lengths of intervals: adding an OFFSET variable for the model.) Choosing between Poisson and Negative Binomial Regressions Poisson Distribution Individual events happen in equal rate per interval, number of road accidents at certain junction per month, number of printing errors per page in a certain book, number of cancer cases diagnosed in a month at a certain clinic (We omit other assumptions here!) If X is the number, then X=0,1,2,….. (sample space of X) e x P( X x) x! Then the probability distribution: Mean number of events per interval is For a bigger variance than the value of mean we can use negative binomial distribution so is the variance of it 0.00 0.00 0.05 0.05 0.10 0.20 0.15 0.25 0.20 0.30 0 2 0 4 2 4 X 6 6 8 0.10 Probability 0.15 Probability 0.0 0.0 0.1 0.2 0.2 Probability 0.4 Probability 0.3 0.6 0.4 Poisson Distributions for various means values mean=1 mean=2 X 8 0 2 0 2 mean=4 4 4 6 X 6 8 8 X mean=3 10 12 Poisson Distribution Let the number of happening of event of interests is Y in a specific interval (space or time). Then Y=0,1,2,….. Let Y is affected by certain variables, say, X1, X2 and X3. Y is the response and X1, X2 and X3 are the explanatory variables We use GLM to model mean of the Y for given x1, x2 and x3 : x x x e 1 2 3 That is, b0 b1x1 b 2 x2 b3 x3 > 0 always log( x1x2 x3 ) b 0 b1 x1 b 2 x2 b 3 x3 : familiar linear model Variance of Y for given x1, x2 and x3 is the same as the mean of it If that variance is bigger do the negative binomial regression (we omit it here). One can use likelihood ratio test to test which one is better Example: The, num_awards is the outcome variable indicating the number of awards earned by respective student at a high school in a year, math that is each student’s score on his/her math final exam is a interval predictor variable, and prog is a categorical predictor variable with three levels (1, 2, and 3) indicating the type of program in which the each student was enrolled. Data file PoisRegression_sim.sav (Thanks to UCLA Academic Technology Services for data) Data: id 45 108 15 . . num_awards 0 0 0 . . prog 3 1 3 . . math 41 41 44 . . The variable ’prog’ is a good explanatory variable for ’num_awards’. For each ’prog’ the ’num_awards’ has its mean and the variance almost equal. To run the Poisson regression: Analyze => Generalized Linear Models => Generalized Linear Models => Type of the Model tab: Poisson loglinear and Respose tab: Dependent variable num_awards . In Predictors tab ‘prog’ as Factors (because we want them to be treated as categorical variables) and ‘math’ as Covariates. In the Model tab include all the predictors as Main effects. Leave the options in Estimation tab as they are. In the Statistics tab select Analysis Type : Type III and Chi-square Statistic : Likelihood ratio and Print: tick Include exponential parameter estimates. In the EM Means tab select ‘prog’ and so that SPSS will calculate mean for each category of it. Select Pairwise for Contrasts to get a comparison. In Scale select Compute means for response and in Adjustment for multiple comparison select Bonforroni. Press OK About IBM SPSS visit: http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp Tests of Model Effects Testing main effects: ’Prog’ is a categorical variable with 3 categories so two dummies are used. We test the significance of ’Prog’ through these two dummies together (chi-squred 2 degrees of freedom test). You can see ’prog’ is significant. Type III Likelihood Ratio Source Chi-Square df Sig. (Intercept) 69,523 1 ,000 prog 14,572 2 ,001 math 45,010 1 ,000 Dependent Variable: num_awards Model: (Intercept), prog, math Parameter Estimates 95% Wald Confidence Interval Parameter B (Intercept) Std. Error Lower Hypothesis Test Upper Wald Chi-Square 95% Wald Confidence Interval for Exp(B) df Sig. Exp(B) Lower Upper -4,877 ,6282 -6,109 -3,646 60,283 1 ,000 ,008 ,002 ,026 [prog=1] -,370 ,4411 -1,234 ,495 ,703 1 ,402 ,691 ,291 1,640 [prog=2] ,714 ,3200 ,087 1,341 4,979 1 ,026 2,042 1,091 3,824 [prog=3] math 0a . ,070 . ,0106 . ,049 . ,091 . 43,806 . 1 1 . ,000 1,073 . 1,051 1b (Scale) Dependent Variable: num_awards Model: (Intercept), prog, math a. Set to zero because this parameter is redundant. cc b. Fixed at the displayed value. For Interval variable ’math’, we test its main effect’s significance with chi-squred 1 degree of freedom test). You can see ’math’ is significant. 1,095 Parameter estimates: • ’math’ has the parameter value 0.07 ( exponentiated value is 1.073 = e0.07) : expected ’num_awards’ increases with the factor 1.073 for unit increase in ’math’ • ’Prog’ is modeled through two dummies (’prog1’ and ’prog2’ where ’prog3’ is the reference category). ’prog1’ has the parameter value -0.37 (exponentiated value of 0.691 = e-0.37 ): expected ’num_awards’ for ’prog1’ decreases with the factor 0.691 from that of ’prog3’ [the difference between log expected ’num_awards’ for ’prog1’ and log of expected ’num_awards’ for ’prog3’ is -0.37. ’prog2’ has the parameter value 0.714 (exponentiated value of 2.042 = e0.714 ): expected ’num_awards’ for ’prog2’ increases with the factor 2.042 from that of ’prog3’ [the difference between log expected ’num_awards’ for ’prog2’ and log of expected ’num_awards’ for ’prog3’ is 0.714. Acknowledgement: Few slides are based on teaching materials for the book, The Practice of Business Statistics Using Data for Decisions : Second Editio by Moore, McCabe, Duckworth and Alwan.