Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Quantitative Research Methods for
Social Sciences/Fall 2011
Module 2: Lecture 7
Introduction to Generalized Linear Models,
Logistic Regression
and
Poisson Regression
Priyantha Wijayatunga
Department of Statistics Umeå University
Modeling dependences with regression models:
Generalized Linear Models (GLM)
Different Regression Models

Linear Regression (simple and multiple): dependent variable
(response) is assumed to be normally distributed,

In ordinary regression conditional mean of the response is modeled
as a linear combination of explanatory variables, etc.

Non-linear Regression: Quadratic regression, polynomial
regression (we omit them here)

GLM: logistic regression, Poisson regression, Cox regression, etc.
(we talk them here)

In GLM, for example, in logistic regression, conditional probability is
modeled
Linear regression (revisited)

Simple linear regression model in one explanatory variable x.
ŷ = b0 + b1x Estimated model


Usually more complex linear models are needed in practical situations.
Many problems in which a knowledge of more than one explanatory
variable is necessary in order to obtain a better understanding and
better prediction of a particular response.
ŷ = b0 + b1x1+ b2x2 + … + bpxp Estimated model



This model is called a ‘first-order model’ with p explanatory variables.
Some explanatory variables can be functions of some of the others,
Eg. X2=X12 ,X5=X3X4, ....
In general, we need n cases of data for p explanatory variables (n>>p).
The simple linear regression model
allows for one independent variable, “x”
y =b0 + b1x + e
y
X
y
X1
The multiple linear regression model
allows for more than one independent variable.
Y = b0 + b1x1 + b2x2 + e
X2
y
Note how the straight line
becomes a plain, and...
X1
The multiple linear regression model
allows for more than one independent variable.
Y = b0 + b1x1 + b2x2 + e
X2
Required conditions for the error variable



The random error e of the model is normally distributed
Its mean is equal to zero and its standard deviation is a constant (s)
for all values of X
The errors are independent for different data cases
We need more GLM
Different situations where we need more general regression models

Dependent variable has only two ordinal outcomes: success or failure –a
political candidate may want to know the behaviour of voters, i.e.,
characteristics of them who may vote for him/her or not. Each voter’s age,
profession, etc. have influence on his/her choice to vote for the candidate
or not – Binary logistic regression

Dependent variable may have more than two ordianl outcomes: after the
high school, student’s selection to “go to university”, “go to vocational
training”, “do a job”, or “other” – Multinary logistic regression

Dependent variable is a count: couple’s characteristics affecting their
family size. Number of children (0, 1, 2, 3, …): it may have some
dependence on their income, ages, etc. – Poisson regression

Dependent variable is period until a particular event happens: time until an
employed person finds employment – Cox regression
GLM
Why do we need more general regression models?
Eg: (Avoid details but JUST get the idea) Suppose person’s preference to
vote or not to “candidate A” depends on his/her age: Y=1 if vote, Y=0, if not.
Let X be the age of the voter. Using linear model: Y =b0 + b1x+ e
E[Y|x]=P(Y=1|x)
and E[Y|x] =b0 + b1x
so P(Y=1|x) =b0 + b1x
We model the conditional probabilities but NOT the conditional means (who
agree more)!!
For all values of X, above probability calculation may not work!! (0≤prob≤1)
Therefore, do a trick (resulting a GLM)
 P(Y  1 | X ) 
  b 0  b1 X ;
log 
 1  P(Y  1 | X ) 
e b 0  b1 X
P(Y  1 | X ) 
1  e b 0  b1 X
This is the binary logistic regression model (details later !! )
GLM cont.,
Why do we need more general regression models?

We may be interested in who is likely more do something, i.e., P(Y=1|x)
– which values of X (age, etc) is more likely to vote for party A

Dependent variable is not normally distributed

Even cases where dependent variable is a real number, normal
distribution is not the optimal choice for model it: Unemployment
duration or insurance claim amount (all positive) may have a skewed
distribution (gamma distribution, lognormal distribution, etc. )

Dependent variable is a count: couple’s characteristics affecting their
family size. Number of children (0, 1, 2, 3, …) they may have depended
on their income, ages, etc. – Poisson regression

Dependent variable is period until a particular event happens: time until
an employed person finds employment (unemployment duration) – Cox
regression
Logistic Regression Model
Topics

Binomial setting and odds

Model for logistic regression

Fitting and interpreting the model

Inference for logistic regression

Multiple logistic regression
Binary Logistic Regression

We will study methods to model relationships when the response
variable has only two possible values. How likely for a subject to be in
either of categories?

For example: customer buys or does not buy (which group is more
likely to buy), patient lives or dies, etc.

We call the two values of the response variable ‘success’ and ‘failure’.

If the data is n independent observations with the same p =
P(success), this is the binomial setting.

Binary dependent variable Y: let Y=1 if a success and Y=0 if a failure

Logistic regression is used when Y depends on certain explanatory
variables. Otherwise one can simply use binomial distribution for Y.
Example: Binge drinkers

A survey of 17,096 students in U.S. four-year colleges collected
information on drinking behavior and alcohol-related problems. The
researchers define “frequent binge drinking” as having five or more
drinks in a row three or more times in the past two weeks. X
represents the number of binge drinkers in the sample.

One possible explanatory variable is gender of the student:
Population
1 (men)
2 (women)
Total
n
7,180
9,916
17,096
X
1,630
1,684
3,314
p̂
0.227
0.170
0.194
Odds



Logistic regression works with odds rather than proportions.
Probability of obtaining a “spade” from a well shuffled pack of cards
is ¼. So, usually we say odds that getting a “spade” is 3 to 1 against.
Let event A means success. Then we define odds of A as
Odds(A) = probability of success / probability of failure
p( A)
(1  p( A))
10
When P(A)=0.5 then Odds(A)=1
5
When probability of the event
goes up odds also goes up
(non–linearly)
0
Odds(A)
15
odds( A) 
0.2
0.4
0.6
P(A)
0.8
1.0
Odds …,
p
odds 
(1  p)

We can omit mentioning A when it is clear from the context
A similar formula for sample odds is obtained by substituting the
sample proportion for p.

Example: The estimated odds of a male student being a frequent
binge drinker are: p/(1-p) = 0.227/(1 - 0.227) = 0.2937. The estimated
odds of a female student being a frequent binge drinker are: 0.1698/(1 0.1698) = 0.2045.

Model for logistic regression.,
The logistic regression model works with the natural log of the
odds, p/(1 - p).

We use the term log odds for this transformation (called logit).

As p moves from 0 to 1, the log odds moves through all negative
and positive numerical values.

We model the log odds as a linear
function of the explanatory variable:
2
0
-2
-4
log{Odd(A)}
 p 
  b 0  b1 x
log 
1 p 
4

0.0
0.2
0.4
0.6
P(A)
0.8
1.0
Plot of p versus x
for selected values
of β0<0 and β1>0.
0.9
0.35
1.0
Probability of success varies with x
0.8
β0 =1, β1=0
0.5
0.20
0.6
0.25
0.7
p
p
0.30
β0 =-1, β1=0
0
2
4
6
8
0
10
2
4
6
8
10
x
1.0
1.0
x
0.6
0.8
β0 =4, β1=-1
p
0.0
0.0
0.2
0.2
0.4
0.4
p
0.6
0.8
β0 =4, β1=-2
0
2
4
6
x
8
10
0
2
4
6
x
8
10
Fitting the model

Binge drinkers example: log odds for men are log(0.2937) = -1.23, and log
odds for women are = log(0.2045) = -1.59.

The explanatory variable gender can be expressed numerically using an
indicator variable: x = I if the student is man, 0 if the student is women. The
model says that for men:
 p 
log  1   b 0  b1 x
 1  p1 
And for women:
 p0 
  b 0
log 
 1  p0 
Since log odds for men = -1.23, and log odds for women = -1.59, we get the
parameter estimates: b0 = -1.59, and b1 = -1.23 - (-1.59) = 0.36.


The fitted logistic model is: log (odds) = -1.59 + 0.36x
In general, the calculations needed to find the parameter estimates are
complex and require software.

Interpreting the model parameters

Most people are not comfortable thinking in the log(odds) scale so
we apply a transformation.

The exponential function (ex) reverses the natural log
transformation. Applying the transformation we get:
odds = e -1.59 + 0.36x = (e -1.59 )( e 0.36x)

From this, the ratio of the odds for men (x =1) and women (x = 0) is
oddsmen/oddswomen = e 0.36 = 1.43 = odds ratio

This transformation transforms the logistic regression slope into an
odds ratio, i.e. the odds that a man is a frequent binge drinker are
1.43 times the odds for a woman.

Odds ratio is a measure of dependence between X and Y
Interpreting…,


Generally, we can understand model parameters
Suppose person’s age (X) affects his/her choice of voting for a certain
party (Y=1) or not (Y=0):
log( oddsx )  b 0  b1 x
for person aged x
log( oddsx 1 )  b 0  b1 ( x  1) for person aged x+1
b0
b1
is not connected with X – not needed to understand relation X–Y
is connected with X
We get
log( odds x 1 )  log( odds x )  b1
 odds x 1 
  b1
log 
 odds x 
odds x 1  e b1  odds x

that is, b1 is the log of odd ratio
For every unit increase in X, odds ‘is’ increased by the number
e
b1
Interpreting…,

Odds ratio is
odds x 1
 e b1
odds x
 p 
  b 0  b1 x;
log 
1 p 

e b0  b1x
then we get p 
1  e b 0  b1x
When b1  0; X increases → p increases – positive dependence of X–Y
b1  0; X increases → p remains the same – independence of X–Y
b1  0; X increases → p decreases – negative dependence of X–Y

Odds ratio is a measure of dependence between X and Y

Odds ratio does not change with X

Odds ratio is always non–negative
odds ratio > 1 positive dependence of X–Y
odds ratio = 1 independence of X–Y
odds ratio < 1 negative dependence of X–Y
Inference for logistic regression (omit details)
About IBM SPSS visit:
http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp
Example: A researcher is interested in how variables, “gre” (Graduate
Record Exam scores) effect admission into graduate school. The
response variable, (“admit”), admit/don't admit (1/0), is a binary variable
and “gre” is an interval variable. SPSS file: binary.sav
(Thanks to UCLA Academic Technology Services for data)
Data: (here we use only admit and gre)
admit
gre
gpa
rank
0 380
3.61
3
1 660
3.67
3
.
.
.
.
Analyze > Regression > Binary Logistic then move ‘admit’ to Dependent
box and ‘gre’ to Covariates box. Make sure Enter appears in Method
section. Click on Options and tick CI for exp(B) to get the confidence
intervals for the odds ratio. Press Continue followed by OK.
Example: …
Classification Tablea
Predicted
admit
Observed
Step 1
admit
0
Percentage
Correct
1
0
273
0
100,0
1
127
0
,0
Overall Percentage
68,3
a. The cut value is ,500
Variables in the Equation
95% C.I.for EXP(B)
B
Step
1a
gre
Constant
S.E.
Wald
df
Sig.
Exp(B)
,004
,001
13,199
1
,000
1,004
-2,901
,606
22,919
1
,000
,055
Lower
1,002
Upper
1,006
a. Variable(s) entered on step 1: gre.
For each unit increase in ’gre’ the ’odds’ of ’admit’ increases with the factor
1.004  e 0.004 that is the odds ratio.
Note the 95% confidence interval for odds ratio.
Multiple logistic regression

In multiple logistic regression, the response variable has two
possible values, as in simple logistic regression, but there can be
several explanatory variables.

As in multiple regression, there is an overall test for all of the
explanatory variables.

The null hypothesis that the coefficients of all the explanatory
variables are zero is tested by a statistic that is approximately χ2 with
degrees of freedom = number of explanatory variables.

Hypotheses about individual coefficients are tested by a statistic that
is approximately χ2 with 1 degree of freedom.
Example cont.,: In addition to ‘gre’, assume that ‘gpa’ (grade point
average) and prestige of the undergraduate institution (‘rank’) where
student studied, effect admission into graduate school
To run the logistic regression, start as earlier but includes all the
explanatory variables in Covariates box
Since there is a categorical variable, ‘rank’ press Categorical tab and
move rank to Categorical Covariates box. We let the last category
(4) as the reference category by allowing the defaults
Classification Tablea
Example.
Predicted
admit
Observed
Step 1
admit
0
Percentage
Correct
1
0
254
19
93,0
1
97
30
23,6
Overall Percentage
71,0
a. The cut value is ,500
Variables in the Equation
95% C.I.for EXP(B)
B
Step 1a
S.E.
Wald
df
Sig.
Exp(B)
Lower
Upper
gre
,002
,001
4,284
1
,038
1,002
1,000
1,004
gpa
,804
,332
5,872
1
,015
2,235
1,166
4,282
20,895
3
,000
rank
rank(1)
1,551
,418
13,787
1
,000
4,718
2,080
10,702
rank(2)
,876
,367
5,706
1
,017
2,401
1,170
4,927
,572
2,668
rank(3)
Constant
,211
,393
,289
1
,591
1,235
-5,541
1,138
23,709
1
,000
,004
a. Variable(s) entered on step 1: gre, gpa, rank.
When other variables (“gre” and “rank”) are held constant, for every unit
0.804
increase of “gpa” odds of “admit” inflates with by factor 2.235  e
Overall effect of ‘rank’ is significant.
When other variables (“gre” and “gpa”) are held constant, “rank 1” has
an odds 4.718  e1.551 times that of “rank 4”
Predict the probability of admitance for person with
gre=600, gpa=4.0 and rank=3
gre=650, gpa=4.5 and rank=1
Use the ”Save” tab in logistic regression
Poisson Regression Model
Topics

Poisson Distribution

Poisson Regression with Same/Equal Observation Interval

Poisson Regression with Different/Unequal Observation Intervals
(See exercise Batch 3 for dealing with observations on different
lengths of intervals: adding an OFFSET variable for the model.)

Choosing between Poisson and Negative Binomial Regressions
Poisson Distribution

Individual events happen in equal rate per interval,
number of road accidents at certain junction per month,
number of printing errors per page in a certain book,
number of cancer cases diagnosed in a month at a certain clinic
(We omit other assumptions here!)

If X is the number, then X=0,1,2,….. (sample space of X)
e   x
P( X  x) 
x!

Then the probability distribution:

Mean number of events per interval is

For a bigger variance than the value of mean we can use negative
binomial distribution

so is the variance of it
0.00
0.00
0.05
0.05
0.10
0.20
0.15
0.25
0.20
0.30
0
2
0
4
2
4
X
6
6
8
0.10
Probability
0.15
Probability
0.0
0.0
0.1
0.2
0.2
Probability
0.4
Probability
0.3
0.6
0.4
Poisson Distributions for various means values
mean=1
mean=2
X
8
0
2
0
2
mean=4
4
4
6
X
6
8
8
X
mean=3
10
12
Poisson Distribution



Let the number of happening of event of interests is Y in a specific
interval (space or time). Then Y=0,1,2,…..
Let Y is affected by certain variables, say, X1, X2 and X3. Y is the
response and X1, X2 and X3 are the explanatory variables
We use GLM to model mean of the Y for given x1, x2 and x3 :
x x x  e
1 2 3
That is,


b0  b1x1  b 2 x2  b3 x3
> 0 always
log( x1x2 x3 )  b 0  b1 x1  b 2 x2  b 3 x3 : familiar linear model
Variance of Y for given x1, x2 and x3 is the same as the mean of it
If that variance is bigger do the negative binomial regression (we omit
it here). One can use likelihood ratio test to test which one is better
Example: The, num_awards is the outcome variable indicating the
number of awards earned by respective student at a high school in a
year, math that is each student’s score on his/her math final exam is
a interval predictor variable, and prog is a categorical predictor
variable with three levels (1, 2, and 3) indicating the type of program
in which the each student was enrolled.
Data file PoisRegression_sim.sav
(Thanks to UCLA Academic Technology Services for data)
Data:
id
45
108
15
.
.
num_awards
0
0
0
.
.
prog
3
1
3
.
.
math
41
41
44
.
.
The variable ’prog’ is a good explanatory variable for ’num_awards’.
For each ’prog’ the ’num_awards’ has its mean and the variance
almost equal.
To run the Poisson regression:
Analyze => Generalized Linear Models => Generalized Linear Models =>
Type of the Model tab: Poisson loglinear and Respose tab:
Dependent variable num_awards . In Predictors tab ‘prog’ as Factors
(because we want them to be treated as categorical variables) and ‘math’
as Covariates. In the Model tab include all the predictors as Main
effects. Leave the options in Estimation tab as they are. In the Statistics
tab select Analysis Type : Type III and Chi-square Statistic :
Likelihood ratio and Print: tick Include exponential parameter
estimates. In the EM Means tab select ‘prog’ and so that SPSS will
calculate mean for each category of it. Select Pairwise for Contrasts to
get a comparison. In Scale select Compute means for response and in
Adjustment for multiple comparison select Bonforroni. Press OK
About IBM SPSS visit:
http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp
Tests of Model Effects
Testing main effects: ’Prog’ is a categorical
variable with 3 categories so two dummies are
used. We test the significance of ’Prog’ through
these two dummies together (chi-squred 2
degrees of freedom test). You can see ’prog’ is
significant.
Type III
Likelihood Ratio
Source
Chi-Square
df
Sig.
(Intercept)
69,523
1
,000
prog
14,572
2
,001
math
45,010
1
,000
Dependent Variable: num_awards
Model: (Intercept), prog, math
Parameter Estimates
95% Wald Confidence Interval
Parameter
B
(Intercept)
Std. Error
Lower
Hypothesis Test
Upper
Wald Chi-Square
95% Wald Confidence Interval for Exp(B)
df
Sig.
Exp(B)
Lower
Upper
-4,877
,6282
-6,109
-3,646
60,283
1
,000
,008
,002
,026
[prog=1]
-,370
,4411
-1,234
,495
,703
1
,402
,691
,291
1,640
[prog=2]
,714
,3200
,087
1,341
4,979
1
,026
2,042
1,091
3,824
[prog=3]
math
0a .
,070
.
,0106
.
,049
.
,091
.
43,806
.
1
1 .
,000
1,073
.
1,051
1b
(Scale)
Dependent Variable: num_awards
Model: (Intercept), prog, math
a. Set to zero because this parameter is redundant.
cc
b. Fixed at the displayed value.
For Interval variable ’math’, we test its main effect’s significance with chi-squred 1
degree of freedom test). You can see ’math’ is significant.
1,095
Parameter estimates:
• ’math’ has the parameter value 0.07 ( exponentiated value is 1.073 =
e0.07) : expected ’num_awards’ increases with the factor 1.073 for unit
increase in ’math’
• ’Prog’ is modeled through two dummies (’prog1’ and ’prog2’ where
’prog3’ is the reference category).
’prog1’ has the parameter value -0.37 (exponentiated value of 0.691 =
e-0.37 ): expected ’num_awards’ for ’prog1’ decreases with the factor
0.691 from that of ’prog3’ [the difference between log expected
’num_awards’ for ’prog1’ and log of expected ’num_awards’ for ’prog3’
is -0.37.
’prog2’ has the parameter value 0.714 (exponentiated value of 2.042 =
e0.714 ): expected ’num_awards’ for ’prog2’ increases with the factor
2.042 from that of ’prog3’ [the difference between log expected
’num_awards’ for ’prog2’ and log of expected ’num_awards’ for ’prog3’
is 0.714.
Acknowledgement:
Few slides are based on teaching materials for the book,
The Practice of Business Statistics Using Data for
Decisions : Second Editio by Moore, McCabe, Duckworth
and Alwan.