Download Lecture 33 - Logistic Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Chapter 14
Binary
In many regression applications the response variable has only two outcomes: an event
either did or did not occur. Such a variable is often referred to as a binary or binomial
variable as its behavior is related to the binomial distribution. A regression model with
this type of response can be interpreted as a model that estimates the effect of the
independent variable(s) on the probability of the event occurring.
Binary response data typically appear in one of two ways:
- When observations represent individual subjects, the response is represented by a
dummy or indicator variable having any two values. The most commonly used
values are zero if the event does not occur and unity if it does.
- When observations summarize the occurrence of events for each set of unique
combinations of the independent variables, the response variable is x/n where x is
the number of occurrences and n the number of observations in the set.
Regression with a binary response is illustrated with data from a study of carriers of
muscular dystrophy. Two groups of women, one consisting of known carriers of the
disease and the other a control group, were examined for four types of protein in their
blood. It is known that proteins may be used as a screening tool to identify carriers. The
variables in the resulting data set {ANGEL > Data > Chapter 14 > Binary} are:
Carrier: 0 for control, 1 for carrier
P1: measurement of protein type 1
P2: measurement of protein type 2
P3: measurement of protein type 3
P4: measurement of protein type 4
Objective: Determine the effectiveness of these proteins to identify carriers of the
disease, with special reference on how screening is improved by using measurements of
the other proteins.
Analysis
Because P1 has been the standard, it will be used to illustrate binomial regression with a
single independent variable. Because this looks like a regression problem, you may first
try a linear regression model by regressing Carrier on P1. The Minitab output is:
The regression equation is
Carrier = 0.275 + 0.00188 P1
Predictor
Constant
P1
Coef
0.27462
0.0018778
S = 0.446282
SE Coef
0.06766
0.0004283
R-Sq = 22.0%
T
4.06
4.38
P
0.000
0.000
R-Sq(adj) = 20.9%
1
The regression is certainly significant, and the estimated coefficients suggest that the
probability of detecting a carrier increases with measurements of P1 (from the positive
slope). But a scatterplot of the data is interesting:
Scatterplot of Carrier vs P1
1.6
1.4
1.2
Carrier
1.0
0.8
0.6
0.4
0.2
0.0
0
100
200
300
400
500
600
700
P1
The plot immediately reveals a problem: the response variable is defined to be a set of
probabilities that, by definition, are constrained to lie between zero and one, yet many
estimated values are beyond this range.
Another difficulty with this model is that the variance of the binomial response variable is
known to be a function of π(1- π), where π is the probability of the event. This obviously
violates the equal variance assumption required by the least squares estimation process.
Thus this particular approach to the regression with a binary response appears to have
limited usefulness.
The use of weighted regression may alleviate the unequal variance violation, and the use
of some transformation (possibly arcsine) may provide somewhat better estimates.
However, a more useful approach is afforded by the logistic regression model.
Logistic Regression – Binary Response
Recall that for a binary response, y, the expected value of y, E(y) = π, where π denotes
P(y=1). The log model is:
exp( Bo  B1 x1  ....  Bk xk )
and through algebraic manipulation,
1  exp( Bo  B1 x1  ....  Bk xk )
  
ln 
  Bo  B1 x1  ....  Bk x k
1  

Notice that although the regression model is linear on the right side, the left side is a
nonlinear function of the response variable π. This function is known as the logit link
2
function, and because it is not linear, the usual least squares methods cannot be used to
estimate the parameters. Instead, a method known as maximum likelihood is used to
obtain these estimates.

P( y  1)
is known as the
1 
P( y  0)
odds of the event y=1 occurring. For example, if π = 0.8 then the odd of y=1 occurring
0 .8
are
=4, or 4 to 1. Therefore, this is often referred to as the log-odds model.
0 .2
Also, since π = P(y=1), then 1 – π = P(y=0). The ratio
=
To perform the binary logistic regression in Minitab:
Stat > Regression > Binary Logistic and enter Carrier for Response and P1 in
Model. Note: the window for Factors refers to any variable(s)which are categorical.
Binary Logistic Regression: Carrier versus P1
Link Function: Logit
1
Response Information
Variable
Carrier
Value
1
0
Total
Count
32
38
70
(Event)
Logistic Regression Table 2
Predictor
Constant
P1
Coef
-2.18913
0.0303324
SE Coef
0.604449
0.0097241
Z
-3.62
3.12
P
0.000
0.002
Odds
Ratio
95% CI
Lower Upper
1.03
1.01
1.05
Log-Likelihood = -34.556 3a
Test that all slopes are zero: G = 27.414, DF = 1, P-Value = 0.000 3b
Goodness-of-Fit Tests 4
Method
Pearson
Deviance
Hosmer-Lemeshow
Chi-Square
43.1919
48.6574
2.6442
DF
51
51
8
P
0.773
0.567
0.955
Table of Observed and Expected Frequencies: 5
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)
Value
1
Obs
Exp
0
1
2
3
4
1
1.2
2
1.5
1
1.6
2
2.1
Group
5
6
3
2.4
3
2.7
7
8
9
10
Total
3
3.5
6
5.3
6
6.7
5
5.0
32
3
Obs
Exp
Total
6
5.8
7
5
5.5
7
6
5.4
7
6
5.9
8
5
5.6
8
4
4.3
7
4
3.5
7
1
1.7
7
1
0.3
7
0
0.0
5
38
70
Measures of Association: 6
(Between the Response Variable and Predicted Probabilities)
Pairs
Concordant
Discordant
Ties
Total
Number
971
235
10
1216
Percent
79.9
19.3
0.8
100.0
Summary Measures
Somers' D
Goodman-Kruskal Gamma
Kendall's Tau-a
0.61
0.61
0.30
Interpreting Output
1 - Response Information displays the number of missing observations and the number of
observations that fall into each of the two response categories. The response value that has been
designated as the reference event is the first entry under Value and labeled as the event. In this
case, the reference event is Carrier-1 (meaning person is carrier).
2 - Logistic Regression Table shows the estimated coefficients, standard error of the
coefficients, z-values, and p-values. When you use the logit link function, you also see the odds
ratio and a 95% confidence interval for the odds ratio. If there are several independent variables
then the individual tests are useful for determining the importance of the individual variables.
The odds ratio is computed as eB1 and is the change in the event odds, defined as P(event)/P(non
event) for a unit increase in the independent variable. Typically analysts compute eB1 – 1, which
is an estimate of the percentage increase (or decrease) in the odds P(y=1)/P(y=0) for every 1-unit
increase in X (while holding other X’s fixed if they exist in the model). In this example the
estimated odds ratio is e0.0303324 = 1.03; and eB1 – 1 = 0.03. For each additional unit increase in
protein P1, the odds of being classified as a carrier increase by 3%. Notice that this odds ratio is
very close to 1 even though the test for P1 (p = 0.002) gives evidence to that the estimated
coefficient for P1 is not equal to zero. A more meaningful difference would be found if the odds
ratio were higher.
The interpretation of the slope is for change in log odds. For example, with slope of 0.0303 we
would state that for a unit increase in P1 the log odds of being a carrier increase by 0.0303.
NOTE: Some stat packages such as SAS present a chi-square statistic instead of the z-statistic.
For large sample sizes either works, however smaller sample sizes should use the chi-square. The
relationship between the two is simply z2 = chi-square with 1 degree of freedom. For this
example, the chi-square statistic would be equal to (3.12)2 = 9.7344 and using 1 degree of
freedom would produce the same p-value.
From the output our logistic regression model would be:
ˆ 
exp( 2.189  0.0303x1 )
1  exp( 2.189  0.0303x1 )
4
3a –This is a measure of model fit and can be used to compare models by evaluating the
difference between twice the log-likelihood values for various models. The test to do this is to
take the difference of twice the likelihood of the smaller model minus twice the likelihood of the
larger model and find the p-value using a Χ2 test with degrees of freedom equal to the difference
in parameter estimates between the two models. NOTE: some software packages (e.g. SAS)
report a – 2 log likelihood which is simply twice the value provided by Minitab.
3b – Next, is the statistic G or he log-likelihood ratio test which is a chi-square test. This statistic
tests the null hypothesis that all the coefficients associated with predictors (i.e. the slopes) equal
zero versus these coefficients not all being equal to zero. In this example, G = 27.414, with a pvalue of 0.000, indicating that there is sufficient evidence the coefficient for P1 is different from
zero.
4 - The goodness-of-fit tests, with p-values ranging from 0.567 to 0.955, indicate that there is
insufficient evidence to claim that the model does not fit the data adequately. If the p-value is less
than some stated -level, the test would reject the null hypothesis of an adequate fit.
5 - Allows you to see how well the model fits the data by comparing the observed and expected
frequencies. There is insufficient evidence that the model does not fit the data well, as the
observed and expected frequencies are similar. This supports the conclusions made by the
goodness-of-fit tests in 4.
6 – This portion provides measures of association to assess the quality of the model. These
measures are based on an analysis of individual pairs of observations with different responses. In
this example there are 38 zeroes and 32 ones; hence there are 38 x 32 = 1216 such pairs. A pair is
deemed concordant if the observation with the higher response also has the higher estimated
probability (i.e. the individual carrying the disease has a higher probability of having the disease),
discordant if the reverse is true, and tied if the estimated probabilities are identical. The numbers
given are the percentages of pairs in each of these classes; obviously, the higher the percentage of
concordant pairs the better is the fit of the model. The right-hand portion of 6 gives three
different measures of rank correlation computed for these quantities. These correlations may
range from zero to one; therefore a larger correlation implies a stronger relationship, i.e. a
stronger predictive validity for that particular model. These statistics can be used then to compare
models. In this example the values range from 0.30 to 0.61 which imply a reasonable predictive
ability.
Logistic regression is also applicable to multi-level responses. The response may be
ordinal (no pain, slight pain, substantial pain) or nominal (Democrat, Republican,
Independent). For ordinal response outcomes, you can model functions called cumulative
logits by performing ordered logistic regression using the proportional odds ratio.
Open Data > Logistic Regression > Ordinal
About the data: Male(0) and female(1) subjects received an active(1) or placebo(0)
treatment for their arthritis pain, and the subsequent extent of improvement as recorded as
marked(1), some(2), or none(3).
One possible strategy would be to create dichotomous response variables by combining
two of the response categories. However, since there is a natural ordering to these
5
response levels, it makes sense to consider a strategy that takes advantage of this
ordering.
In Minitab select Stat > Regression > Ordinal Logistic. For response enter Improve and
for model enter Gender, Treatment. Also, since both of these predictors are categorical
variables you need to enter them as factors, too. Finally, since you are modeling factors
click Results and select “In addition, list of factor level values, and tests for terms with
more than 1 degree of freedom.”
Interpreting Output:
The response and factor information is similar in interpretation as that for binary logistic
regression.
From the Logistic Regression Table, the p-values are used to test for statistical evidence
that the respective predictors have an effect on the response. Here, both p-values are less
are small indicating that Treatment and Gender have an statistically significant effect on
Improvement. The value under each Factor (i.e. 1 under Treatment and Gender) indicates
which factor level is being compared to the other factor levels. The values labeled
Const(1) and Const(2) are estimated intercepts for the logits of the cumulative
probabilities of marked improvement, and for some improvement, respectively. Because
the cumulative probability for the last response value is 1, there is not need to estimate an
intercept for no improvement.
The coefficients for the predictors represent the increments of log odds for Gender =
females and Treatment = active, respectively. That is, by eb1 = e1.319 = 3.74 means that
females have 3.7 times higher odds of showing improvement as males, both for marked
improvement versus some or no improvement and for marked or some improvement
versus no improvement. Those subjects receiving the active drug have and e1.797 = 6.03
times higher odds of showing improvement as those on placebo, both for marked
improvement versus some or no improvement and for marked or some improvement
versus no improvement.
The log-likelihood, all slopes test, Goodness-of-fits, and Measures of Association are
interpreted similarly as that in binary logistic regression.
Comparing models using log-likelihood statistics
To test whether the addition of a covariate or covariates is statistically warranted we can
compare the log-likelihood from the smaller model to that from the larger model. Twice
this difference follows a chi-square distribution with degrees of freedom equal to the
difference in parameters estimated.
Example: Create an interaction term of Gender x Treatment. Re-compute the logistic
expression by entering this interaction term into the model and factor statements. The
log-likelihood from the model containing only the main effects was -75.015, and from the
model including the interaction we get -74.860. The difference is 75.015 – 74.860 =
0.155. This value taken twice is 0.310 which follows a chi-square distribution with 1
6
degree of freedom – 4 estimates in the smaller model and 5 in the larger model. From
Minitab we can calculate a p-value by Calc > Probability Distributions > Chi-square.
Enter 1 as the degree of freedom and 0.310 as the Input Constant. Taking 1 – the
probability results in a p-value of 0.578, which is large p-value indicating that adding the
interaction term to the main effects model is not significant, i.e. do not need the
interaction term.
Ordinal Logistic Regression: Improve versus Gender, Treatment
Link Function: Logit
Response Information
Variable
Improve
Value
1
2
3
Total
Count
28
14
42
84
Factor Information
Factor
Gender
Treatment
Levels
2
2
Values
0, 1
0, 1
Logistic Regression Table
Predictor
Const(1)
Const(2)
Gender
1
Treatment
1
Odds
Ratio
95% CI
Lower Upper
Coef
-2.66719
-1.81280
SE Coef
0.599697
0.556609
Z
-4.45
-3.26
P
0.000
0.001
1.31875
0.529188
2.49
0.013
3.74
1.33
10.55
1.79730
0.472822
3.80
0.000
6.03
2.39
15.24
Log-Likelihood = -75.015
Test that all slopes are zero: G = 19.887, DF = 2, P-Value = 0.000
Goodness-of-Fit Tests
Method
Pearson
Deviance
Chi-Square
1.91000
2.71210
DF
4
4
P
0.752
0.607
Measures of Association:
(Between the Response Variable and Predicted Probabilities)
Pairs
Concordant
Discordant
Ties
Total
Number
1268
324
564
2156
Percent
58.8
15.0
26.2
100.0
Summary Measures
Somers' D
Goodman-Kruskal Gamma
Kendall's Tau-a
0.44
0.59
0.27
7
The table of concordant, discordant, and tied pairs is calculated by forming all possible pairs of
observations with different response values. Suppose the response values are 1, 2, and 3.
Minitab pairs every observation with response value 1 with every observation with response
values of 2 and 3 and then pairs every observation with the response value 2 with every
observation with response value 3. The total number of pairs equals the number of
observations with response of 1 multiplied by the number of observations with the response of
2 plus the number of observations with response of 1 multiplied by the number of observations
with the response of 3 plus the number of observations with response of 2 multiplied by the
number of observations with the response of 3. In this example this would be 28*14 + 28*42 +
14*42 = 2156
To determine whether the pairs are concordant or discordant, Minitab calculates the cumulative
predicted probabilities of each observation and compares these values for each pair of
observations.
Concordant pairs occur when the lowest response value (in the example above, that is 1), a pair
is concordant if the cumulative probability up to the lowest response value is greater for the
observation with the lowest response value than for the observation with the higher response
value. For pairs with the highest response values (in the example above, pairs with 2 and 3), a
pair is concordant if the cumulative probability up to 2 is greater for the observation with the
response value 2 than the observation with the response value 3.
Recall in our discussion of slope interpretation and odds ratio for binary outcomes. Using
Carrier (1 = Yes) and P1 as the predictor, the slope value was -0.0303. The interpretation for an
increase in one unit of P1 was for a 3% increase in the odds ratio of being a carrier. We arrived
at this 3% by finding exp(b1) or exp[-0.0303] = 1.03; this translated to the 3% increase in the
odds ratio. Furthermore, we agreed that this did not convert to an increase of 0.03 in outcome
event probability, i.e. that the probability of being a carrier increased by 0.03 for each unit
increase in P1. In fact we found the probability of being a carrier when P1 = 100 to be 0.699 and
for P1 = 101 this was 0.706
However, we can use these probabilities to calculate the increase in odds ratio by comparing the
odds for being a carrier for P1 =101 compared to the odds of being a carrier for P1 = 100. Recall
that the odds were found by π/(1-π). This results in the following:
P1 = 101 we have odds 0.706/0.294 = 2.401 and P1 = 100 the odds of 0.699/0.301 = 2.322
This gets us the odds ratio for this unit increase to be 2.401/2.322 = 1.03; the odds ratio from
exp[b1].
This method is extended to our ordinal logistic discussion. Using Gender with Female (1) being
the event and the Response events for Improvement of Marked (1), Some (2) and None (3) we
had a slope for Gender of 1.319. This resulted in an odds ratio of exp[1.319] = 3.74
8
Interpreting the slope based on a unit increase would simply mean comparing males to females,
keeping in mind that we would hold the Treatment factor constant. That is, we would compare
gender for those receiving treatment or compare gender for those receiving placebo. We found
the following response probabilities for females receiving treatment:
Marked: 0.610 Some: 0.176
None: 0.214
Calculating these for probabilities for males receiving treatment we have:
Marked: 0.295 Some: 0.201
None: 0.504
Now for the odds ratio for comparing females to males for say the Marked outcomes versus
Some or None we have:
For females the odds are 0.610/0.390 = 1.564 and for males the odds are 0.295/0.705 = 0.418
The odds ratio is then 1.564/0.418 = 3.74
But we also said this 3.74 odds ratio would be the same if comparing Marked or Some to None.
For this we have the following odds:
For females the odds are 0.786/0.214 = 3.673 and for males the odds are 0.494/0.504 = 0.98
The odds ratio is then 3.673/0.98 = 3.74
As you can see, the result is the same. If you calculated the odds and odds ratio for the placebo
group you again would arrive at 3.74 as an odds ratio. Same logic applies to odds ratio for
Treatment.
9