Download 10.2 Logistic and Probit Regression Models 10.2.1 Logistic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
10.2 Logistic and Probit Regression Models
The logistic regression model is useful when you want to fit a linear regression
model to a binary response variable. You have several levels of an independent, or
predictor variable, X. Denote these levels X1, X2,...,Xm. At the ith level of X, you have Ni
(i=1,2,...,m) observations, each of which is an independent Bernoulli trial. Of the Ni
observations, yi are classified as “the outcome of interest” - or “success” - and the
remaining Ni-yi have “the other” classification, e.g. “failure”.
At the ith level of X, yi has a binomial distribution, or, more formally,
yi~Binomial(Ni, Bi), where Ni is the number of trials and Bi is the probability of a success
on a given trial. The object of logistic regression is to estimate or test for changes in Bi
associated with changes in Xi, specifically by modeling these changes via regression.
10.2.1 Logistic Regression: Challenger Shuttle O-Ring
Data Example
Here is an example. Following the 1986 Challenger space shuttle disaster,
investigators focused on a suspected association between O-ring failure and low
temperature at launch. Data documenting the presence or absence of primary O-ring
thermal distress in the 23 shuttle launches preceding the Challenger mission appeared in
Dalal, et. al (1989) and were reproduced in Agresti (1996). Output 10.1 shows the raw
data. Temperature at launch (TEMP) is the X variable. At each TEMP, TD denotes the
number of launches in which thermal distress occurred and TOTAL gives the number of
launches. TOTAL is the N variable, TD is the y variable, and the variable NO_TD is
equal to N-y.
Output 10.1 Challenger O-Ring Thermal Distress Data
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
temp
53
57
58
63
66
67
68
69
70
72
73
75
76
78
79
81
td
1
1
1
1
0
0
0
0
2
0
0
1
0
0
0
0
no_td
total
0
0
0
0
1
3
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
3
1
1
4
1
1
2
2
1
1
1
Inspection of the data in Output 10.1 reveals that the incidence of thermal distress,
indicated by frequency of TD versus NO_TD, appears to be greater at low temperatures.
Therefore, it is of interest to fit a model for which π, the probability of thermal distress,
decreases as temperature increases. However, fitting a model directly to π, such as
πˆ i = β 0 + β1 X i , where Xi denotes the temperature at the ith launch, is not necessarily a
reasonable approach. This is partly because, for theoretical reasons explained in Section
10.6, “Background Theory,” the binomial random variable is not linear with respect to π.
It is also partly because fitted values of π from this model are not bounded by 0 or 1,
allowing the possibility of nonsense estimates of π.
A better approach is to fit the linear regression model to a function of π that is
bounded by 0 and 1 and with which the binomial random variable at least theoretically
has a linear relationship. Two such functions are the logit , defined as
⎛ π ⎞
logit(πi)= log⎜⎜ i ⎟⎟ , and the probit, defined as probit(πi)= Φ −1 (π i ) , where Φ-1(@) is the
⎝ 1 − πi ⎠
inverse of the cumulative density function of the standard normal distribution, that is, the
value on a standard normal table corresponding to a probability of πi.
The logit and probit are both examples of link functions. The link function is a
fundamental component of generalized linear models, because it specifies the relationship
between the mean of the response variable and the linear model. Note that the mean of
the sample proportion, yi/Ni, is πi. For reasons explained in more detail in Section 10.6,
the logit is the most natural link function for binomial data. Models using the logit are
called “logistic” models; in this case we are interested in a logistic regression model
because we want to regress a binomial random variable on temperature.
The simplest logistic regression model for these data is logit(Bi)=$0+$1Xi. You
can fit the logistic regression model using PROC GENMOD, using the following SAS
program statements:
proc genmod;
model td/total=temp / link=logit dist=binomial type1;
From the SAS statements, you can see that GENMOD has a number of features in
common with PROC GLM and MIXED, but a number of unique features as well. As
with GLM and MIXED, the MODEL statement has the general form of 〈response
variable〉=〈independent variable(s)〉. The independent variables can be direct regression
variables, or they can be CLASS variables, which you use in GENMOD to create the
generalized linear model analog of analysis of variance. As with GLM and MIXED,
GENMOD treats independent variables as direct regression variables by default and as
“ANOVA” variables only if they appear first in a CLASS statement. Examples that use
the CLASS statement appear later in this chapter.
For binomial response variables the syntax differs from other SAS linear model
procedures. You specify the response variable as the ratio of the number of outcomes of
interest (the y variable, in this case TD) divided by the number of observations per level
of X (the N variable, in this case TOTAL). The binomial is unique in this respect. For
other distributions, shown in examples later in this chapter, the form of the response
variable is the same as other linear model procedures in SAS.
To complete the model statement, you also specify the distribution of the
response variable, the link function, and other options. The DIST option specifies the
distribution. If you do not specify a distribution, GENMOD uses either the binomial
distribution (if the response variable is a ratio, as above) or the normal distribution (for all
other response variables) as the default. Several distributions are available in GENMOD.
Consult SAS Online Documentation for Version 8 (1999) for a complete list.
Alternatively, you can provide your own distribution or quasi-likelihood, if none of the
distributions provided with GENMOD are suitable. Section 10.4.5 presents an example of
a user-specified distribution. The LINK option specifies the link function. If you do not
specify a link function, GENMOD will use the canonical link, that is, the link that
follows naturally from the probability distribution (see Section 10.6) that you select. In
this example, the logit link is the default because the ratio response variable implies the
binomial distribution and the logit is its canonical link. Thus, neither the
DIST=BINOMIAL nor LINK=LOGIT statements are actually needed for this example.
However, it is good practice to include the DIST and LINK options even when they are
not strictly necessary, if only for the sake of clarity.
The TYPE1 option yields likelihood ratio test statistics for hypotheses based on
Type I estimable functions, as described in Chapter X. You can also compute tests based
on Type III estimable functions by using the option TYPE3. For Type 3 tests, you can use
likelihood ratio statistics, the default, or you can use the WALD option to compute Wald
statistics. Section 10.6 gives explanations of likelihood ratio and Wald test statistics.
Several other options are also available. This chapter illustrates several of these options
where appropriate.
Output 10.2 shows the output generated by PROC GENMOD.
Output 10.2
Basic GENMOD Output for Challenger O-Ring Logistic Regression
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Response Variable (Events)
Response Variable (Trials)
Observations Used
Number Of Events
Number Of Trials
WORK.O_RING
Binomial
Logit
td
total
16
7
23
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
14
14
14
14
11.9974
11.9974
11.1303
11.1303
-10.1576
0.8570
0.8570
0.7950
0.7950
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
temp
Scale
1
1
0
15.0429
-0.2322
1.0000
7.3786
0.1082
0.0000
Wald 95%
Confidence Limits
0.5810
-0.4443
1.0000
29.5048
-0.0200
1.0000
ChiSquare
Pr > ChiSq
4.16
4.60
0.0415
0.0320
NOTE: The scale parameter was held fixed.
LR Statistics For Type 1 Analysis
Source
Intercept
temp
Deviance
DF
ChiSquare
Pr > ChiSq
28.2672
20.3152
1
7.95
0.0048
The beginning of the output contains some basic information about the data set.
You can use this output to make sure that the data were read as intended, that the correct
response variable was analyzed, that the right distribution and link were used, and so
forth. The first substantive output is the “Criteria for Assessing Goodness of Fit.” You
can use the deviance, defined in Section 10.6, to check the fit of the model by comparing
the computed deviance to a P2 distribution with 14 d.f. In this case, the deviance is
11.9974 whereas the table value of χ (14 ) at "=0.25 is 17.12, indicating no evidence of
2
lack of fit.
The Pearson Chi-Square provides an alternative way to check goodness of fit.
Like the deviance, the Pearson P2 also has an approximate χ (14 ) distribution. Its
2
computed value is 11.1303, similar to the deviance and also suggesting no evidence of
lack of fit. The “Scaled Deviance” and “Scaled Pearson Chi-Square” are not of interest in
this example. They are relevant when there is evidence of lack of fit resulting from
overdispersion. Section 10.4.3 presents an example.
The “Analysis of Parameter Estimates” gives the estimates of the regression
parameters as well as their standard errors and confidence limits. Here, the estimated
intercept is β$ 0 =15.0429 with a standard error of 7.3786. The estimated slope is β$ 1 = 0.2322 with a standard error of 0.1082.
The “Chi-Square” statistics and associated p-values (“Pr > ChiSq”) given in the
“Analysis of Parameter Estimates” table are Wald statistics for testing null hypotheses of
zero intercept and slope. For example, the Wald P2 statistic to test H0: $1=0 is 4.60 and
the p-value is 0.0320. You can also test the hypothesis of zero slope using the likelihood
ratio statistic generated by the TYPE1 option and printed under “LR Statistics For Type 1
Analysis.” The likelihood ratio P2 is 7.95 and its p-value in 0.0048. The fact that
likelihood ratio statistic is larger than the corresponding Wald statistic in this case is
coincidental. In general, no pattern exists, and there is no compelling evidence in the
literature to indicate that either statistic is preferable.
10.2.2 Using the Inverse Link to Get the Predicted
Probability
From the output, you can see that 15.049-0.2322*TEMP is the estimated
regression equation. The regression equation allows you to compute the predicted logit
for a desired temperature. For example, at 50°, the predicted logit is 15.0429-0.2322*50
= 3.4329.
Typically, the logit is not of direct interest. On the other hand, the predicted
probability is of interest, in this case, the probability of O-ring thermal distress occurring
at a given temperature. You use the inverse link function to convert the logit to a
probability. In this example, the logit link function, 0= log⎛⎜ π ⎞⎟ , hence, the inverse link is
⎝ 1- π ⎠
B=
e
η
1+ e η
. For 50°, using η$ = 3.4329 as calculated above, the predicted probability is
therefore π$ =
e 3.4329
= 0.9687 . That is, according to the logistic regression estimated
1 + e 3.4329
from the data, the probability of observing primary O-ring thermal distress at 50° is
0.9687.
You can convert the standard error from the link function scale to the inverse
link scale using the Delta Rule. The general form of the Delta Rule for generalized linear
2
⎡ ∂ h(η ) ⎤
models is: Var[h( η$ )] is approximately equal to ⎢
⎥ Var(η$ ) . For the logit link,
⎣ ∂η ⎦
∂ h(η )
= π (1 − π ) and hence the standard error of
some algebra yields
∂η
π$ = π$ (1 − π$ ) × s.e. (η$ ) . You can use GENMOD to compute s.e.(η$ ) , as well as η$ and
related statistics, using the ESTIMATE statement. The syntax and placement of the
ESTIMATE statement are similar to GLM and MIXED. Here are the statements to
compute η$ for several temperatures of interest. Output 10.3 shows the results.
estimate
estimate
estimate
estimate
estimate
estimate
Output 10.3
'logit
'logit
'logit
'logit
'logit
'logit
at
at
at
at
at
at
50 deg' intercept 1
60 deg' intercept 1
64.7 deg' intercept
64.8 deg' intercept
70 deg' intercept 1
80 deg' intercept 1
temp 50;
temp 60;
1 temp 64.7;
1 temp 64.8;
temp 70;
temp 80;
Estimated Logits for Various Temperatures of Interest
Contrast Estimate Results
Label
logit
logit
logit
logit
logit
logit
at
at
at
at
at
at
50 deg
60 deg
64.7 deg
64.8 deg
70 deg
80 deg
Estimate
Standard
Error
Alpha
3.4348
1.1131
0.0220
-0.0012
-1.2085
-3.5301
2.0232
1.0259
0.6576
0.6518
0.5953
1.4140
0.05
0.05
0.05
0.05
0.05
0.05
Confidence Limits
-0.5307
-0.8975
-1.2669
-1.2788
-2.3752
-6.3014
7.4002
3.1238
1.3109
1.2764
-0.0418
-0.7588
ChiSquare
2.88
1.18
0.00
0.00
4.12
6.23
The column “Estimate” gives you the estimated logit. For “logit at 50 deg”, η$ at 50°, the
computed value is 3.4348, rather than the “hand-calculated” η$ =3.4329 given above.
This reflects rounding error: SAS computations involve much greater precision. From the
output, you can see that the standard error of η$ at 50° is 2.0232. Using the Delta Rule,
the standard error for π$ is π$ (1− π$ ) × s.e.(η$ ) = 0.9687 × (1- 0.9687) × 2.0232 = 0.0613 .
In addition to η$ and s.e.( η$ ), Output 10.3 also gives upper and lower 95%
confidence limits for the predicted logit. You can use the inverse link to convert these to
confidence limits for the predicted probability. For example, at 50°, the lower confidence
limit for 0 is –0.5307. Applying the inverse link, the lower confidence limit for B is
e − 0.5307
= 0.3704. A similar computation using the upper confidence limit for 0,
1 + e − 0.5307
7.4002, yields the upper confidence limit for B, 0.9994. It is better to use the upper and
lower limits for 0 and covert them using the inverse link rather than using the standard
error of π$ computed from the Delta Rule. The standard error results in a symmetric
interval, i.e. πˆ ± t × s.e.(π) , which is not, in general, a sensible confidence interval. The
confidence interval should be asymmetric reflecting the non-linear nature of the link
function.
You can compute π$ , its standard error and confidence interval using the ODS
output statement in GENMOD followed by program statements to implement the inverse
link and Delta Rule. First, you insert the following ODS statement after the ESTIMATE
statements in the GENMOD procedure:
ods output estimates=logit;
Then use the following statements:
data prob_hat;
set logit;
phat=exp(estimate)/(1+exp(estimate));
se_phat=phat*(1-phat)*stderr;
prb_LcL=exp(LowerCL)/(1+exp(LowerCL));
prb_UcL=exp(UpperCL)/(1+exp(UpperCL));
proc print data=prob_hat;
run;
The statements produce Output 10.4.
Output 10.4
PROC PRINT of data set containing , s.e.( ), and upper and lower
confidence limits
Obs
1
2
3
4
5
6
Obs
1
2
3
4
5
6
Label
logit
logit
logit
logit
logit
logit
at
at
at
at
at
at
Estimate
50 deg
60 deg
64.7 deg
64.8 deg
70 deg
80 deg
ChiSq
Prob
ChiSq
2.88
1.18
0.00
0.00
4.12
6.23
0.0896
0.2779
0.9733
0.9985
0.0423
0.0125
3.4348
1.1131
0.0220
-0.0012
-1.2085
-3.5301
phat
0.96877
0.75271
0.50549
0.49969
0.22997
0.02847
StdErr
2.0232
1.0259
0.6576
0.6518
0.5953
1.4140
Alpha
LowerCL
0.05
0.05
0.05
0.05
0.05
0.05
-0.5307
-0.8975
-1.2669
-1.2788
-2.3752
-6.3014
se_phat
prb_LcL
prb_UcL
0.06121
0.19095
0.16439
0.16296
0.10541
0.03911
0.37036
0.28956
0.21978
0.21775
0.08509
0.00183
0.99939
0.95786
0.78766
0.78183
0.48955
0.31891
UpperCL
7.4002
3.1238
1.3109
1.2764
-0.0418
-0.7588
The variables “phat,” “se_phat,” “prb_LcL,” and “prb_UcL” give π$ , its standard error,
and the confidence limits.
Output 10.3 and 10.4 also give chi-square statistics. You can use these to test
π
H0: 0=0 for a given temperature. In categorical data analysis,
is defined as the
1= π
odds and hence η$ estimates the log of the odds for a given temperature. An odds of 1,
and hence a log odds of 0, means that an event is equally likely to occur or not occur. In
the above output, those temperatures whose P2 and associated p-values (“ProbChiSq”)
result in a failure to reject H0 are temperatures for which there is insufficient evidence to
contradict the hypothesis that there is a 50-50 chance of thermal distress occurring at that
temperature. Whether this hypothesis is useful depends on the context. In many cases, the
confidence limits of π$ may be important. What is striking in the O-ring data is that the
upper confidence limit for the likelihood of O-ring thermal distress is fairly high
(considering the consequences of O-ring failure), even at 80°. When the Challenger was
launched, it was 31°.
One final note regarding the odds. The estimated slope β$ = -0.2322 is
1
βˆ1
e
interpreted as the log odds ratio per one-unit change in X. Thus
= e −0.2322 = 0.793
odds at a given temperature
is the ratio defined as
. An odds ratio < 1 indicates the odds
odds at temperature - 1
of thermal distress decrease as temperature increases.
10.2.3 Alternative Logistic Regression Analysis Using
0-1 Data
In the previous section, there was one row in the data set for each temperature
level with a variable for N, the number of observations per level, and one for y, the
number of outcomes with the characteristic of interest. You can also enter binomial data
with one row per observation, with each observation classified by which of the two
possible outcomes was observed. Output 10.5 shows the O-ring data entered in this way.
Output 10.5
O-ring data entered by observation rather than by temperature level
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
launch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
temp
td
66
70
69
68
67
72
73
70
57
63
70
78
67
53
67
75
70
81
76
79
75
76
58
0
1
0
0
0
0
0
0
1
1
1
0
0
1
0
0
0
0
0
0
1
0
1
There are three variables for each observation: an identification for the shuttle launch
(LAUNCH), the temperature at the time of launch (TEMP) and an indicator for whether
or not there was thermal distress (TD=0 means no distress, TD=1 mean there was
distress).
You can estimate the logistic regression model using the 0-1 data with the
following GENMOD statements:
proc genmod;
model td=temp
/dist=binomial link=logit type1;
These statements differ from the GENMOD program used in the previous section to
obtain Output 10.2. First, the sample proportion y/N, used as the response variable to
compute Output 10.2, is replaced here by TD, the 0-1 variable. Also, because TD is not a
ratio response variable, you must specify DIST=BINOMIAL, or GENMOD will use the
normal distribution. As before, the LINK=LOGIT statement is not necessary because the
logit link is the default for the binomial distribution, but it is good form. The results
appear in Output 10.6.
Output 10.6
Results of PROC GENMOD Analysis of 0-1 Form of O-Ring Data
The GENMOD Procedure
Model Information
Data Set
Distribution
Link Function
Dependent Variable
Observations Used
Probability Modeled
WORK.TBL_5_10
Binomial
Logit
td
23
Pr( td = 1 )
Response Profile
Ordered
Level
1
2
Ordered
Value
0
1
Count
16
7
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
21
21
21
21
20.3152
20.3152
23.1691
23.1691
-10.1576
0.9674
0.9674
1.1033
1.1033
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
temp
Scale
1
1
0
15.0429
-0.2322
1.0000
7.3786
0.1082
0.0000
Wald 95%
Confidence Limits
0.5810
-0.4443
1.0000
29.5048
-0.0200
1.0000
ChiSquare
Pr > ChiSq
4.16
4.60
0.0415
0.0320
NOTE: The scale parameter was held fixed.
LR Statistics For Type 1 Analysis
Source
Intercept
temp
Deviance
DF
ChiSquare
Pr > ChiSq
28.2672
20.3152
1
7.95
0.0048
Compared to Output 10.2, the “Model Information” is in somewhat different form,
reflecting the difference between using individual outcomes of each Bernoulli response
rather than the sample proportion for each temperature level. The goodness-of-fit
statistics, deviance and Pearson P2, are also different because the response variable and
hence the log-likelihood are not the same. Using the data in Output 10.1, there were
N=16 observations, i.e. 16 sample proportions, one per temperature level, and hence the
deviance had N-p=16-2=14 d.f., where p corresponds to the 2 model degrees of freedom
for $0 and $1. Using the data in Output 10.5, there are N=23 distinct observations, and
hence N-p = 23-2 = 21 degrees of freedom for the lack of fit statistics. The deviance and
Pearson P2 are the only statistics affected by whether you use sample proportion data or
0-1 data.
The “Analysis of Parameter Estimates” and likelihood ratio test statistics for the
Type I test of H0: $1=0 are identical to those computed using the sample proportion data.
You can also compute estimated logit for various temperatures using the same
ESTIMATE statements shown previously in Section 10.3.2. The output is identical to
that shown in Output 10.3. Therefore, when you apply the inverse link and Delta Rule,
you use the same program statements and get the same results as those presented in
Output 10.4.
10.2.4 An Alternative Link: Probit Regression
As mentioned above in Section 10.2.1, the probit link is another function suitable for
fitting regression and ANOVA models to binomial data. The probit model assumes that
the observed Bernoulli “success” or “failure” results from an underlying, but not directly
observable, normally distributed random variable. Figure 10.1 illustrates the hypothesized
model.
Figure 10.1 Illustration of Model Underlying Probit Link
Denote the underlying, unobservable random variable by Z and suppose that Z is
associated with a predictor variable X according to the linear regression equation,
Z = β 0 + β1 X . Remember, you cannot observe Z; all you can observe is the
consequences of Z. If Z is below a certain level, you observe a success. Otherwise, you
observe a failure. The regression of Z on X models how the failure-success boundary
changes with X. Figure 10.1 depicts a case for which the boundary, denoted ZX, for a
given X is equal to -1.2. Thus, the area under the normal curve below ZX=-1.2 is the
probability of a success for the corresponding X. As X changes, the boundary value ZX
changes thereby altering the probability of a success.
Formally, the standard normal cumulative distribution function, i.e. the area
under the curve less than Z, is denoted Φ(z)=
Z
∫− ∞
1
−
X2
2
e
dx . Thus, the probit linear
2π
regression model can be written π = Φ (β 0 + β1 X ) . Note that this gives the model in the
form of the inverse link. You can write the probit model in terms of the link function as
probit(π) = Φ −1 (π) = β 0 + β1 X , where Φ-1(π) means the Z value such that the area under
the curve less than Z is π.
You can fit the probit regression model to the O-ring data using the following
SAS statements:
proc genmod data=agr_135;
model td/total=temp/ link=probit type1;
Note the use of the LINK=PROBIT option but no DIST option. Because of the ratio
response variable, the binomial distribution is assumed by default, but a LINK statement
is required because the PROBIT link is not the default. The results appear in Output 10.7.
Output 10.7 GENMOD results fitting PROBIT link to O-ring Data
Criteria For Assessing Goodness Of Fit
Criterion
DF
Value
Value/DF
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
14
14
14
14
12.0600
12.0600
10.9763
10.9763
0.8614
0.8614
0.7840
0.7840
Analysis Of Parameter Estimates
Parameter
DF
Estimate
Standard
Error
Intercept
temp
1
1
8.7750
-0.1351
4.0286
0.0584
Wald 95%
Confidence Limits
0.8790
-0.2495
ChiSquare
Pr > ChiSq
4.74
5.35
0.0294
0.0207
16.6709
-0.0206
LR Statistics For Type 1 Analysis
Source
Intercept
temp
Deviance
DF
ChiSquare
Pr > ChiSq
19.9494
12.0600
1
7.89
0.0050
The results are not strikingly different from the results of the logistic regression. The
deviance is 12.060 (vs. 11.997 for the logit link) and the p-value for the likelihood ratio
test of H0: β1=0 is 0.0050 (vs. 0.0320 using the logit link). The estimate of β1 is different,
reflecting a different scale for the probit vs. the logit. However, the sign and conclusion
regarding the effect of temperature on thermal distress is the same.
You can use the ESTIMATE statements as shown in Output 8.3, to obtain
predicted probits for various temperatures. You use the inverse link, Φ(estimate), to
convert predicted probits to predicted probabilities. The SAS function to evaluate
Φ(estimate) is PROBNORM; you use the following SAS statements to obtain the probit
model analog to Output 10.4:
estimate 'probit at 50 deg' intercept 1
estimate 'probit at 60 deg' intercept 1
estimate 'probit at 64.7 deg' intercept
estimate 'probit at 64.8 deg' intercept
estimate 'probit at 70 deg' intercept 1
estimate 'probit at 80 deg' intercept 1
ods output estimates=probit;
run;
temp 50;
temp 60;
1 temp 64.7;
1 temp 64.8;
temp 70;
temp 80;
data prob_hat;
set probit;
phat=probnorm(estimate);
pi=3.14159;
invsqrt=1/(sqrt(2*pi));
se_phat=invsqrt*exp(-0.5*(estimate**2))*stderr;
prb_LcL=probnorm(LowerCL);
prb_UcL=probnorm(UpperCL);
proc print data=prob_hat;
The results appear in Output 10.8. Note the form of the Delta Rule for the probit model
to obtain the approximate standard error of π̂ . Thus follows from the fact that the
∂ Φ(η )
Hs.e.(0). The
approximate standard error of π$1 using the Delta Rule, is
∂η
∂ η 1 −
e
∫
∂ η − ∞ 2π
∂ Φ(η )
=
derivative
∂η
X2
2
dx =
1
2π
e
−
η2
2
.
Output 10.8 Predicted Probits and Probabilities obtained from PROBNORM Inverse
Link and Probit form of Delta Rule
Obs
1
2
3
4
5
6
Obs
1
2
3
4
5
6
Label
probit
probit
probit
probit
probit
probit
at
at
at
at
at
at
Estimate
50 deg
60 deg
64.7 deg
64.8 deg
70 deg
80 deg
ChiSq
Prob
ChiSq
3.13
1.23
0.01
0.00
4.42
7.80
0.0767
0.2666
0.9312
0.9579
0.0356
0.0052
StdErr
2.0201
0.6692
0.0342
0.0207
-0.6818
-2.0328
phat
0.97832
0.74831
0.51365
0.50826
0.24768
0.02104
pi
3.14159
3.14159
3.14159
3.14159
3.14159
3.14159
1.1413
0.6024
0.3960
0.3925
0.3244
0.7277
Alpha
0.05
0.05
0.05
0.05
0.05
0.05
LowerCL
-0.2167
-0.5115
-0.7420
-0.7487
-1.3175
-3.4590
UpperCL
4.2570
1.8498
0.8104
0.7901
-0.0461
-0.6066
invsqrt
se_phat
prb_LcL
prb_UcL
0.39894
0.39894
0.39894
0.39894
0.39894
0.39894
0.05917
0.19211
0.15790
0.15657
0.10257
0.03678
0.41421
0.30450
0.22905
0.22703
0.09383
0.00027
0.99999
0.96783
0.79115
0.78526
0.48163
0.27207
Comparing Output 10.8 to the analogous output for the logistic model in Output 10.4, the
estimated probabilities, approximate standard errors, and lower and upper confidence
limits are similar, though not equal, for the two models. For example, for the logit model,
at 50 degrees the predicted probability of thermal distress was 0.969 with an approximate
standard error of 0.061, whereas for the probit model the predicted probability (PHAT) is
0.978 with an approximate standard error of 0.059. Other “discrepancies” are similarly
small; you reach essentially the same conclusions about the O-ring data with either link
function.
In general, logit and probit models produce similar results. In fact, the logit and
probit are very similar functions of B, so the fact that they produce similar results is not
surprising. There are no compelling statistical reasons to choose one over the other. In
some studies, you use the logistic model because its interpretation in terms of odds-ratios
fits the subject matter. In other disciplines, the probit model of the mean has a theoretical
basis, so the probit is preferred.
Related documents