Download PROC LOGISTIC: A Form of Regression Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Discrete choice wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Statistics
PROC LOGISTIC: A FORM OF REGRESSION ANALYSIS
Edith Flaster
Winthrop-University Hospital
1. When do we use Proc Logistic?
Proc Logistic is used when your response variable is
binary; that is, yes or no, present or absent, high or low.
2. SAS's tricky definition of Yes and No.
SAS has succumbed to the epidemiologists on this proc
and used their notation for the response variable as
opposed to the standard statistical notation. SAS keeps
yes as 1, but no must be 2, not 0, the way statisticians do
it. If you use 0 for no, all your signs come out in the
opposite direction, and you will drive yourself nuts trying
to straighten it out after the fact. Now, the 1 and 2 apply
only to the response variable. . All your independent
variables should be coded the way we usually do: 1 for
high, 0 for low, 1 for present, 0 for absent.
3. Comparison of logistic and ordinary regression
a. Response variables and assumptions
In logistic regression, the response variable is binary, as
stated above, while in ordinary least squares regression
the response variable is continuous and assumed to have
a normal distribution.
b. Independent or explanatory variables
The independent variables can be either continuous or
binary for both methods. If you have a categorical
variable with more than two levels, you work with a
series of dummy variables in both methods.
reported that he usually gets similar solutions with his
very large data sets.)
d. Meaning of coefficients (slopes vs relative risk)
For least squares regressions the coefficients are
interpreted as slopes or change in response per unit
change in the explanatory variable; that is, if your
equation is:
Weight(\bs) = 306
+
3(Heigbt,in)
then weight should increase 3 Ibs for each inch in height.
In this equation, 306 is the intercept, and 3 is the slope or
regression coefficient.
For logistic regressions, the coefficients are interpreted as
the natural logarithm (In) of the odds ratio when the
independent variable has only two levels. When the
independent variable is continuous, the coefficient is the
In of the odds ratio per unit change in the indepeodenI
variable. In order to get the odds ratio, you must
exponentiate the coefficient.
For example, with an equation of:
Disease =.5
+ 1.1(decade of age),
the odds ratio is exp(I.I) or 3.0, meaning that you are 3
times more likely to get the disease for each 10 years thai
you age.
4. Logistic regression
a. Calculation of outcome probabilities
c. Methods of calculation
Ordinary regression is known as least squares regression,
because the equations are solved by the method of least
squares. Logistic regression is solved by maximum
likelihood methods. The two methods do not necessarily
give the same solution. (I tried it with a small data set
and got different solutions, but a NYASUG friend
708
Using logistic regression, we can calculate for eacIi
person or row in the table the probability of a positiv~
outcome. The response variable that is fit in logisti~
regression is called the logit, which is expressed
In(p/l-p), where p is the fractional probability of
positive response. Using some arithmetic, which I'll
into at the session if someone wants to know, we
Statistics
solve for p. If we let Z stand for the right band side of
our equation:
In (p/(l-p» = bO + blXl + b2X2.•• ,
then
p = 1/(1 + exp(-Z».
b. Using Proc Logistic
Pmc Logistic is similar in form to both Pmc Reg and
Proc GLM. That is; you use a model statement, and can
use by, output, and weight statements. There are many
other options and various diagnostic procedures, which
you can examine in the manual. Several are shown in the
5 examples given.
c. Example from Applied Logistic Regression by
Hosmer and Lemeshow. (1989, Wiley)
We will go over a simple example from an excellent text
on applied logistic regression. The first example in the
text is a table showing age, and the presence or absence
of coronary beart disease(CHD). Using the simplest SAS
code, we first input the data:
Data RaW;
Input Age CHD;
If CHD = 0 then CHD = 2;
Cards;
200
230
and minimum and maximum. It's always a good idea to
look at these, to protect yourself against outliers, whether
real or errors.
The criteria for assessing model fit follow next. The
most commonly used are the -2 LOG L, where L is the
likelihood, and the Score statistic. Both of these are
distributed as chi-square, with the degrees of freedom
corresponding to the number of explanatory variables in
the model, and the usual p-value is printed. The AlC
stands for the Akaike Information Criterion and SC stands
for the Schwartz Criterion. These are primarily used for
comparing different models for the same data. Lower
values are better.
We fmallyarrive at the results, listed under the analysis
of maximum likelihood estimates. It is a very similar
table to those that appear in Pmc Reg, or Pmc GLM.
The first column lists either the intercept or the coefficient
of the listed variable. The column labelled parameter
estimate gives the corresponding numerical values.
The coefficient for age is 0.111, with a standard error of
0.024. This value is statistically significant as shown in
the column which is second from the right. As we said
before, the odds ratio is exp(0.111) or 1.117/1. This may
be interpreted as the chance that a person will have
coronary heart disease increases about 12 % for every year
of age. If you want to calculate the probability of having
heart disease at 30, for instance, you substitute in the
equation we talked about before:
p = 1I(I+exp(-(-S.31+.1l1*Age»)
You can see that the coding was changed so that presence
of CHD remains at I, but absence is coded as 2.
We then request the model:
Pmc Logistic;
Model CHD = Age;
Run;
The output is shown on the following page. You will be
happy to know that SAS gets the same results as the text.
The response profile near the top of the page serves as a
check that your response variable has been coded as 1 and
2, and shows bow many of each code there were.
The simple statistics show the usual mean, std deviatioD,
If we do the calculations, we find that the chance of
having CHD at 30 is 0.12 or 12%. At 90, it is 0.99 or
99%. At 100, it is 0.997 or 99.7%. With a logistic
model. the probability will never exceed 100%. These
chances seem very high.
According to the brief
descriptioD in the book, these were people selected to
participate in a study. They may be a sample of people
who came to the hospital because of chest pain. I doubt
that they're a sample of a normal population.
The last table in the output gives four rank correlation
indices and the number of pairs with different responses.
These statistics assess the predictive ability of the model.
Proc Logistic is a very rich procedure. with regression
diagnostics, many other options, and the ability to tackle
a wide variety of problems. I hope that this talk has
enabled you to get a feel for the basics of the procedure.
709
Statistics
SAS
The LOGISTIC Procedure
Data Set: WORK.RAW
Response Variable: CHD
Response Levels: 2
Number of Observations: 100
Link FUDCtion: Logit
Response Profile
Ordered
Value
CHD
1
2
Count
1
2
43
57
Simple Statistics for Explanatory Variables
Variable
AGE
Standard
Deviation
Mean
44.380000
Minimum
11.721327
Maximum
20.0000
69.0000
Criteria for Assessing Model Fit
Intercept
Criterion
Only
AlC
Intercept
and
.Chi-Square for Covariates
Covariates
138.663
141.268
111.353
SC
-2 LOG L
136.663
107.353
116.563
Score
29.310 with 1 DF (p=O.OOOI)
26.399 with 1 DF (P=O'OOOI)
Analysis of Maximum Likelihood Estimates
Variable
Parameter Standard
Estimate
Error
INTERCPT -5.3095
AGE
0.1109
Wald
Chi-Square
1.1337
0.0241
21.9350
21.2541
Pr >
Standardized
Chi-Square
Estimate
0.0001
0.0001
0.716806
Association of Predicted Probabilities and Observed Responses
Concordant = 79.0%
Discordant = 19.0%
Tied
(2451 pairs)
710
= 2.0%
Somers' D = 0.600
Gamma
0.612
Tau-a
0.297
c
=
=
= 0.800