Download Categorical Data Analysis: Because there really is no fate worse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Categorical Data Analysis: Because there really is no fate worse than death
What do birth, death, high school dropout, failing a course, losing a sale, engineering majors and
being fired all have in common? If you answered, “Things my mother warned me about”, you need
therapy. The correct answer is all of these are categorical variables, where data pretty much DO fit
in neat little boxes. You were either admitted to college or you weren’t. The customer either bought
insurance from you or opened the door and let his pit bull chase you down the street. The voter
checked the box for the Democratic, Republican or Independent candidate. You get the idea
Today, we’ll start with a brief discussion of simple statistics that are part of the FREQ procedure and
SAS/GRAPH and then dive into logistic regression using PROC LOGISTIC.
While plain vanilla statistics you learn in introductory statistics, like correlation, t-tests and
regression work well in the cases when you have continuous dependent variables, say age at first
pregnancy, IQ scores or birthweight, with categorical data, those tests are not the best choice.
For the example today, we’re going to use the data set on the oldest old from Kaiser-Permanente
and one of the most definite of categories - death. The data set in question was consisted of two
cohorts. Data on one was collected from 1971-1979. The second cohort data was collected from
1980-88. The overall question we’re interested in is what predicts who died over the nine-year
period of study.
Let’s start with the simplest procedure, PROC FREQ.
PROC FREQ DATA=in.old ;
TABLES dthflag ;
DTHFLAG Frequency Percent
0
2670
44.60
1
3316
55.40
Cumulative Cumulative
Frequency
Percent
2670
44.60
5986
100.00
Not too interesting, except that it tells us that 55.4% of our sample died. Let’s take a look at these
data graphically for a minute. One of the wonderful capabilities of higher speed computers
(compared to my ancient graduate school days) is the ability to quickly and easily visualize
relationships. Some things that I suspect are going to be related to death are gender, whether or
not you are in a nursing home, age at the beginning of the study and how sick you are in general.
So, my next task is to do a stacked bar chart of death by gender with SAS Enterprise Guide. To do
this, I open my dataset in SAS Enterprise Guide and then from the TASKS menu select GRAPH, then
BAR CHART then STACKED BAR CHART.
Next, click on the DATA tab. Drag GENDER under Column Chart and DTHFLAG under Stack.
Leave the rest of the tabs as defaults and skip on down to ADVANCED. Select percentage as the
Statistic used to calculate the bar. Under Specify one statistical value to show for bars select
Percentage. Underneath that, click the box next to Display statistical value inside the bar.
Looking at our bar chart we can easily see a couple of things. There are approximately the same
number of women as men, and men are much more likely to have died than women.
If you were dying for the syntax for this instead, here it is
PROC GCHART DATA=mydata.oldpeople ;
VBAR gender / SUBGROUP=dthflag FRAME TYPE=PCT INSIDE=PCT ;
Figure 1
Let’s take a look at this for a minute .. just eyeballing it the odds of a woman dying versus not dying
are about 1.0, that is 24.2 vs 26.0. On the other hand, the odds of a man dying versus not dying
are 31.2 vs 18.6 - considerably greater than 1.
I’d like to look at the exact same picture but comparing by nursing home status.
In this case, all I need to do is click on MODIFY TASK at the top of the results and drag off the
Gender variable and then drag NURSEHOME under Column to Chart. Click RUN.
(Using the syntax, you simply change the VBAR statement to use “nursehome” instead of “gender”
as the variable to chart.
Figure 2
It’s pretty clear from this chart that people who are in a nursing home are a much smaller
percentage of the population - 21% than those who are not, and also that they are much more
likely to die.
Again, if we look at our odds of dying not in a nursing home it is 6 to 1 (that is 18% divide by 3%).
Your odds of dying if you are not in a nursing home are better than even, in fact 37.4 vs 41.5%.
You’re more likely to live than to die.
The reason I start with these figures is that they are going to be much more meaningful to your
typical audience than what we’ll get into later.
Let’s take a closer look at our nursing home analysis. Whenever you are looking at percentages you
ought to be asking yourself “Percentage of WHAT?” How many died? So now that we have piqued
our audience’s interest with some pictures, we’re going to into tables a bit deeper.
ASSOCIATION MEASURES
First, let’s take a look at a bi-variate analysis with a two by two table, looking at the probability of
death for people who are and are not in nursing homes. You can easily do this with either SAS
Enterprise Guide, or by writing the syntax. The EG method is as follows:
Select TASKS > DESCRIBE > TABLE ANALYSIS
When the window first pops up, drag the two variables you want under table variables. (If you don’t
see a window like this, click on the DATA option).
Click on the TABLES tab and drag the two variables from the left pane, Variables permitted in table,
to where you want them to appear in the table. In this case, I want “nursehome” to be my column
variable and “dthflag” to be my row variable.
Next, under the TABLE STATISTICS tab, click on Association and click the boxes next to Chi-square
tests and Measures as shown.
Now, this is pretty simple to write the syntax, as shown below, but I wanted to use these windows
just to illustrate how many options there are for statistical tests just with PROC FREQ. In addition to
all of the tests of association listed in this window, under the other tabs you’ll find tests of
agreement, like McNemar’s and Kappa, tests for ordered differences, trends and more. Take my
advice and take a few minutes to explore this when you have time. If you don’t use Enterprise
Guide, you can see the same options in the PROC FREQ section in the SAS documentation. Read all
about it, as the newsboy says. (Do they still have newsboys?)
And here is the syntax
PROC FREQ DATA = mydata.oldpeople ;
TABLES dthflag*nursehome /
NOROW NOPERCENT NOCUM
CHISQ MEASURES ;
Our first statement invokes the procedure and names the data set to be used. Nothing really to see
there. The next statement requests the table, with row variable first and then column variable,
followed by a ‘/’ that denotes options are to follow.
Our first set of options simply tells SAS what NOT to print as the default. If you look at tables of
numbers all day you might find the standard output pretty obvious. For many people, that’s not the
case, so this
NOROW NOPERCENT NOCUM
Tells SAS not to print the row percentage or percentage of the total in each cell. So, we get a little
different look than the graphs. This gives us the conditional probabilities.
Table 2 : Mortality Rate by Nursing Home Placement
Given the condition that a person is NOT in a nursing home, what is the probability he or she will
have died (DTHFLAG = 1) and the answer is 47%. What about for people who are in a nursing
home? The probability of them dying is 85%. Also, unlike our bar chart, the table also gives us the
number of people who died - 1,077 - so this isn’t based on an insignificant number. In fact, only
184 people are still alive at the end of the study, out of those who have been in a nursing home.
Statistics for Table of Nursing Home by Mortality
Statistic
DF
Value
Prob
Chi-Square
1 582.3733 <.0001
Likelihood Ratio Chi-Square
1 643.1440 <.0001
Continuity Adj. Chi-Square
1 580.8355 <.0001
Mantel-Haenszel Chi-Square
1 582.2760 <.0001
Phi Coefficient
0.3119
Contingency Coefficient
0.2978
Cramer's V
0.3119
You can see that we get four different types of chi-square values and that they are all very similar,
which is what we expect. They are all also very large and very statistically significant, again,
something we could have guessed looking at the table. Also, in case you don’t know, the lowest
probability that SAS prints is .0001 , the probability of getting a chi-square this high is MUCH lower
than one in ten thousand.
Being able to find SPSS in the start menu does not qualify you to perform a multinomial logistic
regression.
I don’t know the real name of the person who has that as their signature on the Chronicle of Higher
Education forum but truer words were never spoken, um typed.
Seriously, if you are like most normal people, you’ve noted that all the chi-square values are pretty
much the same thing, you report the first one because, hey, it came first. Maybe you secretly
suspect that perhaps there is some difference you should know about and you worry that someone
may ask you some day and you really don’t know.
Don’t pretend you know - I was a statistical consultant for 20 years before I looked it up for some
graduate course I taught. So, now, I will tell you, too and you can worry no more.
The options and what they tell you.
CHISQ
Random fact for trivial pursuit - according to Agresti and Finlay anyway, the Pearson chi-square is
the oldest statistical test in use today. That is believable because it is pretty easy to compute by
hand, you just take the number you would expect in each cell if the data truly were independent
and subtract that from the number you actually observed, square it and divide it by the number
expected.
∑
(fo - fe)
2
________
fe
The likelihood ratio chi-square is a lot less easy to compute as it involves taking the log of the ratio
between the observed and expected, not that it matters any more when you’re having computers
do the computation.
All of the first three chi-square values test the same null hypothesis, that of no relationship between
the row and column variables, that is, it is a test for independence.
It does NOT tell you how strong a relationship is. It merely tells you how unlikely the null
hypothesis of NO relationship is.
The Mantel-Haeszel chi-square we’ll get to later, so just hold that thought, which for most people is
“What the hell is a Mantel-Haeszel chi-square?”
What is a Fisher’s Exact Test? How do you get one? Why don’t I have one?
In some settings, it’s common to be asked to do a Fisher Exact Test. The reason is that the
expected value is very small. You may have read that warning on your SAS output
“One or more cells have an expected value of less than 5. Chi-square may not be a valid test.”
Well, great, what do you do then? The common response is that you should collect more data.
However, when I’ve personally seen this most often is in health outcome studies when a group of
patients at a single clinic get one treatment or another and we’re looking at whether they died or
not. Telling the physician,
“Well, you see, what you really need to do to make this a valid statistical test is to kill off a few
more patients”
isn’t the sort of thing that is going to win you a lot of consulting clients.
So, when your poor physician asks you what can he or she do in this situation, the answer is that
you can do a Fisher’s Exact Test, where you (actually, SAS) compute the probability of a table as
unusual as the one that you have obtained under the null hypothesis of no relationship.
If you have a 2 x 2 table and use the CHISQ option, SAS automatically gives you a Fisher Exact Test
as well as the other chi-square values. You don’t have to do anything.
With Nursing Home Residence by Death, I get the following table:
As before, I reject the null hypothesis of no relationship, but in this case it tells me that the
probability of this table if there is truly no relationship in the population is 5.07 times e-142. In
other words, my exact probability has about 100 zeroes after that decimal place and before the 5.
This is quite helpful when giving expert witness to be able to say that not only is the probability less
than 1 in 10,000 that the standard chi-square value gives you but in fact less than one in a googol (
a googol being the number 1 followed by 100 zeroes).
So, Fisher’s exact test is helpful in two instances. One is when you have a small sample size and
chi-square tests are of questionable validity. The second is when, regardless of sample size, you
want an exact probability computed by SAS.
So .... the one little sort-of word, CHISQ gets you at least six statistics and, in the case of 2 x 2
tables, even more than that.
Remember that second option we included
MEASURES
To compute the odds ratio you could divide the frequency row 1, column 1 by the frequency in row
2 column 2
2,846/184 = 13.51 -- the odds of a person who lived not being in a nursing home versus being in
a home.
Then you can divide the frequency in row 2, column 1 by the frequency in row 2, column 2
2,239/ 1,077 = 2.08
Then you can divide the odds obtained in the first row by the second
13.51/ 2.08 = 6.49
OR, you could just look at the table produced when you use the measures option - it produces all
types of other statistics, as shown in the next two tables. Here are our estimates of relative risk and
Odds Ratio
So, first of all, the odds someone who survived nine more years lived at home, versus in a nursing
home, are 6.5 times the odds of someone who died during that period.
People who died are a lot more likely to have lived in a nursing home. It would be nice to have a
test of statistical significance, wouldn’t it? Nicely, this same option gives you 95% confidence
intervals. Since 1.0 does NOT fall within these confidence limits, you can safely say that the odds
are significantly higher that a person who died would live in a nursing home than live at home.
Okay, now it’s later ...
The Mantel-Haeszel chi-square The Mantel-Haeszel chi-square is a test of an ORDINAL
relationship. This chi-square value, then, tests a particular type of relationship. If you only have two
categories for each variable, as in this example, a test for an ordered relationship is exactly the
same as a test of independence. You’re testing is 0 different than 1. So, in our comparison of
nursing homes by death, this is practically identical to the Pearson chi-square. That’s not going to
always be the case, though.
Take a look at this example, using our same dataset. I broke visits to the emergency room down
into five categories - none in the nine-year study period, 1 -5, 6-10, 11-15 and over 15.
You can see the cross-tabulation here...
As long as we’re looking at this table, let’s take a look at the phi coefficient. You see, the chi-square
value is a test of whether or not there is a relationship, but not the size of the relationship. For that,
we can use our friend the Phi coefficient, which is interpreted just like any other correlation
coefficient. We can compare the .18 in the table above to the previous phi coefficient for nursing
home by death of .31 and see that there is a stronger relationship between being in a nursing home
and death than between number of emergency room visits and death.
and in the table, it is clear that there is a substantial difference between the Mantel-Haenszel and
other chi-square values.
Do NOT just compare the chi-square values and say, “the Mantel-Haenszel is smaller, therefore
there is less support for the hypothesis of an ordinal relationship than of independence”. That’s not
necessarily true. Notice that the Mantel-Haenszel has fewer degrees of freedom, so the chi-square
distribution the obtained value is being compared to is different.
I even brought a stuffed chi-square distribution here for you to compare it to with varying degrees
of freedom. (Who actually owns a stuffed chi-square distribution?)
One last point about PROC FREQ before moving on ... by now you are getting the idea that this
simple frequency procedure can produce MUCH more than just one and two-way frequency tables.
Just as it can give you a chi-square test that tests the HYPOTHESIS of a linear ordinal relationship,
PROC FREQ also produces different measures of the strength of the relationship. Like with the
Mantel-Haenszel chi-square, these are going to all be the same if you have two dichotomous
variables, so let’s look at our table of for the death by category of the number of emergency room
visits.
Gamma is a measure of rank correlation, the tau tests and Spearman are all common measures of
rank order correlation .
We don’t have time to discuss every type of correlation and chi-square, but before we leave this
subject, I do want to point out three things.
1. Different types of chi-square values, different types of correlations and other tests like odds
ratios do exist.
2. These statistics are very easy to obtain using SAS.
3. While most times, all of these measures will point you in the direction of the same general
conclusion, there are times when one is preferable to the others.
The phi coefficient is based on one hypothesis - that the frequency across columns is not conditional
on what row you are in. The rank order correlations are based on a different hypothesis, and the
Pearson on yet another.
Maybe you are interested in not only knowing whether emergency room visits are related to death,
but also if the people who have more, say over 16 within nine years (the last category) are more
likely to die than people who had 11-15, who are in turn more likely to die than those with 5-10 and
so on. In this example, you do see a linear relationship as well as a significant relationship when
you do a simple test of independence.
That’s not always the case, though. Many years ago, I was part of a group doing a study on
kindergarten children and their friendships. Contrary to our expectations, we found that having one
friend versus no friends was a factor in many outcome variables, but there was no linear nor ordinal
relationship. That is, the children who had 2, 3 or 4 friends weren’t any different from the children
who had only one.
Even in this case, where you do see an ordinal relationship, it’s not nearly as strong as I had
expected.
So, take away point - you may be interested in non-standard types of statistics, and if so, there they
are, hidden away in PROC FREQ options.
Before we go on to logistic regression, let’s take a four-way comparison here.
We’ve already seen that people are more likely to die in nursing homes but maybe it’s because they
are older. So, my next step is to do table analysis of nursing home by gender by death by mean
age. Since women are less likely to die, I’m going to look at each gender separately just in case
there are more men in nursing homes and that explains the difference.
I can do this in SAS Enterprise Guide by using the Summary Table Analysis task, or I could write the
following code:
PROC TABULATE DATA=mydata.oldpeople ;
VAR age_comp;
CLASS dthflag gender nursehome;
TABLE nursehome* gender*age_comp*MEAN, dthflag
;
PROC TABULATE DATA=mydata.oldpeople ;
*** Starts the tabulate procedure, names the data set to be used ;
VAR AGE_COMP;
**** This is the list of any continuous, numeric variables;
**** I only have one, age_comp ;
CLASS DTHFLAG gender nursehome;
**** The CLASS statement lists classification variables;
**** Variables you want to use as categories go in the CLASS statement ;
TABLE nursehome* gender*age_comp*MEAN, dthflag
;
***** This statement specifies the table. The variables before the comma are your row variables ;
Using an “*” between variables will cross those categories. So, this analysis will breakdown
those in nursing home by gender. Using the “*” followed by a statistic will request the computed
statistic for that variable. In this case, you will get the mean of age, by nursing home status and
gender. You’ll have two columns, one that shows the mean for those who lived (dthflag = 0) and a
second column showing the mean for those who died (dthflag = 1).
Here are our results. It seems that people who are in a nursing home die at an older age than those
who are not. So, contrary to nursing homes killing you, maybe they make you live longer. Just
kidding. The AGE_COMP variable gives you age at the beginning of the nine-year period. So, clearly,
people in nursing homes were about three years older than those not in nursing homes. When you
are talking about the difference between 77 years old and 80 years old, that could have a significant
effect on the likelihood of mortality over the next nine years.
LOGISTIC REGRESSION
So, enough of that! We’ve looked at variables in two, dichotomous categories, in two ordered
categories, and now we’ve taken a look at four variables simultaneously from a descriptive point of
view. In some ways it gave us answers but in others it only led to more questions. Let’s move on to
an inferential test with multiple predictor variables.
TASKS > REGRESSION > LOGISTIC
Drag variables under dependent, quantitative , classification variables as desired. In this case
DTHFLAG is our dependent. The two variables AGE_COMP - which is simply the age at the
beginning of the study, and ERVisits, which as you might guess is the number of Emergency Room
visits.
We also have two categorical predictors, gender and nursing home status. Your window should look
like this when you have finished selecting the variables.
Under Model tab, click Response and for Fit model to level, pull down and select 1.
It just makes more sense to model who died than who lived. It does to me, anyway.
Still under the Model tab, click on Effects. Shift-click on the first and last variable in the left window
pane to select all of them. Click on the MAIN button to select all of these as main effects. I don’t
have any hypothesized crossed or nested effects in this model.
Now, just click RUN
INTERPRETING THE RESULTS
The first table simply tells you the data set used, the dependent variable, number of levels and type
of model. Since these are all exactly what I expect - dthflag as dependent, two levels - alive or
dead, a binary logit model - I just move right along.
Really look at these tables! Do not be that person that skips to the end and looks for statistical
significance. Sometimes with my classes I’ll include some variable where 90% or more of the people
did not answer and see how many of the students pick up on the fact that their conclusion are only
based on the 100 people out of 5,000 who had complete data.
In this case, we’re fine, so we move along.
Probability modeled .... If you are new to using the SAS logistic procedure, and you don’t pay
attention to this little detail here, you can really go off the rails
By default, SAS models the lower number, so you would be predicting who lived. That’s kind of
counter-intuitive so above recall that I changed from the default to Fit Model to Level = 1. So, we
are predicting who died.
Let’s skip over the Class Level table until later and look at this ...
Convergence criterion is satisfied. If your output says anything else - STOP! Stop now and figure
out what is wrong with your model. Don’t go any further. Don’t pass go. Don’t interpret it. Stop.
Since our model is fine, we surge on ahead.
Now we get to Model Fit Statistics
In this case, smaller is better. Forget everything you learned before about how you want a large F,
T whatever. The null hypothesis is that there is no difference between the data as observed and as
predicted by your model.
The intercept only value is the fit statistic with only the intercept in the model. If the next column the intercept + your covariates - doesn’t have a lower value than the first one, you should feel
really bad. What it says is that your model isn’t better than having no predictors at all. Then, you
feel shame.
The next table is just a different set of statistics for the same global hypothesis. If you are familiar
with the omnibus F-test in ANOVA, think of that. If you aren’t familiar with that, you have no idea
what I’m talking about. In that case, look at just what the title says
Testing Global Null Hypothesis : Beta = 0
Again, your testing that none of the predictor variables are any different from zero. This chi-square
value is very high, the probability very low, so we get multiple measures all pointing in the same
direction - at least some of your predictor variables are better than nothing.
Which ones? That brings us to our next table, the Type 3 effects. We can see that all of our
variables are significant.
For those that have the same degrees of freedom, you can compare the chi-square values and say
that Age and Gender are better predictors than ER visits.
How do we interpret the direction of effect of these maximum likelihood estimates? Keep in mind
what probability is being modeled, first of all. Since 1= death, and age has a positive coefficient, it
means the older you are, the higher probability of dying within the next nine years. That makes
sense. Emergency room visits also have a positive correlation. Again, that is reasonable. The more
visits you made to an emergency room over a nine-year period, the more likely you were to die.
What about these next two? Gender is negatively related, but how do you know which gender is
coded one? (Look in the next column.) So, being female is negatively related to dying. NOT being a
nursing home (nursehome = NO) is also negatively related to dying.
You can look at the table of odds ratios and see that all of these are significant. An odds ratio of 1.0
means that the odds are the same regardless, for example, whether you are male or female.
Yet another way to test for statistical significance would be to see if an odds ratio of 1.0 falls within
the 95% confidence intervals. You can see in the table above
However, for many people it’s much simpler to look at this chart and say “Hey”, there is the 1.0
that represents equivalent odds. Here are the odds ratios and (with the lines) the 95% confidence
intervals for each of the variables. Here it is very easy to see at a glance that NOT being in a
nursing home is very far from having no effect. Gender, being female, is also pretty far from nonsignificant. Emergency room visits and age both have an effect, but surprisingly (especially in the
case of age) not as much of an effect as the other two variables.
Since we’re in chart mode, let’s take a look at two other charts.
Another one of the default graphs from the logistic procedure is the one that gives predicted
probabilities. Here you can see the probabilities of death at each age holding emergency room visits
constant (at the sample mean). Separate plots are shown for combinations of gender and nursing
home placement. You can see that by far the lowest probability is for females who are not in
nursing homes, followed by males not in nursing homes. For those who were in their early sixties at
the start of the study both genders have a predicted probability of death within the next nine years
of less than .25. The predicted probability for males in nursing homes is the highest - .50. Females
in nursing homes have a lower probability of dying than males in nursing homes, but noticeably
higher than both males and females not in nursing homes.
In the end, we are all going to die. Cheerful thought, isn’t it? As we go across the X axis in age at
the beginning of the study, the curves get closer together. If you are in your nineties, the
probability you will die within the next nine years is high - but if you are female, your odds of
surviving are still somewhat better if you can stay out of a nursing home.
One last plot is our ROC curve, short for receiver operating characteristic curve. This is a plot of
SENSITIVITY - the percentage of true positives, the people we predicted would die who did, and
SPECIFICITY - or true negatives, the number of people we said would NOT die, who did not/
We plot (1 - specificity) by sensitivity. If we predicted no one would die, our rate of true negatives
would be 100%. Since we predicted nobody would die, we would be exactly right for all of the
people who didn’t die. 1 - 1.0 = 0 so we’d be at 0 on the X axis.
On the other hand, we’d have zero sensitivity. Since we predicted no one would die, we would have
zero true positives.
At the other extreme, if we predicted everyone would die, we would have 100% true positives and
0 true negatives. Since 1-0 = 1 , that would be at the upper right corner here.
The straight line is what we would get without any predictor variables, if we just randomly guessed
whether a person would live or die. The top left corner, where we have correctly predicted all of our
positives and all of our negatives is what we would get in a perfect model.
The more that curve is bowed toward the top left and away from the straight line, the better our
model.
Of course, being a statistician, we always want to look at numbers, so we can also take a look at
our table of predicted probabilities. This tells us that 78.6% were concordant, that is either people
predicted to die, died, or people predicted to live, lived. On the other hand, 21.2% had a probability
that predicted in the incorrect direction.
Both this table and the curve would probably lead us to the same conclusion.
But, wait, what IS our conclusion?
That based on a fairly large sample of American adults aged 65-95, males and people in nursing
homes are much more likely to die within nine years. Of particular interest is the fact that nursing
home status still has an effect even when the age of the patient, gender and the number of times
they are seen in the emergency room are held constant, (that was our fourth chart, if you were
counting).
Also somewhat interesting is the fact that the number of emergency room visits was not a
particularly strong predictor of mortality. I selected this one, rather than total number of doctor
visits or hospitalizations because I thought it would be less sensitive to other factors, such as the
ability to pay medical bills, transportation to the doctor’s office and so on. I will be the first to admit
that this is an imperfect proxy for health status, but, as with any data set, I was limited to working
with the data available.
Related to this, we can conclude from all of the results in aggregation that while the model we have
used is substantially better than zero and improved our ability to predict the probability of death,
there is still considerable room for improvement. Perhaps future research could improve prediction
by including behavioral risk indicators such as amount of alcohol and tobacco usage, as well as
socioeconomic status and diagnosis of chronic illness.
The End