Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Categorical Data Analysis: Because there really is no fate worse than death What do birth, death, high school dropout, failing a course, losing a sale, engineering majors and being fired all have in common? If you answered, “Things my mother warned me about”, you need therapy. The correct answer is all of these are categorical variables, where data pretty much DO fit in neat little boxes. You were either admitted to college or you weren’t. The customer either bought insurance from you or opened the door and let his pit bull chase you down the street. The voter checked the box for the Democratic, Republican or Independent candidate. You get the idea Today, we’ll start with a brief discussion of simple statistics that are part of the FREQ procedure and SAS/GRAPH and then dive into logistic regression using PROC LOGISTIC. While plain vanilla statistics you learn in introductory statistics, like correlation, t-tests and regression work well in the cases when you have continuous dependent variables, say age at first pregnancy, IQ scores or birthweight, with categorical data, those tests are not the best choice. For the example today, we’re going to use the data set on the oldest old from Kaiser-Permanente and one of the most definite of categories - death. The data set in question was consisted of two cohorts. Data on one was collected from 1971-1979. The second cohort data was collected from 1980-88. The overall question we’re interested in is what predicts who died over the nine-year period of study. Let’s start with the simplest procedure, PROC FREQ. PROC FREQ DATA=in.old ; TABLES dthflag ; DTHFLAG Frequency Percent 0 2670 44.60 1 3316 55.40 Cumulative Cumulative Frequency Percent 2670 44.60 5986 100.00 Not too interesting, except that it tells us that 55.4% of our sample died. Let’s take a look at these data graphically for a minute. One of the wonderful capabilities of higher speed computers (compared to my ancient graduate school days) is the ability to quickly and easily visualize relationships. Some things that I suspect are going to be related to death are gender, whether or not you are in a nursing home, age at the beginning of the study and how sick you are in general. So, my next task is to do a stacked bar chart of death by gender with SAS Enterprise Guide. To do this, I open my dataset in SAS Enterprise Guide and then from the TASKS menu select GRAPH, then BAR CHART then STACKED BAR CHART. Next, click on the DATA tab. Drag GENDER under Column Chart and DTHFLAG under Stack. Leave the rest of the tabs as defaults and skip on down to ADVANCED. Select percentage as the Statistic used to calculate the bar. Under Specify one statistical value to show for bars select Percentage. Underneath that, click the box next to Display statistical value inside the bar. Looking at our bar chart we can easily see a couple of things. There are approximately the same number of women as men, and men are much more likely to have died than women. If you were dying for the syntax for this instead, here it is PROC GCHART DATA=mydata.oldpeople ; VBAR gender / SUBGROUP=dthflag FRAME TYPE=PCT INSIDE=PCT ; Figure 1 Let’s take a look at this for a minute .. just eyeballing it the odds of a woman dying versus not dying are about 1.0, that is 24.2 vs 26.0. On the other hand, the odds of a man dying versus not dying are 31.2 vs 18.6 - considerably greater than 1. I’d like to look at the exact same picture but comparing by nursing home status. In this case, all I need to do is click on MODIFY TASK at the top of the results and drag off the Gender variable and then drag NURSEHOME under Column to Chart. Click RUN. (Using the syntax, you simply change the VBAR statement to use “nursehome” instead of “gender” as the variable to chart. Figure 2 It’s pretty clear from this chart that people who are in a nursing home are a much smaller percentage of the population - 21% than those who are not, and also that they are much more likely to die. Again, if we look at our odds of dying not in a nursing home it is 6 to 1 (that is 18% divide by 3%). Your odds of dying if you are not in a nursing home are better than even, in fact 37.4 vs 41.5%. You’re more likely to live than to die. The reason I start with these figures is that they are going to be much more meaningful to your typical audience than what we’ll get into later. Let’s take a closer look at our nursing home analysis. Whenever you are looking at percentages you ought to be asking yourself “Percentage of WHAT?” How many died? So now that we have piqued our audience’s interest with some pictures, we’re going to into tables a bit deeper. ASSOCIATION MEASURES First, let’s take a look at a bi-variate analysis with a two by two table, looking at the probability of death for people who are and are not in nursing homes. You can easily do this with either SAS Enterprise Guide, or by writing the syntax. The EG method is as follows: Select TASKS > DESCRIBE > TABLE ANALYSIS When the window first pops up, drag the two variables you want under table variables. (If you don’t see a window like this, click on the DATA option). Click on the TABLES tab and drag the two variables from the left pane, Variables permitted in table, to where you want them to appear in the table. In this case, I want “nursehome” to be my column variable and “dthflag” to be my row variable. Next, under the TABLE STATISTICS tab, click on Association and click the boxes next to Chi-square tests and Measures as shown. Now, this is pretty simple to write the syntax, as shown below, but I wanted to use these windows just to illustrate how many options there are for statistical tests just with PROC FREQ. In addition to all of the tests of association listed in this window, under the other tabs you’ll find tests of agreement, like McNemar’s and Kappa, tests for ordered differences, trends and more. Take my advice and take a few minutes to explore this when you have time. If you don’t use Enterprise Guide, you can see the same options in the PROC FREQ section in the SAS documentation. Read all about it, as the newsboy says. (Do they still have newsboys?) And here is the syntax PROC FREQ DATA = mydata.oldpeople ; TABLES dthflag*nursehome / NOROW NOPERCENT NOCUM CHISQ MEASURES ; Our first statement invokes the procedure and names the data set to be used. Nothing really to see there. The next statement requests the table, with row variable first and then column variable, followed by a ‘/’ that denotes options are to follow. Our first set of options simply tells SAS what NOT to print as the default. If you look at tables of numbers all day you might find the standard output pretty obvious. For many people, that’s not the case, so this NOROW NOPERCENT NOCUM Tells SAS not to print the row percentage or percentage of the total in each cell. So, we get a little different look than the graphs. This gives us the conditional probabilities. Table 2 : Mortality Rate by Nursing Home Placement Given the condition that a person is NOT in a nursing home, what is the probability he or she will have died (DTHFLAG = 1) and the answer is 47%. What about for people who are in a nursing home? The probability of them dying is 85%. Also, unlike our bar chart, the table also gives us the number of people who died - 1,077 - so this isn’t based on an insignificant number. In fact, only 184 people are still alive at the end of the study, out of those who have been in a nursing home. Statistics for Table of Nursing Home by Mortality Statistic DF Value Prob Chi-Square 1 582.3733 <.0001 Likelihood Ratio Chi-Square 1 643.1440 <.0001 Continuity Adj. Chi-Square 1 580.8355 <.0001 Mantel-Haenszel Chi-Square 1 582.2760 <.0001 Phi Coefficient 0.3119 Contingency Coefficient 0.2978 Cramer's V 0.3119 You can see that we get four different types of chi-square values and that they are all very similar, which is what we expect. They are all also very large and very statistically significant, again, something we could have guessed looking at the table. Also, in case you don’t know, the lowest probability that SAS prints is .0001 , the probability of getting a chi-square this high is MUCH lower than one in ten thousand. Being able to find SPSS in the start menu does not qualify you to perform a multinomial logistic regression. I don’t know the real name of the person who has that as their signature on the Chronicle of Higher Education forum but truer words were never spoken, um typed. Seriously, if you are like most normal people, you’ve noted that all the chi-square values are pretty much the same thing, you report the first one because, hey, it came first. Maybe you secretly suspect that perhaps there is some difference you should know about and you worry that someone may ask you some day and you really don’t know. Don’t pretend you know - I was a statistical consultant for 20 years before I looked it up for some graduate course I taught. So, now, I will tell you, too and you can worry no more. The options and what they tell you. CHISQ Random fact for trivial pursuit - according to Agresti and Finlay anyway, the Pearson chi-square is the oldest statistical test in use today. That is believable because it is pretty easy to compute by hand, you just take the number you would expect in each cell if the data truly were independent and subtract that from the number you actually observed, square it and divide it by the number expected. ∑ (fo - fe) 2 ________ fe The likelihood ratio chi-square is a lot less easy to compute as it involves taking the log of the ratio between the observed and expected, not that it matters any more when you’re having computers do the computation. All of the first three chi-square values test the same null hypothesis, that of no relationship between the row and column variables, that is, it is a test for independence. It does NOT tell you how strong a relationship is. It merely tells you how unlikely the null hypothesis of NO relationship is. The Mantel-Haeszel chi-square we’ll get to later, so just hold that thought, which for most people is “What the hell is a Mantel-Haeszel chi-square?” What is a Fisher’s Exact Test? How do you get one? Why don’t I have one? In some settings, it’s common to be asked to do a Fisher Exact Test. The reason is that the expected value is very small. You may have read that warning on your SAS output “One or more cells have an expected value of less than 5. Chi-square may not be a valid test.” Well, great, what do you do then? The common response is that you should collect more data. However, when I’ve personally seen this most often is in health outcome studies when a group of patients at a single clinic get one treatment or another and we’re looking at whether they died or not. Telling the physician, “Well, you see, what you really need to do to make this a valid statistical test is to kill off a few more patients” isn’t the sort of thing that is going to win you a lot of consulting clients. So, when your poor physician asks you what can he or she do in this situation, the answer is that you can do a Fisher’s Exact Test, where you (actually, SAS) compute the probability of a table as unusual as the one that you have obtained under the null hypothesis of no relationship. If you have a 2 x 2 table and use the CHISQ option, SAS automatically gives you a Fisher Exact Test as well as the other chi-square values. You don’t have to do anything. With Nursing Home Residence by Death, I get the following table: As before, I reject the null hypothesis of no relationship, but in this case it tells me that the probability of this table if there is truly no relationship in the population is 5.07 times e-142. In other words, my exact probability has about 100 zeroes after that decimal place and before the 5. This is quite helpful when giving expert witness to be able to say that not only is the probability less than 1 in 10,000 that the standard chi-square value gives you but in fact less than one in a googol ( a googol being the number 1 followed by 100 zeroes). So, Fisher’s exact test is helpful in two instances. One is when you have a small sample size and chi-square tests are of questionable validity. The second is when, regardless of sample size, you want an exact probability computed by SAS. So .... the one little sort-of word, CHISQ gets you at least six statistics and, in the case of 2 x 2 tables, even more than that. Remember that second option we included MEASURES To compute the odds ratio you could divide the frequency row 1, column 1 by the frequency in row 2 column 2 2,846/184 = 13.51 -- the odds of a person who lived not being in a nursing home versus being in a home. Then you can divide the frequency in row 2, column 1 by the frequency in row 2, column 2 2,239/ 1,077 = 2.08 Then you can divide the odds obtained in the first row by the second 13.51/ 2.08 = 6.49 OR, you could just look at the table produced when you use the measures option - it produces all types of other statistics, as shown in the next two tables. Here are our estimates of relative risk and Odds Ratio So, first of all, the odds someone who survived nine more years lived at home, versus in a nursing home, are 6.5 times the odds of someone who died during that period. People who died are a lot more likely to have lived in a nursing home. It would be nice to have a test of statistical significance, wouldn’t it? Nicely, this same option gives you 95% confidence intervals. Since 1.0 does NOT fall within these confidence limits, you can safely say that the odds are significantly higher that a person who died would live in a nursing home than live at home. Okay, now it’s later ... The Mantel-Haeszel chi-square The Mantel-Haeszel chi-square is a test of an ORDINAL relationship. This chi-square value, then, tests a particular type of relationship. If you only have two categories for each variable, as in this example, a test for an ordered relationship is exactly the same as a test of independence. You’re testing is 0 different than 1. So, in our comparison of nursing homes by death, this is practically identical to the Pearson chi-square. That’s not going to always be the case, though. Take a look at this example, using our same dataset. I broke visits to the emergency room down into five categories - none in the nine-year study period, 1 -5, 6-10, 11-15 and over 15. You can see the cross-tabulation here... As long as we’re looking at this table, let’s take a look at the phi coefficient. You see, the chi-square value is a test of whether or not there is a relationship, but not the size of the relationship. For that, we can use our friend the Phi coefficient, which is interpreted just like any other correlation coefficient. We can compare the .18 in the table above to the previous phi coefficient for nursing home by death of .31 and see that there is a stronger relationship between being in a nursing home and death than between number of emergency room visits and death. and in the table, it is clear that there is a substantial difference between the Mantel-Haenszel and other chi-square values. Do NOT just compare the chi-square values and say, “the Mantel-Haenszel is smaller, therefore there is less support for the hypothesis of an ordinal relationship than of independence”. That’s not necessarily true. Notice that the Mantel-Haenszel has fewer degrees of freedom, so the chi-square distribution the obtained value is being compared to is different. I even brought a stuffed chi-square distribution here for you to compare it to with varying degrees of freedom. (Who actually owns a stuffed chi-square distribution?) One last point about PROC FREQ before moving on ... by now you are getting the idea that this simple frequency procedure can produce MUCH more than just one and two-way frequency tables. Just as it can give you a chi-square test that tests the HYPOTHESIS of a linear ordinal relationship, PROC FREQ also produces different measures of the strength of the relationship. Like with the Mantel-Haenszel chi-square, these are going to all be the same if you have two dichotomous variables, so let’s look at our table of for the death by category of the number of emergency room visits. Gamma is a measure of rank correlation, the tau tests and Spearman are all common measures of rank order correlation . We don’t have time to discuss every type of correlation and chi-square, but before we leave this subject, I do want to point out three things. 1. Different types of chi-square values, different types of correlations and other tests like odds ratios do exist. 2. These statistics are very easy to obtain using SAS. 3. While most times, all of these measures will point you in the direction of the same general conclusion, there are times when one is preferable to the others. The phi coefficient is based on one hypothesis - that the frequency across columns is not conditional on what row you are in. The rank order correlations are based on a different hypothesis, and the Pearson on yet another. Maybe you are interested in not only knowing whether emergency room visits are related to death, but also if the people who have more, say over 16 within nine years (the last category) are more likely to die than people who had 11-15, who are in turn more likely to die than those with 5-10 and so on. In this example, you do see a linear relationship as well as a significant relationship when you do a simple test of independence. That’s not always the case, though. Many years ago, I was part of a group doing a study on kindergarten children and their friendships. Contrary to our expectations, we found that having one friend versus no friends was a factor in many outcome variables, but there was no linear nor ordinal relationship. That is, the children who had 2, 3 or 4 friends weren’t any different from the children who had only one. Even in this case, where you do see an ordinal relationship, it’s not nearly as strong as I had expected. So, take away point - you may be interested in non-standard types of statistics, and if so, there they are, hidden away in PROC FREQ options. Before we go on to logistic regression, let’s take a four-way comparison here. We’ve already seen that people are more likely to die in nursing homes but maybe it’s because they are older. So, my next step is to do table analysis of nursing home by gender by death by mean age. Since women are less likely to die, I’m going to look at each gender separately just in case there are more men in nursing homes and that explains the difference. I can do this in SAS Enterprise Guide by using the Summary Table Analysis task, or I could write the following code: PROC TABULATE DATA=mydata.oldpeople ; VAR age_comp; CLASS dthflag gender nursehome; TABLE nursehome* gender*age_comp*MEAN, dthflag ; PROC TABULATE DATA=mydata.oldpeople ; *** Starts the tabulate procedure, names the data set to be used ; VAR AGE_COMP; **** This is the list of any continuous, numeric variables; **** I only have one, age_comp ; CLASS DTHFLAG gender nursehome; **** The CLASS statement lists classification variables; **** Variables you want to use as categories go in the CLASS statement ; TABLE nursehome* gender*age_comp*MEAN, dthflag ; ***** This statement specifies the table. The variables before the comma are your row variables ; Using an “*” between variables will cross those categories. So, this analysis will breakdown those in nursing home by gender. Using the “*” followed by a statistic will request the computed statistic for that variable. In this case, you will get the mean of age, by nursing home status and gender. You’ll have two columns, one that shows the mean for those who lived (dthflag = 0) and a second column showing the mean for those who died (dthflag = 1). Here are our results. It seems that people who are in a nursing home die at an older age than those who are not. So, contrary to nursing homes killing you, maybe they make you live longer. Just kidding. The AGE_COMP variable gives you age at the beginning of the nine-year period. So, clearly, people in nursing homes were about three years older than those not in nursing homes. When you are talking about the difference between 77 years old and 80 years old, that could have a significant effect on the likelihood of mortality over the next nine years. LOGISTIC REGRESSION So, enough of that! We’ve looked at variables in two, dichotomous categories, in two ordered categories, and now we’ve taken a look at four variables simultaneously from a descriptive point of view. In some ways it gave us answers but in others it only led to more questions. Let’s move on to an inferential test with multiple predictor variables. TASKS > REGRESSION > LOGISTIC Drag variables under dependent, quantitative , classification variables as desired. In this case DTHFLAG is our dependent. The two variables AGE_COMP - which is simply the age at the beginning of the study, and ERVisits, which as you might guess is the number of Emergency Room visits. We also have two categorical predictors, gender and nursing home status. Your window should look like this when you have finished selecting the variables. Under Model tab, click Response and for Fit model to level, pull down and select 1. It just makes more sense to model who died than who lived. It does to me, anyway. Still under the Model tab, click on Effects. Shift-click on the first and last variable in the left window pane to select all of them. Click on the MAIN button to select all of these as main effects. I don’t have any hypothesized crossed or nested effects in this model. Now, just click RUN INTERPRETING THE RESULTS The first table simply tells you the data set used, the dependent variable, number of levels and type of model. Since these are all exactly what I expect - dthflag as dependent, two levels - alive or dead, a binary logit model - I just move right along. Really look at these tables! Do not be that person that skips to the end and looks for statistical significance. Sometimes with my classes I’ll include some variable where 90% or more of the people did not answer and see how many of the students pick up on the fact that their conclusion are only based on the 100 people out of 5,000 who had complete data. In this case, we’re fine, so we move along. Probability modeled .... If you are new to using the SAS logistic procedure, and you don’t pay attention to this little detail here, you can really go off the rails By default, SAS models the lower number, so you would be predicting who lived. That’s kind of counter-intuitive so above recall that I changed from the default to Fit Model to Level = 1. So, we are predicting who died. Let’s skip over the Class Level table until later and look at this ... Convergence criterion is satisfied. If your output says anything else - STOP! Stop now and figure out what is wrong with your model. Don’t go any further. Don’t pass go. Don’t interpret it. Stop. Since our model is fine, we surge on ahead. Now we get to Model Fit Statistics In this case, smaller is better. Forget everything you learned before about how you want a large F, T whatever. The null hypothesis is that there is no difference between the data as observed and as predicted by your model. The intercept only value is the fit statistic with only the intercept in the model. If the next column the intercept + your covariates - doesn’t have a lower value than the first one, you should feel really bad. What it says is that your model isn’t better than having no predictors at all. Then, you feel shame. The next table is just a different set of statistics for the same global hypothesis. If you are familiar with the omnibus F-test in ANOVA, think of that. If you aren’t familiar with that, you have no idea what I’m talking about. In that case, look at just what the title says Testing Global Null Hypothesis : Beta = 0 Again, your testing that none of the predictor variables are any different from zero. This chi-square value is very high, the probability very low, so we get multiple measures all pointing in the same direction - at least some of your predictor variables are better than nothing. Which ones? That brings us to our next table, the Type 3 effects. We can see that all of our variables are significant. For those that have the same degrees of freedom, you can compare the chi-square values and say that Age and Gender are better predictors than ER visits. How do we interpret the direction of effect of these maximum likelihood estimates? Keep in mind what probability is being modeled, first of all. Since 1= death, and age has a positive coefficient, it means the older you are, the higher probability of dying within the next nine years. That makes sense. Emergency room visits also have a positive correlation. Again, that is reasonable. The more visits you made to an emergency room over a nine-year period, the more likely you were to die. What about these next two? Gender is negatively related, but how do you know which gender is coded one? (Look in the next column.) So, being female is negatively related to dying. NOT being a nursing home (nursehome = NO) is also negatively related to dying. You can look at the table of odds ratios and see that all of these are significant. An odds ratio of 1.0 means that the odds are the same regardless, for example, whether you are male or female. Yet another way to test for statistical significance would be to see if an odds ratio of 1.0 falls within the 95% confidence intervals. You can see in the table above However, for many people it’s much simpler to look at this chart and say “Hey”, there is the 1.0 that represents equivalent odds. Here are the odds ratios and (with the lines) the 95% confidence intervals for each of the variables. Here it is very easy to see at a glance that NOT being in a nursing home is very far from having no effect. Gender, being female, is also pretty far from nonsignificant. Emergency room visits and age both have an effect, but surprisingly (especially in the case of age) not as much of an effect as the other two variables. Since we’re in chart mode, let’s take a look at two other charts. Another one of the default graphs from the logistic procedure is the one that gives predicted probabilities. Here you can see the probabilities of death at each age holding emergency room visits constant (at the sample mean). Separate plots are shown for combinations of gender and nursing home placement. You can see that by far the lowest probability is for females who are not in nursing homes, followed by males not in nursing homes. For those who were in their early sixties at the start of the study both genders have a predicted probability of death within the next nine years of less than .25. The predicted probability for males in nursing homes is the highest - .50. Females in nursing homes have a lower probability of dying than males in nursing homes, but noticeably higher than both males and females not in nursing homes. In the end, we are all going to die. Cheerful thought, isn’t it? As we go across the X axis in age at the beginning of the study, the curves get closer together. If you are in your nineties, the probability you will die within the next nine years is high - but if you are female, your odds of surviving are still somewhat better if you can stay out of a nursing home. One last plot is our ROC curve, short for receiver operating characteristic curve. This is a plot of SENSITIVITY - the percentage of true positives, the people we predicted would die who did, and SPECIFICITY - or true negatives, the number of people we said would NOT die, who did not/ We plot (1 - specificity) by sensitivity. If we predicted no one would die, our rate of true negatives would be 100%. Since we predicted nobody would die, we would be exactly right for all of the people who didn’t die. 1 - 1.0 = 0 so we’d be at 0 on the X axis. On the other hand, we’d have zero sensitivity. Since we predicted no one would die, we would have zero true positives. At the other extreme, if we predicted everyone would die, we would have 100% true positives and 0 true negatives. Since 1-0 = 1 , that would be at the upper right corner here. The straight line is what we would get without any predictor variables, if we just randomly guessed whether a person would live or die. The top left corner, where we have correctly predicted all of our positives and all of our negatives is what we would get in a perfect model. The more that curve is bowed toward the top left and away from the straight line, the better our model. Of course, being a statistician, we always want to look at numbers, so we can also take a look at our table of predicted probabilities. This tells us that 78.6% were concordant, that is either people predicted to die, died, or people predicted to live, lived. On the other hand, 21.2% had a probability that predicted in the incorrect direction. Both this table and the curve would probably lead us to the same conclusion. But, wait, what IS our conclusion? That based on a fairly large sample of American adults aged 65-95, males and people in nursing homes are much more likely to die within nine years. Of particular interest is the fact that nursing home status still has an effect even when the age of the patient, gender and the number of times they are seen in the emergency room are held constant, (that was our fourth chart, if you were counting). Also somewhat interesting is the fact that the number of emergency room visits was not a particularly strong predictor of mortality. I selected this one, rather than total number of doctor visits or hospitalizations because I thought it would be less sensitive to other factors, such as the ability to pay medical bills, transportation to the doctor’s office and so on. I will be the first to admit that this is an imperfect proxy for health status, but, as with any data set, I was limited to working with the data available. Related to this, we can conclude from all of the results in aggregation that while the model we have used is substantially better than zero and improved our ability to predict the probability of death, there is still considerable room for improvement. Perhaps future research could improve prediction by including behavioral risk indicators such as amount of alcohol and tobacco usage, as well as socioeconomic status and diagnosis of chronic illness. The End