Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SAS Global Forum 2007 Data Mining and Predictive Modeling Paper 085-2007 Text Mining and PROC KDE to Rank Nominal Data Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT By definition, nominal data cannot be ranked. However, there are circumstances where it is essential to rank nominal data. Examples of such ranking include ranking hospitals and colleges, defining the “most livable cities”, and conference paper submissions. In this project, we consider ranking patient severity. The purpose is to determine how patient severity can be used to rank the quality of hospital performance. There are thousands of patient diagnoses and co-morbidities that make such a ranking very difficult. Generally, nominal variables have been ranked by using quantitative outcome variables. Currently, hospital quality measures used stepwise logistic regression to reduce the number of patient diagnoses considered to define a measure of patient severity. More recently, a weight-of-evidence method has been developed for predictive modeling such that nominal data are compressed and ranked using a target variable. However, there are now methods available that allow for ranking nominal data that do not require outcome variables; instead, outcome variables can be used to validate the ranking. Ranking can be done using SAS Text Miner to compress nominal data fields containing information on patient diagnoses, combined with PROC KDE to define and validate the patient severity ranking. It will be demonstrated that SAS Text Miner can define an implied ranking of nominal fields that is identified through the application of PROC KDE. Once the patient severity rank has been defined, it will be used to examine patient outcomes, and physician variability in patient outcomes. INTRODUCTION More and more, healthcare providers will be judged and reimbursed based upon their performance on quality measures. Since patient outcomes depend very much on patient conditions, any performance measure will need to use a patient severity index. Because of the subjective nature of defining a patient’s condition, attempts have been made to define an objective index so that healthcare providers can be compared. Even though some measures are generally accepted, they are still problematic. While the example discussed in this paper is focused on the development of a patient severity index, the methodology can be used to compress and to rank levels in any categorical variable. These applications include inventory codes and customer purchases. For our example, we use data available from AHRQ (Agency for Healthcare Research and Quality), the National Inpatient Survey. This survey contains all inpatient events for 37 different participating states, approximately 8 million events per year. We use data from the year 2003. Up to 15 columns in the dataset are used to define the patient condition and another 15 columns are used to define patient treatments. These columns are defined using a coding system developed by the World Health Organization. These codes are available online at http://icd9cm.chrisendres.com. In addition to the patient condition, several patient severity indices are included in the dataset; all such indices using some form of logistic regression to define. In this paper, we propose a new methodology for defining patient severity indices; one that does not depend upon patient outcomes for definition so that they can depend upon outcomes for validation. In sections 2 and 3, we discuss the more standard methodology, presenting the new methodology in section 4. We will compare results using the different methods of data compression. STANDARD LOGISTIC REGRESSION METHODOLOGY TO DEFINE A PATIENT SEVERITY INDEX The standard statistical method used to define a patient severity index for ranking the quality of care of healthcare providers is that of regression. Logistic regression is used when the outcome variable considered is mortality (or a severe adverse event); linear regression is used when the outcome variable is cost or length of stay. Given the number of ICD9 codes that are available, the set of codes must be reduced to be used in a regression equation. At some point, a stepwise procedure is used to find the codes that are the most related to the outcome variable. The Agency for Healthcare Research and Quality (AHRQ) has developed a draft report on a number of quality 1 indicators. All of the indicators are risk-adjusted using a measure of patient severity. One such measure is the all patient refined diagnosis related group (ADRDRG). This measure can be used to adjust patient severity, or to adjust the risk of patient mortality. The formula used to define quality is equal to: log it (Pr(Yijk = 1)) = β k 0 th ⎞⎟ + (APRDRG ) + ∑ α kp ⎛⎜ age ∑ q gender ⎝ ⎠ ij q =1 p =1 ijk Pk Qk th th where Yijk is the response for the j patient in the i hospital for the k quality indicator. The value (age/genderp)ij is th th th equal to the p age by gender zero-one indicator associated with the j patient in the i hospital and (APRDRGq)ijk is th th th th equal to the q APRDRG zero-one indicator variable associated with the j patient in the i hospital for the k quality 1 SAS Global Forum 2007 Data Mining and Predictive Modeling indicator. Then the risk adjusted rate is equal to Risk adjusted rate = Observed rate ∗ National average rate Expected rate . AHRQ suggests using a generalized model to account for correlations across patient responses in the same 2 hospital. The risk-adjusted rate is also based upon a logistic regression. However, many organizations are using standardized risk adjustment measures even though they have questionable validity when extrapolated for fresh data. For example, the Tennessee Hospital Discharge Data System uses the expected mortality defined by the 3 equation: -z Expected mortality=1/(1+e ) where z=-9.566+1.542(risk of death mortality weight)+3.819(severity of illness mortality weight)-18.07(gender mortality weight)+0.04(age)-0.045(length of stay in days)+6.937(APRDRG mortality weight)+0.332(severity of illness class)+0.994(risk of death class) and • • • • • • Risk of death mortality weight=average mortality rate for the risk of death classification the patient was assigned to Severity of illness mortality weight=average mortality rate for the severity of illness classification the patient was assigned to Gender mortality weight=average mortality rate for the patient’s gender APRDRG mortality weight=average mortality rate for the APRDRG the patient was grouped to Severity of illness class=severity of illness class assigned to the patient by the APRDRG system Risk of death class=risk of death class assigned to the patient by the APRDRG system For example, suppose there are fives codes (A,B,C,D,E) used in a logistic regression with the outcome variable of mortality. Then the regression equation can be written P=α0+ α1(if A is present)+ α2(if B is present)+ α3(if C is present)+ α4(if D is present)+ α5(if E is present) P is the predicted probability of mortality and α0+ α1+ α2+ α3+ α4+ α5=1. The predicted probability increases as the number of codes increases. One assumption of regression is that the 5 codes are independent (and uncorrelated). For example, if A=diabetes and B=congestive heart failure (CHF) then the likelihood of someone with diabetes having CHF is no greater than the likelihood of someone without diabetes having CHF. However, since diabetes can lead to heart disease, this assumption of independence is clearly false. Another problem with logistic regression is that there are very questionable results with very disparate group sizes. There is a value p chosen such that if P<p then the model predicts no mortality but if the value of P>p then the model predicts mortality. Suppose we are looking at a condition where the mortality is 1%. Then if p=1, the accuracy level of the model is 99%, although the model has no value predictively. However, the false negative rate (predicting no mortality when mortality occurs) will also be 100% while the false positive rate (predicting mortality when no mortality occurs) will be 0%. As p decreases, the accuracy of the model will decrease slightly as will the false negative rate; however, the false positive rate will also increase; however, the false negative rate will remain over 90%. Unfortunately, disparate group sizes are almost never accounted for in the development of the logistic regression model. Consider, for example, that αi=0.20 for i=1,2,3,4,5 and α0=0. If p=1 then all 5 codes must be present in order to predict mortality. If p=.8, then an individual must have 4 out of the 5 codes to predict mortality, and so on. The possible threshold values are 0, .2, .4, .6,.8, and 1. The equal values indicate that there is an equally likely chance of 5 0.2 that any one code is related to one patient. Therefore, the probability of having all five codes is (.2) =0.00032. 4 The probability of having 4 out of 5 is equal to 5(.2) (.8)=0.0064. The remaining probabilities are 0.0512 (for 3), 0.2048 (for 2), 0.4096 (for 1) and 0.32768 (for 0). That gives the possible threshold values for predicting mortality. Regardless of what the codes then represent, those are the only possible threshold values. If the threshold value requires all 5 codes, then 0.99968 of all patients will not have a predicted value of mortality. If only 4 of the 5 codes are required, then 0.99328 will not have a predicted value of mortality. Once the number of codes required is equal to 3 or more, the predicted mortality rate climbs to 0.05792. The differential between the predicted mortality rate and the actual mortality rate is used to rank healthcare 4 providers. If the predicted mortality is much higher than the actual mortality rate, then the provider will have a higher 2 SAS Global Forum 2007 Data Mining and Predictive Modeling ranking. One way to improve the ranking is to make the predicted mortality as high as possible; that is, by reporting all patients (or many patients) above the threshold value. If the 5 codes are known to the provider, extra diligence can be placed on the documentation of the 5 regardless of how well other codes are documented; it becomes more difficult if the 5 codes are not known to the provider. Another problem with the use of regression is the requirement that the codes are uniformly entered by all providers. Entry of codes depends upon the accuracy of documentation. Consider, for example, all of the 4-digit codes associated with diabetes: • • • • • • • • • • 250.0 Diabetes mellitus without mention of complication 250.1 Diabetes with ketoacidosis 250.2 Diabetes with hyperosmolarity 250.3 Diabetes with other coma 250.4 Diabetes with renal manifestations 250.5 Diabetes with ophthalmic manifestations 250.6 Diabetes with neurological manifestations 250.7 Diabetes with peripheral circulatory disorders 250.8 Diabetes with other specified manifestations 250.9 Diabetes with unspecified complication The fifth digit for diabetes represents the following: • • • 0 type II or unspecified type, not stated as uncontrolled 1 type I [juvenile type], not stated as uncontrolled 2 type II or unspecified type, uncontrolled For a complete listing of the ICD9 codes, the interested reader is referred to http://www.disabilitydurations.com/icd9top.htm. The term, “uncontrolled” is left undefined. It is questionable whether every physician who admits patients to a hospital will document “uncontrolled” on the patient’s chart. If providers document differently, those who document the codes for the regression equation will rank higher compared to those who do not. Suppose one provider documents the 5 codes at 25% rather than 20% while another documents at 15%. Then the probability of having 3 or more codes is equal to 0.1035 for the first provider but 0.0266 for the second. If they have the same actual mortality rate, provider 1 will rank considerably higher compared to provider 2. In practice, the number of ICD9 codes used is more than 5 and the probabilities of each code occurring will not be equally likely. However, the result is basically the same. If the provider knows what codes are used in the regression equation, the number of codes that must be carefully documented are the ones in the regression; the rest have no importance. For example, the Healthcare Financing Administration (HCFA) first uses cluster analysis on the codes to group the ICD9 codes with each code variable defined as 1 if present and 0 if absent followed by stepwise logistic 5 regression to identify the most important cluster. The result will be extremely biased if some providers have access to the regression equation while others do not, those who do not are penalized. Also, different regression equations can result in different rankings. While many patients will have similar severity 6 ranks across different formulas, in as much as 20% of the patient base, different measures can vary considerably. Iezzoni, Ash, et al states, “Detailed evaluation of severity measures appears to be a narrow methodologic pursuit, far removed from daily medical practice. Nevertheless, severity-adjusted death rates are widely used as putative quality indicators in health care provider ‘report cards’. Monte Carlo simulation also indicates problems in the model; these logistic models tend to have low positive predictive value, indicating issues with disparate group sizes for rare 7 occurrances. Other studies show even worse agreement, indicating that the measures can only identify outlier 8-10 providers that perform very poorly and that the measures are not valid for ranking all providers. To define a patient severity model using the National Inpatient Sample (NIS) database, we use the patient condition codes, labeled DX1-DX15 in the dataset. We create a series of indicator functions for each ICD9 code that the user wants to investigate for a severity index. While time consuming, the user can use all of the codes and then reduce them by other means. The datastep code for creating these indicator functions is as follows: Data sasuser.indicatorcodes; Set sasuser.NISdata2003; Do i=1 to 15; If dx[i]=’25000’ then n25000=1; Else n25000=0; End; Run; 3 SAS Global Forum 2007 Data Mining and Predictive Modeling The above code defines just one indicator function; others can be added. Once defined, the indicator functions are used in a logistic regression with the Regression Node in Enterprise Miner. The NIS database provides a number of severity measures that are defined using regression. Therefore, we can compare the results from defining a severity measure using text analysis to the results using the more traditional regression. We use the data from the year 2003. The measures that we compare are • • • • APRDRG mortality risk, using the all patient refined DRG as developed by AHRQ APRDRG severity, using the all patient refined DRG as developed by AHRQ Disease staging: mortality level as developed by Medstat Disease staging: resource demand level as developed by Medstat All four measures were developed using a logistic regression or linear regression process (depending upon whether the outcome variable was discrete as in mortality level or continuous as in resource demand level). We will also compare these different measures to each other. WEIGHT OF EVIDENCE METHOD There are some simple methods to reduce a complex categorical variable. Probably the easiest is to define a level called, ‘other’. All but the most populated levels can be rolled into the ‘other’ level. This method has the advantage of allowing the investigator to define the number of levels, and to immediately reduce the number to a manageable set. For example, in a study of medications for diabetes, there were a total of 358 different medications prescribed. It is possible to use the ten most popular medications and then combine the remaining medications into ‘other’. However, the ‘other’ category should consist of fairly homogeneous values. This can cause a problem when examining patient condition codes. Some patient conditions that are rarely used can require extraordinary costs and should not be rolled into a general ‘other’ category. An example of this is a patient who requires a heart transplant and who has a ventricular assist device inserted. The condition is rare, but the cost is extraordinary. Another method is called target-based enumeration. In this method, the levels are quantified by using the average of the outcome variable within each level. The level with the smallest outcome is recoded with a 1, the next smallest with a 2, and so on. A modification of this technique is to use the actual outcome average for each level. Levels with 11 identical expected outcomes are merged. This modification is called weight-of-evidence recoding . The weight-of-evidence (WOE) technique works well when the number of observations per level is sufficiently large 11 to get a stable outcome variable average. It does not generalize well to fresh data if the number of levels is large and the number of observations per level is small. In addition, there must be one, clearly defined target variable since the levels are defined in terms of that target. In a situation where there are multiple targets, the recoding of categories is not stable. In addition, the target variable is assumed to be interval so that the average value has meaning. If there are only two outcome levels (for example, mortality), the weight-of-evidence would be reduced to defining a level by the number of observations in that level with an outcome value of 1 (death). We use a macro to 12 define the weights for this process: The WOE macro yields a total of 10 different levels. These levels are compared to results defined using logistic regression as provided in the database (Tables 1,2). libname woe '.' options mstored sasmstore=woe; %macro smooth_weight_of_evidence(data=, out=, input=, target=, n_prior=, n0n1=, fname=)/store source; %if &fname= %then %let fname=$w%substr(&input,1,6); %else %let fname=$w%substr(&fname,1,6); proc sql; create table f as select "&fname" as FMTNAME, &input as START, sum(&target=1) as N1, sum(&target=0) as N0 from &data 4 SAS Global Forum 2007 Data Mining and Predictive Modeling group by &input; quit; data f; set f end=last; LABEL=log((N1+ &n_prior/(1+&n0n1))/(N0+&n_prior*&n0n1/(1+&n0n1))); output; if last then do; START='OTHER'; LABEL=-log(&n0n1); output; end; run; proc format cntlin=f; run; data &out; set &data; w_&input=put(&input,&fname..)+0; run; %mend; libname ed 'c:\ed'; %smooth_weight_of_evidence (data=sasuser.niswoe,out=nis. niswoe, input=dx1,target=totchg,n_prior=50,n0n1=2); run; Table 1. WOE Levels Compared to APRDG Mortality Risk From Logistic Regression Total Total Woe Levels APRDG Mortality Risk 0 1 2 3 4 5 6 7 8 9 0 9 5.42 0.06 1 0.60 0.00 56 33.73 0.08 39 23.49 0.21 13 7.83 0.01 4 2.41 0.00 7 4.22 0.01 29 17.47 0.03 1 0.60 0.00 7 4.22 0.03 166 1 6457 1.55 40.63 8548 2.06 36.90 48716 11.72 68.24 13774 3.31 73.23 78701 18.94 44.99 52241 12.57 53.09 111932 26.93 95.40 63531 15.29 75.42 17255 4.15 63.93 14410 3.47 55.54 415565 2 4189 2.56 26.36 9147 5.60 39.48 16765 10.26 23.49 3630 2.22 19.30 66001 40.38 37.73 28615 17.51 29.08 4466 2.73 3.81 16555 10.13 19.65 6839 4.18 25.34 7242 4.43 27.91 163449 3 3254 5.52 20.48 4492 7.61 19.39 4864 8.24 6.81 937 1.59 4.98 22873 38.77 13.08 13361 22.65 13.58 764 1.29 0.65 3399 5.76 4.04 1998 3.39 7.40 3055 5.18 11.77 58997 4 1982 10.48 12.47 978 5.17 4.22 983 5.20 1.38 428 2.26 2.28 7348 38.86 4.20 4181 22.11 4.25 159 0.84 0.14 718 3.80 0.85 898 4.75 3.33 1233 6.52 4.75 18908 15891 23166 71384 18808 174936 98402 117328 84232 26991 25947 657085 5 SAS Global Forum 2007 Data Mining and Predictive Modeling Table 2. WOE Levels Compared to APRDRG Severity APRDRG Severity Total WOE Levels 0 1 2 3 4 5 6 7 8 9 0 9 5.42 0.06 1 0.60 0.00 56 33.73 0.08 39 23.49 0.21 13 7.83 0.01 4 2.41 0.00 7 4.22 0.01 29 17.47 0.03 1 0.60 0.00 7 4.22 0.03 166 1 3685 1.43 23.19 6215 2.42 26.83 27032 10.51 37.87 7945 3.09 42.24 49857 19.38 28.50 29745 11.57 30.23 75564 29.38 64.40 40402 15.71 47.97 9360 3.64 34.68 7390 2.87 28.48 257195 2 5143 1.97 32.36 9305 3.56 40.17 32028 12.24 44.87 7877 3.01 41.88 79251 30.28 45.30 38446 14.69 39.07 34710 13.26 29.58 32676 12.49 38.79 11481 4.39 42.54 10801 4.13 41.63 261718 3 4721 4.21 29.71 6354 5.67 27.43 10410 9.29 14.58 2504 2.23 13.31 37360 33.35 21.36 23356 20.85 23.74 6662 5.95 5.68 9830 8.77 11.67 4877 4.35 18.07 5966 5.32 22.99 112040 4 2333 8.98 14.68 1291 4.97 5.57 1858 7.16 2.60 443 1.71 2.36 8455 32.56 4.83 6851 26.38 6.96 385 1.48 0.33 1295 4.99 1.54 1272 4.90 4.71 1783 6.87 6.87 25966 15891 23166 71384 18808 174936 98402 117328 84232 26991 25947 657085 Total The different patient severity indices have no obvious relationship. The WOE levels are scattered across both the mortality and the severity indices. Since the different methods yield such different results, we need to find alternative methods to determine a “good” severity index. Tables 3 and 4 compare WOE results to the two other disease staging measures. Table 3. WOE Levels Compared to Disease Staging: Mortality Disease Staging: Mortality Total Total WOE Levels 0 1 2 3 4 5 6 7 8 9 0 672 0.55 4.26 2218 1.81 9.58 6028 4.92 8.46 2965 2.42 15.78 1109 0.90 0.63 11791 9.62 12.01 84135 68.66 71.71 10518 8.58 12.50 2262 1.85 8.43 844 0.69 3.26 122542 1 475 1.20 3.01 245 0.62 1.06 2333 5.91 3.27 1211 3.07 6.45 3707 9.40 2.12 5157 13.07 5.25 18233 46.22 15.54 3895 9.87 4.63 3151 7.99 11.75 1042 2.64 4.03 39449 2 3091 1.86 19.58 3109 1.87 13.43 29531 17.74 41.45 7452 4.48 39.66 42989 25.83 24.60 20109 12.08 20.48 8247 4.96 7.03 41240 24.78 49.00 5579 3.35 20.80 5072 3.05 19.60 166419 3 4359 1.90 27.61 6752 2.94 29.17 26447 11.52 37.12 5649 2.46 30.07 87958 38.33 50.32 38966 16.98 39.69 6326 2.76 5.39 25707 11.20 30.55 11937 5.20 44.49 15404 6.71 59.54 229505 4 4628 5.87 29.31 8456 10.72 36.53 6078 7.71 8.53 1251 1.59 6.66 32880 41.70 18.81 16424 20.83 16.73 352 0.45 0.30 2505 3.18 2.98 3161 4.01 11.78 3121 3.96 12.06 78856 5 2563 13.26 16.23 2365 12.24 10.22 829 4.29 1.16 260 1.35 1.38 6143 31.78 3.51 5723 29.61 5.83 26 0.13 0.02 292 1.51 0.35 738 3.82 2.75 390 2.02 1.51 19329 15788 23145 71246 18788 17478 6 98170 11731 9 84157 26828 25873 656100 There seems to be no relationship between the two measures as shown in Table 3; the same non-relationship is shown in Table 4. In Table 4, there are only 5 values in code 1 because it represents the unknown codes, and is not part of disease staging. 6 SAS Global Forum 2007 Data Mining and Predictive Modeling Table 4. WOE Levels Compared to Disease Staging: Resource Demand Disease Staging: Resource Demand Total WOE Levels 0 1 2 3 4 5 6 7 8 9 1 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 1 20.00 0.00 0 0.00 0.00 2 40.00 0.00 0 0.00 0.00 2 40.00 0.01 5 2 2143 1.99 13.53 210 0.20 0.91 7611 7.07 10.67 5724 5.32 30.45 11428 10.61 6.53 5488 5.10 5.58 62208 57.78 53.02 9330 8.67 11.08 1290 1.20 4.78 2237 2.08 8.62 107669 3 6002 1.72 37.88 8460 2.42 36.52 48785 13.98 68.39 9492 2.72 50.50 93090 26.67 53.22 58085 16.64 59.03 52752 15.11 44.96 48034 13.76 57.04 12773 3.66 47.33 11571 3.32 44.61 349044 4 6209 3.88 39.19 11179 6.99 48.26 12912 8.07 18.10 2755 1.72 14.66 53299 33.31 30.47 27629 17.26 28.08 2177 1.36 1.86 23698 14.81 28.14 10670 6.67 39.53 9504 5.94 36.64 160032 5 1489 3.71 9.40 3317 8.26 14.32 2027 5.05 2.84 824 2.05 4.38 17101 42.57 9.78 7191 17.90 7.31 186 0.46 0.16 3152 7.85 3.74 2256 5.62 8.36 2626 6.54 10.12 40169 In addition, we use kernel density estimation to compare the WOE levels in terms of outcomes. Figure 1 gives the relationship of WOE level to total charges, which were used to define the WOE levels. Figure 2 gives the relationship to Length of Stay. A natural ordering is established with total charges, with the exception of levels 1 and 3 which have several cross-over points with the other levels. Level 6 has the highest probability of a lower cost compared to all other levels. The highest probability of high cost is shared by levels 1, 2, and 3. Figure 1. WOE Level by Total Charges When we change the target value to length of stay, the ordering changes as well (Figure 2). Level 6 has the highest probability of a short stay compared to all other levels with level 7, the next lowest. However, crossover does occur between levels 6 and 7 early on for the shortest stay. Level 1 now has the highest probability of a lengthy stay. WOE finds costs up to about $40,000 and lengths of stay up to day 12. 7 SAS Global Forum 2007 Data Mining and Predictive Modeling Figure 2. WOE by Length of Stay TEXT MINING METHODOLOGY In order to perform text analysis on the NIS data file, the data must first be pre-processed. We need to merge the 15 columns containing patient conditions into one text string. We do this using a data step with the following statement: String=catx(‘ ‘,DX1,DX2,DX3,DX4,DX5,DX6,DX7,DX8,DX9,DX10,DX11,DX12,DX13,DX14,DX15); separating each code with a space. Because of machine limitations, we used a 10% sample. Once the text string is defined, we use the Text Miner code node in Enterprise Miner (Figure 3). Figure 3. Use of the Text Miner Node to Create Clusters of Patient Severity Figure 4 gives the available options in the Text Miner node. Since the ICD9 codes are numeric, the default for Numbers must be changed from ‘no’ to ‘yes’. Figure 4. Options in Text Miner Change default to ‘yes’ for Numbers 8 SAS Global Forum 2007 Data Mining and Predictive Modeling After running Text Miner, the interactive results are shown in Figure 5. Note the interactive button in Figure 4. Clicking on the button gives Figure 5. Figure 5. Interactive Results for the Text Miner Node Lists all terms in 2 or more observations To define text clusters, go to the Tools menu (Figure 6). It gives an option to cluster the text field. Figure 6. Tools Menu to Cluster Documents Figure 7 provides options for clustering. Because we are defining a patient severity index, we specified that exactly 5 clusters should be defined. The number can be changed by the user. The number of terms (ICD9 codes) used to describe the cluster can also be specified by the user. In this example, we limit the number to ten. We use the standard defaults of Expectation Maximization and Singular Value Decomposition. It is recommended that these not be changed. The results as shown in the interactive window are given in Figure 8. 9 SAS Global Forum 2007 Data Mining and Predictive Modeling Figure 7. Clustering Options in Text Miner Note that a third window in Figure 8 is shown giving the clusters. This window is enlarged in Figure 9. Figure 8. Interactive Window After Clustering Figure 9. Cluster Window 10 SAS Global Forum 2007 Figure 10. Finding the Datasets Data Mining and Predictive Modeling In order to compare outcomes by text clusters, we first need to know what datasets are defined. This is done by highlighting the connection between nodes (Figure 10). The resulting datasets are given in Figure 11. Figure 11. Datasets Defined in Text Miner Dataset stores original data plus cluster values. Dataset stores cluster descriptions Once we have the datasets, we using the following code in the SAS Code Node. Its purpose is to put both the cluster descriptions and the cluster numbers in the original dataset. data sasuser.clusternis (keep=_cluster_ _freq_ _rmsstd_ clus_desc); set emws.text7_cluster; run; data sasuser.desccopynis (drop=_svd_1-_svd_500 prob1-prob500); set emws.text7_documents; run; proc sort data=sasuser.clusternis; by _cluster_; proc sort data=sasuser.desccopynis; by _cluster_; data sasuser.nistextranks; merge sasuser.clusternis sasuser.desccopynis; by _CLUSTER_; run; To make a comparison of outpatient cost by cluster, we use kernel density estimation with the following code: proc kde data=sasuser.totalcostandclusters; univar opxp03x_sum/gridl=0 gridu=10000 out=meps.kdecostbycluster; by _cluster_; run; Figure 12 gives the text clusters with the corresponding text translation given in Table 5. 11 SAS Global Forum 2007 Data Mining and Predictive Modeling Figure 12. Text Clusters for the NIS Data Sample Table 5. Translation of Clusters in Figure 12 Cluster Diagnoses Number 1 Feeding problems in newborn, unspecified fetal and neonatal jaundice, Other heavy-for-dates" infants, neonatal hypoglycemia, 33-34 weeks of gestation, cardiorespiratory distress syndrome of newborn, viral hepatitis, transitory tachypnea of newborn, other respiratory problems after birth, single liveborn born in the hospital of Cesarean delivery 2 Bipolar I disorder, most recent episode (or current) unspecified, anemia, alcohol abuse, extrinsic asthma unspecified, other convulsions, other and unspecified alcohol dependence, tobacco use disorder, single liveborn, unspecified hypothyroidism, other specified personal history presenting hazards to health 3 Unspecified essential hypertension, aortocoronary bypass status, coronary atherosclerosis of native coronary artery, esophageal reflux, urinary tract infection, site not specified, atrial fibrillation, chronic airway obstruction, not elsewhere classified, volume depletion, pure hypercholesterolemia, other and unspecified hyperlipidemia 4 Asthma, unspecified, hyposmolality and/or hyponatremia, other specified cardiac dysrhythmias, obesity, unspecified, osteoarthrosis, unspecified whether generalized or localized, morbid obesity, hypopotassemia, anxiety state, unspecified, chest pain, osteoporosis, unspecified 5 Abnormal glucose tolerance in pregnancy, abnormality in fetal heart rate or rhythm, elderly multigravida, other injury to pelvic organs, post term pregnancy, cord around neck, with compression, normal delivery, transient hypertension of pregnancy, breech presentation buttocks version, hypertension secondary to renal disease, complicating pregnancy, childbirth, and the puerperium Note that cluster 1 contains codes primarily related to a newborn while cluster 5 contains codes primarily related to pregnancy and delivery. Because these two clusters are so dominated by specific patient conditions (because there are so many admissions for childbirth), it is perhaps better to define text clusters of patient conditions that use more specific DRG codes to define severity measures by major MDC category. We will demonstrate this approach in a later section. The two outcome measures considered are total charges and length of stay. We first show the relationship between the different measures with a series of table analyses (Tables 6-12). Table 6 compares the APRDRG mortality risk by defined text clusters. The APRDRG value of 0 indicates that no measure was calculated for these patients. Table 6. Comparison of APRDRG Risk of Mortality and Defined Text Clusters Text CLUSTER (#) APRDRG Total Risk of 1 2 3 4 5 Mortality 61 2 6 17 13 0 23 3.28 9.84 27.87 21.31 37.70 0.03 0.03 0.06 0.15 0.28 50605 7113 15553 12428 7497 1 8014 14.06 30.73 24.56 14.81 15.84 99.68 68.85 42.86 84.87 98.44 12 There is clearly no real measure of agreement between these two risk measures. Also, for APRDRG, approximately two thirds of all patients were defined in group 1. For this reason, providers who can shift their patients from 1 to 2 by improved coding will receive more favorable quality rankings. SAS Global Forum 2007 APRDRG Risk of Mortality 2 3 4 Total Data Mining and Predictive Modeling Text CLUSTER (#) 1 2 73 0.43 0.90 10 0.16 0.12 21 1.10 0.26 8141 1039 6.09 11.76 216 3.55 2.45 68 3.57 0.77 8833 Total 3 4 5 10972 64.35 37.83 4155 68.37 14.33 1428 74.92 4.92 29000 4947 29.01 21.90 1695 27.89 7.50 389 20.41 1.72 22590 20 0.12 0.28 1 0.02 0.01 0 0.00 0.00 7136 17051 6077 1906 75700 Table 7. Comparison of APRDRG Severity Rank by Defined Text Cluster APRDRG Severity Text CLUSTER (#) Total 1 2 3 4 5 61 2 6 17 13 0 23 3.28 9.84 27.87 21.31 37.70 0.03 0.03 0.06 0.15 0.28 32855 5134 9481 7444 4105 1 6691 15.63 28.86 22.66 12.49 20.37 71.95 41.97 25.67 46.47 82.19 28177 1739 8946 12603 3811 2 1078 6.17 31.75 44.73 13.53 3.83 24.37 39.60 43.46 43.15 13.24 11964 260 3546 7030 813 3 315 2.17 29.64 58.76 6.80 2.63 3.64 15.70 24.24 9.20 3.87 2643 1 611 1906 91 4 34 0.04 23.12 72.12 3.44 1.29 0.01 2.70 6.57 1.03 0.42 Total 8141 8833 29000 22590 7136 75700 Table 8. Comparison of APRDRG Risk of Mortality to APRDRG Severity APRDRG APRDRG Risk of Mortality Total Severity 0 1 2 3 4 61 0 0 0 0 0 61 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 100.00 32855 17 67 1538 31233 1 0 0.05 0.20 4.68 95.06 0.00 0.89 1.10 9.02 61.72 0.00 28177 16 987 9751 17423 2 0 0.06 3.50 34.61 61.83 0.00 0.84 16.24 57.19 34.43 0.00 11964 375 4078 5600 1911 3 0 3.13 34.09 46.81 15.97 0.00 19.67 67.11 32.84 3.78 0.00 2643 1498 945 162 38 4 0 56.68 35.75 6.13 1.44 0.00 78.59 15.55 0.95 0.08 0.00 Total 61 50605 17051 6077 1906 75700 Table 9. Comparison of Disease Staging: Mortality Level to Text Clusters Disease Staging: Total Text CLUSTER (#) Mortality Level 1 2 3 4 5 12621 6571 3078 927 1649 0 396 52.06 24.39 7.34 3.14 13.07 92.08 13.66 3.20 4.87 18.72 1 176 1237 849 1420 543 4225 4.17 29.28 20.09 33.61 12.85 13 Table 7 compares the APRDRG severity rank to text clusters; again, there is not a pattern of agreement. The APRDRG severity rank also concentrates most patients in the lower ranks; about 20% of all patients are in groups 3 and 4. Table 8 compares the two APRDRG measures to each other. While there is major agreement for groups 1 and 4, there are few similarities for groups 2 and 3. Table 9 examines the disease staging: mortality level to the text clusters. As with the previous tables, there is little agreement between the two. SAS Global Forum 2007 Disease Staging: Mortality Level 2 3 4 5 Total Data Mining and Predictive Modeling Text CLUSTER (#) 1 2 3 2.17 14.04 2.93 6858 2980 668 28.63 12.44 0 23.69 27.8 33.83 9 82.1 9 13568 2493 802 54.93 3.25 10.09 46.86 9.87 28.30 5353 392 45 66.04 0.56 4.84 18.49 0.55 4.45 1397 58 29 71.28 1.48 2.96 4.83 0.36 0.66 812 8809 28952 8 Total 4 6.30 7431 31.02 32.97 5 7.61 3 0.01 0.04 7818 31.65 34.69 2316 28.57 10.28 476 24.29 2.11 22539 19 0.08 0.27 0 0.00 0.00 0 0.00 0.00 7136 23952 24700 8106 1960 75564 Table 10. Comparison of Disease Staging: Resource Demand Level to Text Clusters Text CLUSTER (#) Disease Staging: Total Resource Demand Level 1 2 3 4 5 3727 0 1 9 0 1 3717 0.00 0.03 0.24 0.00 99.73 0.00 0.00 0.03 0.00 45.74 14286 4758 2503 1532 2006 2 3487 33.31 17.52 10.72 14.04 24.41 66.68 11.08 5.29 22.74 42.91 36238 2377 13046 14829 5403 3 583 6.56 36.00 40.92 14.91 1.61 33.31 57.77 51.16 61.24 7.17 17156 1 5828 9852 1246 4 229 0.01 33.97 57.43 7.26 1.33 0.01 25.81 33.99 14.12 2.82 4249 0 1206 2765 168 5 110 0.00 28.38 65.07 3.95 2.59 0.00 5.34 9.54 1.90 1.35 Total 8126 8823 28987 22584 7136 75656 Table 11. Comparison of Disease Staging: Mortality Level to Disease Staging: Demand Level Disease Staging: Disease Staging: Mortality Level Resource Demand Level 0 1 2 3 4 5 0 0 3 3542 4 1 178 0.00 0.00 0.08 95.04 0.11 4.78 0.00 0.00 0.01 14.79 0.09 1.41 17 18 1432 5045 1405 2 6363 0.12 0.13 10.03 35.33 9.84 44.56 0.87 0.22 5.80 21.07 33.25 50.42 246 2670 13830 11339 2639 3 5461 0.68 7.38 38.22 31.34 7.29 15.09 12.55 32.94 55.99 47.35 62.46 43.27 967 3899 7638 3844 170 4 607 5.65 22.77 44.60 22.45 0.99 3.54 49.34 48.10 30.92 16.05 4.02 4.81 730 1519 1797 179 7 5 12 17.20 35.79 42.34 4.22 0.16 0.28 37.24 18.74 7.28 0.75 0.17 0.10 Total 12621 4225 23949 24700 8106 1960 14 Table 10 compares the disease staging: resource level to the text clusters. In this case, there is some agreement with cluster 1 to resource level 1. Total 3727 14280 36185 17125 4244 75561 Table 11 compares the two disease staging measures to each other. While there is some agreement between demand level 1 and mortality level 2, there is little agreement elsewhere between the two measures. SAS Global Forum 2007 Data Mining and Predictive Modeling Table 12. Comparison of Disease Staging: Mortality Level to APRDRG Risk of Mortality Disease Staging: Total APRDRG_Risk_Mortality Mortality Level 0 1 2 3 4 12621 1 12 119 12482 0 7 0.01 0.10 0.94 98.90 0.06 0.05 0.20 0.70 24.69 29.17 4225 3 13 125 4084 1 0 0.07 0.31 2.96 96.66 0.00 0.16 0.21 0.73 8.08 0.00 23952 27 260 2489 21168 2 8 0.11 1.09 10.39 88.38 0.03 1.42 4.28 14.62 41.87 33.33 24700 280 2217 10287 11910 3 6 1.13 8.98 41.65 48.22 0.02 14.75 36.54 60.44 23.56 25.00 8106 710 2831 3695 868 4 2 8.76 34.92 45.58 10.71 0.02 37.41 46.65 21.71 1.72 8.33 1960 877 735 304 43 5 1 44.74 37.50 15.51 2.19 0.05 46.21 12.11 1.79 0.09 4.17 Total 24 50555 17019 6068 1898 75564 Table 12 compares the two measures of mortality, APRDRG and Disease Staging. There is little agreement here as well. Table 13 compares the levels defined using weight of evidence to the five clusters defined by Text Miner. While there is variability, there is also some consistency. For example, 55% of WOE level 1 is in Text Cluster 4. Similarly, almost 100% of cluster 5 is in WOE level 6. It indicates that both methods provide better results to those of logistic regression. Table 13. WOE Levels Compared to Text Clusters Text Cluster Total WOE Levels 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 2 0.86 0.12 192 2.08 11.42 963 3.21 57.29 524 2/29 31.17 0 0.00 0.00 2 0.86 0.08 185 2.00 7.53 916 3.05 37.27 1355 5.93 55.13 0 0.00 0.00 2 0.86 0.03 1917 20.73 25.27 2761 9.20 36.39 2907 9.20 36.39 0 0.00 0.00 3 1.29 0.15 941 10.18 45.81 551 1.84 26.83 559 2.45 27.22 0 0.00 0.00 1 0.43 0.01 900 9.73 4.86 13718 45.71 74.12 3890 17.02 21.02 0 0.00 0.00 1 0.43 0.01 962 10.40 9.06 4849 16.16 45.68 4802 21.01 45.23 2 0.03 0.02 7 3.00 0.06 1961 21.20 16.36 661 2.20 5.52 2040 8.93 17.02 7314 99.97 61.04 200 85.84 2.21 1144 12.37 12.66 3459 11.53 38.27 4236 18.53 46.86 0 0.00 0.00 12 5.15 0.41 598 6.36 19.98 940 3.13 31.94 1403 6.14 47.67 0 0.00 0.00 3 1.29 0.11 458 4.95 16.40 1192 3.97 42.68 1140 4.99 40.82 0 0.00 0.00 15 233 9248 30010 22856 7316 SAS Global Forum 2007 Data Mining and Predictive Modeling Figure 13. Kernel Density Estimators of inpatient Costs by Text Cluster We next examine the kernel density estimators of the outcomes. Figure 13 gives the relationship of text cluster to inpatient costs. From the shape of the graphs, Cluster 5<2<1<4<3 in terms of ordering. There are two cutpoints in the figure. The first occurs at $11,000 where clusters 1, and 5 transition to lower probability compared to clusters 3 and 4; the second occurs at $16,000 where cluster 2 transitions to a lower cost. Figure 14 examines the APRDRG severity measure by inpatient cost. As should happen, group 1<2<3<4 with group 4 having considerable variability. The major cutpoint occurs at $12,000. Figure 14. Kernel Density Estimate of APRDRG Severity by Inpatient Cost Figure 15 examines the relationship of inpatient costs to the APRDRG mortality risk measure. Group 1<2<3<4 in terms of costs with groups 3 and 4 having considerable variability compared to groups 1 and 2. The cutpoint between groups 1 and 2 occurs at $10,000; the cutpoint between groups 3 and 4 occurs at $55,000. 16 SAS Global Forum 2007 Data Mining and Predictive Modeling Figure 15. Kernel Density Estimate of Inpatient Costs by APRDRG Mortality Risk Figure 16 gives the estimates using the disease staging: resource demand level measure. Again, group 1<2<3<4<5. However, the graph for group 1 is extremely narrow, with very small variability and almost zero probability of costs higher than $5000. The probability of cost for any group beyond $25,000 is also almost zero. Thus while the groups separate well, they seem to be poor predictors of actual costs. Figure 16. Kernel Density Estimates of Inpatient Costs by Resource Demand Level Figure 17 examines disease staging: mortality level. Again, it can only predict inpatient costs up to about $35,000 with positive probability, indicating that it is a poor predictor of actual patient costs. It has multiple cutpoints with $12,000 for groups 1 and 2; $10,000 for groups 2 and 3 and $14,000 for groups 1 and 3. For that interval, 1<3<2. The cutpoint for groups 3 and 4 occurs at $25,000. Because of this shift, with 1<3<2, there is a strong probability that 17 SAS Global Forum 2007 Data Mining and Predictive Modeling some of the hospitals in the sample are shifting patients into a higher group through over-coding of secondary patient conditions. Figure 17. Kernel Density Estimate of Disease Staging: Mortality Level Figure18. Kernel Density Estimate of LOS by Text Cluster We also look at the length of stay (LOS) as an outcome variable by the different patient severity measures. Figure 18 gives the kernel density estimate for the LOS by the text clusters. The pattern is very regular with cluster 5<1<2<4<5. Clusters 2 and 4 are almost identical in terms of distribution. The cutpoint occurs at day 4+ with cluster 5 having a very low probability of exceeding 4 days inpatient stay. The text clusters have positive probability out to day 12. 18 SAS Global Forum 2007 Data Mining and Predictive Modeling Figure19 gives the estimate of LOS by APRDRG severity measure. It has positive probability out to day 20. However, between days 4 and 5, the ordering becomes 1<3<2, indicating a shift of patients from class 2 to class 3 caused by over-coding of some hospitals. Figure 19. Kernel Density Estimate of LOS by APRDRG Severity Measure Figure 20 shows the comparison of LOS to APRDRG mortality index. The pattern looks almost the same as in Figure 19, with 1<3<2 for values between 4 and 6.5 days. Figure 20. Kernel Density Estimate of LOS for APRDRG Mortality Index Figure 21 shows the resource demand level with respect to LOS. For the most part, the LOS for patients in group 1 is equal to 2 days with much smaller probabilities of 1 and 3 days. Group 2 also peaks at 2 days LOS with just a slightly higher probability of 4 or more days compared to group 1. These two curves strongly suggest shifting 19 SAS Global Forum 2007 Data Mining and Predictive Modeling patients from group 1 to group 2. Groups 3,4,5 behave as expected for this estimate. Also, the positive probability ends at about day 10 compared to day 12 or day 20 for the previous groups. Figure 21. Kernel Density Estimate of LOS for Disease Staging: Resource Demand Level In contrast, as shown in Figure 22, the Disease Staging: Mortality Level has a much more regular pattern with 1<2<3<4<5. Figure 22. Kernel Density Estimate of LOS for Disease Staging: Mortality Level 20 SAS Global Forum 2007 Data Mining and Predictive Modeling CONCLUSION It is possible to develop a model to rank the quality of care so that the model does not assume uniformity of data entry. The model can also be validated by examination of additional outcome values in the data. The means of developing the model is to use the stemming properties of the ICD9 codes where the first three digits of the code represent the primary category while the remaining two digits represent a refinement of the diagnosis. The model compares well to those developed through the standard logistic regression technique. The text clusters that are defined can be used to validate the cluster levels. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Anonymous. AHRQ Quality Indicators: Patient Safety Indicators (PSI) Composite Measure: AHRQ; 2006:34 Pages. Anonymous. STS evidence based guidelines: cardiac surgery risk models; 2004. Coulter SL, Cecil WT. Risk adjustment for in-patient mortality-coronary artery bypass surgery. Nashville, TN: Tennessee Department of Health; February, 2004 2004. Gilligan R, Gilanelli D, Hughes R, et al. Coronary artery bypass graft surgery in New Jersey, 1944-1995. New Jersey: Cariovascular health advisory panel; November, 1997 1997. Krakauer H, Bailey RC, Skellan KJ, et al. Evaluation of the HCFA model for the analysis of mortality following hospitalization. Health Services Research. 1992;27(3):317-335. Iezzoni LI, Ash AS, Shwartz M, Daley J, Hughes JS, Mackleman YD. Predicting who dies depends on how severity is measured: implications for evaluating patient outcomes. Annals of Internal Medicine. 1995;123(10):763-770. Austin PC, Alter DA, Tu JV. The use of fixed- and random-effects models for classifying hospitals as mortality outliers: a monte carlos assessment. Medical Decision Making. 2003;23:526-539. Poses RM, McClish DK, Smith WR, et al. Results of report cards for patients with congestive heart failure depend on the method used to adjust for severity. Annals of Internal Medicine. 2000;133:10-20. Thomas JW. Research evidence on the validity of adjusted mortality rate as a measure of hospital quality of care. Medical Care Research and Review. 1998;55(4):371-404. Flanders DW, Tucker G, Krishnadasan A, Honig E, McClellan WM. Validation of the pneumonia severity index: importance of study-specific recalibration. Journal of General Internal Medicine. 1999;14(6):333-340. Smith EP, Lipkovich I, Ye K. Weight of Evidence (WOE): Quantitative estimation of probability of impact. Blacksburg, VA: Virginia Tech, Department of Statistics; 2002. Anonymous. Advanced Predictive Modeling Using SAS Enterprise Miner 5.1 Course Notes. Cary, NC: SAS Education; 200. ACKNOWLEDGMENTS I appreciate the valuable input from my co-investigator and domain expert, John C. Cerrito, PharmD. Also, the work in this paper was supported by NIH grant # 1R15RR017285-01A1, Data Mining to Enhance Medical Research of Clincal Data CONTACT INFORMATION Author Name Company Address City state ZIP Work Phone: Fax: Email: Patricia B. Cerrito University of Louisville Department of Mathematics Louisville, KY 40292 502-852-6010 502-852-7132 [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 21