Download Text Mining and PROC KDE to Rank Nominal Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
SAS Global Forum 2007
Data Mining and Predictive Modeling
Paper 085-2007
Text Mining and PROC KDE to Rank Nominal Data
Patricia B. Cerrito, University of Louisville, Louisville, KY
ABSTRACT
By definition, nominal data cannot be ranked. However, there are circumstances where it is essential to rank nominal
data. Examples of such ranking include ranking hospitals and colleges, defining the “most livable cities”, and
conference paper submissions. In this project, we consider ranking patient severity. The purpose is to determine how
patient severity can be used to rank the quality of hospital performance. There are thousands of patient diagnoses
and co-morbidities that make such a ranking very difficult. Generally, nominal variables have been ranked by using
quantitative outcome variables. Currently, hospital quality measures used stepwise logistic regression to reduce the
number of patient diagnoses considered to define a measure of patient severity. More recently, a weight-of-evidence
method has been developed for predictive modeling such that nominal data are compressed and ranked using a
target variable. However, there are now methods available that allow for ranking nominal data that do not require
outcome variables; instead, outcome variables can be used to validate the ranking. Ranking can be done using SAS
Text Miner to compress nominal data fields containing information on patient diagnoses, combined with PROC KDE
to define and validate the patient severity ranking. It will be demonstrated that SAS Text Miner can define an implied
ranking of nominal fields that is identified through the application of PROC KDE. Once the patient severity rank has
been defined, it will be used to examine patient outcomes, and physician variability in patient outcomes.
INTRODUCTION
More and more, healthcare providers will be judged and reimbursed based upon their performance on quality
measures. Since patient outcomes depend very much on patient conditions, any performance measure will need to
use a patient severity index. Because of the subjective nature of defining a patient’s condition, attempts have been
made to define an objective index so that healthcare providers can be compared. Even though some measures are
generally accepted, they are still problematic.
While the example discussed in this paper is focused on the development of a patient severity index, the
methodology can be used to compress and to rank levels in any categorical variable. These applications include
inventory codes and customer purchases. For our example, we use data available from AHRQ (Agency for
Healthcare Research and Quality), the National Inpatient Survey. This survey contains all inpatient events for 37
different participating states, approximately 8 million events per year. We use data from the year 2003. Up to 15
columns in the dataset are used to define the patient condition and another 15 columns are used to define patient
treatments. These columns are defined using a coding system developed by the World Health Organization. These
codes are available online at http://icd9cm.chrisendres.com. In addition to the patient condition, several patient
severity indices are included in the dataset; all such indices using some form of logistic regression to define.
In this paper, we propose a new methodology for defining patient severity indices; one that does not depend upon
patient outcomes for definition so that they can depend upon outcomes for validation. In sections 2 and 3, we
discuss the more standard methodology, presenting the new methodology in section 4. We will compare results
using the different methods of data compression.
STANDARD LOGISTIC REGRESSION METHODOLOGY TO DEFINE A PATIENT SEVERITY INDEX
The standard statistical method used to define a patient severity index for ranking the quality of care of healthcare
providers is that of regression. Logistic regression is used when the outcome variable considered is mortality (or a
severe adverse event); linear regression is used when the outcome variable is cost or length of stay. Given the
number of ICD9 codes that are available, the set of codes must be reduced to be used in a regression equation. At
some point, a stepwise procedure is used to find the codes that are the most related to the outcome variable.
The Agency for Healthcare Research and Quality (AHRQ) has developed a draft report on a number of quality
1
indicators. All of the indicators are risk-adjusted using a measure of patient severity. One such measure is the all
patient refined diagnosis related group (ADRDRG). This measure can be used to adjust patient severity, or to adjust
the risk of patient mortality. The formula used to define quality is equal to:
log it (Pr(Yijk = 1)) = β k 0
th
⎞⎟ + (APRDRG )
+ ∑ α kp ⎛⎜ age
∑
q
gender
⎝
⎠ ij q =1
p =1
ijk
Pk
Qk
th
th
where Yijk is the response for the j patient in the i hospital for the k quality indicator. The value (age/genderp)ij is
th
th
th
equal to the p age by gender zero-one indicator associated with the j patient in the i hospital and (APRDRGq)ijk is
th
th
th
th
equal to the q APRDRG zero-one indicator variable associated with the j patient in the i hospital for the k quality
1
SAS Global Forum 2007
Data Mining and Predictive Modeling
indicator. Then the risk adjusted rate is equal to
Risk adjusted rate =
Observed rate
∗ National average rate
Expected rate
.
AHRQ suggests using a generalized model to account for correlations across patient responses in the same
2
hospital. The risk-adjusted rate is also based upon a logistic regression. However, many organizations are using
standardized risk adjustment measures even though they have questionable validity when extrapolated for fresh
data. For example, the Tennessee Hospital Discharge Data System uses the expected mortality defined by the
3
equation:
-z
Expected mortality=1/(1+e ) where
z=-9.566+1.542(risk of death mortality weight)+3.819(severity of illness mortality weight)-18.07(gender mortality
weight)+0.04(age)-0.045(length of stay in days)+6.937(APRDRG mortality weight)+0.332(severity of illness
class)+0.994(risk of death class)
and
•
•
•
•
•
•
Risk of death mortality weight=average mortality rate for the risk of death classification the patient was
assigned to
Severity of illness mortality weight=average mortality rate for the severity of illness classification the
patient was assigned to
Gender mortality weight=average mortality rate for the patient’s gender
APRDRG mortality weight=average mortality rate for the APRDRG the patient was grouped to
Severity of illness class=severity of illness class assigned to the patient by the APRDRG system
Risk of death class=risk of death class assigned to the patient by the APRDRG system
For example, suppose there are fives codes (A,B,C,D,E) used in a logistic regression with the outcome variable of
mortality. Then the regression equation can be written
P=α0+ α1(if A is present)+ α2(if B is present)+ α3(if C is present)+ α4(if D is present)+ α5(if E is present)
P is the predicted probability of mortality and α0+ α1+ α2+ α3+ α4+ α5=1. The predicted probability increases as the
number of codes increases. One assumption of regression is that the 5 codes are independent (and uncorrelated).
For example, if A=diabetes and B=congestive heart failure (CHF) then the likelihood of someone with diabetes
having CHF is no greater than the likelihood of someone without diabetes having CHF. However, since diabetes can
lead to heart disease, this assumption of independence is clearly false.
Another problem with logistic regression is that there are very questionable results with very disparate group sizes.
There is a value p chosen such that if P<p then the model predicts no mortality but if the value of P>p then the model
predicts mortality. Suppose we are looking at a condition where the mortality is 1%. Then if p=1, the accuracy level
of the model is 99%, although the model has no value predictively. However, the false negative rate (predicting no
mortality when mortality occurs) will also be 100% while the false positive rate (predicting mortality when no mortality
occurs) will be 0%. As p decreases, the accuracy of the model will decrease slightly as will the false negative rate;
however, the false positive rate will also increase; however, the false negative rate will remain over 90%.
Unfortunately, disparate group sizes are almost never accounted for in the development of the logistic regression
model.
Consider, for example, that αi=0.20 for i=1,2,3,4,5 and α0=0. If p=1 then all 5 codes must be present in order to
predict mortality. If p=.8, then an individual must have 4 out of the 5 codes to predict mortality, and so on. The
possible threshold values are 0, .2, .4, .6,.8, and 1. The equal values indicate that there is an equally likely chance of
5
0.2 that any one code is related to one patient. Therefore, the probability of having all five codes is (.2) =0.00032.
4
The probability of having 4 out of 5 is equal to 5(.2) (.8)=0.0064. The remaining probabilities are 0.0512 (for 3),
0.2048 (for 2), 0.4096 (for 1) and 0.32768 (for 0). That gives the possible threshold values for predicting mortality.
Regardless of what the codes then represent, those are the only possible threshold values. If the threshold value
requires all 5 codes, then 0.99968 of all patients will not have a predicted value of mortality. If only 4 of the 5 codes
are required, then 0.99328 will not have a predicted value of mortality. Once the number of codes required is equal
to 3 or more, the predicted mortality rate climbs to 0.05792.
The differential between the predicted mortality rate and the actual mortality rate is used to rank healthcare
4
providers. If the predicted mortality is much higher than the actual mortality rate, then the provider will have a higher
2
SAS Global Forum 2007
Data Mining and Predictive Modeling
ranking. One way to improve the ranking is to make the predicted mortality as high as possible; that is, by reporting
all patients (or many patients) above the threshold value. If the 5 codes are known to the provider, extra diligence
can be placed on the documentation of the 5 regardless of how well other codes are documented; it becomes more
difficult if the 5 codes are not known to the provider.
Another problem with the use of regression is the requirement that the codes are uniformly entered by all providers.
Entry of codes depends upon the accuracy of documentation. Consider, for example, all of the 4-digit codes
associated with diabetes:
•
•
•
•
•
•
•
•
•
•
250.0 Diabetes mellitus without mention of complication
250.1 Diabetes with ketoacidosis
250.2 Diabetes with hyperosmolarity
250.3 Diabetes with other coma
250.4 Diabetes with renal manifestations
250.5 Diabetes with ophthalmic manifestations
250.6 Diabetes with neurological manifestations
250.7 Diabetes with peripheral circulatory disorders
250.8 Diabetes with other specified manifestations
250.9 Diabetes with unspecified complication
The fifth digit for diabetes represents the following:
•
•
•
0 type II or unspecified type, not stated as uncontrolled
1 type I [juvenile type], not stated as uncontrolled
2 type II or unspecified type, uncontrolled
For a complete listing of the ICD9 codes, the interested reader is referred to
http://www.disabilitydurations.com/icd9top.htm. The term, “uncontrolled” is left undefined. It is questionable whether
every physician who admits patients to a hospital will document “uncontrolled” on the patient’s chart. If providers
document differently, those who document the codes for the regression equation will rank higher compared to those
who do not. Suppose one provider documents the 5 codes at 25% rather than 20% while another documents at
15%. Then the probability of having 3 or more codes is equal to 0.1035 for the first provider but 0.0266 for the
second. If they have the same actual mortality rate, provider 1 will rank considerably higher compared to provider 2.
In practice, the number of ICD9 codes used is more than 5 and the probabilities of each code occurring will not be
equally likely. However, the result is basically the same. If the provider knows what codes are used in the regression
equation, the number of codes that must be carefully documented are the ones in the regression; the rest have no
importance. For example, the Healthcare Financing Administration (HCFA) first uses cluster analysis on the codes to
group the ICD9 codes with each code variable defined as 1 if present and 0 if absent followed by stepwise logistic
5
regression to identify the most important cluster. The result will be extremely biased if some providers have access
to the regression equation while others do not, those who do not are penalized.
Also, different regression equations can result in different rankings. While many patients will have similar severity
6
ranks across different formulas, in as much as 20% of the patient base, different measures can vary considerably.
Iezzoni, Ash, et al states, “Detailed evaluation of severity measures appears to be a narrow methodologic pursuit, far
removed from daily medical practice. Nevertheless, severity-adjusted death rates are widely used as putative quality
indicators in health care provider ‘report cards’. Monte Carlo simulation also indicates problems in the model; these
logistic models tend to have low positive predictive value, indicating issues with disparate group sizes for rare
7
occurrances. Other studies show even worse agreement, indicating that the measures can only identify outlier
8-10
providers that perform very poorly and that the measures are not valid for ranking all providers.
To define a patient severity model using the National Inpatient Sample (NIS) database, we use the patient condition
codes, labeled DX1-DX15 in the dataset. We create a series of indicator functions for each ICD9 code that the user
wants to investigate for a severity index. While time consuming, the user can use all of the codes and then reduce
them by other means. The datastep code for creating these indicator functions is as follows:
Data sasuser.indicatorcodes;
Set sasuser.NISdata2003;
Do i=1 to 15;
If dx[i]=’25000’ then n25000=1;
Else n25000=0;
End;
Run;
3
SAS Global Forum 2007
Data Mining and Predictive Modeling
The above code defines just one indicator function; others can be added. Once defined, the indicator functions are
used in a logistic regression with the Regression Node in Enterprise Miner.
The NIS database provides a number of severity measures that are defined using regression. Therefore, we can
compare the results from defining a severity measure using text analysis to the results using the more traditional
regression. We use the data from the year 2003. The measures that we compare are
•
•
•
•
APRDRG mortality risk, using the all patient refined DRG as developed by AHRQ
APRDRG severity, using the all patient refined DRG as developed by AHRQ
Disease staging: mortality level as developed by Medstat
Disease staging: resource demand level as developed by Medstat
All four measures were developed using a logistic regression or linear regression process (depending upon whether
the outcome variable was discrete as in mortality level or continuous as in resource demand level). We will also
compare these different measures to each other.
WEIGHT OF EVIDENCE METHOD
There are some simple methods to reduce a complex categorical variable. Probably the easiest is to define a level
called, ‘other’. All but the most populated levels can be rolled into the ‘other’ level. This method has the advantage
of allowing the investigator to define the number of levels, and to immediately reduce the number to a manageable
set. For example, in a study of medications for diabetes, there were a total of 358 different medications prescribed. It
is possible to use the ten most popular medications and then combine the remaining medications into ‘other’.
However, the ‘other’ category should consist of fairly homogeneous values. This can cause a problem when
examining patient condition codes. Some patient conditions that are rarely used can require extraordinary costs and
should not be rolled into a general ‘other’ category. An example of this is a patient who requires a heart transplant
and who has a ventricular assist device inserted. The condition is rare, but the cost is extraordinary.
Another method is called target-based enumeration. In this method, the levels are quantified by using the average of
the outcome variable within each level. The level with the smallest outcome is recoded with a 1, the next smallest
with a 2, and so on. A modification of this technique is to use the actual outcome average for each level. Levels with
11
identical expected outcomes are merged. This modification is called weight-of-evidence recoding .
The weight-of-evidence (WOE) technique works well when the number of observations per level is sufficiently large
11
to get a stable outcome variable average. It does not generalize well to fresh data if the number of levels is large
and the number of observations per level is small. In addition, there must be one, clearly defined target variable
since the levels are defined in terms of that target. In a situation where there are multiple targets, the recoding of
categories is not stable. In addition, the target variable is assumed to be interval so that the average value has
meaning. If there are only two outcome levels (for example, mortality), the weight-of-evidence would be reduced to
defining a level by the number of observations in that level with an outcome value of 1 (death). We use a macro to
12
define the weights for this process: The WOE macro yields a total of 10 different levels. These levels are compared
to results defined using logistic regression as provided in the database (Tables 1,2).
libname woe '.'
options mstored sasmstore=woe;
%macro smooth_weight_of_evidence(data=,
out=,
input=,
target=,
n_prior=,
n0n1=,
fname=)/store source;
%if &fname= %then %let fname=$w%substr(&input,1,6);
%else %let fname=$w%substr(&fname,1,6);
proc sql;
create table f as
select "&fname" as FMTNAME,
&input as START,
sum(&target=1) as N1,
sum(&target=0) as N0
from &data
4
SAS Global Forum 2007
Data Mining and Predictive Modeling
group by &input;
quit;
data f;
set f end=last;
LABEL=log((N1+ &n_prior/(1+&n0n1))/(N0+&n_prior*&n0n1/(1+&n0n1)));
output;
if last then do;
START='OTHER';
LABEL=-log(&n0n1);
output;
end;
run;
proc format cntlin=f;
run;
data &out;
set &data;
w_&input=put(&input,&fname..)+0;
run;
%mend;
libname ed 'c:\ed';
%smooth_weight_of_evidence (data=sasuser.niswoe,out=nis. niswoe,
input=dx1,target=totchg,n_prior=50,n0n1=2);
run;
Table 1. WOE Levels Compared to APRDG Mortality Risk From Logistic Regression
Total
Total
Woe Levels
APRDG
Mortality
Risk
0
1
2
3
4
5
6
7
8
9
0
9
5.42
0.06
1
0.60
0.00
56
33.73
0.08
39
23.49
0.21
13
7.83
0.01
4
2.41
0.00
7
4.22
0.01
29
17.47
0.03
1
0.60
0.00
7
4.22
0.03
166
1
6457
1.55
40.63
8548
2.06
36.90
48716
11.72
68.24
13774
3.31
73.23
78701
18.94
44.99
52241
12.57
53.09
111932
26.93
95.40
63531
15.29
75.42
17255
4.15
63.93
14410
3.47
55.54
415565
2
4189
2.56
26.36
9147
5.60
39.48
16765
10.26
23.49
3630
2.22
19.30
66001
40.38
37.73
28615
17.51
29.08
4466
2.73
3.81
16555
10.13
19.65
6839
4.18
25.34
7242
4.43
27.91
163449
3
3254
5.52
20.48
4492
7.61
19.39
4864
8.24
6.81
937
1.59
4.98
22873
38.77
13.08
13361
22.65
13.58
764
1.29
0.65
3399
5.76
4.04
1998
3.39
7.40
3055
5.18
11.77
58997
4
1982
10.48
12.47
978
5.17
4.22
983
5.20
1.38
428
2.26
2.28
7348
38.86
4.20
4181
22.11
4.25
159
0.84
0.14
718
3.80
0.85
898
4.75
3.33
1233
6.52
4.75
18908
15891
23166
71384
18808
174936
98402
117328
84232
26991
25947
657085
5
SAS Global Forum 2007
Data Mining and Predictive Modeling
Table 2. WOE Levels Compared to APRDRG Severity
APRDRG
Severity
Total
WOE Levels
0
1
2
3
4
5
6
7
8
9
0
9
5.42
0.06
1
0.60
0.00
56
33.73
0.08
39
23.49
0.21
13
7.83
0.01
4
2.41
0.00
7
4.22
0.01
29
17.47
0.03
1
0.60
0.00
7
4.22
0.03
166
1
3685
1.43
23.19
6215
2.42
26.83
27032
10.51
37.87
7945
3.09
42.24
49857
19.38
28.50
29745
11.57
30.23
75564
29.38
64.40
40402
15.71
47.97
9360
3.64
34.68
7390
2.87
28.48
257195
2
5143
1.97
32.36
9305
3.56
40.17
32028
12.24
44.87
7877
3.01
41.88
79251
30.28
45.30
38446
14.69
39.07
34710
13.26
29.58
32676
12.49
38.79
11481
4.39
42.54
10801
4.13
41.63
261718
3
4721
4.21
29.71
6354
5.67
27.43
10410
9.29
14.58
2504
2.23
13.31
37360
33.35
21.36
23356
20.85
23.74
6662
5.95
5.68
9830
8.77
11.67
4877
4.35
18.07
5966
5.32
22.99
112040
4
2333
8.98
14.68
1291
4.97
5.57
1858
7.16
2.60
443
1.71
2.36
8455
32.56
4.83
6851
26.38
6.96
385
1.48
0.33
1295
4.99
1.54
1272
4.90
4.71
1783
6.87
6.87
25966
15891
23166
71384
18808
174936
98402
117328
84232
26991
25947
657085
Total
The different patient severity indices have no obvious relationship. The WOE levels are scattered across both the
mortality and the severity indices. Since the different methods yield such different results, we need to find alternative
methods to determine a “good” severity index. Tables 3 and 4 compare WOE results to the two other disease
staging measures.
Table 3. WOE Levels Compared to Disease Staging: Mortality
Disease
Staging:
Mortality
Total
Total
WOE Levels
0
1
2
3
4
5
6
7
8
9
0
672
0.55
4.26
2218
1.81
9.58
6028
4.92
8.46
2965
2.42
15.78
1109
0.90
0.63
11791
9.62
12.01
84135
68.66
71.71
10518
8.58
12.50
2262
1.85
8.43
844
0.69
3.26
122542
1
475
1.20
3.01
245
0.62
1.06
2333
5.91
3.27
1211
3.07
6.45
3707
9.40
2.12
5157
13.07
5.25
18233
46.22
15.54
3895
9.87
4.63
3151
7.99
11.75
1042
2.64
4.03
39449
2
3091
1.86
19.58
3109
1.87
13.43
29531
17.74
41.45
7452
4.48
39.66
42989
25.83
24.60
20109
12.08
20.48
8247
4.96
7.03
41240
24.78
49.00
5579
3.35
20.80
5072
3.05
19.60
166419
3
4359
1.90
27.61
6752
2.94
29.17
26447
11.52
37.12
5649
2.46
30.07
87958
38.33
50.32
38966
16.98
39.69
6326
2.76
5.39
25707
11.20
30.55
11937
5.20
44.49
15404
6.71
59.54
229505
4
4628
5.87
29.31
8456
10.72
36.53
6078
7.71
8.53
1251
1.59
6.66
32880
41.70
18.81
16424
20.83
16.73
352
0.45
0.30
2505
3.18
2.98
3161
4.01
11.78
3121
3.96
12.06
78856
5
2563
13.26
16.23
2365
12.24
10.22
829
4.29
1.16
260
1.35
1.38
6143
31.78
3.51
5723
29.61
5.83
26
0.13
0.02
292
1.51
0.35
738
3.82
2.75
390
2.02
1.51
19329
15788
23145
71246
18788
17478
6
98170
11731
9
84157
26828
25873
656100
There seems to be no relationship between the two measures as shown in Table 3; the same non-relationship is
shown in Table 4. In Table 4, there are only 5 values in code 1 because it represents the unknown codes, and is not
part of disease staging.
6
SAS Global Forum 2007
Data Mining and Predictive Modeling
Table 4. WOE Levels Compared to Disease Staging: Resource Demand
Disease
Staging:
Resource
Demand
Total
WOE Levels
0
1
2
3
4
5
6
7
8
9
1
0
0.00
0.00
0
0.00
0.00
0
0.00
0.00
0
0.00
0.00
0
0.00
0.00
1
20.00
0.00
0
0.00
0.00
2
40.00
0.00
0
0.00
0.00
2
40.00
0.01
5
2
2143
1.99
13.53
210
0.20
0.91
7611
7.07
10.67
5724
5.32
30.45
11428
10.61
6.53
5488
5.10
5.58
62208
57.78
53.02
9330
8.67
11.08
1290
1.20
4.78
2237
2.08
8.62
107669
3
6002
1.72
37.88
8460
2.42
36.52
48785
13.98
68.39
9492
2.72
50.50
93090
26.67
53.22
58085
16.64
59.03
52752
15.11
44.96
48034
13.76
57.04
12773
3.66
47.33
11571
3.32
44.61
349044
4
6209
3.88
39.19
11179
6.99
48.26
12912
8.07
18.10
2755
1.72
14.66
53299
33.31
30.47
27629
17.26
28.08
2177
1.36
1.86
23698
14.81
28.14
10670
6.67
39.53
9504
5.94
36.64
160032
5
1489
3.71
9.40
3317
8.26
14.32
2027
5.05
2.84
824
2.05
4.38
17101
42.57
9.78
7191
17.90
7.31
186
0.46
0.16
3152
7.85
3.74
2256
5.62
8.36
2626
6.54
10.12
40169
In addition, we use kernel density estimation to compare the WOE levels in terms of outcomes. Figure 1 gives the
relationship of WOE level to total charges, which were used to define the WOE levels. Figure 2 gives the relationship
to Length of Stay. A natural ordering is established with total charges, with the exception of levels 1 and 3 which
have several cross-over points with the other levels. Level 6 has the highest probability of a lower cost compared to
all other levels. The highest probability of high cost is shared by levels 1, 2, and 3.
Figure 1. WOE Level by Total Charges
When we change the target value to length of stay, the ordering changes as well (Figure 2). Level 6 has the highest
probability of a short stay compared to all other levels with level 7, the next lowest. However, crossover does occur
between levels 6 and 7 early on for the shortest stay. Level 1 now has the highest probability of a lengthy stay.
WOE finds costs up to about $40,000 and lengths of stay up to day 12.
7
SAS Global Forum 2007
Data Mining and Predictive Modeling
Figure 2. WOE by Length of Stay
TEXT MINING METHODOLOGY
In order to perform text analysis on the NIS data file, the data must first be pre-processed. We need to merge the 15
columns containing patient conditions into one text string. We do this using a data step with the following statement:
String=catx(‘ ‘,DX1,DX2,DX3,DX4,DX5,DX6,DX7,DX8,DX9,DX10,DX11,DX12,DX13,DX14,DX15);
separating each code with a space. Because of machine limitations, we used a 10% sample. Once the text string is
defined, we use the Text Miner code node in Enterprise Miner (Figure 3).
Figure 3. Use of the Text Miner Node to Create Clusters of Patient Severity
Figure 4 gives the available options in the Text Miner
node. Since the ICD9 codes are numeric, the default
for Numbers must be changed from ‘no’ to ‘yes’.
Figure 4. Options in Text Miner
Change
default to
‘yes’ for
Numbers
8
SAS Global Forum 2007
Data Mining and Predictive Modeling
After running Text Miner, the interactive results are shown in Figure 5. Note the interactive button in Figure 4.
Clicking on the button gives Figure 5.
Figure 5. Interactive Results for the Text Miner Node
Lists all terms
in 2 or more
observations
To define text clusters, go to the Tools menu (Figure 6). It gives an option to cluster the text field.
Figure 6. Tools Menu to Cluster Documents
Figure 7 provides options for clustering. Because we are defining a patient severity index, we specified that exactly 5
clusters should be defined. The number can be changed by the user. The number of terms (ICD9 codes) used to
describe the cluster can also be specified by the user. In this example, we limit the number to ten. We use the
standard defaults of Expectation Maximization and Singular Value Decomposition. It is recommended that these not
be changed. The results as shown in the interactive window are given in Figure 8.
9
SAS Global Forum 2007
Data Mining and Predictive Modeling
Figure 7. Clustering Options in Text Miner
Note that a third window in Figure 8 is shown
giving the clusters. This window is enlarged in
Figure 9.
Figure 8. Interactive Window After Clustering
Figure 9. Cluster Window
10
SAS Global Forum 2007
Figure 10. Finding the Datasets
Data Mining and Predictive Modeling
In order to compare outcomes by text clusters, we first
need to know what datasets are defined. This is done by
highlighting the connection between nodes (Figure 10).
The resulting datasets are given in Figure 11.
Figure 11. Datasets Defined in Text Miner
Dataset
stores original
data plus
cluster
values.
Dataset
stores cluster
descriptions
Once we have the datasets, we using the following code in the SAS Code Node. Its purpose is to put both the cluster
descriptions and the cluster numbers in the original dataset.
data sasuser.clusternis (keep=_cluster_ _freq_ _rmsstd_ clus_desc);
set emws.text7_cluster;
run;
data sasuser.desccopynis (drop=_svd_1-_svd_500 prob1-prob500);
set emws.text7_documents;
run;
proc sort data=sasuser.clusternis;
by _cluster_;
proc sort data=sasuser.desccopynis;
by _cluster_;
data sasuser.nistextranks;
merge sasuser.clusternis sasuser.desccopynis;
by _CLUSTER_;
run;
To make a comparison of outpatient cost by cluster, we use kernel density estimation with the following code:
proc kde data=sasuser.totalcostandclusters;
univar opxp03x_sum/gridl=0 gridu=10000 out=meps.kdecostbycluster;
by _cluster_;
run;
Figure 12 gives the text clusters with the corresponding text translation given in Table 5.
11
SAS Global Forum 2007
Data Mining and Predictive Modeling
Figure 12. Text Clusters for the NIS Data Sample
Table 5. Translation of Clusters in Figure 12
Cluster
Diagnoses
Number
1
Feeding problems in newborn, unspecified fetal and neonatal jaundice, Other heavy-for-dates" infants,
neonatal hypoglycemia, 33-34 weeks of gestation, cardiorespiratory distress syndrome of newborn,
viral hepatitis, transitory tachypnea of newborn, other respiratory problems after birth, single liveborn
born in the hospital of Cesarean delivery
2
Bipolar I disorder, most recent episode (or current) unspecified, anemia, alcohol abuse, extrinsic
asthma unspecified, other convulsions, other and unspecified alcohol dependence, tobacco use
disorder, single liveborn, unspecified hypothyroidism, other specified personal history presenting
hazards to health
3
Unspecified essential hypertension, aortocoronary bypass status, coronary atherosclerosis of native
coronary artery, esophageal reflux, urinary tract infection, site not specified, atrial fibrillation, chronic
airway obstruction, not elsewhere classified, volume depletion, pure hypercholesterolemia, other and
unspecified hyperlipidemia
4
Asthma, unspecified, hyposmolality and/or hyponatremia, other specified cardiac dysrhythmias, obesity,
unspecified, osteoarthrosis, unspecified whether generalized or localized, morbid obesity,
hypopotassemia, anxiety state, unspecified, chest pain, osteoporosis, unspecified
5
Abnormal glucose tolerance in pregnancy, abnormality in fetal heart rate or rhythm, elderly multigravida,
other injury to pelvic organs, post term pregnancy, cord around neck, with compression, normal
delivery, transient hypertension of pregnancy, breech presentation buttocks version, hypertension
secondary to renal disease, complicating pregnancy, childbirth, and the puerperium
Note that cluster 1 contains codes primarily related to a newborn while cluster 5 contains codes primarily related to
pregnancy and delivery. Because these two clusters are so dominated by specific patient conditions (because there
are so many admissions for childbirth), it is perhaps better to define text clusters of patient conditions that use more
specific DRG codes to define severity measures by major MDC category. We will demonstrate this approach in a
later section.
The two outcome measures considered are total charges and length of stay. We first show the relationship between
the different measures with a series of table analyses (Tables 6-12). Table 6 compares the APRDRG mortality risk
by defined text clusters. The APRDRG value of 0 indicates that no measure was calculated for these patients.
Table 6. Comparison of APRDRG Risk of Mortality and Defined Text Clusters
Text CLUSTER (#)
APRDRG
Total
Risk of
1
2
3
4
5
Mortality
61
2
6
17
13
0
23
3.28
9.84
27.87
21.31
37.70
0.03
0.03
0.06
0.15
0.28
50605
7113
15553
12428
7497
1
8014
14.06
30.73
24.56
14.81
15.84
99.68
68.85
42.86
84.87
98.44
12
There is clearly no real measure of
agreement between these two risk
measures. Also, for APRDRG,
approximately two thirds of all patients
were defined in group 1. For this
reason, providers who can shift their
patients from 1 to 2 by improved
coding will receive more favorable
quality rankings.
SAS Global Forum 2007
APRDRG
Risk of
Mortality
2
3
4
Total
Data Mining and Predictive Modeling
Text CLUSTER (#)
1
2
73
0.43
0.90
10
0.16
0.12
21
1.10
0.26
8141
1039
6.09
11.76
216
3.55
2.45
68
3.57
0.77
8833
Total
3
4
5
10972
64.35
37.83
4155
68.37
14.33
1428
74.92
4.92
29000
4947
29.01
21.90
1695
27.89
7.50
389
20.41
1.72
22590
20
0.12
0.28
1
0.02
0.01
0
0.00
0.00
7136
17051
6077
1906
75700
Table 7. Comparison of APRDRG Severity Rank by Defined Text Cluster
APRDRG Severity Text CLUSTER (#)
Total
1
2
3
4
5
61
2
6
17
13
0
23
3.28
9.84
27.87
21.31
37.70
0.03
0.03
0.06
0.15
0.28
32855
5134
9481
7444
4105
1
6691
15.63
28.86
22.66
12.49
20.37
71.95
41.97
25.67
46.47
82.19
28177
1739
8946
12603
3811
2
1078
6.17
31.75
44.73
13.53
3.83
24.37
39.60
43.46
43.15
13.24
11964
260
3546
7030
813
3
315
2.17
29.64
58.76
6.80
2.63
3.64
15.70
24.24
9.20
3.87
2643
1
611
1906
91
4
34
0.04
23.12
72.12
3.44
1.29
0.01
2.70
6.57
1.03
0.42
Total
8141
8833
29000
22590
7136
75700
Table 8. Comparison of APRDRG Risk of Mortality to APRDRG Severity
APRDRG
APRDRG Risk of Mortality
Total
Severity
0
1
2
3
4
61
0
0
0
0
0
61
0.00
0.00
0.00
0.00
100.00
0.00
0.00
0.00
0.00
100.00
32855
17
67
1538
31233
1
0
0.05
0.20
4.68
95.06
0.00
0.89
1.10
9.02
61.72
0.00
28177
16
987
9751
17423
2
0
0.06
3.50
34.61
61.83
0.00
0.84
16.24
57.19
34.43
0.00
11964
375
4078
5600
1911
3
0
3.13
34.09
46.81
15.97
0.00
19.67
67.11
32.84
3.78
0.00
2643
1498
945
162
38
4
0
56.68
35.75
6.13
1.44
0.00
78.59
15.55
0.95
0.08
0.00
Total
61
50605
17051
6077
1906
75700
Table 9. Comparison of Disease Staging: Mortality Level to Text Clusters
Disease Staging:
Total
Text CLUSTER (#)
Mortality Level
1
2
3
4
5
12621
6571
3078
927
1649
0
396
52.06
24.39
7.34
3.14 13.07
92.08
13.66
3.20
4.87 18.72
1
176
1237
849
1420
543
4225
4.17 29.28
20.09
33.61
12.85
13
Table 7 compares the
APRDRG severity rank to
text clusters; again, there is
not a pattern of agreement.
The APRDRG severity rank
also concentrates most
patients in the lower ranks;
about 20% of all patients are
in groups 3 and 4.
Table 8 compares the two
APRDRG measures to each
other. While there is major
agreement for groups 1 and
4, there are few similarities
for groups 2 and 3.
Table 9 examines the
disease staging: mortality
level to the text clusters. As
with the previous tables,
there is little agreement
between the two.
SAS Global Forum 2007
Disease Staging:
Mortality Level
2
3
4
5
Total
Data Mining and Predictive Modeling
Text CLUSTER (#)
1
2
3
2.17 14.04
2.93
6858
2980
668
28.63
12.44
0
23.69
27.8 33.83
9
82.1
9
13568
2493
802
54.93
3.25 10.09
46.86
9.87 28.30
5353
392
45
66.04
0.56 4.84
18.49
0.55 4.45
1397
58
29
71.28
1.48 2.96
4.83
0.36 0.66
812
8809
28952
8
Total
4
6.30
7431
31.02
32.97
5
7.61
3
0.01
0.04
7818
31.65
34.69
2316
28.57
10.28
476
24.29
2.11
22539
19
0.08
0.27
0
0.00
0.00
0
0.00
0.00
7136
23952
24700
8106
1960
75564
Table 10. Comparison of Disease Staging: Resource Demand Level to Text Clusters
Text CLUSTER (#)
Disease Staging:
Total
Resource Demand Level
1
2
3
4
5
3727
0
1
9
0
1
3717
0.00
0.03
0.24
0.00
99.73
0.00
0.00
0.03
0.00
45.74
14286
4758
2503
1532
2006
2
3487
33.31
17.52
10.72
14.04
24.41
66.68
11.08
5.29
22.74
42.91
36238
2377
13046
14829
5403
3
583
6.56
36.00
40.92
14.91
1.61
33.31
57.77
51.16
61.24
7.17
17156
1
5828
9852
1246
4
229
0.01
33.97
57.43
7.26
1.33
0.01
25.81
33.99
14.12
2.82
4249
0
1206
2765
168
5
110
0.00
28.38
65.07
3.95
2.59
0.00
5.34
9.54
1.90
1.35
Total
8126
8823
28987
22584
7136
75656
Table 11. Comparison of Disease Staging: Mortality Level to Disease Staging:
Demand Level
Disease Staging:
Disease Staging: Mortality Level
Resource Demand Level
0
1
2
3
4
5
0
0
3
3542
4
1
178
0.00
0.00
0.08
95.04
0.11
4.78
0.00
0.00
0.01
14.79
0.09
1.41
17
18
1432
5045
1405
2
6363
0.12
0.13
10.03
35.33
9.84
44.56
0.87
0.22
5.80
21.07
33.25
50.42
246
2670
13830
11339
2639
3
5461
0.68
7.38
38.22
31.34
7.29
15.09
12.55
32.94
55.99
47.35
62.46
43.27
967
3899
7638
3844
170
4
607
5.65
22.77
44.60
22.45
0.99
3.54
49.34
48.10
30.92
16.05
4.02
4.81
730
1519
1797
179
7
5
12
17.20
35.79
42.34
4.22
0.16
0.28
37.24
18.74
7.28
0.75
0.17
0.10
Total
12621
4225
23949
24700
8106
1960
14
Table 10 compares
the disease staging:
resource level to the
text clusters. In this
case, there is some
agreement with
cluster 1 to
resource level 1.
Total
3727
14280
36185
17125
4244
75561
Table 11 compares
the two disease
staging measures
to each other. While
there is some
agreement between
demand level 1 and
mortality level 2,
there is little
agreement
elsewhere between
the two measures.
SAS Global Forum 2007
Data Mining and Predictive Modeling
Table 12. Comparison of Disease Staging: Mortality Level to APRDRG Risk of Mortality
Disease Staging:
Total
APRDRG_Risk_Mortality
Mortality Level
0
1
2
3
4
12621
1
12
119
12482
0
7
0.01
0.10
0.94
98.90
0.06
0.05
0.20
0.70
24.69
29.17
4225
3
13
125
4084
1
0
0.07
0.31
2.96
96.66
0.00
0.16
0.21
0.73
8.08
0.00
23952
27
260
2489
21168
2
8
0.11
1.09
10.39
88.38
0.03
1.42
4.28
14.62
41.87
33.33
24700
280
2217
10287
11910
3
6
1.13
8.98
41.65
48.22
0.02
14.75
36.54
60.44
23.56
25.00
8106
710
2831
3695
868
4
2
8.76
34.92
45.58
10.71
0.02
37.41
46.65
21.71
1.72
8.33
1960
877
735
304
43
5
1
44.74
37.50
15.51
2.19
0.05
46.21
12.11
1.79
0.09
4.17
Total
24
50555
17019
6068
1898
75564
Table 12 compares
the two measures of
mortality, APRDRG
and Disease
Staging. There is
little agreement here
as well.
Table 13 compares the levels defined using weight of evidence to the five clusters defined by Text Miner. While
there is variability, there is also some consistency. For example, 55% of WOE level 1 is in Text Cluster 4. Similarly,
almost 100% of cluster 5 is in WOE level 6. It indicates that both methods provide better results to those of logistic
regression.
Table 13. WOE Levels Compared to Text Clusters
Text Cluster
Total
WOE Levels
1
2
3
4
5
0
1
2
3
4
5
6
7
8
9
2
0.86
0.12
192
2.08
11.42
963
3.21
57.29
524
2/29
31.17
0
0.00
0.00
2
0.86
0.08
185
2.00
7.53
916
3.05
37.27
1355
5.93
55.13
0
0.00
0.00
2
0.86
0.03
1917
20.73
25.27
2761
9.20
36.39
2907
9.20
36.39
0
0.00
0.00
3
1.29
0.15
941
10.18
45.81
551
1.84
26.83
559
2.45
27.22
0
0.00
0.00
1
0.43
0.01
900
9.73
4.86
13718
45.71
74.12
3890
17.02
21.02
0
0.00
0.00
1
0.43
0.01
962
10.40
9.06
4849
16.16
45.68
4802
21.01
45.23
2
0.03
0.02
7
3.00
0.06
1961
21.20
16.36
661
2.20
5.52
2040
8.93
17.02
7314
99.97
61.04
200
85.84
2.21
1144
12.37
12.66
3459
11.53
38.27
4236
18.53
46.86
0
0.00
0.00
12
5.15
0.41
598
6.36
19.98
940
3.13
31.94
1403
6.14
47.67
0
0.00
0.00
3
1.29
0.11
458
4.95
16.40
1192
3.97
42.68
1140
4.99
40.82
0
0.00
0.00
15
233
9248
30010
22856
7316
SAS Global Forum 2007
Data Mining and Predictive Modeling
Figure 13. Kernel Density Estimators of inpatient Costs by Text Cluster
We next examine
the kernel density
estimators of the
outcomes. Figure
13 gives the
relationship of text
cluster to inpatient
costs.
From the shape of the graphs, Cluster 5<2<1<4<3 in terms of ordering. There are two cutpoints in the figure. The
first occurs at $11,000 where clusters 1, and 5 transition to lower probability compared to clusters 3 and 4; the
second occurs at $16,000 where cluster 2 transitions to a lower cost. Figure 14 examines the APRDRG severity
measure by inpatient cost. As should happen, group 1<2<3<4 with group 4 having considerable variability. The
major cutpoint occurs at $12,000.
Figure 14. Kernel Density Estimate of APRDRG Severity by Inpatient Cost
Figure 15 examines
the relationship of
inpatient costs to the
APRDRG mortality risk
measure. Group
1<2<3<4 in terms of
costs with groups 3
and 4 having
considerable variability
compared to groups 1
and 2. The cutpoint
between groups 1 and
2 occurs at $10,000;
the cutpoint between
groups 3 and 4 occurs
at $55,000.
16
SAS Global Forum 2007
Data Mining and Predictive Modeling
Figure 15. Kernel Density Estimate of Inpatient Costs by APRDRG Mortality Risk
Figure 16 gives the estimates using the disease staging: resource demand level measure. Again, group 1<2<3<4<5.
However, the graph for group 1 is extremely narrow, with very small variability and almost zero probability of costs
higher than $5000. The probability of cost for any group beyond $25,000 is also almost zero. Thus while the groups
separate well, they seem to be poor predictors of actual costs.
Figure 16. Kernel Density Estimates of Inpatient Costs by Resource Demand Level
Figure 17 examines disease staging: mortality level. Again, it can only predict inpatient costs up to about $35,000
with positive probability, indicating that it is a poor predictor of actual patient costs. It has multiple cutpoints with
$12,000 for groups 1 and 2; $10,000 for groups 2 and 3 and $14,000 for groups 1 and 3. For that interval, 1<3<2.
The cutpoint for groups 3 and 4 occurs at $25,000. Because of this shift, with 1<3<2, there is a strong probability that
17
SAS Global Forum 2007
Data Mining and Predictive Modeling
some of the hospitals in the sample are shifting patients into a higher group through over-coding of secondary
patient conditions.
Figure 17. Kernel Density Estimate of Disease Staging: Mortality Level
Figure18. Kernel Density Estimate of LOS by Text Cluster
We also look at the length of stay (LOS) as an outcome variable by the different patient severity measures. Figure
18 gives the kernel density estimate for the LOS by the text clusters. The pattern is very regular with cluster
5<1<2<4<5. Clusters 2 and 4 are almost identical in terms of distribution. The cutpoint occurs at day 4+ with cluster
5 having a very low probability of exceeding 4 days inpatient stay. The text clusters have positive probability out to
day 12.
18
SAS Global Forum 2007
Data Mining and Predictive Modeling
Figure19 gives the estimate of LOS by APRDRG severity measure. It has positive probability out to day 20.
However, between days 4 and 5, the ordering becomes 1<3<2, indicating a shift of patients from class 2 to class 3
caused by over-coding of some hospitals.
Figure 19. Kernel Density Estimate of LOS by APRDRG Severity Measure
Figure 20 shows the comparison of LOS to APRDRG mortality index. The pattern looks almost the same as in Figure
19, with 1<3<2 for values between 4 and 6.5 days.
Figure 20. Kernel Density Estimate of LOS for APRDRG Mortality Index
Figure 21 shows the resource demand level with respect to LOS. For the most part, the LOS for patients in group 1
is equal to 2 days with much smaller probabilities of 1 and 3 days. Group 2 also peaks at 2 days LOS with just a
slightly higher probability of 4 or more days compared to group 1. These two curves strongly suggest shifting
19
SAS Global Forum 2007
Data Mining and Predictive Modeling
patients from group 1 to group 2. Groups 3,4,5 behave as expected for this estimate. Also, the positive probability
ends at about day 10 compared to day 12 or day 20 for the previous groups.
Figure 21. Kernel Density Estimate of LOS for Disease Staging: Resource Demand Level
In contrast, as shown in Figure 22, the Disease Staging: Mortality Level has a much more regular pattern with
1<2<3<4<5.
Figure 22. Kernel Density Estimate of LOS for Disease Staging: Mortality Level
20
SAS Global Forum 2007
Data Mining and Predictive Modeling
CONCLUSION
It is possible to develop a model to rank the quality of care so that the model does not assume uniformity of data
entry. The model can also be validated by examination of additional outcome values in the data. The means of
developing the model is to use the stemming properties of the ICD9 codes where the first three digits of the code
represent the primary category while the remaining two digits represent a refinement of the diagnosis. The model
compares well to those developed through the standard logistic regression technique. The text clusters that are
defined can be used to validate the cluster levels.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Anonymous. AHRQ Quality Indicators: Patient Safety Indicators (PSI) Composite Measure: AHRQ; 2006:34
Pages.
Anonymous. STS evidence based guidelines: cardiac surgery risk models; 2004.
Coulter SL, Cecil WT. Risk adjustment for in-patient mortality-coronary artery bypass surgery. Nashville,
TN: Tennessee Department of Health; February, 2004 2004.
Gilligan R, Gilanelli D, Hughes R, et al. Coronary artery bypass graft surgery in New Jersey, 1944-1995.
New Jersey: Cariovascular health advisory panel; November, 1997 1997.
Krakauer H, Bailey RC, Skellan KJ, et al. Evaluation of the HCFA model for the analysis of mortality
following hospitalization. Health Services Research. 1992;27(3):317-335.
Iezzoni LI, Ash AS, Shwartz M, Daley J, Hughes JS, Mackleman YD. Predicting who dies depends on how
severity is measured: implications for evaluating patient outcomes. Annals of Internal Medicine.
1995;123(10):763-770.
Austin PC, Alter DA, Tu JV. The use of fixed- and random-effects models for classifying hospitals as
mortality outliers: a monte carlos assessment. Medical Decision Making. 2003;23:526-539.
Poses RM, McClish DK, Smith WR, et al. Results of report cards for patients with congestive heart failure
depend on the method used to adjust for severity. Annals of Internal Medicine. 2000;133:10-20.
Thomas JW. Research evidence on the validity of adjusted mortality rate as a measure of hospital quality of
care. Medical Care Research and Review. 1998;55(4):371-404.
Flanders DW, Tucker G, Krishnadasan A, Honig E, McClellan WM. Validation of the pneumonia severity
index: importance of study-specific recalibration. Journal of General Internal Medicine. 1999;14(6):333-340.
Smith EP, Lipkovich I, Ye K. Weight of Evidence (WOE): Quantitative estimation of probability of impact.
Blacksburg, VA: Virginia Tech, Department of Statistics; 2002.
Anonymous. Advanced Predictive Modeling Using SAS Enterprise Miner 5.1 Course Notes. Cary, NC: SAS
Education; 200.
ACKNOWLEDGMENTS
I appreciate the valuable input from my co-investigator and domain expert, John C. Cerrito, PharmD. Also, the work
in this paper was supported by NIH grant # 1R15RR017285-01A1,
Data Mining to Enhance Medical Research of Clincal Data
CONTACT INFORMATION
Author Name
Company
Address
City state ZIP
Work Phone:
Fax:
Email:
Patricia B. Cerrito
University of Louisville
Department of Mathematics
Louisville, KY 40292
502-852-6010
502-852-7132
[email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
21