Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Paper PR01
Data Mining Methods to Link Multiple Drug Purchases and To Investigate Drug
Costs to Consumers
Patricia B. Cerrito, University of Louisville, Louisville, KY 40292
John C. Cerrito, Kroger Pharmacy, Louisville, KY
ABSTRACT
When one field is used to define the observational unit in a dataset, it becomes difficult to alter the dataset to define
a different observational unit. For example, customer drug purchases are usually defined by the purchase rather
than by the customer. The Association Node in Enterprise Miner uses a customer identifier to link purchases when
the purchase is the observational unit. However, the Association Node is inadequate when there are too many
different choices for purchase. An alternative method is to change the observational unit to the customer using
PROC TRANSPOSE and PROC CONCAT. In this way, all customer purchases are linked in a text string that can be
examined using SAS Text Miner. A second method is to reduce the number of choices by defining larger classes
that combine choices that are relatively similar. The methods demonstrated in this paper will be applied to a public
dataset similar to those collected in any retail pharmacy.
INTRODUCTION
Retail pharmacy is different from other retail businesses in that the consumer does not make the medication choice;
that choice is made by the physician. The insurer sets the price paid by the consumer. There are no drug specials
with special rates at a retail pharmacy. Nevertheless, consumers with chronic medication needs can be identified
through a market basket analysis.
Such a market basket analysis is usually performed using association rules available in Enterprise Miner. However, if
there are too many possible choices, association rules tend to fail. Given the number of medications available,
association rules don’t provide meaningful results. It is the purpose of this paper to examine alternative techniques
for market basket analysis in the presence of thousands of possible choices
The data used for this particular problem are in the public domain, downloaded from
http://www.meps.ahrq.gov/Puf/PufDetail.asp?ID=149. The particular dataset used was MEPS HC-059A: 2001
Prescribed Medicines File. The dataset provides information on household-reported prescribed medicines for a
nationally representative sample of the non-institutionalized population of the United States. The purpose of the
dataset is to make estimates of prescribed medicine utilization and expenditures. It is a dataset that could be
collected in any retail pharmacy.
Each record in the dataset represents one prescribed medicine that was obtained during the year 2001. There are
codes for individual households, and for individuals within each household. The dataset contains approximately
277,000 records with a total of 3837 medications in the list of which 2639 medications were prescribed for 2 or more
households. A partial list of the medications is given in Figure 1.
1
Figure 1. Partial List of Medications in the Dataset Figure
A quick review of the different medication names demonstrates very clearly that there are many medications that are
virtually identical, with only slight changes. Although names are somewhat standardized, there is enough discretion
in data input to have some names occur only occasionally as they are just slightly different from the norm.
Misspellings or abbreviations are also apparent, for example, “acetamin”. Numeric codes are used for unknown or
missing medications (-7,-9). It would be very helpful to have an easy way to simplify the list. For the most part, the
difference in drug names occurs because of the level of detail provided at data entry.
SAS TEXT MINER TO CREATE LINKAGE
There are a number of analyses available in Text Miner. In order to define linkage using Text Miner, there is some
data pre-processing required. All medications for one household (or one patient, depending upon the interest) must
be linked, and the observational unit must be changed from medication to household. The following code will make
the required changes:
proc sort data = sasuser.icuchargesonly out= work.sort_out;
by duid rxname;
run;
data work.sort_out1;
set work.sort_out;
rxname = translate(left(trim(rxname)),'_',' ');
run;
proc Transpose data=work.sort_out1
out=work.tran (drop=_name_ _label_)
prefix=med_;
var rxname ;
by duid
run;
Sort the data by
household
identifier, duid.
Shifts all medications for one
household into a series of
fields; one observation per
household remains.
2
data work.concat( keep= duid rxname ) ;
length rxname $32767 ;
set work.tran ;
Concatenates the fields created in
PROC TRANSPOSE, and creates
one long text string to contain the
multiple medications per household.
array chconcat {*} med_: ;
rxname = left( trim( med_1 )) ;
do i = 2 to dim( chconcat ) ;
rxname = left(trim(charges)) || ' ' || left(trim( chconcat[i] )) ;
end ;
run ;
proc sql ;
select max( length( rxname )) into :rxname_LEN from work.concat ;
quit ;
%put rxname_LEN=&rxname_LEN ;
Finds longest text string to be used
data sasuser.medstextstrings ;
to define the final dataset.
length rxname $ &rxname_LEN ;
set work.concat ;
run ;
proc contents data=sasuser.medtextstrings ; run ;
Once this code is run, an initial household record provided in the dataset (as shown in Table 1) is modified by the
above code as shown in Table 2.
Table 1. Household Record
DUID
RXNAME
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
40001
SOFTCLIX
ANTIVERT
SOFTCLIX
SOFTCLIX
SOFTCLIX
SOFTCLIX
SOFTCLIX
SOFTCLIX
ESTROGEN
ESTROGEN
CEFADROXIL
ESTROGEN
ESTROGEN
ESTROGEN
ESTROGEN
ESTROGEN
ESTROGEN
ESTROGEN
ESTROGEN
ESTROGEN
Table 2. Modified Patient Record
DUID
RXNAME
40001 ANTIVERT CEFADROXIL ESTROGEN ESTROGEN ESTROGEN ESTROGEN ESTROGEN
ESTROGEN ESTROGEN ESTROGEN ESTRO BETAMETH KEFLEX
40007 ACCUTANE_(RX_PAK,_10X10) ACCUTANE_(RX_PAK,_10X10) ACCUTANE_(RX_PAK,_10X10)
ALBUTEROL BENZAMYCIN B
40010 ALBUTEROL APAP/PROPOXYPHENE_NAPSYLATE_(PINK) BENZTROPINE_MESYLATE
CARDEC_DM_SYRUP ERYTHROMYCIN/SU
Note that medications that use multiple words are linked by underscores to preserve the entire name in Text Miner.
Text Miner is then used on the re-defined dataset to find the relationships on those text strings. In Enterprise Miner
Version 5, the process flow is given in Figure 2 with the basic defaults for Text Miner given in Figure 3. The SAS
Code node is linked to Text Miner so that all of the datasets created by Text Miner can be identified, and examined
using SAS/Stat code.
3
Figure 2. Icons for Text Mining to Reduce Clusters
Figure 3. Text Miner Defaults
In order to complete an analysis of the text strings using Text
Miner, the defaults are used (version 5). The only needed
change is to automatically cluster the terms. SAS Text Miner
will find an optimal number of clusters.
Once the clusters are identified, they must be labeled. Such
labeling requires domain knowledge; in this example,
knowledge of relationships between diagnoses and
medications is needed to provide cluster labels.
Labels do not have to be unique; they have to have
relevance and meaning. Different investigators may suggest
different labels based upon the clusters. The suggested
labels are given in Table 3.
Table 3. Clusters of Medications
Cluster
Cluster Description
Number
glucophage, furosemide, zestril, synthroid, softclix
1
allegra, claritin, flonase, nasonex, hydrochlorothiazide
2
lipitor, vioxx, allegra, claritin
3
augmentin
4
vicodin, apap/hydrocodone_bitartrate, celexa, naproxen, vioxx
5
zoloft, triple_antibiotic, paxil, zyrtec, naproxen
6
synthroid, premarin, augmentin, zithromax, amoxicillin
7
8
levoxyl, synthroid, premarin, lipitor, zoloft
9
10
hydrochlorothiazide, lipitor, norvasc, furosemide, zestril
estradiol, prednisone, prempro, zocor, zyrtec
11
glyburide , aspirin, glucophage, lisinopril, simvastatin
4
Cluster Label
Diabetes, heart
Allergies
Lipitor, arthritis and allergies
Children’s antibiotic
Pain
Depression, allergies
Post-menopause and
antibiotics
Post-menopause, cholesterol,
and Depression
Cholesterol, Hypertension
Post-menopause, allergies,
cholesterol
Diabetes, cholesterol
Cluster
Number
12
13
14
15
16
17
18
19
20
21
Cluster Description
Cluster Label
lanoxin, coumadin, pravachol, furosemide, norvasc
atenolol, hydrochlorothiazide, lipitor, zestril, premarin
fosamax_(unit_of_use, blister_pck, fosamax_(unit_of_use), fosamax,
celebrex
fluoxetine_hcl, prozac, zithromax, augmentin, cephalexin
singulair_(unit_of_use), albuterol, serevent, flovent, singulair
celebrex, ultram, hydromet, vioxx, ibuprofen
premarin, medroxyprogesterone_acetate_(unit_of_use), allegra,
triple_antibiotic, cephalexin
prevacid, cephalexin, amoxicillin, ibuprofen, celebrex
triple_antibiotic, augmentin, ibuprofen, motrin, amoxicillin
paxil, paxil_(unit_of_use), alprazolam, lorazepam, amoxicillin
Heart, cholesterol
Heart, cholesterol
Ulcer, arthritis
Antibiotic, depression
Asthma
Arthritis
Post-menopause, antibiotics
Ulcer, antibiotic, arthritis
Antibiotic, pain
Depression, antibiotic
Note that the clusters represent combinations of diagnoses. Recall that the examination is by household where
different household members may have different diagnoses. Also, patients with severe, chronic illnesses often suffer
from related co-morbidities. Patients with diabetes often have heart problems. Patients with arthritis can have gastric
problems related to their arthritic medications.
EXAMINATIONS OF PAYMENTS BY CLUSTERS
Because the clusters are defined by analyses, they must be considered random effects and the linear mixed model
(PROC MIXED) must be used. Figures 4-8 show the total paid by each patient (or third party) in the 21 defined
clusters, as well as showing kernel density estimators of several selected clusters to show the entire distribution of
payment. Note that the highest self-pay as shown in Figure 1 is for patients with ulcers and arthritis. There have
been a number of recently introduced medications that are not yet on third party formularies to account for this
amount. The cluster with the next highest payment is for patients with diabetes and heart problems. The highest
average self-pay for a year is $1300 per year. The highest average payment by private insurance is $1400 per year.
Figures 9 and 10 give an overall framework for payments.
Figure 4A. Payment by Customers for Medications by Cluster
Payment by Customers for Medications
1400
1200
1000
800
600
400
200
Li
pi
to
r,
D
ia
be
ar
te
th
s
Al
rit
le
is
rg
an
ie
C
d
s
hi
al
ld
le
re
r
Po
gi
n’
e
s
st
s
an
-m
ti b
en Pos
D
io
op
t-m
ep
t ic
au
re
en
se
op s sio
Pa
,c
n,
in
ho a us
a
e
le
an ller
st
Po
g
er
d
ie
st
ol
s
C
-m
, a ant
ib
en hol
nd
i
o
e
op
tic
st
D
er
au
ep
s
ol
re
se
ss
,a ,H
y
i
on
lle
pe
rg
rte
ie
ns
s,
D
i
ia
be cho o n
le
te
st
s,
e
ch
ro
H
ol
l
ea
es
rt,
te
ch
r
ol
H
ol
ea
es
rt,
t
c h ero
l
ol
es
U
An
te
lc
r
er
t ib
, a ol
io
r
tic
, d thri
tis
ep
re
ss
io
Po
n
st
As
-m
t
h
en
m
a
op
Ar
au
U
th
lc
se
r
er
i
ti s
,a
,a
nt
nt
ib
ib
io
io
tic
ti c
s
,
An a rt
h
D
t ib
ri t
ep
is
io
re
tic
ss
,p
io
ai
n,
n
an
tib
io
tic
0
5
Figure 4B. Kernel Density Estimation of Selected Clusters for Self-Pay
Kernel density estimation
was used to examine
differences within clusters.
In particular, cluster 2
(allergies) has a high
probability of low self
payments. In contrast,
cluster 14 (arthritis and
ulcers) has a fairly high
probability of payment in
the $500-1000 range in
comparison to cluster 2
patients.
Figure 5A. Average Yearly Payment by Cluster by Private Insurance (Primary Insurer)
Private Insurance
1600
1400
1200
1000
800
600
400
200
Li
pit
or
,a
D
ia
be
te
r th
s
Al
rit
le
is
rg
an
i
e
C
d
s
hi
al
ldr
le
rg
en
Po
i
e
’
s
st
s
an
-m
ti b
en Pos
De
i
o
op
t-m
t ic
pr
au
es
en
se
sio
Pa
, c opa
n,
in
us
ho
a
e
le
an ll er g
st
Po
e
ies
d
ro
st
a
C
l,
-m
an nti b
en hol
i
d
o
e
op
st
ti c
De
e
au
s
pr
se rol,
,a
Hy ess
i
o
lle
pe
rg
rte n
ie
ns
s,
D
io
ia
ch
n
be
ol
te
es
s,
te
c
r
ol
ho
He
le
ar
ste
t,
c
r
ol
ho
H
ea
l
r t, e s te
ch
ro
l
ol
es
Ul
An
te
c
ro
t ib
er
l
,a
io
tic
r
, d thr i
tis
ep
re
ss
io
Po
n
st
As
-m
t
h
en
m
o
Ar a
Ul pa u
th
ce
se
r,
, a riti s
an
nt
ti b
ibi
ot
io
ti c
ics
,a
An
r
th
De
t ib
ri t
pr
i
es i otic s
sio
,p
ai
n,
n
an
ti b
iot
ic
0
6
The two highest
payments by
private insurance
are for ulcers and
arthritis, or for
arthritis alone. The
next highest
cluster is for
allergies.
ar
th
rit
D
ia
be
te
s
Al
le
is
rg
an
ie
C
s
d
hi
al
ld
le
re
rg
n’
Po
ie
s
s
st
an
-m
Po
ti b
en
D
i
ot
ep
op s t-m
ic
re
au
en
se
op s sio
P
,c
a
au
n,
in
ho
se
a
le
an ll er
st
Po
g
er
d
st
ol
an ies
C
-m
,
ti b
en hol e and
io
ti c
op
st
D
er
s
ep
au
ol
r
se
es
,
Hy
,a
si
on
pe
lle
rg
rte
ie
ns
s,
D
io
ia
n
be cho
le
te
ste
s,
ch
ro
H
l
ol
ea
es
rt,
te
ch
ro
H
l
ol
ea
es
rt,
t
c h ero
l
ol
es
U
te
An
lc
r
er
t ib
, a ol
io
r th
tic
,d
r
ep itis
re
ss
io
Po
n
As
st
-m
th
m
en
a
op
Ar
au
U
th
lc
s
r
er
e,
i
,a
a n ti s
nt
tib
ib
i
ot
io
ti c
ic
s
,
An a rth
D
r
t
i ti
ep
ib
s
io
re
tic
ss
,p
io
ai
n,
n
an
ti b
io
ti c
Li
pi
to
r,
Figure 5B. Kernel Density Estimation of Selected Clusters for Private Insurance
Insurance payments
for postmenopause
and antibiotics are
relatively small, but
consistent across
insurers. Similarly, for
diabetes and
cholesterol (cluster 11)
and antibiotics (cluster
19). However, for
arthritis and allergies,
the payments by
difference insurers
vary widely.
Figure 6A. Average Yearly Payment by Cluster by Medicare
Medicare
450
400
350
300
250
200
150
100
50
0
7
ar
th
rit
D
ia
be
t
Al es
is
le
r
a
g
nd
C
ie
hi
s
al
ld
le
Po
re
r
n’
gi
st
s
es
-m
an
en Po
ti b
D
op s te
i
o
m
pr
au
t ic
se eno es
sio
,c
p
n, Pai
ho a u
n
s
a
le
Po
st e an ll er
er
gi
st
d
o
es
-m
C
a
l
en hol , an ntib
es
d
op
io
D
t
t
au
e
ep ics
se rol,
re
,a
H
s
y
lle
pe si on
rg
r
t
D i es, en s
ia
io
be ch
n
o
te
s, lest
e
ch
H
r
o
o
ea
l
rt, lest
c h e ro
H
o
ea
l
rt, les t
c h ero
ol
es l
U
An
t ib lc er tero
l
,a
io
tic
r
, d thr
ep itis
re
ss
Po
io
st
n
A
-m
st
en
hm
op
a
U
au
Ar
lc
th
s
er
, a e, a riti s
nt
nt
ib
i
io bi o
ti c
ti
, a cs
An
rth
D
ti
ep
rit
is
re bio
ti
ss
io c , p
n,
ai
n
an
ti b
io
ti c
Li
pi
to
r,
Figure 6B. Kernel Density Estimation of Selected Clusters by Medicare
As noted in both Figures
6A and B, Medicare pays
for very few medications;
its highest average
payment is for
postmenopause. It will be
of interest to compare
results before and after
the passage of legislation
to include medications in
Medicare coverage.
Figure 7A. Average Yearly Payment by Cluster by Medicaid
Medicaid
600
500
400
300
200
100
0
8
ito
r
,a
rth
rit
D
ia
be
te
s
Al
le
is
rg
an
i
Ch
e
d
s
al
i ld
le
re
rg
Po
n’
ie
s
st
s
an
-m
ti b
en Pos
D
i
ot
op
t-m
ep
ic
au
re
en
ss
se
o
P
i
p
,c
on
a
a
in
us
ho
,a
e
le
an ll er
st
Po
gi
er
d
st
es
ol
C
-m
, a ant
en hol
ib
nd
io
es
op
t
D
te
au
ep i cs
se rol ,
re
,a
H
ss
yp
io
lle
er
n
rg
te
ie
n
s,
D
sio
ia
ch
be
n
ol
te
es
s,
t
er
c
H
ea hol e ol
rt,
ste
ro
He c ho
l
le
ar
st
t,
c h ero
l
ol
es
U
An
te
l
c
r
t ib
er
, a ol
io
tic
r
, d thri
tis
ep
re
ss
Po
io
n
st
As
-m
t
hm
en
o
a
A
Ul pa u
ce
se rthr
iti
r,
,a
s
an
nt
ti b
ib
i
o
io
tic
ti c
s
,a
rth
De Ant
r
ibi
i ti
pr
s
ot
es
sio ic , p
ai
n,
n
an
ti b
io
ti c
Li
p
Figure 7B. Density Estimation of Selected Clusters by Medicaid
In Medicaid payments,
the highest payments
are also for arthritis
and ulcers. The
second highest is for
pain.
Figure 8A. Average Yearly Payment by Cluster From All Sources
Total Payment
3500
3000
2500
2000
1500
1000
500
0
9
Figure 8B. Density Estimation of Selected Clusters by All Sources
For total payments,
the highest
payments are again
for ulcers and
arthritis. The peak
payment for a year is
under $1000.
Figure 9. Kernel Density Estimation for Most Clusters for Private Insurance
For most of the
remaining clusters, a
hierarchy of costs
exists. For example,
clusters 8, 5, and 6
have the highest
probability of low
cost; clusters 14 and
9 have the lowest
probability of low
cost. Clusters not
pictured have little
variability beyond
the zero point.
10
Figure 10. Kernel Density Estimation for Most Clusters for Self Pay
The hierarchical pattern
present in Figure x is not
present here. In fact, for
many of the clusters, the
probability of higher
payment is increasing
rather than decreasing;
only clusters 3,10,6,20,
1, and 5 have a
reasonable probability of
lower cost, with
probability decreasing for
higher payments.
ASSOCIATION USING TEXT MINER
In addition to clusters, concept links can be used to find relationships. Text Miner uses association rules to define the
concept links. In order to be visualized, the relationship between terms is assumed by default to be highly significant.
That is, the chi-square statistic is greater than 12. Both terms occur in at least n documents. The default value of n is
MAX(4, A/100, B), where A is the largest value of the number of documents for the subset of terms that are used in
concept linking, B is the 1000th largest value of the number of documents for the subset of terms that are used in
concept linking. For example, for 13,000 documents, every term in the term set must occur in at least 130
documents. Figure 11 shows the concept links for the medication, Celexa (a medication for depression).
Figure 11. Concept Links for Celexa
Note that the investigator
controls the left-hand side
of the rule while the righthand side is allowed to
vary. Since the concept link
is created using Active-X,
placing the cursor over one
of the terms provides the
number of documents
containing both the term
and the initial term Celexa
divided by the number of
documents containing that
term.
11
In other words, the concept links provide the confidence value for the association. Medications, such as Prozac and
Zoloft are listed as associated with Celexa. They are also used to treat depression. These links indicate that patients
do often switch from one drug to another either because the drug is not effective in treating the condition, or there is
a change in insurance formularies. Also, note that the medication, Vicodin (a pain killer) is listed as associated with
treatment for depression.
Figure 12. Concept Links for Zestril
There are many links to
Zestril. They include many
heart and diabetes
medications, indicating a
strong link. Zestril is often
used to treat heart problems
in combination with other
medications.
Figure 13. Concept Links for Albuterol
Albuterol is a
medication used
to treat asthma.
It is prescribed
with emergency
inhalers.
12
Note that Albuterol is related to various other asthma medications such as Singulair, Zyrtec, and Serevent. It is also
related to antibiotics such as Zithromax, indicating that asthma is related to infection. There is a relationship between
Albuterol and Flovent. Where Albuterol is used for emergency attacks, Flovent is an inhalant used as a maintenance
therapy.
Validation of Text Clusters
ICD9 codes that give the patient diagnosis associated with a medication were also provided in the dataset. These
diagnoses codes were used to validate the Text Clusters. The Association Node in SAS Enterprise Miner was used
to examine all ICD9 codes for patients in any one Text Cluster. Figure 14 gives the linkages for codes for patients in
cluster 16, asthma. In a link analysis, the most common codes have the largest shape; the most prominent linkages
have the widest lines. Table 4 gives the most prominent codes. With the exception of depression, the most
prominent codes are associated with asthma and upper respiratory tract diseases.
Figure 14. Diagnosis Codes Associated with Cluster 16.
Table 4. ICD9 Codes Associated with Asthma Clulster
Code
Representation
780
Sleep disorder, dizziness
401
Essential hypertension
477
Allergic rhinitis
311
Depression
473
Chronic sinusitis
530
Diseases of the esophagus
476
Upper respiratory tract diseases
493
Asthma
382
Otitis media
Similarly, the diagnosis codes associated with cluster 1, diabetes and heart were examined (Figure 15) with the
codes defined in Table 5. While codes 715 and 716 don’t seem to fit immediately, joint stiffness is commonly
associated with diabetes.
13
Figure 15. Diagnosis Codes Associated with Diabetes and Heart
Table 5. Diagnosis Code Definitions
Code
250
401
272
715
716
V58
311
780
429
Representation
Diabetes
Essential hypertension
High Cholesterol
Osteoarthrosis
Other arthropathies
Convalescence and palliative Care
Depression
Sleep disorder, dizziness
Complications of Heart Disease
CONCLUSION
Association rules work best when there are only a few possible consumer choices. When there are thousands of
choices, methods need to be found to reduce the choices to classes of choices. SAS Text Miner can automate the
reduction by defining clusters of choices based upon the rules of grammar, stemming, and context. Association is
preserved by defining the observational unit as one consumer, and defining a text string of possible choices.
Text Miner applies association rules through concept links. Linkage between items in a market basket is preserved
through the construction of a text string where all purchases are contained within the string. The method is
successful when the items in the market basket are “stemmed” so that similar items have similar base words. The
clusters found using SAS Text Miner need to be validated using additional variables in the dataset. Market Basket
analysis on an outcome variable can be used to validate the clusters determined through Text Miner.
ACKNOWLEDGMENTS
The author wants to acknowledge the help of the SAS Consultants, Darius S. Baer and Ross Bettinger for their help
in coding the data compression. The author would also like to acknowledge that this work was supported by NIH
grant # 1R15RR017285-01A1, Data Mining to Enhance Medical Research of Clincal Data.
14
CONTACT INFORMATION
Patricia B. Cerrito
Department of Mathematics
University of Louisville
Louisville, KY 40292
502-852-6826
502-852-7132 (fax)
[email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
15