Download Data Mining Medication Prescriptions for a Representative National Sample

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Patient safety wikipedia , lookup

Adherence (medicine) wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Electronic prescribing wikipedia , lookup

Transcript
Paper PR02
Data Mining Medication Prescriptions for a Representative National
Sample
Patricia B. Cerrito, Antonio Badia, John C. Cerrito, University of Louisville, Louisville, KY
ABSTRACT
It is the purpose of this paper to examine how medications are prescribed to individuals in combination for one or
multiple medical conditions, and to explore the use of such medications. The Agency for Healthcare Research and
Quality yearly conducts the Medical Expenditure Panel Survey and makes the results available for research
purposes. The latest survey released in February, 2004 was for the year 2001. Each medication prescribed to a
particular individual is listed; each observation represents a different medication. The first step in the analysis is to
combine all medications into one text string by patient. The text string preserves the linkages between drugs for any
one individual. Next, SAS Text Miner is used to clusters the results into meaningful categories. Concept links that are
available in Text Miner are also used to examine relationships between medications. Results demonstrate that
specific co-morbidities are highly related, and that medication combinations identified in this analysis can be used
either to identify drug combination studies, or targeted marketing.
INTRODUCTION
To examine the issue of prescribed medications, the Agency for Healthcare Research and Quality yearly conducts
the Medical Expenditure Panel Survey and makes the results available for research purposes
(http://www.meps.ahrq.gov/Puf/PUFLookup.asp). The most recent year for which data are available is 2001. The
data are listed so that each observation represents one medication order. However, patient identifiers are provided
with the data so that it is possible to link all medications prescribed for just one patient into one observation in the
dataset. The medications are combined as a text string.
Once the medications are combined into a text string, the data can be analyzed using SAS Text Miner since the text
string preserves the linkages between drugs for any one individual. Relationships between medications can be
examined. It was found that it is common for individuals to switch from one medication to another to treat the same
problem, and for patient co-morbidities to be related. For example, Vioxx, used for treating arthritis, was recently
removed from the market because of adverse events. Drugs that are commonly prescribed with Vioxx can be
examined to determine if the interaction is more responsible for the adverse reactions. The information can be used
for direct marketing, or to examine how physicians prescribe medications in combinations.
METHODS
SAS TEXT MINER
The initial icons for Text Miner are given in Figure 1. These nodes can be integrated into Enterprise Miner provided
that Text Miner is available.
Text Miner Node
Dataset
Figure 1. Text Miner Node
Specify the dataset as usual. The Text Miner Node has three settings screens to examine. The first screen is given
in Figure 2. There is an option to choose if the text is stored in a SAS dataset, or if there is a variable in the dataset
that points to the location of the document. This second option is available to reduce the required storage size for a
SAS dataset. In the second option, there is no limit on the size of each document; for the first option the size is
restricted to 10 pages.
The first default is to exclude consideration of words that only occur in one document since those words cannot be
used to group documents together. Unchecking the box will also create a much longer wordlist. A second default is
to consider a word to be different if it is used as a different part of speech.
1
Figure 2. First Settings Screen
It is often advantageous to uncheck this box since Text
Miner sometimes has difficulty with grammar. In addition,
it is possible to ignore specific parts of speech (such as
conjunctions). Words with the same stem should be
considered the same word. Therefore, the third box
should always be checked.
Numbers and punctuation are not ordinarily used to
cluster text documents so the default is unchecked.
However, inventory codes can be examined using text
miner. If that is the case, then the numbers box should
be checked. It is suggested that the user experiment with
the defaults to observe the impact on the final outcomes.
A standard “stoplist” dataset will remove common words
such as “and” and “the” from consideration. Users can
add words to the stoplist as needed, or create their own
lists. Similarly, a “startlist” can be defined. In this case,
only the specific words contained within the startlist will
be used. This option is useful to flag documents
containing particular words or phrases. It differs from a
more typical keyword search in that several words and
phrases can be searched simultaneously.
It is possible to restrict attention to some specific terms by listing them in a dataset. Text Miner will only list terms
from the specified dataset. The purpose of this step is to ‘parse’ the documents. Text parsing is a very technical
process that is used to reduce the size of the documents to a manageable number. It also means that the software
attempts to use grammar context to identify a specific part of speech for each term used. Modifiers are often
connected to nouns to define ‘noun groups’ (Figure 3).
Figure 3. Results of Parsing
The + signs indicate that there is more than one
word connected to the phrase. Clicking on the
‘Term’ box will put the words in alphabetical order.
Notice that some of the terms have a ‘Y’ or an ‘N’. Any value with an ‘N’ is contained within the ‘stoplist’ file and is
not used in the analysis. Common words such as “and” that are not specifically listed in the “stoplist” dataset should
not be given a large weight since almost all documents contain many “and” words and they contribute very little to
grouping documents. By unchecking the box, ‘Display dropped terms’, all values with an ‘N’ are removed from the
window; unchecking ‘Display kept terms’ removes all words with a ‘Y’.
Singular Value Composition defines a matrix of words by documents. The maximum dimensions (by default 100) box
limits the size of this matrix. However, the larger the matrix, the more time-consuming this process. The roll-up terms
limits the wordlist to the top (100) highest weighted terms. It is suggested that the user modify the dimensions
somewhat to determine the impact on the final outcome of Text Miner.
Once the singular value decomposition is run, briefly, a status screen pops up, indicating that the singular value
decomposition is being performed. The user can close this screen since the process will continue to run.
2
Figure 4. Second Settings Screen
The second screen allows for the user to determine the
method of reducing the wordlist matrix to a manageable
size. The default is to use singular value
decomposition. There are also several possible
methods to weight the value of each term in the
documents.
To investigate how these weights and methods impact
outcomes, it is best to use one dataset and change the
settings to see how the results differ.
The number of dimensions defaults to 100. However,
that number can be decreased for a smaller number of
documents, and increased for a large number (although
the time factor will increase considerably.
A drop down menu will allow the user to change the
weights (Figure 5).
Figure 5. Drop-Down Menu for Term Weights
Entropy is the default weighting. Terms that appear
more frequently will be weighted lower compared to
terms that appear less frequently. This weighting is
somewhat different from Inverse Doc Freq where
documents that appear in as few as 2 documents are
given the highest weights.
Figure 6. Third Settings Screen
Unless the box is checked, clustering is not automatically
performed. However, once Text Miner completes the
parsing and transformation steps, the user can request that
the clustering be performed by using the settings value in
the Tools pull-down menu in the results display.
The user can also set the number of clusters, and the
method on which to base the clusters. The default number
of terms used to describe the clusters is set at 5. That
number may be too small to be able to label the clusters
effectively, and it is recommended that this default be
increased to 20 or more terms.
Again, the user is encouraged to work with the defaults to
determine their impact upon the results. There is no one
correct outcome to clustering text documents. Therefore,
the user is free to change the settings in this and all
previous boxes to get a desirable result.
3
CONCEPT LINKS
A concept link shows terms in the documents that are highly associated with each other, and visualizes the
associations with a hyperbolic tree display (Figure 7).
Figure 7. View Concept Links
The view menu allows the investigator to show the
defaults for the concept links (Figure 8).
Figure 8. Concept Links Settings
The default is to examine terms with the following
characteristics:
1.
The terms occur in at least n documents where n=MAX
(4,A/100,B) where A is the largest value of the number of
documents containing a term for the subset of terms used
th
in the document, B is the 1000 largest value of the
number of documents containing a term for the subset of
terms used in the concept linking.
2.
Term 2 occurs when term 1 occurs at least 5% of the time.
3.
The relationship between terms is highly significant (the
chi-square statistic>12).
The relationship between terms is measured by a chi-square statistic. The cutoff values for extremely, highly, and
somewhat significant are 24, 12, and 6 respectively. The Title variable is used as the title of the concept links Web
page that contains the concept links. By default, the title is The SAS System. By default, the publish location is a
folder in the WORK library.
Once displayed in the default browser with ActiveX, the mouse pointer over any one term displays A/B where A is the
number of documents where the terms occurs in the document as the center document and B is the total number of
documents where the term is displayed.
PRE-PROCESSING OF THE DATA
The original AHRQ dataset contains 277,866 observations. Each observation represents one medication order
given to one patient. These represent 20,679 patients. By combining all records for one patient into a single record,
it was possible to provide SAS with records where all medications were put together in one string. However, there
were several problems with the original dataset that had to be dealt with before SAS could be used. In particular,
fields defined as string that supposedly contained medication name and code (like RXNAME and RXNDC) did not
actually have real values. In particular the RXNAME field contained a code ("-13", indicating unknown) on 215,742
entries, over 90% of the total. The number of numerical codes in RXNDC was even higher. As a result, fields
containing such values had to be filtered out. Fortunately, other fields (for instance, the ones containing condition
codes) were much cleaner. A program was written that brough together all the observations for each patient into a
group, created a single string to concatenate all code strings into one, another string to concatenate all medication
4
names into one, and another string to concatenate all medication codes into one. The program got rid of numerical
codes (like the above "-13", or "-1", used to indicate no condition code available) since this could throw off further
analysis by SAS if considered as a string. The output was then formatted in a way that was easy to load into SAS.
Due to several delays and problems in this preprocessing, only about half (approximately 10,000) records could be
created and used for text mining.
RESULTS
RESULTS USING TEXT MINER
A total of 5397 observations, each a text string of medications for one individual were used in the analysis. These
observations represent approximately a 25% subsample of the total number of patients in the AHRQ database. Due
to several delays and problems in this preprocessing, the full dataset of approximately 10,000 records was not used.
A total of 18 clusters were returned by Text Miner using the expectation maximization clustering technique (Table 1).
An additional 6000 were used for validation. The remaining 10,0000 patient records will be used for additional
analysis.
Table 1. Clustering of Medication Text Strings
Cluster
#
Descriptive Terms
Freq
1
serevent, flovent, pulmicort, prednisone, inhaler, singulair, sulfate, albuterol, diskus, atrovent, intal,
prednisolone, advair, ventolin, p.f., xopenex, rhinocort, zn/polymyxin, panfil, bacitracin
2
flonase, penicillin, zyrtec, dm, gum, bubble, amoxicillin, amoxil, apri, loestrin, augmentin, trimox,
allegra, claritin, vk, z-pak, pediatric, zithromax, pediatric fruit, fruit
3
premarin, synthroid, fosamax, estradiol, + unit, medroxyprogesterone, acetate, singulair, provera,
alprazolam, levothroid, flovent, ranitidine, zocor, levoxyl, serevent, hctz/triamterene, vioxx, hcl,
325
4
avandia, novolin, insulin, glucophage, softclix, vial, glyburide, system, humalog, surestep, diabetes, +
lancet, glucometer, humulin, elite, contraceptive, metformin, actos, metoprolol, glipizide
264
cipro, vioxx, percocet, w/codeine #3, w/diluent, w/codeine, vicodin, tylenol, hydromet,
5 hctz/propranolol, propoxy-n, aciphex, hydrocodone/apap, acetaminophen, morphine, mobic, veetids,
desowen, ibu, m
6
phenobarbital, primidone, kapseal, ex, dilantin, kapseals, infatabs, phenytoin, tegretol, + extend,
valium, + estrogen, diovan, fosamax, sodium, atenolol, hcl, + unit, celebrex, claritin
299
1034
313
34
hcl, hydroxyzine, cyclobenzaprine, cleocin, ranitidine, minocycline, triaz, clindamycin,
7 methylphenidate, verapamil, terazosin, propoxyphene, oxycodone, propranolol, clonidine, er, differin,
hydrocort
180
8
antibiotic, triple, orphengesic, aquaphor, mytussin, cough, metrogel, sulfatrim, ketoconazole, nix, a.f.,
nystatin, hc, hydrocortisone, polysporin, ac, s.f., acetaminophen/codeine, allegra-d, cefzil
160
9
celebrex, premarin, ocuflox, hydromet, ortho, tri-cyclen, ultram, provera, lotrel, ambien, dyrenium,
cyclobenzaprine, biaxin, prilosec, levaquin, vicodin, zoloft, prevacid, caplet, + unit
188
10
paxil, cortisporin, + unit, aldara, cream, hydrocortisone/neomycin/polymyxin, trazodone, temazepam,
ds, depakote, imitrex, hydroxyzine, lorazepam, amitriptyline, allegra-d, apap/hydrocodone, bitartrat
100
11
hydrochlorothiazide, ibuprofen, tussafed-ex, cherry-vanilla, dac, dm, pentoxifylline, tiazac, plendil, +
estrogen, roxicet, viagra, robitussin, acetaminophen, captopril, flexeril, vicodin, keflex, s.f
99
12
toprol, atenolol, accupril, prinivil, cimetidine, naphcon-a, hydrochlorothiazide, adalat, aspirin, lotensin,
glucotrol, darvocet-n, diovan, hctz/triamterene, dyrenium, levoxyl, biaxin, softclix, premp
537
13
lipitor, aspirin, plavix, atenolol, prevacid, altace, norvasc, tricor, relafen, isosorbide, cyclobenzaprine,
hydrochlorothiazide, nitroglycerin, coumadin, zestoretic, monopril, ambien, zestril, prempr
123
depakote, caplet, wellbutrin, blister, bitartrate, adderall, apap/hydrocodone, + pack, sr, celexa,
14 prevacid, prenatal, prozac, plan, napsylate, apap/propoxyphene, prempro, naproxen, risperdal,
diovan
884
15
chloride, lanoxin, coumadin, potassium, pravachol, enalapril, furosemide, maleate, lasix, ocumeter,
vasotec, klor-con, glipizide, k-dur, xt, hyclate, doxycycline, xalatan, diltiazem, theophylline
209
16
atenolol, zocor, folic, warfarin, tartrate, hydrochlorothiazide, acid, sodium, prilosec, norvasc,
metoprolol, zestril, furosemide, toprol, + unit, accupril, glucotrol, k-dur, levaquin, glucophage
453
17
concerta, zoloft, cfe, natalcare, mebendazole, prolex, desogen, jr, trazodone, darvocet-n, serzone,
diphenhydramine, alprazolam, buspar, er, triple, antibiotic, synthroid, diflucan, trimox
70
18
amoxil, dialpak, ortho, ortho-novum, cyclen, tri, tri-cyclen, gum, bubble, tricyclen, micronor, orthocyclen, necon, af, diflucan, minocycline, guaifenex, strawberry, z-pak, naproxen
125
5
Consider the first cluster. Almost all of the medications in the clusters are used to treat asthma or severe allergies.
Similarly, the medications in cluster 4 are used for the treatment of diabetes. Cluster 2 combines medications for
asthma with antibiotics.
Table 2. Cluster Labels.
Cluster
Label
1
Asthma medications
2
Upper respiratory infection
3
Post-menopause
4
Diabetes
5
Acute or chronic pain
6
Hypertension
7
Acne and ADD
8
Head lice, skin conditions
9
Hypertension and urinary tract
10
Migraine and depression
11
Hypertension
12
Type I diabetes
13
Post congestive heart failure
14
ADD
15
Heart condition
16
Type II diabetes
17
ADD and aggressive behavior
18
Oral contraceptives and antibiotics
The drug, Flonase, appears at the beginning of the
list in Cluster 2. It is, therefore, of interest to look at
all associations with the drug (Figure 9). There are
a total of 134 patients in the list with a prescription
to the drug. Of that number, 6 also have a
prescription to Serevent and 21 to Allegra.
Generally, the 18 clusters above can be labeled as
given in Table 2.
Figure 9. Concept Links to Flonase
6
Figure 10. Concept Links to Albuterol
There are a total of 337 patients with Albuterol prescriptions. Only 26 also have an order for an inhaler, which is
needed to administer the Albuterol. Of that number, 40 have prescriptions for Flovent, 27 for Serevent, and 43 for
Singular. A direct association between antibiotics and the medication, Albuterol, is not so clear. However, in the
second generation, the association exists (Figure 11).
Figure 11. Second Generation Concept Links to Albuterol
There are 8 patients with prescriptions
for Zithromax in addition to Atrovent and
Albuterol. Similarly, 18 have
prescriptions to Amoxicillin (suggesting
pediatric patients). A total of 23 patients
have a prescription of Bacitracin (which
only exists for 28 patients total) in
combination with sulphur and Albuterol.
Bacitracin is for a skin infection.
Another combination of interest is
demonstrated in the associations with
Lipitor. Most are related either to other
heart medications, or to diabetes
medications (Figure 12). Of the 240
prescriptions, 31 are using Softclix, 20
also have a prescription for Glucophage,
13 for Novolin, and 7 for Humulin.
Moreover, 14 also have prescriptions for
Zestril, indicating a need to switch to a
different cholesterol-lowering medication.
To examine diabetes medications more
closely, Figure 13 is centered on
Glucophage.
7
Figure 12. Concept Links for Lipitor
Figure 13. Concept Links for Glucophage
Of the 128 patients taking Glucophage, 10 are taking Zestril, 11 Zocor, and 20 Lipitor. 17also have a prescription for
Glucotrol. Given that Vioxx was recently pulled from the market because of the risk of adverse effects, it is
worthwhile to examine drugs taken in combination with Vioxx to examine the possibility that it is the interactions that
result in the adverse outcomes (Figure 14). A total of 205 patients have a prescription for Vioxx (2.5% of the total
patient base). In addition, several are taking some serious pain medication: 4 Oxycodone, 7 Percocet, 11 Vicodin. 21
or 10% of the patients on Vioxx also have a prescription for Celebrix, again indicating switching medications.
8
Figure 14. Concept Links for Vioxx
Another type of drug of interest is that of anti-depressants (Figure 15). There were a total of 116 prescriptions for
Zoloft and 357 for Prozac. Note that none of the Zoloft users switched to or were switched from Prozac. That makes
for close to 5% of the total patient base on just these two anti-depressants. In fact, very few of the patients on Zoloft
are using additional medications. However, there are 7 using a strong pain medication (hydrocodone).
Figure 15. Concept Links for Zoloft
9
TESTING THE RESULTS
To test the results, an additional 6000 documents were used as a separate training set using the code given in
Figure 16.
Figure 16. Predictive Modeling in Enterprise Miner
The profit and lift charts are given in Figure 17-18. The profit chart indicates that the neural network and regression
models have the best outcome; regression and memory-based reasoning only work as well as baseline. However,
the lift chart supports the neural network as the optimal model.
Figure 17. Profit Chart
Figure 18. Lift Chart
COMPARISON TO ASSOCIATION NODE
It is possible to investigate the data without using the Text Miner Node by using the Association Node directly. The
Association Node will treat each medication as a separate category, without the ability of linking based upon natural
language and stemming properties that are available in Text Miner. The Association Node also works with all
277,000 observations in the dataset, with one medication prescription per observation. Because of the size of the
10
dataset and the limitations on the capacity of the computer, the first 1000 records in the dataset were used in the
Association Node, with the dataset sorted by patient id. Therefore, all medications prescribed to any one individual
are included in the sample. Because concept links are organized by initial drug, the rules were organized in a similar
fashion. Table 3 gives the initial rules for the drug, Vioxx.
Table 3. Association Rules for Vioxx
Table 3 shows that the highest confidence for associations with Vioxx are for Softclix and Premarin; neither are listed
in Figure 14. Medications also used for inflammation and pain include Celebrex, and this associated has some
confidence and support, and is listed in Figure 14 as well. A diagram for the rules is given in Figure 19.
Figure 19. Diagram of Association Rules for Vioxx
Because of the density of the graph,
results are difficult to interpret.
However, besides SoftClix and
Premarin, the associations are
relatively similar. Results are also
provided for the medication, Albuterol
(Table 4).
11
Table 4. Association Rules for Albuterol
The associations with Albuterol are much more numerous. Associations with medications of a similar nature include
Flovent, Combivent, and Claritin. The corresponding graph is given in Figure 20.
Figure 20. Graph of Association Rules for Albuterol
Again, the graph is too crowded to
have a lot of value. The Association
Node works best when there are a
limited number of choices. Given the
diversity of medications prescribed, a
reduction to just a few categories will
be very difficult without the natural
language properties of Text Miner.
12
CONCLUSION
Text Miner can be used to investigate linkage, which is defined here as the multiple prescriptions used by any one
individual over the course of a year. Text Miner can group patients for targeted marketing. It can also investigate the
likelihood of patients switching from one medication to another for the same problem. As such, Text Miner can
contribute greatly to postmarketing surveillance of medications. Because of the stemming and natural language
properties of Text Miner, the results are more meaningful and superior to those found more directly using the
Association Node.
CONTACT INFORMATION (HEADER 1)
Patricia B. Cerrito
Department of Mathematics
University of Louisville
Louisville, KY 40292
502-852-6826
Fax: 502-852-7132
[email protected]
Antonio Badia
John C. Cerrito
Department of Computer Statistical Consulting of Louisville, Inc.
Engineering
302 Chippendale Ct.
Louisville, KY 40214
502-852-4078
502-417-2742
[email protected]
[email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
13