Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Paper PR02 Data Mining Medication Prescriptions for a Representative National Sample Patricia B. Cerrito, Antonio Badia, John C. Cerrito, University of Louisville, Louisville, KY ABSTRACT It is the purpose of this paper to examine how medications are prescribed to individuals in combination for one or multiple medical conditions, and to explore the use of such medications. The Agency for Healthcare Research and Quality yearly conducts the Medical Expenditure Panel Survey and makes the results available for research purposes. The latest survey released in February, 2004 was for the year 2001. Each medication prescribed to a particular individual is listed; each observation represents a different medication. The first step in the analysis is to combine all medications into one text string by patient. The text string preserves the linkages between drugs for any one individual. Next, SAS Text Miner is used to clusters the results into meaningful categories. Concept links that are available in Text Miner are also used to examine relationships between medications. Results demonstrate that specific co-morbidities are highly related, and that medication combinations identified in this analysis can be used either to identify drug combination studies, or targeted marketing. INTRODUCTION To examine the issue of prescribed medications, the Agency for Healthcare Research and Quality yearly conducts the Medical Expenditure Panel Survey and makes the results available for research purposes (http://www.meps.ahrq.gov/Puf/PUFLookup.asp). The most recent year for which data are available is 2001. The data are listed so that each observation represents one medication order. However, patient identifiers are provided with the data so that it is possible to link all medications prescribed for just one patient into one observation in the dataset. The medications are combined as a text string. Once the medications are combined into a text string, the data can be analyzed using SAS Text Miner since the text string preserves the linkages between drugs for any one individual. Relationships between medications can be examined. It was found that it is common for individuals to switch from one medication to another to treat the same problem, and for patient co-morbidities to be related. For example, Vioxx, used for treating arthritis, was recently removed from the market because of adverse events. Drugs that are commonly prescribed with Vioxx can be examined to determine if the interaction is more responsible for the adverse reactions. The information can be used for direct marketing, or to examine how physicians prescribe medications in combinations. METHODS SAS TEXT MINER The initial icons for Text Miner are given in Figure 1. These nodes can be integrated into Enterprise Miner provided that Text Miner is available. Text Miner Node Dataset Figure 1. Text Miner Node Specify the dataset as usual. The Text Miner Node has three settings screens to examine. The first screen is given in Figure 2. There is an option to choose if the text is stored in a SAS dataset, or if there is a variable in the dataset that points to the location of the document. This second option is available to reduce the required storage size for a SAS dataset. In the second option, there is no limit on the size of each document; for the first option the size is restricted to 10 pages. The first default is to exclude consideration of words that only occur in one document since those words cannot be used to group documents together. Unchecking the box will also create a much longer wordlist. A second default is to consider a word to be different if it is used as a different part of speech. 1 Figure 2. First Settings Screen It is often advantageous to uncheck this box since Text Miner sometimes has difficulty with grammar. In addition, it is possible to ignore specific parts of speech (such as conjunctions). Words with the same stem should be considered the same word. Therefore, the third box should always be checked. Numbers and punctuation are not ordinarily used to cluster text documents so the default is unchecked. However, inventory codes can be examined using text miner. If that is the case, then the numbers box should be checked. It is suggested that the user experiment with the defaults to observe the impact on the final outcomes. A standard “stoplist” dataset will remove common words such as “and” and “the” from consideration. Users can add words to the stoplist as needed, or create their own lists. Similarly, a “startlist” can be defined. In this case, only the specific words contained within the startlist will be used. This option is useful to flag documents containing particular words or phrases. It differs from a more typical keyword search in that several words and phrases can be searched simultaneously. It is possible to restrict attention to some specific terms by listing them in a dataset. Text Miner will only list terms from the specified dataset. The purpose of this step is to ‘parse’ the documents. Text parsing is a very technical process that is used to reduce the size of the documents to a manageable number. It also means that the software attempts to use grammar context to identify a specific part of speech for each term used. Modifiers are often connected to nouns to define ‘noun groups’ (Figure 3). Figure 3. Results of Parsing The + signs indicate that there is more than one word connected to the phrase. Clicking on the ‘Term’ box will put the words in alphabetical order. Notice that some of the terms have a ‘Y’ or an ‘N’. Any value with an ‘N’ is contained within the ‘stoplist’ file and is not used in the analysis. Common words such as “and” that are not specifically listed in the “stoplist” dataset should not be given a large weight since almost all documents contain many “and” words and they contribute very little to grouping documents. By unchecking the box, ‘Display dropped terms’, all values with an ‘N’ are removed from the window; unchecking ‘Display kept terms’ removes all words with a ‘Y’. Singular Value Composition defines a matrix of words by documents. The maximum dimensions (by default 100) box limits the size of this matrix. However, the larger the matrix, the more time-consuming this process. The roll-up terms limits the wordlist to the top (100) highest weighted terms. It is suggested that the user modify the dimensions somewhat to determine the impact on the final outcome of Text Miner. Once the singular value decomposition is run, briefly, a status screen pops up, indicating that the singular value decomposition is being performed. The user can close this screen since the process will continue to run. 2 Figure 4. Second Settings Screen The second screen allows for the user to determine the method of reducing the wordlist matrix to a manageable size. The default is to use singular value decomposition. There are also several possible methods to weight the value of each term in the documents. To investigate how these weights and methods impact outcomes, it is best to use one dataset and change the settings to see how the results differ. The number of dimensions defaults to 100. However, that number can be decreased for a smaller number of documents, and increased for a large number (although the time factor will increase considerably. A drop down menu will allow the user to change the weights (Figure 5). Figure 5. Drop-Down Menu for Term Weights Entropy is the default weighting. Terms that appear more frequently will be weighted lower compared to terms that appear less frequently. This weighting is somewhat different from Inverse Doc Freq where documents that appear in as few as 2 documents are given the highest weights. Figure 6. Third Settings Screen Unless the box is checked, clustering is not automatically performed. However, once Text Miner completes the parsing and transformation steps, the user can request that the clustering be performed by using the settings value in the Tools pull-down menu in the results display. The user can also set the number of clusters, and the method on which to base the clusters. The default number of terms used to describe the clusters is set at 5. That number may be too small to be able to label the clusters effectively, and it is recommended that this default be increased to 20 or more terms. Again, the user is encouraged to work with the defaults to determine their impact upon the results. There is no one correct outcome to clustering text documents. Therefore, the user is free to change the settings in this and all previous boxes to get a desirable result. 3 CONCEPT LINKS A concept link shows terms in the documents that are highly associated with each other, and visualizes the associations with a hyperbolic tree display (Figure 7). Figure 7. View Concept Links The view menu allows the investigator to show the defaults for the concept links (Figure 8). Figure 8. Concept Links Settings The default is to examine terms with the following characteristics: 1. The terms occur in at least n documents where n=MAX (4,A/100,B) where A is the largest value of the number of documents containing a term for the subset of terms used th in the document, B is the 1000 largest value of the number of documents containing a term for the subset of terms used in the concept linking. 2. Term 2 occurs when term 1 occurs at least 5% of the time. 3. The relationship between terms is highly significant (the chi-square statistic>12). The relationship between terms is measured by a chi-square statistic. The cutoff values for extremely, highly, and somewhat significant are 24, 12, and 6 respectively. The Title variable is used as the title of the concept links Web page that contains the concept links. By default, the title is The SAS System. By default, the publish location is a folder in the WORK library. Once displayed in the default browser with ActiveX, the mouse pointer over any one term displays A/B where A is the number of documents where the terms occurs in the document as the center document and B is the total number of documents where the term is displayed. PRE-PROCESSING OF THE DATA The original AHRQ dataset contains 277,866 observations. Each observation represents one medication order given to one patient. These represent 20,679 patients. By combining all records for one patient into a single record, it was possible to provide SAS with records where all medications were put together in one string. However, there were several problems with the original dataset that had to be dealt with before SAS could be used. In particular, fields defined as string that supposedly contained medication name and code (like RXNAME and RXNDC) did not actually have real values. In particular the RXNAME field contained a code ("-13", indicating unknown) on 215,742 entries, over 90% of the total. The number of numerical codes in RXNDC was even higher. As a result, fields containing such values had to be filtered out. Fortunately, other fields (for instance, the ones containing condition codes) were much cleaner. A program was written that brough together all the observations for each patient into a group, created a single string to concatenate all code strings into one, another string to concatenate all medication 4 names into one, and another string to concatenate all medication codes into one. The program got rid of numerical codes (like the above "-13", or "-1", used to indicate no condition code available) since this could throw off further analysis by SAS if considered as a string. The output was then formatted in a way that was easy to load into SAS. Due to several delays and problems in this preprocessing, only about half (approximately 10,000) records could be created and used for text mining. RESULTS RESULTS USING TEXT MINER A total of 5397 observations, each a text string of medications for one individual were used in the analysis. These observations represent approximately a 25% subsample of the total number of patients in the AHRQ database. Due to several delays and problems in this preprocessing, the full dataset of approximately 10,000 records was not used. A total of 18 clusters were returned by Text Miner using the expectation maximization clustering technique (Table 1). An additional 6000 were used for validation. The remaining 10,0000 patient records will be used for additional analysis. Table 1. Clustering of Medication Text Strings Cluster # Descriptive Terms Freq 1 serevent, flovent, pulmicort, prednisone, inhaler, singulair, sulfate, albuterol, diskus, atrovent, intal, prednisolone, advair, ventolin, p.f., xopenex, rhinocort, zn/polymyxin, panfil, bacitracin 2 flonase, penicillin, zyrtec, dm, gum, bubble, amoxicillin, amoxil, apri, loestrin, augmentin, trimox, allegra, claritin, vk, z-pak, pediatric, zithromax, pediatric fruit, fruit 3 premarin, synthroid, fosamax, estradiol, + unit, medroxyprogesterone, acetate, singulair, provera, alprazolam, levothroid, flovent, ranitidine, zocor, levoxyl, serevent, hctz/triamterene, vioxx, hcl, 325 4 avandia, novolin, insulin, glucophage, softclix, vial, glyburide, system, humalog, surestep, diabetes, + lancet, glucometer, humulin, elite, contraceptive, metformin, actos, metoprolol, glipizide 264 cipro, vioxx, percocet, w/codeine #3, w/diluent, w/codeine, vicodin, tylenol, hydromet, 5 hctz/propranolol, propoxy-n, aciphex, hydrocodone/apap, acetaminophen, morphine, mobic, veetids, desowen, ibu, m 6 phenobarbital, primidone, kapseal, ex, dilantin, kapseals, infatabs, phenytoin, tegretol, + extend, valium, + estrogen, diovan, fosamax, sodium, atenolol, hcl, + unit, celebrex, claritin 299 1034 313 34 hcl, hydroxyzine, cyclobenzaprine, cleocin, ranitidine, minocycline, triaz, clindamycin, 7 methylphenidate, verapamil, terazosin, propoxyphene, oxycodone, propranolol, clonidine, er, differin, hydrocort 180 8 antibiotic, triple, orphengesic, aquaphor, mytussin, cough, metrogel, sulfatrim, ketoconazole, nix, a.f., nystatin, hc, hydrocortisone, polysporin, ac, s.f., acetaminophen/codeine, allegra-d, cefzil 160 9 celebrex, premarin, ocuflox, hydromet, ortho, tri-cyclen, ultram, provera, lotrel, ambien, dyrenium, cyclobenzaprine, biaxin, prilosec, levaquin, vicodin, zoloft, prevacid, caplet, + unit 188 10 paxil, cortisporin, + unit, aldara, cream, hydrocortisone/neomycin/polymyxin, trazodone, temazepam, ds, depakote, imitrex, hydroxyzine, lorazepam, amitriptyline, allegra-d, apap/hydrocodone, bitartrat 100 11 hydrochlorothiazide, ibuprofen, tussafed-ex, cherry-vanilla, dac, dm, pentoxifylline, tiazac, plendil, + estrogen, roxicet, viagra, robitussin, acetaminophen, captopril, flexeril, vicodin, keflex, s.f 99 12 toprol, atenolol, accupril, prinivil, cimetidine, naphcon-a, hydrochlorothiazide, adalat, aspirin, lotensin, glucotrol, darvocet-n, diovan, hctz/triamterene, dyrenium, levoxyl, biaxin, softclix, premp 537 13 lipitor, aspirin, plavix, atenolol, prevacid, altace, norvasc, tricor, relafen, isosorbide, cyclobenzaprine, hydrochlorothiazide, nitroglycerin, coumadin, zestoretic, monopril, ambien, zestril, prempr 123 depakote, caplet, wellbutrin, blister, bitartrate, adderall, apap/hydrocodone, + pack, sr, celexa, 14 prevacid, prenatal, prozac, plan, napsylate, apap/propoxyphene, prempro, naproxen, risperdal, diovan 884 15 chloride, lanoxin, coumadin, potassium, pravachol, enalapril, furosemide, maleate, lasix, ocumeter, vasotec, klor-con, glipizide, k-dur, xt, hyclate, doxycycline, xalatan, diltiazem, theophylline 209 16 atenolol, zocor, folic, warfarin, tartrate, hydrochlorothiazide, acid, sodium, prilosec, norvasc, metoprolol, zestril, furosemide, toprol, + unit, accupril, glucotrol, k-dur, levaquin, glucophage 453 17 concerta, zoloft, cfe, natalcare, mebendazole, prolex, desogen, jr, trazodone, darvocet-n, serzone, diphenhydramine, alprazolam, buspar, er, triple, antibiotic, synthroid, diflucan, trimox 70 18 amoxil, dialpak, ortho, ortho-novum, cyclen, tri, tri-cyclen, gum, bubble, tricyclen, micronor, orthocyclen, necon, af, diflucan, minocycline, guaifenex, strawberry, z-pak, naproxen 125 5 Consider the first cluster. Almost all of the medications in the clusters are used to treat asthma or severe allergies. Similarly, the medications in cluster 4 are used for the treatment of diabetes. Cluster 2 combines medications for asthma with antibiotics. Table 2. Cluster Labels. Cluster Label 1 Asthma medications 2 Upper respiratory infection 3 Post-menopause 4 Diabetes 5 Acute or chronic pain 6 Hypertension 7 Acne and ADD 8 Head lice, skin conditions 9 Hypertension and urinary tract 10 Migraine and depression 11 Hypertension 12 Type I diabetes 13 Post congestive heart failure 14 ADD 15 Heart condition 16 Type II diabetes 17 ADD and aggressive behavior 18 Oral contraceptives and antibiotics The drug, Flonase, appears at the beginning of the list in Cluster 2. It is, therefore, of interest to look at all associations with the drug (Figure 9). There are a total of 134 patients in the list with a prescription to the drug. Of that number, 6 also have a prescription to Serevent and 21 to Allegra. Generally, the 18 clusters above can be labeled as given in Table 2. Figure 9. Concept Links to Flonase 6 Figure 10. Concept Links to Albuterol There are a total of 337 patients with Albuterol prescriptions. Only 26 also have an order for an inhaler, which is needed to administer the Albuterol. Of that number, 40 have prescriptions for Flovent, 27 for Serevent, and 43 for Singular. A direct association between antibiotics and the medication, Albuterol, is not so clear. However, in the second generation, the association exists (Figure 11). Figure 11. Second Generation Concept Links to Albuterol There are 8 patients with prescriptions for Zithromax in addition to Atrovent and Albuterol. Similarly, 18 have prescriptions to Amoxicillin (suggesting pediatric patients). A total of 23 patients have a prescription of Bacitracin (which only exists for 28 patients total) in combination with sulphur and Albuterol. Bacitracin is for a skin infection. Another combination of interest is demonstrated in the associations with Lipitor. Most are related either to other heart medications, or to diabetes medications (Figure 12). Of the 240 prescriptions, 31 are using Softclix, 20 also have a prescription for Glucophage, 13 for Novolin, and 7 for Humulin. Moreover, 14 also have prescriptions for Zestril, indicating a need to switch to a different cholesterol-lowering medication. To examine diabetes medications more closely, Figure 13 is centered on Glucophage. 7 Figure 12. Concept Links for Lipitor Figure 13. Concept Links for Glucophage Of the 128 patients taking Glucophage, 10 are taking Zestril, 11 Zocor, and 20 Lipitor. 17also have a prescription for Glucotrol. Given that Vioxx was recently pulled from the market because of the risk of adverse effects, it is worthwhile to examine drugs taken in combination with Vioxx to examine the possibility that it is the interactions that result in the adverse outcomes (Figure 14). A total of 205 patients have a prescription for Vioxx (2.5% of the total patient base). In addition, several are taking some serious pain medication: 4 Oxycodone, 7 Percocet, 11 Vicodin. 21 or 10% of the patients on Vioxx also have a prescription for Celebrix, again indicating switching medications. 8 Figure 14. Concept Links for Vioxx Another type of drug of interest is that of anti-depressants (Figure 15). There were a total of 116 prescriptions for Zoloft and 357 for Prozac. Note that none of the Zoloft users switched to or were switched from Prozac. That makes for close to 5% of the total patient base on just these two anti-depressants. In fact, very few of the patients on Zoloft are using additional medications. However, there are 7 using a strong pain medication (hydrocodone). Figure 15. Concept Links for Zoloft 9 TESTING THE RESULTS To test the results, an additional 6000 documents were used as a separate training set using the code given in Figure 16. Figure 16. Predictive Modeling in Enterprise Miner The profit and lift charts are given in Figure 17-18. The profit chart indicates that the neural network and regression models have the best outcome; regression and memory-based reasoning only work as well as baseline. However, the lift chart supports the neural network as the optimal model. Figure 17. Profit Chart Figure 18. Lift Chart COMPARISON TO ASSOCIATION NODE It is possible to investigate the data without using the Text Miner Node by using the Association Node directly. The Association Node will treat each medication as a separate category, without the ability of linking based upon natural language and stemming properties that are available in Text Miner. The Association Node also works with all 277,000 observations in the dataset, with one medication prescription per observation. Because of the size of the 10 dataset and the limitations on the capacity of the computer, the first 1000 records in the dataset were used in the Association Node, with the dataset sorted by patient id. Therefore, all medications prescribed to any one individual are included in the sample. Because concept links are organized by initial drug, the rules were organized in a similar fashion. Table 3 gives the initial rules for the drug, Vioxx. Table 3. Association Rules for Vioxx Table 3 shows that the highest confidence for associations with Vioxx are for Softclix and Premarin; neither are listed in Figure 14. Medications also used for inflammation and pain include Celebrex, and this associated has some confidence and support, and is listed in Figure 14 as well. A diagram for the rules is given in Figure 19. Figure 19. Diagram of Association Rules for Vioxx Because of the density of the graph, results are difficult to interpret. However, besides SoftClix and Premarin, the associations are relatively similar. Results are also provided for the medication, Albuterol (Table 4). 11 Table 4. Association Rules for Albuterol The associations with Albuterol are much more numerous. Associations with medications of a similar nature include Flovent, Combivent, and Claritin. The corresponding graph is given in Figure 20. Figure 20. Graph of Association Rules for Albuterol Again, the graph is too crowded to have a lot of value. The Association Node works best when there are a limited number of choices. Given the diversity of medications prescribed, a reduction to just a few categories will be very difficult without the natural language properties of Text Miner. 12 CONCLUSION Text Miner can be used to investigate linkage, which is defined here as the multiple prescriptions used by any one individual over the course of a year. Text Miner can group patients for targeted marketing. It can also investigate the likelihood of patients switching from one medication to another for the same problem. As such, Text Miner can contribute greatly to postmarketing surveillance of medications. Because of the stemming and natural language properties of Text Miner, the results are more meaningful and superior to those found more directly using the Association Node. CONTACT INFORMATION (HEADER 1) Patricia B. Cerrito Department of Mathematics University of Louisville Louisville, KY 40292 502-852-6826 Fax: 502-852-7132 [email protected] Antonio Badia John C. Cerrito Department of Computer Statistical Consulting of Louisville, Inc. Engineering 302 Chippendale Ct. Louisville, KY 40214 502-852-4078 502-417-2742 [email protected] [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 13