Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Preliminary Data Mining for Subgroups in a Hospital Inpatient Population Anthony M. Dymond engage in an iterative process of task discovery to focus on and clarify the objectives of the data mining project, and data discovery to identify what studies the data would support, and what studies the data could suggest. ABSTRACT A business plan to manage a group or population depends on knowing if the group is homogeneous or composed of several subgroups. This paper discusses data mining and describes some practical considerations for preliminary data exploration in a hospital inpatient population of 5760 unique patients. A variety of visual and analytic examinations of selected variables suggest that three subpopulations may exist in this data. This process interacts with processes surrounding the data, which involve selecting variables and cases to examine and preparing (cleaning) the data for use (Cortes et al., 1995). Not all the variables from a large database are used at any one time because they are not all relevant to a particular study or because there may be concerns about their completeness or validity. Not all the cases are used because of corrections for outliers and because of the potential for very different characteristics in known or postulated subpopulations. INTRODUCTION Data mining encompasses a variety of techniques for examining databases and extracting insights that can be used to base business decisions. It allows management to discover and utilize information already existing in internal or external databases, and can be applied to decisions and processes anywhere in the organization. Both of these processes then interact with the central process of hypothesis construction and testing. This phase of data mining should always be centered around hypotheses created by the data analyst and domain expert, and should involve construction of models to test these hypotheses. A very wide variety of analytic and visualization tools can be used at this point to explore both the hypotheses and the models (Elder and Pregibon, 1996). This part of the data mining process is also interactive, with hypotheses evaluation leading to new hypotheses and testing. Data mining is business driven, and the present example is typical in that it was motivated by the business need to improve hospital inpatient management. A good understanding of the subject population was essential for the plan. In this particular case, one important hypothesis to explore was the existence of subpopulations among the inpatients. This was motivated in part by the impression of knowledgeable staff that such subpopulations probably exist. All of these activities then result in a deliverable product for presentation to management and for implementation. Current reviews (Fayyad et al., 1996; Brachman and Anand, 1996) consider both the theoretical and practical aspects of data mining, and suggest that the process can be characterized as a non-trivial discovery of novel, useful, and understandable information. The process can be knowledgeintensive, extended over time, iterative, interactive, and intensive of both database and human resources. HOSPITAL INPATIENT MINING Some of the considerations from Figure 1 can be illustrated by the current study of hospital inpatients. The data came from 5760 unique inpatients discharged in 1993 from a large West Coast Veterans Administration (VA) hospital. (Some patients were hospitalized more than once during the year.) The hospital is located in a metropolitan area and has active university medical school affitiations. The VA hospital has many of the aspects of a university teaching hospital, but with predominantly male veteran demographics. Figure 1 shows some of the major elements and interactions in data mining. The process is humancentered, and involves the data analyst (data miner) and a domain expert who is familiar with both the subject area and the organization. These individuals 117 Extensive inpatient and outpatient data is routinely collected by local computer systems at VA hospitals and then subsetted and transferred to a remote IBM® mainframe. This mainframe has SAS® software that was used for these studies. discovered by a visualization method and supported by domain expert expectations. At this point, both data analysts and domain experts must ask themselves if these groups represent a complete and final solution. Based on a data subset containing only the subjects' inpatient activities, these two groups are probably valid. Their existence would represent a major finding of this data mining study, and the next phase of work would attempt to further explore and characterize these groups. The hypothesis was that subpopulations existed within the inpatient group, and that identifying and characterizing these subpopulations would guide patient management. The author has sufficient experience with hospital databases to serve initially as both the data analyst and domain expert. SAS software has a variety of tools to examine data for outliers and for univariate and bivariate distributions, and these were applied to a subset of the variables believed to be most relevant to the issue under consideration. Some variables were temporarily excluded for reasons such as distribution, collinearity, and validity (some variables are recorded only under special circumstances). However, these patients' outpatient activity was also available, and these variables were included in further analyses. Several approaches were tried in an effort to elucidate additional structure in the data. One approach was to try factor analysis because it can sometimes group cases after a transpose between the cases and variables. When this was attempted on a subset of 500 patients, three potential patient subgroups appeared. Additional studies were then conducted using these three subgroups to establish their validity and interpretation. Discriminant analysis suggested that these groups were valid and could be used to classify patients in the database into the three categories. All patients were assigned to these groups by discriminant analysis and a variety of exploratory graphs were examined. A few of these graphs are shown in Figures 3 and 4. Some cases were removed as outliers when values such as income or length of stay were excessive. However, although there are good statistical reasons to ignore outlier data, these cases should always be examined separately as a part of data mining since they can often point to unusual situations in the business process. Evaluating outliers can potentially lead to some of the most significant information resulting from the data mining operation. Examining contour plots for each pair of variables is a useful technique during the early stages of variable and case review. Figure 2 shows this plot for the variables Major Diagnostic Category (MDC) and age. {MDC measures the patient's major reason for having been in the hospital. MDCs are grouped along the lines of organ systems and related disease.) Even at this preliminary stage of data mining, Figure 2 revealed two separate peaks in the inpatient distribution. One large group matched the age range of World War II (WWII) and Korea veterans with MDCs corresponding to diseases in traditional medical and surgical categories such as cardiology and respiratory illness. The second group has an age range of Vietnam veterans with MDCs corresponding to diseases in the drug and mental categories. Figure 3 shows the length of stay and number of visits for the three groups. For Group 3, the length of stay is much longer than for the other groups. For Group 2, the number of visits (to the hospital as outpatients) is much higher than for the other groups. Figure 4 shows the MDCs versus age. (For patients with multiple discharges, the MDC associated with the longest length of stay was chosen.) This graph shows that most of Group 1 patients are in the traditional medical and surgical categories such as respiratory, circulatory, and digestive. Group 3 is heavily concentrated in the drug and mental categories. Group 2 is distributed across MDCs. A preliminary interpretation of these groups is: These two groups--WWII and Korean veterans with traditional medical and surgical illnesses, and Vietnam veterans with more of a propensity toward drug and mental illnesses-are the two inpatient subpopulations that VA domain experts would predict. This is a significant result supporting the hypotheses of subpopulations. It has been Group 1 Tradjtjonal (73% of inpatients) Traditional medical and surgical patients. Perhaps slightly older patients. Patients who do not acculllJiated excessive numbers of vistts or length of stays during the year. 118 could be used to find spontaneous groupings in the data, and discriminant analysis and categorical analyses could provide models describing the variables underlying group membership, and the interactions between variables. Group 2 Outpatjent lptensive (15% of inpatients) Inpatients who are also making heavy use of outpatient services. Distributed over both medical and surgical MDCs and psychiatric MDCs. Group 3 Inpatient lntensjve (12% of inpatients) Inpatients who have very long lengths of stay. There is a high concentration in the mental and drug MDCs, but they are also well represented in the other MDCs. CONCLUSION It is important to maintain a focus on the business objectives during data mining. Data mining attempts to identify novel, useful, and understandable information within existing databases. This examination of inpatient discharges has succeeded in uncovering patient subgroups that certainly meet these criteria. However, other groups probably exist in the data and could be extracted. For example, male and female patients are certainly valid and distinct groups. They were not elucidated in this study because the variable for gender was not used. Female veteran patients, although a small part of the VA population, are well studied and have been the object of extensive recent efforts to optimize their medical care. There is probably minimal additional return from studying gender groupings at this time. DISCUSSION The Group 1 patients probably correspond mostly to the WWII and Korea veterans, while Group 3 would include many Vietnam era patients. Figure 4 suggests that both Groups 1 and 3 have a complex structure with MDCs extending between psychiatric and the medical and surgical categories. Group 2 may identify a subgroup who make heavy use of outpatient services and are sometimes admitted to the hospital. They appear to be evenly distributed across the MDCs. Additional analyses plus involvement by clinical domain experts with outpatient expertise is necessary to further validate and characterize these groups. Clinical experts would examine medical records for patients from the different groups and add their insights and recommendations for further studies. Although Groups 1 and 3 are probably known to VA clinicians, Group 2 is more anecdotal and may concern the "regulars" who show up at the hospital. However, it is not clear if this heavy outpatient utilization is related to scheduled clinic visits or to unsupervised access to outpatient services. Nor is it clear at this point if this activity represents a beneficial shift away from more expensive inpatient services, or if it represents uncontrolled outpatient (and sometimes inpatient) resource consumption that refined management plans would bring under control. It may in fact represent all these things, and any of these groups could be found to contain valid subgroups with their own unique characteristics and management needs. From a management perspective, the present stage of data mining may be close to all that is needed to guide business planning. Groups 1 and 3 are already known to VA planners. Group 1 appears to be a traditional group seen at all medical and surgical hospitals. Group 3 represents a greater psychiatric emphasis than is seen in most nonpsychiatric facilities, but the group is typical of the VA mission and demographics and is well understood. The Group 2 patients may provide an opportunity to significantly improve clinical effectiveness and effiCiency. Managing care for Group 2 patients should be greatly enhanced by the current transition to a managed care format, where a patient is seen continuously by a team rather than by a new provider each time the patient visits the hospital. This will help supervise a patient's access to clinical resources and improve outcomes through continuity of care. Hospital management thus has assurance that current plans will address the unique needs of existing patient subpopulations. In addition to expanded clinical domain expert involvement, the next phases of this data mining project would utilize additional data analyses and visualizations. The three groups now need to be examined separately. MDCs for patients with multiple discharges need to be explored and correlated with clinic activity to see if meaningful patterns exist. Cluster analysis and factor analysis 119 REFERENCES Brachman, R.J. and Anand, T., ''The Process of Knowledge Discovery in Databases, • In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth; P., and Uthurusamy, R., eds. (1996), Advances in Knowledge Discovery and Data Mining, Cambridge: MIT Press. Cortes, C., Jackel, L.D., and Chiang, W.P. (1995), "Umits on Learning Machine Accuracy Imposed by Data Quality," Proceedings of the Rrst International Conference on Knowledge Discovery and Data Mining, 57-62. Elder, J.F. and Pregibon, D., "A Statistical Perspective on Knowledge Discovery in Databases,· In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., eds. (1996), Advances in Knowledge Discovery and Data Mining, Cambridge: MIT Press. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., "From Data Mining to Knowledge Discovery: An Overview," In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., eds. (1996), Advances in Knowledge Discovery and Data Mining, Cambridge: MIT Press. ACKNOWLEDGMENTS SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. IBM is a registered trademark or trademark of International Business Machine Corporation. ® indicates USA registration. CONTACT INFORMATION Anthony M. Dymond, Ph.D. Independent Consultant Concord, California (510) 798-0129 [email protected] 120 data ~:·~· ---a-t domain :) discovery..__ discovery 4 -i model present implement visualize .,,---- analyze data _ warehouse ( __., select ' variables ~ ) clean 4 data .,. seled cases Figure 1: Some of the major processes in data mining. 20 30 40 50 60 70 80 90 AGE Figure 2: Contour plot of Major Diagnostic Categories CMDCt versus age 121 100 I• 10 0 J:llOUU D GllOUP2 GB.ODf3 I 60 '!! 8II ~0 • "' 20 0 10 30 20 40 50 60 80 70 90 100 110 120 Length of stay 80 60 '!! ...~ • "' 40 20 0 10 30 20 40 50 70 60 90 80 100 Humber of Visits Figure 3: Distribution of length of stay and number of visits for three subpopulations I• GROU~l 25 0 GROU~2 Ill GROU~31 ~ I I ;i I 20 5 0 • Ill ~ D i !I! ~ ~ Ii I; I > ;:; II! ~ u ~ II • ! I sI I= ~ ~ ~ 0 = ~ i i ~ ~ i ~ ~ ~ § ;I ~ & i! Ii " j ~ " Major Diagnostic Category Figure 4: Distribution of Major Diagnostic Categories (MDC) for three subpopulations 122