Download Preliminary Data Mining for Subgroups in a Hospital Inpatient Population

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Preliminary Data Mining for Subgroups in a
Hospital Inpatient Population
Anthony M. Dymond
engage in an iterative process of task discovery to
focus on and clarify the objectives of the data mining
project, and data discovery to identify what studies
the data would support, and what studies the data
could suggest.
ABSTRACT
A business plan to manage a group or population
depends on knowing if the group is homogeneous or
composed of several subgroups. This paper
discusses data mining and describes some practical
considerations for preliminary data exploration in a
hospital inpatient population of 5760 unique patients.
A variety of visual and analytic examinations of
selected variables suggest that three subpopulations
may exist in this data.
This process interacts with processes surrounding
the data, which involve selecting variables and
cases to examine and preparing (cleaning) the data
for use (Cortes et al., 1995). Not all the variables
from a large database are used at any one time
because they are not all relevant to a particular
study or because there may be concerns about their
completeness or validity. Not all the cases are used
because of corrections for outliers and because of
the potential for very different characteristics in
known or postulated subpopulations.
INTRODUCTION
Data mining encompasses a variety of techniques
for examining databases and extracting insights that
can be used to base business decisions. It allows
management to discover and utilize information
already existing in internal or external databases,
and can be applied to decisions and processes
anywhere in the organization.
Both of these processes then interact with the
central process of hypothesis construction and
testing. This phase of data mining should always be
centered around hypotheses created by the data
analyst and domain expert, and should involve
construction of models to test these hypotheses. A
very wide variety of analytic and visualization tools
can be used at this point to explore both the
hypotheses and the models (Elder and Pregibon,
1996). This part of the data mining process is also
interactive, with hypotheses evaluation leading to
new hypotheses and testing.
Data mining is business driven, and the present
example is typical in that it was motivated by the
business need to improve hospital inpatient
management. A good understanding of the subject
population was essential for the plan. In this
particular case, one important hypothesis to explore
was the existence of subpopulations among the
inpatients. This was motivated in part by the
impression of knowledgeable staff that such
subpopulations probably exist.
All of these activities then result in a deliverable
product for presentation to management and for
implementation.
Current reviews (Fayyad et al., 1996; Brachman and
Anand, 1996) consider both the theoretical and
practical aspects of data mining, and suggest that
the process can be characterized as a non-trivial
discovery of novel, useful, and understandable
information. The process can be knowledgeintensive, extended over time, iterative, interactive,
and intensive of both database and human
resources.
HOSPITAL INPATIENT MINING
Some of the considerations from Figure 1 can be
illustrated by the current study of hospital inpatients.
The data came from 5760 unique inpatients
discharged in 1993 from a large West Coast
Veterans Administration (VA) hospital. (Some
patients were hospitalized more than once during
the year.) The hospital is located in a metropolitan
area and has active university medical school
affitiations. The VA hospital has many of the
aspects of a university teaching hospital, but with
predominantly male veteran demographics.
Figure 1 shows some of the major elements and
interactions in data mining. The process is humancentered, and involves the data analyst (data miner)
and a domain expert who is familiar with both the
subject area and the organization. These individuals
117
Extensive inpatient and outpatient data is routinely
collected by local computer systems at VA hospitals
and then subsetted and transferred to a remote
IBM® mainframe. This mainframe has SAS®
software that was used for these studies.
discovered by a visualization method and supported
by domain expert expectations.
At this point, both data analysts and domain experts
must ask themselves if these groups represent a
complete and final solution. Based on a data subset
containing only the subjects' inpatient activities,
these two groups are probably valid. Their
existence would represent a major finding of this
data mining study, and the next phase of work would
attempt to further explore and characterize these
groups.
The hypothesis was that subpopulations existed
within the inpatient group, and that identifying and
characterizing these subpopulations would guide
patient management. The author has sufficient
experience with hospital databases to serve initially
as both the data analyst and domain expert. SAS
software has a variety of tools to examine data for
outliers and for univariate and bivariate distributions,
and these were applied to a subset of the variables
believed to be most relevant to the issue under
consideration. Some variables were temporarily
excluded for reasons such as distribution,
collinearity, and validity (some variables are
recorded only under special circumstances).
However, these patients' outpatient activity was also
available, and these variables were included in
further analyses. Several approaches were tried in
an effort to elucidate additional structure in the data.
One approach was to try factor analysis because it
can sometimes group cases after a transpose
between the cases and variables. When this was
attempted on a subset of 500 patients, three
potential patient subgroups appeared. Additional
studies were then conducted using these three
subgroups to establish their validity and
interpretation. Discriminant analysis suggested that
these groups were valid and could be used to
classify patients in the database into the three
categories. All patients were assigned to these
groups by discriminant analysis and a variety of
exploratory graphs were examined. A few of these
graphs are shown in Figures 3 and 4.
Some cases were removed as outliers when values
such as income or length of stay were excessive.
However, although there are good statistical reasons
to ignore outlier data, these cases should always be
examined separately as a part of data mining since
they can often point to unusual situations in the
business process. Evaluating outliers can
potentially lead to some of the most significant
information resulting from the data mining operation.
Examining contour plots for each pair of variables is
a useful technique during the early stages of
variable and case review. Figure 2 shows this plot
for the variables Major Diagnostic Category (MDC)
and age. {MDC measures the patient's major
reason for having been in the hospital. MDCs are
grouped along the lines of organ systems and
related disease.) Even at this preliminary stage of
data mining, Figure 2 revealed two separate peaks
in the inpatient distribution. One large group
matched the age range of World War II (WWII) and
Korea veterans with MDCs corresponding to
diseases in traditional medical and surgical
categories such as cardiology and respiratory
illness. The second group has an age range of
Vietnam veterans with MDCs corresponding to
diseases in the drug and mental categories.
Figure 3 shows the length of stay and number of
visits for the three groups. For Group 3, the length
of stay is much longer than for the other groups.
For Group 2, the number of visits (to the hospital as
outpatients) is much higher than for the other
groups.
Figure 4 shows the MDCs versus age. (For patients
with multiple discharges, the MDC associated with
the longest length of stay was chosen.) This graph
shows that most of Group 1 patients are in the
traditional medical and surgical categories such as
respiratory, circulatory, and digestive. Group 3 is
heavily concentrated in the drug and mental
categories. Group 2 is distributed across MDCs.
A preliminary interpretation of these groups is:
These two groups--WWII and Korean veterans with
traditional medical and surgical illnesses, and
Vietnam veterans with more of a propensity toward
drug and mental illnesses-are the two inpatient
subpopulations that VA domain experts would
predict. This is a significant result supporting the
hypotheses of subpopulations. It has been
Group 1 Tradjtjonal (73% of inpatients)
Traditional medical and surgical patients.
Perhaps slightly older patients. Patients who do
not acculllJiated excessive numbers of vistts or
length of stays during the year.
118
could be used to find spontaneous groupings in the
data, and discriminant analysis and categorical
analyses could provide models describing the
variables underlying group membership, and the
interactions between variables.
Group 2 Outpatjent lptensive (15% of inpatients)
Inpatients who are also making heavy use of
outpatient services. Distributed over both
medical and surgical MDCs and psychiatric
MDCs.
Group 3 Inpatient lntensjve (12% of inpatients)
Inpatients who have very long lengths of stay.
There is a high concentration in the mental and
drug MDCs, but they are also well represented in
the other MDCs.
CONCLUSION
It is important to maintain a focus on the business
objectives during data mining. Data mining attempts
to identify novel, useful, and understandable
information within existing databases. This
examination of inpatient discharges has succeeded
in uncovering patient subgroups that certainly meet
these criteria. However, other groups probably exist
in the data and could be extracted. For example,
male and female patients are certainly valid and
distinct groups. They were not elucidated in this
study because the variable for gender was not used.
Female veteran patients, although a small part of
the VA population, are well studied and have been
the object of extensive recent efforts to optimize
their medical care. There is probably minimal
additional return from studying gender groupings at
this time.
DISCUSSION
The Group 1 patients probably correspond mostly to
the WWII and Korea veterans, while Group 3 would
include many Vietnam era patients. Figure 4
suggests that both Groups 1 and 3 have a complex
structure with MDCs extending between psychiatric
and the medical and surgical categories. Group 2
may identify a subgroup who make heavy use of
outpatient services and are sometimes admitted to
the hospital. They appear to be evenly distributed
across the MDCs.
Additional analyses plus involvement by clinical
domain experts with outpatient expertise is
necessary to further validate and characterize these
groups. Clinical experts would examine medical
records for patients from the different groups and
add their insights and recommendations for further
studies. Although Groups 1 and 3 are probably
known to VA clinicians, Group 2 is more anecdotal
and may concern the "regulars" who show up at the
hospital. However, it is not clear if this heavy
outpatient utilization is related to scheduled clinic
visits or to unsupervised access to outpatient
services. Nor is it clear at this point if this activity
represents a beneficial shift away from more
expensive inpatient services, or if it represents
uncontrolled outpatient (and sometimes inpatient)
resource consumption that refined management
plans would bring under control. It may in fact
represent all these things, and any of these groups
could be found to contain valid subgroups with their
own unique characteristics and management needs.
From a management perspective, the present stage
of data mining may be close to all that is needed to
guide business planning. Groups 1 and 3 are
already known to VA planners. Group 1 appears to
be a traditional group seen at all medical and
surgical hospitals. Group 3 represents a greater
psychiatric emphasis than is seen in most nonpsychiatric facilities, but the group is typical of the
VA mission and demographics and is well
understood.
The Group 2 patients may provide an opportunity to
significantly improve clinical effectiveness and
effiCiency. Managing care for Group 2 patients
should be greatly enhanced by the current transition
to a managed care format, where a patient is seen
continuously by a team rather than by a new
provider each time the patient visits the hospital.
This will help supervise a patient's access to clinical
resources and improve outcomes through continuity
of care. Hospital management thus has assurance
that current plans will address the unique needs of
existing patient subpopulations.
In addition to expanded clinical domain expert
involvement, the next phases of this data mining
project would utilize additional data analyses and
visualizations. The three groups now need to be
examined separately. MDCs for patients with
multiple discharges need to be explored and
correlated with clinic activity to see if meaningful
patterns exist. Cluster analysis and factor analysis
119
REFERENCES
Brachman, R.J. and Anand, T., ''The Process of
Knowledge Discovery in Databases, • In: Fayyad,
U.M., Piatetsky-Shapiro, G., Smyth; P., and
Uthurusamy, R., eds. (1996), Advances in
Knowledge Discovery and Data Mining, Cambridge:
MIT Press.
Cortes, C., Jackel, L.D., and Chiang, W.P. (1995),
"Umits on Learning Machine Accuracy Imposed by
Data Quality," Proceedings of the Rrst International
Conference on Knowledge Discovery and Data
Mining, 57-62.
Elder, J.F. and Pregibon, D., "A Statistical
Perspective on Knowledge Discovery in Databases,·
In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.,
and Uthurusamy, R., eds. (1996), Advances in
Knowledge Discovery and Data Mining, Cambridge:
MIT Press.
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.,
"From Data Mining to Knowledge Discovery: An
Overview," In: Fayyad, U.M., Piatetsky-Shapiro, G.,
Smyth, P., and Uthurusamy, R., eds. (1996),
Advances in Knowledge Discovery and Data Mining,
Cambridge: MIT Press.
ACKNOWLEDGMENTS
SAS is a registered trademark or trademark of SAS
Institute Inc. in the USA and other countries. IBM is
a registered trademark or trademark of International
Business Machine Corporation. ® indicates USA
registration.
CONTACT INFORMATION
Anthony M. Dymond, Ph.D.
Independent Consultant
Concord, California
(510) 798-0129
[email protected]
120
data
~:·~·
---a-t domain
:)
discovery..__ discovery
4
-i
model
present
implement
visualize .,,---- analyze
data
_
warehouse
(
__., select
' variables
~
)
clean
4
data .,.
seled
cases
Figure 1: Some of the major processes in data mining.
20
30
40
50
60
70
80
90
AGE
Figure 2: Contour plot of Major Diagnostic Categories CMDCt versus age
121
100
I•
10
0
J:llOUU
D
GllOUP2
GB.ODf3
I
60
'!!
8II ~0
•
"'
20
0
10
30
20
40
50
60
80
70
90
100
110
120
Length of stay
80
60
'!!
...~
•
"'
40
20
0
10
30
20
40
50
70
60
90
80
100
Humber of Visits
Figure 3: Distribution of length of stay and number of visits for three subpopulations
I• GROU~l
25
0
GROU~2
Ill GROU~31
~
I I ;i
I
20
5
0
•
Ill
~
D
i
!I!
~
~
Ii I; I
>
;:;
II!
~
u
~
II
•
!
I
sI
I=
~
~
~
0
=
~
i i
~
~
i
~
~
~
§
;I
~
&
i!
Ii
"
j
~
"
Major Diagnostic Category
Figure 4: Distribution of Major Diagnostic Categories (MDC) for three subpopulations
122