Download Design of a Multi Dimensional Database for the Archimed - magic-5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

Clusterpoint wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

Business intelligence wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Transcript
Connecting Medical Informatics and Bio-Informatics
R. Engelbrecht et al. (Eds.)
ENMI, 2005
169
Design of a Multi Dimensional Database
for the Archimed DataWarehouse
Claudine Bréant, Gérald Thurler, François Borst, Antoine Geissbuhler
Service of Medical Informatics
University Hospital of Geneva, Geneva, Switzerland
Abstract
The Archimed data warehouse project started in 1993 at the Geneva University
Hospital. It has progressively integrated seven data marts (or domains of activity)
archiving medical data such as Admission/Discharge/Transfer (ADT) data,
laboratory results, radiology exams, diagnoses, and procedure codes. The objective
of the Archimed data warehouse is to facilitate the access to an integrated and
coherent view of patient medical in order to support analytical activities such as
medical statistics, clinical studies, retrieval of similar cases and data mining
processes. This paper discusses three principal design aspects relative to the
conception of the database of the data warehouse: 1) the granularity of the database,
which refers to the level of detail or summarization of data, 2) the database model
and architecture, describing how data will be presented to end users and how new
data is integrated, 3) the life cycle of the database, in order to ensure long term
scalability of the environment. Both, the organization of patient medical data using a
standardized elementary fact representation and the use of the multi dimensional
model have proved to be powerful design tools to integrate data coming from the
multiple heterogeneous database systems part of the transactional Hospital
Information System (HIS). Concurrently, the building of the data warehouse in an
incremental way has helped to control the evolution of the data content. These three
design aspects bring clarity and performance regarding data access. They also
provide long term scalability to the system and resilience to further changes that may
occur in source systems feeding the data warehouse.
Keywords:
Data warehouse; Medical Informatics; Database
1. Introduction
A data warehouse can be very simply defined as a copy of enterprise transaction data
specifically structured, integrated, and organized for query, data analysis, and decision
support applications [1,2]. In that regard, the architecture, the life cycle, and the end users
of a data warehouse are profoundly different from those of transactional systems [1,2].
Data models and server technology that speeds up transactional processing may not be
appropriate for query and reporting processing. It is therefore recommended to develop a
data warehouse separately from the transactional environment. Its main requirement is to
provide an intuitive, easy and performant access to enterprise wide data so that queries and
reports can be quickly produced on a regular basis.
Section 3: Decision Support and Clinical Guidelines
Connecting Medical Informatics and Bio-Informatics
R. Engelbrecht et al. (Eds.)
ENMI, 2005
170
A hospital data warehouse organized around patient medical data involves integrating a
wide variety of health data, including patient records, medical images, and genetic
information, for the purpose of researching and improving the diagnosis and treatment of
medical
service
(activity)
patient
ARCHIMED data warehouse
(analytical and decision
processes)
ADT
Diagnoses
Laboratory
Radiology
Hospital Information System
(HIS) components
(on line transactional
systems)
episode of care
(analysis)
…
Procedures
…
data marts
ADT
Procedures
Diagnoses
Laboratory
Radiology
…
Figure 1 – The Hospital Information System components and the Archimed
data warehouse
diseases. The objective is to use information technology to help achieve personalized
health care by leveraging all the information and knowledge that exist about treated patients
and, for instance, by studying how new patients compare with other patients with similar
characteristics [3,4].
The development of the Archimed medical data warehouse started at the Geneva
University Hospital in 1993. It has progressively integrated seven data marts (or domains
of activity) archiving Admission/Discharge/Transfer (ADT) data, laboratory results,
radiology exams, physiotherapy exams, clinical data relative to childbirth, diagnoses, and
procedures codes. In the architecture of Archimed a data mart is simply a subset of the
database for a specific domain of activity within the patient treatment workflow. As shown
in figure 1, HIS operational systems are organized around applications managing for
example ADT data or laboratory results. Archimed proposes a different organization of the
data, which is dedicated to analytical activities such as medical statistics, clinical studies,
retrieval of similar cases and data mining processes. As shown in figure 1, the Archimed
data warehouse is organized around subject areas such as patient medical data, episodes of
care, and medical specialties. End users can also access to the database through web-based
applications [5].
This paper discusses three principal design aspects relative to the conception of the
database of the Archimed data warehouse, towards the archiving and organization of patient
medical data. The three following topics are described and discussed in the following
sections:
1) the granularity of the database, which refers to the level of detail or summarization
of data in the data warehouse,
2) the database model and architecture, describing how data will be presented to end
users and how new data is integrated
3) the life cycle of the database, in order to ensure long term scalability of the
environment.
The many other aspects of the Archimed system such as data source acquisition, end user
applications, and meta data, can be found in [6,7].
2. The Archimed database records patient atomic data (or elementary facts)
The granularity of the data warehouse database refers to the level of detail or
summarization of data in the data warehouse. The storage of atomic data (low level of
Section 3: Decision Support and Clinical Guidelines
Connecting Medical Informatics and Bio-Informatics
R. Engelbrecht et al. (Eds.)
ENMI, 2005
171
granularity) versus storage of aggregated data (high level of granularity) is a major design
issue since that it will affects the entire architecture of the data warehouse environment and
more specifically the volume of the database and the type of query that can be answered. It
is generally recognized that expressing the data at a low level of granularity will help to
achieve a robust data warehouse, as it becomes resilient to changes and can easily
accommodate new user needs. Moreover, when the fact tables are granular, they serve as
the natural destination for operational data that may be extracted frequently from the
operational systems.
The transactional hospital information system (HIS) creates and updates records
describing most events occurring during patient in-patient or out-patient care. Events of
interest for the Archimed data warehouse include encounters, observations, treatments,
diagnosis and procedures. Relevant and validated facts such as encounter dates, laboratory
results, radiology exams, diagnoses and procedure codes are transmitted to Archimed. The
database has been designed to record patient medical data with that same level of detail
under a format called standardized elementary fact. This offers many advantages over the
storage of aggregated data. Indeed, whereas aggregated data already embeds the user
queries, atomic data can be reshaped and presented in any format needed. Future unknown
requirements or unusual queries can be accommodated, achieving flexibility of the system
and reusability of the data.
Each elementary fact in the Archimed database is represented by its value (generally a
numeric or a code), along with several additional properties describing the context of this
value such as :
• patient/episode identification
- patient identification number,
- episode identification number,
• medical structures
- service (medical responsibility),
- medical unit (patient location),
• basic patient description to facilitate statistical queries
- patient age,
- patient sex,
• link to dictionaries and nomenclatures which are specific to each domain of
activity
- property name (laboratory result name, radiology exam, diagnosis or procedure
description)
Corresponding dictionaries are necessary for a correct interpretation of the fact at
the time the value was produced. In that regard, historic versions of these
dictionaries are managed if necessary.
• link to the time axis
- date of the event that produced the value
- date of archiving of the fact in the database.
Furthermore, elementary facts in Archimed are standardized the following way. Facts
coming from different domains of activity are expressed using a common template,
including a value and the set of corresponding properties. Basic properties are mandatory,
additional ones are optional and their number may vary according to the domain of activity
as shown in figure 2. The Archimed standardized elementary facts can therefore easily
accommodate a new domain of activity.
The storage of aggregated data can be useful to speed up queries and optimize the database
usage. They can be calculated from a group of selected facts from which can be applied
aggregated functions (sum, average, count, etc.). The Archimed database doesn’t currently
Section 3: Decision Support and Clinical Guidelines
Connecting Medical Informatics and Bio-Informatics
R. Engelbrecht et al. (Eds.)
ENMI, 2005
172
store aggregated data; this area is however under consideration for future developments.
Patient lengths of stay, indicators describing counts of diagnoses, or surgical procedures,
and re admissions rates are some of the aggregates of interest.
The next section describes the database model used to structure these elementary facts.
Instanciated ADT fact
Instanciated Laboratory fact
Instanciated Procedure fact
patient:
episode:
medService:
medUnit:
patientAge:
patientSex:
property name:
value:
dateOfFact:
dateOfArchive:
chief complaint:
patient:
episode:
medService:
medUnit:
patientAge:
patientSex:
property name:
value:
dateOfFact:
dateOfArchive:
unit:
material:
range:
patient:
episode:
medService:
medUnit:
patientAge:
patientSex:
property name:
value:
dateOfFact:
dateOfArchive:
957688
1000474
pediatric
1-AL
7
M
out-patient
regular entry
2004-01-05
2004-01-06
accident
345353
1000011
INTERNAL MED
7-AL
46
F
glucose
6.8
2004-02-23
2004-02-24
mml/L
plasma
4.2-6.0
3345453
1190002
OBGYN
2-AL
34
F
C-section
elected
2004-05-21
2004-05-22
additional
optional
properties
Figure 2 - Three instanciated Archimed facts according to the standardized elementary fact template
3. Patient medical data is organized using a multi dimensional approach
medical service
The Archimed database relies on a multi dimensional modeling approach. The dimensional
model is a logical database design method particularly well suited to data warehouse
databases [2]. In particular, it has the great advantage to allow data coming from various
heterogeneous and independent sources, such as Archimed elementary patient facts, to be
presented in an intuitive and standardized fashion. Contrary to the usual Entity/Relational
data model used by transactional systems, it limits the number of tables to fact and
dimension tables.
For instance, the multi dimensional model in Archimed for the Laboratory data mart
includes:
• one fact table, recording the elementary facts produced by the operational system,
namely the patient laboratory result values.
• a set of usually smaller dimension tables, each linked to the fact table through a
primary key, and describing the context of interpretation of elementary facts. The
dimension tables in the Laboratory data mart include tables describing patients,
medical services, medical units, laboratory exams, laboratory ranges values, and
units.
patient
da
te
/ ti
m
e
diagnostic
code
Figure 3 – Three dimensional data cube describing diagnostic codes
along three dimensions : patient, medical service, and time
Figure 4 shows the general star schema data model used for the representation of an
Archimed data mart. The fact tables all contain zero or more facts that represent values
(measurements) taken at each combination of the dimension key components. The fact
table is also often represented by a data cube where each axis corresponds to one
dimension, as shown in figure 3 [1,2].
As the Archimed data warehouse is composed of several data marts, the whole data
model results in a set of inter connected star schemas including a collection of fact tables
Section 3: Decision Support and Clinical Guidelines
Connecting Medical Informatics and Bio-Informatics
R. Engelbrecht et al. (Eds.)
ENMI, 2005
173
describing patient medical data (encounters, laboratory results, diagnoses and procedures,
dimensions
fact table
conformed
dimensions
(shared by at
least 2 fact
tables)
fact table
dimensions
patient
episode
fact table
fact table
med.
services
med. unit
Figure 4 - Four Archimed data marts inter connected through conformed dimension tables
radiology exams and so on) and a set of dimension tables where some of them are shared by
many facts tables. Indeed, the patient table or the medical services table record data
meaning exactly the same thing in each data mart.
Dimension tables shared by several fact tables are called conformed dimensions [2].
Conformed dimensions constitute a very important and powerful aspect of the multi
dimensional model. Indeed, the definition of conformed dimensions establishes the links
between the data marts, providing an integrated view of the elementary facts. It builds the
foundation for a simple and performant querying of the patient medical data as if it were
initially part of the same database.
For example, the queries ‘retrieve all medical data of patient of age 30-40 during its last
hospitalization in the internal medicine department’ or ‘retrieve all patients of the pediatric
department having low hemoglobin and diagnosis of diabetes’ require many steps of
laborious treatment when issued at the various hospital transactional database systems. It is
however very simple to solve with the Archimed database model.
Establishing, enforcing, and maintaining, the dimension tables of the data warehouse
Archimed are therefore important tasks respectively during the initial planning of the
database architecture, during the integration process of a new data mart, and for the
maintenance of the database overall coherence.
4. Implementation and life cycle of the Archimed database
The Archimed database was first developed using the Ingres database system before being
recently migrated to the Oracle relational database system. In brief, the physical
implementation of the database has required the definition of a physical data model, proper
indexation of the tables, and definition of rules to consistently name tables and attributes.
Unlike classical transactional systems, a data warehouse is in perpetual evolution and
cannot remain static. Indeed, its evolution must follow the changes occurring in the
organization which it serves. User needs appear or change. New data sources become
accessible and must be integrated in the warehouse.
The process of adding a new data mart to the architecture complies with the following
steps. First, a detailed analysis of the source data provided by the transactional systems is
carried out. Elementary facts are identified, setting the granularity of the fact table. Then,
conformed dimensions are carefully highlighted; they define the links with the other data
marts. Other dimensions are also identified which complement the definition of the
Section 3: Decision Support and Clinical Guidelines
Connecting Medical Informatics and Bio-Informatics
R. Engelbrecht et al. (Eds.)
ENMI, 2005
174
elementary facts. A mapping between source data items and the data warehouse table
attributes is established. Then, the physical table and index definitions can be derived and
implemented.
The Archimed data warehouse has been built incrementally, in order to break the
implementation task to manageable proportions. Each data mart has been successively
implemented and connected to the overall architecture.
Moreover, the Archimed database was designed to anticipate changes and the evolution of
existing data sources. As shown above, the data model accommodates the need for new
descriptors and new dimensions without to have to change the database schema.
5. Conclusion
The planning, realization, maintenance and evolution of a large data warehouse database
system can be overwhelming. Source data is complex, volumes are large, and portions of
data will be dirty, erroneous, or hard to understand.
In the case of a data warehouse for medical data, we have discussed the three main design
aspects of the Archimed database.
The organization of elementary facts (atomic) using the dimensional model has proved to
be a powerful tool in order to integrate data coming from multiple heterogeneous database
systems developed independently throughout the institution. It brings clarity and
performance regarding data access. It also provides scalability to the system and resilience
to further changes that may occur in source systems feeding the data warehouse.
Moreover, the building of the data warehouse in an incremental way has helped to control
the evolution of the data content
These design options constitute the main rules and act as a road map to follow when a
new medical set of data (or data mart) is integrated to the system. Sticking to these
decisions has enabled to ensure the coherence of the integrated data, long term scalability,
and to greatly simplify the access to data produced by tens of independent hospital wide
operational databases.
These rules have been established progressively and result from more that 15 years of
prior experience in providing information to hospital administrators and medical staff
including departmental statistics, patient similar case retrievals, and clinical studies.
6. References
[1] Inmon WH, Building the Data Warehouse, Wiley; 3rd edition, 2002.
[2] Kimball R, Ross M. The Data Warehouse Lifecycle Toolkit, Wiley; 2nd edition, 2002.
[3] Kerkri R, Quantin C, Yetongnon K, Dusserre L. Les entrepôts de données:application au suivi
épidémiologique. Informatique et Santé, Springer-Verlag, France, Paris 1998(10):21-29.
[4] Ledbetter CS, Morgan MW, Toward best practice: leveraging the electronic patient record as a clinical
data warehouse J Healthc Inf Manag. 2001 Summer;15(2):119-31.
[5] Lehner B, Thurler G, Bréant C, Tahintzi P, Borst F. Retrieval of Similar Cases using the ARCHIMED
Navigator. MIE, 2003.
[6] Thurler G, Bréant C, Lehner B, Bunge M, Samii K, Hochstrasser D, Nendaz M, Gaspoz JM, Tahintzi P,
Borst F. Toward a Systemic Approach to Disease. Complexus, 2003;1:117-122.
[7] Thurler G, Borst F, Bréant C, Campi D, Jenc j, Lehner B, Maricot P, and Scherrer JR. ARCHIMED: A
network of Integrated Information Systems. Method Inform Med 2000; 39: 36-43.
Address for correspondence
Claudine Bréant HUG – Service d’Informatique Médicale 24, rue Micheli-du-Crêts 1211 Genève 14,
Switzerland, [email protected]
Section 3: Decision Support and Clinical Guidelines