Download Data Modelling and Pre-processing for Efficient Data Mining in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Modelling and Pre-processing for Efficient Data Mining in
Cardiology
Kamil Matoušek and Petr Aubrecht
Abstract—A consolidated database Cardio DB for medical
examination data and efficient data mining process in
cardiology have been designed aimed at creation of a shared
data resource supporting different scientific experiments. Its
requirements analysis included studying of existing data
formats and suggesting suitable ways to transform individual
data models into a common platform.
Efficient data mining in cardiology should offer processing of
time series and accompanying structured information and
return typical patterns indicating some manifestations of
potential diseases or diagnoses. Existing waveform resources
are being utilized in several ways.
Cardio DB is currently populated with experimental data
from the Institute of Physiology, Charles University in Prague
and its structure and analytical data mining options are being
evaluated. Human experts will provide valuable feedback to the
mined results as well as to the overall process and offer possible
process refinements. Having acquired enough results, they will
construct a reusable repository of their collected knowledge.
I. INTRODUCTION
E
xisting proprietary data generated within different
medical examinations are typically of varied types. We
are identifying the common properties of different software
outputs and creating a consolidated storage database, where
patients’ data of different nature could be stored together.
The database has to be easily extensible in order to add new
necessary data features as soon as new data types become
required. Once data are populated into the database, it has to
be further prepared for data mining (DM). Significant
parameters extraction and background knowledge including
disease symptomatic information acquisition have to take
place. Both initial data transformations and statistical
analysis are performed using SumatraTT.
Based on the requirements an efficient data mining
process was designed (see Fig. 1). It takes into account
multiple data sources, particularly the main data in Cardio
DB described in section 4 and wave forms in the
Massachusetts General Hospital/Marquette Foundation
Waveform Database (WFDB) available in PhysioNet [1]
format for comparison and as background knowledge. It is
expected that additional data sources will be required during
the data mining.
Manuscript received June 30, 2006. This work was supported in part by
the grant 201/05/0325 “New Methods and Tools for Knowledge Discovery
in Databases” and the project No. T201210527 “Knowledge-Based Support
for Diagnostics and Prediction in Cardiology” from “Information Society”
Program both from the Academy of Sciences of the Czech Republic.
K. Matoušek and P. Aubrecht are with the Department of Cybernetics,
Czech Technical University in Prague, Technická 2, CZ-166 27 Prague 6
({matousek,aubrech}@fel.cvut.cz).
The source data is transformed into a form more suitable
for data mining by SumatraTT transformation tool described
in section 5. Expert and knowledge feedback is also
supported via professional knowledge repository described
in section 6.
II. REQUIREMENTS ANALYSIS
The existing data formats have been analyzed and user
experience collected in order to find the major requirements
for the consolidated database design.
The main focus concerned the data analysis in the field of
cardiology at Institute of Physiology, Charles University in
Prague. A central data acquisition tool, there is a proprietary
application Cardiag (Diagnostic tool in Cardiology). During
examinations, this application stores analyzable data
generated by a connected electrocardiograph on-line.
Utilized binary files are of several types: measured time
series, processed electrocardiography data, and selected
characteristic strikes stored in three special binary file types
(INT, ECG and MAP). The information distributed in the
three files usually belongs to a single medical examination.
The users requested to store in the database both the binary
data structure of the measured time series as well as some
interesting structured information, e.g. concerning the
characteristic strike features or physician’s observed and
marked symptoms.
Other required property of the database schema is enough
extensibility in order to support other kinds of
measurements, e.g. laboratory examinations, coming from
different systems.
In order to support efficient data exploration and mining,
significant real parameters, or factors of patients and, if
available, their diagnoses identified within medical records
had to be recorded in the structured form. The possibility to
store values of several measured parameters at a time was
also requested.
III. EFFICIENT DATA MINING PROCESS DESIGN
In our approach, efficient data mining in cardiology,
particularly in electrocardiography, should offer processing
of time series and accompanying structured information and
return typical patterns indicating some manifestations of
potential diseases or diagnoses.
Existing waveform database resources can provide a kind
of useful background knowledge for the family of DM
algorithms using Inductive Logic Programming.
Finally, the background waveform datasets can be
beneficially used for result explanation in an understandable,
practitioner-friendly way.
4 CONSOLIDATED DATABASE SCHEMA
As the first step in efficient data mining process
implementation we had to draft the logical design of
consolidated database for measured data. It has been
performed independently on a particular database vendor and
so the requirements for different target environments do not
imply any conflicts. The resulting Cardio DB database
schema supports storage of patients personal data, their
diagnose classification, applied medicines, and mainly the
individual measurements carried out.
The E-R (entity-relationship) model of the Cardio DB is
depicted in Fig. 2. The tables with a white background form
the database core. In the centre, the Measurement table
stores the basic information about all the measured
parameters which are stored in the database. Any
measurement can be either input as a set of results from
external data files or it can come from a laboratory
examination. Due to the initial requirements of practitioners,
both structured and binary data representations of the results
are supported.
Patient information augments the complete frame for the
measurement evaluation and analysis. Patient Diagnoses are
encoded using codes of ICD (International Classification of
Diseases) and a table for taken Medicines is also available.
The additional greyed tables represent relational
extensions of Results table containing special properties of
each different data type (for the proprietary input types) as
well as for laboratory results. The remaining necessary
lookup tables are the hatched ones.
It is not required to input all the data values, thus even
partial descriptions can be captured in case of incomplete
knowledge. The data model is extensible for future needs of
storing data of different nature.
5 SUMATRATT
The next part of the overall schema is dedicated to data
pre-processing, which is in our case done by the system
SumatraTT [2]. Within the designed efficient data mining
process it is employed in two roles: it transforms data from
the source form to the form more suitable for data mining
and for calculations of advanced analysis.
SumatraTT is a modular system with plenty of available
modules. They cover most of the requirements on data preprocessing in a user-convenient way, so the user can
concentrate on the problem. For specific purposes out of the
scope of standard modules there is a scripting module, where
any required piece of additional functionality can be
implemented.
One of the major reasons for choosing the SumatraTT
system was its ability to document the data pre-processing
task (see Fig. 3) by means of an automatically generated set
of html pages. It allows attaching the documentation to data
in order to describe, what transforms were applied to the data
within the task. It also easies cooperation between teams, as
the documentation allows collaborating groups to understand
and reproduce the processes created by each other,
eventually even without using SumatraTT.
SumatraTT provides a platform allowing fast development
of pre-processing phase of data mining [3], which is
described bellow.
5.1 Pre-processing
A pre-processing task in SumatraTT starts with accessing
the source data (in plain text files, DBF, XML, SQL
databases, WEKA files etc.).
An important part of data pre-processing is data
understanding. SumatraTT provides four groups of modules
dedicated specifically to this area. Modules in the group
First touch preview generate quick one-click reports with
data source overviews displaying statistics, graphs and
histograms of all their attributes. In addition, user-defined
graphs and histograms can be used to adjust the data
presentation to the specific user needs. Interactive
visualization modules, where user can interactively change
the displayed form of graphical views and combine and
compare different data, allow the user to explore data in
a more thorough manner. The last group of modules is called
advanced and it contains non-trivial visualisation methods
like scatter plot, radix plot, etc. requiring a certain level of
professional insight.
After user's data understanding, steps commonly called
data cleaning have to be typically performed. Data is
converted into unified data format, problems with missing
values are solved, and numeric values are normalized and
eventually discretized. For better results it is necessary to
detect outliers and errors.
The output of the pre-processing phase is placed into
a Feature database, which serves as a source for the data
mining phase.
5.2 Analysis
Data mining algorithms effectiveness often requires some
modification of the original data in the feature DB, e.g.
reduction of the descriptive attributes (either by selection or
by introducing derived attributes) or change of data
granularity. In the creation of training and testing sets the
distribution of positive and negative examples is balanced. It
corresponds to the analysis process depicted in Fig. 1. These
methods are supported by means of specific SumatraTT
modules.
For specific types of data like time series it is necessary to
perform additional steps. The core module for the analysis is
Trends (see Fig. 4), which is able to calculate average
values, numbers of changes, minimum, maximum values etc.
for each individual time series.
SumatraTT offers modules also for other tasks required in
data mining. Enumeration of all these modules is out of the
scope of this article; we mention here only calculation of
contingency tables, splitting data into testing and training
sets, introduction of new attributes, or unsupervised
normalization.
6 ROLE OF EXPERTS IN DM PROCESS
Results of data mining must be interpreted by experts.
They decide whether the results are sufficient and bring
some useful information which could be practically
exploited. If not, some parts of the DM process have to be
repeated with modified parameters.
In our approach experts build a reusable knowledge base
stored in form of ontology in a shared knowledge repository
(see Fig. 1) allowing cooperation between (often
geographically distant) experts.
The knowledge base storing information about results and
parameter changes can be used to learn on the meta-level
how to automatically change parameters of individual
efficient data mining phases (both data pre-processing and
data mining). So, the stored results of the previous mining
serve as background knowledge for consecutive data mining
tasks.
The Cardio DB is currently implemented in Oracle 9i
database and its tables are populated with the originally
proprietary experimental data from the Institute of
Physiology, Charles University in Prague accommodated to
the newly designed format. The necessary data
transformations are being prepared using SumatraTT. The
initial structure as well as results are being tested and
evaluated for their practical utility.
The tasks for future include possible conceptual mapping
of the Cardio DB and the constructed knowledge repository
to the existing standards in medicine: general HL7
concepts [4], XML based ecgML [5], DICOM
waveforms [6] and other. Automated mining of the
knowledge base results is a longer term goal.
REFERENCES
[1]
[2]
[3]
7 CONCLUSIONS
A consolidated database Cardio DB for medical
examination data (initially in the area of cardiology) and
efficient data mining process have been designed with the
aim to create a shared data storage framework enabling
subsequent performing of different scientific experiments. Its
requirements analysis included studying of existing data
formats and suggesting suitable ways to transform individual
data models into a common platform.
original
text
files
[4]
[5]
[6]
I.C. Henry, A.L. Goldberger, G.B. Moody, R.G. Mark: PhysioNet: An
NIH Research Resource for Physiologic Datasets and Open Source
Software, in: 14th IEEE Symposium on Computer-Based Medical
Systems (CMBS'01), March 2001, pp. 0245
O. Štěpánková, P. Aubrecht, Z. Kouba, P. Mikšovský: Preprocessing
for Data Mining and Decision Support. In Data Mining and Decision
Support: Integration and Collaboration, pages p. 107-117, Dordrecht,
2003. Kluwer Academic Publishers.
D. Pyle: Data Preparation for Data Mining. Morgan Kaufmann
Publishers, Inc., San Francisco, CA, USA, 1999.
A. Hinchley: A Primer on the HL7 Version 3 Communication
Standard, 3rd edition, 2005, ISBN 3-933819-19-9.
H. Wang, F. Azuaje, G. Clifford, B. Jung, N. Black: Methods and
tools for generating and managing ecgML-based information, Proc.
of 2004 Computers in Cardiology, IEEE Press, Chicago, September
2004.
DICOM Standards Committee, Digital Imaging and Communications
in Medicine, 2004.
Cardio
DB
SumatraTT
preprocessing
WFDB
...
Feature
DB
SumatraTT
analysis
additional
data
experts
knowledge
repository
Fig. 1. Overall DM Process Schema
Data
mining
Fig. 2. E-R model of the Cardio DB for Examination Data
Fig. 3. Data pre-processing task and FirstTouchPreview in SumatraTT
Fig. 4. SumatraTT analysis project