Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Modelling and Pre-processing for Efficient Data Mining in Cardiology Kamil Matoušek and Petr Aubrecht Abstract—A consolidated database Cardio DB for medical examination data and efficient data mining process in cardiology have been designed aimed at creation of a shared data resource supporting different scientific experiments. Its requirements analysis included studying of existing data formats and suggesting suitable ways to transform individual data models into a common platform. Efficient data mining in cardiology should offer processing of time series and accompanying structured information and return typical patterns indicating some manifestations of potential diseases or diagnoses. Existing waveform resources are being utilized in several ways. Cardio DB is currently populated with experimental data from the Institute of Physiology, Charles University in Prague and its structure and analytical data mining options are being evaluated. Human experts will provide valuable feedback to the mined results as well as to the overall process and offer possible process refinements. Having acquired enough results, they will construct a reusable repository of their collected knowledge. I. INTRODUCTION E xisting proprietary data generated within different medical examinations are typically of varied types. We are identifying the common properties of different software outputs and creating a consolidated storage database, where patients’ data of different nature could be stored together. The database has to be easily extensible in order to add new necessary data features as soon as new data types become required. Once data are populated into the database, it has to be further prepared for data mining (DM). Significant parameters extraction and background knowledge including disease symptomatic information acquisition have to take place. Both initial data transformations and statistical analysis are performed using SumatraTT. Based on the requirements an efficient data mining process was designed (see Fig. 1). It takes into account multiple data sources, particularly the main data in Cardio DB described in section 4 and wave forms in the Massachusetts General Hospital/Marquette Foundation Waveform Database (WFDB) available in PhysioNet [1] format for comparison and as background knowledge. It is expected that additional data sources will be required during the data mining. Manuscript received June 30, 2006. This work was supported in part by the grant 201/05/0325 “New Methods and Tools for Knowledge Discovery in Databases” and the project No. T201210527 “Knowledge-Based Support for Diagnostics and Prediction in Cardiology” from “Information Society” Program both from the Academy of Sciences of the Czech Republic. K. Matoušek and P. Aubrecht are with the Department of Cybernetics, Czech Technical University in Prague, Technická 2, CZ-166 27 Prague 6 ({matousek,aubrech}@fel.cvut.cz). The source data is transformed into a form more suitable for data mining by SumatraTT transformation tool described in section 5. Expert and knowledge feedback is also supported via professional knowledge repository described in section 6. II. REQUIREMENTS ANALYSIS The existing data formats have been analyzed and user experience collected in order to find the major requirements for the consolidated database design. The main focus concerned the data analysis in the field of cardiology at Institute of Physiology, Charles University in Prague. A central data acquisition tool, there is a proprietary application Cardiag (Diagnostic tool in Cardiology). During examinations, this application stores analyzable data generated by a connected electrocardiograph on-line. Utilized binary files are of several types: measured time series, processed electrocardiography data, and selected characteristic strikes stored in three special binary file types (INT, ECG and MAP). The information distributed in the three files usually belongs to a single medical examination. The users requested to store in the database both the binary data structure of the measured time series as well as some interesting structured information, e.g. concerning the characteristic strike features or physician’s observed and marked symptoms. Other required property of the database schema is enough extensibility in order to support other kinds of measurements, e.g. laboratory examinations, coming from different systems. In order to support efficient data exploration and mining, significant real parameters, or factors of patients and, if available, their diagnoses identified within medical records had to be recorded in the structured form. The possibility to store values of several measured parameters at a time was also requested. III. EFFICIENT DATA MINING PROCESS DESIGN In our approach, efficient data mining in cardiology, particularly in electrocardiography, should offer processing of time series and accompanying structured information and return typical patterns indicating some manifestations of potential diseases or diagnoses. Existing waveform database resources can provide a kind of useful background knowledge for the family of DM algorithms using Inductive Logic Programming. Finally, the background waveform datasets can be beneficially used for result explanation in an understandable, practitioner-friendly way. 4 CONSOLIDATED DATABASE SCHEMA As the first step in efficient data mining process implementation we had to draft the logical design of consolidated database for measured data. It has been performed independently on a particular database vendor and so the requirements for different target environments do not imply any conflicts. The resulting Cardio DB database schema supports storage of patients personal data, their diagnose classification, applied medicines, and mainly the individual measurements carried out. The E-R (entity-relationship) model of the Cardio DB is depicted in Fig. 2. The tables with a white background form the database core. In the centre, the Measurement table stores the basic information about all the measured parameters which are stored in the database. Any measurement can be either input as a set of results from external data files or it can come from a laboratory examination. Due to the initial requirements of practitioners, both structured and binary data representations of the results are supported. Patient information augments the complete frame for the measurement evaluation and analysis. Patient Diagnoses are encoded using codes of ICD (International Classification of Diseases) and a table for taken Medicines is also available. The additional greyed tables represent relational extensions of Results table containing special properties of each different data type (for the proprietary input types) as well as for laboratory results. The remaining necessary lookup tables are the hatched ones. It is not required to input all the data values, thus even partial descriptions can be captured in case of incomplete knowledge. The data model is extensible for future needs of storing data of different nature. 5 SUMATRATT The next part of the overall schema is dedicated to data pre-processing, which is in our case done by the system SumatraTT [2]. Within the designed efficient data mining process it is employed in two roles: it transforms data from the source form to the form more suitable for data mining and for calculations of advanced analysis. SumatraTT is a modular system with plenty of available modules. They cover most of the requirements on data preprocessing in a user-convenient way, so the user can concentrate on the problem. For specific purposes out of the scope of standard modules there is a scripting module, where any required piece of additional functionality can be implemented. One of the major reasons for choosing the SumatraTT system was its ability to document the data pre-processing task (see Fig. 3) by means of an automatically generated set of html pages. It allows attaching the documentation to data in order to describe, what transforms were applied to the data within the task. It also easies cooperation between teams, as the documentation allows collaborating groups to understand and reproduce the processes created by each other, eventually even without using SumatraTT. SumatraTT provides a platform allowing fast development of pre-processing phase of data mining [3], which is described bellow. 5.1 Pre-processing A pre-processing task in SumatraTT starts with accessing the source data (in plain text files, DBF, XML, SQL databases, WEKA files etc.). An important part of data pre-processing is data understanding. SumatraTT provides four groups of modules dedicated specifically to this area. Modules in the group First touch preview generate quick one-click reports with data source overviews displaying statistics, graphs and histograms of all their attributes. In addition, user-defined graphs and histograms can be used to adjust the data presentation to the specific user needs. Interactive visualization modules, where user can interactively change the displayed form of graphical views and combine and compare different data, allow the user to explore data in a more thorough manner. The last group of modules is called advanced and it contains non-trivial visualisation methods like scatter plot, radix plot, etc. requiring a certain level of professional insight. After user's data understanding, steps commonly called data cleaning have to be typically performed. Data is converted into unified data format, problems with missing values are solved, and numeric values are normalized and eventually discretized. For better results it is necessary to detect outliers and errors. The output of the pre-processing phase is placed into a Feature database, which serves as a source for the data mining phase. 5.2 Analysis Data mining algorithms effectiveness often requires some modification of the original data in the feature DB, e.g. reduction of the descriptive attributes (either by selection or by introducing derived attributes) or change of data granularity. In the creation of training and testing sets the distribution of positive and negative examples is balanced. It corresponds to the analysis process depicted in Fig. 1. These methods are supported by means of specific SumatraTT modules. For specific types of data like time series it is necessary to perform additional steps. The core module for the analysis is Trends (see Fig. 4), which is able to calculate average values, numbers of changes, minimum, maximum values etc. for each individual time series. SumatraTT offers modules also for other tasks required in data mining. Enumeration of all these modules is out of the scope of this article; we mention here only calculation of contingency tables, splitting data into testing and training sets, introduction of new attributes, or unsupervised normalization. 6 ROLE OF EXPERTS IN DM PROCESS Results of data mining must be interpreted by experts. They decide whether the results are sufficient and bring some useful information which could be practically exploited. If not, some parts of the DM process have to be repeated with modified parameters. In our approach experts build a reusable knowledge base stored in form of ontology in a shared knowledge repository (see Fig. 1) allowing cooperation between (often geographically distant) experts. The knowledge base storing information about results and parameter changes can be used to learn on the meta-level how to automatically change parameters of individual efficient data mining phases (both data pre-processing and data mining). So, the stored results of the previous mining serve as background knowledge for consecutive data mining tasks. The Cardio DB is currently implemented in Oracle 9i database and its tables are populated with the originally proprietary experimental data from the Institute of Physiology, Charles University in Prague accommodated to the newly designed format. The necessary data transformations are being prepared using SumatraTT. The initial structure as well as results are being tested and evaluated for their practical utility. The tasks for future include possible conceptual mapping of the Cardio DB and the constructed knowledge repository to the existing standards in medicine: general HL7 concepts [4], XML based ecgML [5], DICOM waveforms [6] and other. Automated mining of the knowledge base results is a longer term goal. REFERENCES [1] [2] [3] 7 CONCLUSIONS A consolidated database Cardio DB for medical examination data (initially in the area of cardiology) and efficient data mining process have been designed with the aim to create a shared data storage framework enabling subsequent performing of different scientific experiments. Its requirements analysis included studying of existing data formats and suggesting suitable ways to transform individual data models into a common platform. original text files [4] [5] [6] I.C. Henry, A.L. Goldberger, G.B. Moody, R.G. Mark: PhysioNet: An NIH Research Resource for Physiologic Datasets and Open Source Software, in: 14th IEEE Symposium on Computer-Based Medical Systems (CMBS'01), March 2001, pp. 0245 O. Štěpánková, P. Aubrecht, Z. Kouba, P. Mikšovský: Preprocessing for Data Mining and Decision Support. In Data Mining and Decision Support: Integration and Collaboration, pages p. 107-117, Dordrecht, 2003. Kluwer Academic Publishers. D. Pyle: Data Preparation for Data Mining. Morgan Kaufmann Publishers, Inc., San Francisco, CA, USA, 1999. A. Hinchley: A Primer on the HL7 Version 3 Communication Standard, 3rd edition, 2005, ISBN 3-933819-19-9. H. Wang, F. Azuaje, G. Clifford, B. Jung, N. Black: Methods and tools for generating and managing ecgML-based information, Proc. of 2004 Computers in Cardiology, IEEE Press, Chicago, September 2004. DICOM Standards Committee, Digital Imaging and Communications in Medicine, 2004. Cardio DB SumatraTT preprocessing WFDB ... Feature DB SumatraTT analysis additional data experts knowledge repository Fig. 1. Overall DM Process Schema Data mining Fig. 2. E-R model of the Cardio DB for Examination Data Fig. 3. Data pre-processing task and FirstTouchPreview in SumatraTT Fig. 4. SumatraTT analysis project