Download A Universal Data Pre-processing System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A Universal Data Pre-processing System
Petr AUBRECHT, Zden k KOUBA
Department of Cybernetics, Faculty of Electrical Engineering,
Czech Technical University in Prague
Technická 2, 160 00 Prague 6
{aubrech,kouba}@labe.felk.cvut.cz
Abstract. The main research in data warehousing and data mining is focused
on efficiency of the core algorithm (search and induction). Unfortunately,
these depend on quality of input data. Noisy and erroneous data makes these
algorithms useless. Before processing, data needs to be investigated and preprocessed.
This paper specifies a set of requirements on a data pre-processing system,
which should be satisfied in order to provide a complete and usable system.
These requirements were being taken into account when designing and
implementing the SumatraTT pre-processing system at the Czech Technical
University in Prague.
Its development had been motivated by the aim to present ideas of preprocessing [1, 2] in real applications. Currently, there are no relevant systems
widely accepted and used by the data mining community. SumatraTT makes
aspiration on providing a unified environment, which will be welcome by this
community.
SumatraTT serves as a reference implementation of our ideas and as a
universal platform for developing add-on modules in various target areas of
interest – data mining, data warehousing, medical information processing, etc.
Keywords: data preprocessing, GUI, Sumatra
1Practice (Critic of the Current State)
Nowadays, there exists a huge amount of data stored typically in company archives
or databases of production systems. Whereas enormous attention of researchers has
been paid to development of efficient (relational) database management systems
and/or data mining algorithms, the step in the middle – data pre-processing – suffers
from lack of theory and widely available tools or systems.
Our experience shows that in practice there are the following three ways of
carrying out data pre-processing:
Manually: Data is edited manually in a spreadsheet. This is the easiest case
from user's perspective. This can be the fastest approach for non-repeatable
processing. However, if additional batch of data becomes available, there is no way,
how to repeat the process in order to add them to the processed dataset. In addition,
there is no way, how to verify correctness of the data pre-processing process.
Program: A special dedicated single-purpose program is developed, usually
in a high-level languages like awk, sed, Perl [3] (originally administration tool in
Unix systems), or R-project [4] (more focused on statistical computing and graphics),
Lubos Popelínský (ed.), DATAKON 2003, Brno, 18.-21. 10. 2003, pp. 1-3.
2
Vybraný p ísp vek
respectively in C++ or Pascal. Also this way lacks for comprehensible documentation
and it is difficult to port the solution to the co-worker's environment even if it is
based on the same computational platform (availability of awk or sed scripts on
Windows, endless installation of interpreters, etc.).
System:
Use of a specialised system is the most convenient way, even if
it is very demanding: such a system is usually very expensive and to be used
effectively, it requires a well trained user. The systems vary in capabilities they
provide – from MS DTS (Data Transformation Services, which restricts the
definition of a data transformation to simple SQL queries or Basic scripts) to
Clementine (which is complex, very expensive and focused exclusively on data
mining).
There exist reasons for avoiding the first two ways. For example collaborative
data mining [5] excludes editing raw data manually and discourages writing singlepurpose data transformation programs. Making use of a specialised system can be an
advantage, especially if the system provides self-documenting features in some
extent, even if limited to writing simple textual description of the transformation. It
enables partners to re-use transformations developed by others by repeating the steps
described in the documentation.
The third way, using specialised systems has many advantages. The only obstacle
for almost seamless exchange of transformation definitions is that respective
tools/systems are not commonly available. Such systems can support data
understanding, both visual and configuration-driven data processing, reporting and
other features described further.
A conclusion of this state-of-the-art overview is that there is a need for an open,
widely available and feature-rich data pre-processing system. Its preferred features
include easy manipulation to compete with the Manual way, enough power to have a
win over the Program way, and a rich palette of capabilities to outstrip the System
way). Such a system should also provide a platform for collaboration.
2Steps of Data Pre-processing
There already exists a data mining methodology CRISP [6] (see figure 1), which
explicitly involves data pre-processing as a distinguished part of the data mining
process. In our opinion, the methodology splits the process into relatively separated
steps, while in practice each of these steps is influenced by the others. Especially,
there is a very tight interaction between the adjacent steps.
This paper is focused namely on the second and third steps: data understanding
and data preparation. We think these steps tightly coupled, because data
understanding is an incremental process, which progresses during data preparation. It
is common that errors are reported during the data preparation process. These reports
enrich user's knowledge of data and influence further course of data preparation. This
part should be carried out in iteration.
Of course, the whole data pre-processing can be divided into steps. However, it is
important to emphasise that all these steps are run in cycles rather than as a sequence
of isolated steps.
3
Vybraný p ísp vek
Original CRISP
Bussiness
Understanding
Data
Preprocessing
System
Area of Interest
Data
Understanding
Bussiness
Understanding
Data
Understanding
Deployment
Deployment
Data
Preparation
Data
Preparation
Evaluation
Evaluation
Modelling
Modelling
Figure 1Original and modified CRISP schema
First−touch Review
Static
Interactive
Advanced
Figure 2Sequence of Available Visualisations
Data pre-processing steps:
understanding
The main contribution of data pre-processing is data
understanding. No sensible results can be achieved without a detailed exploration of
received data.
visualisation – the most valuable way of data understanding is visual exploration.
People get 95 % of information visually and graphics is able both to provide huge
amount of information at once including comparisons of various values. The
following kinds of visual exploration offer increasing detail of investigated data –
from simple, automatically generated reports, over interactive tools providing
dynamic views of data to advanced techniques, which require certain knowledge
of the data; see figure 2.
- first-touch review – a graphical report with statistical values, graphs and
histograms on a single HTML page with thumbnails of a graph and histogram
for each field. A summary graph reflecting all the fields together can be
included, too. This kind of visualisation provides the fastest look at data
without any demand on user (it is automatically generated).
- static – common types of visual representation include a wide range of graphs
(2D, 3D, combined, pie, histograms, errors, etc.). There is a sufficient offer of
third-party tools (like R-project, gnuplot etc.) including interactive tools
(spreadsheets, Mathematica, etc.).
- interactive – dynamic data visualisation. This idea allows investigating both
mutual dependencies between data attributes and dependency of attributes on
4
Vybraný p ísp vek
the time. The interactivity enables the user to change the point of view, form
of presentation etc. A good example is a classic 3D graph with capability to
change the subset of records and/or attributes, adding sub-graphs on the
screen, animations representing a time series... It is necessary to say, that
interactive visualisation could bring a feedback to the user – e.g. to set-up
filtering conditions for a part of the pre-processing schema.
- advanced – in the last group of techniques there are techniques expressing
data in an unusual way, like graphs with parallel axis, graphs in radial coordinates, matrix diagrams, multiline graphs...
reports, statistics – textual form is necessary for completeness. It can catch
exceptional values invisible in graphs (e.g. the y-axis of a graph can superimpose
an exceptional value of the first record). In addition, the reports can be enriched
by including graphics from the previous part.
modification
It is necessary to provide a set of data processing modules,
which will be powerful enough to allow arbitrary modifications. Limitation of
capabilities restricts significantly the usability of the tool. The best way is to offer
simple modules for typical actions (access to data sources, subsets of attributes,
filters, calculations, rules for handling missing/wrong data, etc.) and a language for
implementing algorithms, which fulfil all the very specific requirements.
reverse-modification A requirement may appear to carry out a reverse
transformation after completing data preparation and data modelling, which will
bring the resulting data sets(e.g. predicted data) back to the original form
(format).Support for reverse-modification can range from automatic design (in
simple cases)to verification/proof of hand-made transformation based on checking
equality of original data with that obtained by reverse modification of transformed
data.
The pre-processing system shall support data understanding and modification
by providing mutual interoperability of all the tools. Using advanced visualisation
modules, the user will better navigate in graphs. Interoperability of the visualisation
and the modification tool will enable the user to set-up parameters of a filter, which
will be used in the modification phase, directly within an interactive histogram
provided by the visualisation tool. Such an interoperability significantly speeds up
the pre-processing procedure and makes it much more interesting and valuable.
A universal data pre-processing system called SumatraTT, which is based on
ideas declared in this section, has been designed by the authors. Most of its features
have been implemented already and are available.
3SumatraTT 1.0
The first version of SumatraTT was developed in C++. It includes an interpreter of
an proprietary scripting language (SumatraScript). The language supports using of
templates (see fig. 3). The templates contain SumatraScript code enriched by
additional macros. Next paragraphs summarise the old design and reasons for reengineering of SumatraTT.
The input to SumatraTT 1.0 system are templates associated with values to be
substituted to template arguments (parameters of the respective data transformation).
5
Vybraný p ísp vek
By expanding the macros, this input is converted to an executable SumatraScript
program. There is a set of data-source drivers making possible to access and process
data by a SumatraScript program. All of them follow a unified programming
interface
Template
+ Metadata
SumatraScript
Data Access
Input
Data Access
Output
Figure 3Schema of SumatraTT 1.0
The idea of templates was deprecated, because it became hard to combine
capabilities of multiple templates (e.g. filtering and adding a new attribute). Another
problem was with inter-process communication between running transformations
(actually never fully implemented).
The new approach enables modules to mutually exchange data in a natural way
using Java 2 communication means. As the whole transformation runs in several
threads of the same JVM, the mutual communication among them is easy even if the
transformation consists of several relatively independent tasks. It makes possible to
share objects and send messages between individual tasks.
4SumatraTT 2.0
SumatraTT 2.0 is a completely new version of this system [7, 8, 9]. Instead of using
a scripting language and templates, it uses modules and connections, see fig. 4.
The new design makes possible to develop a rich set of modules. The modules are
stored in a tree structure, which organises modules by their functionality. There is a
set of modules implementing external connectivity, mathematical calculations,
visualisation, etc.
Java 2.0 was selected as the implementation platform. This decision brought a
number of advantages including the independence on the operating system, seamless
integration of modules possibly running in different threads, easy integration with
another Java based software and the availability of both stable and emerging
standards (XML, CORBA modules, etc.).
SumatraTT uses heavily meta-data. It supports meta-data processing by an extra
channel added to each open data channel. Both the system and modules send
commands (like run}, stop, recordformat). There are also messages splitting
data into groups (groupbegin, groupend). Modules can send arbitrary messages
and groups of modules then can mutually communicate by means of them.
In addition, every piece of data may be optionally accompanied by a meta-data
description, which includes information on data source reliability, known errors,
definition of data origin, processing carried so far, etc.
6
Vybraný p ísp vek
processData
processData
processData
processData
Inputs
Connection
Outputs
Module
Data
Metadata
Figure 4SumatraTT 2.0: Module and Connection
There are rumours about slow execution of Java. This aspect of Java
implementation of SumatraTT has been tested (not optimised1). The experience is
that reading and simple processing of a 100 MB file lasts about 40 seconds on
PentiumIII, 800Mhz, what is enough for any (at least academic) purpose. Such a
performance has been achieved by a very careful design. Modules can run in their
own threads. However, synchronisation overhead is too high in such a case. That is
why sharing threads by multiple modules is supported in SumatraTT. In figure 5 only
data source modules start threads. Another technique decreases the amount data
stored in local copies of different modules. Modules are expected to modify only a
small part of the whole record. The unmodified part is only referenced and to copied.
Figure 5Thread Model
5Modular Architecture
Modularity is the main advantage of the new system design. Modules are expected to
provide a small piece of functionality, which is carried out in a rigorous way. This is
similar to idea of Unix programs, which can be arbitrary combined to manage a
complex task. Simple modules are easier both to learn and use. Clearly defined
functionality of a module brings an extra advantage – it is possible to (semi-)formally
describe the functionality of a module. It makes possible to create an ontology of
modules, which will simplify the task to select a most appropriate module when
1
the speed of reading from files can be significantly improved by using memory mapped
files, supported from Java 1.4
7
Vybraný p ísp vek
designing a particular data transformation. Complex modules tend to fail in singular
– unexpected – cases.
All modules are held separately in a directory tree (identical with a tree of
modules) as JAR files. The user can add and remove modules by simply adding and
removing files. There is also a shared repository under development, which will
provide version management of modules.
SumatraTT provides an open architecture. Modules have to implement a simple
interface to be ready for participating in a data transformation. To simplify module
creation even more, SumatraTT core provides four base classes providing a starting
point for design of a new module. Such a new module is simply implemented as a
class inheriting from one of the base ones. The source code of many modules
included in the standard SumatraTT distribution consists of less than 10 lines of
source code.
The simplicity encourages advanced users to create their own modules for a
specific purpose. It is easy to implement methods as SumatraTT modules and then
incorporate them into a larger data transformation.
Integration with third-party tools becomes more and more important. As a pilot
project we implemented a SumatraTT module providing data connectivity to a neural
network based data modelling software package. By replacing the data input layer of
the package by a SumatraTT module, the package became almost independent on the
input data format. SumatraTT starts the modelling tool and takes care of data
transfer.
5.1User-friendly Environment
One of the reasons for re-engineering SumatraTT was the necessity to give the user a
more friendly way of communicating with the system. The most important criteria
for designing the graphical user interface is the simplicity of use. The whole data
transformation is represented by a diagram on the screen, which consists of icons
corresponding to functional modules. Data flow is defined by drawing a line between
the corresponding icons. The simple, intuitive way of defining the transformation
process encourages the user to experiment with new ways of data transformation and
exploration. A screenshot of SumatraTT GUI is shown in figure 6.
Data understanding is a very important part of data pre-processing. Visualisation
is the essential part of it. The requirements have been already described in section 2.
Integration of visual modules with the environment simplifies the whole process.
Especially, interactive modules are very useful, as they provide feedback directly to
the transformation schema. In addition, SummatraTT allows the user to download
modules and automate the documentation development.
Installing a module from the central server is transparent to the user. Version
control is fully supported.
Automating documentation addresses the issue, which though useful, is not very
popular among project developers. On request, the SumatraTT system generates a set
of HTML pages describing the transformation schema at a very detailed level. The
user can include additional comment both on individual modules and on the whole
schema. This feature helps the teams to share and re-use knowledge about the
particular transformation.
8
In addition to interactive usage, SumatraTT can be run from the command line. It
also allows to run transformation tasks according to a time schedule. SumatraTT has
also a simple interface enabling the communication with other Java programs either
as a master or as a slave. In the slave mode, the master can dynamically define
the transformation schema.
5.2Data Processing
Data pre-processing systems are often characterised in terms of the following
functionalities:
input/output connectivity
support for frequently executed tasks
support for specialised tasks
availability of a scripting language
In SumatraTT, these functionalities are supported as follows:
Input/output connectivity:
All standard types of data sources are supported.
They include text files, SQL databases, DBF, and XML files, the support for Excel
files is under development. Plain text file can be either structured by pre-defined
markers such as TAB, commas, etc., or the structure can be imposed by a wizard.
Access to the SQL databases is implemented using JDBC, therefore all main SQL
databases, such as Oracle, DB2, Informix, and MS SQL are supported.
Modules accessing different kinds of data allow seamless integration of
heterogeneous data sources.
In addition to the main data stream, some modules may generate metadata (e.g.
the first touch review) or protocols. Specialised output module may transform them
either to html format or Prolog clauses.
Modules for frequent tasks are a core of SumatraTT. They implement common
actions of data pre-processing, such as filter setup, introduction of new attributes,
selection of attributes, selection of records, statistics, numerical calculations, etc.
These modules have a intuitive interface and simple functionality.
9
Vybraný p ísp vek
Modules for specialised tasks address domain specific needs. They are usually
implemented for specific application (e.g. medical data processing) and they have
only limited re-usability outside their domain.
Scripting languages are used to implement functionalities not supported by the
modules mentioned in previous bullet points. The execution of scripting languages is
generally slower than the compiled counterpart; however, the decrease of speed is
fully compensated by the flexibility. Experiments will be carried out with Java
interpreter (Java Bean Shell), eventually with JavaScript, Perl, and Python.
6State-of-The-Art of SumatraTT 2.0
SumatraTT methodology supports three types of user:
A ``naive'' user, who defines the data transformation by building the complex
transformations from the existing set of available modules. This user relies
completely on module reuse, GUI guidance and does not need any knowledge of
programming.
Advanced user is able to substitute missing modules by a code in the scripting
language. The application will be complete, but components written in the
scripting language may slow down the execution.
Instead of developing the code in the scripting language, the module developer
implements a new SumatraTT module in Java and can make it available to the
community of naive and advanced users.
SumatraTT provides the basic functionality necessary for the development of
modules. The core is stable, enough powerful and fast. Currently, the module
repository consists of twenty tested modules. In addition, approximately ten modules
are under development by various developers and will be made availabel to the user
community soon. The current state of the SumatraTT environment and a tutorial for
all user levels including module developers can be found at [9].
There are specialised groups who are using SumatraTT. Their applications
include – knowledge management, medical data processing, processing of
telecommunication data, and neural networks.
7Sample Application
The SumatraTT 1.0 which implements some of the ideas presented in this paper was
used in several real applications in both data mining and data warehousing domains.
Populating a water supply OLAP database with data may serve as an example of
SumatraTT application. It was implemented as a part of the GOAL (Geographic
Information On-line Analysis) project supported be the CEC INCO COPERNICUS
programme [10].
The data described a state of water tanks in a watter supply chain. Every ten
minutes water level in each tank was measured and converted into a text files with
time marks. The application includes three independent data transformation
problems: transformation of date and time format, data matrix transformation,and
data aggregation.
10
Vybraný p ísp vek
8Conclusion
Although data pre-processing is a very important part of data mining process,
applications usually do not use a systematic approach and methodology. The
SumatraTT project is an attempt to provide both methodology and supporting
technology for data pre-processing. In this paper we have described the main
philosophy and principals of building data transformation applications.
The community using SumatraTT is growing. We believe that this paper will
contribute to dissemination of our results and provide user feedback for future
development. SumatraTT is available for academic purposes at [9].
Acknowledgement
The research is supported by by EC – IST RTD project Enabling Communities of
Interest to Promote Heritage of European Regions ``CIPHER.''
9References
1. Dorian Pyle. Data Preparation for Data Mining. Morgan Kaufmann Publishers,
Inc., San Francisco, CA, USA, 1999.
2. Katharina Morik. The Representation Race – Preprocessing for Handling Time
Phenomena. In ECML, Lecture Notes in Computer Science, pages 4–19.
Springer, 2000. Invited talk.
3. www.perl.org.
4. www.r-project.org.
5. Angie Voss, Thomas Gartner, and Steve Moyle. Zeno for Rapid Collaboration in
Data Mining Projects. In Christophe Giraud-Carrier, Nada Lavra , and Steve
Moyle, editors, Integrating Aspects of Data Mining, Decision Support and MetaLearning, pages 69–80. ECML/PKDD'01 workshop notes, September 2001.
6. Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas
Reinartz, Colin Shearer, and Rdiger Wirth. CRISP-DM 1.0: Step-by-step data
mining guide. CRISP-DM consortium, 2000.
7. Petr Aubrecht, Filip Železný, Petr Mikšovský, and Olga Št pánková. SumatraTT:
Towards a universal data preprocessor. In Cybernetics and Systems 2002,
volume II, pages 818–823, Vienna, 2002. Austrian Society for Cybernetics
Studies.
8. Petr Aubrecht and Zden k Kouba. Metadata Driven Data Transformation. In SCI
2001, volume I, pages 332–336. International Institute of Informatics and
Systemics and IEEE Computer Society, 2001.
9. SumatraTT Official Homepage, http://krizik.felk.cvut.cz/Sumatra.
10. Petr Mikšovský and Zden k Kouba. Application A2 Specification. Technical
report TR11, INCO–COPERNICUS 977091 GOAL, Czech Technical University,
Department of Cybernetics, Technická 2, Prague 6, 1999.