Download A Universal Data Pre-processing System

A Universal Data Pre-processing System Petr AUBRECHT, Zden k KOUBA Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague Technická 2, 160 00 Prague 6 {aubrech,kouba}@labe.felk.cvut.cz Abstract. The main research in data warehousing and data mining is focused on efficiency of the core algorithm (search and induction). Unfortunately, these depend on quality of input data. Noisy and erroneous data makes these algorithms useless. Before processing, data needs to be investigated and preprocessed. This paper specifies a set of requirements on a data pre-processing system, which should be satisfied in order to provide a complete and usable system. These requirements were being taken into account when designing and implementing the SumatraTT pre-processing system at the Czech Technical University in Prague. Its development had been motivated by the aim to present ideas of preprocessing [1, 2] in real applications. Currently, there are no relevant systems widely accepted and used by the data mining community. SumatraTT makes aspiration on providing a unified environment, which will be welcome by this community. SumatraTT serves as a reference implementation of our ideas and as a universal platform for developing add-on modules in various target areas of interest – data mining, data warehousing, medical information processing, etc. Keywords: data preprocessing, GUI, Sumatra 1Practice (Critic of the Current State) Nowadays, there exists a huge amount of data stored typically in company archives or databases of production systems. Whereas enormous attention of researchers has been paid to development of efficient (relational) database management systems and/or data mining algorithms, the step in the middle – data pre-processing – suffers from lack of theory and widely available tools or systems. Our experience shows that in practice there are the following three ways of carrying out data pre-processing: Manually: Data is edited manually in a spreadsheet. This is the easiest case from user's perspective. This can be the fastest approach for non-repeatable processing. However, if additional batch of data becomes available, there is no way, how to repeat the process in order to add them to the processed dataset. In addition, there is no way, how to verify correctness of the data pre-processing process. Program: A special dedicated single-purpose program is developed, usually in a high-level languages like awk, sed, Perl [3] (originally administration tool in Unix systems), or R-project [4] (more focused on statistical computing and graphics), Lubos Popelínský (ed.), DATAKON 2003, Brno, 18.-21. 10. 2003, pp. 1-3. 2 Vybraný p ísp vek respectively in C++ or Pascal. Also this way lacks for comprehensible documentation and it is difficult to port the solution to the co-worker's environment even if it is based on the same computational platform (availability of awk or sed scripts on Windows, endless installation of interpreters, etc.). System: Use of a specialised system is the most convenient way, even if it is very demanding: such a system is usually very expensive and to be used effectively, it requires a well trained user. The systems vary in capabilities they provide – from MS DTS (Data Transformation Services, which restricts the definition of a data transformation to simple SQL queries or Basic scripts) to Clementine (which is complex, very expensive and focused exclusively on data mining). There exist reasons for avoiding the first two ways. For example collaborative data mining [5] excludes editing raw data manually and discourages writing singlepurpose data transformation programs. Making use of a specialised system can be an advantage, especially if the system provides self-documenting features in some extent, even if limited to writing simple textual description of the transformation. It enables partners to re-use transformations developed by others by repeating the steps described in the documentation. The third way, using specialised systems has many advantages. The only obstacle for almost seamless exchange of transformation definitions is that respective tools/systems are not commonly available. Such systems can support data understanding, both visual and configuration-driven data processing, reporting and other features described further. A conclusion of this state-of-the-art overview is that there is a need for an open, widely available and feature-rich data pre-processing system. Its preferred features include easy manipulation to compete with the Manual way, enough power to have a win over the Program way, and a rich palette of capabilities to outstrip the System way). Such a system should also provide a platform for collaboration. 2Steps of Data Pre-processing There already exists a data mining methodology CRISP [6] (see figure 1), which explicitly involves data pre-processing as a distinguished part of the data mining process. In our opinion, the methodology splits the process into relatively separated steps, while in practice each of these steps is influenced by the others. Especially, there is a very tight interaction between the adjacent steps. This paper is focused namely on the second and third steps: data understanding and data preparation. We think these steps tightly coupled, because data understanding is an incremental process, which progresses during data preparation. It is common that errors are reported during the data preparation process. These reports enrich user's knowledge of data and influence further course of data preparation. This part should be carried out in iteration. Of course, the whole data pre-processing can be divided into steps. However, it is important to emphasise that all these steps are run in cycles rather than as a sequence of isolated steps. 3 Vybraný p ísp vek Original CRISP Bussiness Understanding Data Preprocessing System Area of Interest Data Understanding Bussiness Understanding Data Understanding Deployment Deployment Data Preparation Data Preparation Evaluation Evaluation Modelling Modelling Figure 1Original and modified CRISP schema First−touch Review Static Interactive Advanced Figure 2Sequence of Available Visualisations Data pre-processing steps: understanding The main contribution of data pre-processing is data understanding. No sensible results can be achieved without a detailed exploration of received data. visualisation – the most valuable way of data understanding is visual exploration. People get 95 % of information visually and graphics is able both to provide huge amount of information at once including comparisons of various values. The following kinds of visual exploration offer increasing detail of investigated data – from simple, automatically generated reports, over interactive tools providing dynamic views of data to advanced techniques, which require certain knowledge of the data; see figure 2. - first-touch review – a graphical report with statistical values, graphs and histograms on a single HTML page with thumbnails of a graph and histogram for each field. A summary graph reflecting all the fields together can be included, too. This kind of visualisation provides the fastest look at data without any demand on user (it is automatically generated). - static – common types of visual representation include a wide range of graphs (2D, 3D, combined, pie, histograms, errors, etc.). There is a sufficient offer of third-party tools (like R-project, gnuplot etc.) including interactive tools (spreadsheets, Mathematica, etc.). - interactive – dynamic data visualisation. This idea allows investigating both mutual dependencies between data attributes and dependency of attributes on 4 Vybraný p ísp vek the time. The interactivity enables the user to change the point of view, form of presentation etc. A good example is a classic 3D graph with capability to change the subset of records and/or attributes, adding sub-graphs on the screen, animations representing a time series... It is necessary to say, that interactive visualisation could bring a feedback to the user – e.g. to set-up filtering conditions for a part of the pre-processing schema. - advanced – in the last group of techniques there are techniques expressing data in an unusual way, like graphs with parallel axis, graphs in radial coordinates, matrix diagrams, multiline graphs... reports, statistics – textual form is necessary for completeness. It can catch exceptional values invisible in graphs (e.g. the y-axis of a graph can superimpose an exceptional value of the first record). In addition, the reports can be enriched by including graphics from the previous part. modification It is necessary to provide a set of data processing modules, which will be powerful enough to allow arbitrary modifications. Limitation of capabilities restricts significantly the usability of the tool. The best way is to offer simple modules for typical actions (access to data sources, subsets of attributes, filters, calculations, rules for handling missing/wrong data, etc.) and a language for implementing algorithms, which fulfil all the very specific requirements. reverse-modification A requirement may appear to carry out a reverse transformation after completing data preparation and data modelling, which will bring the resulting data sets(e.g. predicted data) back to the original form (format).Support for reverse-modification can range from automatic design (in simple cases)to verification/proof of hand-made transformation based on checking equality of original data with that obtained by reverse modification of transformed data. The pre-processing system shall support data understanding and modification by providing mutual interoperability of all the tools. Using advanced visualisation modules, the user will better navigate in graphs. Interoperability of the visualisation and the modification tool will enable the user to set-up parameters of a filter, which will be used in the modification phase, directly within an interactive histogram provided by the visualisation tool. Such an interoperability significantly speeds up the pre-processing procedure and makes it much more interesting and valuable. A universal data pre-processing system called SumatraTT, which is based on ideas declared in this section, has been designed by the authors. Most of its features have been implemented already and are available. 3SumatraTT 1.0 The first version of SumatraTT was developed in C++. It includes an interpreter of an proprietary scripting language (SumatraScript). The language supports using of templates (see fig. 3). The templates contain SumatraScript code enriched by additional macros. Next paragraphs summarise the old design and reasons for reengineering of SumatraTT. The input to SumatraTT 1.0 system are templates associated with values to be substituted to template arguments (parameters of the respective data transformation). 5 Vybraný p ísp vek By expanding the macros, this input is converted to an executable SumatraScript program. There is a set of data-source drivers making possible to access and process data by a SumatraScript program. All of them follow a unified programming interface Template + Metadata SumatraScript Data Access Input Data Access Output Figure 3Schema of SumatraTT 1.0 The idea of templates was deprecated, because it became hard to combine capabilities of multiple templates (e.g. filtering and adding a new attribute). Another problem was with inter-process communication between running transformations (actually never fully implemented). The new approach enables modules to mutually exchange data in a natural way using Java 2 communication means. As the whole transformation runs in several threads of the same JVM, the mutual communication among them is easy even if the transformation consists of several relatively independent tasks. It makes possible to share objects and send messages between individual tasks. 4SumatraTT 2.0 SumatraTT 2.0 is a completely new version of this system [7, 8, 9]. Instead of using a scripting language and templates, it uses modules and connections, see fig. 4. The new design makes possible to develop a rich set of modules. The modules are stored in a tree structure, which organises modules by their functionality. There is a set of modules implementing external connectivity, mathematical calculations, visualisation, etc. Java 2.0 was selected as the implementation platform. This decision brought a number of advantages including the independence on the operating system, seamless integration of modules possibly running in different threads, easy integration with another Java based software and the availability of both stable and emerging standards (XML, CORBA modules, etc.). SumatraTT uses heavily meta-data. It supports meta-data processing by an extra channel added to each open data channel. Both the system and modules send commands (like run}, stop, recordformat). There are also messages splitting data into groups (groupbegin, groupend). Modules can send arbitrary messages and groups of modules then can mutually communicate by means of them. In addition, every piece of data may be optionally accompanied by a meta-data description, which includes information on data source reliability, known errors, definition of data origin, processing carried so far, etc. 6 Vybraný p ísp vek processData processData processData processData Inputs Connection Outputs Module Data Metadata Figure 4SumatraTT 2.0: Module and Connection There are rumours about slow execution of Java. This aspect of Java implementation of SumatraTT has been tested (not optimised1). The experience is that reading and simple processing of a 100 MB file lasts about 40 seconds on PentiumIII, 800Mhz, what is enough for any (at least academic) purpose. Such a performance has been achieved by a very careful design. Modules can run in their own threads. However, synchronisation overhead is too high in such a case. That is why sharing threads by multiple modules is supported in SumatraTT. In figure 5 only data source modules start threads. Another technique decreases the amount data stored in local copies of different modules. Modules are expected to modify only a small part of the whole record. The unmodified part is only referenced and to copied. Figure 5Thread Model 5Modular Architecture Modularity is the main advantage of the new system design. Modules are expected to provide a small piece of functionality, which is carried out in a rigorous way. This is similar to idea of Unix programs, which can be arbitrary combined to manage a complex task. Simple modules are easier both to learn and use. Clearly defined functionality of a module brings an extra advantage – it is possible to (semi-)formally describe the functionality of a module. It makes possible to create an ontology of modules, which will simplify the task to select a most appropriate module when 1 the speed of reading from files can be significantly improved by using memory mapped files, supported from Java 1.4 7 Vybraný p ísp vek designing a particular data transformation. Complex modules tend to fail in singular – unexpected – cases. All modules are held separately in a directory tree (identical with a tree of modules) as JAR files. The user can add and remove modules by simply adding and removing files. There is also a shared repository under development, which will provide version management of modules. SumatraTT provides an open architecture. Modules have to implement a simple interface to be ready for participating in a data transformation. To simplify module creation even more, SumatraTT core provides four base classes providing a starting point for design of a new module. Such a new module is simply implemented as a class inheriting from one of the base ones. The source code of many modules included in the standard SumatraTT distribution consists of less than 10 lines of source code. The simplicity encourages advanced users to create their own modules for a specific purpose. It is easy to implement methods as SumatraTT modules and then incorporate them into a larger data transformation. Integration with third-party tools becomes more and more important. As a pilot project we implemented a SumatraTT module providing data connectivity to a neural network based data modelling software package. By replacing the data input layer of the package by a SumatraTT module, the package became almost independent on the input data format. SumatraTT starts the modelling tool and takes care of data transfer. 5.1User-friendly Environment One of the reasons for re-engineering SumatraTT was the necessity to give the user a more friendly way of communicating with the system. The most important criteria for designing the graphical user interface is the simplicity of use. The whole data transformation is represented by a diagram on the screen, which consists of icons corresponding to functional modules. Data flow is defined by drawing a line between the corresponding icons. The simple, intuitive way of defining the transformation process encourages the user to experiment with new ways of data transformation and exploration. A screenshot of SumatraTT GUI is shown in figure 6. Data understanding is a very important part of data pre-processing. Visualisation is the essential part of it. The requirements have been already described in section 2. Integration of visual modules with the environment simplifies the whole process. Especially, interactive modules are very useful, as they provide feedback directly to the transformation schema. In addition, SummatraTT allows the user to download modules and automate the documentation development. Installing a module from the central server is transparent to the user. Version control is fully supported. Automating documentation addresses the issue, which though useful, is not very popular among project developers. On request, the SumatraTT system generates a set of HTML pages describing the transformation schema at a very detailed level. The user can include additional comment both on individual modules and on the whole schema. This feature helps the teams to share and re-use knowledge about the particular transformation. 8 In addition to interactive usage, SumatraTT can be run from the command line. It also allows to run transformation tasks according to a time schedule. SumatraTT has also a simple interface enabling the communication with other Java programs either as a master or as a slave. In the slave mode, the master can dynamically define the transformation schema. 5.2Data Processing Data pre-processing systems are often characterised in terms of the following functionalities: input/output connectivity support for frequently executed tasks support for specialised tasks availability of a scripting language In SumatraTT, these functionalities are supported as follows: Input/output connectivity: All standard types of data sources are supported. They include text files, SQL databases, DBF, and XML files, the support for Excel files is under development. Plain text file can be either structured by pre-defined markers such as TAB, commas, etc., or the structure can be imposed by a wizard. Access to the SQL databases is implemented using JDBC, therefore all main SQL databases, such as Oracle, DB2, Informix, and MS SQL are supported. Modules accessing different kinds of data allow seamless integration of heterogeneous data sources. In addition to the main data stream, some modules may generate metadata (e.g. the first touch review) or protocols. Specialised output module may transform them either to html format or Prolog clauses. Modules for frequent tasks are a core of SumatraTT. They implement common actions of data pre-processing, such as filter setup, introduction of new attributes, selection of attributes, selection of records, statistics, numerical calculations, etc. These modules have a intuitive interface and simple functionality. 9 Vybraný p ísp vek Modules for specialised tasks address domain specific needs. They are usually implemented for specific application (e.g. medical data processing) and they have only limited re-usability outside their domain. Scripting languages are used to implement functionalities not supported by the modules mentioned in previous bullet points. The execution of scripting languages is generally slower than the compiled counterpart; however, the decrease of speed is fully compensated by the flexibility. Experiments will be carried out with Java interpreter (Java Bean Shell), eventually with JavaScript, Perl, and Python. 6State-of-The-Art of SumatraTT 2.0 SumatraTT methodology supports three types of user: A ``naive'' user, who defines the data transformation by building the complex transformations from the existing set of available modules. This user relies completely on module reuse, GUI guidance and does not need any knowledge of programming. Advanced user is able to substitute missing modules by a code in the scripting language. The application will be complete, but components written in the scripting language may slow down the execution. Instead of developing the code in the scripting language, the module developer implements a new SumatraTT module in Java and can make it available to the community of naive and advanced users. SumatraTT provides the basic functionality necessary for the development of modules. The core is stable, enough powerful and fast. Currently, the module repository consists of twenty tested modules. In addition, approximately ten modules are under development by various developers and will be made availabel to the user community soon. The current state of the SumatraTT environment and a tutorial for all user levels including module developers can be found at [9]. There are specialised groups who are using SumatraTT. Their applications include – knowledge management, medical data processing, processing of telecommunication data, and neural networks. 7Sample Application The SumatraTT 1.0 which implements some of the ideas presented in this paper was used in several real applications in both data mining and data warehousing domains. Populating a water supply OLAP database with data may serve as an example of SumatraTT application. It was implemented as a part of the GOAL (Geographic Information On-line Analysis) project supported be the CEC INCO COPERNICUS programme [10]. The data described a state of water tanks in a watter supply chain. Every ten minutes water level in each tank was measured and converted into a text files with time marks. The application includes three independent data transformation problems: transformation of date and time format, data matrix transformation,and data aggregation. 10 Vybraný p ísp vek 8Conclusion Although data pre-processing is a very important part of data mining process, applications usually do not use a systematic approach and methodology. The SumatraTT project is an attempt to provide both methodology and supporting technology for data pre-processing. In this paper we have described the main philosophy and principals of building data transformation applications. The community using SumatraTT is growing. We believe that this paper will contribute to dissemination of our results and provide user feedback for future development. SumatraTT is available for academic purposes at [9]. Acknowledgement The research is supported by by EC – IST RTD project Enabling Communities of Interest to Promote Heritage of European Regions ``CIPHER.'' 9References 1. Dorian Pyle. Data Preparation for Data Mining. Morgan Kaufmann Publishers, Inc., San Francisco, CA, USA, 1999. 2. Katharina Morik. The Representation Race – Preprocessing for Handling Time Phenomena. In ECML, Lecture Notes in Computer Science, pages 4–19. Springer, 2000. Invited talk. 3. www.perl.org. 4. www.r-project.org. 5. Angie Voss, Thomas Gartner, and Steve Moyle. Zeno for Rapid Collaboration in Data Mining Projects. In Christophe Giraud-Carrier, Nada Lavra , and Steve Moyle, editors, Integrating Aspects of Data Mining, Decision Support and MetaLearning, pages 69–80. ECML/PKDD'01 workshop notes, September 2001. 6. Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rdiger Wirth. CRISP-DM 1.0: Step-by-step data mining guide. CRISP-DM consortium, 2000. 7. Petr Aubrecht, Filip Železný, Petr Mikšovský, and Olga Št pánková. SumatraTT: Towards a universal data preprocessor. In Cybernetics and Systems 2002, volume II, pages 818–823, Vienna, 2002. Austrian Society for Cybernetics Studies. 8. Petr Aubrecht and Zden k Kouba. Metadata Driven Data Transformation. In SCI 2001, volume I, pages 332–336. International Institute of Informatics and Systemics and IEEE Computer Society, 2001. 9. SumatraTT Official Homepage, http://krizik.felk.cvut.cz/Sumatra. 10. Petr Mikšovský and Zden k Kouba. Application A2 Specification. Technical report TR11, INCO–COPERNICUS 977091 GOAL, Czech Technical University, Department of Cybernetics, Technická 2, Prague 6, 1999.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Universal Data Pre-processing System