Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EXTENDING UML FOR MODELLING OF DATA MINING CASES Prof. LADISLAV BURITA–VOJTECH ONDRYHAL EXTENDING UML FOR MODELLING OF DATA MINING CASES The article describes possible approach for modelling data mining cases. The aim of the paper is to describe possibility of using standard modelling language for data mining process to achieve compatibility with other projects based on incremental approach, especially those using Unified Process and UML language. Common UML elements like use cases, classes, interfaces, components, nodes, etc. can be specialized by an extension mechanisms including stereotypes and named values. The new set of UML elements is provided and described for data mining process that covers whole project lifecycle. As an example of such approach can be stated data mining model element, that can extend UML element class by new named values like input data, output data, model parameters, etc. UML Schemas, syntax, semantic and usability examples for those new elements will be included in the paper. A cikk az adatbányászati esetek modellezésének egy lehetséges módját írja le. Célja hogy ismertesse az általános modellező nyelvek használatának lehetőségét az adatbányászati folyamatokhoz, megőrizve ezzel a kompatibilitást más projektekkel, különösen azokkal melyek Egységesített Eljárást és UML nyelvet használnak. Az általános UML elemek, mint a használati esetek (Use case), osztályok (class), komponensek, szálak (node) stb. egyedivé tehetők egy külső, sablonokat és nevesített értékeket tartalmazó mechanizmussal. Az így létrejött új UML elemkészlet használata ajánlott az egész project életciklust lefedő adatbányászati folyamatokhoz. Példaként egy ilyen megközelítéshez vegyük egy adatbányászati modell elemét, amely kiterjeszt egy UML osztályt új, nevesített értékekkel, amik lehetnek bementi és kimeneti adatok, paraméterek stb.. A cikk tartalmazza az új elemekhez tartozó UML sémákat és a szintaxist, valamint példákat a használatukhoz. Development Process Methodology A methodology formally defines the process that you use to gather requirements, analyze them, and design an application that meets them. There are many methodologies, each differing in some way or ways from the others. There are many reasons why one methodology may be better than another for any particular project: For example, some are better suited for large enterprise applications while others are built to design small embedded or safety-critical systems. Some methods better 111 VÉDELMI INFOKOMMUNIKÁCIÓ support large numbers of architects and designers working on the same project, while others work better when used by one person or a small group. Unified Process and UML Language The Unified Process and UML (Unified Modelling Language) are quickly becoming the defacto standards for development process (software development methodology), within the object-oriented and component-based software communities. „The Unified Modelling Language (UML) is a graphical language for visualizing, specifying, constructing, and documenting the artefacts of a software-intensive system. The UML offers a standard way to write a system's blueprints, including conceptual things such as business processes and system functions as well as concrete things such as programming language statements, database schemas, and reusable software components.”[www-uml] On the Figure 1 [RUP2000] there are displayed key concepts of Rational Unified Process (RUP). The aim of the research is to reuse of those concepts in building data mining methodology. Figure 1 Key Concepts of Rational Unified Process 112 EXTENDING UML FOR MODELLING OF DATA MINING CASES Data mining development process methodologies In the data mining world, we can recognize several methodologies for data mining projects. These are usually tightly connected with software producers like SAS, SPSS, Oracle or Microsoft companies. Among these approaches, CRISP-DM methodology is probably the leader in the field of industry independent methodologies. The whole process is described in four level hierarchical process model, consisting of sets of tasks as follows: phase, generic task, specialized task, process instance. On the Figure 2 is the common representation of data mining project based on CRISP-DM. The data lies in the centre of the process. Figure 2 Project lifecycle according to CRISP-DM methodology Integration In a project, where data mining technology is only part of a whole solution, integrated environment has to be set up. Unified Process and UML, as was already mentioned, provide environment already accepted within 113 VÉDELMI INFOKOMMUNIKÁCIÓ software development communities. In the next part of the article possible approach for integration of data mining cases into corporate projects is introduced. All the main phases have been refactored and models, according to Unified Process guidelines, have been created. The following changes and additions have been made to the CRISP-DM methodology: Roles were introduced. Role is not explicitly defined in CRISPDM. This will help to assign properly responsibilities to persons. For example role Data Analyst is required in Data Understanding workflow. Outputs and products from phases have been transformed to artefacts. Significantly reduced number of independent deliverables. Outputs from tasks were integrated and a list of suggested documents has been created. For all documents templates were defined in html and rtf formats. Modelling tool (Enterprise Architect) was used to model data mining process. From such tool subsequent documentation can be generated for output unification. od Process model packages (phases) Name: Package: Version: Author: Process model packages (phases) Data Mining Process Model 1.0 Vojtěch Ondryhal Tasks and Deliverables Business Understanding + Project Plan Data Understanding + Data Analysis Report + Requirements Data Preparation + Data Set + Data Set Description + Terminology + Vision Modelling Ev aluation + Model + Evaluation Report + Model Description + Final Report Deployment + Deployment Plan + Monitoring And Maintenance Plan + Model Parameters Settings + Test Design Figure 3 Data Mining Process Model Overview 114 EXTENDING UML FOR MODELLING OF DATA MINING CASES Business understanding The artefacts produced during work are: The vision document provides first insight into project. It includes the following parts: background, business objectives, business success criteria, inventory of resources, risks and contingencies, costs and benefits. Requirements document includes requirements, assumptions and constraints, data mining goals and data mining success criteria. Terminology repository (in form of document or model glossary in a tool) of relevant business terminology and data mining terminology. Project plan document, for example in a form of Gant chart. The plan lists stages, duration, resources, inputs, outputs and dependencies, including initial assessment of tools and techniques. ud Workflow detail Name: Package: Versi on: Author: Workflow detai l Business Understandi ng 1.0 Voj těch Ondryhal Determi ne business obj ectives Business Analyst Vision Asses Si tuation Determine Data Mining Goal Terminology Produce Project Pl an Proj ect Manager Requirements Proj ect Plan Figure 4 Business understanding workflow detail Data understanding The artefact produced in this phase is Data Analysis Report Document that contains report on initial data collection, description on data, report on data exploration and data quality. 115 VÉDELMI INFOKOMMUNIKÁCIÓ ud Workflow detail Name: Package: Version: Author: Workflow detail Data Understanding 1.0 Voj těch Ondryhal Coll ect Initial Data Describe Data Explore Data Data Analyst Data Analysis Report Verify Data Quali ty Figure 5 Data understanding workflow detail Data preparation This phase creates data sets that will be used in the next phases for modelling. Each activity displayed on Figure 6 provides a chapter in the Data Set Description document. Data Set contains real data prepared as an input for modelling. The data are properly selected, cleaned, eventually new data items created, merged and formatted. ud Workflow detail Name: Package: Versi on: Author: Workflow detai l Data Preparati on 1.0 Voj těch Ondryhal Sel ect Data Clean Data Data Set Description Construct Data Data Set Data Designer Form at Data Integrate Data Figure 6 Data preparation workflow detail 116 EXTENDING UML FOR MODELLING OF DATA MINING CASES Modelling At the start of the workflow tests are created for model validation, training and testing. Model itself runs prepared dataset for results. Model parameters setting lists required parameters for model and values. Usually for different set of values model behaves variously. All variants of setting should be captured and described. ud Workflow de tail Nam e: Package : Version : Auth or: Wo rkf low deta il Mo dell ing 1.0 Vojtě ch Ondryh al Tes t Desi gn Sele ct Mo dell ing T echni que Generate T est Design M odel Desc ription Bu ild M odel Data M ining Engine er Model Parame ters Settings Asse ss Mo del Model Da ta Set Figure 7 Modelling workflow detail Evaluation The evaluation report indicates how results meet business criteria defined in Business Understanding phase. During evaluation models are approved (or rejected). ud Workflow deta il Name: Packag e: Version: Author: Workflow d etail Evalua tion 1 .0 Vo jtěch On dryhal Evalu ate Results Ev alua tion Re port Business Analyst Final Report Determ ine Ne xt Ste ps Proj ec t Manager Re vie w Pro cess Quality Insurance Manager Figure 8 Evaluation phase workflow detail 117 VÉDELMI INFOKOMMUNIKÁCIÓ Final report contains review of the whole process, checks whether all required activities have been finished. It also include list of possible action in the project and decisions on these actions. Deployment Deployment is last workflow in the data mining development process. Deployment packages and deployment plan for target environment is created. Monitoring and maintenance plan defines method of day-to-day result checking in order to assure correctness of produced results. ud Workflow detail Name: Package: Version: Author: Workflow detail Deployment 1.0 Vojtěch Ondryhal Deployment Plan Plan Deployment Deployment Manager Plan M onitoring And Maintenan ce Monitoring And M aintenance Plan finalize Review Project Final Report Projec t Manager Figure 9 Deployment phase workflow details Conclusion The possible approach for modelling of data mining cases based on UML and CRISP-DM was introduced in the paper. Paper provides insight into the more detailed work that includes detailed description of deliverables, templates and examples. This methodology is based on prototypes which were experienced at the Communication and Information Systems Department at University of Defence. The advantage of this approach is unification of the project administration (templates, work description, etc.) with other development projects. 118 EXTENDING UML FOR MODELLING OF DATA MINING CASES References 1. [www-ea] Enterprise Architect web site. http://www.sparxsystems.com.au/ 2. [BOHTH05] Buřita L., Ondryhal V., Hodický J., Trunda M., Hlaváček M, Information Systems, University of Defence, 2005, U-3099 [in Czech language] 3. [CD01] CRISP-DM, Step by Step Data Mining Guide v. 1.0, CRISP-DM Consorcium, http://www.crisp-dm.org/ 4. [RUP2000] Rational Unified Process 2000 – Online documentation 5. [www-vo] Web pages of the author. http://dcs.unob.cz/~Vojtech.Ondryhal/ [in Czech language] 6. [www-uml] Unified Modelling Language Resource Page. http://www.uml.org/ 119 VÉDELMI INFOKOMMUNIKÁCIÓ 120