Download EXTENDING UML FOR MODELLING OF DATA MINING - UNI-NKE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
EXTENDING UML FOR MODELLING OF DATA MINING CASES
Prof. LADISLAV BURITA–VOJTECH ONDRYHAL
EXTENDING UML FOR MODELLING
OF DATA MINING CASES
The article describes possible approach for modelling data mining cases. The aim of the
paper is to describe possibility of using standard modelling language for data mining
process to achieve compatibility with other projects based on incremental approach,
especially those using Unified Process and UML language. Common UML elements like
use cases, classes, interfaces, components, nodes, etc. can be specialized by an extension
mechanisms including stereotypes and named values. The new set of UML elements is
provided and described for data mining process that covers whole project lifecycle. As an
example of such approach can be stated data mining model element, that can extend UML
element class by new named values like input data, output data, model parameters, etc.
UML Schemas, syntax, semantic and usability examples for those new elements will be
included in the paper.
A cikk az adatbányászati esetek modellezésének egy lehetséges módját írja le. Célja hogy
ismertesse az általános modellező nyelvek használatának lehetőségét az adatbányászati
folyamatokhoz, megőrizve ezzel a kompatibilitást más projektekkel, különösen azokkal
melyek Egységesített Eljárást és UML nyelvet használnak. Az általános UML elemek,
mint a használati esetek (Use case), osztályok (class), komponensek, szálak (node) stb.
egyedivé tehetők egy külső, sablonokat és nevesített értékeket tartalmazó mechanizmussal.
Az így létrejött új UML elemkészlet használata ajánlott az egész project életciklust lefedő
adatbányászati folyamatokhoz. Példaként egy ilyen megközelítéshez vegyük egy adatbányászati modell elemét, amely kiterjeszt egy UML osztályt új, nevesített értékekkel, amik
lehetnek bementi és kimeneti adatok, paraméterek stb.. A cikk tartalmazza az új elemekhez
tartozó UML sémákat és a szintaxist, valamint példákat a használatukhoz.
Development Process Methodology
A methodology formally defines the process that you use to gather requirements, analyze them, and design an application that meets them.
There are many methodologies, each differing in some way or ways
from the others. There are many reasons why one methodology may be
better than another for any particular project: For example, some are
better suited for large enterprise applications while others are built to
design small embedded or safety-critical systems. Some methods better
111
VÉDELMI INFOKOMMUNIKÁCIÓ
support large numbers of architects and designers working on the same
project, while others work better when used by one person or a small
group.
Unified Process and UML Language
The Unified Process and UML (Unified Modelling Language) are
quickly becoming the defacto standards for development process (software development methodology), within the object-oriented and component-based software communities.
„The Unified Modelling Language (UML) is a graphical language for
visualizing, specifying, constructing, and documenting the artefacts of a
software-intensive system. The UML offers a standard way to write a system's blueprints, including conceptual things such as business processes
and system functions as well as concrete things such as programming language statements, database schemas, and reusable software components.”[www-uml]
On the Figure 1 [RUP2000] there are displayed key concepts of Rational Unified Process (RUP). The aim of the research is to reuse of
those concepts in building data mining methodology.
Figure 1 Key Concepts of Rational Unified Process
112
EXTENDING UML FOR MODELLING OF DATA MINING CASES
Data mining development process methodologies
In the data mining world, we can recognize several methodologies for
data mining projects. These are usually tightly connected with software
producers like SAS, SPSS, Oracle or Microsoft companies. Among
these approaches, CRISP-DM methodology is probably the leader in the
field of industry independent methodologies. The whole process is described in four level hierarchical process model, consisting of sets of
tasks as follows: phase, generic task, specialized task, process instance.
On the Figure 2 is the common representation of data mining project
based on CRISP-DM. The data lies in the centre of the process.
Figure 2 Project lifecycle according to CRISP-DM methodology
Integration
In a project, where data mining technology is only part of a whole solution, integrated environment has to be set up. Unified Process and UML,
as was already mentioned, provide environment already accepted within
113
VÉDELMI INFOKOMMUNIKÁCIÓ
software development communities. In the next part of the article possible approach for integration of data mining cases into corporate projects
is introduced.
All the main phases have been refactored and models, according to
Unified Process guidelines, have been created. The following changes
and additions have been made to the CRISP-DM methodology:
 Roles were introduced. Role is not explicitly defined in CRISPDM. This will help to assign properly responsibilities to persons.
For example role Data Analyst is required in Data Understanding
workflow.
 Outputs and products from phases have been transformed to artefacts.
 Significantly reduced number of independent deliverables. Outputs from tasks were integrated and a list of suggested documents
has been created. For all documents templates were defined in
html and rtf formats.
 Modelling tool (Enterprise Architect) was used to model data mining process. From such tool subsequent documentation can be
generated for output unification.
od Process model packages (phases)
Name:
Package:
Version:
Author:
Process model packages (phases)
Data Mining Process Model
1.0
Vojtěch Ondryhal
Tasks and Deliverables
Business Understanding
+ Project Plan
Data Understanding
+ Data Analysis Report
+ Requirements
Data Preparation
+ Data Set
+ Data Set Description
+ Terminology
+ Vision
Modelling
Ev aluation
+ Model
+ Evaluation Report
+ Model Description
+ Final Report
Deployment
+ Deployment Plan
+ Monitoring And Maintenance Plan
+ Model Parameters Settings
+ Test Design
Figure 3 Data Mining Process Model Overview
114
EXTENDING UML FOR MODELLING OF DATA MINING CASES
Business understanding
The artefacts produced during work are:
 The vision document provides first insight into project. It includes
the following parts: background, business objectives, business
success criteria, inventory of resources, risks and contingencies,
costs and benefits.
 Requirements document includes requirements, assumptions and
constraints, data mining goals and data mining success criteria.
 Terminology repository (in form of document or model glossary in
a tool) of relevant business terminology and data mining terminology.
 Project plan document, for example in a form of Gant chart. The
plan lists stages, duration, resources, inputs, outputs and dependencies, including initial assessment of tools and techniques.
ud Workflow detail
Name:
Package:
Versi on:
Author:
Workflow detai l
Business Understandi ng
1.0
Voj těch Ondryhal
Determi ne
business
obj ectives
Business
Analyst
Vision
Asses
Si tuation
Determine
Data Mining
Goal
Terminology
Produce
Project
Pl an
Proj ect
Manager
Requirements
Proj ect
Plan
Figure 4 Business understanding workflow detail
Data understanding
The artefact produced in this phase is Data Analysis Report Document
that contains report on initial data collection, description on data, report
on data exploration and data quality.
115
VÉDELMI INFOKOMMUNIKÁCIÓ
ud Workflow detail
Name:
Package:
Version:
Author:
Workflow detail
Data Understanding
1.0
Voj těch Ondryhal
Coll ect
Initial
Data
Describe
Data
Explore
Data
Data
Analyst
Data
Analysis
Report
Verify
Data
Quali ty
Figure 5 Data understanding workflow detail
Data preparation
This phase creates data sets that will be used in the next phases for
modelling. Each activity displayed on Figure 6 provides a chapter in the
Data Set Description document. Data Set contains real data prepared as
an input for modelling. The data are properly selected, cleaned, eventually new data items created, merged and formatted.
ud Workflow detail
Name:
Package:
Versi on:
Author:
Workflow detai l
Data Preparati on
1.0
Voj těch Ondryhal
Sel ect
Data
Clean
Data
Data
Set
Description
Construct
Data
Data Set
Data
Designer
Form at
Data
Integrate
Data
Figure 6 Data preparation workflow detail
116
EXTENDING UML FOR MODELLING OF DATA MINING CASES
Modelling
At the start of the workflow tests are created for model validation, training and testing. Model itself runs prepared dataset for results. Model
parameters setting lists required parameters for model and values. Usually for different set of values model behaves variously. All variants of
setting should be captured and described.
ud Workflow de tail
Nam e:
Package :
Version :
Auth or:
Wo rkf low deta il
Mo dell ing
1.0
Vojtě ch Ondryh al
Tes t
Desi gn
Sele ct
Mo dell ing
T echni que
Generate
T est
Design
M odel
Desc ription
Bu ild
M odel
Data
M ining
Engine er
Model
Parame ters
Settings
Asse ss
Mo del
Model
Da ta Set
Figure 7 Modelling workflow detail
Evaluation
The evaluation report indicates how results meet business criteria defined in Business Understanding phase. During evaluation models are
approved (or rejected).
ud Workflow deta il
Name:
Packag e:
Version:
Author:
Workflow d etail
Evalua tion
1 .0
Vo jtěch On dryhal
Evalu ate
Results
Ev alua tion
Re port
Business
Analyst
Final
Report
Determ ine
Ne xt Ste ps
Proj ec t
Manager
Re vie w
Pro cess
Quality
Insurance
Manager
Figure 8 Evaluation phase workflow detail
117
VÉDELMI INFOKOMMUNIKÁCIÓ
Final report contains review of the whole process, checks whether all
required activities have been finished. It also include list of possible
action in the project and decisions on these actions.
Deployment
Deployment is last workflow in the data mining development process.
Deployment packages and deployment plan for target environment is
created. Monitoring and maintenance plan defines method of day-to-day
result checking in order to assure correctness of produced results.
ud Workflow detail
Name:
Package:
Version:
Author:
Workflow detail
Deployment
1.0
Vojtěch Ondryhal
Deployment
Plan
Plan
Deployment
Deployment
Manager
Plan M onitoring
And Maintenan ce
Monitoring
And
M aintenance
Plan
finalize
Review
Project
Final
Report
Projec t
Manager
Figure 9 Deployment phase workflow details
Conclusion
The possible approach for modelling of data mining cases based on
UML and CRISP-DM was introduced in the paper. Paper provides insight into the more detailed work that includes detailed description of
deliverables, templates and examples. This methodology is based on
prototypes which were experienced at the Communication and Information Systems Department at University of Defence. The advantage of
this approach is unification of the project administration (templates,
work description, etc.) with other development projects.
118
EXTENDING UML FOR MODELLING OF DATA MINING CASES
References
1. [www-ea] Enterprise Architect web site.
http://www.sparxsystems.com.au/
2. [BOHTH05] Buřita L., Ondryhal V., Hodický J., Trunda M.,
Hlaváček M, Information Systems, University of Defence, 2005,
U-3099 [in Czech language]
3. [CD01] CRISP-DM, Step by Step Data Mining Guide v. 1.0,
CRISP-DM Consorcium, http://www.crisp-dm.org/
4. [RUP2000] Rational Unified Process 2000 – Online documentation
5. [www-vo] Web pages of the author.
http://dcs.unob.cz/~Vojtech.Ondryhal/ [in Czech language]
6. [www-uml] Unified Modelling Language Resource Page.
http://www.uml.org/
119
VÉDELMI INFOKOMMUNIKÁCIÓ
120