Download Cross-Industry Standard Process for Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
October 2-3, 2015, İSTANBUL
Boğaziçi University
Project Management in Data Mining
Prof.Dr. M.Erdal Balaban
Istanbul University
Faculty of Business Administration
Avcılar, Istanbul - TURKEY
PRESENTATION OUTLINE
 What is Data Mining?
 Data Mining Environment
 Decision Making Process
 CRISP-DM Methodology
 Phases of Data Mining Process
 Flowchart of Data Mining Process (Proposal)
 Conclusions
October 2, 2015
2/17
What is Data Mining?
“Data mining is the process of discovering useful patterns
and trends in large data sets.” (Larose, 2014).
 Data mining makes the difference which are used in
many areas:









health care,
banking,
finance,
insurance,
telecommunications,
manufacturing,
retail,
market research,
and the public sector.
October 2, 2015
3/17
Data Mining Environment
Database
Technology
Other
Disciplines
Statistics
Data
Mining
Information
Science
Database
Technology
October 2, 2015
Database
Technology
Visualizations
Machine
Learning
4/17
Decision Making Process
DATA
October 2, 2015
INFORMATION
KNOWLEDGE
DECISIONS
ACTION
5/17
CRISP-DM Methodology
CRoss-Industry Standard
Process for Data Mining
(Shearer, 2000)
CRISP-DM focuses data
mining on rapid model
development and
deployment to optimize
decisions.
October 2, 2015
6/17
CRISP-DM
 The Cross-Industry Standard Process
for Data Mining (CRISP-DM) is the
dominant data-mining process
framework. It's an open standard;
anyone may use it. The following list
describes the various phases of the
process.
October 2, 2015
7/17
Business
Understanding
Determine Business
Objectives
 Background
 Business Objectives
 Business Success Criteria
Assess Situation
 Inventory of Resources
 Requirements,
Assumptions, and
Constraints
 Risks and Contingencies
Data
Preparation
Data
Understanding
Collect Initial Data
 Initial Data Collection
Report
Describe Data
 Data Description Report
Explore Data
 Data Exploration Report
Verify Data Quality
 Data Quality Report
 Terminology
 Costs and Benefits
Determine Data Mining
Goals
 Data Mining Goals
 Data Mining Success
Criteria
Data Set
 Data Set Description
Select Data
 Rationale for
Inclusion/Exclusion
Clean Data
 Data Cleaning Report
Construct Data
Modeling
Select Modeling
Technique
Evaluate Results
Mining Results w.r.t.
Business Success Criteria
 Modeling Assumptions
 Approved Models
 Test Design
Build Model
Review Process
 Review of Process
Determine Next Steps
 Parameter Settings
 List of Possible Actions
 Derived Attributes
 Models
 Decision
 Generated Reports
 Model Description
qIntegrate Data
 Merged Data
qFormat Data
Plan Deployment
 Assessment of Data
 Modeling Technique
Generate Test Design
Deployment
Evaluation
 Deployment Plan
Plan Monitoring and
Maintenance
 Monitoring and
Maintenance Plan
Produce Final Report
 Final Report
 Final Presentation
Review Project
 Experience
Documantation
Assess Model
 Model Assessment
 Revised Parameter
Settings
 Reformatted Data
Produce Project Plan
 Project Plan
 Initial Assesment of
Tools and Techniques
Tasks (bold) and outputs (italic) of the CRISP-DM reference model
October 2, 2015
8/17
Data Mining Phases (Proposal Flowchart)
Define
Project
Crucial
Phase !
Data Preparation
Data Sources
Data
Understanding &
Data Selection
Data
Gathering
Clustering Methods or
Association Rules
No
Classification Methods
Dataset
Supervised
Learning ?
Crucial
Phase !
Yes
Test
Dataset
Training
Dataset
Selecting
Algorithm &
Model Building
Evaluation of Model
Performance
Measuring
Model
Performance
Low
October 2, 2015
Data
Preprocessing
Evaluate
Model
High
Model
Implementation
Knowledge
Representation &
Decision
Planing for data mining project
 Produce project plan: List the stages in the project,
together with duration, resources required, and
relations.
 Define the project
 Prepare data for data mining modeling
 Separate data into training and testing parts for
performance evaluation
 Apply alternative algorithms to build model and
evaluate the model’s performances
 Implement the model to generate knowledge and
make a decision before action
October 2, 2015
10/17
Define project
 Understand the project objectives and
requirements on the first phase of
data mining
 List the assumptions made by the
project and list the constrains on the
project
 Construct a cost-benefit analysis for
the project
October 2, 2015
11/17
Prepare data for data mining






Collect the data (or datasets),
Select data,
Explore data,
Clean the data,
Reformat data,
Transform data.
October 2, 2015
12/17
Separate the dataset for
performance evaluation
 Select the evaluation method
 Hold-out
 Cross validation (k-fold cv)
 Bootstrapping
October 2, 2015
13/17
Apply alternative algorithms
and select the best model
 There are several techniques for the same data mining
problem type. Some techniques have specific
requirements on the form of data.

Classification algorithms








k Nearest Neigbour (kNN)
Naive Bayes
Logistic Regression
Decision Trees
Support Vector Machines
Artificial Neural Networks –ANNs
Clustering Algorithms
Assocation Algorithms
 The generated models that meet the selected criteria
become approved models.
October 2, 2015
14/17
Implement the model to make
a decision
 Creation of the model is generally not
the end of the project. Even if the
purpose of the model is to increase
knowledge of the data.
 Apply the model within the
organization’s decision making
process and then activate.
October 2, 2015
15/17
CONCLUSIONS
1.
2.
3.
4.
Data Mining Techniques are important to discover knowledge which is
more meaningful and valuable for decision making.
Project management approach is important for succeessful data
mining.
Each phase of data mining process is important but most important
phases are data preparation before modeling and evaluation of model
performance after modeling. These crucial phases are usually
disregarded or skipped in practice.
All phases and sub operations should be planned and scheduled by
using project management methods for successful data mining.
October 2, 2015
16/17
 Thank you very much for your
attention and listenning.
 Are there any questions and
suggestions?
[email protected]
October 2, 2015
17