Download Lecture 1: Introduction, CRISP-DM, Visualization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
České vysoké učení technické v Praze
Fakulta informačních technologií
Katedra teoretické informatiky
Evropský sociální fond
Praha & EU: Investujeme do vaší budoucnosti
MI-PDD – Data Preprocessing module (2010/2011)
Lecture 1: Introduction, CRISP-DM,
Visualization
Pavel Kordík, FIT, Czech Technical University in Prague
1
Module organization
13 lectures
6 exercises (min 30/60b)
Test (20b)
Semestral project(40b)
Exam (min 20/40b)
Test (40b)
Optional examination +5b max
Kordik, CTU Prague, FIT, MI-PDD
2
Knowledge engineering master
specialization
module (abbreviation)
Statistics for Informatics ( MIE-SPI )
dimension
4+1
completion
z,zk
type of module
p-sz
lecturer
Prof. Blažek
recom. year
1.
Parallel Computer Architectures ( MIE-PAR )
Systems Theory ( MIE-TES )
Data Preprocessing ( MIE-PDD )
3+1
2+1
2+1
z,zk
z,zk
z,zk
p-sz
p-sz
pv-ob
prof. Tvrdík
prof. Moos
Kordík, Ph.D.
1.
1.
1.
Pattern Recognition ( MIE-ROZ )
Cybernality ( MIE-KYB )
Mathematics for Informatics ( MIE-MPI )
2+1
2+0
4+1
z,zk
zk
z,zk
pv-ob
p-hu
p-sz
doc. Haindl
doc. Jirovský
doc. Šolcová
1.
1.
1.
Functional and Logical Programming ( MIE-FLP )
Advanced Database Systems ( MIE-PDB )
Advanced Information Systems ( MIE-PIS )
2+1
2+1
2+1
z,zk
z,zk
z,zk
pv-ob
pv-ob
pv-ob
Janoušek, Ph.D.
Valenta, Ph.D.
prof. Mišovič
1.
1.
1.
elective module
2+1
z,zk
v
1.
elective module
2+1
z,zk
v
1.
Project Management ( MIE-PRM )
Problems and Algorithms ( MIE-PAA )
1+2
3+1
z
z,zk
p-em
pv-ob
Vala
Schmidt, Ph.D.
1.
2.
Computational Intelligence Methods ( MIE-MVI )
Knowledge Discovery from Databases ( MIE-KDD )
2+1
2+1
z,zk
z,zk
pv-ob
pv-ob
Kordík, Ph.D.
doc. Rauch
2.
2.
z
p-pr
2.
Master Project ( MIE-MPR )
elective module
2+1
z,zk
v
2.
elective module
2+1
z,zk
v
2.
Information Security ( MIE-IBE )
2+0
zk
p-em
elective module
2+1
z,zk
v
IT Support to Business and CIO Role ( MIE-CIO )
3+0
zk
p-em
obligatory humanity module
zk
pv-hu
2.
Master Thesis (MIE-DIP)
z
p-pr
2.
Kordik, CTU Prague, FIT, MI-PDD
Čermák, CSc.
2.
2.
prof. Dohnal
2.
3
Recommended elective modules
Get specialized!
Even more
modules available
Kordik, CTU Prague, FIT, MI-PDD
4
Lectures Program
Inroduction, CRISP-DM, visualization.
Data exploration, exploratory analysis techniques, descriptive
statistics.
Methods to determine the relevance of features.
Problems with data – dimensionality, noise, outliers, inconsistency,
missing values, non-numeric data.
Data cleaning, transformation, imputing, discretization, binning.
Reduction of data dimension.
Reduction of data volume, class balancing.
Feature extraction from text.
Feature extraction from documents, web. Preprocessing of
structured data.
Feature extraction from time series.
Feature extraction from images.
Data preparation case studies.
Automation of data preprocessing.
Kordik, CTU Prague, FIT, MI-PDD
5
Software
Pentaho Data Integration (Kettle)
Matlab
FAKE GAME
Kordik, CTU Prague, FIT, MI-PDD
6
Books
Dorian Pyle, Data Preparation for Data
Mining, Morgan Kaufmann, Mar 15, 1999
Mamdouh Refaat, Data Preparation for
Data Mining Using SAS, Morgan Kaufmann,
Sep. 29, 2006)
Tamraparni Dasu, Theodore Johnson,
Exploratory Data Mining and Data
Cleaning, Wiley 2003
Kordik, CTU Prague, FIT, MI-PDD
7
Cross-Industry Data Mining
Standard – CRISP-DM
CRISP-DM
slides adapted
from
Kordik, CTU Prague, FIT, MI-PDD
8
CRISP-DM: Phases and tasks
MI-KDD
Business
Understanding
MI-PDD
Data
Understanding
MI-ROZ, MI-MVI
Data
Preparation
Modeling
MI-KDD
Evaluation
Deployment
Determine
Business
Objectives
Collect
Initial
Data
Select
Data
Select
Modeling
Technique
Evaluate
Results
Plan
Deployment
Assess
Situation
Describe
Data
Clean
Data
Generate
Test Design
Review
Process
Plan Monitering
&
Maintenance
Determine
Data Mining
Goals
Explore
Data
Construct
Data
Build
Model
Determine
Next Steps
Produce
Final
Report
Produce
Project Plan
Verify
Data
Quality
Integrate
Data
Assess
Model
Review
Project
Format
Data
Kordik, CTU Prague, FIT, MI-PDD
9
Phase 1. Business Understanding
Statement of Business Objective
Statement of Data Mining Objective
Statement of Success Criteria
Focuses on understanding the project objectives and requirements
from a business perspective, then converting this knowledge into a
data mining problem definition and a preliminary plan designed to
achieve the objectives
Kordik, CTU Prague, FIT, MI-PDD
10
Phase 1. Business Understanding
Determine business objectives
- thoroughly understand, from a business perspective, what the client
really wants to accomplish
- uncover important factors, at the beginning, that can influence the
outcome of the project
- neglecting this step is to expend a great deal of effort producing
the right answers to the wrong questions
Assess situation
- more detailed fact-finding about all of the resources, constraints,
assumptions and other factors that should be considered
- flesh out the details
Kordik, CTU Prague, FIT, MI-PDD
11
Phase 1. Business Understanding
Determine data mining goals
- a business goal states objectives in business terminology
- a data mining goal states project objectives in technical terms
ex) the business goal: “Increase catalog sales to existing customers.”
a data mining goal: “Predict how many widgets a customer will buy,
given their purchases over the past three years,
demographic information (age, salary, city) and
the price of the item.”
Produce project plan
- describe the intended plan for achieving the data mining goals and the
business goals
- the plan should specify the anticipated set of steps to be performed during
the rest of the project including an initial selection of tools and techniques
Kordik, CTU Prague, FIT, MI-PDD
12
Phase 2. Data Understanding
•
•
•
Explore the Data
Verify the Quality
Find Outliers
Starts with an initial data collection and proceeds with
activities in order to get familiar with the data, to identify
data quality problems, to discover first insights into the data
or to detect interesting subsets to form hypotheses for
hidden information.
Kordik, CTU Prague, FIT, MI-PDD
13
Phase 2. Data Understanding
Collect initial data
-
acquire within the project the data listed in the project resources
includes data loading if necessary for data understanding
possibly leads to initial data preparation steps
if acquiring multiple data sources, integration is an additional issue,
either here or in the later data preparation phase
Describe data
- examine the “gross” or “surface” properties of the acquired data
- report on the results
Kordik, CTU Prague, FIT, MI-PDD
14
Phase 2. Data Understanding
Explore data
- tackles the data mining questions, which can be addressed using querying,
visualization and reporting including:
distribution of key attributes, results of simple aggregations
relations between pairs or small numbers of attributes
properties of significant sub-populations, simple statistical analyses
- may address directly the data mining goals
- may contribute to or refine the data description and quality reports
- may feed into the transformation and other data preparation needed
Verify data quality
- examine the quality of the data, addressing questions such as:
“Is the data complete?”, Are there missing values in the data?”
Kordik, CTU Prague, FIT, MI-PDD
15
Phase 3. Data Preparation
Takes usually over 70% of the time
- Collection
-
Assessment
Consolidation and Cleaning
Data selection
Transformations
Covers all activities to construct the final dataset from the initial raw
data. Data preparation tasks are likely to be performed multiple times
and not in any prescribed order. Tasks include table, record and
attribute selection as well as transformation and cleaning of data for
modeling tools.
Kordik, CTU Prague, FIT, MI-PDD
16
Phase 3. Data Preparation
Select data
- decide on the data to be used for analysis
- criteria include relevance to the data mining goals, quality and
technical constraints such as limits on data volume or data types
- covers selection of attributes as well as selection of records in a
table
Clean data
- raise the data quality to the level required by the selected analysis
techniques
- may involve selection of clean subsets of the data, the insertion of
suitable defaults or more ambitious techniques such as the
estimation of missing data by modeling
Kordik, CTU Prague, FIT, MI-PDD
17
Phase 3. Data Preparation
Construct data
- constructive data preparation operations such as the production of
derived attributes, entire new records or transformed values for
existing attributes
Integrate data
- methods whereby information is combined from multiple tables or
records to create new records or values
Format data
- formatting transformations refer to primarily syntactic modifications
made to the data that do not change its meaning, but might be
required by the modeling tool
Kordik, CTU Prague, FIT, MI-PDD
18
Phase 4. Modeling
Select the modeling technique
(based upon the data mining objective)
Build model
(Parameter settings)
Assess model (rank the models)
Various modeling techniques are selected and applied and their
parameters are calibrated to optimal values. Some techniques have
specific requirements on the form of data. Therefore, stepping back to
the data preparation phase is often necessary.
Kordik, CTU Prague, FIT, MI-PDD
19
Phase 4. Modeling
Select modeling technique
- select the actual modeling technique that is to be used
ex) decision tree, neural network
- if multiple techniques are applied, perform this task for each
techniques separately
Generate test design
- before actually building a model, generate a procedure or
mechanism to test the model’s quality and validity
ex) In classification, it is common to use error rates as quality
measures for data mining models. Therefore, typically separate the
dataset into train and test set, build the model on the train set and
estimate its quality on the separate test set
Kordik, CTU Prague, FIT, MI-PDD
20
Phase 4. Modeling
Build model
- run the modeling tool on the prepared dataset to create one or
more models
Assess model
- interprets the models according to his domain knowledge, the data
mining success criteria and the desired test design
- judges the success of the application of modeling and discovery
techniques more technically
- contacts business analysts and domain experts later in order to
discuss the data mining results in the business context
- only consider models whereas the evaluation phase also takes into
account all other results that were produced in the course of the
project
Kordik, CTU Prague, FIT, MI-PDD
21
Phase 5. Evaluation
Evaluation of model
-
how well it performed on test data
Methods and criteria
- depend on model type
Interpretation of model
-
important or not, easy or hard depends on algorithm
Thoroughly evaluate the model and review the steps executed to
construct the model to be certain it properly achieves the business
objectives. A key objective is to determine if there is some important
business issue that has not been sufficiently considered. At the end of
this phase, a decision on the use of the data mining results should be
reached
Kordik, CTU Prague, FIT, MI-PDD
22
Phase 5. Evaluation
Evaluate results
- assesses the degree to which the model meets the
business objectives
- seeks to determine if there is some business reason why
this model is deficient
- test the model(s) on test applications in the real
application if time and budget constraints permit
- also assesses other data mining results generated
- unveil additional challenges, information or hints for
future directions
Kordik, CTU Prague, FIT, MI-PDD
23
Phase 5. Evaluation
Review process
- do a more thorough review of the data mining engagement in order
to determine if there is any important factor or task that has
somehow been overlooked
- review the quality assurance issues
ex) “Did we correctly build the model?”
Determine next steps
- decides how to proceed at this stage
- decides whether to finish the project and move on to deployment if
appropriate or whether to initiate further iterations or set up new
data mining projects
- include analyses of remaining resources and budget that influences
the decisions
Kordik, CTU Prague, FIT, MI-PDD
24
Phase 6. Deployment
Determine how the results need to be utilized
Who needs to use them?
How often do they need to be used
Deploy Data Mining results by
Scoring a database, utilizing results as business rules,
interactive scoring on-line
The knowledge gained will need to be organized and presented in a way
that the customer can use it. However, depending on the requirements,
the deployment phase can be as simple as generating a report or as
complex as implementing a repeatable data mining process across the
enterprise.
Kordik, CTU Prague, FIT, MI-PDD
25
Phase 6. Deployment
Plan deployment
- in order to deploy the data mining result(s) into the business, takes
the evaluation results and concludes a strategy for deployment
- document the procedure for later deployment
Plan monitoring and maintenance
- important if the data mining results become part of the day-to-day
business and it environment
- helps to avoid unnecessarily long periods of incorrect usage of data
mining results
- needs a detailed on monitoring process
- takes into account the specific type of deployment
Kordik, CTU Prague, FIT, MI-PDD
26
Phase 6. Deployment
Produce final report
- the project leader and his team write up a final report
- may be only a summary of the project and its experiences
- may be a final and comprehensive presentation of the data mining
result(s)
Review project
- assess what went right and what went wrong, what was done well
and what needs to be improved
Kordik, CTU Prague, FIT, MI-PDD
27
Knowledge discovery from
databases - the process
understand the opportunity
identify and define business
opportunity
identify and define
business opportunity
prepare data
profile and understand data
derive attributes
transform data
typically about
create case set
75% of process
build models
profile data
derive attributes
transform data
create case set
train models
assess model performance
train models
assess
performance
use models
deploy model
monitor model performance
deploy model
monitor model
performance
Kordik, CTU Prague, FIT, MI-PDD
28
Example: transforming source data
Purchase
PurchDt
102302 11:02:44
102302 11:02:44
102302 11:02:45
102302 11:02:45
…
102402 11:01:01
102402 11:02:59
102402 11:02:21
102402 11:03:58
…
102502 12:01:34
102502 12:01:49
102502 12:03:45
102502 12:03:58
…
Store
Account
Amt Store
$4.50 423
$88.38 221
$121.33 221
$19.99 73
Acct
Size
8849940044 249
8376636636 337
8376636636 893
3866493657 102
$43.84
$77.01
$11.63
$144.00
743
23
189
270
8376636636
5378366284
8376636636
3866493657
219
430
501
194
12
6
14
2
44
90
23
5
0
0
0
1
$289.08
$71.99
$38.23
$58.84
45
301
219
17
6474538469
3866493657
5382638977
3866493657
579
220
331
430
5
13
1
8
75
34
91
18
0
1
0
0
Billions of
Purchases
Millions of
Accounts
Age
4
9
1
19
Purchase
History
CS CR CrLim
33 1 1000
88 0 4600
76 0 1700
43 1 1700
S1
0
1
0
0
A1 P3 S3
0
0 0
$54 1 1
0
0 0
0
0 0
Item
Summary Fraud
Ten
8
46
15
15
P1
0
1
0
0
4600
1000
2000
1500
89
1
20
12
1
1
2
1
1 $121
1 $54
2 $79
1 $20
1
1
2
3
1 $121 $8 $19
1 $54 $15 $22
2 $79
$1
$3
1 $60 $11 $42
1
1
0
1
0
1
0
1
0
0
0
1
0
0
0
0
3000
3300
2900
1800
30
28
29
16
0
2
0
3
0
1
0
2
0
4
0
5
0
1
0
2
0
0
0
1
0
1
1
0
1
0
0
1
0
0
0
1
0
$54
0
$55
A3 Min Max Elec Vid Jewl Frd?
0
$1
$3
0
0
0
0
$54 $9 $17
1
1
0
1
0 $19 $42
0
0
1
0
0
$4
$9
0
1
0
0
0 $19 $98
$59 $7 $22
0
$4
$9
$58
$6 $14
Aggregate
and Pivot
Kordik, CTU Prague, FIT, MI-PDD
29
result: modeling matrix
Hundreds of Attributes
PurchDt
102302 11:02:44
102302 11:02:44
102302 11:02:45
102302 11:02:45
…
102402 11:01:01
102402 11:02:59
102402 11:02:21
102402 11:03:58
…
102502 12:01:34
102502 12:01:49
102502 12:03:45
102502 12:03:58
…
Amt Store
$4.50 423
$88.38 221
$121.33 221
$19.99 73
Acct
Size
8849940044 249
8376636636 337
8376636636 893
3866493657 102
$43.84
$77.01
$11.63
$144.00
743
23
189
270
4674847467
5378366284
8376636636
3866493657
219
430
501
194
12
6
14
2
44
90
23
5
0
0
0
1
$289.08
$71.99
$38.23
$58.84
45
301
219
17
6474538469
3866493657
5382638977
3866493657
579
220
331
430
5
13
1
8
75
34
91
18
0
1
0
0
One Row Per
Purchase
Age
4
9
1
19
CS CR CrLim
33 1 1000
88 0 4600
76 0 1700
43 1 1700
Ten
8
46
15
15
P1
0
1
0
0
S1
0
1
0
0
A1 P3 S3
0
0 0
$54 1 1
0
0 0
0
0 0
4600
1000
2000
1500
89
1
20
12
1
1
2
1
1 $121
1 $54
2 $79
1 $20
1
1
2
3
1 $121 $8 $19
1 $54 $15 $22
2 $79
$1
$3
1 $60 $11 $42
1
1
0
1
0
1
0
1
0
0
0
1
1
1
0
1
3000
3300
2900
1800
30
28
29
16
0
2
0
3
0
1
0
2
0
4
0
5
0
1
0
2
0
0
0
1
0
1
1
0
1
0
0
1
0
0
0
1
0
$54
0
$55
A3 Min Max Elec Vid Jewl Frd?
0
$1
$3
0
0
0
0
$54 $9 $17
1
1
0
1
0 $19 $42
0
0
1
0
0
$4
$9
0
1
0
0
0 $19 $98
$59 $7 $22
0
$4
$9
$58
$6 $14
Mix of Fraud
and No-Fraud
Purchases
Kordik, CTU Prague, FIT, MI-PDD
30
Example: modeling in SAS
enterprise miner
Kordik, CTU Prague, FIT, MI-PDD
31
Example: model export
score converter node
generates Java model code
reporter node
exports code and HTML report to project directory
Kordik, CTU Prague, FIT, MI-PDD
32
Example: model deployment tool
task
copy model information to the Data Store
design
HTTP
Web browser
client
HTML
JDBC
access
Web App.
Web Server
File/registry
access
NonStop
SQL/MX
SAS Open
Metadata
server
Data Store
File/SAS server
Kordik, CTU Prague, FIT, MI-PDD
SAS
Enterprise
Miner
Mining Mart
Model
export/registration
33
Interaction Manager
Example: real-time scoring
Offers /
Advice
Rules Engine
Business
Rules
Model Scores
Scoring Engine
Deployed
Models
Model Aggregates
Customer
Data
Aggregation Engine
Kordik, CTU Prague, FIT, MI-PDD
Aggregate
Definitions
34
Visualization
Extremely useful in all CRISP-DM stages
Raw data visualization – detect problems
(data inconsistency, outliers, errors)
Large, multivariate data
Model behavior visualization
Results visualization
Often the best interface for domain
experts
Examples follow
Kordik, CTU Prague, FIT, MI-PDD
35
Visual data exploration
Kordik, CTU Prague, FIT, MI-PDD
36
Kordik, CTU Prague, FIT, MI-PDD
37
Glyphs
http://www.ii.uib.no/vis/publications/publication/2009/vids/lie09glyphBased3Dvisualization.html
Kordik, CTU Prague, FIT, MI-PDD
38
Dedicated infrastructures
EVEREST
(Exploratory Visualization
Environment for REsearch
in Science and Technology)
is a large-scale venue for
data exploration and
analysis.
Its main feature is a 27-projector PowerWall with an aggregate pixel count
of 35 million pixels. The projectors are arranged in a 9×3 array, each
providing 3,500 lumens for a very bright display. Displaying 11,520 by
3,072 pixels, or a total of 35 million pixels, the wall offers a tremendous
amount of visual detail. The wall is integrated with the rest of the
computing center, creating a high-bandwidth data path between largescale high-performance computing and large-scale data visualization.
Kordik, CTU Prague, FIT, MI-PDD
39
Next lecture
Visual data exploration
Statistical data description
Kordik, CTU Prague, FIT, MI-PDD
40