Download Data Mining II - Computer Science Department

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lluis Belanche + Alfredo Vellido
Data Mining II
An Introduction to Mining (2)
On dates & evaluation:
Lectures expected to end on the week 14-18th Dec
Likely essay deadline & presentation: 15th, 22nd Jan
What’s DATA MINING?: A historicist
viewpoint
$!
%&
!"
# "
DATA MINING as a methodology
CRISP: a DM methodology
CRoss-Industry Standard Process for Data Mining: neutral
methodology from the point of view of industry, tool and
application (free & non-proprietary)
Pete Chapman, Randy Kerber (NCR); Julian Clinton, Thomas
Khabaza, Colin Shearer (SPSS), Thomas Reinartz, Rüdiger Wirth
(DaimlerChrysler)
CRISP-DM was conceived in 1996
DaimlerChrysler: leaders in industrial application, SPSS: leaders in
product development (Clementine, 1994), NCR: owners of large
(huge!) databases (Teradata)
Financed by the EU. Version 1.0 released officially in 1999
CRISP: Hierarchic structure of the
methodology
CRISP: The virtuous loop of
methodology phases
CRISP: Description of phases
Problem understanding: study of targets and requirements form the
business/problem viewpoint. Defining it as a DM problem.
Data understanding: data recolection; getting to know the data, trying to
detect both quality problems and interesting features.
Data preparation: Preparing the data set to be modelled, starting from raw
data. This is an iterative and exploratory process. Selection of files, tables,
variables, record samples… plus data cleaning.
Modelling: Data analysis using modelling techniques of a sort that are
suitable for the problem at hand. Includes fiddling with the models, tuning
their parameters, etc.
Evaluation: All previous steps must be evaluated as whole (as a unitary
process), and we must decide whether deliverables so far meet the DM
challenge.
Implementation: All the knowledge aquired to this point must be organized
and presented to the “client” in a usable form. We must define, together with
this client, a protocol to reliably deploy the DM findings.
CRISP: The virtuous loop of
methodology phases
Use of DM methodologies (2004
!
2007)
!
" #$
% $
Enterprise MinerTM: SEMMA
The acronym SEMMA -- Sample, Explore, Modify, Model, Assess -- refers to
the core process of conducting data mining. Beginning with a statistically
representative sample of your data, SEMMA makes it easy to apply
exploratory statistical and visualization techniques, select and transform
the most significant predictive variables, model the variables to predict
outcomes, and confirm a model's accuracy.
Use of DM methodologies (2004
2004
2007
2007)
CRISP: Phases: Problem
understanding
PROBLEM
UNDERSTANDING
DATA
DATA
UNDERST’ING
PREPARATION
MODELLING
DETERMINE
PROBLEM
GOAL
BACKGROUND
ASSESS
SITUATION
INVENTORY
RESOURCES
DETERMINE
DM
GOALS
GOALS DM
SUCCESS
CRITERIA DM
PRODUCE
PROJECT
PLAN
PROJECT
INITIAL
SELECTION OF
TOOLS
PLAN
PROBLEM
SUCCESS
GOALS
CRITERIA
REQUERIMS.
ASSUMPTIONS
LIMITATIONS
RISKS
CONTINGEN.
EVALUATION
TERMINOLOG.
IMPLEMEN
TATION
COSTS &
BENEFITS
DM application areas (’06->’09)
&
) *+ $ $,
$,
-$ .)* +
$+ ,
/ $,#.0$ 1 ,
3 $4, $ 1
. $ ,# 2
" #$ 2
5$6$, 1
3 $4* $ 1
,$ ,$
* ,$
$ 6
7$ 1$ . ,$+, 6. # 1 !
*8,*
0 7$ 1$ . 6
$ , 11$ ,$
5 7$6.9 : 6 2
$,* .
$
1 2
9$ 6#, $.9 2
;* -$1 6.
:1
$
1$ . * ,
/ -
&'(
(!
(
'(
2(2
&(
&(
(
('
('
('
(
(
(&
(
(
2(2
(2
(2
(2
(!
(!
('
CRISP: Phases: Data understanding
PROBLEM
UNDERSTANDING
DATA
DATA
UNDERST’ING
PREPARATION
OBTAIN
INITIAL DATA
DESCRIPTION
DATA
EXPLORATION
DATA
VERIFICATION
QUALITY DATA
INITIAL DATA
REPORT
DATA
DESCRIPTIVE
REPORT
DATA
EXPLORATION
REPORT
DATA QUALITY
REPORT
MODELLING
EVALUATION
IMPLEMEN
TATION
METROFANG: a real story about data
understanding (1)
METROFANG: a real story about data
understanding (2)
caudal entrada
350,00
Missing data
300,00
250,00
Stationality
200,00
150,00
100,00
Outliers
50,00
0,00
1
1768 3535 5302 7069 8836 10603 12370 14137 15904 17671
Par motor Secador A
140,00
120,00
Time Series
Weekend?
FORUM???
100,00
80,00
60,00
40,00
20,00
0,00
1
1768 3535 5302 7069 8836 10603 12370 14137 15904 17671
Storing data (’07)
Poll
What did you use for data storage for significant data mining projects in the past year:
[142 voters, 284 votes]
Text files (e.g. tab or comma delim) (75)
52.8%
Data mining system format (SAS, SPSS, arff) (57)
40.1%
Excel (28)
19.7%
Oracle (25)
SQL Server (15)
mySQL (12)
other format (10)
other commercial DBMS (7)
other free DBMS (4)
17.6%
10.6%
8.5%
7.0%
4.9%
2.8%
CRISP: Phases: Data preparation
PROBLEM
UNDERSTANDING
DATA
DATA
UNDERST’ING
PREPARATION
MODELLING
EVALUATION
DATA
SELECTION
ARGUMENTS FOR
SELECTION
DATA
CLEANING
DATOA CLEANING
REPORT
RECONSTRUCT
DATA
DERIVATED
VARIABLES
INTEGRATE
DATA
INTEGRATED
DATA
DATA
FORMATTING
DATA WITH NEW
FORMAT
IMPLEMEN
TATION
OSERVATIONS
GENERATED
Is data preparation that important?
!"#$
7$ !
!
"
2
&
&'
2
6$
2
!
Common data types analyzed …(’07)
Compared to 2005 KDnuggets Poll on “Types of data you analyzed/mined in
last 12 months”, the biggest increase was in anonymized data (perhaps and
indicator of increasing importance of privacy issues).
Common data types analyzed …(’09)
How big is yours? … (’06 -> ‘09)
%
&
/ 2
/
/
/
0/
(
0/ 2
0/
0/
5$ 4 $
7$ 5$ 4 $
'
6$ #
(
&
2
2
!
2
Data manipulation tools …(’07)
CRISP: Phases: Modelling
PROBLEM
UNDERSTANDING
DATA
DATA
UNDERST’ING
PREPARATION
SELECT
MODELING
TECHNIQUE
CREATE TEST
DESIGN
BUILD
MODEL
VALIDATE
MODEL
MODELLING
EVALUATION
SELECTED
TECHNIQUE
TEST DESIGN
PARAMETER
SELECTION
MODEL
VALIDATION
MODEL
MODEL
DESCRIPTION
IMPLEMEN
TATION
CRISP: Selection of techniques
U N I V E R S E OF T E C H N I Q U E S
(Definided by tools)
TECHNIQUES SUITED TO A PROBLEM
POLITICAL
REQUIREMENTS
(Business, executive)
LIMITATIONS
Money, time, hh.rr.
Data types, knowledge
SELECTED TOOL(S)
Commonly used models/techniques (‘05)…
(
$,
6* $
$ $
" "
)
)
*
+
5 $$ . *6
$
&
'
, !
< *6
&
%$* 6%$
,
*6$ 2
%$ $ %$ #4 &
<
*:: 7$, 1 ,# $ &
/ $
&
$=*$ ,$.
5 1$ $ $
6
/
2
9 4 +1$ # + &
/
0$ $ , 6 #1 '
" #$
!
!
&
&
&
&
&
Commonly used models/techniques (‘07)…
CRISP: Phases: Evaluation
PROBLEM
UNDERSTANDING
DATA
DATA
UNDERST’ING
PREPARATION
EVALUATE
RESULTS
REVISE
PROCESSES
DETERMINE
NEXT STEPS
MODELLING
EVOLUTION OF
DM RESULTS
EVALUATION
APPROVED
MODELS
REVISION OF
THE PROCESS
LIST OF
POSSIBLE
ACTIONS
DECISSIONS
IMPLEMEN
TATION
CRISP: Phases: Deployment
PROBLEM
UNDERSTANDING
DATA
DATA
UNDERST’ING
PREPARATION
PLAN
IMPLEMEN
TATION
PLAN
MONITORIZATION &
MAINTENANCE
GENERATE
FINAL REPORT
REVISE
PROJECT
MODELLING
EVALUATION
IMPLEMENTATION
PLAN
MONITORIZATION &
MAINTENANCE PLAN
FINAL REPORT
DOCUMENTATION
OF EXPERIENCE
FINAL
PRESENTATION
IMPLEMEN
TATION
How do you deploy it? (’06 > ’09)
,
*46 # $ $ ,#: :$ &
> $8+
,# $ 4* $ *6$
???
$:6
: +*,
+(
(
(
> $+ 1
68 ,
7$ 1 +$6 @A
7$ 1 +$6
#$ 6 * $
7$ 1 +$6
;7
7$ 1 +$6
A
???
$:6
4 ,#1 +$ !
$:6
$ 6 1$ 1 +$
Cloud computing :
computing in which
dynamically scalable and
often virtualized resources
are provided as a service
over the Internet. An
example Google Apps
#-
$* ./
&!(
'
(
!(
'(
2
(
(
!
(
!
(
2(
2
(
Software popularity (‘07)
Free vs. commercial:
debate
Software popularity (‘09)
' %
$ (")*+
'
%' %,
$
%,
(")
-
Why?
Many changes have occurred in the business application of data mining since CRISP-DM 1.0
was published. Emerging issues and requirements include:
The availability of new types of data—text, Web, and attitudinal data, for example—along with
new techniques for pre-processing, analyzing, and combining them with related case data
Integration and deployment of results with operational systems such as call centers and Web
sites
Far more demanding requirements for scalability and for deployment into real-time
environments
The need to package analytical tasks for non-analytical end users and integrate these tasks in
business workflows
The need to seamlessly integrate the deployment of results and closed-loop feedback with
existing business processes
The need to mine large-scale databases in situ, rather than exporting an analytical dataset
Organizations’ increasing reliance on teams, making it important to educate greater numbers of
people on the processes and best practices associated with data mining and predictive analytics
In July 2006 the consortium announced that it was going to start the process of working towards a second
version of CRISP-DM. On 26 September 2006, the CRISP-DM SIG met to discuss potential enhancements
for CRISP-DM 2.0 and the subsequent roadmap. However, these efforts appear to be stalled. The SIG
has not met, updated the CRISP website, or communicated anything to members since early 2007.