Download PPT

Document related concepts

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Intelligent Choices
Preceeding Data Analysis
Katharina Morik
Univ. Dortmund, www-ai.cs.uni-dortmund.de
• Knowledge Discovery in Databases (KDD)
• The Mining Mart approach
• Case studies
– Item sales
– Intensive care
1
The UCI Library Approach
• Learning task: classification
• Evaluation criteria: accuracy and coverage
• Data sets
– Small number of examples
– Small number of features
– All and only relevant features included
– No noise
2
KDD Task
• Learning task of the application needs to be transferred to a
formal learning task (classification, regression, clustering)
– „I want to predict sales 4 weeks ahead“
– „I want to know more about my best (worst) customers“
– „I want to detect fraud“
• Databases
– Very large number of records
– Very large number of features
– Relevant fatures missing
– Noise included
3
Observation
Experienced users can apply any learning system successfully to
any application, since they prepare the data well...
• The representation LE of examples and the choice of a sample
determines the applicability of learning methods.
• A chain of data transformations (learning steps or manual
preprocessing) leads to LE of the method that delivers the desired
result.
Experienced users remember prototypical successful
transformation/learning chains
4
The Real Process
data
application:
users
performance system
LE1
LHn+m
LH1
LE2
LEn+m
LH2
...
...
LEn-1
LHn-1 = LEn
learning/data mining
LHn+1
LHn = LEn+1
5
Intelligent Choices
80% of the KDD work is invested into:
• Choosing the learning task
• Sampling
• Feature generation, extraction, and selection
• Data cleaning
• Model selection or tuning the hypothesis space
• Defining appropriate evaluation criteria
6
The Mining Mart Approach
Best practice cases of preprocessing chains exist...
• Data, LE and LH are described on the meta level.
• The meta-level description is presented in application
terms.
• MiningMart users choose a case and apply the
corresponding transformation and learning chain to
their application.
... and more can be obtained!
7
Call for Participation
• MiningMart develops an operational meta-language for
describing data and operators.
• MiningMart prepares the first cases of KDD.
• MiningMart will present the case-base in the WWW.
• You may contribute to the endavour!
– Apply the meta-language to your application and
deliver it as a positive example to the case-base; or
– apply a case of MiningMart to your data.
8
The Consortium
•
•
•
?
•
•
Katharina Morik Univ. Dortmund, D (Coordinator)
Lorenza Saitta Univ. Piemonte del Avogadro, I
Pieter Adriaans Perot Systems Netherland, NL
Michael May GMD, D
Jörg-Uwe Kietz SwissLife, CH
Fabio Malabocchia TILab, I
9
The Mining Mart System
Human Computer Interface
KDD process tasks, problem models
Case-base of successful KDD process
Meta-data
Raw-data
Meta-data
Applicability
Manual
Pre-processing
Operators:
Time
multi-relation
ML-Operators:
Time
Parameters
Features
Description
Logic
Meta-data
Augmented data
of results
10
The Meta Model for Meta Data
The Relational Model
describes the database
The Execution Model
generates SQL statements
or calls to external tools
The Conceptual Model
describes the individuals
and classes of the domain
with their relations
The Case Model
describes chains of
preprocessing operators
11
Use of the Meta Model
•
•
•
•
The meta model is stored in a database.
The database manager delivers the relational model.
The data analyst delivers the conceptual model.
The KDD expert delivers or adjusts a case model.
First cases are delivered by the Mining Mart project.
• The system compiles meta data into SQL statements
and calls to external tools hence executing the case
model on the data.
12
Sales of Items of a Drugstore
160
140
Insect killers 1
Insect killers 2
Sun milk
Candles 1
Baby food 1
Beauty
Sweets
Self-tanning cream
Candles 2
Baby food 2
120
100
80
60
20
Week
53|98
47|98
41|98
35|98
29|98
23|98
17|98
11|98
05|98
51|97
45|97
39|97
33|97
27|97
15|97
21|97
09|97
49|96
03|97
43|96
31|96
37|96
19|96
25|96
13|96
07|96
0
01|96
Sales
40
13
Learning Task 1:
Predict Sales of an Item
Given drug store sales data of 50 items in 20 shops over 104
weeks
predict the sales of an item such that
the prediction never underestimates the sale,
the prediction overestimates less than the rule of thumb.
Observation: 90% of the items are sold less than 10 times a
week.
Requirement: prediction horizon is more than 4 weeks ahead.
14
Shop Application -- Data
Shop
Dm1
Dm1
Dm1
Dm2
...
Dm20
Week
1
...
104
1
...
104
Item1
4
...
9
3
...
12
...
...
...
...
...
...
...
Item50
12
...
16
19
...
16
LE DB1: I: T1 A1 ... A 50; set of multivariate time series
15
Preprocessing
• From shops to items: multivariate to univariate
LE1´: i:t1 a1 ... tk ak
For all shops for all items:
Create view Univariate as
Select shop, week, itemi
Where shop=“dmj”
From Source;
• Multiple learning
Dm1_Item1
...
Dm1_Item50
1 4 ... 104 9
1 12... 104 16
....
Dm20_Item50
1 14... 104 16
16
Method 1 for Task 1:
Exponential Smoothing
• Univariate time series as input ( LE1` ),
• incremental method:
current hypothesis h and new observation o yield next
hypothesis by h := h + l o,
where l is given by the user,
• predicts sales of n-next week by last h.
17
Method 2 for Task 1:
SVM in the Regression Mode
• Multiple learning:
for each shop and each item, the support vector machine
learns a function which is then used for prediction.
• Asymmetric loss:
– underestimation is multiplied by 20,
i.e. 3 sales too few predicted -- 60 loss
– overestimation is counted as it is,
i.e. 3 sales too much predicted -- 3 loss
(Stefan Rüping 1999)
18
Further Preprocessing
• Obtaining many vectors from one series by sliding
windows
LH5 i:t1 a1 ... tw aw
move window of size w by m steps
Dm1_Item1_1
Dm1_Item1_2
...
Dm1_Item1_100
...
...
1
2
4... 5
4... 6
7
8
Dm20_Item50_100
100 12... 104 16
100 6... 104 9
19
Article 766933 (bag?)
sales
window
horizon
value to predict
time
20
Comparison with
Exponential Smoothing
horizon
SVM
exp. smoothing
1
56.764
52.40
2
57.044
59.04
3
57.855
65.62
4
58.670
71.21
8
60.286
88.44
13
59.475
102.24
21
loss
horizon
22
Learning Task 2:
Learning Sequences
Are there typical sequences that are valid for all items?
– After an action for an item its sales decrease.
– Each decrease of sales is followed by an increase.
• Given a set of subsequent events
find frequent sequences.
23
From Sales Data to
Event Sequences
Multivariate time series
Univariate time series
?
Subsequent events
?
LHn-1 = LEn
?
LHn frequent event sequences
24
From Series to Sequences
• Given some time series
detect events (states, intervals)
An event is a triple (state, begin,finish).
The state might be a label or a (mean) value.
Typical labels are: increase, decrease, stable...
25
Unsupervised Methods
• All contiguous observations within one level (range)
form one event (Bauer).
• All contiguous observations with more or less the same
gradient form one event (Morik, Wessel).
• Clusters of subsequences form events (Das).
26
Moving Gradient
Determining the time intervals with user-given tolerance threshhold.
Abstracting into classes of gradients: increase,peak,decrease, stable...
•
• •
•
•
•
1
•
•
•
•
4
5
7
10
12 Time
27
Sales of Item 182830 in Shop 55
28
Summarizing Sales by
Tolerant Moving Gradient
(Wessel, Morik 1999)
29
From Subsequent Events
to Event Sequences
Multivariate time series
Univariate time series
Moving
gradient
Subsequent events
?
LHn-1 = LEn
?
LHn frequent event sequences
30
Transformation into Facts
LE4:
stable(182830,1,33,0).
decreasing(182830, 33,34,-6).
stable(182830, 34, 39,0).
increasing(182830, 39, 40,7).
decreasing(182830, 40, 42,-5).
stable(182830, 42,108,0).
31
Summarizing Item 646152 in Shop 55
by Intolerant Moving Gradient
32
Corresponding Facts
increasing(646152,1,2,3).
decreasing(646152,2,3,-11).
increasingPeak(646152,3,4,22).
...
stable(646152, 25,37,0).
increasing(646152, 37, 38, 8).
decreasing(646152, 38, 39, -7).
stable(646152, 39,40, 0).
increasing(646152, 40, 41,7).
decreasing(646152, 41, 42,-8).
increasing(646152, 42, 43,10).
stable(646152, 43, 48,-1).
small time intervals
33
Method 3 for Task 2:
Inductive Logic Programming
• Rules about sequences:
p1(I, Tb, Te, A r), p2(I, Te, Te2, As) 
p3(I, Te2, Te3, A t)
• Results for sequences of sales trends:
increasing (Item, Tb, Te)  decreasing(Item, Te, Te2)
increasing (Item, Tb, Te), decreasing(Item, Te, Te2)
 stable(Item, Te2, Te3)
34
Same Data -- Several Cases
• Predict sales of a particular item in a particular
shop
multivariate to univariate, multiple exponential smoothing OR
multivariate to univariate, sliding windows, multiple learning
with regression SVM
• Find relations between trends that are valid for
all sales in all shops
multivariate to univariate, summarizing, transformation into
facts, rule learning
35
Applications in Intensive Care
•
•
•
•
•
On-line monitoring of intensive care patients
high-dimensional data about patient and medication
measured every minute
stored in the Emtec database of patient records --learning when to intervene in which way.
36
Patient G.C., male, 60 years old
Hemihepatektomie right
37
The Data
LE DB2 i 1: t 1 a 1 1 ... a 1 k
i1: t 2 a 2 1 ... a 2 k
...
i2: t 1 a 1 1 ... a 1 k
...
set of rows for each patient:
1 row for each minute
38
Preprocessing
• Chaining database rows
i 1: t 1 a 1 1 ... a 1 k, t 2 a 2 1 ... a 2 k , ...
• Multivariate to univariate
i 1: t 1 a 1, t 2 a 1 ... t m a 1
i 1: t 1 a 2, t 2 a 2 ... t m a 2
...
• Detecting level changes and outliers
39
Phase State Analysis
Time series y1,...,yN

Phase state yt (yt ,yt 1 )
yt+1
yt
Deterministic
Process
yt
time t
yt+1
yt
AR(1)-process
with outlier
(AO)
HRt
timet
yt+1
yt
Heart rate
time t
U.Gather, M. Bauer
yt
40
Level Change Detection
level_change(pat4999, 50, 112, hr, up)
level_change(pat4999, 112, 164, hr, down)
level_change(pat4999, 10, 74, art, constant)
level_change(pat4999, 74, 110, art, down)
Computed Feature
Comparing norm values for a vital sign and its mean in a
time interval (± standard deviation):
deviation(pat4999, 10, 74, art, up)
41
Learning Task 3:
Recommend Interventions for Patients
Are there valid rules
for all multivariate time series,
such that therapeutical interventions follow from a
patient’s state?
42
Method 3:
Inductive Logic Programming
Given patient records in the form of facts:
• deviations -- time intervals
• therapeutical interventions -- time points
• types of vital signs (group1: hr, swi, co; group2: art, vr)
Learn rules about interventions:
group1(V), deviation(P, T1, T2, V, Dir)
noradrenaline(P, T2, Dir)
43
The Chain of Preprocessing Steps
LE DB2 :
i 1: t 1 a 1 1 ... a 1 k
i1: t 2 a 2 1 ... a 2 k
...
i2: t 1 a 1 1 ... a 1 k
chaining
db rows
i 1: t 1 a 1 1 ... a k1 t 2 a 1 2
... a k 2 ...
i2: t 1 a 1 1 ... a k 1t 2 a 12
... a k 2 ...
multi- to
univariate
i 1: t 1 a 1 1 t 2 a 1 2
i 1: t 1 a 2 1 t 2 a 2 2
...
relational learning
p 1(I,T i,T j,A,D), p 2 (I,Tj,Tk,A,D)
 p 3(I,Tk, Dir)
level
changes
(i 1,t i ,t j,A)
...
computed
feature
(i 1,t i ,t j,A,D)
44
Learning Task 4:
Predict Next Minute‘s
Intervention
Given a patient’s state at time ti,
learn whether and how to intervene at t i+1
Preprocessing:
• Selection of time points where an intervention was done
• Multiple to binary class
for each drug, form the concepts drug_up, drug_down
• Multiple learning for each binary class resulting in
classifiers for each drug and direction of dose change (SVM_light)
45
The Chain of Preprocessing Steps
LE DB2 :
i 1: t 1 a 1 1 ... a 1 k
i1: t 2 a 2 1 ... a 2 k
...
i2: t 1 a 1 1 ... a 1 k
Select time points
with interventions
i 1: t i a 1 i ... a ki
i2: t j a 1j ... a kj
...
Form
binary classes
a1_up +: a 2 ...a k
...
a1_up-:
....
a 2 ...a k
a6_down+: a 2 ...a k..
a6_down-: a 2 ...a k....
Learning classifiers using SVM_light
a1_up +: w2 a 2 ... wk a k
...
a6_down +: w2 a 2 ... wk a k
46
Same Data -- Several Cases
• Find time relations that express therapy protocols
chaining db rows, multivariate to univariate, level changes, deviations, RDT
• Predict intervention for a particular drug
select time points, multiple to binary class, SVM_light
47
Behind the Boxes
Db schema
indicating time
attribute(s),
granularity,...
Select statement in
abstract form,
instantiated by
db schema
Creating views in
abstract form,
instantiated by
db schema and
learning task
Syntactic
transformation
for SVM
Multiple
learning control
Calling SVM_light and
writing results
48
Functionality of MD-Compiler
Manual preprocessing operators of M4 are very elementary.
Results of operators are mostly views.
Base Tables
Table_x
Table_y
RowSelection
Views,
created by the
MD-Compiler
RowSelection
V_01
V_02
Table_z
RowSelection
V_03
MultiColumnFeatureConstruction
V_01
vc_1
vc_2
vc_3
FeatureSelection
V_04
RowSelection
V_05
49
Several view definitions
Inline View versus Physical View-Object
CREATE VIEW
V_01
(attrib_a, attrib_b, attrib_c)
AS
SELECT
x_id, x_a, x_b
FROM Table_x;
Inline-View
stored as sql-string
in M4-Relational
Information from M4-Relational
(BaseAttribute)
Inline-View
Physical View-Object
created by MD-Compiler
for reading data and
executing statistics
50
Several view definitions
Inline View versus Physical View-Object
Base Tables
Table_x
Table_y
RowSelection
Views,
created by the
MD-Compiler
RowSelection
V_01
V_02
Table_z
RowSelection
V_03
MultiColumnFeatureConstruction
V_01
vc_1
vc_2
vc_3
FeatureSelection
V_04
RowSelection
V_05
Create View V_05 (...) as
select ... from (select ... from (select ... from Table_x)
instead of
Create View V_05 (...) as select ... from V_04
51
Several view definitions
Materialized view:
Created by the MD-Compiler
automatically in the background
+ performance gain when selecting
data from V_04 or V_07
+ all operation-outputs can be realized
as views
- additional storage needed
Table_x
Table_y
Table_z
V_01
V_02
V_03
Materialized_View_1
V_04
V_05
V_06
V_07
52
System-Architecture
Statistics
PL/Sql
T1
T2
T3
T4
T5
M4Relation
Editor
T6
M4-Relational Model
MD Compiler
Java-Code
M4-Conceptual Model
Time
Operators
M4Concept
Editor
M4-Case
Editor
Java-Code
from UniDo
M4-Case Model
Mining Mart Database
53
Summary of Cases
Involving Time
Db schema
indicating time
attribute(s)
Syntactic
Sliding windows
transformations Summarizing windows
L E1
Level changes
...
LE4
Inductive logic programming (RDT)
SVM_light for classification
SVM for regression
Exponential smoothing
54
Summary
•
•
•
•
Preprocessing is the key issue in data analysis!
Goal: Support users in making intelligent choices
Approach: Cases of best practice
View of a computer scientist:
– Scalability to very large databases
– Meta-data driven processing
• Case studies on analysing data involving time
55
MiningMart Approach
• Manager -- end-user
knows about the business
case
• Database manager
knows about the data
• Case designer -- power-user
expert in KDD
• Developer
supplies (learning) operators
56