Download Model Maintenance in Dynamic Environments

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Mixture model wikipedia , lookup

Transcript
Model Maintenance in
Dynamic Environments
Venkatesh Ganti
(Joint work with Raghu Ramakrishnan,
Johannes Gehrke, Mong Li Lee)
Mining Environment
•
•
Data Mining
……
……
Data
Warehouse
Data repository for analysis
•
OLAP
•
•
DEMON
Data mining models
• Frequent itemsets
• Decision trees
• Clusters
• …
OLAP
• Aggregate queries
Repository updated regularly
Query workloads change
2
Two Parts of this Talk
1. Model Maintenance: Maintaining models
under systematic data evolution [ICDE 00]
2. Tuning samples: Maintaining samples for
approximate query answering with respect
to changing query workloads [VLDB00]
DEMON
3
Systematic Block Evolution
•
•
Data warehouses are updated with blocks of new data
Block: a set of tuples appended simultaneously to the data
warehouse
D
d
…
D+d
Result: a sequence of database snapshots
DEMON
4
Model Maintenance: Objective
•
•
•
Allow selection of interesting timevarying subsets to be modeled
Low response time to get the updated
model
Interesting classes of models
•
•
•
DEMON
Frequent itemsets (LITS)
Clusters
Decision trees (DT)
5
Subset selection: Data Span
•
Span of interest
•
•
Everything until now—Unrestricted window
Recently collected—Most recent window
• Unrestricted Window (UW)
• Model the entire database
DEMON
D1 D2 D3
M(D1+D2+D3)
D1 D2 D3 D4
M(D1+D2+D3+D4)
6
Data Span (contd.)
•
Most Recent Window (MRW) of size w
E.g., model data collected in the last 3
days
Sliding Windows
Models
•
DEMON
D1 D2 D3
M(D1+D2+D3)
D1 D2 D3 D4
M(D2+D3+D4)
D1
M(D3+D4+D5)
D2 D3
D4 D5
7
Block Selection Sequence
•
•
•
Maintain models on data collected on
alternate days within the last 4 weeks
Require fine granular selection
Block selection sequence (BSS)
•
A 0/1 sequence: a bit for each block in the
data span
•
•
DEMON
1--the block is selected for modeling
0--the block is not selected for modeling
8
BSS: UW
•
A sequence of 0/1 bits, one for each block
in the entire database
• E.g., select all blocks collected on alternate days
1
D1
DEMON
0 1
0 1
D2 D3 D4 D5
9
BSS: MRW
Two types of BSS w.r.t. MRW
•
1.
2.
Window-independent
Window-relative
Model data collected on Mondays within the
last 4 weeks
•
•
BSS: (1000000)*
1 0…0
1
D1 D2-D7 D8
M(D1+D8)
DEMON
0…0 1 ...
D9-D14
1 0…0
1
D1 D2-D7 D8
0…0 1 ...
D9-D14 D15
M(D1+D8+D15)
10
BSS: MRW (contd.)
•
•
•
Window-relative BSS
Model all data collected on alternate days
from the start in a window of size 3
BSS: 101
D1 D2 D3
[1
0
1]
D1 D2 D3
[1
0
D4
1]
D1 D2 D3 D4
[1
D5
0
1]
Here, each successive subset is disjoint from its predecessor
DEMON
11
Model Maintenance: Enumeration
LITS
Clustering
DT
UW:BSS
MRW:BSS
Includes both window-independent and
window-relative block selection sequences.
DEMON
12
Model Maintenance: Algorithms
LITS
UW:BSS
Clustering
DT
GEMM(A)
MRW:BSS
GEMM: GEneric Model Maintenance Algorithm for any
class of models that has an incremental maintenance
algorithm A under tuple insertions
DEMON
13
Maintenance under Insertions
•
Algorithm A
•
•
•
Such algorithms exist for
•
•
•
•
DEMON
Input: old dataset D, old model M(D), a block of
tuples d appended to D
Output: M(D+d) = A(D, d, M(D))
Frequent itemsets (ECUT, ECUT+, BORDERS, FUP)
Clusters (BIRCH)
Decision trees (BOAT)
Note: We do NOT require A to handle
deletions!
14
GEMM
•
Input
•
•
•
•
Output
•
DEMON
Data span (and window size for MRW)
BSS
A model-update algorithm A under tuple
insertions (deletions not required)
An efficient model maintenance algorithm
15
GEMM:MRW
•
Assume BSS is a sequence of 1’s and w=3
T3
D1 D2 D3
M(D1+D2+D3)
T4
D1 D2 D3 D4
M(D2+D3+D4)
T5
D1 D2 D3 D4 D5
M(D3+D4+D5)
• We already know parts of future windows
DEMON
16
GEMM: MRW (contd.)
Idea: Start building models for future windows
E.g.,: At T3, we maintain models on
<D1 + D2 + D3> (model required for window at T3)
<D2 + D3>
(partial model for window at T4)
<D3>
(partial model for window at T5)
Models at T3
M<D1 + D2 + D3>
M<D2 + D3>
M<D3>
DEMON
Models at T4
Immediate
Offline
M<D2 + D3 + D4> (for window at T2)
M<D3 + D4>
(for window at T3)
M<D4>
(for window at T4)
17
GEMM: Arbitrary BSS
1 0 1 0 1 ...
DD11 D2 D3 D4 D5
T3: Model on <1.D1 + 0.D2 + 1.D3>
T4: D4 is appended
Model on <0.D2 + 1.D3 + 0.D4>
T5: D5 is appended
Model on <1.D3 + 0.D4 + 1.D5>
Idea: We still know parts of future windows and the
corresponding BSS for each of them
E.g.,: At T3, we maintain models on
<1.D1 + 0.D2 + 1.D3> (model required at T3)
<0.D2 + 1.D3>
identical
<1.D3>
DEMON
18
GEMM: Resource Requirements
•
Response time to new model
•
Updating one model with the new block
•
•
•
Depends on the incremental algorithm
Space requirements
•
•
DEMON
Other updates offline
At most w models
Space required for a model is orders of
magnitude less than that for data!
19
Maintenance under Insertions
•
Algorithm A
•
•
•
Such algorithms exist for
•
•
•
•
DEMON
Input: old dataset D, old model M(D), a block of
tuples d appended to D
Output: M(D+d) = A(D, d, M(D))
Frequent itemsets (ECUT, ECUT+, BORDERS, FUP)
Clusters (BIRCH)
Decision trees (BOAT)
Note: We do NOT require A to handle
deletions!
20
Frequent Itemset Models
•
•
Set of customer transactions
Frequent itemset: a set of items purchased
together by “many” customers
TID a b c
1
2
3
DEMON
1 0 1
1 1 1
0 1 0
Minimum frequency threshold = 50%
{b}, {c}, {a,c} are frequent itemsets
21
Incremental Algorithm
[FAAM97,TBAR97]
D1 D2 D3
•
Input
•
•
•
•
Old dataset
Old set of frequent itemsets
New block D4
Steps
•
•
Detect if new itemsets become frequent
Count frequencies of a small number of itemsets
•
•
DEMON
D4
Current algorithms scan (D1+D2+D3) completely
Update model…
22
ECUT—New Counting Algorithm
•
•
Transformed data representation
Within each block Di
•
item x: sorted list of transaction identifiers
containing “x”—TID-list(x)
TID a b c
1
2
3
DEMON
1 0 1
0 1 1
0 1 0
TID-list(a) = {1}
TID-list(b) = {2,3}
TID-list(c) = {1,2}
Count({a,b}) = |TID-list(a) intersection TID-list(b)|
23
Experimental Comparison
Time to Update Model
ECUT
SCAN-TO-COUNT
50
45
Time (in seconds)
40
35
30
25
20
15
10
5
0
10
25
50
75
100
150
200
400
New Block Size (in 1000's)
DEMON
24
Comparing Count Times
Counting Times
400
350
Time (in Sec)
300
250
4M:ECUT
200
4M:SCAN
150
100
2M:ECUT
2M:SCAN
50
4M:ECUT+
2M:ECUT+
0
5
DEMON
15
25
35
45
55
65
75
85
95
105
#Itemsets
115
125
135
145
155
165
175
185
25
Summary of the first part
LITS
UW:BSS
Clustering
DT
GEMM
MRW:BSS
DEMON
Maintenance algorithms under tuple insertions
•Frequent itemsets
•ECUT, ECUT+
•Clusters
•BIRCH
•Decision Trees
•BOAT
26
Second Part of this Talk
1. Model Maintenance: Maintaining models
under systematic data evolution [ICDE 00]
2. Maintaining samples with respect to
changing query workloads [VLDB00]
DEMON
27
Random Samples for AQUA
Agg. query Q
Typical AQUA
approach
Exact answer
R
Approx. answer
Uniform
Random
sample
•
•
•
DEMON
S(R)
All tuples in R are assumed to be equally
important while drawing S(R)
In practice, queries exhibit locality
Consequence: S(R) wastes precious real estate
28
Problem
•
Given
•
•
•
•
DEMON
Relation R
Workload W: Q1,…,Qn
Goal: Dynamically tune “random sample of
R” w.r.t. W
Model to be maintained: a simple random
sample
29
ICICLES
•
•
R(Q): set of tuples in R required to answer Q
Random sample of R U R(Q1) U … U R(Qn)
•
Tuples required often are more likely to be in SW(R)
R
R(Q1)
Uniform
Random
Sample
SW(R)
ICICLE
DEMON
R(Qn)
30
Mail Order Dataset
AVG queries on a mail order dataset
Icicle
Icicle-complete
Static sample
0.3
Relative Error
0.25
0.2
0.15
0.1
0.05
0
1
2 3
4
5 6
7
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
#Queries
DEMON
31
Conclusions and Future Work
Static dataset Dynamic dataset
Workload
indifferent
Workload
sensitive
DEMON
Most DM
literature
ICICLES
+
??
DEMON
+
Related Effort
??
32
•
DEMON
Questions?
33