Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Model Maintenance in Dynamic Environments Venkatesh Ganti (Joint work with Raghu Ramakrishnan, Johannes Gehrke, Mong Li Lee) Mining Environment • • Data Mining …… …… Data Warehouse Data repository for analysis • OLAP • • DEMON Data mining models • Frequent itemsets • Decision trees • Clusters • … OLAP • Aggregate queries Repository updated regularly Query workloads change 2 Two Parts of this Talk 1. Model Maintenance: Maintaining models under systematic data evolution [ICDE 00] 2. Tuning samples: Maintaining samples for approximate query answering with respect to changing query workloads [VLDB00] DEMON 3 Systematic Block Evolution • • Data warehouses are updated with blocks of new data Block: a set of tuples appended simultaneously to the data warehouse D d … D+d Result: a sequence of database snapshots DEMON 4 Model Maintenance: Objective • • • Allow selection of interesting timevarying subsets to be modeled Low response time to get the updated model Interesting classes of models • • • DEMON Frequent itemsets (LITS) Clusters Decision trees (DT) 5 Subset selection: Data Span • Span of interest • • Everything until now—Unrestricted window Recently collected—Most recent window • Unrestricted Window (UW) • Model the entire database DEMON D1 D2 D3 M(D1+D2+D3) D1 D2 D3 D4 M(D1+D2+D3+D4) 6 Data Span (contd.) • Most Recent Window (MRW) of size w E.g., model data collected in the last 3 days Sliding Windows Models • DEMON D1 D2 D3 M(D1+D2+D3) D1 D2 D3 D4 M(D2+D3+D4) D1 M(D3+D4+D5) D2 D3 D4 D5 7 Block Selection Sequence • • • Maintain models on data collected on alternate days within the last 4 weeks Require fine granular selection Block selection sequence (BSS) • A 0/1 sequence: a bit for each block in the data span • • DEMON 1--the block is selected for modeling 0--the block is not selected for modeling 8 BSS: UW • A sequence of 0/1 bits, one for each block in the entire database • E.g., select all blocks collected on alternate days 1 D1 DEMON 0 1 0 1 D2 D3 D4 D5 9 BSS: MRW Two types of BSS w.r.t. MRW • 1. 2. Window-independent Window-relative Model data collected on Mondays within the last 4 weeks • • BSS: (1000000)* 1 0…0 1 D1 D2-D7 D8 M(D1+D8) DEMON 0…0 1 ... D9-D14 1 0…0 1 D1 D2-D7 D8 0…0 1 ... D9-D14 D15 M(D1+D8+D15) 10 BSS: MRW (contd.) • • • Window-relative BSS Model all data collected on alternate days from the start in a window of size 3 BSS: 101 D1 D2 D3 [1 0 1] D1 D2 D3 [1 0 D4 1] D1 D2 D3 D4 [1 D5 0 1] Here, each successive subset is disjoint from its predecessor DEMON 11 Model Maintenance: Enumeration LITS Clustering DT UW:BSS MRW:BSS Includes both window-independent and window-relative block selection sequences. DEMON 12 Model Maintenance: Algorithms LITS UW:BSS Clustering DT GEMM(A) MRW:BSS GEMM: GEneric Model Maintenance Algorithm for any class of models that has an incremental maintenance algorithm A under tuple insertions DEMON 13 Maintenance under Insertions • Algorithm A • • • Such algorithms exist for • • • • DEMON Input: old dataset D, old model M(D), a block of tuples d appended to D Output: M(D+d) = A(D, d, M(D)) Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) Clusters (BIRCH) Decision trees (BOAT) Note: We do NOT require A to handle deletions! 14 GEMM • Input • • • • Output • DEMON Data span (and window size for MRW) BSS A model-update algorithm A under tuple insertions (deletions not required) An efficient model maintenance algorithm 15 GEMM:MRW • Assume BSS is a sequence of 1’s and w=3 T3 D1 D2 D3 M(D1+D2+D3) T4 D1 D2 D3 D4 M(D2+D3+D4) T5 D1 D2 D3 D4 D5 M(D3+D4+D5) • We already know parts of future windows DEMON 16 GEMM: MRW (contd.) Idea: Start building models for future windows E.g.,: At T3, we maintain models on <D1 + D2 + D3> (model required for window at T3) <D2 + D3> (partial model for window at T4) <D3> (partial model for window at T5) Models at T3 M<D1 + D2 + D3> M<D2 + D3> M<D3> DEMON Models at T4 Immediate Offline M<D2 + D3 + D4> (for window at T2) M<D3 + D4> (for window at T3) M<D4> (for window at T4) 17 GEMM: Arbitrary BSS 1 0 1 0 1 ... DD11 D2 D3 D4 D5 T3: Model on <1.D1 + 0.D2 + 1.D3> T4: D4 is appended Model on <0.D2 + 1.D3 + 0.D4> T5: D5 is appended Model on <1.D3 + 0.D4 + 1.D5> Idea: We still know parts of future windows and the corresponding BSS for each of them E.g.,: At T3, we maintain models on <1.D1 + 0.D2 + 1.D3> (model required at T3) <0.D2 + 1.D3> identical <1.D3> DEMON 18 GEMM: Resource Requirements • Response time to new model • Updating one model with the new block • • • Depends on the incremental algorithm Space requirements • • DEMON Other updates offline At most w models Space required for a model is orders of magnitude less than that for data! 19 Maintenance under Insertions • Algorithm A • • • Such algorithms exist for • • • • DEMON Input: old dataset D, old model M(D), a block of tuples d appended to D Output: M(D+d) = A(D, d, M(D)) Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) Clusters (BIRCH) Decision trees (BOAT) Note: We do NOT require A to handle deletions! 20 Frequent Itemset Models • • Set of customer transactions Frequent itemset: a set of items purchased together by “many” customers TID a b c 1 2 3 DEMON 1 0 1 1 1 1 0 1 0 Minimum frequency threshold = 50% {b}, {c}, {a,c} are frequent itemsets 21 Incremental Algorithm [FAAM97,TBAR97] D1 D2 D3 • Input • • • • Old dataset Old set of frequent itemsets New block D4 Steps • • Detect if new itemsets become frequent Count frequencies of a small number of itemsets • • DEMON D4 Current algorithms scan (D1+D2+D3) completely Update model… 22 ECUT—New Counting Algorithm • • Transformed data representation Within each block Di • item x: sorted list of transaction identifiers containing “x”—TID-list(x) TID a b c 1 2 3 DEMON 1 0 1 0 1 1 0 1 0 TID-list(a) = {1} TID-list(b) = {2,3} TID-list(c) = {1,2} Count({a,b}) = |TID-list(a) intersection TID-list(b)| 23 Experimental Comparison Time to Update Model ECUT SCAN-TO-COUNT 50 45 Time (in seconds) 40 35 30 25 20 15 10 5 0 10 25 50 75 100 150 200 400 New Block Size (in 1000's) DEMON 24 Comparing Count Times Counting Times 400 350 Time (in Sec) 300 250 4M:ECUT 200 4M:SCAN 150 100 2M:ECUT 2M:SCAN 50 4M:ECUT+ 2M:ECUT+ 0 5 DEMON 15 25 35 45 55 65 75 85 95 105 #Itemsets 115 125 135 145 155 165 175 185 25 Summary of the first part LITS UW:BSS Clustering DT GEMM MRW:BSS DEMON Maintenance algorithms under tuple insertions •Frequent itemsets •ECUT, ECUT+ •Clusters •BIRCH •Decision Trees •BOAT 26 Second Part of this Talk 1. Model Maintenance: Maintaining models under systematic data evolution [ICDE 00] 2. Maintaining samples with respect to changing query workloads [VLDB00] DEMON 27 Random Samples for AQUA Agg. query Q Typical AQUA approach Exact answer R Approx. answer Uniform Random sample • • • DEMON S(R) All tuples in R are assumed to be equally important while drawing S(R) In practice, queries exhibit locality Consequence: S(R) wastes precious real estate 28 Problem • Given • • • • DEMON Relation R Workload W: Q1,…,Qn Goal: Dynamically tune “random sample of R” w.r.t. W Model to be maintained: a simple random sample 29 ICICLES • • R(Q): set of tuples in R required to answer Q Random sample of R U R(Q1) U … U R(Qn) • Tuples required often are more likely to be in SW(R) R R(Q1) Uniform Random Sample SW(R) ICICLE DEMON R(Qn) 30 Mail Order Dataset AVG queries on a mail order dataset Icicle Icicle-complete Static sample 0.3 Relative Error 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #Queries DEMON 31 Conclusions and Future Work Static dataset Dynamic dataset Workload indifferent Workload sensitive DEMON Most DM literature ICICLES + ?? DEMON + Related Effort ?? 32 • DEMON Questions? 33