Download Model Maintenance in Dynamic Environments

Model Maintenance in Dynamic Environments Venkatesh Ganti (Joint work with Raghu Ramakrishnan, Johannes Gehrke, Mong Li Lee) Mining Environment • • Data Mining …… …… Data Warehouse Data repository for analysis • OLAP • • DEMON Data mining models • Frequent itemsets • Decision trees • Clusters • … OLAP • Aggregate queries Repository updated regularly Query workloads change 2 Two Parts of this Talk 1. Model Maintenance: Maintaining models under systematic data evolution [ICDE 00] 2. Tuning samples: Maintaining samples for approximate query answering with respect to changing query workloads [VLDB00] DEMON 3 Systematic Block Evolution • • Data warehouses are updated with blocks of new data Block: a set of tuples appended simultaneously to the data warehouse D d … D+d Result: a sequence of database snapshots DEMON 4 Model Maintenance: Objective • • • Allow selection of interesting timevarying subsets to be modeled Low response time to get the updated model Interesting classes of models • • • DEMON Frequent itemsets (LITS) Clusters Decision trees (DT) 5 Subset selection: Data Span • Span of interest • • Everything until now—Unrestricted window Recently collected—Most recent window • Unrestricted Window (UW) • Model the entire database DEMON D1 D2 D3 M(D1+D2+D3) D1 D2 D3 D4 M(D1+D2+D3+D4) 6 Data Span (contd.) • Most Recent Window (MRW) of size w E.g., model data collected in the last 3 days Sliding Windows Models • DEMON D1 D2 D3 M(D1+D2+D3) D1 D2 D3 D4 M(D2+D3+D4) D1 M(D3+D4+D5) D2 D3 D4 D5 7 Block Selection Sequence • • • Maintain models on data collected on alternate days within the last 4 weeks Require fine granular selection Block selection sequence (BSS) • A 0/1 sequence: a bit for each block in the data span • • DEMON 1--the block is selected for modeling 0--the block is not selected for modeling 8 BSS: UW • A sequence of 0/1 bits, one for each block in the entire database • E.g., select all blocks collected on alternate days 1 D1 DEMON 0 1 0 1 D2 D3 D4 D5 9 BSS: MRW Two types of BSS w.r.t. MRW • 1. 2. Window-independent Window-relative Model data collected on Mondays within the last 4 weeks • • BSS: (1000000)* 1 0…0 1 D1 D2-D7 D8 M(D1+D8) DEMON 0…0 1 ... D9-D14 1 0…0 1 D1 D2-D7 D8 0…0 1 ... D9-D14 D15 M(D1+D8+D15) 10 BSS: MRW (contd.) • • • Window-relative BSS Model all data collected on alternate days from the start in a window of size 3 BSS: 101 D1 D2 D3 [1 0 1] D1 D2 D3 [1 0 D4 1] D1 D2 D3 D4 [1 D5 0 1] Here, each successive subset is disjoint from its predecessor DEMON 11 Model Maintenance: Enumeration LITS Clustering DT UW:BSS MRW:BSS Includes both window-independent and window-relative block selection sequences. DEMON 12 Model Maintenance: Algorithms LITS UW:BSS Clustering DT GEMM(A) MRW:BSS GEMM: GEneric Model Maintenance Algorithm for any class of models that has an incremental maintenance algorithm A under tuple insertions DEMON 13 Maintenance under Insertions • Algorithm A • • • Such algorithms exist for • • • • DEMON Input: old dataset D, old model M(D), a block of tuples d appended to D Output: M(D+d) = A(D, d, M(D)) Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) Clusters (BIRCH) Decision trees (BOAT) Note: We do NOT require A to handle deletions! 14 GEMM • Input • • • • Output • DEMON Data span (and window size for MRW) BSS A model-update algorithm A under tuple insertions (deletions not required) An efficient model maintenance algorithm 15 GEMM:MRW • Assume BSS is a sequence of 1’s and w=3 T3 D1 D2 D3 M(D1+D2+D3) T4 D1 D2 D3 D4 M(D2+D3+D4) T5 D1 D2 D3 D4 D5 M(D3+D4+D5) • We already know parts of future windows DEMON 16 GEMM: MRW (contd.) Idea: Start building models for future windows E.g.,: At T3, we maintain models on <D1 + D2 + D3> (model required for window at T3) <D2 + D3> (partial model for window at T4) <D3> (partial model for window at T5) Models at T3 M<D1 + D2 + D3> M<D2 + D3> M<D3> DEMON Models at T4 Immediate Offline M<D2 + D3 + D4> (for window at T2) M<D3 + D4> (for window at T3) M<D4> (for window at T4) 17 GEMM: Arbitrary BSS 1 0 1 0 1 ... DD11 D2 D3 D4 D5 T3: Model on <1.D1 + 0.D2 + 1.D3> T4: D4 is appended Model on <0.D2 + 1.D3 + 0.D4> T5: D5 is appended Model on <1.D3 + 0.D4 + 1.D5> Idea: We still know parts of future windows and the corresponding BSS for each of them E.g.,: At T3, we maintain models on <1.D1 + 0.D2 + 1.D3> (model required at T3) <0.D2 + 1.D3> identical <1.D3> DEMON 18 GEMM: Resource Requirements • Response time to new model • Updating one model with the new block • • • Depends on the incremental algorithm Space requirements • • DEMON Other updates offline At most w models Space required for a model is orders of magnitude less than that for data! 19 Maintenance under Insertions • Algorithm A • • • Such algorithms exist for • • • • DEMON Input: old dataset D, old model M(D), a block of tuples d appended to D Output: M(D+d) = A(D, d, M(D)) Frequent itemsets (ECUT, ECUT+, BORDERS, FUP) Clusters (BIRCH) Decision trees (BOAT) Note: We do NOT require A to handle deletions! 20 Frequent Itemset Models • • Set of customer transactions Frequent itemset: a set of items purchased together by “many” customers TID a b c 1 2 3 DEMON 1 0 1 1 1 1 0 1 0 Minimum frequency threshold = 50% {b}, {c}, {a,c} are frequent itemsets 21 Incremental Algorithm [FAAM97,TBAR97] D1 D2 D3 • Input • • • • Old dataset Old set of frequent itemsets New block D4 Steps • • Detect if new itemsets become frequent Count frequencies of a small number of itemsets • • DEMON D4 Current algorithms scan (D1+D2+D3) completely Update model… 22 ECUT—New Counting Algorithm • • Transformed data representation Within each block Di • item x: sorted list of transaction identifiers containing “x”—TID-list(x) TID a b c 1 2 3 DEMON 1 0 1 0 1 1 0 1 0 TID-list(a) = {1} TID-list(b) = {2,3} TID-list(c) = {1,2} Count({a,b}) = |TID-list(a) intersection TID-list(b)| 23 Experimental Comparison Time to Update Model ECUT SCAN-TO-COUNT 50 45 Time (in seconds) 40 35 30 25 20 15 10 5 0 10 25 50 75 100 150 200 400 New Block Size (in 1000's) DEMON 24 Comparing Count Times Counting Times 400 350 Time (in Sec) 300 250 4M:ECUT 200 4M:SCAN 150 100 2M:ECUT 2M:SCAN 50 4M:ECUT+ 2M:ECUT+ 0 5 DEMON 15 25 35 45 55 65 75 85 95 105 #Itemsets 115 125 135 145 155 165 175 185 25 Summary of the first part LITS UW:BSS Clustering DT GEMM MRW:BSS DEMON Maintenance algorithms under tuple insertions •Frequent itemsets •ECUT, ECUT+ •Clusters •BIRCH •Decision Trees •BOAT 26 Second Part of this Talk 1. Model Maintenance: Maintaining models under systematic data evolution [ICDE 00] 2. Maintaining samples with respect to changing query workloads [VLDB00] DEMON 27 Random Samples for AQUA Agg. query Q Typical AQUA approach Exact answer R Approx. answer Uniform Random sample • • • DEMON S(R) All tuples in R are assumed to be equally important while drawing S(R) In practice, queries exhibit locality Consequence: S(R) wastes precious real estate 28 Problem • Given • • • • DEMON Relation R Workload W: Q1,…,Qn Goal: Dynamically tune “random sample of R” w.r.t. W Model to be maintained: a simple random sample 29 ICICLES • • R(Q): set of tuples in R required to answer Q Random sample of R U R(Q1) U … U R(Qn) • Tuples required often are more likely to be in SW(R) R R(Q1) Uniform Random Sample SW(R) ICICLE DEMON R(Qn) 30 Mail Order Dataset AVG queries on a mail order dataset Icicle Icicle-complete Static sample 0.3 Relative Error 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #Queries DEMON 31 Conclusions and Future Work Static dataset Dynamic dataset Workload indifferent Workload sensitive DEMON Most DM literature ICICLES + ?? DEMON + Related Effort ?? 32 • DEMON Questions? 33

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Model Maintenance in Dynamic Environments