Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Topic 4: Data Mining I Data Mining Applications Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Advanced Database Technologies target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Text mining (news group, email, documents) and Web analysis. Intelligent query answering (do you want to know more? do you want to see similar results? are you also interested in this?) 1 Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Text mining (news group, email, documents) and Web analysis. Intelligent query answering (do you want to know more? do you want to see similar results? are you also interested in this?) Advanced Database Technologies 3 Market Analysis and Management (2) Target marketing Associations/co-relations between product sales Prediction based on the association information Dr. N. Mamoulis Identifying customer requirements 4 cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning: identifying the best products for different customers use prediction to find what factors will attract new customers Provides summary information statistical summary information (data central tendency and Advanced Database Technologies Finance planning and asset evaluation various multidimensional summary reports Conversion of single to a joint bank account: marriage, etc. products (clustering or classification) Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Cross-market analysis data mining can tell you what types of customers buy what Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Corporate Analysis and Risk Management Customer profiling 2 Determine customer purchasing patterns over time Other Applications Dr. N. Mamoulis Advanced Database Technologies Where are the data sources for analysis? Market analysis and management Dr. N. Mamoulis Market Analysis and Management (1) target marketing, customer relation management, market basket analysis, cross selling, market segmentation Fraud detection and management Other Applications Database analysis and decision support Risk analysis and management Data Mining Applications Market analysis and management Why Data Mining? The data explosion problem: Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories; We are drowning in data, but starving for knowledge! Dr. N. Mamoulis Database analysis and decision support summarize and compare the resources and spending Competition: monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market variation) Dr. N. Mamoulis Advanced Database Technologies 5 Dr. N. Mamoulis Advanced Database Technologies 6 1 Fraud Detection and Management (1) Fraud Detection and Management (2) Detecting inappropriate medical treatment Applications widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach Detecting telephone fraud use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples Advanced Database Technologies 7 Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail Analysts estimate that 38% of retail shrink is due to dishonest employees. Dr. N. Mamoulis IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. Dr. N. Mamoulis Advanced Database Technologies 9 Steps of a KDD Process Databases Dr. N. Mamoulis Find useful features, dimensionality/variable reduction, invariant representation. Advanced Database Technologies 10 incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation Data Integration Choosing functions of data mining Data Cleaning Data in the real world is dirty relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Selection Data Warehouse Why Data Preprocessing? Learning the application domain: 8 Pattern Evaluation Sports Advanced Database Technologies Data Mining: A KDD Process Other Applications auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references Dr. N. Mamoulis Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Dr. N. Mamoulis Advanced Database Technologies 11 Dr. N. Mamoulis Advanced Database Technologies 12 2 Major Tasks in Data Preprocessing Data Cleaning Data cleaning Data cleaning tasks Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data Dr. N. Mamoulis Advanced Database Technologies 13 Data Integration Age Income Religion Gender 23 24,200 Muslim M 39 ? Christian F 45 45,390 ? Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution Dr. N. Mamoulis Advanced Database Technologies combines data from multiple sources into a coherent store integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-# for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units Dr. N. Mamoulis Advanced Database Technologies 15 Data Transformation Gender ekey age salary gender 10002 23 24,200 M 30004 43 60,234 M 10003 39 32,000 F 30020 23 47,000 F 10004 45 40,000 F 30025 56 93,030 M Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Advanced Database Technologies Dr. N. Mamoulis Income Gender 10002 23 36,300 M 10003 39 46,000 F 10004 45 60,000 F 30004 43 60,234 M 30020 23 47,000 F 30025 56 93,030 M 2. Covert data types 1 GBP = 1.5 USD Advanced Database Technologies 16 Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies New attributes constructed from the given ones Dr. N. Mamoulis 1. Integrate attribute names Attribute/feature construction Integrated Data Emp-id Age Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Aggregation: summarization, data cube construction Data source 2 Income Data Reduction Strategies Smoothing: remove noise from data 14 Emp-id Age Detecting and resolving data value conflicts F Data source 1 Schema integration Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Data Integration Example Data integration: Integration of multiple databases, data cubes, or files Data transformation 17 Data cube aggregation (reduce to summary data) Dimensionality reduction (as in multimedia data) Numerosity reduction (model data using functions or summarize them using histograms) Dr. N. Mamoulis Advanced Database Technologies 18 3 Discretization and Concept Hierarchy Summary of the KDD process Discretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. E.g., ages 0-10 → 1, ages 11-20 → 2, etc. Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). Advanced Database Technologies 19 Mining Association Rules Dr. N. Mamoulis Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc. Rule form: “Body → Head [support, confidence]”. buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%] major(x, “CS”) ∧ takes(x, “DB”) → grade(x, “A”) [1%, 75%] Dr. N. Mamoulis Advanced Database Technologies 21 A (potentially unknown) set of literals I={i1, i2, …,in}, called items, A support threshold s, 0<s≤1 and A confidence threshold c, 0<s≤1 Find All rules of the form X → Y, where X ⊂ I, Y ⊂ I, X ∩ Y = ∅ and (i) at least s% of the transactions in D contain X ∪ Y, (ii) at least c% of the transactions in D which contain X, also contain Y. Dr. N. Mamoulis Advanced Database Technologies get automotive services done * ⇒ Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) Home Electronics ⇒ * (What other products should the store stocks up?) Attached mailing in direct marketing Detecting “ping-pong”ing of patients, faulty “collisions” Dr. N. Mamoulis Advanced Database Technologies 22 Generate association rules from F as follows. Let X ∪ Y ∈ F and X ∈ F. Then conf(X→Y) = supp(X)/supp(X ∪ Y). If conf(X→Y) ≥ c then rule X→Y is generated, because: A set of transactions D, where each transaction T ∈ D is a subset of I, i.e., T ⊆ I, E.g., 98% of people who purchase tires and auto accessories also Find all frequent itemsets F, where for each Z ∈ F, at least s% of the transactions in D contain Z. Given 20 General Procedure for Mining Association Rules Formal Definition Advanced Database Technologies Applications Examples. of frequent patterns in long sequences Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: Next, we will present some data mining techniques focusing on discovery of association rules and discovery Association Rule: Basic Concepts Association rule mining: Data mining is the most interesting one due to the challenge of automatic discovery of interesting information E.g., ages 0-12 → child, ages 13-20 → teenager, etc. Dr. N. Mamoulis All steps of the KDD process are important and crucial for the effectiveness of discovery. 23 supp(X→Y) = supp(X ∪ Y) ≥ s and conf(X→Y) ≥ c Therefore mining association rules reduces to mining frequent itemsets and correlations between them. Dr. N. Mamoulis Advanced Database Technologies 24 4 Finding Frequent Itemsets – The Apriori algorithm The Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be The most important part of the mining process is to find all itemsets with minimum support s in the database D. a subset of a frequent k-itemset Apriori is an influential algorithm for this task Pseudo-code: It is based on the following observation: an itemset l = {i1,i2,…,in} can be frequent, only if all subsets of l are frequent. In other words, supp(l)≥s only if supp(l’ )≥s, ∀l’’⊂l. Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do Thus the frequent itemsets are found in a level-wise manner. First freq. itemsets L1 with only 1 item are found. Then candidate itemsets C2 of size 2 are generated from them and verified, and so on. Dr. N. Mamoulis Advanced Database Technologies 25 TID 100 200 300 400 C1 itemset sup. Items {1} 2 134 {2} 3 235 Scan D {3} 3 1235 {4} 1 25 {5} 3 L2 itemset sup {1 3} {2 3} {2 5} {3 5} C2 2 2 3 2 C3 itemset {2 3 5} itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 Scan D Dr. N. Mamoulis L1 select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 2} 3} 5} 3} 5} 5} Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Advanced Database Technologies 27 Example of Generating Candidates abcd from abc and abd acde from acd and ace Advanced Database Technologies 28 Why counting supports of candidates a problem? {a,c,d} {a,c,e} The total number of candidates can be very huge One transaction may contain many candidates Method: {a,c,d,e} X acd ace √ √ Pruning: Dr. N. Mamoulis How to Count Supports of Candidates? L3={abc, abd, acd, ace, bcd} 26 insert into Ck L3 itemset sup {2 3 5} 2 Self- joining: L3*L3 Advanced Database Technologies Step 1: self-joining Lk-1 C2 itemset {1 {1 {1 {2 {2 {3 Dr. N. Mamoulis Suppose the items in Lk-1 are listed in an order itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 Scan D end return ∪k Lk; How to Generate Candidates? The Apriori Algorithm — Example (s=2) Database D increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support ade cde X acde is removed because ade is not in L3 Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction C4={abcd} Dr. N. Mamoulis Advanced Database Technologies 29 Dr. N. Mamoulis Advanced Database Technologies 30 5 Example of the hash-tree for C3 Example of the hash-tree for C3 Hash function: mod 3 Hash function: mod 3 H H 1,4,.. 2,5,.. 3,6,.. 234 567 H Hash on 3rd item 145 H 124 457 125 458 Dr. N. Mamoulis Hash on 2nd item H 356 689 345 367 368 159 31 Example of the hash-tree for C3 H 12345 look for 1XX H 12345 look for 12X 12345 look for 13X (null) 12345 look for 14X 145 H 124 457 125 458 Dr. N. Mamoulis ∅ H 356 689 345 C2 C3 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} itemset {2 3 5} Dr. N. Mamoulis 367 368 Advanced Database Technologies TID 100 200 300 400 {{2},{5}} C1’ Sets of itemsets {{1 3}} {{2 3},{2 5},{3 5}} 300 400 {{1 2},{1 3},{1 5}, {2 3},{2 5},{3 5}} {{2 5}} TID 200 Sets of itemsets {{2 3 5}} 300 {{2 3 5}} C3’ H 124 457 125 458 345 H 356 689 Hash on 2nd item 367 368 159 Advanced Database Technologies 32 The database is not used after the 1st pass. Instead, the set Ck’ is used for each step, Ck’ = <TID, {Xk}> : each Xk is a potentially frequent itemset in transaction with id=TID. At each step Ck’ is generated from Ck-1’ at the pruning step of constructing Ck and used to compute Lk. For small values of k, Ck’ could be larger than the database! Dr. N. Mamoulis Advanced Database Technologies 34 Experiments show that Apriori is faster than AprioriTid, in general. This is because the Ck’ generated during the early phases of AprioriTid are very large. An AprioriHybrid method is proposed that uses Apriori in the early phases and AprioriTid in the latter phases. L2 itemset {1 3} {2 3} {2 5} {3 5} Advanced Database Technologies Dr. N. Mamoulis 145 234 567 Hash on 1st item Experiments with Apriori, AprioryTid itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 Sets of itemsets {{1},{2},{4}} {{2},{3},{5}} {{1},{2},{3},{5}} TID 100 200 33 L1 C1’ Database D Items 134 235 1235 25 Hash on 2nd item 159 AprioriTid Example (s=2) TID 100 200 300 400 345 look for 3XX Hash on 1st item 234 567 H 12345 look for 1XX H 345 look for 3XX AprioriTid: Use D only for first pass 2345 look for 2XX 12345 H 1,4,.. 2,5,.. 3,6,.. 1,4,.. 2,5,.. 3,6,.. Hash on 3rd item Advanced Database Technologies Hash function: mod 3 H Hash on 1st item 2345 look for 2XX 12345 sup 2 2 3 2 itemset sup L3 {2 3 5} 2 35 Dr. N. Mamoulis Advanced Database Technologies 36 L 6 Is Apriori Fast Enough? — Performance Bottlenecks Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding The core of the Apriori algorithm: hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB The bottleneck of Apriori: candidate generation Sampling: mining on a subset of given data, lower support To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 ≈ 1030 candidates. Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent 37 highly condensed, but complete for frequent pattern mining avoid costly database scans Dr. N. Mamoulis Advanced Database Technologies 38 FP-tree Construction Compress a large database into a compact, FrequentPattern tree (FP-tree) structure TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Steps: Develop an efficient, FP-tree-based frequent pattern mining method Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern Advanced Database Technologies Mining Frequent Patterns Without Candidate Generation Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets threshold + a method to determine the completeness Dr. N. Mamoulis Use frequent (k – 1)-itemsets to generate candidate frequent kitemsets Use database scan and pattern matching to collect counts for the candidate itemsets A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only! min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Order frequent items in frequency descending order 3. Scan DB again, construct FP-tree Dr. N. Mamoulis Advanced Database Technologies FP-tree Construction TID 100 200 300 400 500 39 root Advanced Database Technologies FP-tree Construction min_support = 3 freq. Items bought {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, p, b} {f, c, a, m, p} Dr. N. Mamoulis Item frequency f 4 c 4 a 3 b 3 m 3 p 3 TID 100 200 300 400 500 root c:1 c:2 a:1 a:2 p:1 Advanced Database Technologies Item frequency f 4 c 4 a 3 b 3 m 3 p 3 f:2 m:1 Dr. N. Mamoulis min_support = 3 freq. Items bought {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, p, b} {f, c, a, m, p} f:1 40 41 Dr. N. Mamoulis m:1 b:1 p:1 m:1 Advanced Database Technologies 42 7 FP-tree Construction TID 100 200 300 400 500 freq. Items bought {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, p, b} {f, c, a, m, p} Item frequency f 4 c 4 a 3 b 3 m 3 p 3 root f:3 c:2 TID 100 200 300 400 500 b:1 b:1 p:1 m:1 b:1 p:1 m:1 Advanced Database Technologies 43 Benefits of the FP-tree Structure never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database (if not count node-links and counts) Example: For Connect-4 DB, compression ratio could be over 100 Dr. N. Mamoulis Advanced Database Technologies 45 Mining Frequent Patterns Using the FP-tree (cont’d) Start with last item in order (i.e., p). Follow node pointers and traverse only the paths containing p. Accumulate all of transformed prefix paths of that item to form a conditional pattern base f:4 p c:3 b:1 a:3 p:1 m:2 p:2 Dr. N. Mamoulis c:1 Dr. N. Mamoulis root f:4 c:1 b:1 c:3 a:3 b:1 p:1 m:2 b:1 p:2 m:1 Advanced Database Technologies 44 General idea (divide - and - conquer) Recursively grow frequent pattern path using the FPtree Method For each item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) Dr. N. Mamoulis Advanced Database Technologies 46 Mining Frequent Patterns Using the FP-tree (cont’d) Move to next least frequent item in order, i.e., m Follow node pointers and traverse only the paths containing m. Accumulate all of transformed prefix paths of that item to form a conditional pattern base Conditional pattern base for p f:4 fcam:2, cb:1 Conditional pattern base for m c:3 Constructing a new FPtree based on this pattern base leads to only one branch c:3 Thus we derive only one frequent pattern cont. p. Pattern cp Advanced Database Technologies Item frequency f 4 c 4 a 3 b 3 m 3 p 3 Mining Frequent Patterns Using the FP-tree Completeness: min_support = 3 freq. Items bought {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, p, b} {f, c, a, m, p} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 c:1 a:2 Dr. N. Mamoulis FP-tree Construction min_support = 3 fca:2, fcab:1 a:3 m m:2 b:1 m:1 47 Dr. N. Mamoulis Constructing a new FP-tree based on this pattern base leads to path fca:3 From this we derive frequent patterns fcam, fcm, cam, fm, cm, am Advanced Database Technologies 48 8 Properties of FP-tree for Conditional Pattern Base Construction Conditional Pattern-Bases for the example Node - link property For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header Prefix path property To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai. Dr. N. Mamoulis Advanced Database Technologies 49 Item Conditional pattern - base Conditional FP - tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty Dr. N. Mamoulis Performance studies show that 90 Apriori, and is also faster than tree-projection 80 Run time(sec.) No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building D1 FP-grow th runtime D1 Apriori runtime 70 Reasoning Dr. N. Mamoulis Data set T25I20D10K 100 FP-growth is an order of magnitude faster than 50 FP-growth vs. Apriori: Scalability With the Support Threshold Why Is Frequent Pattern Growth Fast? Advanced Database Technologies 60 50 40 30 20 10 Advanced Database Technologies 0 0 51 FP-growth vs. Tree-Projection: Scalability with Support Threshold Dr. N. Mamoulis 0.5 1 1.5 2 Support threshold(%) Advanced Database Technologies 2.5 3 52 Additional issues on the FP-tree How to use this method for large databases? Data set T25I20D100K 140 D2 FP-growth Runtime (sec.) 120 Partition the database cleverly and build one FP-tree for each partition Make the FP-tree disk resident (like the B+-tree). However, node link traversal and accessing of paths may cause many random I/Os D2 TreeProjection Materialize the FP-tree 100 80 No need to build it for each query. However, the construction is based on predefined minimum support threshold. 60 Incremental Updates on the Tree 40 20 Useful for incremental updates of association rules. However, it is only possible when the minimum support is 1. Another solution is to adjust the minimum support with the update of the database. 0 0 Dr. N. Mamoulis 0.5 1 Support (%) Advanced Databasethreshold Technologies 1.5 2 53 Dr. N. Mamoulis Advanced Database Technologies 54 9 Finding Frequent Episodes in Event Sequences Finding Frequent Episodes in Event Sequences: An example A C ED B C F Problem definition: Given a sequence of events in time, and a time window win find all episodes that appear frequently in all windows of size win in the sequence BA A time The frequency of an episode α is defined by the number of windows where fr (α , S , win) = occurrence of event types. The problem is similar to mining frequent itemsets in a database of transactions, but the different type of data Advanced Database Technologies 55 Types of episodes For example, the episode (C appears after A) is very frequent in the above Dr. N. Mamoulis E B in any order First find frequent events (e.g., A,B,C) Generate candidate episodes for next level A Hybrid serial- p arallel episode: (e.g., AB, AC, BC). Prune candidates for which a B sub - episode is not freq. in the previous level. there is no total order in the appearance of the elements, but there is a partial order Count the frequencies of candidates and find A frequent episodes. F B Apply iteratively for next- level episodes. Advanced Database Technologies 57 Counting frequencies of candidates A C ED B C F w1 Apriori in transactional data Observation: Consecutive windows contain C D EA A C D CE A C F D A B time We do not have to count the episodes in each window Advanced Database Technologies Advanced Database Technologies 58 BA A C D EA A C D CE A C F D A B w2 time While sliding the window we update the frequencies of episodes considering only the new events which enter the window and the events which are no longer in the window similar events: w1 w2 w3 w4 Dr. N. Mamoulis Counting frequencies of candidates (cont’d) This is the main difference; the rest is like Dr. N. Mamoulis 56 Like Apriori F Parallel episode: A appears with BA A Advanced Database Technologies Main Algorithm Serial episode: F appears after E A C ED B C F | {w ∈ W ( S , win ) | α ∈ w} | | W ( S , win ) | sequence make it special. Dr. N. Mamoulis D CE A C F D A B α appears divided by the total number of time windows: An episode is defined as an serial or parallel co- Dr. N. Mamoulis C D EA A C w1 w2 w3 w4 59 For example, when moving from w1 to w2 we update only the frequencies of episodes containing B (the incoming event). Dr. N. Mamoulis Advanced Database Technologies 60 10 WINEPI MINEPI This method is called WINEPI (mining frequent episodes using a sliding time window) Another method (MINEPI) follows a different strategy (not counting frequencies in sliding windows) If we count parallel episodes (the order of events does not matter) computation is simple The approach is based on counting minimal occurrences of episodes If we count serial episodes (the events are ordered in the episode) computation is based on automata A minimal occurrence of an episode α is a time interval w=[ts, te) such that (i) α occurs in w and (ii) no subwindow of w contains α Hybrid episodes (with partial event orders) are counted hierarchically; they are broken to serial or parallel sub-episodes, the frequencies of them are counted, and the frequencies of the hybrid episode are computed by checking the co-existence of the parts. Dr. N. Mamoulis Advanced Database Technologies 61 MINEPI (cont’d) At level k the algorithm computes the minimal occurrences of all episodes with k events. The level k+1 episodes are found by performing a temporaljoin of the minimal occurrences of the sub-episodes of size k. Example. To find the frequency an minimal occurrences of A→B→C, we join the MO of A→B with the MO of B→C BA A C D EA A C D CE A C F D A B occurrence of A→B→C Dr. N. Mamoulis A C ED B C F BA A C D EA A C D CE A C F D A B time Dr. N. Mamoulis Advanced Database Technologies 62 WINEPI vs. MINEPI The algorithm is again level-wise. The frequent episodes with k events are computed from the frequent episodes with k-1 events. A C ED B C F Example: minimal occurrences of A→B Advanced Database Technologies time 63 MINEPI is very efficient in finding frequent episodes, but it may suffer from the excessive space needed to store the minimal occurrences. These may not fit in memory for large problems. Especially the early phases of MINEPI can be very expensive. However the temporal join is much more efficient than the counting of WINEPI. Moreover, MINEPI is independent to the window size. In general, the algorithms may produce different results because of the different definition of frequency. WINEPI and MINEPI can be combined to derive frequent episodes efficiently. Dr. N. Mamoulis Advanced Database Technologies 64 References Jiawei Han and Micheline Kamber: "Data Mining: Concepts and Techniques ", Morgan Kaufmann, 2001 Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases, VLDB 1994 Jiawei Han, Jian Pei, Yiwen Yin: Mining Frequent Patterns without Candidate Generation, ACM SIGMOD, 2000 Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo: Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery 1(3): 259 - 289, November 1997 V. Kumar and M. Joshi, “Tutorial on High Performance Data Mining”, http://wwwusers.cs.umn.edu/~mjoshi/hpdmtut/ Dr. N. Mamoulis Advanced Database Technologies 65 11