Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CPT-S 580-06 Advanced Databases Yinghui Wu EME 49 1 Scalable data mining (a case study) 2 Data mining What is data mining? A tentative definition: – Use of efficient techniques for analysis of very large collections of data and the extraction of useful and possibly unexpected patterns in data – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Data 3 Database Processing vs. Data Mining Query – Well defined – SQL, SPARQL, Xpath… Query – Poorly defined – No precise query language – Find all credit applicants with last – Find all credit applicants who are name of Smith. poor credit risks. (classification) – Identify customers who have purchased more than $10,000 in the last month. – Identify customers with similar buying habits. (Clustering) – Find all my friends who frequently goes to French restaurants if their friends do (association rules) – Find all my friends living in Seattle and like French restaurant Output – Precise – Subset of database Output – Fuzzy – Not a subset of database 4 Data Mining Models and Tasks Use variables to predict unknown or future values of other variables. Find human-interpretable patterns that describe the data. 5 Association rule 6 Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Association Rule Discovery: Applications Marketing and Sales Promotion: – Let the rule discovered be {Bagels, … } --> {Potato Chips} – Potato Chips as consequent => Can be used to determine what should be done to boost its sales. – Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. – Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips Definition: Frequent Itemset Itemset – A collection of one or more items • Example: {Milk, Bread, Diaper} – k-itemset • An itemset that contains k items Support count () – Frequency of occurrence of an itemset – E.g. ({Milk, Bread,Diaper}) = 2 Support – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Definition: Association Rule • Association Rule – An implication expression of the form X Y, where X and Y are itemsets – Example: {Milk, Diaper} {Beer} • Rule Evaluation Metrics – Confidence (c) • Measures how often items in Y appear in transactions that contain X Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example: {Milk , Diaper } Beer – Support (s) • Fraction of transactions that contain both X and Y TID s c (Milk, Diaper, Beer ) |T| 2 0.4 5 (Milk, Diaper, Beer ) 2 0.67 (Milk, Diaper ) 3 Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! Given d unique items: – Total number of itemsets = 2d – Total number of possible association rules: d d k R k j 3 2 1 d 1 d k k 1 j 1 d d 1 If d=6, R = 602 rules Mining Association Rules: Decoupling TID Items 1 Bread, Milk 2 3 4 5 Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) Bread, Diaper, Beer, Eggs {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer {Beer} {Milk,Diaper} (s=0.4, c=0.67) Bread, Milk, Diaper, Coke {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements Mining Association Rules Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive Frequent Itemset Generation Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database Transactions N TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke List of Candidates w – Match each transaction against every candidate – Complexity ~ O(NMw) => Expensive since M = 2d !!! M Frequent Itemset Generation Strategies Reduce the number of candidates (M) – Complete search: M=2d – Use pruning techniques to reduce M Reduce the number of transactions (N) – Reduce size of N as the size of itemset increases – Use a subsample of N transactions Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction Reducing Number of Candidates: Apriori Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: X , Y : ( X Y ) s( X ) s(Y ) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support Illustrating Apriori Principle null A Found to be Infrequent B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD Pruned supersets ABCE ABDE ABCDE ACDE BCDE Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1 Items (1-itemsets) Minimum Support = 3 Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 Count 3 2 3 2 3 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Itemset {Bread,Milk,Diaper} Count 3 Apriori Algorithm Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k frequent itemsets • Prune candidate itemsets containing subsets of length k that are infrequent • Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those that are frequent Apriori: Reducing Number of Comparisons Candidate counting: – Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets Hash Structure Transactions N TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke k Buckets Apriori: Implementation Using Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} We need: • Hash function • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Hash function 3,6,9 1,4,7 2,5,8 234 567 145 136 345 124 457 125 458 159 356 357 689 367 368 Apriori: Implementation Using Hash Tree 1 2 3 5 6 transaction 1+ 2356 2+ 356 12+ 356 3+ 56 13+ 56 234 567 15+ 6 145 136 345 124 125 457 458 159 356 357 689 367 368 Match transaction against 11 out of 15 candidates Apriori: Alternative Search Methods Traversal of Itemset Lattice – General-to-specific vs Specific-to-general Frequent itemset border null null .. .. .. .. {a1,a2,...,an} (a) General-to-specific {a1,a2,...,an} Frequent itemset border null .. .. Frequent itemset border (b) Specific-to-general {a1,a2,...,an} (c) Bidirectional Apriori: Alternative Search Methods Traversal of Itemset Lattice – Breadth-first vs Depth-first (a) Breadth first (b) Depth first Bottlenecks of Apriori Candidate generation can result in huge candidate sets: – 104 frequent 1-itemset will generate 107 candidate 2-itemsets – To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 ~ 1030 candidates. Multiple scans of database: – Needs (n +1 ) scans, n is the length of the longest pattern Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement – If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, A BCD, AB CD, BD AC, ABD C, B ACD, AC BD, CD AB, ACD B, C ABD, AD BC, BCD A, D ABC BC AD, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L) Rule Generation How to efficiently generate rules from frequent itemsets? – In general, confidence does not have an antimonotone property c(ABC D) can be larger or smaller than c(AB D) – But confidence of rules generated from the same itemset has an anti-monotone property – e.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD) • Confidence is anti-monotone w.r.t. number of items on the RHS of the rule Association rules in graph data 28 Association in social networks Traditional association rules: X ⇒ Y Association in social networks French restaurant visit x city “if customers x and x′ are friends living in the same city c, there are at least 3 French restaurants in c that x and x′ both like, and if x′ visits a newly opened French restaurant y in c, then x may also visit y.” y x’ French3 restaurant Identify customers for a new French restaurant Association with topological constraints! Conventional association rules: item set 29 Association via graph patterns x2 “fake” acc #2 acc #1 x1 x article Ecuador “Shakira” album Identify customers for released album blog blog “claim a prize” (keywords) detect fake account Question 1: How to define association rule via graph patterns? Question 2: How to discovery interesting rules? Question 3: How to use the rules to identify customers? 30 more involved than rules for itemsets! Graph Pattern Association Rules (GPARs) graph-pattern association rule (GPAR) R(x, y) R(x, y): Q(x, y) ⇒ q(x, y) • Q(x, y): a graph pattern; where x and y are two designated nodes • q(x, y) is an edge labeled q from x to y (predicate) • Q and q as the antecedent and consequent of R, respectively. French restauran t x French restauran t French restauran t x’ : French3 restaurant x x’ ⇒ x French3 restaurant city city R(x, French restaurant): Q(x, French restaurant) ⇒ like(x, French restaurant) 31 Graph Pattern Association Rules (GPARs) graph-pattern association rule (GPAR) R(x, y) R(x, y): Q(x, y) ⇒ q(x, y) If there exists a match h that identifies vx and vy as matches of the designated nodes x and y in Q, respectively, then the consequent q(vx , vy) will likely hold. “if x and x′ are friends living in city c, there are at least 3 French restaurants in c that x and x′ both like, and if x′ visits a newly opened French restaurant y in c, then x may also visit y.” French restauran t x city French restaurant x’ : French3 restaurant R(x, French restaurant): x city x’ French restaurant ⇒ x French3 restaurant Q(x, French restaurant) ⇒ like(x, French restaurant) 32 # of isomorphic subgraph in single graph? Support and Confidence Support of R(x, y): Q(x,y) ⇒ p(x,y) Le Bernadin French restauran t x city x’ ⇒ u1 French3 restaurant French3 restaurant Per se u2 New York (city) u3 French3 restaurant 33 Support and Confidence Confidence of R(x, y) Local closed world assumption v2 x2 x1 Candidate # of x with one edge of type q but is not a match for q(x,y) x v1 Ecuador Ecuador Shakira album v Shakira album “positive” v ' MJ’s album v'' hobby Pop music “negative” “unknown” 34 Discover GPARs Mining GPARs for particular event – often leads to same group of entities French restaurant French restaurant Difference functiony x y Diversified GPARs X’ x x Bi-criteria Diversification function city Le Bernadin French2 restaurant Per se city Patina The diversified mining problem: Asian restaurant Asian restaurant u1 a graph G,ua2predicateuq(x, u4 u5 positive u6 Given bound and 3 y), a support integers k and d, find a set S of k nontrivial GPARs pertaining to q(x, New LA y) such that (a) York F(S) is maximized; and (b) for each GPAR R ∈ S,Asian (city) restaurant (city) r(PR, x) supp(R,G) ≤ d.3 French French3 ≥ α and French3 restaurant restaurant restaurant 35 Parallel GPAR Discovery A parallel discovery algorithm • coordinator Sc divides G into n-1 fragments, each assigned to a processor Si • discovers GPARs in parallel by bulk synchronous processing in d rounds • Parallel Sc postsscalable? a set M of GPARs to each processor • Load each Balancing? processor generates GPARs locally by extending those in M Communication cost? • new GPARs are collected and assembled by Sc in the barrier synchronization phase; Sc incrementally updates top-k GPARs set Lk Assembled New GPARs Mi+1 Lk R1 R2 … Rk coordinator 3. Synchronization/ Update Lk 1. Distribute Mi worker 1 worker 2 … 2. Locally expand Mi worker n 36 Identifying entities using GPARs Given a set of GPARs Σ pertaining to the same q(x,y), graph G and confidence threshold η, the set of entities identified by Σ is the set 37 Entity Identification algorithm 38 Scalability of discovery algorithm On average 3.2 time faster 39