Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSG230 Summary Donghui Zhang 2017年5月5日星期五 Data Mining: Concepts and Techniques 1 What we learned? 1. Frequent pattern & association 2. Clustering 3. Classification 4. Data warehousing 5. Additional 2017年5月5日星期五 Data Mining: Concepts and Techniques 2 What we learned? 1. Frequent pattern & association frequent itemsets (Apriori, FP-growth) max and closed itemsets association rules essential rules generalized itemsets Sequential pattern 2. Clustering 3. Classification 4. Data warehousing 5. Additional 2017年5月5日星期五 Data Mining: Concepts and Techniques 3 What we learned? 1. Frequent pattern & association 2. Clustering k-means Birch (based on CF-tree) DBSCAN CURE 3. Classification 4. Data warehousing 5. Additional 2017年5月5日星期五 Data Mining: Concepts and Techniques 4 What we learned? 1. Frequent pattern & association 2. Clustering 3. Classification decision tree naïve Baysian classifier Baysian network neural net and SVM 4. Data warehousing 5. Additional 2017年5月5日星期五 Data Mining: Concepts and Techniques 5 What we learned? 1. Frequent pattern & association 2. Clustering 3. Classification 4. Data warehousing 5. concept, schema data cube & operations (rollup, …) cube computation: multi-way array aggregation iceberg cube dynamic data cube Additional 2017年5月5日星期五 Data Mining: Concepts and Techniques 6 What we learned? 1. Frequent pattern & association 2. Clustering 3. Classification 4. Data warehousing 5. Additional lattice (of itemsets, g-itemsets, rules, cuboids) distance-based indexing 2017年5月5日星期五 Data Mining: Concepts and Techniques 7 1. Frequent pattern & association frequent itemsets (Apriori, FP-growth) max and closed itemsets association rules essential rules generalized itemsets Sequential pattern 2017年5月5日星期五 Data Mining: Concepts and Techniques 8 Basic Concepts: Frequent Patterns and Association Rules Transaction-id Items bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Itemset X={x1, …, xk} Find all the rules XY with min confidence and support Customer buys both Customer buys beer 2017年5月5日星期五 Customer buys diaper support, s, probability that a transaction contains XY confidence, c, conditional probability that a transaction having X also contains Y. Let min_support = 50%, min_conf = 50%: A C (50%, 66.7%) C A (50%, 100%) Data Mining: Concepts and Techniques 9 From Mining Association Rules to Mining Frequent Patterns (i.e. Frequent Itemsets) Given a frequent itemset X, how to find association rules? Examine every subset S of X. Confidence(S X – S ) = support(X)/support(S) Compare with min_conf An optimization is possible (refer to exercises 6.1, 6.2). 2017年5月5日星期五 Data Mining: Concepts and Techniques 10 The Apriori Algorithm—An Example Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Database TDB Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E C1 1st scan C2 L2 Itemset {A, C} {B, C} {B, E} {C, E} sup 2 2 3 2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} sup 1 2 1 2 3 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 L1 C2 2nd scan Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C3 Itemset {B, C, E} 2017年5月5日星期五 3rd scan L3 Itemset {B, C, E} sup 2 Data Mining: Concepts and Techniques 11 Important Details of Apriori How to generate candidates? Step 1: self-joining Lk Step 2: pruning How to count supports of candidates? Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 Pruning: abcd from abc and abd acde from acd and ace acde is removed because ade is not in L3 C4={abcd} 2017年5月5日星期五 Data Mining: Concepts and Techniques 12 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree 2017年5月5日星期五 min_support = 3 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques 13 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree 2017年5月5日星期五 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques min_support = 3 {} f:1 c:1 a:1 m:1 p:1 14 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree 2017年5月5日星期五 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques min_support = 3 {} f:2 c:2 a:2 m:1 b:1 p:1 m:1 15 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree 2017年5月5日星期五 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques min_support = 3 {} f:3 c:2 b:1 a:2 m:1 b:1 p:1 m:1 16 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree 2017年5月5日星期五 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques min_support = 3 {} f:3 c:2 c:1 b:1 a:2 b:1 p:1 m:1 b:1 p:1 m:1 17 Construct FP-tree from a Transaction Database TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree 2017年5月5日星期五 Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques min_support = 3 {} f:4 c:3 c:1 b:1 a:3 b:1 p:1 m:2 b:1 p:2 m:1 18 Find Patterns Having P From P-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 2017年5月5日星期五 f:4 c:3 c:1 b:1 a:3 Conditional pattern bases item cond. pattern base b:1 c f:3 p:1 a fc:3 b fca:1, f:1, c:1 m:2 b:1 m fca:2, fcab:1 p:2 m:1 p fcam:2, cb:1 Data Mining: Concepts and Techniques 19 Max-patterns Frequent pattern {a1, …, a100} (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 frequent subpatterns! Max-pattern: frequent patterns without proper frequent super pattern BCDE, ACD are max-patterns Tid Items BCD is not a max-pattern Min_sup=2 2017年5月5日星期五 Data Mining: Concepts and Techniques 10 A,B,C,D,E 20 30 B,C,D,E, A,C,D,F 20 Example (ABCDEF) A (BCDE) B (CDE) C (DE) Items Frequency A 2 B 2 C 3 D 3 E 2 F 1 ABCDE 0 2017年5月5日星期五 D (E) E () Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Min_sup=2 Max patterns: Data Mining: Concepts and Techniques 21 Example (ABCDEF) A (BCDE) B (CDE) C (DE) D (E) E () Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Min_sup=2 Node A 2017年5月5日星期五 Tid Items Frequency AB 1 AC 2 AD 2 AE 1 ACD 2 Data Mining: Concepts and Techniques Max patterns: 22 Example (ABCDEF) A (BCDE) B (CDE) C (DE) D (E) E () Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Min_sup=2 Node A 2017年5月5日星期五 Tid Items Frequency AB 1 AC 2 AD 2 AE 1 ACD 2 Data Mining: Concepts and Techniques Max patterns: ACD 23 Example (ABCDEF) A (BCDE) B (CDE) C (DE) D (E) E () Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Min_sup=2 Node B 2017年5月5日星期五 Tid Items Frequency BCDE 2 Data Mining: Concepts and Techniques Max patterns: ACD 24 Example (ABCDEF) A (BCDE) B (CDE) C (DE) D (E) E () Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Min_sup=2 Node B Items Frequency BCDE 2 Max patterns: ACD BCDE 2017年5月5日星期五 Data Mining: Concepts and Techniques 25 Example (ABCDEF) A (BCDE) B (CDE) C (DE) D (E) E () Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Min_sup=2 Max patterns: ACD BCDE 2017年5月5日星期五 Data Mining: Concepts and Techniques 26 A Critical Observation Rule Support Confidence A → BC sup(ABC) sup(ABC)/sup(A) AB → C sup(ABC) sup(ABC)/sup(AB) AC → B sup(ABC) sup(ABC)/sup(AC) A→B sup(AB) sup(AB)/sup(A) A→C sup(AC) sup(AC)/sup(A) A → BC has smaller support and confidence than the other rules, independent to the TDB. Rules AB → C, AC → B, A → B and A → C are redundant with regard to A → BC. While mining association rules, a large percentage of rules may be redundant. 2017年5月5日星期五 Data Mining: Concepts and Techniques 27 Formal Definition of Essential Rule Definition 1 Rule r1 implies another rule r2 if support(r1)≤support(r2) and confidence(r1)≤ confidence(r2) independent to TDB. Denote as r1 r2 Definition 2 Rule r1 is an essential rule if r1 is strong and r2 s.t. r2 r1 . 2017年5月5日星期五 Data Mining: Concepts and Techniques 28 Example of a Lattice of rules ABC ABC AC AB CAB BAC ABC ACB BC BA BCA CB CA • Generate the child nodes: move or delete from the consequent. • To find essential rules: start from each max itemset; browse top-down; prune a sub-tree whenever a rule is confident. 2017年5月5日星期五 Data Mining: Concepts and Techniques 29 Frequent generalized itemsets A taxonomy of items. TDB involves leaf items in the taxonomy. A g-itemset may contain g-items, but cannot contain an ancestor and a descendant at the same time. !! A descendant g-item is a “superset”!! Anyone who bought {milk, bread} also bought {milk}. Anyone who bought {A} also bought {W}. ?? how to find frequent g-itemsets? Browse (and prune) a lattice of g-itemsets! To get children, replace one item by its ancestor (if conflicts, remove instead.) 2017年5月5日星期五 Data Mining: Concepts and Techniques 30 What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database Given support threshold min_sup =2, <(ab)c> is a SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 2017年5月5日星期五 sequential pattern Data Mining: Concepts and Techniques 31 Mining Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix <a>; The ones having prefix <b>; SID sequence … 10 <a(abc)(ac)d(cf)> The ones having prefix <f> 20 <(ad)c(bc)(ae)> 2017年5月5日星期五 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Data Mining: Concepts and Techniques 32 Finding Seq. Patterns with Prefix <a> Only need to consider projections w.r.t. <a> <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>, by checking the frequency of items like a and _a. Further partition into 6 subsets Having prefix <aa>; … Having prefix <af> 2017年5月5日星期五 Data Mining: Concepts and Techniques SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 33 2. Clustering k-means Birch (based on CF-tree) DBSCAN CURE 2017年5月5日星期五 Data Mining: Concepts and Techniques 34 The K-Means Clustering Method Pick k objects as initial seed points Assign each object to the cluster with the nearest seed point Re-compute each seed point as the centroid (or mean point) of its cluster Go back to Step 2, stop when no more new assignment Not optimal. A counter example? 2017年5月5日星期五 Data Mining: Concepts and Techniques 35 BIRCH (1996) Balanced Iterative Reducing and Clustering using Hierarchies Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96) Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Scales linearly: finds a good clustering with a single scan Weakness: handles only numeric data, and sensitive to the and improves the quality with a few additional scans order of the data record. Data Mining: Concepts and Techniques 2017年5月5日星期五 36 Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1 Xi SS: Ni=1 (Xi )2 CF = (5, (16,30),244) 10 9 8 7 6 5 4 3 2 1 0 0 2017年5月5日星期五 1 2 3 4 5 6 7 8 9 10 Data Mining: Concepts and Techniques (3,4) (2,6) (4,5) (4,7) (3,8) 37 Some Characteristics of CF Two CF can be aggregated. Given CF1=(N1, LS1, SS1), CF2 = (N2, LS2, SS2), If combined into one cluster, CF=(N1+N2, LS1+LS2, SS1+SS2). The centroid and radius can both be computed from CF. centroid is the center of the cluster radius is the average distance between an object and the centroid. x N x i 1 0 N i 2 ( ) i1 xi x0 N R N how? 2017年5月5日星期五 Data Mining: Concepts and Techniques 38 Some Characteristics of CF LS x0 N 2 2 (( ) ( ) i1 xi x0 2 xi * x0) N R N SS N ( LS / N ) 2 2 LS * ( LS / N ) N 1 N 2017年5月5日星期五 N * SS LS 2 Data Mining: Concepts and Techniques 39 CF-Tree in BIRCH Clustering feature: summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the subcluster from the statistical point of view. registers crucial measurements for computing cluster and utilizes storage efficiently A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering A nonleaf node in a tree has descendants or “children” The nonleaf nodes store sums of the CFs of their children A CF tree has two parameters Branching factor: specify the maximum number of children. threshold T: max radius of sub-clusters stored at the leaf nodes 2017年5月5日星期五 Data Mining: Concepts and Techniques 40 Insertion in a CF-Tree To insert an object o to a CF-tree, insert to the root node of the CF-tree. To insert o into an index node, insert into the child node whose centroid is the closest to o. To insert o into a leaf node, If an existing leaf entry can “absorb” it (i.e. new radius <= T), let it be; Otherwise, create a new leaf entry. Split: Choose two entries whose centroids are the farthest away; Assign them to two different groups; Assign the remaining entries to one of these groups. 2017年5月5日星期五 Data Mining: Concepts and Techniques 41 Density-Based Clustering: Background (II) Density-reachable: p A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi p1 q Density-connected A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. 2017年5月5日星期五 p Data Mining: Concepts and Techniques q o 42 DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core 2017年5月5日星期五 MinPts = 5 Data Mining: Concepts and Techniques 43 DBSCAN: The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed. 2017年5月5日星期五 Data Mining: Concepts and Techniques 44 Motivation for CURE k-means does not perform well on this; AGNES + dmin has single-link effect! 2017年5月5日星期五 Data Mining: Concepts and Techniques 45 Cure: The Basic Version Initially, insert to PQ every object as a cluster. Every cluster in PQ has: (Up to) C representative points Pointer to closest cluster (dist between two clusters = min{dist(rep1, rep2)}. While PQ has more than k clusters, Merge the top cluster with its closest cluster. 2017年5月5日星期五 Data Mining: Concepts and Techniques 46 Representative points Step 1: choose up to C points. If a cluster has no more than C points, all of them. Otherwise, choose the first point as the farthest from the mean. Choose the others as the farthest from the chosen ones. Step 2: shrink each point towards mean: p’ = p + * (mean – p) [0,1]. Larger means shrinking more. Reason for shrink: avoid outlier, as faraway objects are shrunk more. 2017年5月5日星期五 Data Mining: Concepts and Techniques 47 3. Classification decision tree naïve Baysian classifier Baysian net neural net and SVM 2017年5月5日星期五 Data Mining: Concepts and Techniques 48 Training Dataset This follows an example from Quinlan’s ID3 2017年5月5日星期五 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Data Mining: Concepts and Techniques buys_computer no no yes yes yes no yes no yes yes yes yes yes no 49 Output: A Decision Tree for “buys_computer” age? <=30 student? overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes 2017年5月5日星期五 Data Mining: Concepts and Techniques 50 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left 2017年5月5日星期五 Data Mining: Concepts and Techniques 51 General Case Suppose X can have one of m values… V1, V2, P(X=V1) = p1 P(X=V2) = p2 …. … Vm P(X=Vm) = pm What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H ( X ) p1 log 2 p1 p2 log 2 p2 pm log 2 pm m p j log 2 p j j 1 H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys) distribution 2017年5月5日星期五 Data Mining: Concepts and Techniques 52 Specific Conditional Entropy X = College Major Definition of Conditional Entropy: Y = Likes “Gladiator” H(Y|X=v) = The entropy of Y among only those records in which X has value v X Y Example: Math Yes History No CS Yes • H(Y|X=Math) = 1 Math No • H(Y|X=History) = 0 Math No • H(Y|X=CS) = 0 CS Yes History No Math Yes 2017年5月5日星期五 Data Mining: Concepts and Techniques 53 Conditional Entropy Definition of general Conditional Y = Likes “Gladiator” Entropy: X = College Major H(Y|X) = The average conditional entropy of Y X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes 2017年5月5日星期五 = ΣjProb(X=vj) H(Y | X = vj) Example: vj Math History CS Prob(X=vj) 0.5 0.25 0.25 H(Y | X = vj) 1 0 0 H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 Data Mining: Concepts and Techniques 54 Conditional entropy H(C|age) age buy no <=30 2 3 30…40 4 0 >40 3 2 H(C|age<=30) = 2/5 * lg(5/2) + 3/5 * lg(5/3) = 0.971 H(C|age in 30..40) = 1 * lg 1 + 0 * lg 1/0 = 0 H(C|age>40) = 3/5 * lg(5/3) + 2/5 * lg(5/2) = 0.971 H (C | age) 5 4 H (C | age 30) H (C | age (30,40]) 14 14 5 H (C | age 40) 0.694 14 2017年5月5日星期五 Data Mining: Concepts and Techniques 55 Select the attribute with lowest conditional entropy H(C|age) = 0.694 H(C|income) = 0.911 H(C|student) = 0.789 H(C|credit_rating) = 0.892 age? <=30 30..40 Select “age” to be the tree root! student? yes 2017年5月5日星期五 no yes no yes Data Mining: Concepts and Techniques >40 credit rating? excellent fair no yes 56 Bayesian Classification X: a data sample whose class label is unknown, e.g. X =(Income=medium, Credit_rating=Fair, Age=40). Hi: a hypothesis that a record belongs to class Ci, e.g. Hi = a record belongs to the “buy computer” class. P(Hi), P(X): probabilities. P(Hi/X): a conditional probability: among all records with medium income and fair credit rating, what’s the probability to buy a computer? This is what we need for classification! Given X, P(Hi/X) tells us the possibility that it belongs to some class. What if we need to determine a single class for X? 2017年5月5日星期五 Data Mining: Concepts and Techniques 57 Bayesian Theorem Another concept, P(X|Hi) : probability of observing the sample X, given that the hypothesis holds. E.g. among all people who buy computer, what percentage has the same value as X. We know P(X Hi) = P(Hi|X) P(X) = P(X|Hi) P(Hi), So P( X | H )P(H ) P(H | X ) i i P( X ) i We should assign X to the class Ci where P(Hi|X) is maximized, equivalent to maximize P(X|Hi) P(Hi). 2017年5月5日星期五 Data Mining: Concepts and Techniques 58 Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent: n P( X | C i) P( x | C i) k k 1 The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) No dependence relation between attributes Greatly reduces the number of probabilities to maintain. 2017年5月5日星期五 Data Mining: Concepts and Techniques 59 Sample quiz questions 1. What data does naïve Baysian net maintain? 2. Given X =(age<=30, Income=medium, Student=yes Credit_rating=Fair) buy or not buy? 2017年5月5日星期五 age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Data Mining: Concepts and Techniques buys_computer no no yes yes yes no yes no yes yes yes yes yes no 60 Naïve Bayesian Classifier: Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes” 2017年5月5日星期五 Pitfall: forget P(Ci) Data Mining: Concepts and Techniques 61 Assume five variables T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L) L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S) R only directly influenced by M (i.e. R is conditionally independent of L,S, given M) M and S are independent 2017年5月5日星期五 Data Mining: Concepts and Techniques 62 T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny Making a Bayes net M S L R T Step One: add variables. • Just choose the variables you’d like to be included in the net. 2017年5月5日星期五 Data Mining: Concepts and Techniques 63 T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny Making a Bayes net M S L R T Step Two: add links. • The link structure must be acyclic. • If node X is given parents Q1,Q2,..Qn you are promising that any variable that’s a non-descendent of X is conditionally independent of X given {Q1,Q2,..Qn} 2017年5月5日星期五 Data Mining: Concepts and Techniques 64 T: The lecture started by 10:35 L: The lecturer arrives late R: The lecture concerns robots M: The lecturer is Manuela S: It is sunny Making a Bayes net P(s)=0.3 P(LM^S)=0.05 P(LM^~S)=0.1 P(L~M^S)=0.1 P(L~M^~S)=0.2 M S P(M)=0.6 P(RM)=0.3 P(R~M)=0.6 L P(TL)=0.3 P(T~L)=0.8 R T Step Three: add a probability table for each node. • The table for node X must list P(X|Parent Values) for each possible combination of parent values 2017年5月5日星期五 Data Mining: Concepts and Techniques 65 Computing with Bayes Net P(s)=0.3 P(LM^S)=0.05 P(LM^~S)=0.1 P(L~M^S)=0.1 P(L~M^~S)=0.2 M S P(M)=0.6 P(RM)=0.3 P(R~M)=0.6 L P(TL)=0.3 P(T~L)=0.8 R T P(T P(T P(T P(T P(T P(T P(T P(T ^ ~R ^ L ^ ~M ^ S) = ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = L) * P(~R ^ L ^ ~M ^ S) = L) * P(~R L ^ ~M ^ S) * P(L^~M^S) = L) * P(~R ~M) * P(L^~M^S) = L) * P(~R ~M) * P(L~M^S)*P(~M^S) = L) * P(~R ~M) * P(L~M^S)*P(~M | S)*P(S) = L) * P(~R ~M) * P(L~M^S)*P(~M)*P(S). 2017年5月5日星期五 Data Mining: Concepts and Techniques 66 What we learned? 4. Data warehousing concept, schema data cube & operations (rollup, …) cube computation: multi-way array aggregation iceberg cube dynamic data cube 2017年5月5日星期五 Data Mining: Concepts and Techniques 67 What is Data Warehouse? Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing: The process of constructing and using data warehouses 2017年5月5日星期五 Data Mining: Concepts and Techniques 68 Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 2017年5月5日星期五 Data Mining: Concepts and Techniques 69 A data cube all 0-D(apex) cuboid product product, quarter quarter country product,country 1-D cuboids quarter, country 2-D cuboids product, quarter, country 2017年5月5日星期五 Data Mining: Concepts and Techniques 3-D(base) cuboid 70 Multidimensional Data Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product Office Month 2017年5月5日星期五 City Month Week Day Pick one node from each dimension hierarchy, you get a data cube! How many cubes? How many distinct cuboids? Data Mining: Concepts and Techniques 71 Typical OLAP Operations Roll up (drill-up): summarize data Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: by climbing up hierarchy or by dimension reduction project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. 2017年5月5日星期五 Data Mining: Concepts and Techniques 72 Typical OLAP Operations Industry Region Year Category Country Quarter Product City Office Month Week Day ?? Starting from [product, city, week], what OLAP operations can produce the total sales for every month and every category in the “automobile” industry. 2017年5月5日星期五 Data Mining: Concepts and Techniques 73 OLAP Server Architectures Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array Specialized SQL servers specialized support for SQL queries over star/snowflake schemas 2017年5月5日星期五 Data Mining: Concepts and Techniques 74 Multi-way Array Aggregation for Cube Computation C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 b3 B b2 B13 14 15 44 28 9 24 b1 5 b0 1 2 3 4 a0 a1 a2 a3 56 40 36 A 2017年5月5日星期五 60 16 Data Mining: Concepts and Techniques 52 20 Order: ABC AB: plane AC: line BC: point 75 Multi-Way Array Aggregation for Cube Computation (Cont.) Let A: 40 values, B: 400 values, C: 4000 values. One chunk contains 10*100*1000 = 1,000,000 values. ABC needs how much memory? AB plane: 40*400=16,000 AC line: 40*(4000/4) = 40,000 BC point: (400/4)*(4000/4) = 100,000 total: 156,000 CBA needs how much memory? CB plane: 4000*400=1,600,000 CA line: 4000*(40/4) = 40,000 BA point: (400/4)*(40/4) = 1000 total: 1,641,000 --- 10 times more! 2017年5月5日星期五 Data Mining: Concepts and Techniques 76 Computing iceberg cube using BUC BUC (Beyer & Ramakrishnan, SIGMOD’99) Bottom-up vs. top-down?—depending on how you view it! Apriori property: Aggregate the data, then move to the next level If minsup is not met, stop! 2017年5月5日星期五 Data Mining: Concepts and Techniques 77 The Dynamic Data Cube [EDBT’00] 1..4 1..4 5..8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 41 81 121 16 1 5..8 1 1 1 1 1 1 1 1 1 1 1 1 4 8 12 16 4 1 8 1 12 1 16 1 1 1 1 1 1 1 1 1 1 1 1 1 4 81 121 16 1 4 1 8 1 12 1 16 1 1 1 1 1 1 1 1 1 1 4 8 12 16 4 8 12 16 4 8 12 16 E.g. 16+12+8+6 = 42. Query cost = update cost = O(log2(n)) 2017年5月5日星期五 Data Mining: Concepts and Techniques 78 Dynamic Data Cube summary A balanced tree with fanout=4. The leaf nodes contains the original data cube. Each index entry stores an X-border and an Y-border. Each border is stored as a binary tree, which supports a 1-dim prefix-sum query and an update in O(log(n)) time. Overall, the DDC supports a range-sum query and an update both in O(log2n) time. 2017年5月5日星期五 Data Mining: Concepts and Techniques 79 5. Additional lattice (of itemsets, g-itemsets, rules, cuboids) distance-based indexing 2017年5月5日星期五 Data Mining: Concepts and Techniques 80 Problem Statement Given a set S of objects and a metric distance function d(). The similarity search problem is defined as: for an arbitrary object q and a threshold , find { o | oS d(o, q)< } Solution without index: for every oS, compute d(q,o). Not efficient! 2017年5月5日星期五 Data Mining: Concepts and Techniques 81 An Example of the VP-tree S={o1,…,o10}. Randomly pick o1 as root. Compute the distance between o1 and oi, sort in increasing order of distance: o3 o7 o6 o9 o10 o2 o8 o5 o4 5 6 18 34 96 102 111 300 401 build tree recursively. 34 o3 , o7 , o6 , o9 2017年5月5日星期五 o1 96 o10 , o2 , o8 , o5, o4 Data Mining: Concepts and Techniques 82 Query Processing Given object q, compute d(q,root). Intuitively, if it’s small, search the left tree; otherwise, search the right tree. In each index node, store: maxDL=max{ d(root, oi)|oi left tree }, minDR=min{ d(root, oi)|oi right tree }. Pruning condition: prune left if: d(q,root) – maxDL ≥ prune right if: minDR - d(q,root) ≥ ?? maxDL=10, minDR=20, d(q,root)=10, =10. Which sub-tree(s) do we check? ?? maxDL=10, minDR=20, d(q,root)=10, for what do we have to check both trees? 2017年5月5日星期五 Data Mining: Concepts and Techniques 83 Summary 1. Frequent pattern & association 2. Clustering 3. Classification 4. Data warehousing 5. Additional 2017年5月5日星期五 Data Mining: Concepts and Techniques 84