Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea Pietracaprina, Geppino Pucci, Fabio Vandin, University of Padova Eli Upfal, Brown University PODS 2009 Presented by Dongjoo Lee, IDS Lab., CSE, SNU Frequent Pattern Mining Given a dataset D of transactions over a set of items I, and a support threshold s, return all itemsets X such that X ⊆ I with support at least s in D (i.e., contained in at least s transactions). D support Mining Algorithm Frequent Itemsets association rules correlations sequences episodes classifiers clusters … Transaction Datasets Apriori FP-Growth … Among all possible nCk itemsets of size k (k-itemsets), we are interested in statistically significant ones, that is, itemsets whose supports are significantly higher, in a statistical sense, than their expected supports in a dataset where individual items are placed independently in transactions. IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 2 Measuring the Statistical Significance of a Discovery: D ^ D Model dataset of t transactions on a set I of n items, where each transaction di ⊆I n(i): number of transactions that contain item I fi = n(i)/t: frequency of item i random data set where item i is included in any given transaction with probability fi, independent of all other items and all other transactions Null hypothesis H0 The support s(X, D) in D is drawn from the same distribution as its support ^ ^ s(X, D) in D. Alternative hypothesis H1 The support s(X, D) in D is not drawn from that distribution, and in particular that there is a positive correlation between the occurrences of the individual items in X. IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 3 Measuring the Statistical Significance of a Discovery: D ^ D IDS Lab. 2009 Spring Seminar Example t = 1,000,000 |I| = 1,000 fi = fj = 1/1,000 support({i, j}) = 7 | Q7 | = 300 Ti,j = {t | t ∈ D, i ∈t, j ∈t} Qn = {{i,j} | |Ti,j| = n} Pr(i,j) = 0.000001 E[|Ti,j|] = (0.000001)×(1,000,000) = 1 Pr(|Ti,j| = 7) = 0.0001 1000C7 E[| Q7 |] = 0.0001×499,500 = 50 Pr(| Q7 | = 300) ≤ 2-300 = 499,500 Copyright 2009 by CEBT Binomial distribution Pr(k) = nCkpk(1-p)n-k 4 Statistical Hypothesis Testing What is the probability (Pr: p-value) of observation if the null hypothesis is true observation D | Q7 | = 300 Pr(| Q7 | = 300|H0) ≤ 2-300 ^ D If the observation is in critical region, reject the null hypothesis. C p-value ≤ 0.05 2-300 Significance level of the test: α = Pr (Type I error) probability of rejecting H0 when it is true (false positive) Power of the test: β = 1- Pr (Type II error) probability of correctly rejecting the null hypothesis IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 5 Multi-hypothesis Testing Pr(| QX1,k | = s|H01) = pX1 Pr(| QX2,k | = s|H02) = pX2 … Pr(| QXi,k | = s|H0i) = pXi … observation D nCk | QX1, k | = s | QX2, k | = s … | QXi, k | = s … ? ^ D C ≤ 0.05 The outcome of an experiment is used to test simultaneously a number of hypotheses. Significance level of multi-hypothesis testing Family Wise Error Rate (FWER) – probability of incurring at least one Type I error in any of the individual tests – conservative – for large numbers of hypotheses all of these techniques lead to test with low power. False Discovery Rate (FDR) – less conservative IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 6 False Discovery Rate Control False Discovery Rate (FDR) expected ratio of erroneous rejections among all rejections FDR = E[V/R] ( V/R = 0 when R = 0) – V: number of Type I errors in the individual test – R: total number of null hypotheses rejected by the multiple test FDR Control controls the expected proportion of incorrectly rejected null hypotheses. IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 7 Standard FDR Control Consider all possible combination of k-itemsets Get the p-value of itemset X with support s, following binomial distribution Find itemsets that keep FDR constraints IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 8 What Do the Authors Do? 1. ^ Approximate the distribution of Qk,s with minimum support smin. 2. ^ Find smin approximating distribution Qk,s with the error ϵ. 3. Poisson Approximation by using Chen-Stein method. A Monte Carlo method Establish a support threshold s* with a controlled FDR. Reduce the number of FDR compared to standard multi-comparison test IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 9 Poisson Distribution If the expected number of occurrences in a certain interval is λ, then the probability that there are exactly k occurrences (k being a non-negative integer, k = 0, 1, 2,…) is equal to f (k ; ) k e k! Probability mass function IDS Lab. 2009 Spring Seminar Cumulative distribution function Copyright 2009 by CEBT 10 ^ Poisson Approximation for Qk,s Let Qk,s be the number of itemsets of size k with support at least s with ^ ^ respect to D, Qk,s is the corresponding random variable for D. Fix k and s, Define a collection of Bernoulli random variables { ZX | X ⊂ I, |X| = k }, such that ZX = 1 if the itemset X appears in at least s ^ transaction in the random dataset D, and ZX = 0 otherwise. px = Pr(ZX = 1) Let I(X) = { X´ | X∩X´ ≠ ø, |X´| = |X|} If Y I(X) then ZY and ZX are independent. IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 11 ^ Poisson Approximation for Qk,s ^ THEOREM 1. Let U be a Poisson random variable such that E[U] = E[ Q k,s] = λ ^ ^ < ∞. The variation distance between the distributions L(Q k,s) of Qk,s and L(U) of U is such that b1 and b2 are both decreasing in s. Therefore, if b1 + b2 < ϵ for a given s, then the same upper bound will hold for every s´ > s. … IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 12 A Monte Carlo Method for Determining smin IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 13 A Novel Multi-Hypothesis Testing Set initial support value as smin Maximum number of calculation to obtain s* Found the s* Set next support value IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 14 Novel Testing vs. Standard Testing Standard FDR Testing Novel Testing Constrain itemsets Constrain support s* Control more hypotheses Control less hypotheses Evaluate the significance of individual itemset Evaluate the significance of entire itemsets IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 15 Experimental Results IDS Lab. 2009 Spring Seminar – Experiments on Benchmark Datasets Copyright 2009 by CEBT 16 Experimental Results IDS Lab. 2009 Spring Seminar – Experiments on Random Datasets Copyright 2009 by CEBT 17 Experimental Results IDS Lab. 2009 Spring Seminar – Comparison with Standard FDR Test Copyright 2009 by CEBT 18 Conclusion In a random dataset where items are placed independently in transactions, there is a minimum support smin such that the number of k-itemsets with support at least smin is well approximated by a Poisson distribution. Novel multi-hypothesis testing incur a small FDR tests. First attempt at establishing a support threshold for the classical frequent itemset mining problem with a quantitative guarantee on the significance of the output. IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 19 Discussion Hard to understand, because it needs so many related knowledge or notions to understand the content Pros Good Approximation Less FDR Find appropriate support through exploring structure of whole dataset Cons Fail to find significant itemsets with small support IDS Lab. 2009 Spring Seminar Copyright 2009 by CEBT 20