Download Statistical Learning Theory Meets Big Data

Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data “In God we trust, all others (must) bring data” ● Prof. W.E. Deming, Statistician, (attributed) “Every two days now we create as much information as we did from the dawn of civilization up until 2003” ● Eric Schmidt, Google Exec. Chairman, 2009 “Data explosion is bigger than Moore's Law” ● Marissa Mayer, Yahoo! CEO, 2009 Big Data “Every two days now we create as much information as we did from the dawn of civilization up until 2003” ● Eric Schmidt, Google Exec. Chairman, 2009 “Data explosion is bigger than Moore's Law” ● Marissa Mayer, Yahoo! CEO, 2009 Big Data ● Big data requires big storage and long execution time ● It's great to have big data – but do we need to process all of it? ● Trade off accuracy of results for execution speed ● Can approximation algorithms for data analysis be fast and still guarantee high-quality results? Frequent Itemsets ● Market Basket Analysis ● You own a grocery store ● Have copy of receipts from sales For each customer, you have the list of products she bought Interested in what groups of products are bought together – correlated items – ● – – Business decisions Optimization Settings ● Transactional dataset D Figure from Tan et al. - Introduction to Data Mining Transactions ● Transactional dataset D transaction Figure from Tan et al. - Introduction to Data Mining Items ● Transactions are built on items from a domain items Figure from Tan et al. - Introduction to Data Mining Itemsets ● Sets of items are called itemsets Itemsets Figure from Tan et al. - Introduction to Data Mining Frequency ● Frequency of an itemset X in dataset D: – – : fraction of transactions of D containing X Example: Milk: 4/5, {Bread, Milk}: 3/5 Figure from Tan et al. - Introduction to Data Mining Frequent Itemsets ● Frequent Itemsets mining with respect to – Find all itemsets X with frequency Frequent Itemsets ● Frequent Itemsets mining with respect to Find all itemsets X with frequency Top -K Frequent Itemsets – Find all itemsets at least as frequent as the kth most frequent Association Rules – ● – X in a transaction is correlated with Y in the transaction Frequent Itemsets mining ● Classical problem in data analysis ● There are algorithms to extract exact collection of FI's – APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00] Frequent Itemsets mining ● Well studied classical problem in data analysis ● There are algorithms to extract exact collection of FI's ● APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00] Running time depends on number of transactions in the dataset and on the number of frequent itemsets – – – – 10^8 transactions considered “normal” size 10^4 items → 2^(10^4) itemsets total A transaction with d items contains 2^d itemsets How to speed up mining? ● Decrease dependency on number of transactions: How to speed up mining? ● Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower How to speed up mining? ● ● ● Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable How to speed up mining? ● ● Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results ● Need to quantify what is acceptable ● Given Frequency 0 , extract collection W of FI's such that: 1 How to speed up mining? ● ● Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results ● Need to quantify what is acceptable ● Given , extract collection W of FI's such that: Frequency 0 1 Must be in W How to speed up mining? ● ● Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results ● Need to quantify what is acceptable ● Given , extract collection W of FI's such that: Frequency 0 1 Must not be in W Must be in W How to speed up mining? ● ● Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results ● Need to quantify what is acceptable ● Given , extract collection W of FI's such that: Frequency 0 1 Must not be in W May be in W Must be in W Observa(on ● Data mining is exploratory in nature ● Exact results are usually not needed  ● We want a good, realistic idea of the process Fast analysis is more important  ... if it still guarantees good-enough results! 22 Riondato et al., PARMA, CIKM'12 How to speed up mining? ● ● ● ● Decrease dependency on number of transactions: 1.create random sample of dataset 2.extract frequent itemsets from sample with lower Sampling force us to accept approximate results Need to quantify what is acceptable Given , extract collection W of itemset from sample, such that: Frequency 0 1 Must not be in W ● May be in W Must be in W Problem: choose right sample size and right Choosing ● If we have, for all itemsets X simultaneously then does the trick Choosing ● If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W May be in W Must be in W Choosing ● If we have, for all itemsets X simultaneously then does the trick Frequency 0 1 Must not be in W ● May be in W Must be in W Need sample size |S| s.t., with prob. at least for all itemsets X, we have , Choosing the sample size ● Naïve approach ● Given itemset X, frequency of X in sample S is distributed like a binomial: ● Use Chernoff bound to bound ● Apply Union bound over all itemsets to find |S| such that Choosing the sample size ● Naïve approach ● Given itemset X, frequency of X in sample S is distributed like a binomial: ● Use Chernoff bound to bound ● Apply Union bound over all itemsets to find |S| such that ● Problem: There's an exponential number of itemsets – – Sample size would too large (linear the in number of elements) Vapnik-Chervonenkis Dimension ● Combinatorial property of a collection of subsets from a domain ● Measures the “richness” or “expressivity” of the collection of subsets ● If we know the VC-dim of a collection of subsets, we can compute the minimum sample size needed to approximate the sizes of all the subsets using the sample Range spaces ● VC-Dimension is defined on range spaces ● (B,R): range space B: domain – R: collection of subsets from B (ranges) No restrictions: – ● – – – B can be infinite R can be infinite R can contain infinitely-large subsets of B Vapnik-Chervonenkis Dimension ● Range space ● For any ● C is shattered if ● The VC-Dimension of (B,R) is the size of the largest shattered subset of B , define Example of VC-Dimension y ● B= x Example of VC-Dimension y ● B= ● R = all axis-aligned rectangles in x Example of VC-Dimension y ● B= ● R = all axis-aligned rectangles in x Example of VC-Dimension y ● B= ● R = all axis-aligned rectangles in ● Shattering 4 points: Easy – Take any 4 points s.t. no 3 of them are aligned x Example of VC-Dimension y ● B= ● R = all axis-aligned rectangles in ● Shattering 4 points: Easy – Take any 4 points s.t. no 3 of them are aligned y x x Example of VC-Dimension y ● B= ● R = all axis-aligned rectangles in ● Shattering 4 points: Easy – x Take any 4 points s.t. no 3 of them are aligned y Need 16 rectangles x Example of VC-Dimension y ● B= ● R = all axis-aligned rectangles in ● Shattering 5 points? x Example of VC-Dimension y ● B= ● R = all axis-aligned rectangles in ● Shattering 5 points: impossible x Example of VC-Dimension y ● B= ● R = all axis-aligned rectangles in ● Shattering 5 points: impossible Take any 5 points – One of them that is contained in all rectangles containing the other four – Impossible to find a rectangle containing only the other four VC(B,R)=4 – ● x Approximating sizes of ranges ● Sample Theorem: – Let (B,R) have VC(B,R)≤ d. Given , let S be a collection of points from B sampled uniformly at random. If then, ● Can approximate sizes of all – No need of union bound simultaneously VC-Dimension for Frequent Itemsets ● B = dataset D (set of transactions) VC-Dimension for Frequent Itemsets ● B = dataset D (set of transactions) ● Itemset X, = transactions of D containing X VC-Dimension for Frequent Itemsets ● B = dataset D (set of transactions) ● Itemset X, = transactions of D containing X VC-Dimension for Frequent Itemsets ● B = dataset D (set of transactions) ● Itemset X, ● = transactions of D containing X VC-Dimension for Frequent Itemsets ● B = dataset D (set of transactions) ● Itemset X, ● = transactions of D containing X VC-Dimension for Frequent Itemsets ● B = dataset D (set of transactions) ● Itemset X, = transactions of D containing X ● To approximate using the sample theorem, need upper bound to VC(B,R) Upper Bound to VC-Dim for FI's ● B = dataset D (set of transactions) ● Itemset X, = transactions of D containing X ● ● Theorem: [The d-index of a dataset] – Let d be the maximum integer such that D contains at least d transactions of length at least d. Then VC(B,R) ≤ d The d-‐index of the dataset ● ● d-index d: the maximum integer such that D contains at least d transactions of length at least d Example: this dataset has d-index d=3 bread beer milk chips coke pasta bread coke chips milk coffee pasta milk coffee ● Independent of |D| ● Can be computed with a single scan of the dataset ● d is an upper bound to the VC-dimension 49 Choosing the right sample size ● Theorem – – – – Let Let D be a dataset and d be the max integer such that D contains at least d transactions of length at least d Let S be a collection of transactions of D sampled independently and uniformly at random, with Then Algorithm ● To extract approximate collection of freq itemsets: 1.Compute d for the dataset D 2.Compute sample size (function of d) 3.Create random sample S from D 4.Extract Freq Itemsets from S using Algorithm ● To extract approximate collection of Freq. Itemsets: 1.Compute d for the dataset D 2.Compute sample size given d 3.Create random sample S from D 4.Extract Frequent Itemsets from S using ● Theorem: With probability at least , the returned set W of itemsets satisfies the desired property Frequency 0 1 Must not be in W May be in W Must be in W A closer look... ● Expression of sample size: A closer look... ● Expression of sample size: ● d: the maximum integer such that D contains at least d transactions of length at least d A closer look... ● Expression of sample size: ● d: the maximum integer such that D contains at least d transactions of length at least d – – – does not depend on |D| does not depend on does not depend on number of itemsets A closer look... ● Expression of sample size: ● d: the maximum integer such that D contains at least d transactions of length at least d does not depend on |D| – does not depend on – does not depend on number of itemsets The time to mine S does not depend on |D| !!! – ● Proof (Intuition) ● For a set of k transactions to be shattered, each transaction must appear in 2^(k-1) different 's where X is an itemset ● A transaction only appears in the X it contains ● A transaction of length w contains 2^(w)-1 itemsets ● Need w≥k for the transaction to belong to a shattered set of size k ● To shatter k transactions they must all have length ≥k ● Max k for which it happens is upper bound to VC-Dim 's of the itemsets Experiments ● Sample always fits into main memory ● Output always satisfies required approx. guarantees Frequency accuracy even better than guaranteed Mining time significantly improved – ● Conclusions We showed how VC-dimension, a highly theoretical concept from statistical learning theory, can help in developing an efficient algorithm to approximate the collection of frequent itemsets, addressing one of the Big Data challenges through samp Riondato & Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML/PKDD (1) 20121 . PARMA: A Randomized Algorithm for Approximate Associa;on Rules Mining in MapReduce Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, Eli Upfal Goals ● Extract high quality approximation of FI(D,θ) by mining several small samples of D in parallel ● Adapt to available computational resources ● Minimize data replication and communication ● Achieve high parallelism 61 Riondato et al., PARMA, CIKM'12 Why parallelizing? ● For a very high quality approximation, the sample would still be large ● Sample creation time depends on size of dataset ● Parallelization allows us to achieve:    ● higher accuracy (lower ε) higher confidence (lower δ) faster results Makes sense when dataset is stored on a distributed filesystem 62 Riondato et al., PARMA, CIKM'12 MapReduce ● framework for easy parallelization on large clusters of commodity machines ● Allows very fast analysis of large datasets ● Algorithms expressed through two functions:   ● Map: (k,v) (k1,v1),(k2,v2),... Reduce: (r,(w1,w2,...)) (k1,v1),(k2,v2),... It is expensive to move data around (shuffling)  Design goal: avoid data replication 63 Riondato et al., PARMA, CIKM'12 The op'miza'on framework ● Sample should fit into local memory ● No more samples than number of processors ● ● Exploit available computational resources to maximize quality of approximation Constraints and Objective function expressed in a Mixed Integer Non Linear Program   Convex feasibility region and convex obj. function Easy and fast to solve with global MINLP solver 64 Riondato et al., PARMA, CIKM'12 The Algorithm: PARMA ● ● ● Very simple design  2 MapReduce Rounds 1st round  Map: create samples  Reduce: mine each sample 2nd round  Map: identity function  Reduce: aggregation/filtering of results 65 Riondato et al., PARMA, CIKM'12 st 1 Round: Sampling and Mining ● Parallelization of the sequential algorithm ● Map: Sample Creation  Random sampling with replacement ... in parallel! – – ● Input: (transaction id, transaction) Output: (sample id, transaction) Reduce: Mining of each sample in parallel using lowered frequency threshold θ-ε/2  Can use any sequential exact mining algorithm (APriori, FPGrowth, ...) – – Input: (sample id, (transaction1, transaction2, ...)) Output: (itemset, (frequency, confidence interval)) 66 Riondato et al., PARMA, CIKM'12 nd 2 Phase: Aggrega:on ● Map: Identity function ● Reduce: Filtering and Aggregation – – ● ● ● Input: (itemset,((freq1,conf_int1),(freq2,conf_int2),...) Output: (itemset, (frequency, confidence interval)) An itemset is sent in output only if it was frequent in a sufficient number of samples (determined by the optimization framework) Confidence interval: smallest interval containing a sufficient number of estimations freq1,freq2,... Frequency: median of confidence interval 67 Riondato et al., PARMA, CIKM'12 Analysis ● Theorem:  ● Proof intuition: “Boosting-like” approach   ● With probability at least 1-δ, output is an ε-approximation to FI(D,θ) Outputs of the reducers in stage 1 are ε-approximations for the local samples, with some probability 1-φ If an itemset appears in enough ε-approximations, then it is probably frequent in the dataset Consequences: better approximation, fewer false positives 68 Riondato et al., PARMA, CIKM'12 Implementa:on ● PARMA implemented as Java library for Hadoop  ● Easy to run, possible integration with Apache Mahout FP-Growth used for the sample mining phase   PARMA is agnostic to the choice of mining algorithm We used an existing Java implementation ● 1400 lines of code ● Available on github 69 Riondato et al., PARMA, CIKM'12 Experiments ● Run on Amazon Elastic MapReduce  ● Artificial datasets using standard benchmark generators   ● Up to 16 machines, with 17GB of RAM Various parameters (transaction size, items, ...) Size between 5 to 50 million transactions Compared performances of PARMA with PFP and naive distributed counting  PARMA performs significantly better (1.5x to 5x speed improvement) in all tests 70 Riondato et al., PARMA, CIKM'12 Run-me Analysis ● Runtime does not grow with size of data Because individual sample size does not grow! Much faster than PFP, faster as data grows!  ● ● PARMA gets faster with more machines, even with more data. 71 Riondato et al., PARMA, CIKM'12 Speedup ● Quasi linear speedup ● Stage 1 very close to optimal ● Stage 2 no speedup  Running time depends on number of itemsets – – No way to control the number of itemsets Common in frequent itemsets mining algorithms ideal total Stage 1 Stage 2 7 speedup 6 5 4 3 2 72 1 Riondato et al., PARMA, CIKM'12 2 4 8 number of instances Accuracy ● Output always ε-approximation to FI(D,θ) ● Output not much larger than real collection ● Better frequency accuracy than guaranteed ● Smaller confidence intervals than guaranteed Riondato et al., PARMA, CIKM'12 Conclusions We showed how to efficiently use MapReduce to mine a high-quality approximation of the frequent itemsets and association rules from a large dataset with minimum data replication and quasi-linear speedup 74 Riondato et al., PARMA, CIKM'12 Publications ● Riondato, Upfal. Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML PKDD 2012 ● Riondato, DeBrabant, Fonseca, Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. CIKM 2012

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistical Learning Theory Meets Big Data