Download Statistical Learning Theory Meets Big Data

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Statistical Learning Theory
Meets Big Data
Randomized algorithms for frequent itemsets
Eli Upfal
Brown University
Data, data, data
“In God we trust, all others (must) bring data”
●
Prof. W.E. Deming, Statistician, (attributed)
“Every two days now we create as much
information as we did from the dawn of
civilization up until 2003”
●
Eric Schmidt, Google Exec. Chairman, 2009
“Data explosion is bigger than Moore's Law”
●
Marissa Mayer, Yahoo! CEO, 2009
Big Data
“Every two days now we create as
much information as we did from the
dawn of civilization up until 2003”
●
Eric Schmidt, Google Exec. Chairman, 2009
“Data explosion is bigger than Moore's
Law”
●
Marissa Mayer, Yahoo! CEO, 2009
Big Data
●
Big data requires big storage and long execution time
●
It's great to have big data – but do we need to process
all of it?
●
Trade off accuracy of results for execution speed
●
Can approximation algorithms for data analysis be fast
and still guarantee high-quality results?
Frequent Itemsets
●
Market Basket Analysis
●
You own a grocery store
●
Have copy of receipts from sales
For each customer, you have the list of products
she bought
Interested in what groups of products are bought
together – correlated items
–
●
–
–
Business decisions
Optimization
Settings
●
Transactional dataset D
Figure from Tan et al. - Introduction to Data Mining
Transactions
●
Transactional dataset D
transaction
Figure from Tan et al. - Introduction to Data Mining
Items
●
Transactions are built on items from a domain
items
Figure from Tan et al. - Introduction to Data Mining
Itemsets
●
Sets of items are called itemsets
Itemsets
Figure from Tan et al. - Introduction to Data Mining
Frequency
●
Frequency of an itemset X in dataset D:
–
–
: fraction of transactions of D containing X
Example: Milk: 4/5, {Bread, Milk}: 3/5
Figure from Tan et al. - Introduction to Data Mining
Frequent Itemsets
●
Frequent Itemsets mining with respect to
–
Find all itemsets X with frequency
Frequent Itemsets
●
Frequent Itemsets mining with respect to
Find all itemsets X with frequency
Top -K Frequent Itemsets
–
Find all itemsets at least as frequent as the kth
most frequent
Association Rules
–
●
–
X in a transaction is correlated with Y in the
transaction
Frequent Itemsets mining
●
Classical problem in data analysis
●
There are algorithms to extract exact collection of FI's
–
APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]
Frequent Itemsets mining
●
Well studied classical problem in data analysis
●
There are algorithms to extract exact collection of FI's
●
APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]
Running time depends on number of transactions in
the dataset and on the number of frequent itemsets
–
–
–
–
10^8 transactions considered “normal” size
10^4 items → 2^(10^4) itemsets total
A transaction with d items contains 2^d itemsets
How to speed up mining?
●
Decrease dependency on number of transactions:
How to speed up mining?
●
Decrease dependency on number of transactions:
1.create random sample of dataset
2.extract frequent itemsets from sample with lower
How to speed up mining?
●
●
●
Decrease dependency on number of transactions:
1.create random sample of dataset
2.extract frequent itemsets from sample with lower
Sampling force us to accept approximate results
Need to quantify what is acceptable
How to speed up mining?
●
●
Decrease dependency on number of transactions:
1.create random sample of dataset
2.extract frequent itemsets from sample with lower
Sampling force us to accept approximate results
●
Need to quantify what is acceptable
●
Given
Frequency 0
, extract collection W of FI's such that:
1
How to speed up mining?
●
●
Decrease dependency on number of transactions:
1.create random sample of dataset
2.extract frequent itemsets from sample with lower
Sampling force us to accept approximate results
●
Need to quantify what is acceptable
●
Given
, extract collection W of FI's such that:
Frequency 0
1
Must be in W
How to speed up mining?
●
●
Decrease dependency on number of transactions:
1.create random sample of dataset
2.extract frequent itemsets from sample with lower
Sampling force us to accept approximate results
●
Need to quantify what is acceptable
●
Given
, extract collection W of FI's such that:
Frequency 0
1
Must not be in W
Must be in W
How to speed up mining?
●
●
Decrease dependency on number of transactions:
1.create random sample of dataset
2.extract frequent itemsets from sample with lower
Sampling force us to accept approximate results
●
Need to quantify what is acceptable
●
Given
, extract collection W of FI's such that:
Frequency 0
1
Must not be in W
May be in W
Must be in W
Observa(on
●
Data mining is exploratory in nature
●
Exact results are usually not needed

●
We want a good, realistic idea of the process
Fast analysis is more important

... if it still guarantees good-enough results!
22
Riondato et al., PARMA, CIKM'12
How to speed up mining?
●
●
●
●
Decrease dependency on number of transactions:
1.create random sample of dataset
2.extract frequent itemsets from sample with lower
Sampling force us to accept approximate results
Need to quantify what is acceptable
Given
, extract collection W of itemset from
sample, such that:
Frequency 0
1
Must not be in W
●
May be in W
Must be in W
Problem: choose right sample size and right
Choosing
●
If we have, for all itemsets X simultaneously
then
does the trick
Choosing
●
If we have, for all itemsets X simultaneously
then
does the trick
Frequency 0
1
Must not be in W
May be in W
Must be in W
Choosing
●
If we have, for all itemsets X simultaneously
then
does the trick
Frequency 0
1
Must not be in W
●
May be in W
Must be in W
Need sample size |S| s.t., with prob. at least
for all itemsets X, we have
,
Choosing the sample size
●
Naïve approach
●
Given itemset X, frequency of X in sample S is
distributed like a binomial:
●
Use Chernoff bound to bound
●
Apply Union bound over all itemsets to find |S| such
that
Choosing the sample size
●
Naïve approach
●
Given itemset X, frequency of X in sample S is
distributed like a binomial:
●
Use Chernoff bound to bound
●
Apply Union bound over all itemsets to find |S| such
that
●
Problem: There's an exponential number of itemsets
–
–
Sample size would too large (linear the in number
of elements)
Vapnik-Chervonenkis Dimension
●
Combinatorial property of a collection of subsets
from a domain
●
Measures the “richness” or “expressivity” of the
collection of subsets
●
If we know the VC-dim of a collection of subsets, we
can compute the minimum sample size needed to
approximate the sizes of all the subsets using the
sample
Range spaces
●
VC-Dimension is defined on range spaces
●
(B,R): range space
B: domain
– R: collection of subsets from B (ranges)
No restrictions:
–
●
–
–
–
B can be infinite
R can be infinite
R can contain infinitely-large subsets of B
Vapnik-Chervonenkis Dimension
●
Range space
●
For any
●
C is shattered if
●
The VC-Dimension of (B,R) is the size of the largest
shattered subset of B
, define
Example of VC-Dimension
y
●
B=
x
Example of VC-Dimension
y
●
B=
●
R = all axis-aligned rectangles in
x
Example of VC-Dimension
y
●
B=
●
R = all axis-aligned rectangles in
x
Example of VC-Dimension
y
●
B=
●
R = all axis-aligned rectangles in
●
Shattering 4 points: Easy
–
Take any 4 points s.t. no 3 of them are aligned
x
Example of VC-Dimension
y
●
B=
●
R = all axis-aligned rectangles in
●
Shattering 4 points: Easy
–
Take any 4 points s.t. no 3 of them are aligned
y
x
x
Example of VC-Dimension
y
●
B=
●
R = all axis-aligned rectangles in
●
Shattering 4 points: Easy
–
x
Take any 4 points s.t. no 3 of them are aligned
y
Need 16 rectangles
x
Example of VC-Dimension
y
●
B=
●
R = all axis-aligned rectangles in
●
Shattering 5 points?
x
Example of VC-Dimension
y
●
B=
●
R = all axis-aligned rectangles in
●
Shattering 5 points: impossible
x
Example of VC-Dimension
y
●
B=
●
R = all axis-aligned rectangles in
●
Shattering 5 points: impossible
Take any 5 points
– One of them that is contained in all rectangles
containing the other four
– Impossible to find a rectangle containing only the
other four
VC(B,R)=4
–
●
x
Approximating sizes of ranges
●
Sample Theorem:
–
Let (B,R) have VC(B,R)≤ d. Given
, let S
be a collection of points from B sampled uniformly
at random. If
then,
●
Can approximate sizes of all
–
No need of union bound
simultaneously
VC-Dimension for Frequent Itemsets
●
B = dataset D (set of transactions)
VC-Dimension for Frequent Itemsets
●
B = dataset D (set of transactions)
●
Itemset X,
= transactions of D containing X
VC-Dimension for Frequent Itemsets
●
B = dataset D (set of transactions)
●
Itemset X,
= transactions of D containing X
VC-Dimension for Frequent Itemsets
●
B = dataset D (set of transactions)
●
Itemset X,
●
= transactions of D containing X
VC-Dimension for Frequent Itemsets
●
B = dataset D (set of transactions)
●
Itemset X,
●
= transactions of D containing X
VC-Dimension for Frequent Itemsets
●
B = dataset D (set of transactions)
●
Itemset X,
= transactions of D containing X
●
To approximate using the
sample theorem, need
upper bound to VC(B,R)
Upper Bound to VC-Dim for FI's
●
B = dataset D (set of transactions)
●
Itemset X,
= transactions of D containing X
●
●
Theorem: [The d-index of a dataset]
–
Let d be the maximum integer such that D
contains at least d transactions of length at least d.
Then
VC(B,R) ≤ d
The d-­‐index of the dataset
●
●
d-index d: the maximum integer such that D contains
at least d transactions of length at least d
Example: this dataset has d-index d=3
bread
beer
milk
chips
coke
pasta
bread
coke
chips
milk
coffee
pasta
milk
coffee
●
Independent of |D|
●
Can be computed with a single scan of the dataset
●
d is an upper bound to the VC-dimension
49
Choosing the right sample size
●
Theorem
–
–
–
–
Let
Let D be a dataset and d be the max integer such
that D contains at least d transactions of length at
least d
Let S be a collection of transactions of D sampled
independently and uniformly at random, with
Then
Algorithm
●
To extract approximate collection of freq itemsets:
1.Compute d for the dataset D
2.Compute sample size (function of d)
3.Create random sample S from D
4.Extract Freq Itemsets from S using
Algorithm
●
To extract approximate collection of Freq. Itemsets:
1.Compute d for the dataset D
2.Compute sample size given d
3.Create random sample S from D
4.Extract Frequent Itemsets from S using
●
Theorem: With probability at least
, the returned
set W of itemsets satisfies the desired property
Frequency
0
1
Must not be in W
May be in W
Must be in W
A closer look...
●
Expression of sample size:
A closer look...
●
Expression of sample size:
●
d: the maximum integer such that D contains at least
d transactions of length at least d
A closer look...
●
Expression of sample size:
●
d: the maximum integer such that D contains at least
d transactions of length at least d
–
–
–
does not depend on |D|
does not depend on
does not depend on number of itemsets
A closer look...
●
Expression of sample size:
●
d: the maximum integer such that D contains at least
d transactions of length at least d
does not depend on |D|
– does not depend on
– does not depend on number of itemsets
The time to mine S does not depend on |D| !!!
–
●
Proof (Intuition)
●
For a set of k transactions to be shattered, each
transaction must appear in 2^(k-1) different
's
where X is an itemset
●
A transaction only appears in the
X it contains
●
A transaction of length w contains 2^(w)-1 itemsets
●
Need w≥k for the transaction to belong to a shattered
set of size k
●
To shatter k transactions they must all have length ≥k
●
Max k for which it happens is upper bound to VC-Dim
's of the itemsets
Experiments
●
Sample always fits into main memory
●
Output always satisfies required approx. guarantees
Frequency accuracy even better than guaranteed
Mining time significantly improved
–
●
Conclusions
We showed how VC-dimension, a highly theoretical
concept from statistical learning theory, can help in
developing an efficient algorithm to approximate the
collection of frequent itemsets, addressing one of the
Big Data challenges through samp
Riondato & Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through
Sampling with Tight Performance Guarantees. ECML/PKDD (1) 20121
.
PARMA: A Randomized Algorithm for Approximate Associa;on Rules Mining in MapReduce
Matteo Riondato, Justin A. DeBrabant,
Rodrigo Fonseca, Eli Upfal
Goals
●
Extract high quality approximation of FI(D,θ) by
mining several small samples of D in parallel
●
Adapt to available computational resources
●
Minimize data replication and communication
●
Achieve high parallelism
61
Riondato et al., PARMA, CIKM'12
Why parallelizing?
●
For a very high quality approximation, the sample
would still be large
●
Sample creation time depends on size of dataset
●
Parallelization allows us to achieve:



●
higher accuracy (lower ε)
higher confidence (lower δ)
faster results
Makes sense when dataset is stored on a
distributed filesystem
62
Riondato et al., PARMA, CIKM'12
MapReduce ●
framework for easy parallelization on large
clusters of commodity machines
●
Allows very fast analysis of large datasets
●
Algorithms expressed through two functions:


●
Map: (k,v)
(k1,v1),(k2,v2),...
Reduce: (r,(w1,w2,...)) (k1,v1),(k2,v2),...
It is expensive to move data around (shuffling)

Design goal: avoid data replication
63
Riondato et al., PARMA, CIKM'12
The op'miza'on framework ●
Sample should fit into local memory
●
No more samples than number of processors
●
●
Exploit available computational resources to
maximize quality of approximation
Constraints and Objective function expressed in
a Mixed Integer Non Linear Program


Convex feasibility region and convex obj. function
Easy and fast to solve with global MINLP solver
64
Riondato et al., PARMA, CIKM'12
The Algorithm: PARMA
●
●
●
Very simple design
 2 MapReduce Rounds
1st round
 Map: create samples
 Reduce: mine each sample
2nd round
 Map: identity function
 Reduce: aggregation/filtering
of results
65
Riondato et al., PARMA, CIKM'12
st
1 Round: Sampling and Mining
●
Parallelization of the sequential algorithm
●
Map: Sample Creation

Random sampling with replacement ... in parallel!
–
–
●
Input: (transaction id, transaction)
Output: (sample id, transaction)
Reduce: Mining of each sample in parallel using
lowered frequency threshold θ-ε/2

Can use any sequential exact mining algorithm (APriori,
FPGrowth, ...)
–
–
Input: (sample id, (transaction1, transaction2, ...))
Output: (itemset, (frequency, confidence interval))
66
Riondato et al., PARMA, CIKM'12
nd
2 Phase: Aggrega:on
●
Map: Identity function
●
Reduce: Filtering and Aggregation
–
–
●
●
●
Input: (itemset,((freq1,conf_int1),(freq2,conf_int2),...)
Output: (itemset, (frequency, confidence interval))
An itemset is sent in output only if it was frequent in a
sufficient number of samples (determined by the
optimization framework)
Confidence interval: smallest interval containing a
sufficient number of estimations freq1,freq2,...
Frequency: median of confidence interval
67
Riondato et al., PARMA, CIKM'12
Analysis
●
Theorem:

●
Proof intuition: “Boosting-like” approach


●
With probability at least 1-δ, output is an ε-approximation
to FI(D,θ)
Outputs of the reducers in stage 1 are ε-approximations
for the local samples, with some probability 1-φ
If an itemset appears in enough ε-approximations, then it
is probably frequent in the dataset
Consequences: better approximation, fewer false
positives
68
Riondato et al., PARMA, CIKM'12
Implementa:on
●
PARMA implemented as Java library for Hadoop

●
Easy to run, possible integration with Apache Mahout
FP-Growth used for the sample mining phase


PARMA is agnostic to the choice of mining algorithm
We used an existing Java implementation
●
1400 lines of code
●
Available on github
69
Riondato et al., PARMA, CIKM'12
Experiments
●
Run on Amazon Elastic MapReduce

●
Artificial datasets using standard benchmark
generators


●
Up to 16 machines, with 17GB of RAM
Various parameters (transaction size, items, ...)
Size between 5 to 50 million transactions
Compared performances of PARMA with PFP and
naive distributed counting

PARMA performs significantly better (1.5x to 5x speed
improvement) in all tests
70
Riondato et al., PARMA, CIKM'12
Run-me Analysis
●
Runtime does not grow with size of data
Because individual sample size does not grow!
Much faster than PFP, faster as data grows!

●
●
PARMA gets faster with more machines, even with
more data.
71
Riondato et al., PARMA, CIKM'12
Speedup
●
Quasi linear speedup
●
Stage 1 very close to optimal
●
Stage 2 no speedup

Running time depends on number of itemsets
–
–
No way to control the number of itemsets
Common in frequent itemsets mining algorithms
ideal
total
Stage 1
Stage 2
7
speedup
6
5
4
3
2
72
1
Riondato et al., PARMA, CIKM'12
2
4
8
number of instances
Accuracy
●
Output always ε-approximation to FI(D,θ)
●
Output not much larger than real collection
●
Better frequency accuracy than guaranteed
●
Smaller confidence intervals than guaranteed
Riondato et al., PARMA, CIKM'12
Conclusions
We showed how to efficiently use MapReduce to
mine a high-quality approximation of the frequent
itemsets and association rules from a large
dataset with minimum data replication and
quasi-linear speedup
74
Riondato et al., PARMA, CIKM'12
Publications
●
Riondato, Upfal. Efficient Discovery of Association
Rules and Frequent Itemsets through Sampling with
Tight Performance Guarantees. ECML PKDD 2012
●
Riondato, DeBrabant, Fonseca, Upfal. PARMA: A
Parallel Randomized Algorithm for Approximate
Association Rules Mining in MapReduce. CIKM 2012