Download NGDM02_peter

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Theoretical computer science wikipedia , lookup

Pattern recognition wikipedia , lookup

Data analysis wikipedia , lookup

Corecursion wikipedia , lookup

Data assimilation wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
Efficient Data-Reduction Methods
for On-line Association Rule Mining
H. Bronnimann
B. Chen
P. Haas
Polytechnic Univ
Exilixis
IBM Almaden
[email protected]
NGDM’02
[email protected]
M. Dash, Y. Qiao, P. Scheuermann
Northwestern University
[email protected] {manoranj,yiqiao,peters}@ece.nwu.edu
1
Motivation
 Volume of Data in Warehouses & Internet
is growing faster than Moore’s Law
 Scalability is a major concern
 “Classical” algorithms require one/more scans of the database
 Need to adopt to Streaming Data
 Data elements arrive on-line
 Limited amount of memory
 One Solution: Execute algorithm on a sample
 Lossy compressed synopses (sketch) of data
NGDM’02
2
Motivation
 Sampling Methods
 Advantage: can explicitly trade-off
accuracy and speed
 Work best when tailored to application
 Our Contributions
 Sampling methods for count datasets
 Base set of items & each data element is vector of item counts
 Application: Association rule mining
NGDM’02
3
Outline
 Outline of the Presentation
 Motivation
 FAST
 Epsilon Approximation
 Experimental Results
 Data Stream Reduction
 Conclusion
NGDM’02
4
The Problem
Generate a smaller subset S0 of a larger superset S such that
the supports of 1-itemsets in S0 are close to those in S
minimize Dist ( S0 , S )
S0  S , S 0  n
NP-Complete: One-In-Three SAT Problem
| L1 ( S )  L1 ( S 0 ) |  | L1 ( S 0 )  L1 ( S ) |
Dist 1 
| L1 (S 0 )  L1 ( S ) |
Dist 2   ( f ( A; S 0 )  f ( A; S )) 2
AI1 ( S )
Dist   max f ( A; S 0 )  f ( A; S )
AI1 ( S )
I1(T) = set of all 1-itemsets in transaction set T
L1(T) = set of frequent 1-itemsets in transaction set T
f(A;T) = support of itemset A in transaction set T
NGDM’02
5
FAST-trim
 FAST-trim Outline
Given a specified minimum support p and confidence
c, FAST-trim Algorithm proceeds as follows:
Obtain a large simple random sample S from D.
2. Compute f(A;S) for each 1-itemset A.
3. Using the supports computed in Step 2, obtain a
reduced sample S0 from S by trimming away
outlier transactions.
4. Run a standard association-rule algorithm against S0
– with Minimum support p and confidence c –
to obtain the final set of Association Rules.
1.
NGDM’02
6
FAST-trim
 FAST-trim Algorithm
Uses input parameter k to explicitly trade-off
speed and accuracy
Trimming Phase
while (|S0| > n) {
divide S0 into disjoint groups of min(k,|S0|) transactions each;
for each group G {
compute f(A;S0) for each item A;
set S0=S0 – {t*}, where Dist(S0 -{t*},S) = min Dist(S0 - {t},S)
teG
}
}
Note: Removal of outlier t* causes maximum decrease or minimum
increase in Dist(S0,S)
NGDM’02
7
FAST-grow
 FAST-grow Algorithm
Select representative transactions from S and add them
to the sample S0 that is initially empty
Growing Phase
while (|S0| > n) {
divide S0 into disjoint groups of min(k,|S0|) transactions each;
for each group G {
compute f(A;S0) for each item A;
set S0=S0  {t*}, where Dist(S0 {t*},S) = min Dist(S0{t},S)
teG
}
}
NGDM’02
8
Epsilon Approximation (EA)
Epsilon Approximation (EA)
 Theory based on work in statistics on VC Dimensions
(Vapnik & Cervonenkis’71) shows:
Can estimate simultaneously the frequency of a collection of
subsets
VC dimension is finite
 Applications to computational geometry and
learning theory
Def: A sample S0 of S1 is an e approximation iff discrepancy
satisfies Dist  (S 0 , S1 )  e
NGDM’02
9
Epsilon Approximation (EA)
Halving Method
 Deterministically halves the data to get sample S0
 Apply halving repeatedly (S1 => S2 => … => St (= S0))
until Dist  (S 0 , S1 )  e
 Each halving step introduce a discrepancy e i (ni , m)
where m = total no. of items in database,
ni = size of sub-sample Si
 Halving stops with the maximum t such that
e t   e i ( n i , m)  e
i t
NGDM’02
10
Epsilon Approximation (EA)
How to compute halving?
Hyperbolic cosine method [Spencer]
1. Color each transaction
red (in sample) or blue (not in sample)
2. Penalty for each item, reflects
Penalty small if red/blue approximately balanced
Penalty will shoot up exponentially when
red dominates (item is over-sampled), or
blue dominates (item is under-sampled)
3. Color transactions sequentially, keeping penalty low
Key property: no increase on penalty in average
=> One of the two colors does not increase the penalty globally
NGDM’02
11
Epsilon Approximation (EA)
Penalty Computation
 Let Qi = Penalty for item Ai
 Init Qi = 2
 Suppose that we have colored the first j transactions
Qi  Qi( j )  (1   i ) ri (1   i ) bi  (1   i ) ri (1   i ) bi
where ri = ri(j) = no. of red transactions containing Ai
bi = bi(j) = no. of blue transactions containing Ai
i = parameter that influences how fast penalty
changes as function of |ri - bi|
NGDM’02
12
Epsilon Approximation (EA)
How to color transaction j+1
 Compute global penalty:
Q ( r )   Qi( j||r )
i
Q (b )   Qi( j||b )
i
= Global penalty assuming
transaction j+1 is red
= Global penalty assuming
transaction j+1 is blue
 Choose color for which global penalty is smaller
EA is inherently an on-line method
NGDM’02
13
Performance Evaluation
 Synthetic data set
 IBM QUEST project [AS94]
 100,000 transactions
 1,000 items
 number of maximal potentially large itemsets = 2000
 average transaction length: 10
 average length of maximal large itemsets: 4
 minimum support: 0.77%
 length of the maximal large itemsets: 6
 Final sampling ratios:
0.76%, 1.51%, 3.0%, … dictated by EA halvings
NGDM’02
14
Experimental Results
FAST_trim_D1
FAST_trim_D2
EA
SRS
Accuracy vs. Sample Ratio
FAST_trim vs. EA/SRS
Time vs. Sample Ratio
FAST_trim vs. EA/SRS
1
Time (cpu sec)
Accuracy
0.9
0.8
0.7
0.6
0.5
0.4
0
0.05
0.1
0.15
0.2
0.25
0.3
FAST_trim_D1
FAST_trim_D2
EA
SRS
3.5
3
2.5
2
1.5
1
0.5
0
0
0.05
Sample Ratio
0.1
0.15
0.2
Sample Ratio
0.25
 87% reduction in sample size for accuracy:
EA (99%), FAST_trim_D2 (97%), SRS (94.6%)
NGDM’02
15
0.3
Experimental Results
Accuracy vs. Sample Ratio
FAST_grow vs. EA/SRS
FAST_grow_D1
FAST_grow_D2
EA
SRS
Time vs. Sample Ration
FAST_grow vs. EA/SRS
Time (cpu sec)
1
Accuracy
0.9
0.8
0.7
0.6
0.5
0.4
0
0.05
0.1
0.15
0.2
Sample Ratio
0.25
0.3
FAST_grow_D1
FAST_grow_D2
EA
SRS
3.5
3
2.5
2
1.5
1
0.5
0
0
0.05
0.1
0.15
0.2
0.25
Sample Ratio
 FAST_grow_D2 is best for very small sampling ratio (< 2%)
 EA best over-all in accuracy
NGDM’02
16
0.3
Data Stream Reduction
Data Stream Reduction (DSR)
 Representative sample of data stream
 Assign more weight to recent data while
partially keeping track of old data
NS/2
mS
NS/2
mS-1
NS/2
…
mS-2
NS/2
Total #Transactions
= ms.Ns/2
1
Bucket#
To generate NS-element sample, halve (mS-k) times of bucket k
NS/2
mNGDM’02
S
NS/4
NS/8
mS-1
mS-2
…
1
1
Bucket#
17
Data Stream Reduction
 Practical Implementation
To avoid frequent halving we use one buffer once
and compute new representative sample when
buffer is full by applying EA
0 Halving
Ns
1 Halving
2 Halving
NGDM’02
Empty
1 Halving
2 Halving
3 Halving
18
Data Stream Reduction
Problem: Two users immediately before and after
halving operation see data that varies substantially
Continuous DSR: Buffer divided into chunks
2ns
Next ns
transactions
arrive
New trans
ns
3ns
4ns
5ns
Ns-2ns
NGDM’02
Ns
Oldest chunk
is halved first
Ns-ns
Ns
19
Conclusion
 Two-stage sampling approach based on trimming
outliers or selecting representative transactions
 Epsilon approximation: deterministic method for
repeatedly halving data to obtain final sample
 Can be used in conjunction with other non-sampling
count-based mining algorithms
 EA-based data stream reduction
• We are investigating how to evaluate goodness of
representative subset
• Frequency information to be used for discrepancy
function
NGDM’02
20