Download NGDM02_peter

Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. Chen P. Haas Polytechnic Univ Exilixis IBM Almaden [email protected] NGDM’02 [email protected] M. Dash, Y. Qiao, P. Scheuermann Northwestern University [email protected] {manoranj,yiqiao,peters}@ece.nwu.edu 1 Motivation  Volume of Data in Warehouses & Internet is growing faster than Moore’s Law  Scalability is a major concern  “Classical” algorithms require one/more scans of the database  Need to adopt to Streaming Data  Data elements arrive on-line  Limited amount of memory  One Solution: Execute algorithm on a sample  Lossy compressed synopses (sketch) of data NGDM’02 2 Motivation  Sampling Methods  Advantage: can explicitly trade-off accuracy and speed  Work best when tailored to application  Our Contributions  Sampling methods for count datasets  Base set of items & each data element is vector of item counts  Application: Association rule mining NGDM’02 3 Outline  Outline of the Presentation  Motivation  FAST  Epsilon Approximation  Experimental Results  Data Stream Reduction  Conclusion NGDM’02 4 The Problem Generate a smaller subset S0 of a larger superset S such that the supports of 1-itemsets in S0 are close to those in S minimize Dist ( S0 , S ) S0  S , S 0  n NP-Complete: One-In-Three SAT Problem | L1 ( S )  L1 ( S 0 ) |  | L1 ( S 0 )  L1 ( S ) | Dist 1  | L1 (S 0 )  L1 ( S ) | Dist 2   ( f ( A; S 0 )  f ( A; S )) 2 AI1 ( S ) Dist   max f ( A; S 0 )  f ( A; S ) AI1 ( S ) I1(T) = set of all 1-itemsets in transaction set T L1(T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T NGDM’02 5 FAST-trim  FAST-trim Outline Given a specified minimum support p and confidence c, FAST-trim Algorithm proceeds as follows: Obtain a large simple random sample S from D. 2. Compute f(A;S) for each 1-itemset A. 3. Using the supports computed in Step 2, obtain a reduced sample S0 from S by trimming away outlier transactions. 4. Run a standard association-rule algorithm against S0 – with Minimum support p and confidence c – to obtain the final set of Association Rules. 1. NGDM’02 6 FAST-trim  FAST-trim Algorithm Uses input parameter k to explicitly trade-off speed and accuracy Trimming Phase while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 – {t*}, where Dist(S0 -{t*},S) = min Dist(S0 - {t},S) teG } } Note: Removal of outlier t* causes maximum decrease or minimum increase in Dist(S0,S) NGDM’02 7 FAST-grow  FAST-grow Algorithm Select representative transactions from S and add them to the sample S0 that is initially empty Growing Phase while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0  {t*}, where Dist(S0 {t*},S) = min Dist(S0{t},S) teG } } NGDM’02 8 Epsilon Approximation (EA) Epsilon Approximation (EA)  Theory based on work in statistics on VC Dimensions (Vapnik & Cervonenkis’71) shows: Can estimate simultaneously the frequency of a collection of subsets VC dimension is finite  Applications to computational geometry and learning theory Def: A sample S0 of S1 is an e approximation iff discrepancy satisfies Dist  (S 0 , S1 )  e NGDM’02 9 Epsilon Approximation (EA) Halving Method  Deterministically halves the data to get sample S0  Apply halving repeatedly (S1 => S2 => … => St (= S0)) until Dist  (S 0 , S1 )  e  Each halving step introduce a discrepancy e i (ni , m) where m = total no. of items in database, ni = size of sub-sample Si  Halving stops with the maximum t such that e t   e i ( n i , m)  e i t NGDM’02 10 Epsilon Approximation (EA) How to compute halving? Hyperbolic cosine method [Spencer] 1. Color each transaction red (in sample) or blue (not in sample) 2. Penalty for each item, reflects Penalty small if red/blue approximately balanced Penalty will shoot up exponentially when red dominates (item is over-sampled), or blue dominates (item is under-sampled) 3. Color transactions sequentially, keeping penalty low Key property: no increase on penalty in average => One of the two colors does not increase the penalty globally NGDM’02 11 Epsilon Approximation (EA) Penalty Computation  Let Qi = Penalty for item Ai  Init Qi = 2  Suppose that we have colored the first j transactions Qi  Qi( j )  (1   i ) ri (1   i ) bi  (1   i ) ri (1   i ) bi where ri = ri(j) = no. of red transactions containing Ai bi = bi(j) = no. of blue transactions containing Ai i = parameter that influences how fast penalty changes as function of |ri - bi| NGDM’02 12 Epsilon Approximation (EA) How to color transaction j+1  Compute global penalty: Q ( r )   Qi( j||r ) i Q (b )   Qi( j||b ) i = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue  Choose color for which global penalty is smaller EA is inherently an on-line method NGDM’02 13 Performance Evaluation  Synthetic data set  IBM QUEST project [AS94]  100,000 transactions  1,000 items  number of maximal potentially large itemsets = 2000  average transaction length: 10  average length of maximal large itemsets: 4  minimum support: 0.77%  length of the maximal large itemsets: 6  Final sampling ratios: 0.76%, 1.51%, 3.0%, … dictated by EA halvings NGDM’02 14 Experimental Results FAST_trim_D1 FAST_trim_D2 EA SRS Accuracy vs. Sample Ratio FAST_trim vs. EA/SRS Time vs. Sample Ratio FAST_trim vs. EA/SRS 1 Time (cpu sec) Accuracy 0.9 0.8 0.7 0.6 0.5 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 FAST_trim_D1 FAST_trim_D2 EA SRS 3.5 3 2.5 2 1.5 1 0.5 0 0 0.05 Sample Ratio 0.1 0.15 0.2 Sample Ratio 0.25  87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%) NGDM’02 15 0.3 Experimental Results Accuracy vs. Sample Ratio FAST_grow vs. EA/SRS FAST_grow_D1 FAST_grow_D2 EA SRS Time vs. Sample Ration FAST_grow vs. EA/SRS Time (cpu sec) 1 Accuracy 0.9 0.8 0.7 0.6 0.5 0.4 0 0.05 0.1 0.15 0.2 Sample Ratio 0.25 0.3 FAST_grow_D1 FAST_grow_D2 EA SRS 3.5 3 2.5 2 1.5 1 0.5 0 0 0.05 0.1 0.15 0.2 0.25 Sample Ratio  FAST_grow_D2 is best for very small sampling ratio (< 2%)  EA best over-all in accuracy NGDM’02 16 0.3 Data Stream Reduction Data Stream Reduction (DSR)  Representative sample of data stream  Assign more weight to recent data while partially keeping track of old data NS/2 mS NS/2 mS-1 NS/2 … mS-2 NS/2 Total #Transactions = ms.Ns/2 1 Bucket# To generate NS-element sample, halve (mS-k) times of bucket k NS/2 mNGDM’02 S NS/4 NS/8 mS-1 mS-2 … 1 1 Bucket# 17 Data Stream Reduction  Practical Implementation To avoid frequent halving we use one buffer once and compute new representative sample when buffer is full by applying EA 0 Halving Ns 1 Halving 2 Halving NGDM’02 Empty 1 Halving 2 Halving 3 Halving 18 Data Stream Reduction Problem: Two users immediately before and after halving operation see data that varies substantially Continuous DSR: Buffer divided into chunks 2ns Next ns transactions arrive New trans ns 3ns 4ns 5ns Ns-2ns NGDM’02 Ns Oldest chunk is halved first Ns-ns Ns 19 Conclusion  Two-stage sampling approach based on trimming outliers or selecting representative transactions  Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample  Can be used in conjunction with other non-sampling count-based mining algorithms  EA-based data stream reduction • We are investigating how to evaluate goodness of representative subset • Frequency information to be used for discrepancy function NGDM’02 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download NGDM02_peter