Download CSCI 6402 Data Mining Project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Implementation of
“A New Two-Phase Sampling Based Algorithm
for Discovering Association Rules”
CSCI 6405 Data Warehousing and Data Mining
Tokunbo Makanju
Adan Cosgaya
Faculty of Computer Science
Dalhousie University
Fall 2005
Overview

Introduction

Algorithm

Data Preparation

Experimental Results

Conclusions

References
Introduction

Size of datasets are getting larger

The time required to mine information from these
datasets increases as datasets get larger

Demand for faster rule mining
Solution: mine a sample of the original dataset
Algorithm


FAST (Finding Association in Sample Transactions)
2 versions



FAST-Trim
FAST-Grow
FAST outline:




Obtain a simple random sample S
Compute frequency for each 1-itemset
Obtain a reduced sample S0 from S by either trimming S
or growing S0.
Run a standard association-rule algorithm against S0
Algorithm

Distance Functions
minimize Dist ( S0 , S )
S0  S , S 0  n
| L1 ( S )  L1 ( S 0 ) |  | L1 ( S 0 )  L1 ( S ) |
Dist 1 
| L1 (S 0 )  L1 ( S ) |
Dist 2   ( f ( A; S 0 )  f ( A; S )) 2
AI1 ( S )
Dist   max f ( A; S 0 )  f ( A; S )
AI1 ( S )
I1(T) = set of all 1-itemsets in transaction set T
L1(T) = set of frequent 1-itemsets in transaction set T
f(A;T) = support of itemset A in transaction set T
Algorithm

FAST-Grow Algorithm
Obtain a simple random sample S from D
compute f(A;S) from each A element of S
set i=0, S0(i)=, minDist = , and minStage=-1;
while (|S0| < n) {
divide S0 into disjoint groups of min(k,| S-S0 |) transactions each;
for each group G {
set S0 = S0(i)  {t*}, where Dist(S0(i) {t*},S) =
min Dist(S0(i){t},S)
}
compute f(A; S0(i)) for each item A element of S0;
if (Dist(S0(i),S) < minDist) {
set minDist := dist (S0( i), S) and minStage := i;
}
set S0(i + 1 / := S0(i);
}
Data Preparation



Downloaded from
fimi.cs.helsinki.fi/data/accidents.pdf
The data source for this dataset is the National
Institute of Statistics from the region of Flanders
in Belgium.
In total 572 unique attribute values can be found
in the dataset and an average of 45 attribute
values are recorded for each accident.
Experimental Results




Dataset with 340,183 transactions
Obtained a reduced sample of 30%
Final sample ratios of 2.5%, 5%, 7.5% and 10%
Parameters:


Minimum Support = 0.77%
Size of group k = 10
Experimental Results
Results
Sampling ratio
# of rules produced
% of Accuracy
(8,500 transactions)
2949
27.64%
5% (17,010 transactions)
585
100%
7.5% (25,500 transactions)
445
100%
10% (34,020 transactions)
585
100%
2.5%
120.00%
100%
100.00%
% of Accuracy

100%
100%
80.00%
60.00%
% of Accuracy
40.00%
20.00%
27.64%
0.00%
2.50%
5%
7.50%
Sampling ratio
10%
Conclusions

No need to process a large input dataset

FAST- grow can achieve a high accuracy even
with a small sampling ratio of 5-10%

The algorithm has a better performance when
using the fixed-size stopping criterion
References

[1] B. Chen, P. Haas, and P. Scheuermann. A new two-phase sampling based
algorithm for discovering association rules. In Proceedings of ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 2002

[2] H. Bronnimann, B. Chen, P. Haas, M. Dash, Y. Qiao, P. Scheuermann, Efficient
Data-Reduction Methods for On-Line Association Rule Discovery. Presented at
NSF Workshop on Next-Generation Data Mining (NGDM02), November 2002.

[3] K. Geurts. Traffic Accidents Data Set. fimi.cs.helsinki.fi/data/accidents.pdf.
Last Access: 17/11/2005

[4] GNU publicly available implementation of Apriori algorithm, written by Christian
Borgelt. http://fuzzy.cs.uni-magdeburg.de/~borgelt/software.html
Last Access: 24/11/2005
Thank you!
Questions?
Related documents