Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Implementation of
“A New Two-Phase Sampling Based Algorithm
for Discovering Association Rules”
CSCI 6405 Data Warehousing and Data Mining
Tokunbo Makanju
Adan Cosgaya
Faculty of Computer Science
Dalhousie University
Fall 2005
Overview
Introduction
Algorithm
Data Preparation
Experimental Results
Conclusions
References
Introduction
Size of datasets are getting larger
The time required to mine information from these
datasets increases as datasets get larger
Demand for faster rule mining
Solution: mine a sample of the original dataset
Algorithm
FAST (Finding Association in Sample Transactions)
2 versions
FAST-Trim
FAST-Grow
FAST outline:
Obtain a simple random sample S
Compute frequency for each 1-itemset
Obtain a reduced sample S0 from S by either trimming S
or growing S0.
Run a standard association-rule algorithm against S0
Algorithm
Distance Functions
minimize Dist ( S0 , S )
S0 S , S 0 n
| L1 ( S ) L1 ( S 0 ) | | L1 ( S 0 ) L1 ( S ) |
Dist 1
| L1 (S 0 ) L1 ( S ) |
Dist 2 ( f ( A; S 0 ) f ( A; S )) 2
AI1 ( S )
Dist max f ( A; S 0 ) f ( A; S )
AI1 ( S )
I1(T) = set of all 1-itemsets in transaction set T
L1(T) = set of frequent 1-itemsets in transaction set T
f(A;T) = support of itemset A in transaction set T
Algorithm
FAST-Grow Algorithm
Obtain a simple random sample S from D
compute f(A;S) from each A element of S
set i=0, S0(i)=, minDist = , and minStage=-1;
while (|S0| < n) {
divide S0 into disjoint groups of min(k,| S-S0 |) transactions each;
for each group G {
set S0 = S0(i) {t*}, where Dist(S0(i) {t*},S) =
min Dist(S0(i){t},S)
}
compute f(A; S0(i)) for each item A element of S0;
if (Dist(S0(i),S) < minDist) {
set minDist := dist (S0( i), S) and minStage := i;
}
set S0(i + 1 / := S0(i);
}
Data Preparation
Downloaded from
fimi.cs.helsinki.fi/data/accidents.pdf
The data source for this dataset is the National
Institute of Statistics from the region of Flanders
in Belgium.
In total 572 unique attribute values can be found
in the dataset and an average of 45 attribute
values are recorded for each accident.
Experimental Results
Dataset with 340,183 transactions
Obtained a reduced sample of 30%
Final sample ratios of 2.5%, 5%, 7.5% and 10%
Parameters:
Minimum Support = 0.77%
Size of group k = 10
Experimental Results
Results
Sampling ratio
# of rules produced
% of Accuracy
(8,500 transactions)
2949
27.64%
5% (17,010 transactions)
585
100%
7.5% (25,500 transactions)
445
100%
10% (34,020 transactions)
585
100%
2.5%
120.00%
100%
100.00%
% of Accuracy
100%
100%
80.00%
60.00%
% of Accuracy
40.00%
20.00%
27.64%
0.00%
2.50%
5%
7.50%
Sampling ratio
10%
Conclusions
No need to process a large input dataset
FAST- grow can achieve a high accuracy even
with a small sampling ratio of 5-10%
The algorithm has a better performance when
using the fixed-size stopping criterion
References
[1] B. Chen, P. Haas, and P. Scheuermann. A new two-phase sampling based
algorithm for discovering association rules. In Proceedings of ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 2002
[2] H. Bronnimann, B. Chen, P. Haas, M. Dash, Y. Qiao, P. Scheuermann, Efficient
Data-Reduction Methods for On-Line Association Rule Discovery. Presented at
NSF Workshop on Next-Generation Data Mining (NGDM02), November 2002.
[3] K. Geurts. Traffic Accidents Data Set. fimi.cs.helsinki.fi/data/accidents.pdf.
Last Access: 17/11/2005
[4] GNU publicly available implementation of Apriori algorithm, written by Christian
Borgelt. http://fuzzy.cs.uni-magdeburg.de/~borgelt/software.html
Last Access: 24/11/2005
Thank you!
Questions?