Download Format guide for AIRCC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
An efficient hash based algorithm for
mining closed frequent itemsets
Ms. Dhara Patel1 and Prof. Ketan Sarvakar2
1
ME (CSE Student), UVPCE, Kherva, Gujarat, India
2
Asst. Professor, UVPCE, Kherva, Gujarat, India
Abstract : Association rule discovery has emerged as an important problem in knowledge discovery
and data mining. The association mining task consists of identifying the frequent itemsets, and then
forming conditional implication rules among them. Efficient algorithms to discover frequent
patterns are crucial in data mining research. Finding frequent itemsets is computationally the most
expensive step in association rule discovery and therefore it has attracted significant research
attention. In this paper for generating frequent itemsets we developed improved procedure and
result analysis with wine dataset. Our improved procedure is compare with ILLT algorithm and
time required for generating itemsets is less.
I. Introduction
Association rules mining is one of the most important and well researched techniques. It aims to extract
interesting correlations, frequent patterns, associations or casual structures among sets of the items in the
transaction database or other data repositories. It is widely used in various areas such as cross marketing,
new product development, personalized service and commercial credit evaluation in e-business, etc. The
process of discovering all the association rules consists of two steps: 1) discovery of all frequent itemsets
that have minimum support, and 2) creating of all rules from the discovered frequent itemsets that meet
the confidence threshold. Most researches have focused on efficient methods for finding frequent itemsets
because it is computationally the most expensive step and solve the candidate itemsets generation by
avoiding candidate generation, and reducing the time to scan database. However, they do not concentrate
mining frequent itemsets based on more similar transactions in the database as found in the real world,
e.g. wholesale transactions and medical prescription transactions.
Since introduction in 1993 by Argawal. The frequent itemset and association rule mining problems have
received a great deal of attention. To solve mining problems efficiently more number of paper published
for presenting new algorithms and improvements on existing algorithms. To exploiting customer
behaviour and make correct decision leading to analyze huge amount of data. For example, an association
rule “beer, chips (60%)” states that four out of five customers that bought beer also bought chips. These
rules can be useful for decisions making for promotions, store layout, product pricing and others [1].
Page | 1
The Apriori algorithm achieves good reduction on the size of candidate sets. However, when there exist a
large number of frequent patterns and/or long patterns, candidate generation- and-test methods may still
suffer from generating huge numbers of candidates and taking many scans of large databases for
frequency checking.
Large amount of data have been collected routinely in the course of day-to-day management, in business,
administration, banking, e-commerce, the delivery of social and health services, environmental
protection, security and in politics. With the tremendous growth of data, users are expecting more
relevant and sophisticated information which may be lying hidden in the data. Existing analysis and
evaluating techniques do not match with this tremendous growth. Data mining is often described as a
discipline to find hidden information in databases. It involves different techniques and algorithms to
discover useful knowledge lying hidden in the data. Association rule mining has been one of the most
popular data mining subjects which can be simply defined as finding interesting rules from collection of
data. The first step in association rule mining is finding frequent itemsets. It is a very resource consuming
task and for that reason it has been one of the most popular research fields in data mining.
At the same time very large databases do exist in real life. In a medium sized business or in a company
big as Walmart, it’s very easy to collect a few gigabytes of data. Terabytes of raw data are ubiquitously
being recorded in commerce, science and government. The question of how to handle these databases is
still one of the most difficult problems in data mining.
In this paper, section 2 discuss the related work of the frequent itemset mining; section 3 discuss the
methodology for frequent itemset mining; section 4 discuss result analysis; Finally section 5 concludes
the paper.
II. Related Work
Generally, the method for finding frequent itemsets can be divided into two approaches: candidate
generation-and test and pattern growth. A basic algorithm for candidate generation-and-test is the Apriori
which makes use of items for the candidate and combines the pre-given threshold value to count frequent
itemsets in the database. This algorithm shows that it requires multiple database scans, as many as the
longest frequent itemsets.
Formally, as defined in [2], the problem of mining association rules is stated as follows: Let I =
{i1,i2,…,im}. Let D be a set of transactions, where each transaction T is a set of items such that T⊆ I.
Associated with each transaction is a unique identifier, called its transaction id TID. A transaction T
contains X, a set of some items in I, if X ⊆ T. An association rule is an implification of the form X ⇒ Y
where X ⊂ I, Y ⊂ I and X ∩ Y = φ. The meaning of such a rule is that transactions in database, which
contain the items in X, tend to also contain the items in Y. The rule X ⊆ Y holds in the transaction set D
with confidence c if among those transactions that contains X c% of them also contain Y. The rule X ⇒ Y
has support s in the transaction set D if s% of transactions in D contain X ∪ Y. The problem of mining
association rules that have support and confidence greater than the user-specified minimum support and
Page | 2
minimum confidence respectively. Conventionally, the problem of discovering all association rules is
composed of the following two steps: 1) Find the large itemsets that have transaction support above a
minimum support and 2) From the discovered large itemsets generate the desired association rules.
The overall performance of mining association rules can be achieved by using the first step. After the
identification of large itemsets, the corresponding association rules can be derived in a straightforward
manner.
All the algorithms produce frequent itemsets on the basis of minimum support. The advantage of Apriori
algorithm is Simple and easy algorithm to finds the frequent elements from the database and disadvantage
are more search space is needed and I/O cost will increase and number of database scan is increased thus
candidate generation will increase results in increase in computational cost [1, 2]. Apriori algorithm is
quite successful for market based analysis in which transactions are large but frequent items generated is
small in number.
The advantage of Eclat algorithm is When the database is very large, it produce good results and
disadvantage is it generates the larger amount of candidates then Apriori. Vertical layout based algorithms
claims to be faster than Apriori but require larger memory space then horizontal layout based because
they needs to load candidate, database and TID list in main memory [3].
The advantage of FP-Growth algorithm are only 2 passes over data-set. No candidate generation faster
than Apriori and disadvantage are not fit in mamory and expansive to build [4]. The advantage of H-mine
algorithm are no candidate generation and does not need to store any frequent pattern in memory and
disadvantage are required more memory and no random access [5]. For FP-Tree and H-mine, performs
better than all discussed above algorithms because of no generation of candidate sets but the pointes
needed to store in memory require large memory space.
The advantage of Frequent Item Graph (FIG) algorithm are quick mining process and scanning the entire
database only once and disadvantage is requires a full scan of frequent 2-itemsets that would be used
when building the graphical structure [6]. For FIG algorithm is a quick mining process that does not use
candidates but requires a full scan of frequent 2-itemsets that would be used when building the graphical
structure in second phase. The advantage of Frequent Itemsets Algorithm for Similar Transactions
(FIAST) are save space reduce time and disadvantage is uses AND operation to find itemsets and
eventually [7]. For FIAST algorithm, tries to reduce I/O, space and time but performance decreases for
sparse datasets. The advantage of Indexed Limited Level Tree (ILLT) algorithm are easy to find frequent
itemsts for different support levels and scanning the database only once and disadvantage is candidate
generation [8]. For ILLT algorithm performs better but requires large memory space for store tree
structure.
In data mining, frequent itemset is acknowledged because there is more application like correlation,
association rules based on frequent patterns, sequential patterns tasks. In frequent pattern itemsets finding
association rules are important as other task for data mining. The major difficulty in frequent pattern
Page | 3
mining is result of large number of patterns. As the minimum threshold becomes lower, an exponentially
large number of itemsets are generated. So pruning is unimportant in mining and it becomes important
topics in mining frequent patterns. Therefore, the goal is optimize the process of finding frequent patterns
which is scalable, efficient and get important patterns [9].
III. Methodology
In a large transactional database multiple items are there so the database surely contains various
transactions which contain same set of items. Thus by taking advantage of these transactions trying to
find out the frequent itemsets and prune off the candidate itemsets whose node count is lower than min
support using improved procedure, result in efficiently execution time.
Sampling method is a popular method in computational statistics; two important terminologies related to
it are population and sample. The population is defined in keeping with the objectives of the study; a
sample is a subset of population. Usually, when the population is large, if the sample is scientifically
chosen, it can be used to represent the population, because the sample reflects the characteristics of the
population from which it is drawn.
Usually, in data mining, the population is large, so the sampling method is appropriate. As in given
example, suppose that the sample S data in Table 2 is a carefully chosen sample of some population P in
Table 1.
Table 1: Population Data
TID
List of item_IDs
1
i1, i2, i5
2
i2, i4
3
i2, i3
4
i1, i2, i4
5
i1, i3
6
i2, i3
7
i1, i3
8
i1, i2, i3, i5
9
i1, i2, i3
10
i1, i2, i5
11
i2, i4
12
i2, i3
13
i1, i2, i4
14
i1, i3
15
i2, i3
16
i1, i3
17
i1, i2, i3, i5
Page | 4
18
i1, i2, i3
Using sampling method can save much time, if the sample is carefully chosen, the sample can represent
the population, and then the table that comes from the sample can represent that comes from the
population, the 2-itemsets with high frequency in sample’s table are liable to be the one with high
frequency in population’s table.
Table 2: Sample Data
TID
List of item_IDs
1
i1, i2, i5
2
i2, i4
3
i2, i3
4
i1, i2, i4
5
i1, i3
6
i2, i3
7
i1, i3
8
i1, i2, i3, i5
9
i1, i2, i3
Procedure
1.) Carefully draw a sample S from the population P, usually by random sampling.
2.) To deal with the sample S to get a table, denoted as table HS.
3.) Rank the table HS with respect to the frequency of column content in order to make the column
address with high frequency lie in the former and that with low frequency the latter, then we get a new
table HSR.
4.) Based on HSR, to deal with the rest sample of the population P, i.e. P − S, when finished, get a table
denoted as HP.
5.) Obtain frequent itemsets according to predetermined minimum support count.
Pseudo Code
Input: Database D, min_sup
Output: Frequent Itemsets
Procedure MiningFrequentItemsets
For some transaction t ∈ D,
Insert t in to sample S;
End for;
For sample S,
Page | 5
get a Hash table (HS);
End for;
Rank the Hash table Hs (descending),
get a new Hash table (HSR);
Based on HSR (P-S),
get a Hash table (HP);
For each item (HP),
If item ≥ min_sup,
return L = U (item);
End if;
End For;
Example
Take the data in Table 1 as an example, we will show how procedure works.

Draw a sample S from the population P, shown in Table 2.

To deal with the sample S to get a table, denoted as table HS, shown in Table 3.
Table 3 : Table HS
Address
(1, 2)
(1, 5)
(2, 5)
(2, 4)
(2, 3)
(1, 4)
(1, 3)
(3, 5)
Count
4
2
2
2
4
1
4
1
Content
{i1, i2}
{i1, i5}
{i2, i5}
{i2, i4}
{i2, i3}
{i1, i4}
{i1, i3}
{i3, i5}
{i1, i2}
{i1, i5}
{i2, i5}
{i2, i4}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}

Rank the table HS with respect to the frequency of column content in order to make the column
address with the high frequent content lie in the former and that with low frequent content the
latter, get a new table HSR, shown in Table 4.
Table 4 : Table HSR
Address
(1, 2)
(2, 3)
(1, 3)
(1, 5)
(2, 5)
(2, 4)
(1, 4)
(3, 5)
Count
4
4
4
2
2
2
1
1
Content
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i4}
{i3, i5}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}

Based on HSR,To deal with the rest sample of the population P, i.e. P − S, when finished, get a
table denoted as HP, shown in Table 5.
Page | 6
Table 5 : Table HP
Address
(1, 2)
(2, 3)
(1, 3)
(1, 5)
(2, 5)
(2, 4)
(1, 4)
(3, 5)
Count
8
8
8
4
4
4
2
2
Content
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i4}
{i3, i5}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i4}
{i3, i5}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}
Obtain frequent itemsets according to predetermined minimum support count. If we set support count as
6, we find that 2-itemset {i1, i2}, {i2, i3} and {i1, i3} are frequent.
IV. Result Analysis
In our experiments we choose different datasets with different number of records to prove the efficiency
of the algorithm. Table 6 shows the dataset from the UCI repository of machine learning databases [10].
Table 6: The characteristics of Dataset
Dataset
Number of Records
agaricus-lepiota.data.txt
8124
balance-scale.data.txt
625
bridges.data.txt
108
flag.data.txt
194
imports-85.data.txt
205
letter-recognition.data.txt
20000
machine.data.txt
209
tic-tac-toe.data.txt
958
wine.data.txt
178
As a result of the experimental study, revealed the performance of efficient algorithm with the apriori
algorithm. The run time is the time to mine the frequent itemsets. The experimental result of time is
shown in Table 7 to Table 15 reveals that the algorithm outperforms the apriori algorithm. The
experimental result is also shown in Figure 1 to Figure 9.
Page | 7
As it is clear from the comparison efficient algorithm performs well for the low support value for all
datasets shown in Table 6. But at the higher support its performance small reduces compare to apriori
algorithm. Difference between execution time of efficient algorithm and apriori algorithm are decreases
in later stages.
Table 7: Execution Time for Apriori Algorithm and Efficient Algorithm using Agaricus-lepiota
dataset
Support (in
%)
Total Execution time in second
Apriori
Algorithm
Efficient Algorithm
40
173.89
134.73
50
80.78
62.87
60
53.86
43.81
Figure 1: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Agaricuslepiota dataset
Table 8: Execution Time for Apriori Algorithm and Efficient Algorithm using Balance-scale
dataset
Support (in
%)
Total Execution time in second
Apriori
Efficient Algorithm
Page | 8
Algorithm
40
13.37
11.51
50
6.21
5.34
60
4.14
3.71
Figure 2: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Balance-scale
dataset
Table 9: Execution Time for Apriori Algorithm and Efficient Algorithm using Bridges dataset
Support (in
%)
Total Execution time in second
Apriori
Algorithm
Efficient Algorithm
40
2.31
1.89
50
1.07
0.86
60
0.72
0.58
Page | 9
Figure 3: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Bridges
dataset
Table 10: Execution Time for Apriori Algorithm and Efficient Algorithm using Flag dataset
Support (in
%)
Total Execution time in second
Apriori
Algorithm
Efficient Algorithm
40
4.15
3.58
50
1.93
1.62
60
1.29
1.10
Page | 10
Figure 4: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Flag dataset
Table 11: Execution Time for Apriori Algorithm and Efficient Algorithm using Imports-85 dataset
Support (in
%)
Total Execution time in second
Apriori
Algorithm
Efficient Algorithm
40
5.71
4.73
50
2.45
1.92
60
1.50
1.17
Page | 11
Figure 5: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Imports-85
dataset
Table 12: Execution Time for Apriori Algorithm and Efficient Algorithm using Letter-recognition
dataset
Support (in
%)
Total Execution time in second
Apriori
Algorithm
Efficient Algorithm
40
406.69
331.69
50
169.05
130.34
60
99.44
75.51
Page | 12
Figure 6: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Letterrecognition dataset
Table 13: Execution Time for Apriori Algorithm and Efficient Algorithm using Machine dataset
Support (in
%)
Total Execution time in second
Apriori
Algorithm
Efficient Algorithm
40
4.47
3.66
50
2.09
1.79
60
1.39
1.24
Page | 13
Figure 7: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Machine
dataset
Table 14: Execution Time for Apriori Algorithm and Efficient Algorithm using Tic-tac-toe dataset
Support (in
%)
Total Execution time in second
Apriori
Algorithm
Efficient Algorithm
40
23.59
17.65
50
10.48
7.80
60
6.67
5.17
Page | 14
Figure 8: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Tic-tac-toe
dataset
Table 15: Execution Time for Apriori Algorithm and Efficient Algorithm using Wine dataset
Support (in
%)
Total Execution time in second
Apriori
Algorithm
Efficient Algorithm
40
3.89
3.28
50
1.81
1.45
60
1.20
0.96
Page | 15
Figure 9: Total Execution Time for Apriori Algorithm and Efficient Algorithm using Wine dataset
V. Conclusion
We choose different datasets with different number of records to prove the efficiency of the algorithm.
We choose nine datasets such as, agaricus-lepiota, balance-scale, bridges, flag, imports-85, letterrecognition, machine, tic-tac-toe and wine from the UCI repository of machine learning databases.
Frequent pattern mining problem has been studied extensively with alternative by considered the
following factor for creating our efficient algorithm, which are the time consumption, these factor is
affected by the approach for finding the frequent itemsets. Work has been done to develop an efficient
algorithm which is an improvement over apriori algorithm.
For different datasets the running time consumption of our efficient algorithm outperformed apriori
algorithm. Whereas the running time of efficient algorithm performed well over the apriori algorithm on
the collected dataset at the lower support level and also running time of efficient algorithm performed
well at higher support level. Thus it saves much time and considered as an efficient algorithm as proved
from the results.
References
[1]
Aggrawal.R; Imielinski.t; Swami.A. 1993. Mining Association Rules between Sets of Items in
Large Databases. ACM SIGMOD Conference. Washington DC, USA.
Page | 16
[2]
Agrawal.R., Srikant.R. September 1994. Fast algorithms for mining association rules. In Proc.
Int’l Conf. Very Large Data Bases (VLDB), pp.487–499.
[3]
C.Borgelt. 2003.Efficient Implementations of Apriori and Eclat. In Proc. 1st IEEE ICDM
Workshop on Frequent Item Set Mining Implementations, CEUR Workshop Proceedings 90,
Aachen, Germany.
[4]
Han.J; Pei.J; Yin. Y; , “Mining frequent patterns without candidate generation,” In Proc. ACMSIGMOD Int’l Conf. Management of Data (SIGMOD), 2000.
[5]
Pei.J; Han.J; Lu.H; Nishio.S.; Tang. S.; Yang. D.; , “H-mine: Hyper-structure mining of frequent
patterns in large databases,” In Proc. Int’l Conf. Data Mining (ICDM), November 2001.
[6]
Kumar, A.V.S.; Wahidabanu, R.S.D.; , "A Frequent Item Graph Approach for Discovering
Frequent Itemsets," Advanced Computer Theory and Engineering, 2008. ICACTE '08.
International Conference on , vol., no., pp.952-956, 20-22 Dec. 2008.
[7]
Duemong, F.; Preechaveerakul, L.; Vanichayobon, S.; , "FIAST: A Novel Algorithm for Mining
Frequent Itemsets," Future Computer and Communication, 2009. ICFCC 2009. International
Conference on , vol., no., pp.140-144, 3-5 April 2009.
[8]
Venkateswari, S.; Suresh, R.M.; , "An efficient for discovery of frequent itemsets," Signal and
Image Processing (ICSIP), 2010 International Conference on, vol., no., pp.531-533, 15-17 Dec.
2010.
[9]
Pramod. S; O. P. Vyas.; , " Survey on Frequent Item set Mining Algorithms," In Proc.
International Journal of Computer Applications, 2010 International Conference on, vol., no.,
pp.86-91, 2010.
[10]
Blake C.L. and Merz C.J., UCI Repository of Machine Learning Databases, Dept. of Information
and Computer Science, University of California at Irvine, CA, USA, 1998.
Page | 17