Download Format guide for AIRCC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
An Improved Procedure for Frequent
Pattern Mining in Transactional
Database
Ms. Dhara Patel1 and Prof. Ketan Sarvakar2
1
ME (CSE Student), UVPCE, Kherva, Gujarat, India
2
Asst. Professor, UVPCE, Kherva, Gujarat, India
Abstract : Association rule discovery has emerged as an important problem in knowledge discovery
and data mining. The association mining task consists of identifying the frequent itemsets, and then
forming conditional implication rules among them. Efficient algorithms to discover frequent
patterns are crucial in data mining research. Finding frequent itemsets is computationally the most
expensive step in association rule discovery and therefore it has attracted significant research
attention. In this paper for generating frequent itemsets we developed improved procedure and
result analysis with wine dataset. Our improved procedure is compare with ILLT algorithm and
time required for generating itemsets is less.
I. Introduction
Association rules mining is one of the most important and well researched techniques. It aims to extract
interesting correlations, frequent patterns, associations or casual structures among sets of the items in the
transaction database or other data repositories. It is widely used in various areas such as cross marketing,
new product development, personalized service and commercial credit evaluation in e-business, etc. The
process of discovering all the association rules consists of two steps: 1) discovery of all frequent itemsets
that have minimum support, and 2) creating of all rules from the discovered frequent itemsets that meet
the confidence threshold. Most researches have focused on efficient methods for finding frequent itemsets
because it is computationally the most expensive step and solve the candidate itemsets generation by
avoiding candidate generation, and reducing the time to scan database. However, they do not concentrate
mining frequent itemsets based on more similar transactions in the database as found in the real world,
e.g. wholesale transactions and medical prescription transactions.
Since introduction in 1993 by Argawal. The frequent itemset and association rule mining problems have
received a great deal of attention. To solve mining problems efficiently more number of paper published
for presenting new algorithms and improvements on existing algorithms. To exploiting customer
behaviour and make correct decision leading to analyze huge amount of data. For example, an association
Page | 1
rule “beer, chips (60%)” states that four out of five customers that bought beer also bought chips. These
rules can be useful for decisions making for promotions, store layout, product pricing and others [1].
The Apriori algorithm achieves good reduction on the size of candidate sets. However, when there exist a
large number of frequent patterns and/or long patterns, candidate generation- and-test methods may still
suffer from generating huge numbers of candidates and taking many scans of large databases for
frequency checking.
Large amount of data have been collected routinely in the course of day-to-day management, in business,
administration, banking, e-commerce, the delivery of social and health services, environmental
protection, security and in politics. With the tremendous growth of data, users are expecting more
relevant and sophisticated information which may be lying hidden in the data. Existing analysis and
evaluating techniques do not match with this tremendous growth. Data mining is often described as a
discipline to find hidden information in databases. It involves different techniques and algorithms to
discover useful knowledge lying hidden in the data. Association rule mining has been one of the most
popular data mining subjects which can be simply defined as finding interesting rules from collection of
data. The first step in association rule mining is finding frequent itemsets. It is a very resource consuming
task and for that reason it has been one of the most popular research fields in data mining.
At the same time very large databases do exist in real life. In a medium sized business or in a company
big as Walmart, it’s very easy to collect a few gigabytes of data. Terabytes of raw data are ubiquitously
being recorded in commerce, science and government. The question of how to handle these databases is
still one of the most difficult problems in data mining.
In this paper, section 2 discuss the related work of the frequent itemset mining; section 3 discuss the
methodology for frequent itemset mining; section 4 discuss result analysis; Finally section 5 concludes
the paper.
II. Related Work
Generally, the method for finding frequent itemsets can be divided into two approaches: candidate
generation-and test and pattern growth. A basic algorithm for candidate generation-and-test is the Apriori
which makes use of items for the candidate and combines the pre-given threshold value to count frequent
itemsets in the database. This algorithm shows that it requires multiple database scans, as many as the
longest frequent itemsets.
Formally, as defined in [2], the problem of mining association rules is stated as follows: Let I =
{i1,i2,…,im}. Let D be a set of transactions, where each transaction T is a set of items such that T⊆ I.
Associated with each transaction is a unique identifier, called its transaction id TID. A transaction T
contains X, a set of some items in I, if X ⊆ T. An association rule is an implification of the form X ⇒ Y
where X ⊂ I, Y ⊂ I and X ∩ Y = φ. The meaning of such a rule is that transactions in database, which
contain the items in X, tend to also contain the items in Y. The rule X ⊆ Y holds in the transaction set D
Page | 2
with confidence c if among those transactions that contains X c% of them also contain Y. The rule X ⇒ Y
has support s in the transaction set D if s% of transactions in D contain X ∪ Y. The problem of mining
association rules that have support and confidence greater than the user-specified minimum support and
minimum confidence respectively. Conventionally, the problem of discovering all association rules is
composed of the following two steps: 1) Find the large itemsets that have transaction support above a
minimum support and 2) From the discovered large itemsets generate the desired association rules.
The overall performance of mining association rules can be achieved by using the first step. After the
identification of large itemsets, the corresponding association rules can be derived in a straightforward
manner.
All the algorithms produce frequent itemsets on the basis of minimum support. The advantage of Apriori
algorithm is Simple and easy algorithm to finds the frequent elements from the database and disadvantage
are more search space is needed and I/O cost will increase and number of database scan is increased thus
candidate generation will increase results in increase in computational cost [1, 2]. Apriori algorithm is
quite successful for market based analysis in which transactions are large but frequent items generated is
small in number.
The advantage of Eclat algorithm is When the database is very large, it produce good results and
disadvantage is it generates the larger amount of candidates then Apriori. Vertical layout based algorithms
claims to be faster than Apriori but require larger memory space then horizontal layout based because
they needs to load candidate, database and TID list in main memory [3].
The advantage of FP-Growth algorithm are only 2 passes over data-set. No candidate generation faster
than Apriori and disadvantage are not fit in mamory and expansive to build [4]. The advantage of H-mine
algorithm are no candidate generation and does not need to store any frequent pattern in memory and
disadvantage are required more memory and no random access [5]. For FP-Tree and H-mine, performs
better than all discussed above algorithms because of no generation of candidate sets but the pointes
needed to store in memory require large memory space.
The advantage of Frequent Item Graph (FIG) algorithm are quick mining process and scanning the entire
database only once and disadvantage is requires a full scan of frequent 2-itemsets that would be used
when building the graphical structure [6]. For FIG algorithm is a quick mining process that does not use
candidates but requires a full scan of frequent 2-itemsets that would be used when building the graphical
structure in second phase. The advantage of Frequent Itemsets Algorithm for Similar Transactions
(FIAST) are save space reduce time and disadvantage is uses AND operation to find itemsets and
eventually [7]. For FIAST algorithm, tries to reduce I/O, space and time but performance decreases for
sparse datasets. The advantage of Indexed Limited Level Tree (ILLT) algorithm are easy to find frequent
itemsts for different support levels and scanning the database only once and disadvantage is candidate
generation [8]. For ILLT algorithm performs better but requires large memory space for store tree
structure.
Page | 3
In data mining, frequent itemset is acknowledged because there is more application like correlation,
association rules based on frequent patterns, sequential patterns tasks. In frequent pattern itemsets finding
association rules are important as other task for data mining. The major difficulty in frequent pattern
mining is result of large number of patterns. As the minimum threshold becomes lower, an exponentially
large number of itemsets are generated. So pruning is unimportant in mining and it becomes important
topics in mining frequent patterns. Therefore, the goal is optimize the process of finding frequent patterns
which is scalable, efficient and get important patterns [9].
III. Methodology
In a large transactional database multiple items are there so the database surely contains various
transactions which contain same set of items. Thus by taking advantage of these transactions trying to
find out the frequent itemsets and prune off the candidate itemsets whose node count is lower than min
support using improved procedure, result in efficiently execution time.
Sampling method is a popular method in computational statistics; two important terminologies related to
it are population and sample. The population is defined in keeping with the objectives of the study; a
sample is a subset of population. Usually, when the population is large, if the sample is scientifically
chosen, it can be used to represent the population, because the sample reflects the characteristics of the
population from which it is drawn.
Usually, in data mining, the population is large, so the sampling method is appropriate. As in given
example, suppose that the sample S data in Table 2 is a carefully chosen sample of some population P in
Table 1.
Table 1: Population Data
TID
List of item_IDs
1
i1, i2, i5
2
i2, i4
3
i2, i3
4
i1, i2, i4
5
i1, i3
6
i2, i3
7
i1, i3
8
i1, i2, i3, i5
9
i1, i2, i3
10
i1, i2, i5
11
i2, i4
12
i2, i3
13
i1, i2, i4
14
i1, i3
Page | 4
15
i2, i3
16
i1, i3
17
i1, i2, i3, i5
18
i1, i2, i3
Using sampling method can save much time, if the sample is carefully chosen, the sample can represent
the population, and then the table that comes from the sample can represent that comes from the
population, the 2-itemsets with high frequency in sample’s table are liable to be the one with high
frequency in population’s table.
Table 2: Sample Data
TID
List of item_IDs
1
i1, i2, i5
2
i2, i4
3
i2, i3
4
i1, i2, i4
5
i1, i3
6
i2, i3
7
i1, i3
8
i1, i2, i3, i5
9
i1, i2, i3
Procedure
1.) Carefully draw a sample S from the population P, usually by random sampling.
2.) To deal with the sample S to get a table, denoted as table HS.
3.) Rank the table HS with respect to the frequency of column content in order to make the column
address with high frequency lie in the former and that with low frequency the latter, then we get a new
table HSR.
4.) Based on HSR, to deal with the rest sample of the population P, i.e. P − S, when finished, get a table
denoted as HP.
5.) Obtain frequent itemsets according to predetermined minimum support count.
Pseudo Code
Input: Database D, min_sup
Output: Frequent Itemsets
Procedure MiningFrequentItemsets
For some transaction t ∈ D,
Page | 5
Insert t in to sample S;
End for;
For sample S,
get a Hash table (HS);
End for;
Rank the Hash table Hs (descending),
get a new Hash table (HSR);
Based on HSR (P-S),
get a Hash table (HP);
For each item (HP),
If item ≥ min_sup,
return L = U (item);
End if;
End For;
Example
Take the data in Table 1 as an example, we will show how procedure works.

Draw a sample S from the population P, shown in Table 2.

To deal with the sample S to get a table, denoted as table HS, shown in Table 3.
Table 3 : Table HS
Address
(1, 2)
(1, 5)
(2, 5)
(2, 4)
(2, 3)
(1, 4)
(1, 3)
(3, 5)
Count
4
2
2
2
4
1
4
1
Content
{i1, i2}
{i1, i5}
{i2, i5}
{i2, i4}
{i2, i3}
{i1, i4}
{i1, i3}
{i3, i5}
{i1, i2}
{i1, i5}
{i2, i5}
{i2, i4}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}

Rank the table HS with respect to the frequency of column content in order to make the column
address with the high frequent content lie in the former and that with low frequent content the
latter, get a new table HSR, shown in Table 4.
Table 4 : Table HSR
Address
(1, 2)
(2, 3)
(1, 3)
(1, 5)
(2, 5)
(2, 4)
(1, 4)
(3, 5)
Count
4
4
4
2
2
2
1
1
Content
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i4}
{i3, i5}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}
Page | 6

Based on HSR,To deal with the rest sample of the population P, i.e. P − S, when finished, get a
table denoted as HP, shown in Table 5.
Table 5 : Table HP
Address
(1, 2)
(2, 3)
(1, 3)
(1, 5)
(2, 5)
(2, 4)
(1, 4)
(3, 5)
Count
8
8
8
4
4
4
2
2
Content
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i4}
{i3, i5}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i4}
{i3, i5}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i5}
{i2, i5}
{i2, i4}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}
{i1, i2}
{i2, i3}
{i1, i3}
Obtain frequent itemsets according to predetermined minimum support count. If we set support count as
6, we find that 2-itemset {i1, i2}, {i2, i3} and {i1, i3} are frequent.
IV. Result Analysis
In our experiments we choose wine dataset with different properties, to prove the efficiency of the
algorithm. In the wine dataset, 178 number of records and 14 number of columns. Table 6 shows the
dataset from the UCI repository of machine learning databases [10].
Table 6: The characteristics of Dataset
Itemset
Number of Records
Number of Columns
Wine.data.txt
178
14
As a result of the experimental study, revealed the performance of improved procedure with the ILLT
algorithm. The run time is the time to mine the frequent itemsets. The experimental result of time is
shown in Table 7 reveals that the improved procedure outperforms the ILLT algorithm. The experimental
result is also shown in Figure 1.
As it is clear from the comparison improved procedure performs well for the low support value for the
Wine dataset which contains 178 transactions and 14 numbers of columns. But at the higher support its
performance small reduces compare to ILLT algorithm. Difference between execution time of improved
procedure and ILLT are decreases in later stages.
Page | 7
Table 7: Execution Time for ILLT and Improved Procedure using Wine dataset
Support (in
Total Execution time in second
%)
ILLT
Improved Procedure
40
3.81
3.28
50
1.77
1.45
60
1.18
0.96
Figure 1: Total Execution Time for ILLT and Improved Procedure using Wine dataset
V. Conclusion
Frequent pattern mining problem has been studied extensively with alternative By considered the
following factor for creating our improved procedure, which are the time consumption, these factor is
affected by the approach for finding the frequent itemsets. Work has been done to develop an improved
procedure which is an improvement over ILLT algorithm.
For wine dataset the running time consumption of our improved procedure outperformed ILLT. Whereas
the running time of improved procedure performed well over the ILLT on the collected dataset at the
lower support level and also running time of improved procedure performed well at higher support level.
Thus it saves much time and considered as an efficient method as proved from the results.
References
[1]
Aggrawal.R; Imielinski.t; Swami.A. 1993. Mining Association Rules between Sets of Items in
Large Databases. ACM SIGMOD Conference. Washington DC, USA.
Page | 8
[2]
Agrawal.R., Srikant.R. September 1994. Fast algorithms for mining association rules. In Proc.
Int’l Conf. Very Large Data Bases (VLDB), pp.487–499.
[3]
C.Borgelt. 2003.Efficient Implementations of Apriori and Eclat. In Proc. 1st IEEE ICDM
Workshop on Frequent Item Set Mining Implementations, CEUR Workshop Proceedings 90,
Aachen, Germany.
[4]
Han.J; Pei.J; Yin. Y; , “Mining frequent patterns without candidate generation,” In Proc. ACMSIGMOD Int’l Conf. Management of Data (SIGMOD), 2000.
[5]
Pei.J; Han.J; Lu.H; Nishio.S.; Tang. S.; Yang. D.; , “H-mine: Hyper-structure mining of frequent
patterns in large databases,” In Proc. Int’l Conf. Data Mining (ICDM), November 2001.
[6]
Kumar, A.V.S.; Wahidabanu, R.S.D.; , "A Frequent Item Graph Approach for Discovering
Frequent Itemsets," Advanced Computer Theory and Engineering, 2008. ICACTE '08.
International Conference on , vol., no., pp.952-956, 20-22 Dec. 2008.
[7]
Duemong, F.; Preechaveerakul, L.; Vanichayobon, S.; , "FIAST: A Novel Algorithm for Mining
Frequent Itemsets," Future Computer and Communication, 2009. ICFCC 2009. International
Conference on , vol., no., pp.140-144, 3-5 April 2009.
[8]
Venkateswari, S.; Suresh, R.M.; , "An efficient for discovery of frequent itemsets," Signal and
Image Processing (ICSIP), 2010 International Conference on, vol., no., pp.531-533, 15-17 Dec.
2010.
[9]
Pramod. S; O. P. Vyas.; , " Survey on Frequent Item set Mining Algorithms," In Proc.
International Journal of Computer Applications, 2010 International Conference on, vol., no.,
pp.86-91, 2010.
[10]
Blake C.L. and Merz C.J., UCI Repository of Machine Learning Databases, Dept. of Information
and Computer Science, University of California at Irvine, CA, USA, 1998.
Page | 9