Download Reducing the number of sub-trees for frequent itemsets mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Recent Advances in Electrical and Computer Engineering
Reducing the number of sub-trees for frequent
itemsets mining
Supatra Sahaphong and Gumpon Sritanratana
candidate itemsets have two major problems which are shown
as follows. The database must be scanned multiple times to
generate candidate sets which increase the I/O load and is
time-consuming. Moreover, the generation of huge candidate
sets and calculation of their support will consume a lot of CPU
time. The drawbacks which presented as above were overcome
by using the next generation of algorithm, called the FPgrowth algorithm [3]. The advantages of mining of frequent
itemsets by using the FP-growth algorithm are shown as
follows. The database is scanned only two times, so time
consuming is decreased. The generating of candidate sets is
not required, so the I/O load is reduced. The FP-growth
algorithm performs depth-first search approach in the search
space. It encodes the data set using a compact data structure
called FP-tree and extracts frequent pattern directly from this
prefix tree [4]. The following researches have improved this
idea. In reference [5], the H-mine algorithm was introduced by
using array-based and trie-based data structure. The Patricia
Mine algorithm was proposed in [6] that compressed Patricia
trie to store the data sets. The FPgrowt* algorithm reduced the
FP-tree traversal time by using array technique [7]. In
reference [8], the SFI-Mine algorithm which constructs
pattern-base by using a new method which is different from
pattern-base in FP-growth and mines frequent itemsets with a
new combination method without recursive construction of
conditional FP-tree. However, most of the FP-tree algorithm
base has the following drawbacks. First, mining of frequent
itemset from the FP-tree, it generates huge of conditional FPtree and takes a lot of time and space. Second, when the
changing of minimum support, this algorithm may restart and
scan database twice. Many researchers have proposed ways to
scan database once. The Eclat algorithm was proposed by
using the join step from the Apriori property to generate
frequent pattern [9]. In Reference [10], the new data structure,
called LIB-graph is proposed to contain data when database is
scanned and discovery of frequent patterns by using recursive
conditional FP-tree. The Sorted-List structure which created
from the Vertical Index List was proposed to contained data
from scanning database once and mining of frequent itemsets
by using depth-first search [11]. Moreover, in case that the
decision maker wants to change the minimum support
threshold, an algorithm is performed without rescanning of
database [12].
This paper proposed a new algorithm to mine all frequent
itemsets. The feature of the proposed algorithm presented as
follows. The database is scanned only one time to mine
frequent itemsets and a new algorithm mines frequent itemsets
without generation of candidate sets. The decision maker can
Abstract—This paper aimed to develop a new algorithm to
mine all frequent itemsets from a transaction database. A new
mining algorithm called vertical index list (VIL) tree which
performs database only once and without generating any
candidate itemsets. The decision maker can change of the
minimum support threshold without rescanning of the
database. The VIL-tree algorithm uses sorted VIL, so a mount
of the frequent itemsets are generated at first. The next trees
are resized down which reduced trees construction and its
traversal, and the number of recursive of mining steps. When
the node construction and sub-trees are reduced, resulting in a
reduction in run time and memory consumption. The
experiments in which run time and memory consumption of
the proposed algorithm are tested in comparison with frequent
pattern (FP) growth algorithm. The experiments of both
algorithms are evaluated by applying to the bench mark
synthetics datasets. The experimental results demonstrate that
VIL-tree provides better performance than FP-growth in terms
of run time and space consumption.
Keywords—Association rule mining, data mining, frequent
itemsets mining, knowledge discovering.
I. INTRODUCTION
F
requent itemsets mining is an essential step in association
rule mining. The association rule mining is to decompose
into two major subtasks. First, the generation of all the
frequent itemsets which satisfy the minimal support threshold.
Second, the extraction of all high confidence rules from
frequent itemsets found in previous step. Our work focuses on
the first subtask. The first classic algorithm is Apriori which is
proposed in [1]. The Apriori principle is “If an itemsets is
frequent, then all of its subsets must also be frequent” [2]. The
Apriori algorithm uses a level-wise and breadth-first search
approach for generating association rule. It uses the supportbased pruning to control the exponential growth of candidate
itemsets. The algorithms based on generated and tested
This work was supported in part by Ramkhamhaeng University,
Ramkhamhaeng Raod, Bangkapi District, Bangkok 10240, Thailand.
Supatra Sahaphong is with the Department of Computer Science, Faculty
of Science, Ramkhamhaeng University, Ramkhamhaeng Raod, Bangkapi
District, Bangkok 10240, Thailand (corresponding e-mail: [email protected],
[email protected]).
Gumpon Sritaratana is with the Department of Mathematics, Faculty of
Science, Buriram Rajabhat University, Muang, 31000, Thailand (e-mail:
[email protected]).
ISBN: 978-1-61804-228-6
213
Recent Advances in Electrical and Computer Engineering
change of the minimum support threshold all time without
rescanning of the database. The proposed algorithm reduced
the number of sub-trees and loops in mining steps. Therefore,
the proposed method can reduce both run time and space
consumption, the experiments in which the run time and
memory consumption are tested for the VIL-tree and FPgrowth algorithm. The results of this method are still obtaining
complete and correct frequent itemset. This paper is organized
as follows. The prior knowledge is presented in section II,
follows by the approach which is presented in section III, the
results and discussions is shown in section V and the finally,
the conclusion is addressed in section VI.
II. PRIOR KNOWLEDGE
This section introduces basic concepts for mining of
frequent itemsets. The following definition is proposed by Han
et al in [4, 12].
Let I = {x1 , x 2 ,..., x m }
DB = {T1 , T2 ,..., Tn }
be
be
a
a
set
transaction
of
items
database,
Fig. 1 The instruction of VIL
and
III. THE APPROACH
where
This section introduces a new algorithm called VIL-tree.
The feature of the proposed algorithm presented as follows.
The transaction database is scanned only one time to mine
frequent itemsets. The VIL-tree algorithm mines frequent
itemsets without generation of candidate sets, reduces the
number of sub-trees and loops in mining steps. Therefore, the
proposed method can reduce both of run time and space
consumption.
The transaction database is scanned once to construct a VIL.
Then an itemset-tree structure is a general tree structure
constructed from the VIL, called a vertical index list-tree of
itemsets, denoted by VIL-tree (itemsets). It is a finite set of
one or more nodes. It consists of the root of tree, a set of item
subtrees as the children of the root, and a set of header tables.
Each node in tree comprises five fields. There are two fields of
value which are item-name and support and there are three
fields of pointer which are same-item, parent and child. Each
member of the header table consists of two fields, item-name
and head of node link. Each node of tree is of the form
(frequent itemset : support).
The algorithm in Fig. 2 shows how to construct VIL-tree
and the algorithm in Fig. 3 shows how to mine all frequent
itemsets.
T1 , T2 ,..., Tn are transactions that contain items in I. The
support, or supp (occurrence frequency), of a pattern A, where
A is a set of items, is the number of transactions containing A
in DB. A pattern A is frequent if A’s support is no less than a
predefined minimum support threshold, minsup. Given a DB
and a minimum support threshold minsup, the problem of
finding a complete set of frequent itemsets is called the
frequent-itemsets mining problem.
A data structure called a vertical index list (VIL) which
introduced in [2, 9, 11] is summarized as follows. Let
Ti = {x1 , x2 ,..., xm } be a transaction in DB, where
i = 1,2,..., m and x j is an item for j = 1,2,..., n. A vertical
index list (or VIL) is the structure constructed from a scan of
each
Ti in DB only once. Each row in VIL contains an item in
I, support of item in I, and transactions in DB which contain
such an item. The set of transaction will be written in order
according to the ascending of its identification number. The set
of items will be written in order according to the descending of
its support. The algorithm in Fig. 1 shows how to construct the
VIL.
The construction VIL is presented in Fig.1.
ISBN: 978-1-61804-228-6
214
Recent Advances in Electrical and Computer Engineering
IV. RESULTS AND DISCUSSION
This section presents the experiments in which the run time
and memory consumption are tested for the VIL-tree and FPgrowth algorithm with two synthetic datasets and varying
minimum support thresholds. The experiments were performed
on a Microsoft Windows 7 Home Premium, processor is (Intel
(R) Core (TM) i5-2467M, and 4 GB of RAM. All algorithms
were coded using C language. The two synthetic datasets
generated by the IBM Almaden Quest research group [13-14]
were used for presented the experimental results. The datasets
serve as the FIMI repository, which is a result of the
workshops on frequent itemset mining implementations [1516]. The two original databases of synthetic datasets are
T10I4D100K and T40I10D100K.
In Fig. 4 and Fig. 5, when the minimum support is high, the
number of frequent itemsets is low. The minimum support is
low, many frequent itemsets are obtained. VIL-tree is always
faster than FP-growth method because frequent item in VIL is
performed in order of high support. The node construction and
sub-trees are reduced, resulting in a reduction in run time and
memory consumption. The Fig. 6 and Fig. 7 show that the
memory consumption of VIL-tree is less than FP-growth in
every minimal support threshold. This is because the node
construction and the size of trees are reduced, so memory
consumption is decreased.
Fig. 2 The construction of VIL-tree
Fig. 4 Run time of mining on T10I4D100K
Fig. 5 Run time of mining on T40I10D100K
Fig. 3 VIL-tree algorithm
ISBN: 978-1-61804-228-6
215
Recent Advances in Electrical and Computer Engineering
ACKNOWLEDGMENT
We deep appreciation and gratitude to Associate Professor
Dr. Veera Boonjing of the Department of Mathematics and
Computer Science, Faculty of Science, King Mongkut’s
Institute of Technology Ladkrabang, Thailand for his guidance
and suggestions. We would like to express our appreciation
and gratitude to Associate Professor Dr. Tawesak Tanwandee
of Mahidol University, Thailand for his help in proof reading
of this paper.
REFERENCES
Fig. 6 Memory consumption of mining on T10I4D100K
[1] R. Agrawal and R. Srikant: Fast Algorithm for
Mining
Association Rules, Proceedings of the 20th International
Conference on Very Large Data Bases, Chile, September
1994, 487-499.
[2] P-N. Tan, M. Steinbach, & V. Kumar, Introduction to
Data Mining (Pearson Education Inc., 2006).
[3] J. Han, J. Pei, & Y. Yin: Mining Frequent Pattern without
Candidate Generation, Proceedings of the 2000 ACM
SIGMOD international conference on Management of
Data, Texas, May 2000, 1-12.
[4] J. Han, J. Pei, Y. Yin, & R. Mao: Mining Frequent Pattern
without Candidate Generation: a Frequent Pattern Tree,
Springer, vol. 8, 2004, no 1, 53-87.
[5] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, & D. Yang:
Hmine: Hyper-Structure Mining of Frequent Patterns in
Large Databases, Proceedings of the 2001 IEEE
International Conference on Data Mining, USA,
November 2001, 441-448.
[6] A. Pietracaprina, & D.Zandolin: Mining Frequent Itemsets
Using Patricia Tries, Proceedings of the 3rd IEEE
International Conference on Data Mining, Florida, USA,
November 2003.
[7] G. Grahne, & J. Zhu: Efficiently Using Prefix-Trees in
Mining Frequent Itemsets,” Proceedings of the 3rd IEEE
International Conference on Data Mining, Florida, USA,
November 2003.
[8] S. Sahaphong, & V. Boonjing: The Combination
Approach to Frequent Itemsets Mining, Proceedings of
the 2008 International Conference on Convergence and
hybrid Information Technology, Korea, November 2008,
565-570.
[9] M.J. Zaki, Scalable Algorithms for Association Mining,
IEEE Transaction on Knowledge and Data Engineering,
vol. 12, no. 3, 2000, 372-390.
[10] D. J. Chai, L. Jin, B. Hwang, & K. H. Ryu: Frequent
Pattern Mining Using Bipartite Graph, Proceedings of the
18th International Conference on Database and Expert
Systems Applications, Germany, August 2007, 182-186.
[11] S. Sahaphong, Frequent Itemsets Mining Using Vertical
Index List, Proceedings of the 2nd IEEE International
Conference on Computer Science and Information
Technology, China, August 209, 480-484.
[12] J. Han, & M. Kamber, Data Mining: Concepts and
Techniques, Elsevier, Maryland Heights MO, 2006.
Fig. 7 Memory consumption of mining on T40I10D100K
V. CONCLUSION
This paper aimed to develop a new algorithm to mine all
frequent itemsets from a transaction database. A new mining
algorithm called vertical index list (VIL) tree which performs
database only once and without generating any candidate
itemsets. The decision maker can change the minimum support
threshold without rescanning of the database at anytime. The
research provided the experiments in which run time and
memory consumption are tested in comparison with frequent
pattern (FP) growth algorithm. The experiments of both
algorithms are evaluated by applying to the bench mark
synthetic datasets. The experimental results demonstrate that
VIL-tree provides better performance than FP-growth in terms
of run time and space consumption. We summarized the
feature of this research as follows. The VIL-tree algorithm
scans database only once, moreover, the VIL-tree algorithm
uses sorted VI so that the amount of frequent itemsets are
generated at first. The next VIL-tree is resized down and subtrees are reduced which reduced the number of loops of
mining steps. Therefor, run time and space consumption are
reduced. The experiments of both algorithms are evaluated by
applying to the bench mark synthetics datasets. The
experimental results demonstrate that VIL-tree provides better
performance than FP-growth in terms of run time and space
consumption.
ISBN: 978-1-61804-228-6
216
Recent Advances in Electrical and Computer Engineering
[13] Frequent
Itemset
Mining
Dataset
Repository,
“T10I4D100K”. Available: http://fimi.cs.helsinki.fi/data/
[14] Frequent Itemset Mining Dataset Repository,
“T40I10D100K”.Available:http://fimi.cs.helsinki.fi/data/
[15] Workshop on Frequent Itemset Mining Implementations.
Available: http://fimi.ua.ac.be/fimi03/
[16] Workshop on Frequent Itemset Mining Implementations.
Available: http://fimi.ua.ac.be/fimi04/
Supatra Sahaphong received her B.S. (Computer Science)
from Ramkhamhaeng University (RU), Thailand, in 1990, a
M.S. (Information Technology) from King Mongkut’s Institute
of Technology Ladkrabang (KMITL), Thailand, in 1998, and a
Ph.D. (Computer Science) from KMITL in 2011. From 1991
to 1999, she worked at Institute of Computer, RU and has
joined Department of Computer Science, Faculty of Science,
RU in 1999, where she is currently an Assistant Professor. She
has been invited to be a program committee member and
organizing committee of many international conferences as
well as reviewer of many journals. She had received the best
paper award from the Second International Workshop on
Frontiers of Information Technology, Applications and Tools.
Her research interests include data mining, knowledge
discovering, and image segmentation.
Gumpon Sritanratana is an Assistant Professor at the
Department of Mathematics, Faculty of Science, Buriram
Rajabhat University, Thailand. He received Ph.D. in
Mathematics from Chiang Mai University, Thailand. He
worked at Department of Mathematics, Mahidol Univerisity,
Thailand. He has been invited to be a program committee
member and organizing committee of many international
conferences and reviewer of many journals. His research
activities involve partial differential operators such as the
laplacian, ultra-hyperbolic and domain operators etc. His
interest is in certain partial differential equations involving
temper distribution solution. Another area of interest involves
distribution theory and matrix calculus, especially some matrix
convolutions of functions and distributions.
ISBN: 978-1-61804-228-6
217
Related documents