Download Survey on Mining Association Rule with Data Structures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ISSN (Print) : 0974-6846
Indian Journal of Science and Technology, Vol 8(24), DOI: 10.17485/ijst/2015/v8i24/80157, September 2015 ISSN (Online) : 0974-5645
Survey on Mining Association Rule with
Data Structures
Y. Jeya Sheela and S. H. Krishnaveni*
Department of Information Technology, Noorul Islam University, Kumaracoil, Kanyakumari - 629180, Tamil Nadu,
India; [email protected], [email protected]
Abstract
In the current trend, development of various applications of Data mining gains much importance. Data mining is a process
of extracting interesting, advantageous and understandable patterns from huge databases. Mining Association rules from
large databases is one of the important tasks in data mining. Association rule mining is based on two steps: finding frequent
item set and generating rules from it. There are many algorithms for finding frequent item set. Processing large databases
for generating efficient association rules necessitates repeated scans which increases the computing time. Data Structures
plays a main role in reducing the complexity of computational operations. In this paper we have focused on the theories of
standard data structures used in mining proficient association rules and exemplified with examples.
Keywords: Data Mining, Data Structures, FP-Tree, Pre-Large Tree, Tries
1. Introduction
Data mining is a process of extracting interesting, hidden
and useful patterns from large databases such as: relational and transactional database, data warehouses, XML
repository etc. It is also known as Knowledge Discovery
in Database (KDD). In general there are three processes:
pre-processing, data mining process and post- processing. Real world data consists of inconsistent, incorrect
data with certain behaviors and data may contain many
errors. Data pre-processing is the essential and proven
technique that deals with such issues. Data goes through
a sequence of steps during pre-processing: data cleaning,
integration, transformation, reduction and data discretization. The main step in KDD is the data mining process,
where various mining algorithms are applied to produce
useful unseen information. Third step is called post-processing, which assess the mining outcome based on users’
necessities and area information. The result obtained can
be presented only if it is reasonable, otherwise few or all
of those processes are repeated until satisfying result is
obtained. Finally the result can be presented in any of the
form: raw data, tables, decision trees, rules, charts, data
*Author for correspondence
cubes or 3D graphics. There are many efficient mining
algorithms .Association rule mining plays an important
role in the mining process. It is used to find relationships among data items, frequent patterns in a large
transactional database.
For example, in a Super market there are provisions
that after buying few items, for instance, once milk is
bought, a list of related items like: butter 40%, bread
25% will be presented for additional purchasing. In this
example, Association rule is, when milk is bought, 40%
of the time butter is bought together, and 25% of the
time bread is bought along with milk. These rules help to
make strong decisions in marketing management. There
are various applications of association rules in the area of
telecommunication networks, market and risk management, inventory control etc. Association rule mining can
be used along with data structures. Data Structure is a way
of organizing data in a database. There are various efficient
data structures. In section 2, various existing data structures are studied, in section 3, various data structures that
are used in the generation of association rules and their
advantages and disadvantages are studied, in section 4,
results and conclusion are discussed.
Survey on Mining Association Rule with Data Structures
2. Various Data Structures used
in Data Mining Process
2.1 Trie Data Structure
A trie, also called digital tree, radix tree or prefix tree, is a
prearranged tree data structure that is used to store keys
in strings. The data structure trie was initially introduced
to store and capably get words from a Dictionary. A trie1
is a rooted directed tree like structure. Initially the root is
at depth 0 and a node at depth i point nodes at depth i + 1.
A pointer called as a link, is tagged by a letter. A special
letter * represents null character. If node i points to node
j, then node i is the parent of node j, and node j is the
child of node i. Each leaf node l denotes a word which is
the concatenation of the letters in the path from root to
leaf node l. If two words have identical k letters, then the
k steps on the paths are identical.
Let S = a set of words S = {fear, fell, fit, full, fun, tell,
talk, tap}
In Figure 1, Trie T that stores the above words is
shown. Initially a root node is drawn. Every character of
the input word is read one by one and added as a separate node in the trie. If the character read is new, then a
new node is constructed, added in the trie and marked
as leaf node. If the character read is a prefix of existing
node, then a new node is constructed and last node is
marked as leaf. The length of the word decides the trie
depth. To search2 a word in the tree, starting from the root
node, move ahead by relating its letters in a sequence.
If the word is not found, then there is no link with the
corresponding tag. To insert a node, start from the root
node, move ahead as if to search a word. If there is a node,
which has no link tagged with the subsequent letter L of
the word, then a new node and a link to point it (tagged
as L) are created. This process is iterated until the end of
the word is reached.
2.2 FP-tree
Mining Association rules are based on finding frequent
patterns and then generating rules from it. A pattern that
occurs very often in a data set is frequent pattern3. There
are many algorithms to find frequent patterns or item
sets namely: Apriori, FP-growth. Apriori algorithm first
generates candidate sets from a set of items and checks
whether they occur frequently. It requires many database
scans and it is expensive method. FP-Growth algorithm
first constructs a small data structure called FP-tree and
then mines frequent patterns from the FP-tree. This can
be done in two passes.
Example 4: Find all frequent item sets in the
­following database given in Table 1, taking minimum
support as 30%.
Step 1: Minimum support calculation
Number of transactions = 8
Minimum support = 30%
Minimum support count = 30/100*8
= 2.4
=3
Step 2: Finding frequency of occurrence and Priority of
items
Table 2 shows the frequency of occurrence of items
and priority of items in the given transactional database.
Table 1. Given Transactional dataset
Figure 1. Trie T stores the words fear, fell, fit, full, fun, tell,
talk, tap.
2
Vol 8 (24) | September 2015 | www.indjst.org
Tid
Items
1
X, Y
2
Y, Z, W
3
X, W, U, V
4
X, U, V
5
X, Y, Z
6
X, Y, Z, U
7
X
8
Y, Z, V
Indian Journal of Science and Technology
Y. Jeya Sheela and S. H. Krishnaveni
Table 2. Frequency of occurrence and priority of
items
Item
Frequency
Priority
X
6
1
Y
5
2
Z
4
3
U
3
4
V
3
5
W
2
6
For e.g. Item X has occurred in row 1, row 3, row 4,
row 5, row 6 and row 7. Totally it has occurred six times.
Hence frequency of item X is 6. In the same way frequency
of remaining items are found. Based on this frequency,
items are prioritized. Here frequency of Item X is 6 and
item W is 2, so item X has the highest priority and item
W has the lowest priority. Items that do not satisfy the
minimum support requirement will be dropped.
Step 3: Ordering of items on the basis of Priority
Table 3 shows the ordering of items. Items are ordered
based on the priority of items shown in Table 2. Order of
items are X, Y, Z, U, V, W.
Step 4: FP–tree Construction
Figure 2 shows the FP-Tree construction for the items
in row 1 of Table 3.
Root node is taken as NULL. Initially Root node is
drawn and then the items in row 1 are included one after
the other. Items in row 1 are: X, Y
Figure 3 shows the FP-Tree construction for the items
in row 2 of Table 3.
Items in row 2 are: Y, Z, W.
Figure 4 shows the FP-Tree construction for the items
in row 3 of Table 3.Items in row 3 are: X, U, V, and W.
Figure 5 shows the FP-Tree construction for the items
in row 4 of Table 3. Items in row 4 are: X, U, V.
Figure 2. FP-Tree construction for the items in Row 1.
Figure 3. FP-Tree construction for the items in Row 2.
Figure 4. FP-Tree construction for the items in Row 3.
Table 3. Order of items
Item
Frequency
Priority
X
6
1
Y
5
2
Z
4
3
U
3
4
V
3
5
W
2
6
Vol 8 (24) | September 2015 | www.indjst.org
Figure 5. FP-Tree construction for the items in Row 4.
Indian Journal of Science and Technology
3
Survey on Mining Association Rule with Data Structures
Figure 6 shows the FP-Tree construction for the items
in row 5 of Table 3. Items in row 5 are: X, Y, Z.
Figure 7 shows the FP-Tree construction for the items
in row 6 of Table 3.
Items in row 6 are: X, Y, Z, U.
Figure 8 shows the FP-Tree construction for the items
in row 7 of Table 3. Items in row 7 are: X
Figure 9 shows the FP-Tree construction for the items
in row 8 of Table 3.
Items in row 8 are: Y, Z, and V.
Step 5: Validation process
FP- tree constructed is validated by checking the
­frequency of items in the FP-tree constructed in Figure 9
with the values in Table 2. If it matches, the FP-tree constructed is right. Here the values are matched and hence
the FP-tree constructed is valid.
Figure 8. FP-Tree construction for the items in Row 7.
2.3 Pre-Large Tree
The concepts of pre-large tree5 can be used to generate
association rules from large databases, which minimize
the number of scans. A pre-large concept is defined with a
Figure 9. Final FP-Tree constructed after inserting the
items of Row 8.
Figure 6. FP-Tree construction for the items in Row 5.
Figure 7. FP-Tree construction for the items in Row 6.
4
Vol 8 (24) | September 2015 | www.indjst.org
lower support threshold and an upper support ­threshold.
The upper support threshold is same as the minimum
support threshold which is set by the user. The ratio of
support of an item set should be bigger than the upper
support threshold, so that it will be thought as a large
item set. If the ratio of support of an item set is below
the lower support threshold, then it is considered as
small item set. Pre-large item sets stores the items one
by one in the growing mining process and minimizes the
movements of item sets from large to small items and
vice-versa. Lower support threshold is based on the number of updated records permitted in the database. If it go
beyond the permitted number, rescanning of database is
required. Rescanning increases the computing time. To
insert data effectively, pre-large concepts can be combined with FP – tree to design pre-large-tree structure.
Initially a pre-large tree is constructed from the actual
database. The database is scanned to find large and prelarge items and then it is sorted in descending order. Then
pre-large tree is constructed based on the sorted sequence
Indian Journal of Science and Technology
Y. Jeya Sheela and S. H. Krishnaveni
of items. Construction process progresses in a step by step
basis, from first to last transaction .After executing all the
transactions, pre-large tree is constructed. Two tables
Header_Table and the Pre_Header_Table are maintained,
which stores the frequency values of large and pre-large
items.
Example for pre large tree construction is given
below:
Given database contains 10 transactions and 9 items,
denoted as {m} to {u}.
Lower support threshold, Sl is set at 40% and the upper
support threshold, Su is set at 70%.
The large items are :{ m}, {n}, {p} and {q}, and the
­pre-large items are: {r}, {s} and {t}
With these items, Header_Table and the Pre_Header_
Table are constructed.
Header_Table contains the frequency values of
large items and Pre_Header_Table contains the frequency values of pre-large items, then pre-large tree is
­constructed.
Step 1: Finding frequency and priority of large Items
Table 5 shows the frequency of items of given
­transaction item set. In the given database, ‘n’ has occurred
8 times, so ‘n’ has the highest priority and ‘m’ has occurred
only two times, so it has the lowest priority.
Step 2: Ordering of items based on priority
In Table 6, items are arranged based on the priority of
items shown in Table 5 ie. descending order.
Step 3: Header_Table construction
Table 7 shows the Header_Table construction
Header_Table contains the frequency of occurrences
of large items in the given database.
Table 4. Given Transaction item set
Tid
Items
1
m, n, o, p, q, r
2
n, o, p, r
3
n, o, p, q, t
4
n, op, q, r
5
n, o, p, r, s, u
6
n, o, p, q, t
7
n, r, s
8
o, p, q, t
9
o, r, s, t, u
10
m, n, o, q, s
Vol 8 (24) | September 2015 | www.indjst.org
Table 5. Frequency and priority
of large items
Item
Frequency
Priority
m
2
8
n
8
1
o
8
2
p
6
3
q
6
4
r
6
5
s
4
6
t
4
7
u
2
9
Table 6. Ordering and prioritizing of
items
Tid
Items
1
Ordering
m, n, o, p, q, r n, o, p, q, r, m
2
n, o, p, r
n, o, p, r
3
n, o, p, q, t
n, o, p, q, t
4
n, o, p, q, r
n, o, p, q, r
5
n, o, p, r, s, u
n, o, p, r, s, u
6
n, o, p, q, t
n, o, p, q, t
7
n, r, s
n, r, s
8
o, p, q, t
o, p, q, t
9
o, r, s, t, u
o, r, s, t, u
10
m, n, o, q, s
n, o, q, s, m
Table 7. Header_Table
Item
Frequency
n
8
o
8
p
6
q
6
Head
Step 4: Pre_Header_Table construction
Table 8 shows Pre_Header_table construction.
Pre_header_Table contains the frequency of
­occurrences of pre-large items in the given database.
Step 5: Constructing Pre-large tree
In Figure 10, Header_Table and the Pre_Header_Table
stores the frequency values of large and pre-large items.
Based on these values Pre-large tree is constructed.
Indian Journal of Science and Technology
5
Survey on Mining Association Rule with Data Structures
Table 8. Pre_Header_Table
Item
Frequency
r
6
s
4
t
4
Head
This parameter may not suit all datasets. But Trie does
not depend on these parameters. So it is very easy to
work with Trie. Trie works very faster with low support
­threshold than Hash trees.
Experimental results prove that the performance of
trie is very close to hash-trees with constraints at high support threshold but outperforms hash tree at low threshold.
Tries are best suited for proficient execution of candidate
generation because, candidates generated from the pairs
of item sets will have the same parent. So ­candidates can
be easily obtained.
3.1 Mining Association Rules using TCOM
Figure 10. Pre-Large tree construction.
3. Role of Data Structures in
Mining Association Rules
3.1 The Trie Data Structure for Finding
Frequent Item Set
Association rule mining finds frequent item set and then
generates well-built rules from it. Most algorithms like
Apriori based on Hash trees, frequent pattern growth
(FP-growth) and Vertical data format approach are used
to find frequent itemset and widely increases the speed of
searching of items. Finding frequent item set is one of the
important applications in data mining. In general, hashtrees are used for finding frequent item set. In this paper,
a trie data structure1 replaces hash-trees and resolves this
main data mining task. A trie, is a pre-organized data
structure that is used to store keys in strings.
Let I = {i1,i2,i3….in} be the set of items. Each of these
items are paired to generate candidate sets. From candidate sets, frequent item sets are found. Trie searches the
k-item set pairs faster than Hash-trees. Hash-tree depends
on two parameters: table size and leaf_mm-size (number of candidates the leaf stores) for better ­performance.
6
Vol 8 (24) | September 2015 | www.indjst.org
Association rule mining is one of the most significant
parts in data mining process. The purpose of association rule mining is to find association relationships or
­correlations among a set of items.
In this paper, a proficient way to discover the legal
association rules among the occasionally occurring items
is presented. A novel data structure called Transactional
Co-Occurrence Matrix (TCOM)6 is designed for the actual
transactional database by two passes. Then the count of
occurrence of item sets is calculated and legal association
rules are mined based on TCOM. The main advantage is
that item sets can be randomly accessed and counted without examining the actual database or TCOM. This increases
the effectiveness of the mining process. Discovering rule
patterns is the eventual goal of association rule ­mining.
A very small amount of memory is required for rule
mining. This is very small when compared to recurrent
pattern mining process. In most of the cases discovering
rules among occasionally occurring items are very advantageous. Experimental results prove that this method is
proficient and able method for mining massive transaction
databases. This method can be improved further by using
closed item set and efficient pruning techniques.
3.3 Pre-large Tree for Mining Association
Rule Mining
The Frequent Pattern tree (FP-tree) is a capable data
structure that mines association rules without generating candidate item sets. It condenses a database into a
tree like structure and stores large items only. When the
data are customized, it necessitates all transactions to be
processed in batch. In this paper, an algorithm based on
pre-large concepts5 is used to handle customized records
of actual database. This algorithm sets a lower and an
Indian Journal of Science and Technology
Y. Jeya Sheela and S. H. Krishnaveni
upper support threshold for defining pre-large concepts
ie to prevent small item becoming large item. The algorithm initially divides the items into three parts: large,
pre-large or small in the actual database and checks their
count differences as either positive, zero or negative or
then each part is processed separately. In general in the
field of data mining, minimum support threshold is set
by the user. Here minimum support threshold is considered same as upper support threshold. Lower threshold
is based on the number of customized records permitted.
If the numbers of customized records go beyond the permitted number, the algorithm will scan the database again
to get the ending results. But if the number of customized
records does not go beyond the permitted number, the
execution time is better saved. The algorithm uses some
pruning techniques to minimize the number of scans of
actual database. So it obtains an excellent execution time
for maintaining pre-large tree, especially when handling
small number of customized records.
Experimental results proves that the speed of ­execution
of pre-large-tree maintenance algorithm is very fast than
the batch FP-tree and FUFP-tree maintenance algorithm
for handling customized records.
3.4 Mining Frequent Ordered Sub Trees in a
Tree-Structured Database
Mining frequent subtree is a key research area in
­knowledge discovery from tree – structured Database.
To find relationships among tree data items, under user
defined threshold, frequently occurring sub structures
should be found in prior. This is well-known as the
frequent sub tree mining (FSM)7. In this paper, a new
method is used to find all the repeated ordered sub trees
from a tree-structured database. The main idea is that the
structural features of the input tree instances are taken
out to create a transactional form that facilitates the use
of normal item set mining methods. The eventual aim is
to mine frequent sub trees from input tree instances that
are represented in a transactional database using normal
item set mining method. In this way, the sub tree listing
process is prevented and the sub trees can be regenerated
in a post-processing phase. This enables, additionally
structured and more complicated tree data to deal with
much lower support thresholds. This method can find
position-constrained Sub trees. Every node in the position-constrained sub tree will have explanatory notes
regarding the occurrence and embedding level of nodes
Vol 8 (24) | September 2015 | www.indjst.org
in the actual database tree. In addition to this, separated
sub tree relationships can also be specified through
implicit linking nodes.
Experiments carried out on artificial and
­actual-world datasets verify the estimated benefits of
this method over striving methods in terms of effectiveness, ­mining ­abilities, and enlighten of the hauled
patterns. This method can integrate any normal item set
mining ­algorithm.
3.5 Frequent Closed Enumeration
Table (FCET) to Find Generalized
Association Rule
Association rule mining is one of the important tasks in
the field of Data mining. It is used to find strong relationships among data items. The main aim of this paper
is to use a well-organized data structure to find generalized association rules between the items at different
hierarchical levels in a tree with the postulation that the
actual frequent item sets and association rules were created in prior. Consider a large transactional database,
each and every transactions contains a set of items,
and a nomenclature (tree-like structure) on the items.
Relationship among every item at any level of the tree is
found. All associates of each and every item in a transaction are appended to the transaction, and by using any
of the mining algorithms, association rule can be found
from these enlarged transactions. The prime dispute of
creating an efficient mining algorithm is how to utilize
the original frequent item sets and association rules to
directly create novel generalized association rules without scanning the database again and again. In this paper,
a proficient data structure called the Frequent Closed
Enumeration Table (FCET)8 is used to store up the related
information. It stores maximal item sets (maximal item
set is a frequent item set whose nearest or next supersets are not frequent) and derives the subset item sets
information. Two algorithms namely GMAR and GMFI
are used along with various pruning techniques to create
new generalized association rules. Experimental results
prove that GMAR and GMFI algorithms are far better than BASIC and Cumulate algorithms because they
generate only fewer number of candidate sets. GMAR
removes huge amount of extraneous rules based on
the minimum confidence. GMAR is always good than
GMFI in a bare database and GMFI is very good than
GMAR in a intense database and the amount of frequent
Indian Journal of Science and Technology
7
Survey on Mining Association Rule with Data Structures
item sets is large. Time complexity for finding maximal
item sets is O(log2n) and n is the overall total of maximal itemsets. The memory requirement for the FCET is
little large, but by limiting the maximal item sets size ,
large amount of memory spaces are saved, in the case of
intense databases.
4. Conclusion
In this paper we have studied the concepts of various data
structures and advantages of using data structures in various applications of data mining. Generating Association
rules from large transactional databases is an important
task in the field of data mining. Various features found in
data structures like Trie, FP- tree, Pre-Large tree, TCOM,
FCET helps in minimizing the complex computational
tasks in generating rules. Data mining issues like memory
requirement for processing voluminous data and time
complexity can be effectively solved by selecting proper
data structures.
8
Vol 8 (24) | September 2015 | www.indjst.org
5. References
1. Bodon F, Ronyai B. Trie: An alternative data structure for
data mining algorithms. Mathematical and Computer
Modelling; 2003.
2. Available from: http://www.geeksforgeeks.org/trie-insertand-search
3. Available from: http://www.1.se.cuhk.edu.hk/~seem4630/
tuto/Tutorial03.ppt
4. Available
from:
http://www.hareenlaks.blogspot.
com/2011/06/fp-tree-example-how-to-identify.html
5. Lin C, Hong T. Maintenance of pre-large trees for data
­mining with updated records. Information Sciences; 2014.
6. Ding J, Yau SST. TCOM, an innovative data structure for
­mining association rules among infrequent items. Computers
and Mathematics with Applications. 2009; 57(2):290–301.
7. Hadzic F, Hecker M, Tagarelli A. Ordered subtree mining
via transactional mapping using a structure-preserving tree
database schema. Information Sciences; 2015.
8. Wu C, Huang Y. Generalized association rule mining using
an efficient data structure. Expert Systems with Applications;
2011.
Indian Journal of Science and Technology