Download An Extensive Survey on Association Rule Mining Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
An Extensive Survey on Association Rule Mining Algorithms
Mihir R Patel1, Dipak Dabhi2
1,2
Assistant Professor, Department of Computer Engineering, CGPIT, Bardoli, India
Confidence of an association rule is defined as the
percentage/fraction of the number of transactions that
contain XUY to the total number of records that contain X,
where if the percentage exceeds the threshold of confidence
an interesting association rule X⇒Y can be generated.
Abstract— Association rule mining (ARM) aims at
extraction, hidden relation, and interesting associations
between the existing items in a transactional database. The
purpose of this study is to highlight fundamental of
association rule mining, association rule mining approaches to
mining association rule, various algorithm and comparison
between algorithms.
confidence(X⇒Y)=
Keywords—Assocation Rule Mining; Bottom up Approach;
Top Down Approach.
(
)
( )
Association rule mining is to find out association rules
that satisfy the predefined minimum support and confidence
from a given database. The problem is usually decomposed
into two sub-problems[2].
1. Find those itemsets whose occurrences exceed a
predefined threshold in the database; those itemsets are
called frequent or large itemsets.
2. To generate association rules from those large itemsets
with the constraints of minimal confidence.
I. INTRODUCTION
Association rule mining, one of the most important and
well researched techniques of data mining, was introduced
by Agrawal et al.(1993). It aims to extract interesting
correlations, frequent patterns, associations or casual
structures among sets of items in the transaction databases
or other data repositories. Let I=I1, I2, … , Im be a set of m
distinct attributes, T be transaction that contains a set of
items such that T⊆I, D be a database with different
transaction records. An association ruleis an implication in
the form of X⇒Y, where X, Y⊆I are sets of items called
itemsets, and X∩Y =∅. Here, X is called antecedent while
Y is called consequent, the rule means X implies Y[1].
To select interesting rules from the set of all possible
rules, constraints on various measures of significance and
interest can be used. The best-known constraints are
minimum thresholds on support and confidence.
Since the database is large and users concern about only
those frequently purchased items, usually thresholds of
support and confidence are predefined by users to drop
those rules that are not so interesting or useful. The two
thresholds are called minimal support and minimal
confidence respectively.
The rule X⇒Y holds in the transaction set D with support
s, where s of an association rule (X⇒Y) is defined as the
percentage/fraction of records that contain XUY to the total
number of records in the database(n)[2].
The support-count of an itemset X, denoted by X.count,
in a data set D is the number of transactions in D that
contain X. Then,
II. ASSOCIATION RULE MINING
Association rule mining approach can be divided into two
classes:
A. Bottom Up Approach
Bottom Up Approach look for frequent itemsets from the
given dataset that satisfy the predefined constraint. Bottom
up approach gets large frequent itemsets through the
combination and pruning of small frequent itemsets. The
principle of the algorithm is: firstly calculates the support of
all itemsets in candidate itemset Ck obtained by Lk-1, if the
support of the itemset is greater than or equal to the
minimum support, the candidate k-itemset is frequent kitemset, that is Lk, then combines all frequent k-itemsets to a
new candidate itemset Ck+1, level by level, until finds large
frequent itemsets.
A major challenge in mining frequent itemsets from a
large data set is the fact that such mining often generates a
huge number of itemsets satisfying the minimum support
threshold, especially when min sup is set low. This is
because if an itemset is frequent, each of its subsets is
frequent as well.It is useful to identify a small representative
set of itemsets from which all other frequent itemsets can be
derived such approach is known as Top-Down approach.
support( X⇒Y )=
408
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
B. Top Down Approach
Top-down Approach looks for more specific frequent
itemsets rather than finding more general frequent itemsets.
The number of frequent itemsets produced from a
transaction data set can be very large. It is useful to identify
a small representative set of itemsets from which all other
frequent itemsets can be derived. Two such representations
are presented in this section are
1) Maximal Frequent Itemset
2) Closed Frequent Itemset
An itemset X is a maximal frequent itemset (or maxitemset) in set S if X is frequent, and there exists no superitemset Y such that X⊆Y and Y is frequent in S[2].
An itemset X is closed in a data set S if there exists no
proper super-itemset Y such that Y has the same support
count as X in S. An itemset X is a closed frequent itemset in
set S if X is both closed and frequent in S[2].
The precise definition of closed itemset, however, is
based on Relations (1) and (2).Given the functions:
III. CLASSIFICATION OF ASSOCIATION RULE MINING
ALGORITHM
A. Classifying Different Frequent Itemset Mining
Algorithm
A Frequent itemsets mining algorithms, can be Classified
into two classes, candidate generation and pattern growth,
within which further division is based upon underlying data
structures, dataset organization.
1) Candidate
Generation: Candidate
generation
algorithms identify candidate itemsets before validating
them with respect to incorporated constraints, where the
generation of candidates is based upon previously
identified valid itemsets.
2) Pattern Growth: In contrast to the more prolific
candidate generation techniques, pattern growth algorithms
eliminate the need for candidate generation through the
creation of complex hyper structures (data storage
structures used in pattern growth algorithms). The structure
is comprised of two linked structures, a pattern frame and
an item list, which together provide a concise
representation of the relevant information contained within
D.
3) Based on Itemset Data Structure: As itemsets are
generated, different data structures can be used to keep
track of them. The most common approach seems to be a
hash tree. Alternatively, a trie or lattice may be used.
4) Based on Dataset Organization: A set of transactions
in TID-itemset format (that is, {TID : itemset} ), where
TID is a transaction-id and itemset is the set of items
bought in transaction TID. This data format is known as
horizontal data format. Alternatively, data can also be
presented in item-TID set format (that is, {item : TIDset}),
where item is an item name, and TID set is the set of
transaction identifiers containing the item. This format is
known as vertical data format.
Table I summarizes and provides a means to briefly
compare the various algorithms for mining frequent itemset
mining based on the maximum number of scans, data
structures proposed, and contribution.
f (T ) = {i ∈ I | ∀ t ∈ T, i ∈ t}
Which returns all the itemset included in the set of
transactions T, and
g(I ) = {t ∈ T | ∀ i ∈ I, i ∈ t}
Which returns the set of transactions supporting a given
itemset I (its tid-list ), the composite function fog is called
Galois operator or closure operator.
Generator: An itemset p is a generator of a closed
itemset y if p is one of the itemsets (there may be more than
one) that determines y using Galois
Closure operator: h(p) = y.
It is intuitive that the closure operator defines a set of
equivalence classes over the lattice of frequent itemsets: two
itemsets belongs to the same equivalence class if and only if
they have the same closure, i.e. their support is the same and
is given by the same set of transactions.
From the above definitions, the relationship between
equivalence classes and closed itemsets is clear: the
maximal itemsets of all equivalence classes are closed
itemsets. Mining all these maximal elements means mining
all closed itemsets.
409
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
Table I
Comparison of Frequent Itemset Mining Algorithm
Algorithm
AIS[1]
Scan
M+1
Data Structure
Not Specified
Type
Candidate
Generation
Dataset
Horizontal
Apriori[3]
M+1
1
Candidate
Generation
Candidate
Generation
Horizontal
Apriori-Tid[4]
DIC[5]
Candidate
Generation
Horizontal
Partitioning &
checkpoints
Eclat[6]
Depends
On interval
Size
2
Lk-1 : Hash table
Ck: Hash tree
Lk-1 : Hash table
Ck: array indexed by
TID
̅̅̅̅: Sequential
structure
Trie
Lattice
Vertical
Fp-Growth[7]
2
Fp-Tree
Prefix Based
Equivalence
Trie Based Structure
Generating
Frequent Patterns
with
The Frequent
Pattern List[8]
2
Frequent Pattern List
(FPL).
Candidate
generation
Pattern
Growth
Pattern
Growth
Horizontal
Horizontal
Horizontal
Contribution
First Algorithm for
Association rule
mining
Direct Count and
Prune
TID
Frequent Pattern List
and its Operation
 Minimum Elements Method: Some algorithms choose
the minimal elements (or key patterns) of each
equivalence class as closure generator. Key patterns
form a lattice, and this lattice can be traversed with a
simple Apriori-like algorithm.
 Closure Climbing: Once a generator is found, we
calculate its closure, and then create new generators
from the closed itemset discovered. In this way the
generators are not necessarily minimal elements, and
the computation of their closure is faster.
3) Type of Closure Calculation: Generators closure
calculation can be done both on-line and off-line.
In the off-line case we can imagine to firstly retrieve the
complete set of generators, and then to calculate their
closure.
In the second case we can calculate closures during the
browsing: each time a new generator is mined, its closure is
immediately computed.
Table II summarizes and provides a means to briefly
compare the various algorithms for mining maximal
itemset Algorithms based on data structures proposed, and
contribution.
B. Classifying different closed frequent itemsets mining
Algorithm
A Closed Frequent closed itemsets mining algorithms,
can be categorized as following:
1) Type of Search Strategy: The closed itemsets mining
algorithms are classified in two classes according to their
search strategy.
 Breadth-first-search strategy: In Breadth-first (or
level-wise) strategy, all candidate k -itemsets are
generated with set unions of two frequent (k-1)
itemsets.
 Depth-first-search strategy: Given a frequent kitemset X, candidate (k+1)-itemsets are generated
by adding an item i, i X, to X. If also {X∪i} is
discovered to be frequent, the process is
recursively repeated on {X ∪ i}.
2) Type of Generator Selection: We can identify all the
closed itemsets by calculating the closure of each
generator.
410
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
Table II
Comparison of Closed Frequent Itemset Mining Algorithms
Algorithm
A-close[9]
Charm[10]
Closest[11]
Closet+[12]
Fp-Close[13]
DCI-Closed[14]
Type Of Search
Strategy
Breadth First Search
Depth First Search
Depth First Search
Depth First Search
Depth First Search
Depth First Search
Type Of Data
Format
Horizontal
Vertical
Horizontal
Horizontal
Horizontal
Vertical
C. Classifying different Maximal frequent itemsets mining
Algorithm
A Maximal frequent itemsets mining algorithms, can be
Classified into two classes, candidate generation and
pattern growth, within which further division is based upon
underlying data structures, dataset organization.
1) Candidate
Generation: Candidate
generation
algorithms identify candidate itemsets before validating
them with respect to incorporated constraints, where the
generation of candidates is based upon previously
identified valid itemsets.
Type Of Generator
Selection
Minimum Elements
Closure Climbing
Closure Climbing
Closure Climbing
Closure Climbing
Closure Climbing
Type of Closure
Calculation
Off-line
On-line
On-line
On-line
On-line
On-line
2) Pattern Growth: In contrast to the more prolific
candidate generation techniques, pattern growth algorithms
eliminate the need for candidate generation through the
creation of complex hyper structures(data storage structures
used in pattern growth algorithms). The structure is
comprised of two linked structures, a pattern frame and an
item list, which together provide a concise representation of
the relevant information contained within D.
3) Based on Itemset Data Structure: As itemsets are
generated, different data structures can be used to keep
track of them.The most common approach seems to be a
hash tree. Alternatively, a trie or lattice may be used.
Table III
Comparison of Maximal Frequent Itemset Mining Algorithms
Algorithm
pincer-search[15]
MaxMiner[16]
A Counting Mining
Algorithm Based on Matrix[17]
Data Structure
maximum frequent
candidate set
maximum frequent
candidate set
Associated matrix
data structure
Mafia[18]
maximum frequent
candidate set
Fpmax[19]
FP-Tree
A New FP-tree-based
Algorithm MMFI[20]
Fp-Tree,
Matrix,
Array.
Mining Maximal Frequent
Itemsets with Frequent
Pattern List[21]
Frequent Pattern
List
Contribution
Two-way-search based on MFS
Dataset
Horizontal
Upward close Principle
Horizontal
Counting operation and elimination
method to remove the rows
and columns
Effective Pruning
Mechanisms.
Vertical Bitmap
Representation of the
Dataset
Vertical Bitmap
Representation of the
Dataset
Horizontal
MFI-Tree -keep track of all maximal
frequent itemsets.
Matrix-Trim conditional fp tree and
subset of mfi.
Array-Reducing the frequency of the
traversal of the FP-tree.
Bit counting, Bit string trimming and
migration
411
Horizontal
Horizontal
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 5, Issue 1, January 2015)
[8]
4) Based on Dataset Organization: A set of transactions
in TID-itemset format (that is, {TID : itemset}), where TID
is a transaction-id and itemset is the set of items bought in
transaction TID. This data format is known as horizontal
data format. Alternatively, data can also be presented in
item-TID set format (that is, {item : TIDset}), where item
is an item name, and TID set is the set of transaction
identifiers containing the item. This format is known as
vertical data format.
Table III summarizes and provides a means to briefly
compare the various algorithms for mining maximal
itemset Algorithms based on data structures proposed, and
contribution.
[9]
[10]
[11]
[12]
[13]
IV. CONCLUSION
In this paper, we have discussed two approach of
association rule mining and classification of association rule
mining approach. This paper gives a brief survey on
classification and comparison of frequent/closed/maximal
itemsets mining algorithms.
[14]
[15]
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[16]
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining
association rules. In Proc. 20thInt. Conf. Very Large Data Bases,
487-499
Jiawei Han and Micheline Kamber "Data Mining Concepts and
Techniques", Second Edition Morgan Kaufmann Publishers, March
2006
Agrawal, R., Imielinski, T., and Swami, A. N. 1993. Mining
association rules between sets of items in large databases. In
Proceedings of the 1993 ACM SIGMOD International Conference
on Management of Data, 207-216.
Agrawal R, Srikant R, ―Fast algorithms for mining association
rules‖, Proceedings of the 20th Int’1 Conference on Very Large
Databases[C], New York:IEEE press ,pp. 487-499, 1994.
S. Brin, etc, ―Dynamic Itemset Counting and Implication rules for
Market Basket Analysis‖, Proc. ACM SIGMOD Int’l Conf.
Management of Data, pp. 255-264, 1997.
M. J. Zaki, "Scalable algorithms for association mining," IEEE
Transaction on Knowledge and Data Engineering, pp.372–390,
2000.
J. Han, J. Pei, Y. Yin.: Minining frequent patterns without candidate
generation. Proceedings of SIGMOD, 2000.
[17]
[18]
[19]
[20]
[21]
412
F. C. Tseng and C. C. Hsu, ―Generating frequent patterns with the
frequent pattern list‖, Proc. 5th Pacific-Asia Conf. on Knowledge
Discovery and Data Mining , pp.376-386, April 2001.
Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal.
Discovering frequent closed itemsets for association rules. In Proc.
ICDT ’99, pages 398–416, 1999.
M.J. Zaki, and C. Hsiao., "Charm: An efficient algorithm for closed
itemset mining",Proc. SIAM Int'l Conf. Data Mining, PP. 457-473,
2002.
J. Pei, J. Han, and R. Mao, " CLOSET: An efficient Algorithm for
mining frequent closed itemsets", ACM SIGMOD workshop
research issue in Data mining and knowledge Discovery, PP. 21-30,
2000.
J. Wang, J. Han, and J. Pei, "CLOSET+: Searching for the best
strategies for mining frequent closed itemsets", proc. Int'l Conf.
Knowledge Discovery and Data Mining, PP. 236-245, 2003.
G. Grahne, and J. Zhu, "Efficiently using prefix-trees in mining
frequent itemsets", IEEE ICDM Workshop on Frequent Itemset
Mining Implementations, 2003.
C. Lucchese, S. Orlando, and R. Perego, "DCI_CLOSED: a Fast and
Memory Efficient Algorithm to Mine Frequent Closed Itemsets",
IEEE Transaction On Knowledge And Data Engineering, Vol. 18,
No. 1, PP. 21-35, 2006.
D.Lin, Z. M. Kedem, Pincer-Search: A New Algorithm for
Discovering the Maximum Frequent Itemset, EDBT Conference
Proceedings, 1998, pages 105-110.
R.J. Bayardo, ―Efficiently Mining Long Patterns from Databases,‖
Proc. ACM-SIGMOD Int’l Conf. Management of Data, pp. 85-93,
1998.
D. Burdick, M. Calimlim, J. Flannick, J. Gehrke, and T. Yiu,
―MAFIA: a maximal frequent itemset algorithm,‖ IEEETransactions
on Knowledge and Data Engineering, pp.1490– 1504, Nov 2005.
G. Grahne and J.F. Zhu, ―High Performance Mining of Maximal
Frequent Item sets‖, Proc. SIAM Int’l Conf. High Performance Data
Mining, pp. 135−143, 2003.
Hui-ling,
Peng;
Yun-xing,
Shu,"A new FPtreebased algorithm MMFI for mining the maximal frequent itemses
"IEEE International Conference on Computer Science and
Automation Engineering (CSAE), 2012 Volume: 2 Publication
Year: 2012 , Page(s): 61 – 65.
Haiwei Jin "A counting mining algorithm of maximum frequent
itemset based on matrix" Seventh International Conferences on
Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Volume: 3
Publication Year: 2010 , Page(s):1418-1422.
Jin Qian; Feiyue Ye "Mining Maximal Frequent Itemsets with
Frequent Pattern List "Fourth International Conference on Fuzzy
Systems ad Knowledge Discovery,2007 Volume: 1 Publication
Year: 2007, Page(s): 628 – 632.