Download Association Rule Mining: An Overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391
Association Rule Mining: An Overview
Neesha Sharma1, Dr. Chander Kant Verma2
M.Tech Scholar, DCSA, Kurukshetra University, Kurukshetra
2
Assistant Professor, DCSA, Kurukshetra University, Kurukshetra
[email protected],[email protected]
1
Abstract: Association Rule Mining(ARM) has been the area of interest for many researchers for a long time and
continues to be the same. It is one of the important tasks of data mining. It aims at discovering relationships among
various items in the database. The objective of this paper is to present a review on the basic concepts of ARM
technique along with the recent related work that has been done in this field. The paper also discusses the issues and
challenges related to the field of association rule mining. A small comparison based on the performance of various
algorithms of association rule mining has also been made in the paper.
Keywords: Association rule mining, Weka, XLMiner.
1. Introduction
The process of knowledge discovery in databases (KDD) includes selection of data, its preprocessing,
transformation, data mining and interpretation. Data mining (the analysis step of the KDD), an interdisciplinary
subfield of computer science is the computational process in which patterns are discovered in large data sets. The
main goal of the data mining process is to extract information from a data set and transform it into a structure that
can be understood for certain future use. The steps that are involved in data mining are anomaly detection in data,
association rule mining, clustering, classification, regression and summarization [4].
Association rule mining is one of the important tasks of data mining. It is indeed one of the most well studied data
mining tasks. Association rule mining was first introduced in[13]. It discovers relationships among attributes in
databases, by producing certain if-then statements concerning attribute values [3]. A more formal definition of
association rule mining is as follows: Let I={I1,I2,…,In)be a set of n binary attributes called items. Let
D={T1,T2,…,Tn} be a set of transactions called the database. Every transaction in the database D has a unique
transaction ID and contains a subset of the items in I. A rule in the database is defined as an implication of the form
I1=>I2 where I1, I2 I and I1 I2= .The sets of items (for short item sets) I1 and I2 are called antecedent (left-handside) and consequent (right-hand-side) of the rule respectively[6].
Association rule mining aims at finding association rules that should satisfy the predefined minimum support and
confidence specified by the user from a given database. The problem is generally decomposed into two
subproblems[17].
Figure 1: Generating Association Rules
As shown in figure 1 one subproblem is to find out those itemsets in the database whose occurrences exceed a
predefined threshold in the database; those itemsets are called frequent or large itemsets. The second subproblem is
to generate association rules from those large itemsets with the constraints of minimal confidence. Suppose one of
the large itemsets is Lk, Lk = {I1, I2, … , Ik}, association rules with this itemsets can be generated in the following
IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391
way: the first rule will be {I1, I2, … , Ik-1} {Ik}, by checking the confidence specified by the user, this rule can be
determined as interesting or not. Then other rules are then generated by deleting the last items from the antecedent
and inserting it into the consequent. Afterwards, the confidences of the new rules are checked to determine their
interestingness. This process is iterated while the antecedent is non-empty. Since the second subproblem is very
much straight forward, most of the researches mainly focus on the first subproblem. The first subproblem of finding
all frequent itemsets can be further divided into two subproblems[17]: candidate large itemsets generation process
and frequent itemsets generation process. The itemsets whose support is greater than the support threshold are called
large or frequent itemsets, and the itemsets that are expected to be large or frequent are called candidate itemsets.
There are two important basic measures to find the interestingness of association rules, support(s) and confidence(c).
They respectively reflect the usefulness and certainty of discovered rules[10]. Support(s) of an association rule is
defined as the percentage/fraction of transactions that contain I1
I2 to the total number of transactions in the
database. Suppose the support of a particular item is 0.1%, it means only 0.1 percent of the transactions contain
purchasing of this item[8]. Confidence(c) of an association rule is defined as the percentage/fraction of the number
of transactions that contain I1 I2 to the total number of transactions that contain I1. Confidence is a measure of
strength of the association rules, suppose the confidence of some association rule X Y is 70%, it means that 70% of
the transactions that contain X also contain Y[8].
To illustrate the concepts, a small example from the supermarket domain has been used[6]. The set of items is
I={bread, jam, butter, milk} and a small database(Table 1) containing the items (1 represents that item is present and
0 represents that item is not present in a transaction). An example rule for the supermarket could be {bread, jam} =>
{butter} meaning that if bread and jam are bought together, customers also buy butter.
T
Table 1: Sample Database to Find Association Rules
Bread
Jam
Butter
Milk
T1
1
1
0
0
T2
1
1
1
0
T3
1
0
1
1
T4
0
1
1
0
T5
1
1
0
0
In the example database, the item set{bread, jam, butter} has a support of 1/5=0.2. Since it occurs in 20% of all
transactions (1 out of 5 transactions). The rule {bread, jam}=>{ butter} has a confidence of 0.2/0.4 = 0.5in the
database, it clearly means that for 50% of the transactions containing bread and jam the rule is correct (50% of the
times a customer buys bread and jam, butter is bought as well).
Another example of data mining is as follows[12]. In an online book store there are always some tips that are shown
after a person purchases some books, for instance, once a person bought the book Data Mining Concepts and
Techniques, a list of related books such as: Database System 40%, Data Warehouse 25%, will be presented to you as
recommendation for further purchasing. In this example, the association rules are: when the book Data Mining
Concepts and Techniques is bought, 40% of the time the book Database System is bought with it, and 25% of the
time the book Data Warehouse is bought with it. Those rules discovered from the transaction database of the book
store can be used to rearrange the way of how to place those books which are related, thus further making those
rules stronger. Those rules can also be used to help the owner of the store to make his market strategies such as: by
promotion of the book Data Mining Concepts and Techniques, it can increase the sales of the other two books
mentioned.
2. Basic Association Rule Mining Algorithm
Some of the well known algorithms for association rule mining are Apriori, FP-growth, AIS, Apriori-TID, Apriori
Hybrid, Partitioning algorithms, Tertius Algorithm and many more. Various authors have compared the existing
algorithms and various others have proposed new algorithms to remove the drawbacks of older ones.
In general, a set of items(such as antecedent(LHS) or the consequent(RHS) of a rule) is called an itemset. The
number of items contained in an itemset is called the length of an itemset. Itemsets of some length k are called kitemsets. Generally, an association rules mining algorithm contains the following steps[17]:
a) The set of candidate k-itemsets is generated by 1-extensions of the large (k-1)-itemsets generated in the
previous iteration.
b) Supports for the candidate k-itemsets are generated by a pass over the database.
IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391
c)
Itemsets that do not have the minimum support are discarded and the remaining itemsets are called large kitemsets.
This process is repeated until there are no more large itemsets in the database. The most commonly used approach
for finding association rules is based on the Apriori algorithm. The efficiency of the level wise generation of
frequent itemsets is improved by using the Apriori property which says that all nonempty subsets of a frequent
itemset must also be frequent[10].
3. Related Work
3.1 Recent Advancements
An efficient algorithm has been proposed in 2008[5] to mine combined association rules on imbalanced datasets.
Unlike conventional association rules, combined association rules have been organized as a number of rule sets. In
each rule set, single combined association rules consist of various types of attributes. A novel frequent pattern
generation algorithm has been proposed to discover the complex inter-rule and intra-rule relationships. Data
imbalance problem has also been tackled.
An experiment has been performed in 2009[7] using association rule mining technique that can predict healthy and
sick status of heart using heart disease data. Three association rule mining algorithms apriori, predictive apriori and
tertius have been compared in the experiment. The results of the experiment clearly show that the apriori algorithm
is best suited for this type of task.
An efficient version of apriori algorithm has been proposed in 2010[11] for mining multi-level association rules in
large databases to find maximum frequent itemset at lower level of abstraction. The new algorithm namely, SC-BF
Multilevel is fast and efficient that requires only a single scan of database for mining complete frequent
itemsets.Also, the results of Apriori and K-Means algorithms against their implementation in Weka and XLMiner
have been compared in 2010[1]. For the comparison purpose the transaction data of Sales Day(a super store) has
been used. The results suggested some strategy change for organization that could lead to better sale.
In 2012[18] work has been done upon the data regarding the students enrollments for a specific set of data i.e., the
courses which the students would like to learn. Then the three data mining techniques namely, clustering(K-Means
algorithm), classification(ADTree algorithm) and association rule(Apriori algorithm) were applied on the data to
find the best combination of courses. A comparison has been made which shows that this combined approach is
better than using only association rule mining for such kind of task.
In 2013[16] three association rule algorithms have been considered by the author: apriori, predictive apriori and
tertius. The author has tried to find the best association rules using Weka. After getting the results, the author
discussed the limitations of the apriori algorithm and the ways to improve it. Association rules generated by the
three algorithms have been compared and it has been found that apriori is faster than the other two algorithms. In the
same year association rule mining technique was applied on a dataset containing crimes against women[2]. Weka
tool was used for this purpose. One dataset was taken from the UCI repository and another was collected from the
session court of Sirsa. Apriori algorithm was applied on the dataset to extract the hidden information about the crime
and the culprit. A comparison has been made between apriori and predictive apriori in which which apriori was
found to be better and faster. Also, a ‘Kidney Dataset’ containing 7 attributes and 157 instances has been used to
integrate clustering and association rule mining in 2013[14]. This determined some essential and interesting rules
formed in case of each cluster using Weka. A supermarket data has been taken in 2013[8]. A comparison has been
made among apriori algorithm, FP-Growth and Tertius algorithm by applying them on the supermarket data using
Weka. The results show that FP-Growth performs better than the other two algorithms
3.2 Performance Review
The advantages and disadvantages of certain association rule mining algorithms are discussed below in tabular form
(Table 2):
3.3 Issues and Challenges
Though a lot of research work has been done in the field of association rule mining. Various authors have proposed
various different algorithms in this field. Still there exist many issues and challenges in this field which need to be
solved in order to get complete advantage of this technique. The main drawbacks of the association rule mining
algorithms are[9]:
a) Obtaining non interesting rules
b) Huge number of discovered rules
c) Low algorithm performance
IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391
Association Rule
Mining Algorithm
1. AIS
2.
3.
Table 2: Performance Review of a Few Algorithms
Advantages
Disadvantages
a) An estimation is used in the algorithm to
prune those candidate itemsets that have no
hope to be large.
b) Memory management is used in AIS when
memory is not enough.
Apriori
a) Easy implementation.
b) It uses Apriori property for pruning
therefore, itemsets left for further support
checking remain less.
FP-Growth
a) It scales better than Apriori algorithm.
b) It requires only two scans of the database
without any candidate generation.
c) It is not influenced by support factor.
a) Requires multiple scanning of the
database .
b) Only rules of the kind X Y=>Z can be
generated. Rules of the kind X=>Y Z
cannot be generated.
c) Too many candidate itemsets that
finally turn out to be small are
generated.
a) It explains only the presence or absence
of an item in the database.
b) It scans the complete database multiple
times.
c) It has a complex candidate generation
process that uses most of the time,
space and memory.
d) It works well only for small databases
with large support factor.
a) The resulting FP-Tree is not unique for
the same logical database.
b) It cannot be used in interactive mining
system.
c) It cannot be used for incremental
mining.
Extracting all the association rules from a database requires counting all the possible combinations of attributes
present in that database. Support and confidence factors can be used to obtain the interesting rules which have values
for these factors greater than a threshold value specified by the user. In most of the methods the confidence is
determined only after the relevant support for the rules is computed. However, when the number of attributes is
large computational time increases exponentially. For a database of m records and n attributes, assuming binary
encoding of attributes in a record, the enumeration of subset of attributes requires m*2n computational steps. For
small value of n traditional algorithms are simple and efficient but for large values of n the computational analysis is
infeasible[9].
While the general association rule model described in section 2 suffices for many applications, it is not adequate
when the frequency of each item in the basket is variable and cannot be ignored[10].
The key element that makes association rule mining practical is the minsup i.e., the minimum support specified by
the user. It is used to prune the uninteresting rules. But using only a single minsup means that all the items in the
database are of the same nature. This may not be the case all the time. For example, in retailing business customers
frequently buy those items which have less price while the items which have a higher price may not be bought too
frequently. In such a situation, if the minsup is set too high, the generated rules will contain only those rules
containing only those items which have less price and contribute less to the profit of the organization. On the other
hand, if the minsup is set too low, too many meaningless frequent patterns will be generated that will overload the
decision makers. This type of dilemma is called rare item problem[19].
Association rule mining has been very successful in various fields be it commercial, social or human activities. But
this technique poses a threat to privacy. One can easily disclose other’s information by using this technique. So
before releasing databse, sensitive information must be hidden from unauthorized access. It has been revealed that
one of the current technical challenges is the development of techniques that incorporate security and privacy issues.
The association rule hiding problem aims at sanitizing the database in such a way that through association rule
mining one will not be able to disclose the sensitive rules and only the non-sensitive rules will be mined[15].
IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391
4. Conclusion
Association rule mining has gained a lot of attention in the field of research. This technique aims at finding
dependences among the attributes in a database. Therefore, it has found applicability in a lot of fields. One of the
major area of applications of association rule mining is market basket analysis. Many algorithms have been
proposed that mine association rules from a database based on the minsup and minconf specified by the user. This
paper presents a review on the association rule mining. It firstly defines what association rule mining is all about. It
then presents a generalized association rule mining algorithm. The paper surveys the research work done by various
authors in this field. It has been found while reviewing the literature that those algorithms which do not involve
candidate generation process are faster than those which involve candidate generation process. Some of the issues
and challenges related to this field have also been presented in this paper.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
References
A.M. Khattak, A.M. Khan, Sungyoung Lee and Young-Koo Lee, “Analyzing Association Rule Mining and
Clustering on Sales Day Data with XLMiner and Weka” In International Journal of Database Theory and
Application, Volume 3, No. 1, March 2010.
Divya Bansal and Lekha Bhambhu, “Execution of Apriori Algorithm Of Data Mining Directed Towards
Tumultuous Crimes Concerning Women” In International Journal Of Advanced Research in Computer Science and
Software Engineering, ISSN: 2277 128X, Volume 3, Issue 9, September 2013.
Enrique Garcia, Cristobal Romero, Sebastian Ventura and Toon Calders, “Drawbacks and Solutions of Applying
Association Rule Mining in Learning Management Systems”, In Proceedings of the International Workshop on
Applying Data Mining in e-learning, 2007.
http://en.wikipedia.org/wiki/Data_mining
Huaifeng Zhang, Yanchang Zhao, Longbing Cao and Chengqi Zhang, “Combined Association Rule Mining”,
PAKDD 2008, LNAI 5012, pp. 1069-1074, 2008 © Springer- Verlag Berlin Heidelberg 2008
Ila Chandrakar and A. Mari Kirthima, “A Survey On Association Rule Mining Algorithms”, In International Journal
Of Mathematics and Computer Research, ISSN: 2320-7167, Vol 1, Issue 10, Page No. 270-272, November 2013.
Jesmin Nahar, Kevin S.Tickle, Shawkat Ali and Yi-Ping Phoebe Chen, “Diagnosis Heart Disease Using an
Association Rule Discovery Approach” In Proceedings of the IASTED International Conference Computational
Intelligence(CI 2009), August 17-19, 2009 Honolulu, Hawaii, USA.
Jyoti Arora, Shelza and Sanjeev Rao, “An Efficient ARM Technique for Information Retrieval In Data Mining” In
International Journal of Engineering Research and Technology(IJERT),ISSN: 2278-0181, Volume 2, Issue 10,
October 2013.
Maria N.Moreno, Saddys Segrera and Vivian F. Lopez, “Association Rules: Problems, solutions and new
applications” In Actas del III Taller Nacional de Mineria de Datos y Aprendizaje, pp. 317-323, TAMIDA 2005,
ISBN: 84-9732-449-8 © 2005 Los autores, Thomson.
Nitin Gupta, Nitin Mangal, Kamal Tiwari and Pabitra Mitra, “Mining Quantitative Association Rules in Protein
Sequences” Data Mining, LNAI 3755, pp. 273-281, 2006 © Springer- Verlag Berlin Heidelberg 2006.
Pratima Gautam and K.R. Pardasani, “Algorithm for Efficient Multilevel Association Rule Mining” In (IJCSE)
International Journal on Computer Science and Engineering, ISSN: 0975-3397, Volume 02, No. 05, 1700-1704,
2010.
Qiankun Zhao and Sourav S. Bhowmick, “Association Rule Mining: A Survey”, Technical Report, CAIS, Nanyang
Technological University, Singapore, No. 2003116, 2003.
Rakesh Agrawal, R., T. Imielinski, and A.Swami, “Mining association rules between sets of items in large
databases” In Proceedings of the 1993 ACM SIGMOD International Conference on Management of DataSIGMOD ’93, pp. 207-216, 1993.
Ritu Ganda, “Knowledge Discovery from Database using an Integration of Clustering and Association Rule Mining”
In International Journal Of Advanced Research in Computer Science and Software Engineering, ISSN: 2277 128X,
Volume 3, Issue 9, September 2013.
Sanjay Keer, Anju Singh, “Hiding Sensitive Association Rule Using Clusters of Sensitive Association Rule”,
International Journal of Computer Science and Network(IJCSN), ISSN: 2277-5420, Volume 1, Issue 3, June 2012.
Shweta and Kanwal Garg, “Mining Efficient Association Rules Through Apriori Algorithm Using Attributes And
Comparative Analysis Of Various Association Rule Algorithms” In International Journal Of Advanced Research in
Computer Science and Software Engineering, ISSN: 2277 128X, Volume 3, Issue 6, June 2013.
Sotiris Kotsiantis and Dimitris Kanellopoulos, “Association Rule Mining: A Recent Overview” In GESTS
International Transactions On Computer Science And Engineering, Vol. 32(1), pp. 71-82, 2006.
IJCSC Volume 5 • Number 1 March 2014 pp.10-15 ISSN-0973-7391
18. Sunita B. Aher and Lobo L.M.R.J., “Combination of Clustering, Classification and Association Rule Based
Approach for Course Recommender System in E-learning” In International Journal of Computer Applications(09758887), Volume 39- No. 7, February 2012.
19. Ya-Han Hu, Yen-Liang Chen, “Mining Association Rules with Multiple Minimum Supports: A New Mining
Algorithm and a Support Timing Mechanism” © 2004 Elsevier B.V.