Download Mining Efficient Association Rules Through Apriori Algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
IJCST Vol. 2, Issue 3, September 2011
ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online)
Mining Efficient Association Rules Through Apriori
Algorithm Using Attributes
1
1,2,3
Mamta Dhanda, 2Sonali Guglani, 3Gaurav Gupta
Dept. of Computer Science, RIMT – IET, Mandi Gobindgarh, Punjab, India
Abstract
In data mining a number of algorithms has been proposed. Each
algorithm has a different objective. A lot of research has been done
on these various data mining fields and algorithms. Extraction
of valuable data from large dataset is an emerging problem.
Apriori algorithm is the algorithm to extract association rules
from dataset. Apriori algorithm is not an efficient algorithm as
it is a time consuming algorithm in case of large dataset. With
the time a number of changes proposed in Apriori to enhance
the performance in term of time and number of database passes.
This paper illustrate the apriori algorithm disadvantages and
utilization of attributes which can improve the efficiency of apriori
algorithm.
Keywords
frequent itemset , Apriori , profit, quantity, support. .
I. Apriori Algorithm
Apriori algorithm is an algorithm of association rule mining.It
is an important data mining [9] model studied extensively by
the database and data mining community. It Assume all data are
categorical. It is Initially used for Market Basket Analysis [14] to
find how items purchased by customers are related. The problem
of finding association rules can be stated as follows:
Given a database of sales transactions, it is desirable to discover
the important associations [15,16] among different items such the
presence of some items in a transaction will imply the presence of
other items in the same transaction. As example of an association
rule is:
Contains (T,”baby food”) → Contains (T, “diapers”) [Support=
4%, Confidence=40%]
The interpretation of such rule is as follows:
• 40% of transactions that contains baby food also contains
diapers;
• 4% of all transactions contain both of these items.
The calculations of the Support(S) and Confidence(C) are very
simple:
CONF (A → B) = SUPP(AUB)
SUPP(A)
S (A) = (Number of transactions containing item A) /( Total number
of transactions in the database)
S (A → B) =( Number of transactions containing items A and B)
/ (Total number of transactions in the database)
The above association rule [15] is called single-dimension because
it involves a single attribute or predicate (Contains). The main
problem is to find all association rules that satisfy minimum
support (min_sup) and minimum confidence (min_conf) [13]
thresholds, which are provided by user and/or domain experts. A
formal definition of association rule is:
Let J= {i1, i2…im} be a set of items. Let D be the set of database
transaction where each transaction T is a set of items such that
is T ⊆J. Each transaction is associated with an identifier, called
TID. Let A be a set of items. A transaction T is said to contain A
if and only if A ⊆ T. An association rule is an implication of the
342 International Journal of Computer Science and Technology
form A⇒B[2,3], where A ⊆ J, B ⊆ J, and A∩B =∅. The rule A
⇒ B holds in the transaction set D with support s, where s is the
percentage of transactions in D that contain A∪ B (i.e., both A and
B). This is taken to be the probability, P (A∪ B). The rule A⇒ B
has confidence c in the transaction set D if c is the percentage of
transactions in D containing A that also contain B. This is taken
to be the conditional probability, P (B|A). A rule is frequent if
its support is greater than the minimum support threshold and
strong if its confidence is more than the minimum confidence
threshold. Discovering all association rules is considered as twophase process which are:
1. Find all frequent itemsets having minimum support. The
search space to enumeration all frequent itemsets is on the
magnitude of 2n.
2. Generate strong rules. Any association that satisfies the
threshold [7,11] will be used to generate an association
rule.
The first phase in discovering all association rules is considered
to be the most important one because it is time consuming due to
the huge search space (the power set of the set of all items) and the
second phase can be accomplished in a straightforward manner.
1.1 Pseudo-code for Apriori:
• Lk: Set of frequent itemsets of size k (with min support)
• Ck: Set of candidate itemset of size k (potentially frequent
itemsets)
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained
in t
Lk+1 = candidates in Ck+1 with min_support
return ∪k Lk;
II. Working of Apriori Algorithm Using Data Mining Tool
Following are screen shots of working of apriori algorithm using
Tanagra data mining tool to produce the output of apriori [4]
algorithm in GUI format.
Fig. 1: importing dataset in Tanagra
w w w. i j c s t. c o m
ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online)
Fig. 2: selection of Attributes (discrete)
Fig. 3: selection of support and confidence values.
Fig. 4: Tanagra data mining tool producing association rules
III. Limitations of Association Rule mining:
1. End users of ARM encounter problems as the algorithm do
not return result in a reasonable time.
2. It only tells the presence and absence of an item in transactional
database [5].
3. It is not efficient in case of large dataset.
4. ARM treat all items in database equally by considering only
the presence and absence of an item within the transaction
[8].it does not take into account the significance of item to
user or business.
5. ARM fails to associate user objective abd business value
with outcome of ARM analysis.
ARM has a lot of disadvantages .These can be removed by using
attributes like weight and quantity, weight attribute will give user
an estimate of how much quantity of item has been purchased by
the customer, profit attribute will calculate the profit ratio and tell
total amount of profit an item is giving to the customer.
IV. Conclusion
The conclusion to this work is that Apriori algorithm is applied on
the transactional database. By using measures of apriori algorithm,
frequent itemsets can be generated from the database. Apriori
algorithm is associated with certain limitations of large database
scans. Advantage of apriori[9,10] is its easy implementation.
Association rule mining efficiency can be improved by using
attributes like profit ,quantity [6] which will give the valuable
information to the customer as well as the business. Association
rule mining has a wide range of applicability in many areas.
w w w. i j c s t. c o m
IJCST Vol. 2, Issue 3, September 2011
References
[1] J.5Han, J. Pei , Y. Yin . "Mining Frequent Patterns without
Candidate Generation : A Frequent-Pattern Tree Approach”.
In Proc. ACM-SIG MOD Int. Conf. Management of Data(SIG
MOD’04), pages 53-87, 2004.
[2] M. H. Marghny, A.A. Mitwaly. "Fast Algorithm for Mining
Association Rules". In proc. Of the First ICGST International
Conference on Artificial Intelligence and Machine Learning
AIML05, pages 36-40, Dec. 2005.
[3] Agrawal, R., Imielinski, T., Swami, A. N. "Mining association
rules between sets of items in large databases". In Proceedings
of the 1993 ACM SIGMOD International Conference on
Management of Data, 207-216, 1993.
[4] Agrawal, R., Srikant, R. 1994. "Fast algorithms for mining
association rules". In Proc. 20th Int. Conf. Very Large Data
Bases, 487-499.
[5] Tang, P., Turkia, M., "Parallelizing frequent itemset
mining with FP-trees".Technical Report,titus.compsci.ualr.
edu/~ptang/papers/par-fi.pdf, Department of Computer
Science, University of Arkansas at Little Rock, 2005.
[6] Tien Dung Do, Siu Cheung Hui, Alvis Fong, "Mining Frequent
Itemsets with Category-Based Constraints, Lecture Notes in
Computer Science, Volume 2843, 2003, pp. 76 - 86
[7] Han, J., Pei, J. "Mining frequent patterns by patterngrowth: methodology and implications". ACM SIGKDD
Explorations Newsletter 2, 2, 14-20, 2000
[8] Hegland, M., "Algorithms for Association Rules", Lecture
Notes in Computer Science, Volume 2600, Jan 2003, Pages
226
[9] Hilderman R. J., Hamilton H. J., "Knowledge Discovery
and Interest Measures", Kluwer Academic, Boston, 2002.
[10]J. Han, M. Kamber, "Data Mining", Morgan Kaufmann
Publishers, San Francisco, CA, 2001.
[11] Rajesh Natarajan, B. Shekar, "Data mining (DM): poster
papers: A relatedness-based data-driven approach to
determination of interestingness of association rules".
[12]Laks V. S. Lakshmanan, Carson Kai-Sang Leung, Raymond
T. Ng, "The segment support map: scalable mining of
frequent itemsets", ACM SIGKDD Explorations Newsletter,
Volume 2 Issue 2. December 2000
[13]Jiawei Han, Jian Pei, Yiwen Yin. "Mining Frequent Patterns
without Candidate Generation". SIGMOD Conference 2000:
1-12
[13]R. Agrawal, T. Imielinski, and A. Swami. "Mining association
rules between sets of items in large databases. SIGMOD’93,
207-216, Washington, D.C.
[14]R. Agrawal, T. Imielinski, A. Swami. Mining Association
Rules between Sets of Items in Large Databases". In
Proceedings of the 1993 International Conference on
Management of Data (SIGMOD 93), pages 207-216, May
1993
[15]Textbook Jiawei Han, "Data Mining: concepts and
techniques", Morgan Kaufman, 2000.
[16]Rakesh Agrawal , Thomasz Imielinski, Arun Swami, "Mining
association rules between sets of items in large database". In
Proc. Of the ACM SIGMOD Conference on Management
of Data, P. 207-216, May 1993.
International Journal of Computer Science and Technology 343
IJCST Vol. 2, Issue 3, September 2011
ISSN : 2229-4333(Print) | ISSN : 0976-8491(Online)
Mamta Dhanda , received her B-Tech degree
from Punjab Technical University in 2009.
Her M-Tech Degree is from RIMT-IET
Mandi Gobind Garh. She is now working
as the assistant Professor with Department
of Information Technology, RIMT-MAEC
,Mandi GobindGarh. Her research interests
include Data Mining and Its Technologies,
Association Rule mining.
344 International Journal of Computer Science and Technology
w w w. i j c s t. c o m