Download Introducing the Fuzzy Association Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Proceedings of the 2nd National Conference; INDIACom – 2008
Computing For Nation Development, February 08 – 09, 2008
Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi
Introducing the Fuzzy Association Rules
Roshie Nanda
University School of Information Technology, Guru Gobind Singh Indraprastha University,
Kashmere Gate, Delhi, E-Mail: [email protected]
ABSTRACT
Data mining is the analysis of large observational datasets to
find unsuspected relationships and to summarize the data in
novel ways that are both understandable and useful to the
owner and the users of the data. Data mining is used today in
various fields ranging from business to medicine. The various
data mining functionalities that can be applied in these fields
include characterization and discrimination, frequent pattern
mining, classification and prediction, cluster analysis and
outlier analysis. Association rules are the most widely used
form of frequent pattern and association rule mining is one of
the most important and interesting functionalities in data
mining. The current paper introduces the concepts of mining
association rules from fuzzy datasets, thus expanding the
expressiveness of conventional association rules.
KEYWORDS
Data mining, association rule, association rule mining, fuzzy
association rule.
INTRODUCTION
Knowledge plays a vital role in the lives of all human beings
and important pieces of knowledge hold the key to our future
on this planet. Without knowledge it is not feasible to proceed
further and gain success in any discipline. In today’s world,
there is a continuous search for data from which important
pieces of knowledge can be extracted. Such pieces of
knowledge are required by both the users and the non-users of
the data. Over the past few years many techniques have been
developed to extract data from large databases and other forms
of data repositories. These techniques have been now collected
to form a discipline called data mining [1-4]. Data mining has
become well-accepted and has spread around the globe in a
short time. Data mining is also called as knowledge discovery
in databases or simply KDD.
The current paper introduces the concept of fuzzy association
rules by applying the considerations of the association rule
mining to fuzzy data. Fuzzy association rule provides a smooth
boundary between the involved and non-involved members of a
set. It is preferred over classical set theory as the classical set
theory provides a sharp boundary between members and the
non-members of its set.
DATA MINING
Data Mining is new and yet widely used field for extracting
useful knowledge from huge quantity of structured and/or
unstructured data. Data mining is outlined as the process of
Copyright © INDIACom – 2008
‘extracting’ or ‘mining’ knowledge from large amounts of data
[1].
The various steps being involved in discovering useful and
understandable knowledge includes data cleansing for
removing noise and inconsistencies, data integration for
integrating data from various sources, data selection for
selecting only the relevant and needed data, data transformation
for transforming data into a form that gets mined easily, data
mining comes as the most important step for extracting the
intelligent data, pattern recognition for recognizing the
interesting patterns and the last step being the presentation of
the discovered knowledge to the user (Figure 1).
Large Database or
Other Form of Data
Repository
Data Cleansing
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Recognition
Knowledge Discovery
Figure 1. Steps for discovering useful knowledge.
Data Mining Functionalities
The most common and wide used data mining tasks or
functionalities are as follows.
• Classification and Prediction. Classification is the
process of finding a model that describes and
Proceedings of the 2nd National Conference; INDIACom – 2008
•
•
•
distinguishes data classes for the purpose of being able
to use the model to predict the class of data objects
whose class label is unknown. Prediction is similar to
classification but it is used to predict missing or
unavailable numerical data values rather than class
labels [1].
Clustering. Clustering is the process of grouping data
objects that are similar in nature. Objects having
higher degrees of similarities are grouped together
within the same cluster and dissimilar ones are kept in
separate groups.
Association Rule Mining. Association rule mining is
the process of finding a simple probabilistic statement
about the co-occurrence of certain events in a datasets.
Outlier Analysis. Outlier analysis is the process of
analyzing the data objects whose characteristics are
sharply dissimilar from the characteristics of the
majority of the data objects in the dataset. Most data
mining functionality generally rejects the outliers as
noise. But when studied carefully, the outliers can
provide important information.
Applications of Data Mining
Data mining is used in various fields [1-5] and some of the
domains of application of data mining are discussed below.
• Medicine. Data mining is being widely used in field
of medicine. Data mining is used in the development
of new medicines, treatment of cancer diseases,
advancement in AIDS therapies, identification of
sequence patterns and various gene functions of
human genome system, and DNA data analysis. The
recent researches in DNA analysis has helped in
identifying the causes of various hidden diseases and
in finding cures for them.
• Fraud Detection. Fraud detection helps in identifying
unauthorized users of any computerized system and
restricting them from misusing the system. Many
credit card companies are now employing data mining
techniques to discover the abnormalities in the pattern
of the spending habits of their customers.
• Retail Industry. Retail data mining collects
information on buying habits and shopping history of
the customers, and sales transactions and
transportations of goods in the chain of the retail
outlets.
• Telecommunication Industry. Telecommunication
industry has evolved over the years to support a
diverse assortment of devices and services including
telephones, cellular phones, pagers, fax machines,
teleconferencing and Internet applications. Data
mining helps in understanding the business process in
the telecommunication industry, determining various
patterns in the industry, identifying the available
resources and making proper use of them.
Copyright © INDIACom – 2008
ASSOCIATION RULE MINING
An association rule is a simple probabilistic statement about the
co-occurrence of certain events in a database, and is
particularly applicable to sparse transaction datasets [1]. The
interestingness of an association rule is quantified by two
measures known as support and confidence. The support of an
association rule is a measure of coverage denoted by the
number of instances for which the association rule predicts
correctly. On the other hand, the confidence of an association
rule is a measure of accuracy denoted by the ratio of the
number of instances that it predicts correctly to the number of
instances to which it applies.
Association rule mining [1,2,6-9] is one of the most
widely used functionalities in data mining. A common example
of association rule mining is the market basket analysis. Market
basket analysis is used in determining the buying habits of the
customers by looking at the various associations and
combinations of the items they have purchased together.
Let ‘Pen’ and ‘Notebook’ be two single items. Let
there be an association rule Pen Æ Notebook. Here, Pen is the
antecedent of the association rule and Notebook is the
consequent of the association rule. The given association rule
implies that whenever Pen appears Notebook will also appear
along with it.
Let there be a 5-itemset, an itemset consisting of 5
items, {Pen, Pencil, Sharpener, Rubber, Notebook}. Table 1
provides a set of transactions {T1, T2, T3, T4, ...} on this 5itemset.
Table 1. A set of sample transactions.
Transactions
Items
T1
Pen, Pencil, Rubber
T2
Pen, Sharpener, Notebook, Rubber
T3
Pencil, Rubber, Notebook, Sharpener
T4
Pen, Pencil, Sharpener
...
Suppose items Pen and Notebook appear together in only 20%
of the transactions but whenever Pen appears there is a 70%
chance that Notebook will also appear. This 20% chance of Pen
and Notebook appearing together in the stated transactions is
called as the Support and the 70% chance being that if Pen
appears in a transaction then Notebook can also occur within
the same transaction is called the Confidence. As the
Confidence is high, it can be assumed that customers who will
buy Pen will also buy Notebook. As the Support is respectable,
the association rule has a good significance.
Types of Association Rules
Researchers have formulated several types of association rules
till date [1]. The most important types are briefly discussed
next.
• Boolean Association Rule. The Boolean association
rules depict whether an item is present or absent in a
transaction. For example, Pen Æ Notebook.
• Quantitative Association Rule. The quantitative
association rules provide a partition between the
Introducing the Fuzzy Association Rules
•
•
attributes or items by grouping them into different
intervals. For example, Age (Tom, “25…45”) ^ Salary
(Tom, “100000…500000”) Æ Buy (Tom, Mercedes).
The given quantitative association rule means Tom
having age between 25 and 45, and earning a salary
between Rs. 100000 and Rs. 500000 purchases a
Mercedes.
Single-Dimensional Association Rule. In singledimensional association rules the attributes reference
only a single dimension For example, Buys (Pen) Æ
Buys (Notebook).
Multi-Dimensional Association Rule. In multidimensional association rules the attributes reference
two or more dimensions. Age (Tom, “25…45”) ^
Salary (Tom, “100000…500000”) Æ Buy (Tom,
Mercedes).
FUZZY ASSOCIATION RULE
Fuzzy association rules are preferred over other classical rules
since they provide a smooth boundary between the involved
and non-involved members of a set. In classical or traditional
set there are only two values 0 or 1, with 0 indicating that the
attribute is not a member of the set and 1 indicating that it is a
member of the set. In fuzzy data there are three values namely
0 indicating that an attribute is not a member of the set, values
between 0 and 1 indicating that an attribute is partially a
member of the set involving partial membership and 1 meaning
the an attribute is definitely a member of the set involving full
membership [10].
Fuzzy association rules use linguistic variables. These
linguistic variables define the value of a variable both
qualitatively, by defining a symbol for a fuzzy set, and
quantitatively, by defining the meaning of the fuzzy set.
Let us have I as a collection of all items defined as I = {i1, i2,
i3, … im} and a set of all transactions as T = {t1, t2, t3, … tn}.
Each attribute ik will be associated with fuzzy sets which
defined as Fik = {fik1, fik2, fik3, ... fikl}. So, for attribute ‘age’
we have a fuzzy set {young, adult, old}.
The fuzzy association rules are in the form of ‘If X is A then Y
is B’. In this rule, we have X = {x1, x2, x3, ... xp} and Y = {y1,
y2, y3, ... yq} as the itemsets. X and Y both are disjoint sets
and hence with no attribute being common between them. A is
a fuzzy set being associated with X and represented as A =
{fx1, fx2, fx3, ... fxp} meaning that for an attribute x1 we will
be having a fuzzy set fx1 in A. Similarly, B is a fuzzy set being
associated with Y and represented as B = {fy1, fy2, fy3, ... fyq}
meaning that for an attribute y1 we will be having a fuzzy set
fy1 in B. Here, ‘X is A’ is called as the antecedent of the rule
and ‘Y is B’ is called as the consequent of the rule. When ‘X is
A’ is satisfied then ‘Y is B’ is also satisfied. Satisfaction, in this
case, means both support and confidence conditions are met
well.
Copyright © INDIACom – 2008
Table 2. An example of fuzzy dataset.
ID
Age
Degree
E1
Adult
M.Tech.
E2
Old
B.A.
E3
Young
B.Tech.
E4
Adult
M.C.A.
Salary
30000 (High)
18000 (Normal)
28000 (High)
10000 (Low)
Table 2 contains a sample fuzzy dataset. We can determine the
value of the attribute ik of the jth record by using the
convention tj[ik]. For example, if we want to determine the
value of salary of third record, we will write t3[Salary] and
obtain the value 28000. In Table 2, the attribute salary has been
denoted using the fuzzy set Salary = {high, normal, low}
dividing the salary interval into low, normal and high. For the
interval (Rs. 10000 to Rs. 30000) we have normal salary, for
(Rs. 10,000 and below) we have low salary and for (Rs. 30,000
and above )we have high salary.
CONCLUSION
The current paper begins with a tutorial on data mining with
special emphasis to association rules and association rule
mining. The paper introduces the fuzzy association rule using
the underlying concept of fuzzy sets and thus extends the
benefits of association rule mining to those datasets in which
conventional association rules cannot be mined.
FUTURE SCOPE
The concept of the fuzzy association rules can be extended to
various enhanced forms of association rules including
multilevel association rules, multidimensional association rules
and quantitative association rules.
ACKNOWLEDGEMENTS
The author will like to thank Anjana Gosain and Udayan
Ghose, both faculty members, University School of
Information Technology, Guru Gobind Singh Indraprastha
University for their guidance during the course of this study.
REFERENCES
[1]
J. Han, and M. Kamber – Data Mining: Concepts and
Techniques; Second Edition; Morgan Kaufmann, 2006.
[2]
I. H. Witten, and E. Frank – Data Mining: Practical
Machine Learning Tools and Techniques; Second
Edition; Morgan Kaufmann, 2005.
[3]
D. J. Hand, H. Mannila, and P. Smyth – Principles of
Data Mining; MIT Press, 2001.
[4]
P. Chakraborty, “An assortment of methods for mining
dense substructures in a large graph”, In Proceedings of
National Conference on Information Technology:
Present Practices and Challenges, 2007.
[5]
A. Kumar, “Applications of data mining”, In
Proceedings of National Conference on Information
Technology: Present Practices and Challenges, 2007.
[6]
R. Agrawal, T. Imielinsky, and A. Swami, “Mining
association rules between sets of items in large
Proceedings of the 2nd National Conference; INDIACom – 2008
[7]
[8]
[9]
[10]
databases”, In Proceedings of ACM SIGMOD
International Conference on Management of Data, 1993.
R. Agrawal, and R. Srikant, “Fast algorithms for mining
association rules”, In Proceedings of International
Conference on Very Large Databases, 1994.
J. Han, and Y. Fu, “Discovery of multiple level
association rules from large databases”, In Proceedings
of International Conference on Very Large Databases,
1995.
A. Sarasere, E. Omiecinsky, and S. Navathe, “An
efficient algorithm for mining association rules in large
databases”, In Proceedings of International Conference
on Very Large Databases, 1995.
H. Bandemer, and W. Nather – Fuzzy data analysis;
Kluwer Academic Publishers, 1992.
Copyright © INDIACom – 2008