Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proceedings of the 2nd National Conference; INDIACom – 2008 Computing For Nation Development, February 08 – 09, 2008 Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi Introducing the Fuzzy Association Rules Roshie Nanda University School of Information Technology, Guru Gobind Singh Indraprastha University, Kashmere Gate, Delhi, E-Mail: [email protected] ABSTRACT Data mining is the analysis of large observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the owner and the users of the data. Data mining is used today in various fields ranging from business to medicine. The various data mining functionalities that can be applied in these fields include characterization and discrimination, frequent pattern mining, classification and prediction, cluster analysis and outlier analysis. Association rules are the most widely used form of frequent pattern and association rule mining is one of the most important and interesting functionalities in data mining. The current paper introduces the concepts of mining association rules from fuzzy datasets, thus expanding the expressiveness of conventional association rules. KEYWORDS Data mining, association rule, association rule mining, fuzzy association rule. INTRODUCTION Knowledge plays a vital role in the lives of all human beings and important pieces of knowledge hold the key to our future on this planet. Without knowledge it is not feasible to proceed further and gain success in any discipline. In today’s world, there is a continuous search for data from which important pieces of knowledge can be extracted. Such pieces of knowledge are required by both the users and the non-users of the data. Over the past few years many techniques have been developed to extract data from large databases and other forms of data repositories. These techniques have been now collected to form a discipline called data mining [1-4]. Data mining has become well-accepted and has spread around the globe in a short time. Data mining is also called as knowledge discovery in databases or simply KDD. The current paper introduces the concept of fuzzy association rules by applying the considerations of the association rule mining to fuzzy data. Fuzzy association rule provides a smooth boundary between the involved and non-involved members of a set. It is preferred over classical set theory as the classical set theory provides a sharp boundary between members and the non-members of its set. DATA MINING Data Mining is new and yet widely used field for extracting useful knowledge from huge quantity of structured and/or unstructured data. Data mining is outlined as the process of Copyright © INDIACom – 2008 ‘extracting’ or ‘mining’ knowledge from large amounts of data [1]. The various steps being involved in discovering useful and understandable knowledge includes data cleansing for removing noise and inconsistencies, data integration for integrating data from various sources, data selection for selecting only the relevant and needed data, data transformation for transforming data into a form that gets mined easily, data mining comes as the most important step for extracting the intelligent data, pattern recognition for recognizing the interesting patterns and the last step being the presentation of the discovered knowledge to the user (Figure 1). Large Database or Other Form of Data Repository Data Cleansing Data Integration Data Selection Data Transformation Data Mining Pattern Recognition Knowledge Discovery Figure 1. Steps for discovering useful knowledge. Data Mining Functionalities The most common and wide used data mining tasks or functionalities are as follows. • Classification and Prediction. Classification is the process of finding a model that describes and Proceedings of the 2nd National Conference; INDIACom – 2008 • • • distinguishes data classes for the purpose of being able to use the model to predict the class of data objects whose class label is unknown. Prediction is similar to classification but it is used to predict missing or unavailable numerical data values rather than class labels [1]. Clustering. Clustering is the process of grouping data objects that are similar in nature. Objects having higher degrees of similarities are grouped together within the same cluster and dissimilar ones are kept in separate groups. Association Rule Mining. Association rule mining is the process of finding a simple probabilistic statement about the co-occurrence of certain events in a datasets. Outlier Analysis. Outlier analysis is the process of analyzing the data objects whose characteristics are sharply dissimilar from the characteristics of the majority of the data objects in the dataset. Most data mining functionality generally rejects the outliers as noise. But when studied carefully, the outliers can provide important information. Applications of Data Mining Data mining is used in various fields [1-5] and some of the domains of application of data mining are discussed below. • Medicine. Data mining is being widely used in field of medicine. Data mining is used in the development of new medicines, treatment of cancer diseases, advancement in AIDS therapies, identification of sequence patterns and various gene functions of human genome system, and DNA data analysis. The recent researches in DNA analysis has helped in identifying the causes of various hidden diseases and in finding cures for them. • Fraud Detection. Fraud detection helps in identifying unauthorized users of any computerized system and restricting them from misusing the system. Many credit card companies are now employing data mining techniques to discover the abnormalities in the pattern of the spending habits of their customers. • Retail Industry. Retail data mining collects information on buying habits and shopping history of the customers, and sales transactions and transportations of goods in the chain of the retail outlets. • Telecommunication Industry. Telecommunication industry has evolved over the years to support a diverse assortment of devices and services including telephones, cellular phones, pagers, fax machines, teleconferencing and Internet applications. Data mining helps in understanding the business process in the telecommunication industry, determining various patterns in the industry, identifying the available resources and making proper use of them. Copyright © INDIACom – 2008 ASSOCIATION RULE MINING An association rule is a simple probabilistic statement about the co-occurrence of certain events in a database, and is particularly applicable to sparse transaction datasets [1]. The interestingness of an association rule is quantified by two measures known as support and confidence. The support of an association rule is a measure of coverage denoted by the number of instances for which the association rule predicts correctly. On the other hand, the confidence of an association rule is a measure of accuracy denoted by the ratio of the number of instances that it predicts correctly to the number of instances to which it applies. Association rule mining [1,2,6-9] is one of the most widely used functionalities in data mining. A common example of association rule mining is the market basket analysis. Market basket analysis is used in determining the buying habits of the customers by looking at the various associations and combinations of the items they have purchased together. Let ‘Pen’ and ‘Notebook’ be two single items. Let there be an association rule Pen Æ Notebook. Here, Pen is the antecedent of the association rule and Notebook is the consequent of the association rule. The given association rule implies that whenever Pen appears Notebook will also appear along with it. Let there be a 5-itemset, an itemset consisting of 5 items, {Pen, Pencil, Sharpener, Rubber, Notebook}. Table 1 provides a set of transactions {T1, T2, T3, T4, ...} on this 5itemset. Table 1. A set of sample transactions. Transactions Items T1 Pen, Pencil, Rubber T2 Pen, Sharpener, Notebook, Rubber T3 Pencil, Rubber, Notebook, Sharpener T4 Pen, Pencil, Sharpener ... Suppose items Pen and Notebook appear together in only 20% of the transactions but whenever Pen appears there is a 70% chance that Notebook will also appear. This 20% chance of Pen and Notebook appearing together in the stated transactions is called as the Support and the 70% chance being that if Pen appears in a transaction then Notebook can also occur within the same transaction is called the Confidence. As the Confidence is high, it can be assumed that customers who will buy Pen will also buy Notebook. As the Support is respectable, the association rule has a good significance. Types of Association Rules Researchers have formulated several types of association rules till date [1]. The most important types are briefly discussed next. • Boolean Association Rule. The Boolean association rules depict whether an item is present or absent in a transaction. For example, Pen Æ Notebook. • Quantitative Association Rule. The quantitative association rules provide a partition between the Introducing the Fuzzy Association Rules • • attributes or items by grouping them into different intervals. For example, Age (Tom, “25…45”) ^ Salary (Tom, “100000…500000”) Æ Buy (Tom, Mercedes). The given quantitative association rule means Tom having age between 25 and 45, and earning a salary between Rs. 100000 and Rs. 500000 purchases a Mercedes. Single-Dimensional Association Rule. In singledimensional association rules the attributes reference only a single dimension For example, Buys (Pen) Æ Buys (Notebook). Multi-Dimensional Association Rule. In multidimensional association rules the attributes reference two or more dimensions. Age (Tom, “25…45”) ^ Salary (Tom, “100000…500000”) Æ Buy (Tom, Mercedes). FUZZY ASSOCIATION RULE Fuzzy association rules are preferred over other classical rules since they provide a smooth boundary between the involved and non-involved members of a set. In classical or traditional set there are only two values 0 or 1, with 0 indicating that the attribute is not a member of the set and 1 indicating that it is a member of the set. In fuzzy data there are three values namely 0 indicating that an attribute is not a member of the set, values between 0 and 1 indicating that an attribute is partially a member of the set involving partial membership and 1 meaning the an attribute is definitely a member of the set involving full membership [10]. Fuzzy association rules use linguistic variables. These linguistic variables define the value of a variable both qualitatively, by defining a symbol for a fuzzy set, and quantitatively, by defining the meaning of the fuzzy set. Let us have I as a collection of all items defined as I = {i1, i2, i3, … im} and a set of all transactions as T = {t1, t2, t3, … tn}. Each attribute ik will be associated with fuzzy sets which defined as Fik = {fik1, fik2, fik3, ... fikl}. So, for attribute ‘age’ we have a fuzzy set {young, adult, old}. The fuzzy association rules are in the form of ‘If X is A then Y is B’. In this rule, we have X = {x1, x2, x3, ... xp} and Y = {y1, y2, y3, ... yq} as the itemsets. X and Y both are disjoint sets and hence with no attribute being common between them. A is a fuzzy set being associated with X and represented as A = {fx1, fx2, fx3, ... fxp} meaning that for an attribute x1 we will be having a fuzzy set fx1 in A. Similarly, B is a fuzzy set being associated with Y and represented as B = {fy1, fy2, fy3, ... fyq} meaning that for an attribute y1 we will be having a fuzzy set fy1 in B. Here, ‘X is A’ is called as the antecedent of the rule and ‘Y is B’ is called as the consequent of the rule. When ‘X is A’ is satisfied then ‘Y is B’ is also satisfied. Satisfaction, in this case, means both support and confidence conditions are met well. Copyright © INDIACom – 2008 Table 2. An example of fuzzy dataset. ID Age Degree E1 Adult M.Tech. E2 Old B.A. E3 Young B.Tech. E4 Adult M.C.A. Salary 30000 (High) 18000 (Normal) 28000 (High) 10000 (Low) Table 2 contains a sample fuzzy dataset. We can determine the value of the attribute ik of the jth record by using the convention tj[ik]. For example, if we want to determine the value of salary of third record, we will write t3[Salary] and obtain the value 28000. In Table 2, the attribute salary has been denoted using the fuzzy set Salary = {high, normal, low} dividing the salary interval into low, normal and high. For the interval (Rs. 10000 to Rs. 30000) we have normal salary, for (Rs. 10,000 and below) we have low salary and for (Rs. 30,000 and above )we have high salary. CONCLUSION The current paper begins with a tutorial on data mining with special emphasis to association rules and association rule mining. The paper introduces the fuzzy association rule using the underlying concept of fuzzy sets and thus extends the benefits of association rule mining to those datasets in which conventional association rules cannot be mined. FUTURE SCOPE The concept of the fuzzy association rules can be extended to various enhanced forms of association rules including multilevel association rules, multidimensional association rules and quantitative association rules. ACKNOWLEDGEMENTS The author will like to thank Anjana Gosain and Udayan Ghose, both faculty members, University School of Information Technology, Guru Gobind Singh Indraprastha University for their guidance during the course of this study. REFERENCES [1] J. Han, and M. Kamber – Data Mining: Concepts and Techniques; Second Edition; Morgan Kaufmann, 2006. [2] I. H. Witten, and E. Frank – Data Mining: Practical Machine Learning Tools and Techniques; Second Edition; Morgan Kaufmann, 2005. [3] D. J. Hand, H. Mannila, and P. Smyth – Principles of Data Mining; MIT Press, 2001. [4] P. Chakraborty, “An assortment of methods for mining dense substructures in a large graph”, In Proceedings of National Conference on Information Technology: Present Practices and Challenges, 2007. [5] A. Kumar, “Applications of data mining”, In Proceedings of National Conference on Information Technology: Present Practices and Challenges, 2007. [6] R. Agrawal, T. Imielinsky, and A. Swami, “Mining association rules between sets of items in large Proceedings of the 2nd National Conference; INDIACom – 2008 [7] [8] [9] [10] databases”, In Proceedings of ACM SIGMOD International Conference on Management of Data, 1993. R. Agrawal, and R. Srikant, “Fast algorithms for mining association rules”, In Proceedings of International Conference on Very Large Databases, 1994. J. Han, and Y. Fu, “Discovery of multiple level association rules from large databases”, In Proceedings of International Conference on Very Large Databases, 1995. A. Sarasere, E. Omiecinsky, and S. Navathe, “An efficient algorithm for mining association rules in large databases”, In Proceedings of International Conference on Very Large Databases, 1995. H. Bandemer, and W. Nather – Fuzzy data analysis; Kluwer Academic Publishers, 1992. Copyright © INDIACom – 2008