Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Transcript

School of Computer Science and Software Engineering Monash University Bachelor of Computer Science Honours (1608), Clayton Campus Literature Review Draft – 2003 Mining Negative Rules in Large Databases using Generalized Rule Discovery Dhananjay R Thiruvady (18787886) Supervisor: Professor Geoffrey I. Webb Abstract Rule discovery in Data Mining has recently been given a lot of attention. Association rule discovery is an approach to mining large number of rules from a database of transactions. The rules generated will be based on strong associations between items within the transactions. Strong associations are defined based on the minimum support constraint. Generalized Rule Discovery (GRD) is an alternate rule discovery method to Association rule discovery. GRD and Association rule discovery share several features. GRD allows the user to specify minimum constraints to generate rules, whereas a minimum support has to be specified in association rule discovery. The search method used by GRD, OPUS algorithm for an unordered search, which is an effective method to search large unordered search spaces or, in the case of rule discovery, search large unordered space of rules. Mining negative rules has been researched with association rule discovery. Using GRD and applying the Tidsets/Diffsets approach, mining negative rules can be done effectively and efficiently. The rules generated can then be assessed for usefulness. 1. Association Rule Discovery Mining association rules in large databases was first approached by Agrawal, Imielinski and Swami (Agrawal, Imielinski, Swami, 1993). A database of transactions is the training data from which rules will be generated. The aim of association rule discovery is to find rules with strong association between items form the training data. A rule is of the form A => B where A is known as the antecedent and B is the consequent of the rule. Both A and B are itemsets. An itemset can be a single item (e.g. water) or a set of items (e.g. water and chips). The rule implies that if an itemset A occurs in a transaction then itemset B is likely to occur in the same transaction of the training data. If water => chips is 80% then the rules implies that 80% of customers who bought water also bought chips (Toivonen, 1996). To mine association rules from a large database, minimum constraints are defined (Srikant, Vu, Agrawal, 1997). The number of rules to be generated can get very large, e.g. for 1000 rules there are 21000 possibilities. The rules generated will be based upon a minimum support constraint. The data space can be very large, therefore the minimum support constraint is needed to prune the data space to generate rules. The support of a rule is the frequency with which the rule occurs in the training data. For example 25 transactions in out of 100 transactions (Assuming the training data includes 100 transactions) contain Pepsi. Then the support of Pepsi is 0.25. The itemsets within the rules generated are considered to be frequent itemsets; Pepsi is frequent in the previous example. Further constraints (e.g. Confidence and Lift) can be applied to the rules that are generated. Applying other constraints is defined as interest. Some constraints that can be applied as a measure of interestingness are: 1) Confidence (A => B) = support (A => B) / support (A) 2) Lift (A => B) = support (A => B) / support (A) x support (B) 3) Leverage (A => B) = support (A => B) – support (A) x support (B) Confidence is a common measure of interest for generating association rules in Association rule discovery. Discovering association rules can be thought of as a three part process: 4) Search the data space to find all the items or itemsets whose support is greater than the user specified minimum support. These itemsets are the frequent itemsets. 5) Generate interesting rules based on the frequent itemsets. A rule is said to be interesting if its confidence is above a user’s specified minimum confidence. 6) Remove (prune) all rules which are not interesting from the rule space. (Srikant, Agrawal, 1997) Further constraints can be applied to generate rules specific to a user’s needs. However, this can only be done when the rules are already available from the first two parts of the rule generation process. In association rule discovery, the frequent itemset generation process is the part which needs improvement in efficiency. When the training data has thousands of transactions finding frequent itemsets can take up a lot of time. Therefore, most research related to association rule discovery has been conducted to improve the frequent itemsets generation process. Specifically most of the attention has been given to improve part one of the three part process. There are several algorithms which have been developed to mine association rules quickly. Some of them are: 1) Apriori The Apriori algorithm has proved to be an efficient algorithm in mining association rules. It has been used effectively and has become the standard algorithm used for association rule discovery. Apriori follow a two step process to generate rules. The first step is to find all frequent itemsets. From these frequent itemsets association rules are generated which is the second step. The first step will limit the number of itemsets which are considered for the antecedent and consequent of the rules and information about the itemset frequency is maintained. An Itemset is frequent if it satisfies the minimum support constraint. Some variants of the Apriori approach have showed that very few passes through the training data may be necessary to generate association rules (Savasere, Omiecinski, Navathe, 1995), (Toivonen, 1996). 2) Other algorithms and search methods There have been several algorithms proposed to solve the task of generating frequent itemsets efficiently. They include (Savasere, Omiecinski, Navathe, 1995), (Zaki, Parthasarathy, Ogihara, Li, 1997), (Park, Chen, Yu, 1995). Webb (Webb, 2000) argues that for some applications a direct search may be more effective than the two part process of the Apriori algorithm. The algorithm presented maintains the data in memory from which the association rules can be generated through constraints. 2. Generalized Rule discovery Generalized rule discovery was developed by Webb and Zhang (Webb, Zhang, 2002). GRD’s aims are very similar to that of Association rule discovery. As with Association rule discovery, GRD searches through the training data and generates several rules. These rules are generated based on constraints specified by a user but may not include minimum support. In some applications minimum support may not be relevant criteria to generate rules. For example, if a user wanted interesting rules with the highest leverage then Association rule discovery will first apply a minimum support constraint to get the frequent itemsets. The leverage constraint will then only apply to the frequent itemsets from which rules will be generated. However, there maybe itemsets which are not frequent but can satisfy the minimum leverage constraint when they are part of a rule. As a result several interesting rules may not be generated. The advantage of GRD approach is that it generates rules based on minimum constraints defined by a user (Webb, Zhang, 2002). The rules can be generated based on minimum support of the itemsets, but it is not an essential criterion for generating rules. Other constraints to generate interesting rules include minimum confidence, minimum leverage, and minimum lift. In addition to the constraints a users can specify a specific number of rules to be generated. GRD will generate rules which maximize the constraints. The GRD approach has been implemented in the GRD program. The GRD program and the algorithm used by it are described below. 1) The GRD program The GRD program currently was developed by Webb and Zhang at Deakin University, Australia. The training data input to the program includes a header file with information about the data and the data file with all the transactions. The user then has to specify all the constraints, the number of rules to be generated, the number of cases and the number of items for the antecedent. Then the rules from the training data will be generated with the leverage value. The GRD program has proved to be successful when attempting to find rules with the highest possible leverage (Webb, Zhang, 2002). An example input: -number-of-cases=5822 -minimum-strength=0.8 -minimumsupports=0.01 -max-number-of-associations=1000 -maximum-LHS-size=4. The -number-of-cases=5822 are the total number of cases from the training data which will be searched for the rules. -minimum-strength=0.8 -minimumsupports=0.01 are the constraints applied to generate the rules. -max-number-ofassociations=1000 limits GRD to finding only 1000 associations and -maximumLHS-size=4 specifies that the antecedent can only comprise of 4 items at the most. 2) OPUS search algorithm The algorithm used by GRD is the Optimized Pruning for Unordered Search algorithm (OPUS) for an unordered search (Webb, 1995). This algorithm can be used for Classification rule discovery and was originally developed for that purpose. It is an algorithm that guarantees to find the target it seeks. The search space can often be very complex. For such a search space, a heuristic search is usually employed. However, the search does not necessarily find its target. Heuristic algorithms may also introduce a bias The OPUS algorithm is an efficient search method that prunes parts the search space that will not result in interesting rules. Once an infrequent itemset has been discovered the search space is restructured (pruned). Restructuring the search space and pruning uninteresting rules allows very fast access to rules which satisfy the minimum constraints. 3. Mining Negative rules The main interest in Association rule discovery has been to mine rules with strong associations. Such associations are also called positive associations. Finding positive associations is useful to make predictions about the training data. Mining negative rules has been given some attention and could prove to be useful. In a large database negative rules generated can be very large numbers and may not be of interest to a user. Brin, Motwani and Silverstein (Brin, Motwani, Silverstein, 1997) first talked about mining negative associations between two negative itemsets. Savasere, Omiecinski and Navathe (Savasere, Omiecinski, Navathe, 1998) use the method of generating positive rules from which negative rules are mined. The result is fewer but more interesting negative rules are mined. Negative associations are associations between itemsets which occur in a transaction with those itemsets that don’t occur in the same transactions. Assume that A and B are Itemsets. B is a single Itemset. Then the rules to be mined can be of the form: 1) A => B (A implies B, as used in association rules) 2) A => ~B (A implies not B) 3) ~A => B (~A implies B) 4) ~A => ~B (not A implies not B) (Wu, Zhang, Zhang) The rules above specify concrete relationships between each itemset compared to (Savasere, Omiecinski, Navathe, 1998) who look at the rule A =/> B. An example of useful negative rules: “60% of customers who buy Potato Chips do not buy bottled water” (Savasere, Omiecinski, Navathe, 1998). This information can be used by the manager of a store to improve the store’s marketing strategy (Savasere, Omiecinski, Navathe, 1998). 4. Tidsets and diffsets In a database of transactions, each transaction has itemsets associated with it. If the data are stored horizontally (See Fig. 1) then the performance of search algorithms on the training data is not as efficient vertical mining (See Fig. 2). In horizontal mining each transaction is stored with the itemsets that occur in it. Transaction 1: [ Itemset A, Itemset B, Transaction 2: [ Itemset A, Itemset D ] Transaction 3: [ Itemset A, Itemset B, Itemset C ] Itemset C, Itemset D] Fig. 1: Horizontal mining Vertical mining has outperformed horizontal mining when searching for rules in the training data. This is because data that are not necessary are automatically pruned and a transaction frequency can be calculated quickly to satisfy the minimum support. A reference tidset is defined for a set of items. Zaki and Gouda define the reference tidset as a class. The class has a set of transactions associated with it. The class items may occur in say three transactions (1, 2, and 3) out of total number of transactions, say a hundred. Each itemset is stored with the transactions it is contained in for a particular class (e.g. 1, and 3). The transaction sets for the itemsets in this example can be seen in Fig. 2, Itemsets B and C occur in transactions 1 and 3. The transaction set that belongs to an itemset is known as a Tidset. The tidset for a class is known as a prefix tidset. The itemsets from within the class are then tested to see if they satisfy a minimum support constraint. Those itemsets that do not are omitted as they are considered infrequent. The itemsets that are frequent will occur in most of the transactions out of the transactions in the same class. From Fig. 2 it can be seen that each itemset is in at least two out of three transactions. Itemset A Itemset B Itemset C Transaction 1 Transaction 1 Transaction 1 Transaction 2 Transaction 3 Itemset D Transaction 2 Transaction 3 Transaction 3 Transaction 1 Fig. 2: Vertical mining for a given Class Zaki and Gouda (Zaki, Gouda, 2001) proposed that each itemset should be stored with their Diffsets rather than their Tidsets in the class that the tidset appears in. A Diffset is a set of transactions that an itemset does not occur in within a given class. Since the itemsets are frequent within the class, the size of the tidset for the itemset is likely to be large, i.e. most of the transactions in the class. Therefore, the size of the diffset for an itemset is much smaller. The same information is contained in both representations and the Diffsets approach results in saving a lot of memory. From Fig. 3 it can be seen that for a given class the size of the diffset representation is a lot smaller than that of the tidset representation. The class contains three transactions and Itemset A Itemset B Itemset C Itemset D Transaction 1 Transaction 2 Transaction 2 Fig. 3: Diffsets for itemsets in Fig. 2 Within a given class tidsets for an itemset are computed by GRD. The aim now is to calculate the tidset for the negation of an itemset. If A is an itemset, then the aim is to find the tidset of ~A. For each form of a rule, GRD can calculate the negation of an itemset using the diffsets approach: Reference tidset = T Tidset for antecedent = A Tidset for consequent = C Tidset for antecedent and consequent = A ^ C 1. For A => C, GRD computes: a. Tidset (A) b. Tidset (C) c. Tidset (A ^ C) 2. For A => ~C, GRD can compute: a. Tidset (A) b. Tidset (~C) = Diffset (T, C) c. Tidset (A ^ ~C) = Diffset (A, C) 3. For ~A => ~C, GRD can compute: a. Tidset (~A) = Diffset (T, A) b. Tidset (C) c. Tidset (~A ^ C) = Diffset (C, A) 4. For ~A => ~C, GRD can compute: a. Tidset (~A) = Diffset (T, A) b. Tidset (~C) = Diffset (T, C) c. Tidset (~A ^ ~C) = Diffset (T, (A ^ C)) 5. Bibliography Agrawal, R Imielinski, T Swami, A N (1993), Mining Association Rules between Sets of Items in Large Databases, {ACM} {SIGMOD} International Conference on Management of Data, 207 – 216. Brin, S Motwani, R Silverstein, C (1997), Beyond market baskets: generalizing association rules to correlations, 265 – 276. Park, J S Chen, M-S Yu, P S (1995), An Effective Hash Based Algorithm for Mining Association Rules, {ACM} {SIGMOD} International Conference on Management of Data, 175 – 186. Savasere, A Omiecinski, E Navathe S (1998), Mining for Strong Negative Associations in a Large Database of Customer Transactions, ICDE, 449-502. Savasere, A Omiecinski, E Navathe S (1995), An Efficient Algorithm for Mining Association Rules in Large Databases, The {VLDB} Journal, 432-444. Srikant, R Vu, Q Agrawal, R (1997), Mining Association Rules with Item Constraints, Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining, {KDD}, AAAI Press, 6773. Srikant, R Agrawal, R (1997), Mining generalized association rules, Future Generation Computer Systems, vol. 13, 161-180. Srikant, R Agrawal, R (1994), Fast Algorithms for Mining Association Rules, Morgan Kaufmann, Proc. 20th Int. Conf. Very Large Data Bases, {VLDB}, 487 – 499. Toivonen, H (1996), Sampling Large Databases for Association Rules, Morgan Kaufman, 134-145. Webb, G I (2000), Efficient search for association rules, Knowledge Discovery and Data Mining, 99-107. Webb, G I (in press), Association rules, The Handbook of Data Mining, In N. Ye (Ed), Lawrence Erlbaum. Webb, G I Zhang, S (2002), Beyond Association Rules: Generalized Rule Discovery, Kluwer Academic Publishers. Webb, G I (1995), OPUS: An Efficient Admissible Algorithm for Unordered Search, Journal of Artificial Intelligence, 3, 431-465. Wu, X Zhang, C Zhang S (), Mining both Positive and Negative Association Rules, Machine Learning: proceedings of the nineteenth international conference, 658-665. Zaki, M J Gouda, K (2001), Fast Vertical Mining Using Diffsets, 01-1, available at citeseer.nj.nec.com/zaki01fast.html. Zaki, M J Parthasarathy, S Ogihara, M Li, W (1997), New Algorithms for Fast Discovery of Association Rules, Technical Report 651, Computer Science Department, University of Rochester.