Download Literature Review - School of Computer Science and Software

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
School of Computer Science and Software Engineering
Monash University
Bachelor of Computer Science Honours (1608), Clayton Campus
Literature Review Draft – 2003
Mining Negative Rules in Large Databases using
Generalized Rule Discovery
Dhananjay R Thiruvady
(18787886)
Supervisor: Professor Geoffrey I. Webb
Abstract
Rule discovery in Data Mining has recently been given a lot of attention. Association rule
discovery is an approach to mining large number of rules from a database of transactions.
The rules generated will be based on strong associations between items within the
transactions. Strong associations are defined based on the minimum support constraint.
Generalized Rule Discovery (GRD) is an alternate rule discovery method to Association
rule discovery. GRD and Association rule discovery share several features. GRD allows
the user to specify minimum constraints to generate rules, whereas a minimum support
has to be specified in association rule discovery. The search method used by GRD, OPUS
algorithm for an unordered search, which is an effective method to search large
unordered search spaces or, in the case of rule discovery, search large unordered space of
rules.
Mining negative rules has been researched with association rule discovery. Using GRD
and applying the Tidsets/Diffsets approach, mining negative rules can be done effectively
and efficiently. The rules generated can then be assessed for usefulness.
1. Association Rule Discovery
Mining association rules in large databases was first approached by Agrawal, Imielinski
and Swami (Agrawal, Imielinski, Swami, 1993). A database of transactions is the training
data from which rules will be generated. The aim of association rule discovery is to find
rules with strong association between items form the training data.
A rule is of the form A => B where A is known as the antecedent and B is the consequent
of the rule. Both A and B are itemsets. An itemset can be a single item (e.g. water) or a
set of items (e.g. water and chips). The rule implies that if an itemset A occurs in a
transaction then itemset B is likely to occur in the same transaction of the training data. If
water => chips is 80% then the rules implies that 80% of customers who bought water
also bought chips (Toivonen, 1996).
To mine association rules from a large database, minimum constraints are defined
(Srikant, Vu, Agrawal, 1997). The number of rules to be generated can get very large,
e.g. for 1000 rules there are 21000 possibilities. The rules generated will be based upon a
minimum support constraint. The data space can be very large, therefore the minimum
support constraint is needed to prune the data space to generate rules.
The support of a rule is the frequency with which the rule occurs in the training data. For
example 25 transactions in out of 100 transactions (Assuming the training data includes
100 transactions) contain Pepsi. Then the support of Pepsi is 0.25. The itemsets within
the rules generated are considered to be frequent itemsets; Pepsi is frequent in the
previous example. Further constraints (e.g. Confidence and Lift) can be applied to the
rules that are generated. Applying other constraints is defined as interest.
Some constraints that can be applied as a measure of interestingness are:
1) Confidence (A => B) = support (A => B) / support (A)
2) Lift (A => B) = support (A => B) / support (A) x support (B)
3) Leverage (A => B) = support (A => B) – support (A) x support (B)
Confidence is a common measure of interest for generating association rules in
Association rule discovery.
Discovering association rules can be thought of as a three part process:
4) Search the data space to find all the items or itemsets whose support is greater
than the user specified minimum support. These itemsets are the frequent
itemsets.
5) Generate interesting rules based on the frequent itemsets. A rule is said to be
interesting if its confidence is above a user’s specified minimum confidence.
6) Remove (prune) all rules which are not interesting from the rule space.
(Srikant, Agrawal, 1997)
Further constraints can be applied to generate rules specific to a user’s needs. However,
this can only be done when the rules are already available from the first two parts of the
rule generation process.
In association rule discovery, the frequent itemset generation process is the part which
needs improvement in efficiency. When the training data has thousands of transactions
finding frequent itemsets can take up a lot of time. Therefore, most research related to
association rule discovery has been conducted to improve the frequent itemsets
generation process. Specifically most of the attention has been given to improve part one
of the three part process. There are several algorithms which have been developed to
mine association rules quickly. Some of them are:
1)
Apriori
The Apriori algorithm has proved to be an efficient algorithm in mining association
rules. It has been used effectively and has become the standard algorithm used for
association rule discovery.
Apriori follow a two step process to generate rules. The first step is to find all
frequent itemsets. From these frequent itemsets association rules are generated which
is the second step. The first step will limit the number of itemsets which are
considered for the antecedent and consequent of the rules and information about the
itemset frequency is maintained. An Itemset is frequent if it satisfies the minimum
support constraint.
Some variants of the Apriori approach have showed that very few passes through the
training data may be necessary to generate association rules (Savasere, Omiecinski,
Navathe, 1995), (Toivonen, 1996).
2)
Other algorithms and search methods
There have been several algorithms proposed to solve the task of generating frequent
itemsets efficiently. They include (Savasere, Omiecinski, Navathe, 1995), (Zaki,
Parthasarathy, Ogihara, Li, 1997), (Park, Chen, Yu, 1995).
Webb (Webb, 2000) argues that for some applications a direct search may be more
effective than the two part process of the Apriori algorithm. The algorithm presented
maintains the data in memory from which the association rules can be generated
through constraints.
2. Generalized Rule discovery
Generalized rule discovery was developed by Webb and Zhang (Webb, Zhang, 2002).
GRD’s aims are very similar to that of Association rule discovery. As with Association
rule discovery, GRD searches through the training data and generates several rules. These
rules are generated based on constraints specified by a user but may not include minimum
support.
In some applications minimum support may not be relevant criteria to generate rules. For
example, if a user wanted interesting rules with the highest leverage then Association rule
discovery will first apply a minimum support constraint to get the frequent itemsets. The
leverage constraint will then only apply to the frequent itemsets from which rules will be
generated. However, there maybe itemsets which are not frequent but can satisfy the
minimum leverage constraint when they are part of a rule. As a result several interesting
rules may not be generated.
The advantage of GRD approach is that it generates rules based on minimum constraints
defined by a user (Webb, Zhang, 2002). The rules can be generated based on minimum
support of the itemsets, but it is not an essential criterion for generating rules. Other
constraints to generate interesting rules include minimum confidence, minimum leverage,
and minimum lift. In addition to the constraints a users can specify a specific number of
rules to be generated. GRD will generate rules which maximize the constraints.
The GRD approach has been implemented in the GRD program. The GRD program and
the algorithm used by it are described below.
1) The GRD program
The GRD program currently was developed by Webb and Zhang at Deakin
University, Australia. The training data input to the program includes a header file
with information about the data and the data file with all the transactions. The user
then has to specify all the constraints, the number of rules to be generated, the
number of cases and the number of items for the antecedent. Then the rules from the
training data will be generated with the leverage value. The GRD program has
proved to be successful when attempting to find rules with the highest possible
leverage (Webb, Zhang, 2002).
An example input: -number-of-cases=5822 -minimum-strength=0.8 -minimumsupports=0.01 -max-number-of-associations=1000 -maximum-LHS-size=4.
The -number-of-cases=5822 are the total number of cases from the training data
which will be searched for the rules. -minimum-strength=0.8 -minimumsupports=0.01 are the constraints applied to generate the rules. -max-number-ofassociations=1000 limits GRD to finding only 1000 associations and -maximumLHS-size=4 specifies that the antecedent can only comprise of 4 items at the most.
2) OPUS search algorithm
The algorithm used by GRD is the Optimized Pruning for Unordered Search
algorithm (OPUS) for an unordered search (Webb, 1995). This algorithm can be
used for Classification rule discovery and was originally developed for that purpose.
It is an algorithm that guarantees to find the target it seeks.
The search space can often be very complex. For such a search space, a heuristic
search is usually employed. However, the search does not necessarily find its target.
Heuristic algorithms may also introduce a bias
The OPUS algorithm is an efficient search method that prunes parts the search space
that will not result in interesting rules. Once an infrequent itemset has been
discovered the search space is restructured (pruned). Restructuring the search space
and pruning uninteresting rules allows very fast access to rules which satisfy the
minimum constraints.
3. Mining Negative rules
The main interest in Association rule discovery has been to mine rules with strong
associations. Such associations are also called positive associations. Finding positive
associations is useful to make predictions about the training data.
Mining negative rules has been given some attention and could prove to be useful. In a
large database negative rules generated can be very large numbers and may not be of
interest to a user. Brin, Motwani and Silverstein (Brin, Motwani, Silverstein, 1997) first
talked about mining negative associations between two negative itemsets. Savasere,
Omiecinski and Navathe (Savasere, Omiecinski, Navathe, 1998) use the method of
generating positive rules from which negative rules are mined. The result is fewer but
more interesting negative rules are mined.
Negative associations are associations between itemsets which occur in a transaction with
those itemsets that don’t occur in the same transactions. Assume that A and B are
Itemsets. B is a single Itemset. Then the rules to be mined can be of the form:
1) A => B
(A implies B, as used in association rules)
2) A => ~B
(A implies not B)
3) ~A => B
(~A implies B)
4) ~A => ~B
(not A implies not B)
(Wu, Zhang, Zhang)
The rules above specify concrete relationships between each itemset compared to
(Savasere, Omiecinski, Navathe, 1998) who look at the rule A =/> B.
An example of useful negative rules: “60% of customers who buy Potato Chips do not
buy bottled water” (Savasere, Omiecinski, Navathe, 1998). This information can be used
by the manager of a store to improve the store’s marketing strategy (Savasere,
Omiecinski, Navathe, 1998).
4. Tidsets and diffsets
In a database of transactions, each transaction has itemsets associated with it. If the data
are stored horizontally (See Fig. 1) then the performance of search algorithms on the
training data is not as efficient vertical mining (See Fig. 2). In horizontal mining each
transaction is stored with the itemsets that occur in it.
Transaction 1:
[ Itemset A,
Itemset B,
Transaction 2:
[ Itemset A,
Itemset D ]
Transaction 3:
[ Itemset A,
Itemset B,
Itemset C ]
Itemset C,
Itemset D]
Fig. 1: Horizontal mining
Vertical mining has outperformed horizontal mining when searching for rules in the
training data. This is because data that are not necessary are automatically pruned and a
transaction frequency can be calculated quickly to satisfy the minimum support.
A reference tidset is defined for a set of items. Zaki and Gouda define the reference tidset
as a class. The class has a set of transactions associated with it. The class items may
occur in say three transactions (1, 2, and 3) out of total number of transactions, say a
hundred. Each itemset is stored with the transactions it is contained in for a particular
class (e.g. 1, and 3). The transaction sets for the itemsets in this example can be seen in
Fig. 2, Itemsets B and C occur in transactions 1 and 3.
The transaction set that belongs to an itemset is known as a Tidset. The tidset for a class
is known as a prefix tidset. The itemsets from within the class are then tested to see if
they satisfy a minimum support constraint. Those itemsets that do not are omitted as they
are considered infrequent. The itemsets that are frequent will occur in most of the
transactions out of the transactions in the same class. From Fig. 2 it can be seen that each
itemset is in at least two out of three transactions.
Itemset A
Itemset B
Itemset C
Transaction 1
Transaction 1 Transaction 1
Transaction 2
Transaction 3
Itemset D
Transaction 2
Transaction 3 Transaction 3
Transaction 1
Fig. 2: Vertical mining for a given Class
Zaki and Gouda (Zaki, Gouda, 2001) proposed that each itemset should be stored with
their Diffsets rather than their Tidsets in the class that the tidset appears in. A Diffset is a
set of transactions that an itemset does not occur in within a given class. Since the
itemsets are frequent within the class, the size of the tidset for the itemset is likely to be
large, i.e. most of the transactions in the class. Therefore, the size of the diffset for an
itemset is much smaller. The same information is contained in both representations and
the Diffsets approach results in saving a lot of memory.
From Fig. 3 it can be seen that for a given class the size of the diffset representation is a
lot smaller than that of the tidset representation. The class contains three transactions and
Itemset A
Itemset B
Itemset C
Itemset D
Transaction 1
Transaction 2 Transaction 2
Fig. 3: Diffsets for itemsets in Fig. 2
Within a given class tidsets for an itemset are computed by GRD. The aim now is to
calculate the tidset for the negation of an itemset. If A is an itemset, then the aim is to
find the tidset of ~A.
For each form of a rule, GRD can calculate the negation of an itemset using the diffsets
approach:




Reference tidset = T
Tidset for antecedent = A
Tidset for consequent = C
Tidset for antecedent and consequent = A ^ C
1. For A => C, GRD computes:
a. Tidset (A)
b. Tidset (C)
c. Tidset (A ^ C)
2. For A => ~C, GRD can compute:
a. Tidset (A)
b. Tidset (~C) = Diffset (T, C)
c. Tidset (A ^ ~C) = Diffset (A, C)
3. For ~A => ~C, GRD can compute:
a. Tidset (~A) = Diffset (T, A)
b. Tidset (C)
c. Tidset (~A ^ C) = Diffset (C, A)
4. For ~A => ~C, GRD can compute:
a. Tidset (~A) = Diffset (T, A)
b. Tidset (~C) = Diffset (T, C)
c. Tidset (~A ^ ~C) = Diffset (T, (A ^ C))
5. Bibliography
Agrawal, R Imielinski, T Swami, A N (1993), Mining Association Rules between Sets of
Items in Large Databases, {ACM} {SIGMOD} International Conference on
Management of Data, 207 – 216.
Brin, S Motwani, R Silverstein, C (1997), Beyond market baskets: generalizing
association rules to correlations, 265 – 276.
Park, J S Chen, M-S Yu, P S (1995), An Effective Hash Based Algorithm for Mining
Association Rules, {ACM} {SIGMOD} International Conference on Management of
Data, 175 – 186.
Savasere, A Omiecinski, E Navathe S (1998), Mining for Strong Negative Associations in
a Large Database of Customer Transactions, ICDE, 449-502.
Savasere, A Omiecinski, E Navathe S (1995), An Efficient Algorithm for Mining
Association Rules in Large Databases, The {VLDB} Journal, 432-444.
Srikant, R Vu, Q Agrawal, R (1997), Mining Association Rules with Item Constraints,
Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining, {KDD}, AAAI Press, 6773.
Srikant, R Agrawal, R (1997), Mining generalized association rules, Future Generation
Computer Systems, vol. 13, 161-180.
Srikant, R Agrawal, R (1994), Fast Algorithms for Mining Association Rules, Morgan
Kaufmann, Proc. 20th Int. Conf. Very Large Data Bases, {VLDB}, 487 – 499.
Toivonen, H (1996), Sampling Large Databases for Association Rules, Morgan
Kaufman, 134-145.
Webb, G I (2000), Efficient search for association rules, Knowledge Discovery and Data
Mining, 99-107.
Webb, G I (in press), Association rules, The Handbook of Data Mining, In N. Ye (Ed),
Lawrence Erlbaum.
Webb, G I Zhang, S (2002), Beyond Association Rules: Generalized Rule Discovery,
Kluwer Academic Publishers.
Webb, G I (1995), OPUS: An Efficient Admissible Algorithm for Unordered Search,
Journal of Artificial Intelligence, 3, 431-465.
Wu, X Zhang, C Zhang S (), Mining both Positive and Negative Association Rules,
Machine Learning: proceedings of the nineteenth international conference, 658-665.
Zaki, M J Gouda, K (2001), Fast Vertical Mining Using Diffsets, 01-1, available at
citeseer.nj.nec.com/zaki01fast.html.
Zaki, M J Parthasarathy, S Ogihara, M Li, W (1997), New Algorithms for Fast Discovery
of Association Rules, Technical Report 651, Computer Science Department, University of
Rochester.