Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9th 2002 Outline The problems Our solutions Work to do Definitions Association rule: Association rule mining searches for interesting relationships among items in a given data set. Such interesting relationships are typically expressed in an association rule in the form of X=>Y, where X and Y are sets of items. It can be read that, whenever a transaction T contains X, it probably will contain Y. Metrics: The probability is defined as the percentage of transactions containing Y in addition to X with respect to the overall number of transactions containing X. This probability is called confidence (or strength). While the confidence measure represents the certainty of a rule, support is used to represent the usefulness of the rule [1]. Formally, the support of a rule is defined as the percentage of transactions containing both X and Y with respect to the number of transactions in the database. Interesting rules. A rule is considered to be interesting if its confidence and support exceed certain thresholds. Such thresholds are generally assumed to be given by domain experts. The Problems While the support-confidence framework has been widely used for measuring the interestingness of association rules, it is known that 1. the resulting rules may be misleading [4-8]. A rule with high support and high confidence may still not indicate that X and Y are dependent. 2. The use of thresholds of support and confidence for pruning may obscure important rules, 3. and also many unimportant rules may remain in the resulting rule set. Many metrics… To address the problems with support- confidence framework, many other metrics are proposed: interest, conviction, gini index, Laplace, phi-coefficients, collective strength, reliability, …. So far, we can find at least 21 metrics in the literature. What to choose??? P. Tan, V. Kumar, J. Srivastava, “Selecting the right Interesting measure for Association pattern.” ACM SIGKDD ’02, 2002. Our Solutions Six Principles plus partial order, in contrast to prior total order or partial order of support-confidence framework, 1. Implication 2. Correlation 3. Novelty 4. Utility 5. Top-N-rules 6. Efficiency Implication principle Principle 1 (implication principle): If a set of measures is defined to reflect the interestingness of an association rule , then at least one measure mi(X=>Y)in the set should satisfy the constraint mi(X=>Y)>mi(Y=>X) when P(X)<P(Y). Correlation principle Principle 2 (correlation principle): If a set of measures is defined to reflect the interestingness of an association rule X=>Y , then at least one measure mi(X=>Y) in the set should be directly proportional to the covariance of X and Y. Novelty principle Principle 3 (novelty principle): If a set of measures is defined to reflect the interestingness of an association rule X=>Y, then for a given P(XY), at least one measure mi in the set should reflect its novelty. The novelty measure mi should be inversely proportional to p=max{P(X),P(Y)}. Utility principle Principle 4 (utility principle): If a set of measures is defined to reflect the interestingness of an association rule X=>Y, then at least one measure mi in the set should reflect its utility, i.e., mi is a monotone increasing function with respect to P[XY]. Top-N-rule principle Principle 5 (top-N-rule principle): If a synthetic measure is defined to sort the rules for presenting the top N rules to users, then it is desirable that this measure obeys the principles 1-4. Efficiency principle Principle 6 (efficiency principle): If a set of measures is defined to reflect the interestingness of an association rule, then it is desirable that thresholds used with those measures help reduce computation complexity. Partial results Support Implication confidence Interest Conviction Reliability x X (when X (when positively correlated) Correlation X Novelty X X positively correlated) X X (when negatively related) Utility X A few conclusions No measure is absolutely better than the others for obtaining the Top-N rules. When using a synthetic measure such as reliability or conviction, support is still an important utility measure. Interest still should be used as a novelty measure in order to fully characterize rules. Interest not only can be used as a good correlation measure, it also can be used as a good novelty measure. It is always 1 when the rule contains no novel information. When Interest is used as a synthetic measure for ranking rules, then confidence should also be included in addition to support. This is because Interest is a poor measure for implication examination. While we may have three alternate frameworks for fully characterizing rules (support-confidence-interest, supportconviction-interest, support-reliability-interest), the supportconfidence-interest framework is best. The other two work well only when rules are positively correlated. Partial Order Instead of support-confidence framework, we suggest: Support-confidence-interest framework Support-conviction-interest framework Support-reliability-interest framework Other Framework??? Which is the best??? Work to Do Evaluate the frameworks with realistic application data (Image data, KDD cup data, Skyrocket data, …, criticized for lack of support applications) Efficiency principle? P-tree algorithms and other algorithms for comparison Other possible frameworks? Ours are for objective metrics, how to combine subjective metrics for top-N rules?