Download 03_MeasuringARs_maggie

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Arrow's impossibility theorem wikipedia , lookup

Transcript
Measuring Association Rules
Shan “Maggie” Duanmu
Project for CSCI 765
Dec 9th 2002
Outline
 The problems
 Our solutions
 Work to do
Definitions
 Association rule: Association rule mining searches for
interesting relationships among items in a given data set. Such
interesting relationships are typically expressed in an
association rule in the form of X=>Y, where X and Y are sets of
items. It can be read that, whenever a transaction T contains X,
it probably will contain Y.
 Metrics: The probability is defined as the percentage of
transactions containing Y in addition to X with respect to the
overall number of transactions containing X. This probability is
called confidence (or strength). While the confidence measure
represents the certainty of a rule, support is used to represent
the usefulness of the rule [1]. Formally, the support of a rule is
defined as the percentage of transactions containing both X and
Y with respect to the number of transactions in the database.
 Interesting rules. A rule is considered to be interesting if its
confidence and support exceed certain thresholds. Such
thresholds are generally assumed to be given by domain
experts.
The Problems
 While the support-confidence framework has been
widely used for measuring the interestingness of
association rules, it is known that
1. the resulting rules may be misleading [4-8]. A rule
with high support and high confidence may still not
indicate that X and Y are dependent.
2. The use of thresholds of support and confidence for
pruning may obscure important rules,
3. and also many unimportant rules may remain in the
resulting rule set.
Many metrics…
 To address the problems with support-
confidence framework, many other metrics
are proposed: interest, conviction, gini index,
Laplace, phi-coefficients, collective strength,
reliability, …. So far, we can find at least 21
metrics in the literature. What to choose???
 P. Tan, V. Kumar, J. Srivastava, “Selecting the
right Interesting measure for Association
pattern.” ACM SIGKDD ’02, 2002.
Our Solutions
Six Principles plus partial order, in contrast to prior
total order or partial order of support-confidence
framework,
1. Implication
2. Correlation
3. Novelty
4. Utility
5. Top-N-rules
6. Efficiency
Implication principle
 Principle 1 (implication principle): If a set of
measures is defined to reflect the
interestingness of an association rule , then
at least one measure mi(X=>Y)in the set
should satisfy the constraint
mi(X=>Y)>mi(Y=>X) when P(X)<P(Y).
Correlation principle
 Principle 2 (correlation principle): If a set of
measures is defined to reflect the
interestingness of an association rule X=>Y ,
then at least one measure mi(X=>Y) in the set
should be directly proportional to the
covariance of X and Y.
Novelty principle
 Principle 3 (novelty principle): If a set of
measures is defined to reflect the
interestingness of an association rule X=>Y,
then for a given P(XY), at least one measure
mi in the set should reflect its novelty. The
novelty measure mi should be inversely
proportional to p=max{P(X),P(Y)}.
Utility principle
 Principle 4 (utility principle): If a set of
measures is defined to reflect the
interestingness of an association rule X=>Y,
then at least one measure mi in the set
should reflect its utility, i.e., mi is a monotone
increasing function with respect to P[XY].
Top-N-rule principle
 Principle 5 (top-N-rule principle): If a
synthetic measure is defined to sort the rules
for presenting the top N rules to users, then it
is desirable that this measure obeys the
principles 1-4.
Efficiency principle
 Principle 6 (efficiency principle): If a set of
measures is defined to reflect the
interestingness of an association rule, then it
is desirable that thresholds used with those
measures help reduce computation
complexity.
Partial results
Support
Implication
confidence Interest
Conviction
Reliability
x
X (when
X (when
positively
correlated)
Correlation
X
Novelty
X
X
positively
correlated)
X
X (when
negatively
related)
Utility
X
A few conclusions
 No measure is absolutely better than the others for obtaining the




Top-N rules.
When using a synthetic measure such as reliability or
conviction, support is still an important utility measure. Interest
still should be used as a novelty measure in order to fully
characterize rules.
Interest not only can be used as a good correlation measure, it
also can be used as a good novelty measure. It is always 1
when the rule contains no novel information.
When Interest is used as a synthetic measure for ranking rules,
then confidence should also be included in addition to support.
This is because Interest is a poor measure for implication
examination.
While we may have three alternate frameworks for fully
characterizing rules (support-confidence-interest, supportconviction-interest, support-reliability-interest), the supportconfidence-interest framework is best. The other two work well
only when rules are positively correlated.
Partial Order
Instead of support-confidence framework, we
suggest:
 Support-confidence-interest framework
 Support-conviction-interest framework
 Support-reliability-interest framework
 Other Framework???
 Which is the best???
Work to Do
 Evaluate the frameworks with realistic
application data (Image data, KDD cup data,
Skyrocket data, …, criticized for lack of
support applications)
 Efficiency principle? P-tree algorithms and
other algorithms for comparison
 Other possible frameworks?
 Ours are for objective metrics, how to
combine subjective metrics for top-N rules?