Download Data Mining I - Tarleton State University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining I
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Data Mining I
Mining Frequent Patterns, Associations, and
Correlations: Basic Concepts and Methods
Keith E. Emmert
Tarleton State University
October 2, 2012
Data Mining I
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Basics Concepts
Frequent Itemset Mining Methods: Apriori
Which Patterns Are Interesting? Pattern Evaluation
Methods
Data Mining I
Frequent Patterns
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
A frequent pattern is a pattern that often appears in a data
set.
I A frequent itemset is a set of items that often appears
in a data set. (Computer, monitor, surge protector, and
a computer game)
I A frequent sequential pattern is a subsequence that
often appears in a data set. (First a computer, then
computer game #1, then computer game #2)
I A frequent structured patter is a substructure that can
be constructed from a data set that often appears.
(Subgraphs, subtrees, etc)
Data Mining I
Market Basket Analysis
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Market basket analysis uses customer purchases to develop
associations between the different items purchased. This
allows “big-brother” to produce a more effective
I product placement and catalog design (Computer games
are near computers). Note that product placement in a
store, catalog, and online may all be different as
different types of people shop in different ways.
I cross-marketing strategies - products and services of
other companies that complement yours (you only sell
computers, so partner with a company that sells cool
computer games!)
I customer shopping behavior analyses. (A computer was
purchased...they’re gonna buy a game, so print
computer game coupons upon checkout!)
Data Mining I
Some Basic Terms
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Let I = {I1 , . . . , Im } be a (universal) set of items.
I A transaction, T ⊆ I, is a random variable.
I D is the set of all transactions in a given time.
I A ⊆ T is an itemset. If |A| = k, then it is called a
k-itemset. The occurrence frequency of an itemset is
the # of transactions that contain the itemset and is
also know as frequency, support count, count, or
absolute support of the itemset.
We wish to study which itemsets (subsets of transactions)
imply other itemsets are also obtained.
For itemsets A, B, we assume the following notation:
I Pr (A ∪ B) denotes the probability that a transaction
contains the union of sets A and B, that is, it contains
all items in the sets A and B.
I Pr (A or B) denotes the probability that a transaction
contains either A, B, or both.
Data Mining I
Keith E. Emmert
Association Rules
Support
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Suppose A, B ⊆ I are item sets, A, B 6= ∅, and A ∩ B = ∅.
We wish to study “customers” who obtain A that also tend
to obtain B. This is denoted by A ⇒ B and is called an
association rule.
Support(A ⇒ B) = Pr (A ∪ B) is the percentage of
transactions in D that contain both A and B. So,
Support(A ⇒ B) is
# transactions that contain both A and B
.
|D|
Note that
Support(A ⇒ B) = Support(B ⇒ A) = Support(A ∪ B).
We shall also call this Relative Support of an itemset.
Data Mining I
Keith E. Emmert
Association Rules
Confidence(A ⇒ B)
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Define Confidence(A ⇒ B) = Pr (B | A) is the percentage of
transactions in D containing A that also contain B, and that
for any itemset Q, the number of transactions that contain
Q is called
Occurrence Frequency(Q) = Support(Q) = Support Count(Q).
So, we know
# transactions that contain both A and B
# transactions that contain A
Support(A ∪ B)
Support(A ⇒ B)
=
=
Support(A)
Support(A)
Support Count(A ∪ B)
=
Support Count(A)
P(B | A) =
In general, Confidence(A ⇒ B) 6= Confidence(B ⇒ A).
Data Mining I
Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Suppose that A contains a cat, and B contains catnip, then
we can write
A ⇒ B[Support = 4%, Confidence = 40%, ]
and conclude that
I 4% of all transactions purchased both a cat and their
drug of choice.
I 40% of all transactions that included a cat, also
included their drug of choice.
Of course, this is not a commutative operator, so B ⇒ A
would have completely different support and confidence.
Data Mining I
Frequent Itemsets and Strong Association Rules
Keith E. Emmert
Outline
I
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
I
I
An itemset, Q, is called a frequent itemset if
Support(Q) = Pr (Q) (the # of transactions that
contain Q divided by the number of transactions) is
larger than a specified minimum support threshold, i.e.
relative support of Q is at least a prespecified minimum
support threshold if and only if its absolute support is at
least a corresponding minimum support count threshold.
The set of frequent k-itemsets is denoted by Lk .
For two itemsets A, B, if Support(A ⇒ B) is at least a
minimum support threshold and Confidence(A ⇒ B) is
at least a minimum confidence threshold, then the
association rule is called strong.
Data Mining I
When are Association Rules Strong?
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Recall that the support count of an itemset is the number of
transactions that contain the itemset. If we know
1. Number of transactions, |D|
2. Support Count(A)
3. Support Count(B)
4. Support Count(A ∪ B)
Then we can compute
Support Count(A ∪ B)
I Confidence(A ⇒ B) =
.
Support Count(A)
Support Count(A ∪ B)
I Confidence(B ⇒ A) =
.
Support Count(B)
Support Count(A ∪ B)
I Support(A ⇒ B) =
.
|D|
I Support(B ⇒ A) = Support(A ⇒ B).
It is easy to derive the rules A ⇒ B as well as B ⇒ A and
check whether or not they are strong.
Data Mining I
Association Rule Mining and a Problem
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
A “simple” two step process
1. Find all frequent itemsets: By definition, each of these
itemsets will occur at least as frequently as a
predetermined minimum support count, minsup.
2. Generate strong association rules from the frequent
itemsets: By definition, these rules must satisfy
minimum support and minimum confidence.
The Problem:
I If I is a frequent itemset, then each A ⊆ I is frequent.
I Hence, P(I )\{∅} contains all frequent item subsets of
I.
I Note that P(I )\{∅} = 2|I | − 1 is the number of
frequent itemsets contained in I , a possibly large
number.
Data Mining I
Some More Terms
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Let Q be an itemset, and D the set of transactions. Suppose
a minimum support threshold is fixed.
I Q is closed in D if 6 ∃R such that Q ( R ⊆ D where
the Support Count(Q) = Support Count(R).
I Q is a closed frequent itemset in D if Q is closed and
frequent.
I Q is a maximal frequent itemset in D if Q is frequent
and 6 ∃R ⊆ D such that Q ( R and R is frequent.
I C is the set of closed frequent itemsets of D.
I M is the set of maximal frequent itemsets of D.
Note that
I C along with the support count of each itemset in C
allows us to derive the whole set of frequent itemsets.
I However, M with support counts of each itemset in M
does NOT (in general) allow us to derive the set of
frequent itemsets.
Data Mining I
Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Let the (universe) set of item be I = {a1 , . . . , a100 } and
define D = {Q = {a1 , . . . , a100 }, R = {a1 , . . . , a50 }} to be
the transactions set. Let the minimum support count is 1.
I Support Count(Q) = 1.
I Support Count(R) = 2 because R ⊆ R and R ⊆ Q.
I Clearly both are frequent itemsets.
I Q is closed because 6 ∃Y ⊆ D such that Q ( Y .
I R is closed because 6 ∃Y ⊆ D such that R ( Y and
Support Count(R) = Support Count(Y ).
I So, C = {Q, R}.
I Since R ⊆ Q and Q is a frequent itemset, then R is not
a maximal frequent itemset.
I Clearly Q is a maximal frequent itemset.
I So, M = {Q}.
Data Mining I
Example Continued
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Let the (universe) set of item be I = {a1 , . . . , a100 } and
define D = {Q = {a1 , . . . , a100 }, R = {a1 , . . . , a50 }} to be
the transactions set. Let the minimum support count is 1.
Recall that
I Support Count(Q) = 1.
I Support Count(R) = 2 because R ⊆ R and R ⊆ Q.
I So, C = {Q, R}.
I So, M = {Q}.
Hence,
I For any U ⊆ D, if ∃aj such that j > 50, then the
Support Count(U) = 1 (i.e. U ⊆ Q), otherwise it is 2
(i.e. U ⊆ R). So, knowing the support count of C
yields the support count of all itemsets.
I Note we can’t compute Support Count({a1 , a2 }) given
the only the support count of M .
Data Mining I
Apriori
Keith E. Emmert
Outline
I
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
I
I
Apriori (Prior Knowledge) algorithm developed by
Rakesh Agrawal and Ramakrishnan Srikant, fourth in
Proceedings of the 20th International Conference on
Very Large Data Bases, VLDB, pages 487-499,
Santiago, Chile, September 1994 – used in forming
frequent itemsets for Boolean association rules.
Apriori Property All nonempty subsets of a frequent
itemset must also be frequent.
Corollary If I is not frequent, the for all items a, I ∪ {a}
is not frequent.
Data Mining I
The Basic Idea
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Suppose we have Lk−1 , the (k − 1)-frequent itemsets, for
k ≥ 2.
Assume that transaction items are sorted in lexicographic
order.
Goal: Find Lk .
I The Join Step Ck = Lk−1 o
n Lk−1 : Join Lk−1 with itself
to form candidate k-itemsets, where o
n means:
I
I
I
I
l1 , l2 ∈ Lk−1 are joinable if ∧k−2
n=1 (l1 [n] = l2 [n]) and
l1 [k − 1] < l2 [k − 1].
l1 [k − 1] < l2 [k − 1] ensures there are no duplicates.
New k-itemset is {l1 [1], l1 [2], . . . , l1 [k − 1], l2 [k − 1]}.
Prune Step Note that Lk ⊆ Ck , so some k-itemsets may
not be frequent.
I
I
Any (k − 1)-itemset that is not frequent will not be a
subset of a frequent k-itemset.
Any (k − 1)-subset of a candidate k-itemset in Ck that
is not in Lk−1 can’t be frequent and must be removed!
Data Mining I
Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Suppose our transaction database is
as given on the right. Suppose that
minSup = 2.
Find the frequent itemsets.
C1 is
L1 is
Itemset Support
Itemset
{1}
2
{1}
{2}
3 =⇒
{2}
{3}
3
{3}
{4}
1
{5}
{5}
3
TID
1
2
3
4
Support
2
3
3
3
Items
1, 3, 4
2, 3, 5
1, 2, 3, 5
2, 5
Data Mining I
Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Suppose our transaction database is
as given on the right. Suppose that
minSup = 2.
Find the frequent itemsets.
C2 is
Itemset Support
L2 is
{1, 2}
1
Itemset
{1, 3}
2
{1, 3}
=⇒
{1, 5}
1
{2, 3}
{2, 3}
2
{2, 5}
{2, 5}
3
{3, 5}
{3, 5}
2
TID
T001
T002
T003
T004
Items
1, 3, 4
2, 3, 5
1, 2, 3, 5
2, 5
Support
2
2
3
2
Note that C2 contains all possible joins because each subset
is a frequent 1-itemset.
Data Mining I
Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Suppose our transaction database is
as given on the right. Suppose that
minSup = 2.
TID
T001
T002
T003
T004
Items
1, 3, 4
2, 3, 5
1, 2, 3, 5
2, 5
Find the frequent itemsets.
C3 is
L3 is
Itemset Support
=⇒ Itemset Support
{2, 3, 5}
2
{2, 3, 5}
2
Note that C2 does not contain
I {1, 2, 3} because {1, 2} is not a frequent 2-itemset.
I {1, 2, 5} because {1, 2} is not a frequent 2-itemset.
I {1, 3, 5} because {1, 5} is not a frequent 2-itemset.
Further joins are not possible, that is, C4 = L3 o
n L3 = ∅,
and the Apriori algorithm terminates.
Data Mining I
Apriori Algorithm
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
input : D, a database of transactions
minSup, the minimum support count threshold.
output: L, frequent itemsets in D.
begin
L1 = findFequent1Itemsets(D)
for k = 2; Lk−1 6= ∅; k++ do
Ck = apriori.gen(Lk−1 )
for t ∈ D do
Ct = subset(Ck , t)
/* get candidates */
for c ∈ Ct do
c.count++
end
end
Lk = {c ∈ Ck | c.count ≥ minSup}
end
return L = ∪k lk
end
Data Mining I
Procedure apriori.gen
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
input : Lk−1 , frequent (k − 1)-itemsets
output: Ck , candidate k-itemsets
begin
for ∀l1 ∈ Lk−1 do
for ∀l2 ∈ Lk−1 do
if ∧k−2
n=1 (l1 [n] = l2 [n]) ∧ (l1 [k − 1] < l2 [k − 1]) then
c = l1 o
n l2
/* join step candidates */
if has.infrequent.subset(c, Lk−1 ) then
delete(c)
/* prune step */
end
else
add c to Ck
end
end
end
end
return Ck
end
Data Mining I
Procedure has.infrequent.subset
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
input : c, candidate k-itemset
Lk−1 , frequent (k − 1)-itemsets
output: TRUE or FALSE
begin
for ∀(k − 1)subset s of c do
if s 6∈ Lk−1 then
return TRUE
end
end
return FALSE
end
Data Mining I
Generating Strong Association Rules A ⇒ B
Keith E. Emmert
Outline
I
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
I
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
I
I
I
Assume that minimum support, minSup, and minimum
confidence, minConf , are defined, and D is the set of
transactions.
Want
Support Count(A ∪ B)
≥ minSup
Support(A ⇒ B) =
|D|
Support Count(A ∪ B)
Confidence(A ⇒ B) =
≥ minConf .
Support Count(A)
For each frequent itemset I (consider each frequent
itemset from L2 , L3 , . . .), generate all nonempty subsets
of I .
For every nonempty subset s of I , output the rule
Support Count(I )
≥ minConf
s ⇒ (I \s) if
Support Count(s)
Support(A ⇒ B) ≥ minSup is true for all association
rules since we use subsets of frequent itemsets.
Data Mining I
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Example
Association Rules from L2
Items
1, 3, 4
2, 3, 5
1, 2, 3, 5
2, 5
Rule
Confidence
Support
L1 Itemsets
{1} ⇒ {3} 2/2 = 100%
{1}
2
{2} ⇒ {3} 2/3 = 66.7%
{2}, {3}, {5}
3
{2} ⇒ {5} 3/3 = 100%
L2 Itemset
Support
{3} ⇒ {5} 2/3 = 66.7%
{1, 3}
2
{3} ⇒ {1} 2/3 = 66.7%
{2, 3}
2
{3} ⇒ {2} 2/3 = 66.7%
3
{2, 5}
{5} ⇒ {2} 3/3 = 100%
{3, 5}
2
{5} ⇒ {3} 2/3 = 66.7%
{1} ⇒ {3}, {2} ⇒ {5}, and {5} ⇒ {2} are strong
association rules from L2 .
Suppose our transaction database is
as given on the right. Suppose
that minSup = 2. Assume that
minConf = 0.7
TID
T001
T002
T003
T004
Data Mining I
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Example
Association Rules from L3
Suppose our transaction database is
as given on the right. Suppose
that minSup = 2. Assume that
minConf = 0.7
TID
T001
T002
T003
T004
L1 Itemsets
Support
{1}
2
Rule
3
{2}, {3}, {5}
{2} ⇒ {3, 5}
L2 Itemsets
Support
{3} ⇒ {2, 5}
{1, 3}
2
{5} ⇒ {2, 3}
{2, 3}
2
{2, 3} ⇒ {5}
{2, 5}
3
{2, 5} ⇒ {3}
{3, 5}
2
{3, 5} ⇒ {2}
L3 Itemset
Support
{2, 3, 5}
2
Hence, {2, 3} ⇒ {5} and {3, 5} ⇒ {2} are
Items
1, 3, 4
2, 3, 5
1, 2, 3, 5
2, 5
Confidence
2/3 = 66.7%
2/3 = 66.7%
2/3 = 66.7%
2/2 = 100%
2/3 = 66.7%
2/2 = 100%
strong.
Data Mining I
When Strong Association Rules Go Bad
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Suppose
I The number of transactions is |D| = 10000
I Number of computer games is 6000
I Number of videos is 7500
I Both computer games and videos are 4000
I Apriori generated, based on minSup = 0.3 and
minConf = 0.6, the association rule
Computer Games ⇒ Videos[Sup = 40%, Conf = 66%]
I
I
This is a strong association rule!
This probability of purchasing videos is
7500/10000 = 75%.
This is bad, since it appears that computer games and
videos are negatively correlated - purchasing computer
games decreases the number of videos purchased.
Data Mining I
Keith E. Emmert
Correlation
Lift
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
The idea is that the occurrence (based on transactions that
A appears in) of A is independent from the occurrence of B if
Pr (A ∪ B) = Pr (A)Pr (B).
Lift is defined to be
Pr (B | A)
Confidence(A ⇒ B)
Pr (A ∪ B)
=
=
.
lift(A, B) =
Pr (A)Pr (B)
Pr (A)
Support(A)
I
I
I
If lift(A, B) < 1, then A is negatively correlated with the
occurrence of B.
If lift(A, B) > 1, then A is positively correlated with the
occurrence of B.
If lift(A, B) = 1, then A and B are independent of each
other. Any association rule between these variables may
not be useful.
Data Mining I
A Tiny Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Suppose you have noted the following transactions. Our
minimum support threshold is 3 and minimum confidence
threshold of 50%.
TID
Items
T001 {1, 2, 3, 4}
L1 Support
L2
Support
T002 {1, 2}
{1}
3
{1, 2}
3
T003 {2, 3, 4}
{2}
6
{2, 3}
3
T004 {2, 3}
{3}
4
{2, 4}
4
T005 {1, 2, 4}
{3, 4}
3
{4}
5
T006 {3, 4}
Note that L3 did not make the miniT007 {2, 4}
mum support threshold of 3.
Data Mining I
Example - The Association Rules Sorted by Lift
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
A⇒B
{1} ⇒ {2}
{4} ⇒ {2}
{3} ⇒ {2}
{3} ⇒ {4}
{1} ⇒ {4}
{2} ⇒ {4}
{1} ⇒ {2, 4}
{3, 4} ⇒ {2}
{1, 2} ⇒ {4}
{2, 3} ⇒ {4}
{1} ⇒ {2, 4}
{4} ⇒ {3}
{2} ⇒ {1}
{2} ⇒ {3}
{2, 4} ⇒ {1}
{2, 4} ⇒ {3}
Conf(A ⇒ B)
3/3
4/5
3/4
3/4
2/3
2/3
2/3
2/3
2/3
2/3
2/3
3/5
1/2
1/2
1/2
1/2
Support(A)/|D|
3/7
5/7
4/7
4/7
3/7
6/7
3/7
3/7
3/7
3/7
3/7
5/7
6/7
6/7
4/7
4/7
Lift(A, B)
7/3
28/25
21/16
21/16
14/9
7/9
14/9
14/9
14/9
14/9
14/9
21/25
7/12
7/12
7/8
7/8
Data Mining I
Example - Some Conclusions
Keith E. Emmert
Outline
I
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
{1} ⇒ {2} has the highest lift ratio.
I
Basics Concepts
I
I
I
Offering a coupon for item 1 may entice the buyer to
purchase item 2.
The price on item 2 might be slightly increased to
further enhance revenue of the bundled purchase.
Buying a cat is followed by buying cat litter.
{2} ⇒ {1} has a lift ratio less than one.
I
I
So, coupons for item 2 will probably not entice someone
to purchase item 1!
Buying cat litter need not be followed by buying a cat.
Data Mining I
Keith E. Emmert
Other Pattern Evaluation Measures
χ2 Test for Independence
Outline
Basics Concepts
I
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
I
χ2 Recall that we need the observed value and expected
value. The statistic is
X (observed − expected)2
χ2 =
.
expected
The hypotheses are
H0 :The variables are independent
Ha :The variables are related
I
I
I
Degrees of freedom df = (Row − 1)(Column − 1).
Reject when χ2 > χ2α .
Some statisticians suggest that each cell count in a
contingency table should be at least 5.
Data Mining I
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Other Pattern Evaluation Measures
χ2 Test for Independence
The typical contingency table for
A ⇒ B is shown to the right. Ā repA
Ā
resents transactions that do not conB c1 (d1 ) c2 (d2 )
tain itemset A. B̄ are transactions
B̄ c3 (d3 ) c4 (d4 )
that do not contain B. Parentheses
are expected counts.
I Note that for χ2 , a number close to zero indicates a
lack of information indicating the variables may be
independent (fail to reject H0 ).
I Larger numbers (closer to one) indicate dependence
(reject H0 ).
I
I
If the observed value is smaller than the expected, then
their is a negative correlation in the rule A ⇒ B.
If the observed value is larger than the expected, then
their is a positive correlation in the rule A ⇒ B.
Data Mining I
Keith E. Emmert
Other Pattern Evaluation Measures
All Confidence
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Given two itemsets A and B,
Support(A ∪ B)
max{Support(A), Support(B)}
= min{Pr (A | B), Pr (B | A)}
All Conf(A, B) =
= min{Conf(B ⇒ A), Conf(A ⇒ B)}
This is the minimum confidence of the two association rules
A ⇒ B and B ⇒ A.
This measure is anti-monotonic (the Apriori property), that
is, if an item-set can’t pass a minimum association threshold,
none of its super sets will pass it. Very useful for pruning.
Data Mining I
Keith E. Emmert
Other Pattern Evaluation Measures
Max Confidence
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Given two itemsets A and B,
Support(A ∪ B)
min{Support(A), Support(B)}
= max{Pr (A | B), Pr (B | A)}
Max Conf(A, B) =
= max{Conf(B ⇒ A), Conf(A ⇒ B)}
This is the maximum confidence of the two association rules
A ⇒ B and B ⇒ A.
This measure is monotonic, that is, if an item-set’s
association is no less than γ, then all of its super sets will be
no less than it.
Data Mining I
Keith E. Emmert
Other Pattern Evaluation Measures
Kulczynski
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Given two itemsets A and B,
1
Kulc(A, B) = (Pr (A | B) + Pr (B | A))
2
1
= (Conf(B ⇒ A) + Conf(A ⇒ B))
2
This is the average confidence of the two association rules
A ⇒ B and B ⇒ A.
This is the arithmetic mean.
Does not have the monotonicity or anti-monotonicity
property.
Data Mining I
Keith E. Emmert
Other Pattern Evaluation Measures
Cosine
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Given two itemsets A and B,
p
Pr (A ∪ B)
Cosine(A, B) = p
= Pr (A | B)Pr (B | A)
Pr (A)Pr (B)
p
= Conf(B ⇒ A)Conf(A ⇒ B)
This is called a harmonized lift measure (the square root in
the denominator ensures that the cosine value is only
influenced by the support of A, B, and A ∪ B).
It is also called the geometric mean. (Note that for
M−1 (A, B) we obtain the harmonic mean.)
Does not have the monotonicity or anti-monotonicity
property.
Data Mining I
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Mathematical Generalized Mean
See: “Re-examination of interestingness measures in pattern mining: a
unified framework” by Wu, Chen, & Han, 2009.
The mathematical generalized mean is defined by
1/k
Pr (A | B)k + Pr (B | A)k
k
M (A, B) =
,
2
where M is the mathematical generalized mean and k ∈ R.
Theorem
For all k ∈ R, Mk satisfies the following properties
P1 Mk ∈ [0, l]
P2 Mk monotonically decreases with Support(A) (or
Support(B)) when Support(A ∪ B) and Support(B) (or
Support(A)) remain constant
P3 Mk is symmetric under item permutations
P4 Mk is invariant to scaling, i.e. multiplying a scaling
factor to Support(A ∪ B), Support(A), and Support(B)
will not affect the measure.
Data Mining I
Using the Mathematical Generalized Mean
Keith E. Emmert
Outline
Theorem
Basics Concepts
Recall that for two itemsets A and B, the mathematical
generalized mean is defined by
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
k
M (A, B) =
Pr (A | B)k + Pr (B | A)k
2
1/k
,
where k ∈ R. The following hold.
I
All Conf(A, B) = lim Mk (A, B) = min{Pr (A |
k→−∞
B), Pr (B | A)}
I
Cosine(A, B) = lim Mk (A, B) =
k→0
I
I
p
Pr (A | B)Pr (B | A)
Pr (A | B) + Pr (B | A)
k→1
2
Max Conf(A, B) = lim Mk (A, B) = max{Pr (A |
Kulc(A, B) = lim Mk (A, B) =
k→∞
B), Pr (B | A)}
Data Mining I
Keith E. Emmert
Common Properties for All Confidence, Max
Confidence, Kulcynski, and Cosine Measures
Outline
Basics Concepts
I
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
I
I
The measures are influenced Pr (A | B) and Pr (B | A)
(that is, the supports of A, B, and A ∪ B), but not the
total number of transactions.
Return values between 0 and 1, with 1 indicating a
stronger correlation.
For all itemsets A and B,
AllConf(A, B) ≤ Cosine(A, B)
≤ Kulc(A, B)
≤ MaxConf(A, B).
Data Mining I
Null Transactions
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
A null-transaction is a transaction that does not contain any
of the itemsets being examined. A measure is called
null-invariant if it it not influenced by null-transactions.
I Often, null transactions outweigh the number of
individual purchases being examined.
I Lift and χ2 are sensitive to null-transactions.
I All Confidence, Max Confidence, Kulcynski, and Cosine
Measures remove the influence of null-transactions, and
hence, they are null-invariant.
Data Mining I
Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Suppose we have milk and coffee under study. Let m denote
milk, and c denote coffee. Then c denotes transactions that
do not contain coffee, m denotes transactions that do not
contain milk, and mc denotes null-transactions that contain
neither milk nor coffee. P
milk milk
row
coffee
m
mc
c
coffee mc
mc
c
P
P
m
m
col
Data
Set
mc
mc
mc
mc
D1
10000 1000
1000
100000
D2
10000 1000
1000
100
D3
100
1000
1000
100000
D4
1000 1000
1000
100000
D5
1000
100
10000 100000
D6
1000
10
100000 100000
Data Mining I
Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Data
Set
D1
D2
D3
D4
D5
D6
I mc
mc
mc
mc
mc
10000 1000
1000
100000
10000 1000
1000
100
100
1000
1000
100000
1000 1000
1000
100000
1000
100
10000 100000
1000
10
100000 100000
positively associated in D1 , D2 since (for either set)
Support(mc) = 10000 > Support(mc) = 1000
Support(mc) = 10000 > Support(mc) = 1000
I
I
mc negatively associated in D3
mc is neutral in D4
Data Mining I
Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Data
Set
D1
D2
D3
D4
D5
D6
I mc
I mc
mc
mc
mc
mc
10000 1000
1000
100000
10000 1000
1000
100
100
1000
1000
100000
1000 1000
1000
100000
1000
100
10000 100000
1000
10
100000 100000
positively associated in D1 , D2
negatively associated in D3 since
Support(mc) = 100 < Support(mc) = 1000
Support(mc) = 100 < Support(mc) = 1000
I
mc is neutral in D4
Data Mining I
Example
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Data
Set
D1
D2
D3
D4
D5
D6
I mc
I mc
I mc
mc
mc
mc
mc
10000 1000
1000
100000
10000 1000
1000
100
100
1000
1000
100000
1000 1000
1000
100000
1000
100
10000 100000
1000
10
100000 100000
positively associated in D1 , D2
negatively associated in D3
is neutral in D4 since
Support(mc) = 1000 = Support(mc) = 1000
Support(mc) = 1000 = Support(mc) = 1000
Data Mining I
Lift and χ2 - Sensitive to Null Transactions
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Data
mc
mc
mc
χ2
lift
Set
mc
D1
10000 1000
1000
100000 905557 9.26
D2
10000 1000
1000
100
0
1
D3
100
1000
1000
100000
670
8.44
24740 25.75
D4
1000 1000
1000
100000
D5
1000
100
10000 100000
8173
9.18
965
1.97
D6
1000
10
100000 100000
I mc positively associated in D1 , D2
I mc negatively associated in D3
I mc is neutral in D4
I However, D1 and D2 have dramatically different results
using χ2 and lift! This is due to sensitivity to
null-transactions, mc.
I For D3 , a negative association is not indicated.
I For D4 , we have a very strong positive association
indicated!
Data Mining I
All Confidence and Max Confidence
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Data
mc
mc
mc
Set
mc
D1
10000 1000
1000
100000
D2
10000 1000
1000
100
D3
100
1000
1000
100000
D4
1000 1000
1000
100000
D5
1000
100
10000 100000
D6
1000
10
100000 100000
I Association of mc is positive in D1 , D2 , negative in D3 ,
and neutral in D4 .
Support(mc)
1000
I D5 :
Support(c) = 1000+100 = 0.909 so c ⇒ m should
I
1000
occur and Support(mc)
Support(m) = 1000+10000 = 0.091 so m ⇒ c
probably won’t occur.
1000
D6 : Support(mc)
Support(c) = 1000+10 = 0.990 so c ⇒ m should
occur and Support(mc)
Support(m) =
probably won’t occur.
1000
1000+100000
= 0.0099 so m ⇒ c
Data Mining I
All Confidence and Max Confidence
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Data
mc
mc
mc
AllCon MaxCon
Set
mc
D1
10000 1000
1000
100000
0.91
0.91
D2
10000 1000
1000
100
0.91
0.91
D3
100
1000
1000
100000
0.09
0.09
0.5
0.5
D4
1000 1000
1000
100000
D5
1000
100
10000 100000
0.09
0.91
0.01
0.99
D6
1000
10
100000 100000
I Association of mc is positive in D1 , D2 , negative in D3 ,
and neutral in D4 .
I D5 : c ⇒ m should occur and m ⇒ c probably won’t
occur.
I D6 : c ⇒ m should occur and m ⇒ c probably won’t
occur.
I AllCon and MaxCon agree on D1 through D4 .
I AllCon and MaxCon give opposite results for D5 and D6 .
Data Mining I
Kulczynski and Cosine
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Data
mc
mc
mc
Kulc Cosine
Set
mc
D1
10000 1000
1000
100000 0.91
0.91
D2
10000 1000
1000
100
0.91
0.91
D3
100
1000
1000
100000 0.09
0.09
0.5
0.5
D4
1000 1000
1000
100000
D5
1000
100
10000 100000
0.5
0.29
0.5
0.10
D6
1000
10
100000 100000
I Association of mc is positive in D1 , D2 , negative in D3 ,
and neutral in D4 .
I D5 : c ⇒ m probably Yes and m ⇒ c probably No.
I D6 : c ⇒ m probably Yes and m ⇒ c probably No.
I Kulc & Cosine agree on D1 through D4 .
I Kulc views (perhaps correctly?) D5 & D6 as neutral.
I Cosine views D5 & D6 with negative association.
Data Mining I
When are A and B “Controversial”?
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
To measure how far “out of whack” A and B are, we want a
function, IR(A, B), with the following properties
Q1 IR(A, B) ∈ [0, 1], with
IR(A, B) = 0 ⇐⇒ Pr (A | B) = Pr (B | A)
IR(A, B) = 1 ⇐⇒ |Pr (A | B) − Pr (B | A)| = 1
Q2 IR(A, B) = IR(B, A) (Symmetry)
Q3 IR(A, B) monotonically decreases if Support(AB)
increases and Support(AB) & Support(AB) are fixed.
(IR decreases when A & B share more common item
sets)
Q4 If Support(AB) & Support(AB) are fixed, then
If Sup(AB) ≥ Sup(AB), IR ⇑ as Sup(AB) ⇑ .
If Sup(AB) ≤ Sup(AB), IR ⇑ as Sup(AB) ⇓ .
Thus, IR increases as Sup(AB) & Sup(AB) move apart.
Similar results for Sup(AB) due to symmetry.
Data Mining I
Tiny Set Theory
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
For any two sets, U and V , we have
U = (U ∩ V ) ∪ (U ∩ V ).
I
Support(A) = Support(AB) + Support(AB)
I
Support(B) = Support(AB) + Support(AB)
I
Support(A)−Support(B) = Support(AB)−Support(AB)
Data Mining I
Imbalance Ratio
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
The imbalance ratio, which assesses the imbalance of the
two itemsets A and B, is
|Support(A) − Support(B)|
IR(A, B) =
Support(A) + Support(B) − Support(AB)
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
=
I
I
I
I
I
|Support(AB) − Support(AB)|
Support(AB) + Support(AB) + Support(AB)
The numerator is the (absolute) difference between the
support of the itemsets A and B
The denominator is the number of transactions
containing A ∪ B.
If Conf(A ⇒ B) = Conf(B ⇒ A), then IR(A, B) = 0.
Otherwise, the further out of agreement, the larger the
ratio.
IR is independent of the number of null-transactions as
well as the total number of transactions.
Data Mining I
Imbalance Ratio
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Data
mc
mc
mc
IR
Kulc
Set
mc
D1
10000 1000
1000
100000
0
0.91
D2
10000 1000
1000
100
0
0.91
D3
100
1000
1000
100000
0
0.09
0
0.5
D4
1000 1000
1000
100000
D5
1000
100
10000 100000
0.971
0.5
D6
1000
10
100000 100000 0.9899 0.5
I Association of mc is positive in D1 , D2 , negative in D3 ,
and neutral in D4 .
I D5 : c ⇒ m probably Yes and m ⇒ c probably No.
I D6 : c ⇒ m probably Yes and m ⇒ c probably No.
|(1000+10000)−(1000+100)|
I D5 : IR(m, c) =
(1000+10000)+(1000+100)−1000 = 0.971
D6 : IR(m, c) = 0.9899. Both indicate an imbalance.
I Thus, Kulc and IR together give a better picture with
D4 being neutral, and D5 , D6 being unbalanced because
c ⇒ m “Yes” and m ⇒ c “No”.
Data Mining I
Homework
Keith E. Emmert
Outline
Basics Concepts
Frequent Itemset
Mining Methods:
Apriori
Which Patterns
Are Interesting?
Pattern Evaluation
Methods
Chapter 6, Problems: 8, 11, 14
Related documents