Download Data Mining Association Analysis: Basic Concepts and Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Association Analysis: Basic Concepts
and Algorithms
Alternative Methods for Frequent Itemset Generation
l
Traversal of Itemset Lattice
– General-to-specific vs Specific-to-general
Lecture Notes for Chapter 6
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Alternative Methods for Frequent Itemset Generation
l
Traversal of Itemset Lattice
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
2
Alternative Methods for Frequent Itemset Generation
l
– Equivalence Classes
© Tan,Steinbach, Kumar
Introduction to Data Mining
Traversal of Itemset Lattice
– Breadth-first vs Depth-first
4/18/2004
3
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
4
FP-growth Algorithm
l
Use a compressed representation of the
database using an FP-tree
Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine
the frequent itemsets
l Requires only two passes over the data base.
l
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
5
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
6
FP-tree construction: Pass 1
FP-tree Construction: Pass 2
TID Items
Item Count
TID Items
TID Items
1
ABE
A
6
1
BAE
1
BAE
2
BCF
B
7
2
BC
2
BC
3
BD
C
6
3
BD
3
BD
4
ABC
D
2
4
BAC
4
BAC
5
AC
5
AC
5
AC
6
BC
6
BC
6
BC
7
AC
7
AC
7
AC
8
ABCE
8
BACE
8
BACE
9
ABD
9
BAD
9
BAD
© Tan,Steinbach, Kumar
E
2
F
1
Minsup = 2
Introduction to Data Mining
4/18/2004
null
7
FP-tree Construction
© Tan,Steinbach, Kumar
B:1
A:1
E:1
Introduction to Data Mining
BAE
2
BC
3
BD
4
BAC
5
AC
2
BC
BD
4
BAC
5
AC
6
BC
6
BC
7
AC
7
AC
8
BACE
8
BACE
9
BAD
9
BAD
B:2
A:1
C:1
E:1
Introduction to Data Mining
4/18/2004
9
FP-tree Construction
© Tan,Steinbach, Kumar
B:3
A:1
C:1
D:1
E:1
Introduction to Data Mining
FP-tree Construction
TID Items
TID Items
null
1
BAE
2
BC
3
BD
null
B:4
1
BAE
2
BC
3
BD
4
BAC
5
AC
6
BC
4
BAC
5
AC
6
BC
7
AC
7
AC
8
BACE
8
BACE
9
BAD
9
BAD
© Tan,Steinbach, Kumar
10
null
1
3
© Tan,Steinbach, Kumar
4/18/2004
TID Items
null
BAE
8
FP-tree Construction
TID Items
1
4/18/2004
A:2
E:1
C:1
D:1
C:1
Introduction to Data Mining
4/18/2004
11
© Tan,Steinbach, Kumar
B:4
A:2
E:1
C:1
A:1
D:1
C:1
C:1
Introduction to Data Mining
4/18/2004
12
FP-tree Construction
FP-tree Construction
TID Items
TID Items
null
1
BAE
2
BC
3
BD
4
BAC
5
AC
6
BC
7
8
9
null
1
BAE
2
BC
3
BD
4
BAC
5
AC
6
BC
AC
7
AC
BACE
8
BACE
BAD
9
BAD
© Tan,Steinbach, Kumar
B:5
A:2
E:1
A:1
C:2
D:1
C:1
C:1
Introduction to Data Mining
4/18/2004
13
FP-tree Construction
© Tan,Steinbach, Kumar
B:5
A:2
E:1
C:2
A:2
D:1
C:1
Introduction to Data Mining
4/18/2004
TID Items
null
BAE
2
BC
3
BD
4
BAC
5
AC
6
BC
7
AC
8
BACE
9
BAD
© Tan,Steinbach, Kumar
14
FP-tree Construction
TID Items
1
C:2
null
B:6
A:3
E:1
A:2
C:2
D:1
C:2
C:2
E:1
Introduction to Data Mining
4/18/2004
15
FP-tree construction: reverse order
1
BAE
2
BC
3
BD
4
BAC
5
AC
6
BC
7
AC
8
BACE
9
BAD
© Tan,Steinbach, Kumar
B:7
A:4
E:1
C:2
C:2
A:2
D:1
C:2
D:1
E:1
Introduction to Data Mining
4/18/2004
16
FP-tree Construction
TID Items
null
1
EAB
2
CB
3
DB
4
CAB
5
CA
6
CB
7
CA
8
ECAB
9
DAB
E:2
C:5
A:1
C:1
B:1
A:1
B:2
Item
B
A
C
D
E
D:2
A:3
B:1
B:2
A:1
B:1
B:7
A:4
E:1
B:1
© Tan,Steinbach, Kumar
null
Count
7
6
6
2
2
C:2
C:2
A:2
D:1
C:2
D:1
E:1
Less compression!
Introduction to Data Mining
4/18/2004
17
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
18
FP-growth: finding frequent itemsets
FP-growth: Patterns ending in E
After construction the FP-tree it is used to
generate all frequent itemsets.
l We generate all frequent itemsets ending in
E,D,C,A and B respectively.
l To generate all frequent itemsets ending in E, we
generate all that end in DE,CE,AE and BE.
l
Conditional Pattern: BA:1
null
B:7
A:4
E:1
C:2
C:2
A:2
D:1
C:2
D:1
E:1
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
19
FP-growth: Patterns ending in E
Conditional Pattern: BAC:1
E:1
C:2
D:1
B:2
Item
B
A
C
C:2
D:1
Introduction to Data Mining
4/18/2004
21
A:2
C:1
© Tan,Steinbach, Kumar
Introduction to Data Mining
Conditional Pattern: BA:2
A:4
The tree is a single path.
Output βE, where β is any subset of BA.
The support of this pattern is the minimum item
support appearing in β.
Hence, we output: BAE:2, BE:2 and AE:2
Introduction to Data Mining
22
null
B:7
B:2
A:2
© Tan,Steinbach, Kumar
4/18/2004
FP-growth: Patterns ending in C
null
Count
2
2
Count
2
2
1
C is infrequent, so it can be removed: CE and all its
supersets are infrequent.
Conditional Pattern tree for E
Item
B
A
20
null
A:2
E:1
© Tan,Steinbach, Kumar
4/18/2004
Conditional Pattern Base
BA:1, BAC:1
null
C:2
Introduction to Data Mining
Conditional Pattern tree for E
B:7
A:4
© Tan,Steinbach, Kumar
4/18/2004
E:1
C:2
C:2
A:2
D:1
C:2
D:1
E:1
23
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
24
FP-growth: Patterns ending in C
Conditional Pattern: B:2
E:1
C:2
Conditional Pattern: A:2
null
B:7
A:4
FP-growth: Patterns ending in C
C:2
A:2
D:1
B:7
C:2
A:4
D:1
E:1
E:1
© Tan,Steinbach, Kumar
4/18/2004
25
Conditional FP-Tree for C
Conditional Pattern Base:
BA:2, B:2, A:2
Count
B
4
A
4
C:2
C:2
A:2
D:1
C:2
D:1
E:1
Introduction to Data Mining
Item
null
Introduction to Data Mining
4/18/2004
26
4/18/2004
28
Conditional FP-Tree for AC
Conditional Pattern Base:
B:2
null
B:4
© Tan,Steinbach, Kumar
A:2
Item
Count
B
2
null
B:2
A:2
Tree is single path: output BAC:2.
The tree is not a single path so:
(1) Output iC, where i is any item in the tree, with
the support of i. Hence, we output: BC:4, AC:4.
(2) Recurse.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
27
FP-growth algorithm
l
4/18/2004
Association rule algorithms tend to produce too
many rules
– many of them are uninteresting or redundant
– Redundant if {A,B,C} → {D} and {A,B} → {D}
have same support & confidence
Initial call: FP-growth (T(∅), ∅).
Introduction to Data Mining
Introduction to Data Mining
Pattern Evaluation
FP-growth (T(α),α)
IF T(α) is a single path P
THEN for all β ⊆ P, output β ∪ α with support equal to minimum
support in T(α) of item occurring in β.
ELSE
FOR EACH item i in T(α)
output β = i ∪ α with support of i in T(α).
construct T(β)
IF T(β) not empty
THEN call FP-growth (T(β), β)
END FOR
END IF
© Tan,Steinbach, Kumar
© Tan,Steinbach, Kumar
29
l
Interestingness measures can be used to
prune/rank the derived patterns
l
In the original formulation of association rules,
support & confidence are the only measures used
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
30
Computing Interestingness Measure
l
Drawback of Confidence
Given a rule X → Y, information needed to compute rule
interestingness can be obtained from a contingency table
Contingency table for X → Y
Y
Y
X
f11
f10
X
f01
f00
fo+
f+1
f+0
|T|
f11: support of X and Y
f10: support of X and Y
f01: support of X and Y
f00: support of X and Y
f1+
Coffee
Coffee
Tea
15
5
Tea
75
5
80
90
10
100
20
Association Rule: Tea → Coffee
Confidence= P(Coffee|Tea) = 0.75
Used to define various measures
u
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
31
Statistical Independence
l
⇒ Although confidence is high, rule is misleading
l
– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)
Introduction to Data Mining
4/18/2004
32
P(Y | X )
P(Y )
P( X , Y )
Interest =
P( X ) P(Y )
PS = P( X , Y ) − P( X ) P(Y )
P( X , Y ) − P( X ) P(Y )
φ − coefficient =
P( X )[1 − P( X )]P(Y )[1 − P(Y )]
– P(S∧B) = P(S) × P(B) => Statistical independence
– P(S∧B) > P(S) × P(B) => Positively correlated
– P(S∧B) < P(S) × P(B) => Negatively correlated
4/18/2004
Measures that take into account statistical
dependence
Lift =
– P(S∧B) = 420/1000 = 0.42
– P(S) × P(B) = 0.6 × 0.7 = 0.42
Introduction to Data Mining
⇒ P(Coffee|Tea) = 0.9375
© Tan,Steinbach, Kumar
Statistical-based Measures
Population of 1000 students
© Tan,Steinbach, Kumar
but P(Coffee) = 0.9
support, confidence, lift, Gini,
J-measure, etc.
33
Example: Lift/Interest
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
34
Drawback of Lift & Interest
Coffee
Coffee
Y
Y
Tea
15
5
20
X
10
0
10
X
90
0
Tea
75
5
80
X
0
90
90
X
0
10
10
90
10
100
10
90
100
90
10
100
Association Rule: Tea → Coffee
Lift =
Confidence= P(Coffee|Tea) = 0.75
Y
0. 1
= 10
(0.1)(0.1)
Lift =
Y
90
0. 9
= 1.11
(0.9)(0.9)
but P(Coffee) = 0.9
Statistical independence:
⇒ Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
If P(X,Y)=P(X)P(Y) => Lift = 1
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
35
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
36
Comparing Different Measures
There are lots of
measures proposed
in the literature
10 examples of
contingency tables:
Some measures are
good for certain
applications, but not
for others
© Tan,Steinbach, Kumar
Property under Variable Permutation
B
p
r
B
q
s
A
p
q
B
B
A
r
s
f01
f00
83
2
94
3080
1363
2000
2000
2000
7121
2483
424
622
127
5
1320
500
1000
2000
5
4
1370
1046
298
2961
4431
6000
3000
2000
1154
7452
Introduction to Data Mining
B
p
r
A
A
4/18/2004
38
B
q
s
A
p
q
B
B
A
r
s
c(A → B) = P(B|A) = σ(AB) / σ(A) = p / (p+q)
c(B → A) = P(A|B) = σ(AB) / σ(B) = p / (p+r)
Symmetric measures:
support, lift, collective strength, cosine, Jaccard, etc
Asymmetric measures:
u
f10
8123
8330
9481
3954
2886
1500
4000
4000
1720
61
Example: Confidence
Does M(A,B) = M(B,A)?
u
f11
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
Rankings of contingency tables
using various measures:
What criteria should
we use to determine
whether a measure
is good or bad?
A
A
Exam ple
Hence, confidence is not symmetric.
confidence, conviction, Laplace, J-measure, etc
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
39
Property under Row/Column Scaling
© Tan,Steinbach, Kumar
Male
Female
Male
Female
Male
Female
High
2
3
5
High
4
30
34
Low
1
4
5
Low
2
40
42
3
7
10
6
70
76
× k1
× k2
2
3
5
High
4
30
34
Low
1
4
5
Low
2
40
42
3
7
10
6
70
76
2x
10x
cpr =
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples
Introduction to Data Mining
40
Female
High
© Tan,Steinbach, Kumar
4/18/2004
Example: cross-product ratio
Grade-Gender Example (Mosteller, 1968):
Male
Introduction to Data Mining
4/18/2004
f (H,M )f (L,F )
f (H,F )f (L,M )
After Column Scaling:
cpr =
41
© Tan,Steinbach, Kumar
k1 f (H,M )k2 f (L,F )
k2 f (H,F )k1 f (L,M )
Introduction to Data Mining
=
f (H,M )f (L,F )
f (H,F )f (L,M )
4/18/2004
42
Example: φ-Coefficient
Property under Inversion Operation
Transaction 1
.
.
.
.
.
Transaction N
A
B
C
D
E
F
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
0
0
0
0
0
1
0
0
0
0
0
(a)
© Tan,Steinbach, Kumar
(b)
l
B
p
r
43
B
q
s+k
20
10
X
10
20
30
X
10
60
70
70
30
100
30
70
100
30
0.2 − 0.3 × 0.3
0.7 × 0.3 × 0.7 × 0.3
= 0.5238
φ=
Introduction to Data Mining
B
p
r
A
A
4/18/2004
44
B
q
s
A
A
B
p
r
B
q
s+k
c(A → B) = P(B|A) = σ(AB) / σ(A) = p / (p+q)
support, cosine, Jaccard, etc
Hence confidence is invariant under null addition
(i.e., adding transactions in which neither A nor B
were bought).
Non-invariant measures:
u
X
© Tan,Steinbach, Kumar
Invariant measures:
u
Y
70
Example: Confidence
B
p
r
A
A
Y
10
φ Coefficient is the same for both tables
4/18/2004
B
q
s
Y
60
0.6 − 0.7 × 0.7
0.7 × 0.3 × 0.7 × 0.3
= 0.5238
Invariant under Null Addition?
A
A
Y
X
φ=
(c)
Introduction to Data Mining
φ-coefficient is analogous to correlation coefficient
for continuous variables
correlation, Gini, mutual information, odds ratio, etc
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
45
© Tan,Steinbach, Kumar
Sym bol
M e as ure
Range
P1
P2
P3
O2
O3
Φ
λ
α
Q
Y
κ
M
J
G
s
c
L
V
I
IS
PS
F
AV
S
ζ
Correlation
-1 … 0 … 1
Yes
Yes
Yes
Y es
No
Y es
Y es
No
Lambda
0…1
Yes
No
No
Y es
No
No*
Y es
No
Odds ratio
Y es*
Yes
Yes
Y es
Y es
Y es*
Y es
No
Y ule's Q
0…1…∞
-1 … 0 … 1
Yes
Yes
Yes
Y es
Y es
Y es
Y es
No
Y ule's Y
-1 … 0 … 1
Yes
Yes
Yes
Y es
Y es
Y es
Y es
No
O3'
O4
Cohen's
-1 … 0 … 1
Yes
Yes
Yes
Y es
No
No
Y es
Mutual Inf ormation
0…1
Yes
Yes
Yes
Y es
No
No*
Y es
No
J-Measure
0…1
Yes
No
No
No
No
No
No
No
Gini Index
0…1
Yes
No
No
No
No
No*
Y es
Support
0…1
No
Yes
No
Y es
No
No
No
No
0…1
No
Yes
No
Y es
No
No
No
Y es
Laplace
0…1
No
Yes
No
Y es
No
No
No
No
Conviction
0.5 … 1 … ∞
0…1…∞
No
Yes
No
Yes**
No
No
Y es
No
Y es*
Yes
Yes
Y es
No
No
No
No
IS (cosine)
0 .. 1
No
Yes
Yes
Y es
No
No
No
Y es
-0.25 … 0 … 0.25
Yes
Yes
Yes
Y es
No
Y es
Y es
No
Certainty factor
-1 … 0 … 1
Yes
A dded value
0.5 … 1 … 1
Yes
Yes
Yes
No
No
No
No
No
Collective strength
0…1…∞
0 .. 1
No
Yes
Yes
Y es
No
Y es*
Y es
No
No
Yes
Yes
Y es
No
No
No
Yes
Yes
No
No
No
4/18/2004
No
Jaccard
Klosgen's
K
© Tan,Steinbach, Kumar




2
3

1 
2
− 1  2 − 3 −
Yes
 K 0 K
3  to Data
3 3 Mining
 Introduction
Yes
No
No
No
46
Objective measure:
– Rank patterns based on statistics computed from data
– e.g., 21 measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).
No
Piatetsky-Shapiro's
Yes
l
No
Conf idence
Interest
4/18/2004
Subjective Interestingness Measure
Different Measures have Different Properties
O1
Introduction to Data Mining
Y es
No
l
Subjective measure:
– Rank patterns according to user’s interpretation
u
A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
u
A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
Y es
47
No
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
48
Interestingness via Unexpectedness
l
Need to model expectation of users (domain knowledge)
+
-
Pattern expected to be frequent
Pattern expected to be infrequent
Pattern found to be frequent
Pattern found to be infrequent
+ - +
l
Expected Patterns
Unexpected Patterns
Need to combine expectation of users with evidence from
data (i.e., extracted patterns)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
49