Download Lecture 6 - Hui Xiong

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Association Analysis: Basic Concepts
and Algorithms
Dr. Hui Xiong
Rutgers University
Introduction to Data Mining
08/06/2006
1
Association Analysis: Basic Concepts
and Algorithms
Basic Concepts
Introduction to Data Mining
08/06/2006
2
Association Rule Mining
z
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Introduction to Data Mining
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence,
not causality!
08/06/2006
3
Definition: Frequent Itemset
z
Itemset
– A collection of one or more items
‹
Example: {Milk, Bread, Diaper}
– k-itemset
k itemset
‹
z
An itemset that contains k items
Support count (σ)
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
z
Support
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
z
Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
Introduction to Data Mining
08/06/2006
4
Definition: Association Rule
z
Association Rule
– An implication expression of the form
X → Y, where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}
z
Rule Evaluation Metrics
TID
Items
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
4
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
– Support (s)
‹
Fraction of transactions that contain
both X and Y
Example:
{Milk , Diaper } ⇒ {Beer}
– Confidence (c)
‹
Measures how often items in Y
appear in transactions that
contain X
s=
c=
Introduction to Data Mining
σ ( Milk , Diaper, Beer )
|T|
=
2
= 0 .4
5
σ (Milk, Diaper, Beer) 2
= = 0.67
σ (Milk, Diaper)
3
08/06/2006
5
Association Rule Mining Task
z
Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
z
Brute-force approach:
– List all possible association rules
– Compute
C
t th
the supportt and
d confidence
fid
for
f each
h rule
l
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Introduction to Data Mining
08/06/2006
6
Computational Complexity
z
Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
⎡⎛ d ⎞ ⎛ d − k ⎞⎤
R = ∑ ⎢⎜ ⎟ × ∑ ⎜
⎟⎥
k
j
⎠⎦
⎣⎝ ⎠ ⎝
= 3 − 2 +1
d −1
d −k
k =1
j =1
d
d +1
If d=6, R = 602 rules
Introduction to Data Mining
08/06/2006
7
Mining Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread,
B
d Di
Diaper, B
Beer, E
Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk Beer} → {Diaper} (s=0
{Milk,Beer}
(s 0.4,
4 c=1
c 1.0)
0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Introduction to Data Mining
08/06/2006
8
Mining Association Rules
z
Two-step approach:
1. Frequent Itemset Generation
–
Generate all itemsets whose support ≥ minsup
2. Rule Generation
–
z
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still
computationally expensive
Introduction to Data Mining
08/06/2006
9
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
Given d items, there
are 2d possible
candidate itemsets
ABCDE
Introduction to Data Mining
BCDE
08/06/2006
10
Frequent Itemset Generation
z
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread Milk,
Bread,
Milk Diaper
Diaper, Beer
Bread, Milk, Diaper, Coke
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Introduction to Data Mining
08/06/2006
11
Frequent Itemset Generation Strategies
z
Reduce the number of candidates (M)
– Complete search: M=2d
– Use p
pruning
g techniques
q
to reduce M
z
Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
z
Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
Introduction to Data Mining
08/06/2006
12
Reducing Number of Candidates
z
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
z
Apriori principle holds due to the following property
of the support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Introduction to Data Mining
08/06/2006
13
Illustrating Apriori Principle
Found to be
Infrequent
Pruned
supersets
Introduction to Data Mining
08/06/2006
14
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
Introduction to Data Mining
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(2 itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Ite m s e t
{ B r e a d ,M ilk ,D ia p e r }
08/06/2006
Count
3
15
Apriori Algorithm
z
Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
‹ Generate
length (k+1) candidate itemsets from length k
frequent itemsets
‹ Prune candidate itemsets containing subsets of length k that
are infrequent
‹ Count the support of each candidate by scanning the DB
‹ Eliminate candidates that are infrequent, leaving only those
that are frequent
Introduction to Data Mining
08/06/2006
16
Reducing Number of Comparisons
z
Candidate counting:
– Scan the database of transactions to determine the
support of each candidate itemset
– To reduce the number of comparisons, store the
candidates in a hash structure
‹ Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
Transactions
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Introduction to Data Mining
08/06/2006
17
Data Mining
Association Analysis: Basic Concepts
and Algorithms
Algorithms and Complexity
© Tan,Steinbach, Kumar
Introduction to Data Mining
03/24/2006
18
Factors Affecting Complexity of Apriori
z
Choice of minimum support threshold
–
–
z
Dimensionality (number of items) of the data set
–
–
z
more space is needed to store support count of each item
if number of frequent items also increases, both computation and
I/O costs may also increase
Size of database
–
z
lowering support threshold results in more frequent itemsets
this may increase number of candidates and max length of
frequent itemsets
since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)
Introduction to Data Mining
08/06/2006
19
Compact Representation of Frequent Itemsets
z
Some itemsets are redundant because they have
identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
11
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
⎛10 ⎞
= 3× ∑ ⎜ ⎟
⎝k⎠
z
Number of frequent itemsets
z
Need a compact representation
Introduction to Data Mining
08/06/2006
10
k =1
20
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent
Maximal
Itemsets
Infrequent
Itemsets
Border
Introduction to Data Mining
08/06/2006
21
Closed Itemset
z
An itemset is closed if none of its immediate supersets
has the same support as the itemset
TID
1
2
3
4
5
Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}
Introduction to Data Mining
Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Support
4
5
3
4
4
2
3
3
4
3
08/06/2006
Itemset Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
2
{A,B,C,D}
2
22
Maximal vs Closed Itemsets
TID
Items
1
ABC
2
124
BCE
4
ACDE
123
A
ABCD
3
5
Transaction Ids
null
12
124
AB
DE
12
24
AD
ABD
ABE
123
AE
2
E
3
BD
4
ACD
345
D
BC
24
2
245
C
4
AC
2
ABC
1234
B
24
BE
2
4
ACE
ADE
34
CD
CE
3
BCD
45
4
BCE
BDE
4
ABCD
ABCE
ABDE
Not supported by
any transactions
ACDE
BCDE
ABCDE
Introduction to Data Mining
08/06/2006
23
Maximal vs Closed Frequent Itemsets
Minimum support = 2
124
123
A
12
124
AB
12
ABC
24
AC
AD
ABD
ABE
2
1234
B
AE
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
Closed but
not maximal
null
24
BE
2
4
ACE
E
ADE
CD
Closed and
maximal
34
CE
3
BCD
45
DE
4
BCE
BDE
CDE
4
ABCD
ABCE
ABDE
ACDE
BCDE
# Closed = 9
# Maximal = 4
ABCDE
Introduction to Data Mining
08/06/2006
DE
24
CDE
Maximal vs Closed Itemsets
Introduction to Data Mining
08/06/2006
25
Alternative Methods for Frequent Itemset Generation
z
Traversal of Itemset Lattice
– General-to-specific vs Specific-to-general
Introduction to Data Mining
08/06/2006
26
Alternative Methods for Frequent Itemset Generation
z
Traversal of Itemset Lattice
– Equivalent Classes
Introduction to Data Mining
08/06/2006
27
Alternative Methods for Frequent Itemset Generation
z
Traversal of Itemset Lattice
– Breadth-first vs Depth-first
Introduction to Data Mining
08/06/2006
28
Alternative Methods for Frequent Itemset Generation
z
Representation of Database
– horizontal vs vertical data layout
TID
1
2
3
4
5
6
7
8
9
10
Items
A,B,E
B,C,D
C,E
A,C,D
ABCD
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B
A
1
4
5
6
7
8
9
Introduction to Data Mining
B
1
2
5
7
8
10
C
2
3
4
8
9
D
2
4
5
9
E
1
3
6
08/06/2006
29
Rule Generation
z
Given a frequent itemset L, find all non-empty
subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D,
A →BCD,
AB →CD,
BD →AC,
z
ABD →C,
B →ACD,
AC → BD,
CD →AB,
ACD →B,
C →ABD,
AD → BC,
BCD →A,
D →ABC
BC →AD,
If |L| = k,
k then
th there
th
are 2k – 2 candidate
did t
association rules (ignoring L → ∅ and ∅ → L)
Introduction to Data Mining
08/06/2006
30
Rule Generation
z
How to efficiently generate rules from frequent
itemsets?
– In general, confidence does not have an anti
antimonotone property
c(ABC →D) can be larger or smaller than c(AB →D)
– But confidence of rules generated from the same
itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
‹ Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule
Introduction to Data Mining
08/06/2006
31
Rule Generation for Apriori Algorithm
Lattice of rules
Low
Confidence
R l
Rule
Pruned
Rules
Introduction to Data Mining
08/06/2006
32
Rule Generation for Apriori Algorithm
z
Candidate rule is generated by merging two rules
that share the same prefix
in the rule consequent
CD=>AB
z
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
z
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
Introduction to Data Mining
BD=>AC
D=>ABC
08/06/2006
33
Association Analysis: Basic Concepts
and Algorithms
Pattern Evaluation
© Tan,Steinbach, Kumar
Introduction to Data Mining
03/02/2006
34
Effect of Support Distribution
z
Many real data sets have skewed support
distribution
Support
distribution of
a retail data set
Introduction to Data Mining
08/06/2006
35
Effect of Support Distribution
z
How to set the appropriate minsup threshold?
– If minsup is too high, we could miss itemsets involving
interesting rare items (e
(e.g.,
g expensive products)
– If minsup is too low, it is computationally expensive
and the number of itemsets is very large
Introduction to Data Mining
08/06/2006
36
Cross-Support Patterns
100
A cross-supportt pattern
tt
iinvolves
l
items with varying degree of
support
Support (%)
80
60
• Example: {caviar,milk}
40
How to avoid such patterns?
20
0
0
500
1000
1500
2000
Sorted Items
caviar
milk
Introduction to Data Mining
08/06/2006
37
Cross-Support Patterns
Observation:
100
Conf(ca ia milk) is very
Conf(caviar→milk)
e high
but
Support (%)
80
Conf(milk→caviar) is very low
60
40
Therefore
20
0
0
500
1000
Sorted Items
caviar
Introduction to Data Mining
1500
min( Conf(caviar→milk)
Conf(caviar→milk),
Conf(milk→caviar) )
is also very low
2000
milk
08/06/2006
38
h-Confidence
h-confidence:
z
100
Advantages of h-confidence:
Support (%)
80
1. Eliminate cross-support
patterns such as
{caviar,milk}
60
40
2. Min function has antimonotone property
20
0
0
500
1000
1500
•
2000
Sorted Items
caviar
Algorithm can be applied to
efficiently discover low support,
high confidence patterns
milk
Introduction to Data Mining
08/06/2006
39
Pattern Evaluation
z
Association rule algorithms can produce large
number of rules
– many of them are uninteresting or redundant
– Redundant if {A,B,C} → {D} and {A,B} → {D}
have same support & confidence
z
Interestingness measures can be used to
prune/rank the patterns
– In the original formulation, support & confidence are
the only measures used
Introduction to Data Mining
08/06/2006
40
Application of Interestingness Measure
Knowledge
Interestingness
Measures
Patterns
Postprocessing
Preprocessed
Data
Prod
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
uct
Featur
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
e
M ining
Selected
Data
Preprocessing
Data
Selection
Introduction to Data Mining
08/06/2006
41
Computing Interestingness Measure
z
Given a rule X → Y, information needed to compute rule
interestingness can be obtained from a contingency table
Contingency table for X → Y
Y
Y
X
f11
f10
f1+
X
f01
f00
fo+
f+1
f+0
|T|
f11: support of X and Y
f10: support of X and Y
f01: support of X and Y
f00: support of X and Y
Used to define various measures
‹
Introduction to Data Mining
support, confidence, lift, Gini,
J-measure, etc.
08/06/2006
42
Drawback of Confidence
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea → Coffee
Confidence= P(Coffee|Tea) = 0.75
0 75
but P(Coffee) = 0.9
⇒ Although confidence is high, rule is misleading
⇒ P(Coffee|Tea) = 0.9375
Introduction to Data Mining
08/06/2006
43
Statistical Independence
z
Population of 1000 students
– 600 students know how to swim (S)
– 700 students
t d t kknow h
how tto bik
bike (B)
– 420 students know how to swim and bike (S,B)
– P(S∧B) = 420/1000 = 0.42
– P(S) × P(B) = 0.6 × 0.7 = 0.42
– P(S∧B) = P(S) × P(B) => Statistical independence
– P(S∧B) > P(S) × P(B) => Positively correlated
– P(S∧B) < P(S) × P(B) => Negatively correlated
Introduction to Data Mining
08/06/2006
44
Statistical-based Measures
z
Measures that take into account statistical
dependence
P (Y | X )
P(Y )
P( X , Y )
Interest =
P ( X ) P (Y )
PS = P( X , Y ) − P ( X ) P (Y )
Lift =
φ − coefficient =
P ( X , Y ) − P( X ) P(Y )
P( X )[1 − P ( X )]P (Y )[1 − P (Y )]
Introduction to Data Mining
08/06/2006
45
Example: Lift/Interest
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea → Coffee
Confidence= P(Coffee|Tea) = 0.75
0 75
but P(Coffee) = 0.9
⇒ Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
Introduction to Data Mining
08/06/2006
46
Drawback of Lift & Interest
Y
Y
X
10
0
10
X
0
90
90
10
90
100
Lift =
0.1
= 10
(0.1)(0.1)
Y
Y
X
90
0
90
X
0
10
10
90
10
100
Lift =
0.9
= 1.11
(0.9)(0.9)
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Introduction to Data Mining
08/06/2006
47
08/06/2006
48
There are lots of
measures proposed
in the literature
Introduction to Data Mining
Comparing Different Measures
10 examples of
contingency tables:
Rankings of contingency tables
using various measures:
Introduction to Data Mining
Exam ple
f11
f10
f01
f00
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
8123
8330
9481
3954
2886
1500
4000
4000
1720
61
83
2
94
3080
1363
2000
2000
2000
7121
2483
424
622
127
5
1320
500
1000
2000
5
4
1370
1046
298
2961
4431
6000
3000
2000
1154
7452
08/06/2006
49
Property under Variable Permutation
A
A
B
p
r
B
q
s
B
B
A
p
q
A
r
s
Does M(A,B) = M(B,A)?
Symmetric measures:
‹
support, lift, collective strength, cosine, Jaccard, etc
Asymmetric measures:
‹
confidence, conviction, Laplace, J-measure, etc
Introduction to Data Mining
08/06/2006
50
Property under Row/Column Scaling
Grade-Gender Example (Mosteller, 1968):
Male
Female
High
2
3
5
Low
1
4
5
3
7
10
Male
Female
High
4
30
34
Low
2
40
42
6
70
76
2x
10x
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples
Introduction to Data Mining
08/06/2006
51
Property under Inversion Operation
Transaction 1
.
.
.
.
.
Transaction N
A
B
C
D
E
F
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
0
0
0
0
0
1
0
0
0
0
0
(a)
Introduction to Data Mining
(b)
08/06/2006
(c)
52
Example: φ-Coefficient
z
φ-coefficient is analogous to correlation coefficient
for continuous variables
Y
Y
Y
Y
X
60
10
70
X
20
10
30
X
10
20
30
X
10
60
70
70
30
100
30
70
100
0.6 − 0.7 × 0.7
0.7 × 0.3 × 0.7 × 0.3
= 0.5238
φ=
0.2 − 0.3 × 0.3
0.7 × 0.3 × 0.7 × 0.3
= 0.5238
φ=
φ Coefficient is the same for both tables
Introduction to Data Mining
08/06/2006
53
Property under Null Addition
A
A
B
p
r
B
q
s
A
A
B
p
r
B
q
s+k
Invariant measures:
‹
support, cosine, Jaccard, etc
Non invariant measures:
Non-invariant
‹
correlation, Gini, mutual information, odds ratio, etc
Introduction to Data Mining
08/06/2006
54
Different Measures have Different Properties
Introduction to Data Mining
08/06/2006
55
Simpson’s Paradox
c({HDTV = Yes} → {Exercise Machine = Yes}) = 99 / 180 = 55%
c({HDTV = No} → {Exercise Machine = Yes}) = 54 / 120 = 45%
=> Customers who buy HDTV are more likely to buy exercise machines
Introduction to Data Mining
08/06/2006
56
Simpson’s Paradox
College students:
c ({HDTV = Yes} → {Exercise Machine = Yes} ) = 1 / 10 = 10 %
c ({HDTV = No} → {Exercise Machine = Yes} ) = 4 / 34 = 11 .8%
Working adults:
c({HDTV = Yes} → {Exercise Machine = Yes}) = 98 / 170 = 57.7%
c({HDTV = No} → {Exercise Machine = Yes}) = 50 / 86 = 58.1%
Introduction to Data Mining
08/06/2006
57
Simpson’s Paradox
z
Observed relationship in data may be influenced
by the presence of other confounding factors
(hidden variables)
– Hidden variables may cause the observed relationship
to disappear or reverse its direction!
z
Proper stratification is needed to avoid generating
spurious patterns
Introduction to Data Mining
08/06/2006
58
Extensions of Association Analysis
Dr. Hui Xiong
Rutgers University
Introduction to Data Mining
08/26/2006
1
Association Analysis: Advanced Concepts
Extensions of Association Analysis to
Continuous and Categorical Attributes and
Multi-level Rules
Introduction to Data Mining
08/26/2006
2
Continuous and Categorical Attributes
How to apply association analysis to non-asymmetric binary
variables?
Example of Association Rule:
{Gender=Male, Age ∈ [21,30)} → {No of hours online ≥ 10}
Introduction to Data Mining
08/26/2006
3
Handling Categorical Attributes
z
Example: Internet Usage Data
{Level of Education=Graduate, Online Banking=Yes}
→ {Privacy Concerns = Yes}
Introduction to Data Mining
08/26/2006
4
Handling Categorical Attributes
z
Introduce a new “item” for each distinct attributevalue pair
Introduction to Data Mining
08/26/2006
5
Handling Categorical Attributes
z
Some attributes can have many possible values
– Many of their attribute values have very low support
‹
Potential solution: Aggregate the low
low-support
support attribute values
Introduction to Data Mining
08/26/2006
6
Handling Categorical Attributes
z
Distribution of attribute values can be highly skewed
– Example: 85% of survey participants own a computer at home
‹
Most records have Computer at home = Yes
Computation becomes expensive; many frequent itemsets involving
the binary item (Computer at home = Yes)
‹
‹
Potential solution:
– discard the highly frequent items
– Use alternative measures such as h-confidence
z
Computational Complexity
– Binarizing
Bi i i th
the d
data
t iincreases th
the number
b off ititems
– But the width of the “transactions” remain the same as the
number of original (non-binarized) attributes
– Produce more frequent itemsets but maximum size of frequent
itemset is limited to the number of original attributes
Introduction to Data Mining
08/26/2006
7
Handling Continuous Attributes
z
Different methods:
– Discretization-based
– Statistics-based
St ti ti b
d
– Non-discretization based
‹
z
minApriori
Different kinds of rules can be produced:
– {Age∈[21
{Age∈[21,30),
30) No of hours online∈[10
online∈[10,20)}
20)}
→ {Chat Online =Yes}
– {Age∈[21,30), Chat Online = Yes}
→ No of hours online: μ=14, σ=4
Introduction to Data Mining
08/26/2006
8
Discretization-based Methods
Introduction to Data Mining
08/26/2006
9
Discretization-based Methods
z
Unsupervised:
– Equal-width binning
– Equal-depth
E
l d th bi
binning
i
– Cluster-based
z
Supervised discretization
Continuous attribute, v
v1
v2
v3
v4
v5
v6
v7
v8
v9
Chat Online = Yes
0
0
20
10
20
0
0
0
0
Chat Online = No
150
100
0
0
0
100
100
150
100
bin1
Introduction to Data Mining
bin2
08/26/2006
bin3
10
Discretization Issues
z
Interval width
Introduction to Data Mining
08/26/2006
11
Discretization Issues
z
Interval too wide (e.g., Bin size= 30)
– May merge several disparate patterns
‹
Patterns A and B are merged together
– May lose some of the interesting patterns
‹
z
Pattern C may not have enough confidence
Interval too narrow (e.g., Bin size = 2)
– Pattern A is broken up into two smaller patterns
‹
–
‹
z
Can recover the pattern by merging adjacent subpatterns
Pattern B is broken up into smaller patterns
Cannot recover the pattern by merging adjacent subpatterns
Potential solution: use all possible intervals
– Start with narrow intervals
– Consider all possible mergings of adjacent intervals
Introduction to Data Mining
08/26/2006
12
Discretization Issues
z
Execution time
– If the range is partitioned into k intervals, there are
O(k2) new items
– If an interval [a,b) is frequent, then all intervals that
subsume [a,b) must also be frequent
‹
E.g.: if {Age ∈[21,25), Chat Online=Yes} is frequent,
then {Age ∈[10,50), Chat Online=Yes} is also frequent
– Improve efficiency:
‹
Use maximum support to avoid intervals that are too wide
‹ Modify algorithm to exclude candidate itemsets containing
more than one intervals of the same attribute
Introduction to Data Mining
08/26/2006
13
Discretization Issues
z
Redundant rules
R1: {Age ∈[18,20), Age ∈[10,12)} → {Chat Online=Yes}
R2: {Age ∈[18,23), Age ∈[10,20)} → {Chat Online=Yes}
– If both rules have the same support and confidence,
prune the more specific rule (R1)
Introduction to Data Mining
08/26/2006
14
Statistics-based Methods
z
Example:
{Income > 100K, Online Banking=Yes} → Age: μ=34
z
Rule consequent consists of a continuous variable,
characterized by their statistics
– mean, median, standard deviation, etc.
z
Approach:
– Withhold the target attribute from the rest of the data
– Extract frequent itemsets from the rest of the attributes
‹
Binarized the continuous attributes (except for the target attribute)
– F
For each
h frequent
f
t itemset,
it
t compute
t the
th corresponding
di
descriptive statistics of the target attribute
Frequent itemset becomes a rule by introducing the target variable
as rule consequent
‹
– Apply statistical test to determine interestingness of the rule
Introduction to Data Mining
08/26/2006
15
Statistics-based Methods
Frequent Itemsets:
{Male Income > 100K}
{Male,
Association Rules:
{Male Income > 100K} → Age: μ = 30
{Male,
{Income < 30K, No hours ∈[10,15)}
{Income < 40K, No hours ∈[10,15)} → Age: μ = 24
{Income > 100K, Online Banking = Yes}
{Income > 100K,Online Banking = Yes}
→ Age: μ = 34
….
….
Introduction to Data Mining
08/26/2006
16
Statistics-based Methods
z
How to determine whether an association rule
interesting?
– Compare the statistics for segment of population
covered by the rule vs segment of population not
covered by the rule:
A ⇒ B: μ
versus
A ⇒ B: μ’
Z=
– Statistical hypothesis testing:
μ '− μ − Δ
s12 s22
+
n1 n2
‹
Null hypothesis: H0: μ’ = μ + Δ
‹
Alternative hypothesis: H1: μ’ > μ + Δ
‹
Z has zero mean and variance 1 under null hypothesis
Introduction to Data Mining
08/26/2006
17
Statistics-based Methods
z
Example:
r: Browser=Mozilla ∧ Buy=Yes → Age: μ=23
– Rule is interesting
g if difference between μ and μ
μ’ is more than 5
years (i.e., Δ = 5)
– For r, suppose
n1 = 50, s1 = 3.5
– For r’ (complement): n2 = 250, s2 = 6.5
Z=
μ '− μ − Δ
2
1
2
2
s
s
+
n1 n2
=
30 − 23 − 5
3.52 6.52
+
50 250
= 3.11
– For 1-sided test at 95% confidence level, critical Z-value for
rejecting null hypothesis is 1.64.
– Since Z is greater than 1.64, r is an interesting rule
Introduction to Data Mining
08/26/2006
18
Min-Apriori
Document-term matrix:
TID W 1 W 2 W 3 W 4 W 5
D1
2 2 0 0 1
D2
0 0 1 2 2
D3
2 3 0 0 0
D4
0 0 1 0 1
D5
1 1 1 0 2
Example:
W1 and W2 tends to appear together in the
same document
Introduction to Data Mining
08/26/2006
19
Min-Apriori
z
Data contains only continuous attributes of the same
“type”
– e.g.,
g frequency
q
y of words in a document
z
Potential solution:
TID W 1 W 2 W 3 W 4 W 5
D1
2 2 0 0 1
D2
0 0 1 2 2
D3
2 3 0 0 0
D4
0 0 1 0 1
D5
1 1 1 0 2
– Convert into 0/1 matrix and then apply existing algorithms
‹
lose word frequency information
– Discretization does not apply as users want association among
words not ranges of words
Introduction to Data Mining
08/26/2006
20
Min-Apriori
z
How to determine the support of a word?
– If we simply sum up its frequency, support count will
be greater than total number of documents!
‹
Normalize the word vectors – e.g., using L1 norms
‹
Each word has a support equals to 1.0
TID W 1 W 2 W 3 W 4 W 5
D1
2 2 0 0 1
D2
0 0 1 2 2
D3
2 3 0 0 0
D4
0 0 1 0 1
D5
1 1 1 0 2
TID
D1
D2
D3
D4
D5
Normalize
Introduction to Data Mining
W1
0.40
0 00
0.00
0.40
0.00
0.20
W2
0.33
0 00
0.00
0.50
0.00
0.17
W3
0.00
0 33
0.33
0.00
0.33
0.33
W4
0.00
1 00
1.00
0.00
0.00
0.00
08/26/2006
W5
0.17
0 33
0.33
0.00
0.17
0.33
21
Min-Apriori
z
New definition of support:
sup(C ) = ∑ min D (i, j )
i∈T
TID
D1
D2
D3
D4
D5
W1
0.40
0.00
0.40
0 00
0.00
0.20
W2
0.33
0.00
0.50
0 00
0.00
0.17
Introduction to Data Mining
W3
0.00
0.33
0.00
0 33
0.33
0.33
W4
0.00
1.00
0.00
0 00
0.00
0.00
W5
0.17
0.33
0.00
0 17
0.17
0.33
j∈C
Example:
Sup(W1,W2,W3)
= 0 + 0 + 0 + 0 + 0.17
= 0.17
0 17
08/26/2006
22
Anti-monotone property of Support
TID
D1
D2
D3
D4
D5
W1
0.40
0.00
0.40
0.00
0.20
W2
0.33
0.00
0.50
0.00
0.17
W3
0.00
0.33
0.00
0.33
0.33
W4
0.00
1.00
0.00
0.00
0.00
W5
0.17
0.33
0.00
0.17
0.33
Example:
Sup(W1) = 0.4 + 0 + 0.4 + 0 + 0.2 = 1
Sup(W1, W2) = 0.33 + 0 + 0.4 + 0 + 0.17 = 0.9
Sup(W1, W2, W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17
Introduction to Data Mining
08/26/2006
23
Concept Hierarchies
Food
Electronics
Bread
Computers
Milk
Wheat
White
Skim
Foremost
Introduction to Data Mining
2%
Desktop
Kemps
Home
Laptop Accessory
Printer
08/26/2006
TV
DVD
Scanner
24
Multi-level Association Rules
z
Why should we incorporate concept hierarchy?
– Rules at lower levels may not have enough support to
appear in any frequent itemsets
– Rules at lower levels of the hierarchy are overly
specific
e.g., skim milk → white bread, 2% milk → wheat bread,
skim milk → wheat bread, etc.
are indicative of association between milk and bread
‹
Introduction to Data Mining
08/26/2006
25
Multi-level Association Rules
z
How do support and confidence vary as we
traverse the concept hierarchy?
– If X is the parent item for both X1 and X2
X2, then
σ(X) ≤ σ(X1) + σ(X2)
– If
and
then
σ(X1 ∪ Y1) ≥ minsup,
X is parent of X1, Y is parent of Y1
σ(X ∪ Y1) ≥ minsup, σ(X1 ∪ Y) ≥ minsup
σ(X ∪ Y) ≥ minsup
– If
then
conf(X1 ⇒ Y1) ≥ minconf,
conf(X1 ⇒ Y) ≥ minconf
Introduction to Data Mining
08/26/2006
26
Multi-level Association Rules
z
Approach 1:
– Extend current association rule formulation by augmenting each
transaction with higher level items
Original Transaction: {skim milk, wheat bread}
Augmented Transaction:
{skim milk, wheat bread, milk, bread, food}
z
Issues:
– Items that reside at higher levels have much higher support
counts
if support threshold is low, too many frequent patterns involving items
from the higher levels
‹
– Increased dimensionality of the data
Introduction to Data Mining
08/26/2006
27
Multi-level Association Rules
z
Approach 2:
– Generate frequent patterns at highest level first
– Then, generate frequent patterns at the next highest
level, and so on
z
Issues:
– I/O requirements will increase dramatically because
we need to perform more passes over the data
– May miss some potentially interesting cross-level
association patterns
Introduction to Data Mining
08/26/2006
28
Association Analysis: Advanced Concepts
Subgraph Mining
Introduction to Data Mining
08/26/2006
29
Frequent Subgraph Mining
Extends association analysis to finding frequent
subgraphs
z Useful for Web Mining,
Mining computational chemistry
chemistry,
bioinformatics, spatial data sets, etc
z
Homepage
Research
Artificial
Intelligence
Databases
Data Mining
Introduction to Data Mining
08/26/2006
30
Graph Definitions
a
a
q
p
p
b
r
p
t
s
c
a
s
t
t
c
p
a
r
t
r
p
p
a
s
a
c
q
p
p
b
b
(a) Labeled Graph
Introduction to Data Mining
r
c
b
(b) Subgraph
(c) Induced Subgraph
08/26/2006
31
Representing Transactions as Graphs
z
Each transaction is a clique of items
A
TID = 1:
Transaction
Id
1
2
3
4
5
Items
{A,B,C,D}
{A,B,E}
{B,C}
{{A,B,D,E}
, , , }
{B,C,D}
C
B
D
Introduction to Data Mining
08/26/2006
E
32
Representing Graphs as Transactions
a
e
p
p
p
q
r
b
d
c
G1
G1
G2
G3
G3
(a,b,q)
0
0
0
…
(a,b,r)
0
0
1
…
Introduction to Data Mining
d
r
a
G2
(a,b,p)
1
1
0
…
q
c
r
r
p
p
b
b
d
r
e
a
q
G3
(b,c,p)
0
0
1
…
(b,c,q)
0
0
0
…
(b,c,r)
1
0
0
…
…
…
…
…
…
08/26/2006
(d,e,r)
0
0
0
…
33
Challenges
Node may contain duplicate labels
z Support and confidence
z
– How to define them?
z
Additional constraints imposed by pattern
structure
– Support and confidence are not the only constraints
– Assumption: frequent subgraphs must be connected
z
Apriori-like approach:
– Use frequent k-subgraphs to generate frequent (k+1)
subgraphs
‹What
is k?
Introduction to Data Mining
08/26/2006
34
Challenges…
z
Support:
– number of graphs that contain a particular subgraph
z
Apriori principle still holds
z
Level-wise (Apriori-like) approach:
– Vertex growing:
‹
k is the number of vertices
– Edge growing:
‹
k is the number of edges
Introduction to Data Mining
08/26/2006
35
Vertex Growing
⎛0
⎜
⎜p
M =⎜
p
⎜⎜
⎝q
G1
p
0
r
0
p q⎞
⎟
r 0⎟
0 0⎟
⎟
0 0 ⎟⎠
Introduction to Data Mining
⎛0
⎜
⎜p
M =⎜
p
⎜⎜
⎝0
G2
p
0
r
0
p 0⎞
⎟
r 0⎟
0 r⎟
⎟
r 0 ⎟⎠
08/26/2006
M G3
⎛0
⎜
⎜p
=⎜p
⎜
⎜q
⎜
⎝0
p
0
r
0
0
p
r
0
0
r
0⎞
⎟
0⎟
r⎟
⎟
?⎟
0 ⎟⎠
q
0
0
0
?
36
Edge Growing
Introduction to Data Mining
08/26/2006
37
Apriori-like Algorithm
Find frequent 1-subgraphs
z Repeat
z
– Candidate generation
‹
Use frequent (k-1)-subgraphs to generate candidate k-subgraph
– Candidate pruning
‹ Prune candidate subgraphs that contain infrequent
(k-1)-subgraphs
– Support counting
‹
Count the support of each remaining candidate
– Eliminate candidate k-subgraphs that are infrequent
In practice, it is not as easy. There are many other issues
Introduction to Data Mining
08/26/2006
38
Example: Dataset
G1
G2
G3
G4
((a,b,p)
p) (a,b,q)
(
q)
1
0
1
0
0
0
0
0
Introduction to Data Mining
((a,b,r))
0
0
1
0
((b,c,p)
p) (b,c,q)
(
q)
0
0
0
0
1
0
0
0
((b,c,r))
1
0
0
0
…
…
…
…
…
((d,e,r))
0
0
0
0
08/26/2006
39
08/26/2006
40
Example
Introduction to Data Mining
Candidate Generation
z
In Apriori:
– Merging two frequent k-itemsets will produce a
candidate (k+1)-itemset
z
In frequent subgraph mining
(vertex/edge growing)
– Merging two frequent k-subgraphs may produce more
than one candidate (kk+1
1))-subgraph
subgraph
Introduction to Data Mining
08/26/2006
41
Multiplicity of Candidates (Vertex Growing)
⎛0
⎜
p
M = ⎜⎜
p
⎜⎜
⎝q
G1
p
0
r
0
p q⎞
⎟
r 0⎟
0 0⎟
⎟
0 0 ⎟⎠
Introduction to Data Mining
⎛0
⎜
⎜p
M =⎜
p
⎜⎜
⎝0
G2
p
0
r
0
p 0⎞
⎟
r 0⎟
0 r⎟
⎟
r 0 ⎟⎠
08/26/2006
⎛0
⎜
⎜p
M =⎜p
⎜
⎜0
⎜q
⎝
G3
p
0
r
0
0
p 0 q⎞
⎟
r 0 0⎟
0 r 0⎟
⎟
r 0 ?⎟
0 ? 0 ⎟⎠
42
Multiplicity of Candidates (Edge growing)
z
Case 1: identical vertex labels
a
e
b
a
a
+
e
b
c
e
e
b
c
c
a
a
a
e
b
c
+
c
e
b
e
b
a
c
e
e
b
c
Introduction to Data Mining
08/26/2006
43
Multiplicity of Candidates (Edge growing)
z
Case 2: Core contains identical labels
Core: The (k-1) subgraph that is common
between the joint graphs
Introduction to Data Mining
08/26/2006
44
Multiplicity of Candidates (Edge growing)
z
Case 3: Core multiplicity
a
a
a
a
a
a
b
b
a
b
a
a
b
a
a
b
a
a
+
a
b
a
a
a
b
a
a
Introduction to Data Mining
08/26/2006
45
Candidate Generation by Edge Growing
z
Given:
G1
G2
b
a
c
Core
z
d
Core
Case 1: a ≠ c and b ≠ d
G3 = Merge(G1,G2)
Core
Introduction to Data Mining
a
b
c
d
08/26/2006
46
Candidate Generation by Edge Growing
z
Case 2: a = c and b ≠ d
G3 = Merge(G1,G2)
Core
G3 = Merge(G1,G2)
a
b
a
d
Introduction to Data Mining
b
a
Core
d
08/26/2006
47
Candidate Generation by Edge Growing
z
Case 3: a ≠ c and b = d
G3 = Merge(G1,G2)
G3 = Merge(G1,G2)
Core
a
b
c
b
Introduction to Data Mining
a
Core
08/26/2006
b
c
48
Candidate Generation by Edge Growing
z
Case 4: a = c and b = d
G3 = Merge(G1,G2)
Merge(G1 G2)
Core
G3 = Merge(G1,G2)
a
b
a
b
G3 = Merge(G1,G2)
a
b
Core
b
Introduction to Data Mining
a
Core
b
a
08/26/2006
49
Graph Isomorphism
z
A graph is isomorphic if it is topologically
equivalent to another graph
Introduction to Data Mining
08/26/2006
50
Graph Isomorphism
z
Test for graph isomorphism is needed:
– During candidate generation step, to determine
whether a candidate has been generated
– During candidate pruning step, to check whether its
(k-1)-subgraphs are frequent
– During
g candidate counting,
g, to check whether a
candidate is contained within another graph
Introduction to Data Mining
08/26/2006
51
Graph Isomorphism
A(1)
A(2)
B (5)
B (6)
B (7)
B (8)
A(3)
A(4)
A(2)
A(1)
B (7)
B (6)
B (5)
B (8)
A(3)
A(4)
A(1)
A(2)
A(3)
A(4)
B(5)
B(6)
B(7)
B(8)
A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8)
1
1
1
0
1
0
0
0
1
1
0
1
0
1
0
0
1
0
1
1
0
0
1
0
0
1
1
1
0
0
0
1
1
0
0
0
1
1
1
0
0
1
0
0
1
1
0
1
0
0
1
0
1
0
1
1
0
0
0
1
0
1
1
1
A(1)
A(2)
A(3)
(3)
A(4)
B(5)
B(6)
B(7)
B(8)
A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8)
1
1
0
1
0
1
0
0
1
1
1
0
0
0
1
0
0
1
1
1
1
0
0
0
1
0
1
1
0
0
0
1
0
0
1
0
1
0
1
1
1
0
0
0
0
1
1
1
0
1
0
0
1
1
1
0
0
0
0
1
1
1
0
1
• The same graph can be represented in many ways
Introduction to Data Mining
08/26/2006
52
Graph Isomorphism
z
Use canonical labeling to handle isomorphism
– Map each graph into an ordered string representation
(known as its code) such that two isomorphic graphs
will be mapped to the same canonical encoding
– Example:
‹
Lexicographically largest adjacency matrix
⎡0
⎢0
⎢
⎢1
⎢
⎣0
0
0
1
1
1
1
0
1
0⎤
1⎥
⎥
1⎥
⎥
0⎦
⎡0
⎢1
⎢
⎢1
⎢
⎣1
1
1
0
0
1⎤
0⎥
⎥
0⎥
⎥
0⎦
Canonical: 111100
String: 011011
Introduction to Data Mining
1
0
1
0
08/26/2006
53