Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining:
Concepts and Techniques
©Jiawei Han and Micheline Kamber
(modified for I211)
April 11, 2010
Data Mining: Concepts and Techniques
1
Mining Association Rules in
L
Large
Databases
D t b
Association rule mining
Mining single-dimensional Boolean association rules
from transactional databases
From association mining to correlation analysis
Summary
April 11, 2010
Data Mining: Concepts and Techniques
2
What Is Association Mining?
Association rule mining:
Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases
databases, relational databases,
databases and other
information repositories.
Applications:
Basket data analysis, cross-marketing, catalog design,
etc.
Examples.
Rule form: “Body → Ηead [support, confidence]”.
buys(x, “diapers”) → buys(x, “beer”) [0.5%, 60%]
major(x, “CS”) ^ takes(x, “DB”) → grade(x, “A”) [1%,
75%]
April 11, 2010
Data Mining: Concepts and Techniques
3
Association Rule: Basic Concepts
Given: ((1)) database of transactions,, (2)
( ) each transaction is
a list of items (purchased by a customer in a visit)
Find: all rules that correlate the presence of one set of
items with that of another set of items
E.g., 98% of people who purchase tires and auto
accessories also get automotive services done
A li ti
Applications
* ⇒ Maintenance Agreement (What the store should
do to boost Maintenance Agreement
g
sales))
Home Electronics ⇒ * (What other products should
the store stock up?)
Attached mailing in direct marketing
Detecting “ping-pong”ing of patients, faulty “collisions”
April 11, 2010
Data Mining: Concepts and Techniques
4
Rule Measures: Support and
Confidence
Customer
buys both
Customer
buys beer
Customer
buys diaper
Find all the rules X & Y ⇒ Z with
minimum
i i
confidence
fid
and
d supportt
support, s, probability that a
transaction contains {{X,, Y,, Z}}
confidence, c, conditional
probability that a transaction
having {X, Y} also contains Z
Transaction ID Items Bought
2000
ABC
A,B,C
Let minimum support 50%, and
minimum confidence 50%, we have
1000
A,C
A ⇒ C (50%, 66.6%)
4000
A,D
C ⇒ A (50%, 100%)
5000
B,E,F
April 11, 2010
Data Mining: Concepts and Techniques
5
Association Rule Mining: A Road Map
Boolean vs. quantitative associations (Based on the types of values
handled)
buys(x, “SQLServer”) ^ buys(x, “DMBook”) → buys(x, “DBMiner”)
[0.2%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”) → buys(x, “PC”) [1%, 75%]
Single dimension vs. multiple dimensional associations (see ex. Above)
Single level vs.
vs multiple
multiple-level
level analysis
What brands of beer are associated with what brands of diapers?
Various extensions
Correlation,
C
l ti
causality
lit analysis
l i
Association does not necessarily imply causality
Constraints enforced
April 11, 2010
E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
Data Mining: Concepts and Techniques
6
Mining Association Rules in
L
Large
Databases
D t b
Association rule mining
Mining single-dimensional Boolean association rules
from transactional databases
From association mining to correlation analysis
Summary
April 11, 2010
Data Mining: Concepts and Techniques
7
Mining
g Association Rules—An Example
p
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. supportt 50%
Mi
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A ⇒ C :
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
April 11, 2010
Data Mining: Concepts and Techniques
8
The Apriori
p
Algorithm
g
Database D
TID
100
200
300
400
Items
134
235
1235
25
itemset sup.
{1}
2
C1
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
C2
L2 itemset sup
...
{1 3}
{2 3}
{2 5}
{3 5}
April 11, 2010
2
2
3
2
itemset sup
{1 2}
1
{1 3}
2
{1 5}
1
{2 3}
2
{2 5}
3
{3 5}
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
Data Mining: Concepts and Techniques
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
9
Example of Generating Candidates
L3 = {abc, abd, acd, ace, bcd}
Self-joining:
Self
joining: L3*LL3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
April 11, 2010
Data Mining: Concepts and Techniques
10
Is Apriori Fast Enough? — Performance
B ttl
Bottlenecks
k
Th core off the
The
h Apriori
A i i algorithm:
l ih
Use frequent (k – 1)-itemsets to generate candidate frequent kitemsets
Use database scan and pattern matching to collect counts for the
candidate itemsets
The bottleneck of Apriori: candidate generation
Huge candidate sets:
104 frequent 1-itemset will generate 107 candidate 2-itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100 ≈ 1030 candidates.
Multiple
u t p e sca
scanss o
of database
database:
April 11, 2010
Needs (n +1 ) scans, n is the length of the longest pattern
Data Mining: Concepts and Techniques
11
Mining Association Rules in
L
Large
Databases
D t b
Association rule mining
Mining single
single-dimensional
dimensional Boolean association rules
from transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
S
Summary
April 11, 2010
Data Mining: Concepts and Techniques
12
Interestingness Measurements
Objective measures
Two popular measurements:
support; and
April 11, 2010
confidence
Subjective measures (Silberschatz & Tuzhilin
Tuzhilin,
KDD95)
A rule (pattern) is interesting if
it is unexpected (surprising to the user);
and/or
actionable (the user can do something with it)
Data Mining: Concepts and Techniques
13
Criticism to Support and Confidence
Example 1: (Aggarwal & Yu,
Yu PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball ⇒ eat cereal [40%, 66.7%] is misleading
because the overall p
percentage
g of students eating
g cereal is 75%
which is higher than 66.7%.
play basketball ⇒ not eat cereal [20%, 33.3%] is far more
accurate,, although
g with lower support
pp
and confidence
basketball not basketball sum(row)
cereal
2000
1750
3750
nott cereall
1000
250
1250
sum(col.)
3000
2000
5000
April 11, 2010
Data Mining: Concepts and Techniques
14
Criticism to Support and Confidence
(Cont )
(Cont.)
Example 2:
X and Y: positively correlated,
X and Z, negatively related
support and confidence of
X=>Z dominates
We need a measure of dependent
or correlated events
corrA, B
P( A∩ B)
=
P( A) P( B)
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Rule Support Confidence
X=>Y 25%
50%
X=>Z 37.50%
75%
P(B|A)/P(B) iis also
l called
ll d the
th lift
of rule A => B
April 11, 2010
Data Mining: Concepts and Techniques
15
Other Interestingness Measures: Interest
Interest (correlation, lift)
P( A ∧ B)
P ( A) P ( B )
taking both P(A) and P(B) in consideration
P(A^B)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
April 11, 2010
It
Itemset
t
S
Support
t
I t
Interest
t
X,Y
X,Z
Y,Z
25%
37.50%
12.50%
2
0.9
0.57
Data Mining: Concepts and Techniques
16
Mining Association Rules in
L
Large
Databases
D t b
Association rule mining
Mining single-dimensional Boolean association rules
from transactional databases
From association mining to correlation analysis
Summary
April 11, 2010
Data Mining: Concepts and Techniques
17
Summaryy
Association rule mining
probably the most significant contribution from the
database communityy in KDD
A large number of papers have been published
Many interesting issues have been explored
An interesting research direction
April 11, 2010
Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.
Data Mining: Concepts and Techniques
18