Download A 1

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining:
Concepts and Techniques
— Chapter 5 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
May 22, 2017
Data Mining: Concepts and Techniques
1
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation
analysis
 Constraint-based association mining
 Summary
May 22, 2017
Data Mining: Concepts and Techniques
2
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation
analysis
 Constraint-based association mining
 Summary
May 22, 2017
Data Mining: Concepts and Techniques
3
What Is Frequent Pattern Analysis?

Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining


Motivation: Finding inherent regularities in data

What products were often purchased together?— Beer and diapers?!

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?
Applications

Basket data analysis, cross-marketing(交叉销售), catalog design(分类设计),
sale campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
May 22, 2017
Data Mining: Concepts and Techniques
4
Why Is Freq. Pattern Mining Important?

Discloses an intrinsic and important property of data sets

Forms the foundation for many essential data mining tasks

Association, correlation, and causality analysis

Sequential, structural (e.g., sub-graph) patterns

Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data

Classification: associative classification

Cluster analysis: frequent pattern-based clustering

Data warehousing: iceberg cube and cube-gradient

Semantic data compression: fascicles

Broad applications
May 22, 2017
Data Mining: Concepts and Techniques
5
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id
Items bought
10
A, B, D
20
A, C, D
30
A, D, E
40
B, E, F
50
B, C, D, E, F
Customer
buys both



Customer
buys diaper


Customer
buys beer
May 22, 2017
itemset: A set of one or more
items
k-itemset X = {x1, …, xk}
(absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
(relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
An itemset X is frequent if X’s
support is no less than a minsup
threshold
Data Mining: Concepts and Techniques
6
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id
Items bought
10
A, B, D
20
A, C, D
30
A, D, E
40
B, E, F
50
B, C, D, E, F
Customer
buys both

Find all the rules X  Y with minimum
support and confidence

Customer
buys diaper

support, s, probability that a
transaction contains X  Y
confidence, c, conditional
probability that a transaction
having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Customer
buys beer
May 22, 2017
Association rules:
A  D (60%, 100%)
D  A (60%, 75%)
Data Mining: Concepts and Techniques
7
Closed Patterns and Max-Patterns





A long pattern contains a combinatorial number of subpatterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X (proposed by
Bayardo @ SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns

Reducing the # of patterns and rules
May 22, 2017
Data Mining: Concepts and Techniques
8
Closed Patterns and Max-Patterns

Exercise. DB = {<a1, …, a100>, < a1, …, a50>}



What is the set of closed itemset?

<a1, …, a100>: 1

< a1, …, a50>: 2
What is the set of max-pattern?


Min_sup = 1.
<a1, …, a100>: 1
What is the set of all patterns?

!!
May 22, 2017
Data Mining: Concepts and Techniques
9
Two-step approach

Two-step approach:
1. Frequent Itemset Generation
–
2.
Rule Generation
–

Generate all itemsets whose support  minsup
Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
Frequent itemset generation is still
computationally expensive
May 22, 2017
Data Mining: Concepts and Techniques
10
A Road Map

Based on the Completeness of patterns to be mined
Complete set of frequent itemsets
Closed frequent itemsets
maximal frequent itemsets(top-k frequent itemsets)

Based on the levels of abstraction involved in the rule set
Single-level association rules, multilevel association rules
buys(X, “computer”) → buys(X, “HP_printer”)
buys(X, “laptop_computer”) → buys(X, “HP_printer”)

Based on the number of data dimensions involved in the rule
single-dimensional association rule
buys(X, “computer”) → buys(X, “antivirus_software”)
multidimensional association rule
age(X, “30…39” ∧income(X, “42k…48k”) → buys(X, “high resolution TV”)
May 22, 2017
Data Mining: Concepts and Techniques
11
A Road Map

Based on the types of values handled in the rule
Boolean association rule
Quantitative association rule

Based on the kinds of rules to be mined
Association rules
Strong gradient relaionship

Based on the kinds of patterns to be mined
Frequent item pattern
Sequentil pattren
Structured pattern
May 22, 2017
Data Mining: Concepts and Techniques
12
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation
analysis
 Constraint-based association mining
 Summary
May 22, 2017
Data Mining: Concepts and Techniques
13
Scalable Methods for Mining Frequent Patterns


The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
 i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
May 22, 2017
Data Mining: Concepts and Techniques
14
Apriori: A Candidate Generation-and-Test Approach


Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Method:




Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
May 22, 2017
Data Mining: Concepts and Techniques
15
The Apriori Algorithm—An Example
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
Supmin = 2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
C1
1st scan
C2
L2
Itemset
sup
2
2
3
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
L1
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
May 22, 2017
Itemset
{B, C, E}
3rd scan
L3
Itemset
sup
{B, C, E}
2
Data Mining: Concepts and Techniques
16
The Apriori Algorithm

Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
May 22, 2017
Data Mining: Concepts and Techniques
17
Important Details of Apriori

How to generate candidates?

Step 1: self-joining Lk

Step 2: pruning

How to count supports of candidates?

Example of Candidate-generation


L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3



Pruning:


abcd from abc and abd
acde from acd and ace
acde is removed because ade is not in L3
C4={abcd}
May 22, 2017
Data Mining: Concepts and Techniques
18
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1

Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
May 22, 2017
Data Mining: Concepts and Techniques
19
How to Count Supports of Candidates?

Why counting supports of candidates a problem?

The total number of candidates can be very huge
One transaction may contain many candidates


Method:

Candidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets and
counts


Interior node contains a hash table
Subset function: finds all the candidates contained in
a transaction
May 22, 2017
Data Mining: Concepts and Techniques
20
Example: Counting Supports of Candidates
Subset function
3,6,9
1,4,7
Transaction: 1 2 3 5 6
2,5,8
1+2356
234
567
13+56
145
136
345
12+356
124
457
May 22, 2017
125
458
356
357
689
367
368
159
Data Mining: Concepts and Techniques
21
Challenges of Frequent Pattern Mining


Challenges

Multiple scans of transaction database

Huge number of candidates

Tedious workload of support counting for candidates
Improving Apriori: general ideas

Reduce passes of transaction database scans

Shrink(收缩) number of candidates

Facilitate(减轻) support counting of candidates
May 22, 2017
Data Mining: Concepts and Techniques
22
Partition: Scan Database Only Twice

Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB

Scan 1: partition database and find local frequent
patterns

Scan 2: consolidate global frequent patterns
May 22, 2017
Data Mining: Concepts and Techniques
23
Partition: Scan Database Only Twice

A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association in large databases. In
VLDB’95
May 22, 2017
Data Mining: Concepts and Techniques
24
DHP: Reduce the Number of Candidates

A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent

Candidates: a, b, c, d, e

Hash entries: {ab, ad, ae} {bd, be, de} …

Frequent 1-itemset: a, b, d, e

ab is not a candidate 2-itemset if the sum of count of
{ab, ad, ae} is below support threshold

J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. In SIGMOD’95
May 22, 2017
Data Mining: Concepts and Techniques
25
Sampling for Frequent Patterns

Select a sample of original database, mine frequent
patterns within sample using Apriori

Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked

Example: check abcd instead of ab, ac, …, etc.

Scan database again to find missed frequent patterns

H. Toivonen. Sampling large databases for association
rules. In VLDB’96
May 22, 2017
Data Mining: Concepts and Techniques
26
DIC(动态项集计数): Reduce Number of Scans
ABCD

ABC ABD ACD BCD
AB
AC
BC
AD
BD

Once both A and D are determined
frequent, the counting of AD begins
Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
CD
Transactions
B
A
C
D
Apriori
{}
Itemset lattice
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset
DIC
counting and implication rules for
market basket data. In
SIGMOD’97
May 22, 2017
1-itemsets
2-itemsets
…
1-itemsets
2-items
Data Mining: Concepts and Techniques
3-items
27
Bottleneck of Frequent-pattern Mining


Multiple database scans are costly
Mining long patterns needs many passes(趟) of
scanning and generates lots of candidates

To find frequent itemset i1i2…i100


# of scans: 100
# of Candidates: (1001) + (1002) + … + (110000) = 21001 = 1.27*1030 !

Bottleneck: candidate-generation-and-test

Can we avoid candidate generation?
May 22, 2017
Data Mining: Concepts and Techniques
28
Step2:How to Generate Association
Rules from Frequent Itemsets(1)
Generated by:
1:
2:
May 22, 2017
Data Mining: Concepts and Techniques
29
Step2:How to Generate Association
Rules from Frequent Itemsets(2)
May 22, 2017
Data Mining: Concepts and Techniques
30
Mining Frequent Patterns Without
Candidate Generation

Grow long patterns from short ones using local
frequent items

“abc” is a frequent pattern

Get all transactions having “abc”: DB|abc

“d” is a local frequent item in DB|abc  abcd is
a frequent pattern
May 22, 2017
Data Mining: Concepts and Techniques
31
Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, f-list
3. Scan DB again,
construct FP-tree
May 22, 2017
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
F-list=f-c-a-b-m-p
Data Mining: Concepts and Techniques
min_support = 3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
32
Find Patterns Having P From P-conditional Database



Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
May 22, 2017
f:4
c:3
c:1
b:1
a:3
Conditional pattern bases
item
cond. pattern base
b:1
c
f:3
p:1
a
fc:3
b
fca:1, f:1, c:1
m:2
b:1
m
fca:2, fcab:1
p:2
m:1
p
fcam:2, cb:1
Data Mining: Concepts and Techniques
33
From Conditional Pattern-bases to Conditional FP-trees

For each pattern-base
 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of the
pattern base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
May 22, 2017
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
m-conditional pattern base:
fca:2, fcab:1
All frequent
patterns relate to m
{}
m,

f:3  fm, cm, am,
fcm, fam, cam,
c:3
fcam
a:3
m-conditional FP-tree
Data Mining: Concepts and Techniques
34
Recursion: Mining Each Conditional FP-tree
{}
{}
Cond. pattern base of “am”: (fc:3)
c:3
f:3
c:3
a:3
f:3
am-conditional FP-tree
Cond. pattern base of “cm”: (f:3)
{}
f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree
May 22, 2017
Data Mining: Concepts and Techniques
35
A Special Case: Single Prefix Path in FP-tree


{}
a1:n1
a2:n2
Suppose a (conditional) FP-tree T has a shared
single prefix-path P
Mining can be decomposed into two parts


Reduction of the single prefix path into one node
Concatenation of the mining results of the two
parts
a3:n3
b1:m1
C2:k2
May 22, 2017
r1
{}
C1:k1
C3:k3

r1
=
a1:n1
a2:n2
+
a3:n3
Data Mining: Concepts and Techniques
b1:m1
C2:k2
C1:k1
C3:k3
36
Mining Frequent Patterns With FP-trees


Idea: Frequent pattern growth
 Recursively grow frequent patterns by pattern and
database partition
Method
 For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
 Repeat the process on each newly created conditional
FP-tree
 Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
May 22, 2017
Data Mining: Concepts and Techniques
37
Scaling FP-growth by DB Projection




FP-tree cannot fit in memory?—DB projection
First partition a database into a set of projected DBs
Then construct and mine FP-tree for each projected DB
Parallel projection vs. Partition projection techniques
 Parallel projection is space costly
May 22, 2017
Data Mining: Concepts and Techniques
38
FP-Growth vs. Apriori: Scalability With the Support
Threshold
Data set T25I20D10K
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
May 22, 2017
0.5
1
1.5
2
Support threshold(%)
Data Mining: Concepts and Techniques
2.5
3
39
FP-Growth vs. Tree-Projection: Scalability with
the Support Threshold
Data set T25I20D100K
140
D2 FP-growth
Runtime (sec.)
120
D2 TreeProjection
100
80
60
40
20
0
0
0.5
1
1.5
2
Support threshold (%)
May 22, 2017
Data Mining: Concepts and Techniques
40
Why Is FP-Growth the Winner?

Divide-and-conquer:



decompose both the mining task and DB according to
the frequent patterns obtained so far
leads to focused search of smaller databases
Other factors

no candidate generation, no candidate test

compressed database: FP-tree structure

no repeated scan of entire database

basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
May 22, 2017
Data Mining: Concepts and Techniques
41
CHARM: Mining by Exploring Vertical Data Format

Horizontal format:

{TID: { I1, I2, …}}
tid-list: list of trans.-ids containing an itemset

Vertical format: :

Method:
{Ii : { TID1, TID2, …}}
1. Transform the horizontally formatted data to the vertical
formatted data by scanning the data set once;
2. Starting with k=1, the frequent k-itemsets can be used to
construct the candidate(k+1)-itemsets base on the Apriori property.
The TID_sets can be done by intersection of the TID_sets of the
frequent k-itemsets.This process repeats with k incremented by 1
each time , until no frequent itemsets or no candidate itemsets can
be found.
May 22, 2017
Data Mining: Concepts and Techniques
Example 5.6 Mining
frequent itemsets using
Vertical Data Format
May 22, 2017
Data Mining: Concepts and Techniques
43
May 22, 2017
Data Mining: Concepts and Techniques
44
CHARM: Mining by Exploring Vertical Data Format


Using diffset to accelerate mining

Only keep track of differences of tids

t(X) = {T1, T2, T3}, t(XY) = {T1, T3}

Diffset (XY, X) = {T2}
Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et
al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)
May 22, 2017
Data Mining: Concepts and Techniques
45
Visualization of Association Rules: Plane Graph
May 22, 2017
Data Mining: Concepts and Techniques
46
Visualization of Association Rules: Rule Graph
May 22, 2017
Data Mining: Concepts and Techniques
47
Visualization of Association Rules
(SGI/MineSet 3.0)
May 22, 2017
Data Mining: Concepts and Techniques
48
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation
analysis
 Constraint-based association mining
 Summary
May 22, 2017
Data Mining: Concepts and Techniques
49
Mining Various Kinds of Association Rules

Mining multilevel association

Miming multidimensional association

Mining quantitative association

Mining interesting correlation patterns
May 22, 2017
Data Mining: Concepts and Techniques
50
Mining Multiple-Level Association Rules



Items often form hierarchies
Flexible support settings
 Items at the lower level are expected to have lower
support
Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)
reduced support
uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
May 22, 2017
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Data Mining: Concepts and Techniques
Level 1
min_sup = 5%
Level 2
min_sup = 3%
51
Multi-level Association: Redundancy Filtering




Some rules may be redundant due to “ancestor”
relationships between items.
Example

milk  wheat bread

2% milk  wheat bread [support = 2%, confidence = 72%]
[support = 8%, confidence = 70%]
We say the first rule is an ancestor of the second rule.
A rule is redundant if its support is close to the “expected”
value, based on the rule’s ancestor.
May 22, 2017
Data Mining: Concepts and Techniques
52
Mining Multi-Dimensional Association

Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)

Multi-dimensional rules:  2 dimensions or predicates

Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)

hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)


Categorical Attributes: finite number of possible values, no
ordering among values
Quantitative Attributes: numeric, implicit ordering among
values
May 22, 2017
Data Mining: Concepts and Techniques
53
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
May 22, 2017
Data Mining: Concepts and Techniques
54
Strong Rules are not Necessarily Interesting
IF Min_Sup=30%, Min_Con=60%, then
But, P(video)=75% >
May 22, 2017
66% !!!
Data Mining: Concepts and Techniques
55
A new augmented support-confidence
framework for association rules
Correlation Measure:
Lift (提升度)
 2 计算
All_confidence(全置信度)
Consine(余弦)
May 22, 2017
Data Mining: Concepts and Techniques
56
Lift(提升度)
P( A B)
lift 
P( A) P( B)
Value of Lift < 1 → Negatively Correlation
 Value of Lift = 1 → Independent
Value of Lift > 1 → positively Correlation
Example:
May 22, 2017
Data Mining: Concepts and Techniques
57
2 计算
Example:
May 22, 2017
Data Mining: Concepts and Techniques
58
All_confidence & Consine
Similar to Lift ,-----an important difference ! Because by
taking the square root, the cosine value is only influenced by the
supports of A , B, and AUB, and not by total number of
transaction !
May 22, 2017
Data Mining: Concepts and Techniques
59
Which Measures Should Be Used?


lift and 2 are not
good measures for
correlations in large
transactional DBs
all-conf or
coherence could be
good measures
(Omiecinski@TKDE’03)


Both all-conf and
coherence have the
downward closure
property
Efficient algorithms
can be derived for
mining (Lee et al.
@ICDM’03sub)
May 22, 2017
Data Mining: Concepts and Techniques
60
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
May 22, 2017
Data Mining: Concepts and Techniques
61
Constraint-based (Query-Directed) Mining

Finding all the patterns in a database autonomously(自
治)? — unrealistic!


Data mining should be an interactive process


The patterns could be too many but not focused!
User directs what to be mined using a data mining
query language (or a graphical user interface)
Constraint-based mining


User flexibility: provides constraints on what to be
mined
System optimization: explores such constraints for
efficient mining—constraint-based mining
May 22, 2017
Data Mining: Concepts and Techniques
62
Constraints in Data Mining





Knowledge type constraint:
 classification, association, etc.
Data constraint — using SQL-like queries
 find product pairs sold together in stores in Chicago in
Dec.’02
Dimension/level constraint
 in relevance to region, price, brand, customer category
Rule (or pattern) constraint
 small sales (price < $10) triggers(引发) big sales
(sum > $200)
Interestingness constraint
 strong rules: min_support  3%, min_confidence 
60%
May 22, 2017
Data Mining: Concepts and Techniques
63
Anti-Monotonicity in Constraint Pushing
TDB (min_sup=2)

Anti-monotonicity




When an intemset S violates the
constraint, so does any of its superset
sum(S.Price)  v is anti-monotone
sum(S.Price)  v is not anti-monotone
Example. C: range(S.profit)  15 is antimonotone

Itemset ab violates C

So does every superset of ab
May 22, 2017
Data Mining: Concepts and Techniques
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Item
Profit
a
40
b
0
c
-20
d
10
e
-30
f
30
g
20
h
-10
64
Monotonicity for Constraint Pushing
TDB (min_sup=2)

Monotonicity

sum(S.Price)  v is monotone
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Item
Profit
min(S.Price)  v is monotone
a
40
b
0
Example. C: range(S.profit)  15
c
-20
d
10
e
-30
f
30
g
20
h
-10



When an intemset S satisfies the
constraint, so does any of its
superset
TID

Itemset ab satisfies C

So does every superset of ab
May 22, 2017
Data Mining: Concepts and Techniques
65
Succinctness(简洁性)

Succinctness:



Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items

min(S.Price)  v is succinct

sum(S.Price)  v is not succinct
Optimization: If C is succinct, C is pre-counting pushable
May 22, 2017
Data Mining: Concepts and Techniques
66
Convertible Constraints

avg(X)  25 is convertible anti-monotone w.r.t.
item value descending order R: <a, f, g, d, b, h,
c, e>
 If an itemset af violates a constraint C, so
does every itemset with af as prefix, such as
afd


avg(X)  25 is convertible monotone w.r.t. item
value ascending order R-1: <e, c, h, b, d, g, f,
a>
 If an itemset d satisfies a constraint C, so
does itemsets df and dfa, which having d as
a prefix
Thus, avg(X)  25 is strongly convertible
May 22, 2017
Data Mining: Concepts and Techniques
Item
Profit
a
40
b
0
c
-20
d
10
e
-30
f
30
g
20
h
-10
67
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
May 22, 2017
Data Mining: Concepts and Techniques
68
Frequent-Pattern Mining: Summary

Frequent pattern mining—an important task in data mining

Scalable frequent pattern mining methods

Apriori (Candidate generation & test)

Projection-based (FPgrowth, CLOSET+, ...)

Vertical format approach (CHARM, ...)

Mining a variety of rules and interesting patterns

Constraint-based mining

Mining sequential and structured patterns

Extensions and applications
May 22, 2017
Data Mining: Concepts and Techniques
69
Frequent-Pattern Mining: Research Problems

Mining fault-tolerant frequent, sequential and structured
patterns


Mining truly interesting patterns


Patterns allows limited faults (insertion, deletion,
mutation)
Surprising, novel, concise, …
Application exploration


E.g., DNA sequence analysis and bio-pattern
classification
“Invisible” data mining
May 22, 2017
Data Mining: Concepts and Techniques
70
Ref: Basic Concepts of Frequent Pattern Mining




(Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases.
SIGMOD'93.
(Max-pattern) R. J. Bayardo. Efficiently mining long patterns from
databases. SIGMOD'98.
(Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Discovering frequent closed itemsets for association rules. ICDT'99.
(Sequential pattern) R. Agrawal and R. Srikant. Mining sequential
patterns. ICDE'95
May 22, 2017
Data Mining: Concepts and Techniques
71
May 22, 2017
Data Mining: Concepts and Techniques
72