Download A 1

Document related concepts
no text concepts found
Transcript
Mining Frequent Patterns, Associations
Data Mining Techniques
1
Outline
• What is association rule mining and frequent
pattern mining?
• Methods for frequent-pattern mining
• Constraint-based frequent-pattern mining
• Frequent-pattern mining: achievements,
promises and research problems
Data Mining Techniques
2
Market Basket Analysis
This basket contains an assortment of
products
What one customer purchased at one
time???
What merchandise customers are
buying and when???
Marketing basket analysis is a process
that analyzes customer buying habits
Data Mining Techniques
3
What Market Basket Analysis Can
Help?
• Customer: who they are? why they make
certain purchase?
• Merchandise: which products tend to be
purchased together? Which are most
amenable to promotion? Does a brand of
products make a difference?
• Usage:
– Store layout;
– Product layout;
Data Mining Techniques
– Coupons issue;
4
Association Rules from Market
Basket Analysis
 Method:
Transaction 1: Frozen pizza, cola,
milk
Transaction 2: Milk, potato chips
Transaction 3: Cola, frozen pizza
Transaction 4: Milk, pretzels
Transaction 5: Cola, pretzels
Hints that frozen pizza and cola may
sell well together, and should be placed
side-by-side in the convenience store..
Results:
Froze
n
Pizza
Mil
Frozen
Pizza
2
1
2
0
0
Milk
1
3
1
1
1
Cola
2
1
3
0
1
Potato
Chips
0
1
0
1
0
Pretzel
s
0
1
1
0
2
Col
k
Potato
a Chips
Pretzel
s
we could derive the association rules:
If a customer purchases Frozen Pizza, then they will probably purchase Cola.
If a customer purchases Cola, then they will probably purchase Frozen Pizza.
Data Mining Techniques
5
Use of Rule Associations
• Coupons, discounts
– Don’t give discounts on 2 items that are frequently bought together.
Use the discount on 1 to “pull” the other
• Product placement
– Offer correlated products to the customer at the same time.
Increases sales
• Timing of cross-marketing
– Send camcorder offer to VCR purchasers 2-3 months after VCR
purchase
• Discovery of patterns
– People who bought X, Y and Z (but not any pair) bought W over
half the time
Data Mining Techniques
6
What are Frequent Patterns?
• Frequent patterns: patterns (itemsets,
subsequences, substructures, etc.) that occur
frequently in a database [AIS93]
For example:
– A set of items, such as milk and bread, that appear
frequently together in a transaction data set is a
frequent itemset
– A subsequence, such as buying first a PC, then a
digital camera, and then a memory card, if it occurs
frequently in a shopping history database, is a frequent
sequential pattern
– A substructure, can refer to different structural forms,
such as subgraph, subtree, or sublattics
Data Mining Techniques
7
Motivation
• Frequent pattern mining: finding regularities in
data
– What products were often purchased together? –beer
and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to a new drug?
– Can we automatically classify web documents based
on frequent key-word combinations?
Data Mining Techniques
8
Why Is Freq. Pattern Mining
Important?
• Forms the foundation for many essential data mining
tasks
– Association, correlation, and causality analysis
– Sequential, structural (e.g., sub-graph) patterns
– Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data
– Classification: associative classification
– Cluster analysis: frequent pattern-based clustering
– Data warehousing: iceberg cube and cube-gradient
– Semantic data compression: fascicles
– Broad applications: Basket data analysis, cross-marketing,
catalog design, sale campaign analysis, web log (click stream)
analysis, …
Data Mining Techniques
9
A Motivating Example
• Market basket analysis (customers shopping
behavior analysis)
– Which groups or sets of items are customers likely to
purchase on a given trip of the store?
– Results can be used in plan marketing or advertising
strategies, or in the design of a new catalog.
– These patterns can be presented in the form of
association rules below:
• Computer
antivirus_software [support=2%, confidence=60%]
Data Mining Techniques
10
Basic Concepts
• I is the set of items {i1, i2, … id}
• A transaction T is a set of items: T={ia, ib, …, it},
T  I . Each transaction is associated with an
identifier, called TID.
• D, the task-relevant data, is a set of transactions
D={T1, T2, … Tn}.
• An association rule is of the form:
A B, where A ⊂ I, B ⊂ I, and A∩B = ∅
Data Mining Techniques
11
Rule Measures: Support and
Confidence
• Itemset X = {x1, …, xk}, k-itemset
• support, s, probability of transactions in D that contain X
 Y, P(X  Y)- relative support; the number of
transactions in D that contain the itemset- absolute
support
• confidence, c, conditional probability of transactions in D
having X that also contain Y, P(Y︱X)
sup( X Y )
c
sup( X )
• Frequent itemset: If the support of an itemset X satisfies a
predefined minimum support threshold, then X is a
Data Mining Techniques
12
frequent itemset
An Example
TID
Items bought
10
A, B, D
20
A, C, D
30
A, D, E
40
B, E, F
50
B, C, D, E, F
Customer
buys both
Customer
buys beer
Let supmin = 50%,
confmin = 50%
Customer
buys diaper
Frequent patterns are:
{A:3, B:3, D:4, E:3, AD:3}
Association rules:
A D (60%, 100%)
D A (60%, 75%)
Data Mining Techniques
13
Problem Definition
• Given I={i1, i2,…, im}, D={t1, t2, …, tn} ,and the
minimum support and confidence thresholds,
– frequent pattern mining problem is to find all frequent
patterns in the D
– association rule mining problem is to identify all
strong association rules X Y, that must satisfy
minimum support and minimum confidence
Data Mining Techniques
14
Frequent Pattern Mining: A road
Map (Ⅰ)
• Based on the types of values in the rule
– Boolean associations: involve associations between
the presence and absence of items
• buys (x, “SQLServer”)
buys (x, “DMBook”)
• buys (x, “DM Software”) [0.2%, 60%]
– Quantitative associations: describe associations
between quantitative items or attributes
• age (x, “30..39”) ^ income (x, “42..48K”)
Data Mining Techniques
buys (x, “PC”)
15
Frequent Pattern Mining: A road
Map (Ⅱ)
• Based on the number of data dimensions
involved in the rule
– Single dimension associations: the items or attributes
in an association rule reference only one dimension
• buys (x, “computer”)
buys (x, “printer”)
– Multiple dimensional associations: reference two or
more dimensions, such as age, income, and buys
• age (x, “30..39”) ^ income (x, “42..48K”)
Data Mining Techniques
buys (x, “PC”)
16
Frequent Pattern Mining: A road
Map (Ⅲ)
• Based on the levels of abstraction involved in
the rule set
– Single level
• buys (x, “computer”)
buys (x, “printer”)
– multiple-level analysis
• What brands of computers are associated with what brands
of digital cameras?
• buys (x, “laptop_computer”)
buys (x, “HP_printer”)
Data Mining Techniques
17
Multiple-Level Association Rules
• Items often form hierarchies
TID
Items Purchased
1
IBM-ThinkPad-R40/P4M, Symantec-Norton-Antivirus-2003
2
Microsoft-Office-Proffesional-2003, Microsoft-
3
logiTech-Mouse, Fellows-Wrist-Rest
…
…
all
Level 0
Computer
Level 1
Level 2
Level 3
laptop
IBM
desktop
Dell
Software
office antivirus
Printer & Camera
printer camera
Data Mining Techniques
Microsoft
Accessory
mouse
pad
18
Frequent Pattern Mining: A road
Map (Ⅳ)
• Based on the completeness of patterns to be
mined
–
–
–
–
–
–
Complete set of frequent itemsets
Closed frequent itemsets
Maximal frequent itemsets
Constrained frequent itemsets
Approximate frequent itemsets
…
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Data Mining Techniques
19
Outline
• What is association rule mining and frequent
pattern mining?
• Methods for frequent-pattern mining
• Constraint-based frequent-pattern mining
• Frequent-pattern mining: achievements,
promises and research problems
Data Mining Techniques
20
Frequent Pattern Mining Methods
• Apriori and its variations/improvements
• Mining frequent-patterns without candidate
generation
• Mining max-patterns and closed itemsets
• Mining multi-dimensional, multi-level
frequent patterns with flexible support
constraints
• Interestingness: correlation and causality
Data Mining Techniques
21
Data Representation
• Transactional vs. Binary
TID
Items
10
a, c, d
20
b, c, e
30
a, b, c, e
40
b, e
TID a b c d e
• Horizontal vs. Vertical
10
1 0 1 1 0
20
0 1 1 0 1
30
1 1 1 0 1
40
0 1 0 0 1
Item
TIDs
a
10, 30
b
20, 30, 40
c
10, 20, 30
d
10
e
20, 30, 40
Data Mining Techniques
22
Apriori: A Candidate Generation-andTest Approach
• Apriori is a seminal algorithm proposed by R.
Agrawal & R. Srikant [VLDB’94]
• Apriori consists of two phases:
– Generate length (k+1) candidate itemsets from
length k frequent itemsets
• Join step
• Prune step
– Test the candidates against DB
Data Mining Techniques
23
Apriori-Based Mining
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length
k frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated
Data Mining Techniques
24
Apriori Property
• Apriori pruning property: If there is any itemset
which is infrequent, its superset should not be
generated/tested!
– No superset of any infrequent itemset should be
generated or tested
– Many item combinations can be pruned!
Data Mining Techniques
25
Illustrating Apriori Principle
The whole process of
frequent pattern mining
can be seen as a search
In the lattice
start
A
null
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
Infrequent
ABCD
Pruned
supersets
ABCE
ABDE
Data Mining Techniques
ABCDE
ACDE
BCDE
26
Apriori Algorithm—An Example
Database TDB
Tid
10
20
30
40
L2
Supmin = 2
C1
Items
A, C, D
B, C, E
A, B, C, E
B, E
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
C3
1st scan
C2
sup
2
2
3
2
Itemset
{B, C, E}
Itemset
{A}
{B}
{C}
{D}
{E}
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
3rd scan
sup
2
3
3
1
3
sup
1
2
1
2
3
2
L3
L1
Itemset
{A}
{B}
{C}
{E}
C2
2nd scan
Itemset
{B, C, E}
Data Mining Techniques
sup
2
sup
2
3
3
3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
27
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Data Mining Techniques
28
Important Details of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• How to count supports of candidates?
Data Mining Techniques
29
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Data Mining Techniques
30
How to Count Supports of
Candidates?
• Why counting supports of candidates a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in
Data Mining Techniques
31
a transaction
Counting Supports of Candidates Using
Hash Tree
• Suppose you have 15 candidate itemsets
of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5},
{4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4
5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
• You need:
– Hash function
– Max leaf size: max number of itemsets stored in a
leaf node (if number of candidate itemsets exceeds
max leaf size, split the node)
Data Mining Techniques
32
Generate Hash Tree
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3
4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
Hash function
3,6,9
1,4,7
2,5,8
Split nodes with more than
3 candidates
using the second item
145
136
124
457
125
458
159
Data Mining Techniques
234
567
356
357
689
345
367
368
33
Generate Hash Tree
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3
4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
Hash function
3,6,9
1,4,7
2,5,8
Now split nodes
using the third item
234
567
145
124
457
125
458
159
Data Mining Techniques
136
356
357
689
345
367
368
34
Generate Hash Tree
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3
4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
Hash function
3,6,9
1,4,7
2,5,8
234
567
145
124
457
136
125
458
split
Data Mining Now,
Techniques
159
this similarly.
356
357
689
345
367
368
35
Subset Operation
Given a (lexicographically ordered) transaction t, say {1,2,3,5,6} how can
we enumerate the possible subsets of size 3?
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6
2 3 5 6
3
5 6
Level 2
12 3 5 6
123
125
126
Level 3
13 5 6
135
136
15
6
156
23 5 6
235
236
25 6
256
35 6
356
Subsets of 3 items
Data Mining Techniques
36
Subset Operation Using Hash Tree
12356
1+ 2356
Hash Function
transaction
2+ 356
3+ 56
1,4,7
2,5,8
3,6,9
234
567
145
136
345
124
457
125
458
159
356
357
689
Data Mining Techniques
367
368
37
Subset Operation Using Hash Tree
12356
Hash Function
transaction
1+ 2356
2+ 356
12+ 356
1,4,7
3+ 56
3,6,9
2,5,8
13+ 56
234
15+ 6
567
145
136
345
124
457
125
458
159
356
357
689
Data Mining Techniques
367
368
38
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
12+ 356
1,4,7
3+ 56
3,6,9
2,5,8
13+ 56
234
15+ 6
567
145
136
345
124
457
125
458
159
356
357
689
367
368
Match transaction against 11 out of 15 candidates
Data Mining Techniques
39
How the Hash Tree Works
• Suppose t = {1, 2, 3, 4, 5}
• First all size 3-itemsets must begin with 1, 2 or 3
• Therfore at the root must hash on 1, 2 and 3
separately
• Once we reach the child of the root, need to
hash again repeat the process till the algorithm
reaches the leaves check if each candidate in
the leaf is a subset of the transaction and
increment count if it is In the example, 6/9 leaf
nodes are visited and 11/15 itemsets are
matched
Data Mining Techniques
40
Generating Association Rules From
Frequent Itemsets
• For each frequent itemset l, generate all nonempty
subset of l
• For every nonempty subset s of l, output the rule
sup_ count (l )
sup_ count ( s)
sup_ count ( A B)
c
sup_ count ( A)
TID
Item_IDs
T10
I1,I2,I5
T20
I2,I4
T30
I2,I3
T40
I1,I2,I4
T50
I1,I3
T60
I2,I3
T70
I1,I3
T80
I1,I2,I3,I5
T90
I1,I2,I3
– Example: Suppose l = {I1, I2, I5}. The nonempty subsets of l are
{I1,I2} ,{I1,I5},{I2,I5},{I1},{I2},and{I5}. The association rules are:
I1∧I2
I1
I5 c=2/4=50% I1∧I5
I2∧I5 c=2/6=33% I2
I2 c=2/2=100%
I1∧I5 c=2/7=29%
Data Mining Techniques
I2∧I5
I5
I5 c=2/2=100%
I1∧I2 c=2/2=100%
41
Efficient Implementation of Apriori in
SQL
• Hard to get good performance out of pure SQL
(SQL-92) based approaches alone
• Make use of object-relational extensions like
UDFs, BLOBs, Table functions etc.
– Get orders of magnitude improvement
• S. Sarawagi, S. Thomas, and R. Agrawal,
1998
Data Mining Techniques
42
Challenges of Frequent Itemset
Mining
•
The core of the Apriori algorithm
– Use frequent (k–1)-itemsets to generate candidate frequent k-
itemsets
Use database scan to collect counts for the candidate itemsets
–
• Challenge
– Multiple scans of transaction database-costly
• Needs (n +1 ) scans, n is the length of the longest pattern
– Huge number of candidates especially when support threshold is
set low
• 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one
needs to generate 2100 ~ 1030 candidates.
– Tedious workload of support counting for candidates
Data Mining Techniques
43
Outline
•
Methods for improving Apriori
•
An interesting approach – FP-growth
Data Mining Techniques
44
Methods to Improve Apriori’s
Efficiency
•
Improving Apriori: general ideas
–
Reduce passes of transaction database scans
–
Shrink number of candidates
–
Facilitate support counting of candidates
Data Mining Techniques
45
DIC: Reduce Number of Scans
• DIC (Dynamic Itemset Counting ): tries to reduce
the number of passes over the database by
dividing the database into intervals of a specific
size
• Intuitively, DIC works like a train running over the
data with stops at intervals M transactions apart
(M is a parameter)
• S. Brin R. Motwani, J. Ullman, and S. Tsur.
“Dynamic itemset counting and implication rules
for market basket data”. In SIGMOD’97
Data Mining Techniques
46
DIC: Reduce Number of Scans
• Candidate 1-itemsets are generated
• Once both A and D are determined frequent, the counting of AD begins
• Once all length-2 subsets of BCD are determined frequent, the counting
of BCD begins
ABCD
Transactions
ABC ABD ACD BCD
Apriori
AB
AC
A
BC
B
AD
C
D
{}
Itemset lattice
BD
CD
1-itemsets
2-itemsets
…
1-itemsets
2-items
DIC
Data Mining Techniques
3-items
47
DIC: An Example
• A transaction database TDB with 40,000 transactions; support
threshold=100; M =10,000
– If itemset a and b get support counts greater than 100 in the first 10,000
transactions, DIC will start counting 2-itemset ab after the first 10,000
transactions
– Similarly, if ab, ac and bc are contained in at least 100 transactions among
the second 10,000 transactions, DIC will start counting 3-itemset abc after
20,000 transactions
– Once DIC gets to the end of the transaction database TDB, it will stop
counting the 1-itemsets and go back to the start of the database and count
the 2 and 3-itemsets
– After the first 10,000 transactions, DIC will finish counting ab, and after
20,000 transactions, it will finish counting abc
By overlapping the counting of different lengths of
itemsets, DIC can save some database scans
Data Mining Techniques
48
DHP: Reduce the Number of
Candidates
• DHP (Direct Hashing and Pruning ): reduces the
number of candidate itemsets
• J. Park, M. Chen, and P. Yu. “An effective hashbased algorithm for mining association rules”. In
SIGMOD’95
Data Mining Techniques
49
DHP: Reduce the Number of
Candidates
• In the k-th scan, DHP counts not only length-k
candidates, but also buckets of length-(k+1)
potential candidates
•
A k-itemset whose corresponding hashing bucket
count is below the threshold cannot be frequent
–
Candidates: a, b, c, d, e
–
Hash entries: {ab, ad, ae} {bd, be, de} …
–
Frequent 1-itemset: a, b, d, e
–
ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is
below support threshold
Data Mining Techniques
50
Compare Apriori & DHP
DHP
Data Mining Techniques
51
DHP: Database Trimming
Data Mining Techniques
52
Example: DHP
Data Mining Techniques
53
Example: DHP
Data Mining Techniques
54
Partition: A Two Scan Method
• Partition: requires just two database scans to
mine the frequent itemsets
• A. Savasere, E. Omiecinski, and S. Navathe,
“An efficient algorithm for mining association
rules in large databases”. VLDB’95
Data Mining Techniques
55
A Two Scan Method: Partition
• Partition the database into n partitions, such that
each partition can be held into main memory
• Itemset X is frequent X must be frequent in at
least one partition
– Scan 1: partition database and find local frequent
patterns
– Scan 2: consolidate global frequent patterns
• All local frequent itemsetscan be held in main
memory? A sometimes too strong assumption
Data Mining Techniques
56
Partitioning
Data Mining Techniques
57
Sampling for Frequent Patterns
• Sampling : selects a sample of original database,
mine frequent patterns within sample using Apriori
• H. Toivonen. “Sampling large databases for
association rules”. In VLDB’96
Data Mining Techniques
58
Sampling for Frequent Patterns
• Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked
– Example: check abcd instead of ab, ac, …, etc.
• Scan database again to find missed frequent
patterns
• Trade off some degree of accuracy against
efficiency
Data Mining Techniques
59
Eclat
• Eclat: uses the vertical database layout and uses
the intersection based approach to compute the
support of an itemset
• H. Toivonen. “Sampling large databases for
association rules”. In VLDB’96
Data Mining Techniques
60
Eclat – An Example
• Transform the horizontally formatted data to the
vertical format
Horizontal
Data Layout
Vertical Data Layout
TID
Item_IDs
Itemset
TID_set
T10
I1,I2,I5
I1
{T10, T40, T50, T70, T80, T90}
T20
I2,I4
I2
{T10, T20, T30, T40, T60, T80, T90}
T30
I2,I3
I3
{T30, T50, T60, T70, T80, T90}
T40
I1,I2,I4
I4
{T20, T40}
T50
I1,I3
I5
{T10, T80}
T60
I2,I3
T70
I1,I3
T80
I1,I2,I3,I5
T90
I1,I2,I3
Data Mining Techniques
61
Eclat – An Example
• The frequent k-itemset can be used to construct
the candidate (k+1)-itemsets
• Determine support of any (k+1)-itemset by intersecting
tid-lists of two of its k subsets
2-itemsets
3-itemsets
Itemset
TID_set
Itemset
TID_set
{I1, I2}
{T10, T40, T80, T90}
{I1, I2, I3}
{T80, T90}
{I1, I3}
{T50, T70, T80, T90}
{I1, I2, I5}
{T10, T80}
{I1, I4}
{T40}
{I1, I5}
{T10, T80}
{I2, I3}
{T30, T60, T80, T90}
{I2, I4}
{T20, T40}
{I2, I5}
{T10, T80}
{I3, I5}
{T80}
Adv: very fast support counting
Disa: intermediate tid-lists may
become too large fo memory
Data Mining Techniques
62
Apriori-like Advantage
• Uses large itemset property
• Easily parallelized
• Easy to implement
Data Mining Techniques
63
Apriori-Like Bottleneck
• Multiple database scans are costly
• Mining long patterns needs many passes of
scanning and generates lots of candidates
– To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 =
1.27*1030 !
• Bottleneck: candidate-generation-and-test
• Can we avoid candidate generation?
Data Mining Techniques
64
Mining Frequent Patterns Without
Candidate Generation
• Grow long patterns from short ones using
local frequent items
– “abc” is a frequent pattern
– Get all transactions having “abc”: DB|abc
– “d” is a local frequent item in DB|abc  abcd is
a frequent pattern
Data Mining Techniques
65
Compress Database by FP-tree
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Sort frequent items in
frequency descending
order, L-list
3. Scan DB again,
construct FP-tree
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
min_support = 3
{}
f:1
c:1
a:1
m:1
F-list=f-c-a-b-m-p
Data Mining Techniques
p:1
66
Compress Database by FP-tree
TID
100
200
300
400
500
(ordered) frequent items
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, b, p}
{f, c, a, m, p}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
Data Mining Techniques
{}
f:2
c:2
a:2
m:1
b:1
p:1
m:1
67
Compress Database by FP-tree
TID
100
200
300
400
500
(ordered) frequent items
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, b, p}
{f, c, a, m, p}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
Data Mining Techniques
{}
f:3
c:3
b:1
a:3
m:2
b:1
p:2
m:1
68
Compress Database by FP-tree
TID
100
200
300
400
500
(ordered) frequent items
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, b, p}
{f, c, a, m, p}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
Data Mining Techniques
{}
f:3
c:2
c:1
b:1
a:2
b:1
p:1
m:1
b:1
p:1
m:1
69
Compress Database by FP-tree
TID
100
200
300
400
500
(ordered) frequent items
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, b, p}
{f, c, a, m, p}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
Data Mining Techniques
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
70
Benefits of the FP-tree
• Completeness
– Preserve complete information for frequent pattern
mining
– Never break a long pattern of any transaction
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more
frequently occurring, the more likely to be shared
– Never be larger than the original database (not count
node-links and the count field)
– For Connect-4 DB, compression ratio could be over 100
Data Mining Techniques
71
Partition Patterns and
Databases
• Frequent patterns can be partitioned into
subsets according to f-list: f-c-a-b-m-p
–
–
–
–
–
Patterns containing p
Patterns having m but no p
…
Patterns having c but no a nor b, m, or p
Pattern f
• The partitioning is complete and does not have
any overlap
Data Mining Techniques
72
Find Patterns Having P From Pconditional Database
• Starting at the frequent item header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item p
• Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
{}
f:4
c:3
Conditional pattern bases
c:1
b:1
a:3
b:1
p:1
item
cond. pattern base
c
f:3
a
fc:3
b
fca:1, f:1, c:1
m:2
b:1
m
fca:2, fcab:1
p:2
m:1
p
fcam:2, cb:1
Data Mining Techniques
73
From Conditional Pattern-bases to
Conditional FP-trees
• For each pattern-base
– Accumulate the count for each item in the base
– Construct the FP-tree for the frequent items of the
pattern base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
p-conditional pattern base:
fcam:2, cb:1
All frequent
patterns relate to p
{}
p

c:3  pc
m:2
b:1
p:2
m:1
Data Mining Techniques
m-conditional FP-tree
74
Recusive Mining
• Patterns having m but no p can be mined
recursively
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
m-conditional pattern base:
fca:2, fcab:1
All frequent
patterns relate to m
{}
m,

f:3  fm, cm, am,
fcm, fam, cam,
c:3
fcam
a:3
m-conditional FP-tree
Data Mining Techniques
75
Optimization
• Optimization: enumerate patterns from singlebranch FP-tree
– Enumerate all combination
– Support = that of the last item
• m, fm, cm, am
• fcm, fam, cam
• fcam
{}
f:3
c:3
a:3
m-conditional FP-tree
Data Mining Techniques
76
A Special Case: Single Prefix Path
in FP-tree
• A (projected) FP-tree has a single prefix
– Reduce the single prefix into one node
– Join the mining results of the two parts
{}
enumeration of all the combinations
of the sub-pathes of P
a1:n1
a2:n2
a3:n3
b1:m1
C2:k2
r1
{}
C1:k1
C3:k3

r1
=
a1:n1
a2:n2
a3:n3
+
Data Mining Techniques
b1:m1
C2:k2
C1:k1
C3:k3
77
FP-Growth
• Idea: Frequent pattern growth
– Recursively grow frequent patterns by pattern and
database partition
• Method
– For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
– Repeat the process on each newly created conditional
FP-tree
– Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the combinations
of its sub-paths, each of which is a frequent pattern
Data Mining Techniques
78
Scaling Up FP-growth by
Database Projection
• What if FP-tree cannot fit in memory?—Database
projection
– Partition a database into a set of projected Databases
– Construct and mine FP-tree for each projected
Database
• Heuristic: Projected database shrinks quickly in many
applications
– Such a process can be recursively applied to any
projected database if its FP-tree still cannot fit in main
memory
How?
Data Mining Techniques
79
Partition-based Projection
• Parallel projection needs
a lot of disk space
• Partition projection
saves it
p-proj DB
fcam
cb
fcam
m-proj DB
fcab
fca
fca
am-proj DB
fc
fc
fc
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
b-proj DB
f
cb
…
a-proj DB
fc
…
cm-proj DB
f
f
f Mining Techniques
Data
c-proj DB
f
…
f-proj DB
…
…
80
FP-Growth vs. Apriori: Scalability With
the Support Threshold
Data set T25I20D10K: the average transaction size and average maximal potentially frequent
itemset size are set to 25 and 20, respectively, while the number of transactions in the dataset is set to 10K [AS94]
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
DataSupport
Mining Techniques
threshold(%)
2.5
3
81
FP-Growth vs. Tree-Projection:
Scalability with the Support Threshold
Data set T25I20D100K
140
D2 FP-growth
Runtime (sec.)
120
D2 TreeProjection
100
80
60
40
20
0
0
0.5
1
Data Mining Techniques
Support threshold (%)
1.5
2
82
Why Is FP-Growth Efficient?
• Divide-and-conquer:
– decompose both the mining task and DB according to
the frequent patterns obtained so far
– leads to focused search of smaller databases
• Other factors
– no candidate generation, no candidate test
– compressed database: FP-tree structure
– no repeated scan of entire database
– basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
Data Mining Techniques
83
Major Costs in FP-Growth
• Poor locality of FP-trees
– Low hit rate of cache
• Building FP-trees
– A stack of FP-trees
• Redundant information
– Transaction abcd appears in a-, ab-, abc-, ac-, …, cprojected databases and FP-trees.
• Can we avoid the redundancy?
Data Mining Techniques
84
Implications of the Methodology
• Mining closed frequent itemsets and max-patterns
– CLOSET (DMKD’00)
• Constraint-based mining of frequent patterns
– Convertible constraints (KDD’00, ICDE’01)
• Computing iceberg data cubes with complex
measures
– H-tree and H-cubing algorithm (SIGMOD’01)
Data Mining Techniques
85
Closed Frequent Itemsets
• An itemset X is closed if none of its immediate
supersets has the same support as X.
• An itemset X is not closed if at least one of its
immediate supersets has the same support count
as X.
– For example
• Database: {(1,2,3,4),(1,2,3,4,5,6)}
• Itemset (1,2) is not a closed itemset
• Itemset (1,2,3,4) is a closed itemset
• An itemset is a closed frequent itemset if it is
closed and its support
satisfies
support
threshold.
Data Mining Techniques
86
Benefits of closed frequent
itemsets
• It reduces redundant patterns to be generated
– A frequent itemset {a , a , , a } , the total number of
frequent itemsets that it contains is
(1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !
1
2
100
• It has the same power as frequent itemset mining
• It improves not only efficiency but also
effectiveness of mining
Data Mining Techniques
87
Mining Closed Frequent Itemsets
(Ⅰ)
• Itemset merging: if Y appears in every occurrence of X, then
Y is merged with X
– For example, the projected conditional database for prefix item {I5:2}
is {{I2,I1},{I2,I1,I3}}. Item {I2,I1} can be merged with {I5} to form the
closed itemset, {I5,I2,I1:2}
• Sub-itemset pruning: if Y ‫ כ‬X, and sup(X) = sup(Y), X and all
of X’s descendants in the set enumeration tree can be
pruned
– For example, suppose a transaction database:  a , a , , a , a , a , , a 
1
min_sup=2. The projection on the item a1 , a , a , , a
2
: 2
100
1
2
50
. Thus the
mining of closed frequent itemset in this data set terminates after mining a1 ' s
1
Projected database.
Data Mining Techniques
2
50
88
Mining Closed Frequent
Itemsets(Ⅱ)
• Item skipping: if a local frequent item has the same
support in several header tables at different levels, one
can prune it from the header table at higher levels
– For example, a transaction database:  a , a , , a , a , a , , a  ,
min_sup = 2. Becausea 2 in a1‘s projected database has the same
support as a 2 in the global header table, a 2 can be pruned from the
global header table.
1
2
100
1
2
50
• Efficient subset checking – closure checking
– Superset checking: checks if this new frequent itemset is a
superset of some already found closed itemsets with the same
support
– Subset checking
Data Mining Techniques
89
Mining Closed Frequent Itemsets
• J. Pei, J. Han & R. Mao. CLOSET: An Efficient
Algorithm for Mining Frequent Closed Itemsets",
DMKD'00.
Data Mining Techniques
90
Maximal Frequent Itemsets
• An itemset is maximal frequent if none of its immediate supersets is
frequent
• Despite providing a compact representation, maximal frequent
itemsets do not contain the support information of their subsets.
– For example, the support of the maximal frequent itemsets
{a, c, e}, {a, d}, and {b,c,d,e} do not provide any hint about the support of
their subsets.
• An additional pass over the data set is therefore needed to determine
the support counts of the nonmaximal frequent itemsets.
• It might be desirable to have a minimal representation of frequent
itemsets that preserves the support information.
– Such representation is the set of the closed frequent itemsets.
Data Mining Techniques
91
Maximal vs Closed Itemsets
All maximal frequent itemsets
are closed because none
of the maximal frequent
itemsets can have the same
support count as their
immediate supersets.
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Data Mining Techniques
92
MaxMiner: Mining Max-patterns
• 1st scan: find frequent items
– A, B, C, D, E
• 2nd scan: find support for
Tid
Items
10
A,B,C,D,E
20
B,C,D,E,
30
A,C,D,F
– AB, AC, AD, AE, ABCDE
– BC, BD, BE, BCDE
– CD, CE, CDE, DE,
Potential
max-patterns
• Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan
• R. Bayardo. Efficiently mining long patterns from
Data Mining Techniques
93
databases. In SIGMOD’98
Further Improvements of Mining
Methods
• AFOPT (Liu, et al. [KDD’03])
– A “push-right” method for mining condensed frequent
pattern (CFP) tree
• Carpenter (Pan, et al. [KDD’03])
– Mine data sets with small rows but numerous columns
– Construct a row-enumeration tree for efficient mining
Data Mining Techniques
94
Mining Various Kinds of Association
Rules
• Mining multilevel association
• Miming multidimensional association
• Mining quantitative association
• Mining interesting correlation patterns
Data Mining Techniques
95
Multiple-Level Association Rules
• Items often form hierarchies
TID
Items Purchased
1
IBM-ThinkPad-R40/P4M, Symantec-Norton-Antivirus-2003
2
Microsoft-Office-Proffesional-2003, Microsoft-
3
logiTech-Mouse, Fellows-Wrist-Rest
…
…
all
Level 0
Computer
Level 1
Level 2
Level 3
laptop
IBM
desktop
Dell
Software
office antivirus
Printer & Camera
printer camera
Data Mining Techniques
Microsoft
Accessory
mouse
pad
96
Multiple-Level Association Rules
• Flexible support settings
– Items at the lower level are expected to have lower
support
• Exploration of shared multi-level mining (Agrawal
& Srikant[VLB’95], Han & Fu[VLDB’95])
reduced support
uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Data Mining Techniques
Level 1
min_sup = 5%
Level 2
min_sup = 3%
97
Multi-level Association:
Redundancy Filtering
• Some rules may be redundant due to “ancestor”
relationships between items.
– Example
• laptop computer  HP printer [support = 8%, confidence = 70%]
• IBM laptop computer  HP printer [support = 2%, confidence =
72%]
• We say the first rule is an ancestor of the second
rule.
• A rule is redundant if its support is close to the
“expected” value,Data
based
on the rule’s ancestor. 98
Mining Techniques
Multi-Dimensional Association
• Single-dimensional rules:
buys(X, “computer”)  buys(X, “printer”)
• Multi-dimensional rules:  2 dimensions or predicates
– Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
– hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
• Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
• Quantitative Attributes: numeric, implicit ordering among
values—discretization,
Data clustering,
Mining Techniquesand gradient approaches
99
Multi-Dimensional Association
Techniques can be categorized by how
numerical attributes, such as age or salary are
treated
1. Quantitative attributes are discretized using predefined
concept hierarchies – Static and predetermined
•
A concept hierarchy for income, such as “0…20k”, “21k…30k”,
and so on.
2. Quantitative attributes are discretized or clustered into
“bins” based on the distribution of the data – Dynamic,
referred as quantitative association rules
Data Mining Techniques
100
Quantitative Association Rules
• Proposed by Lent, Swami and Widom ICDE’97
• Numeric attributes are dynamically discretized
– Such that the confidence or compactness of the rules mined is
maximized
• 2-D quantitative association rules: Aquan1  Aquan2  Acat
• Example
Data Mining Techniques
101
Quantitative Association Rules
• ARCS (association rule clustering system)- Cluster
adjacent association rules to form general rules using a 2D grid
– Binning: partition the ranges of quantitative attributes into intervals
• Equal-width
• Equal-frequency
• Clustering-based
– Finding frequent predicate sets: once the 2-D array containing the
count distribution for each category is set up, it can be scaned to
find the frequent predicate sets
– Clustering the associationrules
age(X,”34-35”)  income(X,”30-50K”)
 buys(X,”high resolution TV”)
Data Mining Techniques
102
Correlation Analysis
min_sup:30%
min_conf:60%
• play basketball  eat cereal [40%, 66%] is misleading
– The overall % of students eating cereal is 75% > 66%.
• play basketball  not eat cereal [20%, 34%] is more
accurate, although with lower support and confidence
• Measure of dependent/correlated events: lift
P( A B)
lift 
P( A) P( B)
lift ( B, C ) 
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
2000 / 5000
 0.89
3000 / 5000*3750 / 5000
lift ( B, C ) 
Data Mining Techniques
1000 / 5000
 1.33
3000 / 5000 *1250 / 5000
103
Outline
• What is association rule mining and frequent
pattern mining?
• Methods for frequent-pattern mining
• Constraint-based frequent-pattern mining
• Frequent-pattern mining: achievements, promises
and research problems
Data Mining Techniques
104
Constraint-based (Query-Directed)
Mining
• Finding all the patterns in a database
autonomously? — unrealistic!
– The patterns could be too many but not focused!
• Data mining should be an interactive process
– User directs what to be mined using a data mining
query language (or a graphical user interface)
• Constraint-based mining
– User flexibility: provides constraints on what to be
mined
– System optimization: explores such constraints for
Data Mining Techniques
efficient mining—constraint-based mining
105
Constraints
• Constrains can be classified into five categories:
– antimonotone
– Monotone
– Succinct
– Convertible
– Inconvertible
Data Mining Techniques
106
Anti-Monotone in Constraint
Pushing
TDB (min_sup=2)
• Anti-monotone
– When an intemset S violates the
constraint, so does any of its superset
– sum(S.Price)  v is anti-monotone
– sum(S.Price)  v is not anti-monotone
TID
Transaction
10
a, b, c, d, f
20
30
40
b, c, d, f, g, h
a, c, d, e, f
c, e, f, g
Item
Profit
a
40
b
0
– Itemset ab violates C
c
-20
d
10
– So does every superset of ab
e
-30
f
30
g
20
107
h
-10
• Example. C: range(S.profit)  15 is
anti-monotone
Data Mining Techniques
Monotone for Constraint Pushing
TDB (min_sup=2)
• Monotone
– When an intemset S satisfies the
constraint, so does any of its
superset
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
– sum(S.Price)  v is monotone
Item
Profit
– min(S.Price)  v is monotone
a
40
b
0
c
-20
d
10
e
-30
f
30
g
20108
h
-10
• Example. C: range(S.profit)  15
– Itemset ab satisfies C
– So does every superset of ab
Data Mining Techniques
Succinctness
• Succinctness:
– Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
– Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
– min(S.Price)  v is succinct
– sum(S.Price)  v is not succinct
• Optimization: If C is succinct, C is pre-counting
Data Mining Techniques
109
pushable
Converting “Tough” Constraints
TDB (min_sup=2)
• Convert tough constraints into
anti-monotone or monotone by
properly ordering items
• Examine C: avg(S.profit)  25
– Order items in value-descending
order
• <a, f, g, d, b, h, c, e>
– If an itemset afb violates C
• So does afbh, afb*
• It becomes anti-monotone!
Data Mining Techniques
TID
Transaction
10
a, b, c, d, f
20
b, c, d, f, g, h
30
a, c, d, e, f
40
c, e, f, g
Item
a
b
c
d
e
f
g
h
Profit
40
0
-20
10
-30
30
20
110
-10
Strongly Convertible Constraints
• avg(X)  25 is convertible anti-monotone
w.r.t. item value descending order R: <a,
f, g, d, b, h, c, e>
– If an itemset af violates a constraint C, so
does every itemset with af as prefix, such as
afd
• avg(X)  25 is convertible monotone
w.r.t. item value ascending order R-1: <e,
c, h, b, d, g, f, a>
– If an itemset d satisfies a constraint C, so
does itemsets df and dfa, which having d as
a prefix
Mining Techniques
• Thus, avg(X)  25 Data
is strongly
convertible
Item
Profit
a
40
b
0
c
-20
d
10
e
-30
f
30
g
20
h
-10
111
Can Apriori Handle Convertible
Constraint?
• A convertible, not monotone nor antimonotone nor succinct constraint cannot
be pushed deep into the an Apriori mining
algorithm
– Within the level wise framework, no direct
pruning based on the constraint can be made
– Itemset df violates constraint C: avg(X)>=25
– Since adf satisfies C, Apriori needs df to
assemble adf, df cannot be pruned
• But it can be pushed into frequent-pattern
growth framework! Data Mining Techniques
Item
Value
a
40
b
0
c
-20
d
10
e
-30
f
30
g
20
h
-10
112
Mining With Convertible
Constraints
• C: avg(X) >= 25, min_sup=2
• List items in every transaction in value
descending order R: <a, f, g, d, b, h, c, e>
– C is convertible anti-monotone w.r.t. R
• Scan TDB once
– remove infrequent items
• Item h is dropped
– Itemsets a and f are good, …
• Projection-based mining
– Imposing an appropriate order on item projection
– Many tough constraints can be converted into
(anti)-monotone
Data Mining Techniques
Item
Value
a
40
f
30
g
20
d
10
b
0
h
-10
c
-20
e
-30
TDB (min_sup=2)
TID
Transaction
10
a, f, d, b, c
20
f, g, d, b, c
30
a, f, d, c, e
40
f, g, h,113
c, e
Handling Multiple Constraints
• Different constraints may require different or even
conflicting item-ordering
• If there exists an order R s.t. both C1 and C2 are
convertible w.r.t. R, then there is no conflict between
the two convertible constraints
• If there exists conflict on order of items
– Try to satisfy one constraint first
– Then using the order for the other constraint to mine frequent
itemsets in the corresponding projected database
Data Mining Techniques
114
What Constraints Are Convertible?
Constraint
Convertible
anti-monotone
Convertible
monotone
Strongly
convertible
avg(S)  ,  v
Yes
Yes
Yes
median(S)  ,  v
Yes
Yes
Yes
sum(S)  v (items could be of any
value, v  0)
Yes
No
No
sum(S)  v (items could be of any
value, v  0)
No
Yes
No
sum(S)  v (items could be of any
value, v  0)
No
Yes
No
sum(S)  v (items could be of any
value, v  0)
Yes
No
No
……
Data Mining Techniques
115
Constraint-Based Mining—A
General Picture
Constraint
vS
SV
Antimonotone
no
no
Monotone
yes
yes
Succinct
yes
yes
SV
min(S)  v
yes
no
no
yes
yes
yes
min(S)  v
max(S)  v
yes
yes
no
no
yes
yes
max(S)  v
count(S)  v
no
yes
yes
no
yes
weakly
count(S)  v
no
yes
weakly
sum(S)  v ( a  S, a  0 )
sum(S)  v ( a  S, a  0 )
yes
no
no
yes
no
no
range(S)  v
range(S)  v
yes
no
no
yes
no
no
avg(S)  v,   { , ,  }
support(S)  
convertible
yes
convertible
no
no
no
yes
no
support(S)  
Data no
Mining Techniques
116
A Classification of Constraints
Monotone
Antimonotone
Succinct
Strongly
convertible
Convertible
anti-monotone
Convertible
monotone
Inconvertible
Data Mining Techniques
117
Outline
• What is association rule mining and frequent
pattern mining?
• Methods for frequent-pattern mining
• Constraint-based frequent-pattern mining
• Frequent-pattern mining: achievements, promises
and research problems
Data Mining Techniques
118
Frequent-Pattern Mining: Summary
•
Frequent pattern mining—an important task in data mining
•
Scalable frequent pattern mining methods
–
Apriori (Candidate generation & test)
–
Projection-based (FPgrowth, CLOSET+, ...)
–
Vertical format approach (CHARM, ...)

Mining a variety of rules and interesting patterns

Constraint-based mining

Mining sequential and structured patterns

Extensions and applications
Data Mining Techniques
119
Frequent-Pattern Mining: Research
Problems
• Mining fault-tolerant frequent, sequential and
structured patterns
– Patterns allows limited faults (insertion, deletion,
mutation)
• Mining truly interesting patterns
– Surprising, novel, concise, …
• Application exploration
– E.g., DNA sequence analysis and bio-pattern
classification
– “Invisible” data mining
Data Mining Techniques
120
Assignment (Ⅰ)
• A database has five transactions.
Suppose min_sup = 60% and
min_conf = 80%.
– Find all frequent itemsets using
Apriori and FP-grwoth, respectively.
Compare the efficiency of the two
mining process.
TID Items_list
T1
{m, o, n, k, e, y}
T2
{d, o, n, k, e, y}
T3
{m, a, k, e}
T4
{m, u, c, k, y}
T5
{c, o, k, I, e}
– List all of the strong association
rules
Data Mining Techniques
121
Assignment (Ⅱ)
• Frequent itemset mining often generate a huge number
of frequent itemsets. Discuss effective methods that can
be used to reduced the number of frequent itemsets
while still preserving most of the information.
• The price of each item in a store is nonnegative. The
store manager is only interested in rules of the forms:”
one free item may trigger $200 total purchases in the
same transaction.” State how to mine such rules
efficiently
Data Mining Techniques
122
Thank you !
Data Mining Techniques
123
Related documents