Download FP-growth

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Association Rules: Algorithms:
FP-growth
Alípio Jorge, DCC-FC, Universidade do Porto
[email protected]
1
Data Mining 2
Alípio Jorge
FP-Growth [Han, Pei, Yin]
An algorithm more efficient than APRIORI
Compresses the database into a more efficient structure
Reuses pattern fragments
Reduces the number of accesses to the DB
Reduces memory and computation needs
A divide-and-conquer method reduces the search space
One order of magnitude faster than APRIORI
2
Data Mining 2
Alípio Jorge
Shortcomings of APRIORI
generation of long patterns
{a1, a2, …, an}
How many sub-patterns are needed?
3
Data Mining 2
Alípio Jorge
Compressing a DB
Consider the database
abc
bc
cd
cdf
bd
d
what's its size?
can we compress it?
we only want frequent sets, not individual transactions
4
Recall APRIORI-TID
Data Mining 2
Alípio Jorge
FP-Tree
Principles
1.
2.
3.
4.
5
We only need the frequent items. We go through the DB
once to count the frequency of all items.
We keep the item frequencies in a compact data structure to
avoid new visits to the DB.
If different transactions share a frequent item set, we can
partially merge them. Frequent items must be disposed by a
pre-defined order.
If two transactions share a prefix (according to the order) we
can use a structure based on prefixes. We must, however,
keep the countings. That is easier if the items are sorted by
inverse frequence.
Data Mining 2
Alípio Jorge
FP-Growth
Example: data and frequent items (minsup=3 (0.6))
6
Data Mining 2
Alípio Jorge
FP-Tree
From example
7
Data Mining 2
Alípio Jorge
FP-Tree
Definition
consists of
each node in the item prefix subtree has 3 fields
item-name
count (of the prefix)
node-link (pointer to next node with same name)
each entry in the frequent-item-header table
8
root node
set of item prefix subtrees
frequent item header table
item name
head of node-link (pointer to first node with that name)
Data Mining 2
Alípio Jorge
FP-Tree construction algorithm
Input: DB, minsup ms
Output: FP-Tree
Method
Scan DB once
collect frequent items F
Sort F in support descending order
Create root of Tree and, for each transaction T of DB
insert freq items of T in Tree sorted by descending freq
Inserting items <p1, p2, …> in Tree
9
if root of Tree has a child N with p1, increase counter of node
else create a new child N with p1 and count 1
insert <p2,…> in subtree with root N
Data Mining 2
Alípio Jorge
Analysis
2 scans of DB
FP-tree construction takes 2 scans of DB:
inserting a transaction T is O(|freq(T)|)
freq(T) is the frequent item projection of transaction T
Completeness
frequent item identification
tree construction
the FP-Tree contains all information we need about frequent sets
Compactness
10
size of Tree <= SUM over T of |freq(T)|
height of Tree = MAX over T of |freq(T)|
Data Mining 2
Alípio Jorge
Analysis
how much space does APRIORI need?
may grow exponentially
How does frequency ordering influence compactness?
11
Data Mining 2
Alípio Jorge
Exercise
Compare Apriori and FP-Tree on the following data
How much space is needed?
APRIORI: number of candidate sets
FP-Growth: size of Tree + size of Table
Generate Tree now considering increasing frequency
ordering
DB={{a,b,c,d},{a,b,c,d},{a,e},{a,b}}, minsup=50%
compare size of both trees
Consider DB={adef , bdef , cdef , a, a, a, b, b, b, c, c, c}
12
build tree with decreasing freq ord
can you find a different ordering that achieves a more compact
Tree?
Data Mining 2
Alípio Jorge
Where are we now?
We have a compact representation of the DB
How can we obtain the frequency of a pattern?
How can we obtain ALL frequent patterns?
13
Data Mining 2
Alípio Jorge
Obtaining frequent patterns
Algorithm FP-growth
Properties
14
all freq sets with item A can be obtained from the Tree by
following node-links starting in the table (e.g. freq({b,c})?)
Data Mining 2
Alípio Jorge
FP-Growth
Consider each item A at a time (start by less frequent)
obtain long patterns involving A
obtain an FP-Tree from those patterns
conditional pattern base
conditional FP-Tree
obtain patterns of size 2 from cond FP-Tree and A
for each size 2 pattern "repeat" from step 2
15
Data Mining 2
Alípio Jorge
FP-Growth
Patterns with "p" (minsup=3)
Conditional pattern base
Conditional FP-Tree
cp:3
Patterns with "p"
only c is frequent
Tree is [root c:3]
Size 2 patterns
fcam:2, cb:1
p:3, cp:3
Now we don't have to worry about "p"
16
all patterns with have been derived
Data Mining 2
Alípio Jorge
FP-Growth
Patterns with "m" (minsup=3)
Conditional pattern base
Conditional FP-Tree
am:3, cm:3, fm:3
extend am
17
Table: [f:3, c:3, a:3]
Tree is [root f:3c:3a:3]
Size 2 patterns
fca:2, fcab:1
next slide
Data Mining 2
Alípio Jorge
FP-Growth
Patterns with "am" (minsup=3)
Now the Tree is
Conditional pattern base
cam:3, fam:3
extend cam
18
Table: [f:3, c:3]
Tree is [root f:3c:3]
Size 3 patterns
fc:3
Conditional FP-Tree
Tree is [root f:3c:3]
fam is not extendable
Data Mining 2
Alípio Jorge
FP-Growth
Patterns with "cam" (minsup=3)
Now the Tree is
Conditional pattern base
19
[root f:3]
Size 3 patterns
f:3
Conditional FP-Tree
Tree is [root f:3]
fcam:3,
no longer patterns available
Data Mining 2
Alípio Jorge
FP-Growth (Exercise)
Continuing…
Extend
cm
fm
And then find patterns with
20
b
a
c
f
Data Mining 2
Alípio Jorge
Properties
To have the frequent patterns with suffix A, we only need
to consider the prefix paths of nodes with A.
21
the count is that of A
E.g. with b
Data Mining 2
Alípio Jorge
Properties
an item set with B∪{A} is frequent in DB iff B is frequent
in the conditonal pattern base of A.
22
That’s why we just need the tree
Data Mining 2
Alípio Jorge
Single-path tree (special case)
An FP-Tree with a single path can be mined by
enumerating all the combinations of the subpaths.
23
The count of each subpath is the minimum of the nodes in it
Data Mining 2
Alípio Jorge
Single prefix-path Tree (special case)
Can be mined more efficiently.
Separate single prefix-path from multi-path part
Mine separately
Combine
24
Data Mining 2
Alípio Jorge
Finally: FP-Growth algorithm
Input: DB (as a Tree), minsup
Output: the set of frequent patterns
Method: call FP-growth(Tree,null)
Procedure FP-growth(Tree,A)
If Tree has single prefix-path P and a multipath part Q
Else let Q be Tree
For each item Ai in Q
25
Generate patterns from combinations in P+A (freq-pattern(P))
Generate pattern B=Ai+A
Construct B cond-pattern-base and cond FP-Tree
If Tree_B is not empty call freq-pattern(B)=FP-growth(Tree_B, B)
Return
freq-pattern(P)+freq-pattern(Q)+freq-pattern(P) × freq-pattern(Q)
Data Mining 2
Alípio Jorge
Analysis
It can be shown that
The algorithm finds the complete set of frequent patterns in
the given transaction database DB.
An FP-Tree is usually much smaller than DB
It scans the FP-Tree of DB once for each frequent item A
Mining operation consist of
26
Generates a small pattern base B_A
Pattern mining is done recursively on the small B_A
Prefix count adjustment
Counting local frequent items
Pattern fragment concatenation
Data Mining 2
Alípio Jorge
More…
Scaling FP-growth by not assuming the Tree fits main
memory.
Divide the database adequately and join the patterns obtained
Project the original DB into different DBs
Experimental evaluation
27
Compare FP-growth with Apriori
Compare FP-growth with TreeProjection [Agarwal et al 2001]
Reimplement competing algorithms
Synthetic and real data sets
Data Mining 2
Alípio Jorge
Experimental evaluation
Assess compactness of FP-Trees
28
By measuring their size on connect-4 (and other data)
Data Mining 2
Alípio Jorge
Experimental evaluation
Scalability study
29
By measuring time of competing algorithms on different data
sets (here sparse artificial data set and (dense) connect-4)
Data Mining 2
Alípio Jorge
Exercises (on paper)
Consider DB= {{a,b,c,d},{a,b,c,d},{f,c,d},{e,c,d},{c}},
30
Obtain the frequent patterns using FP-Growth, minsup=1
Data Mining 2
Alípio Jorge
Advanced exercises
Implement your own version of building na FP-Tree from
a database of transactions
If you did it in R look at the FP-Tree of the greengrocers
dataset
Try Borgelt’s implementation of FP-Growth
31
http://www.borgelt.net/fpgrowth.html
Compare it with Borgelt’s implementation of Apriori
Data Mining 2
Alípio Jorge
References
Jiawei Han, Jian Pei,Yiwen Yin, and Runying Mao. 2004.
Mining Frequent Patterns without Candidate Generation: A
Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 8,
1 (January 2004)
Ramesh C. Agarwal, Charu C. Aggarwal, and V.V.V. Prasad. 2001. A
tree projection algorithm for generation of frequent item sets. J.
Parallel Distrib. Comput. 61, 3 (March 2001)
Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for
Mining Association Rules in Large Databases. In Proceedings of the
20th International Conference on Very Large Data Bases (VLDB '94)
Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu
Toivonen, and A. Inkeri Verkamo. 1996. Fast discovery of association
rules. In Advances in knowledge discovery and data mining, Usama M.
Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy
Uthurusamy (Eds.)
32
Data Mining 2
Alípio Jorge