Download Mining Frequent Patterns from Very Dimensional Data: A Top

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Mining Frequent Patterns from Very High
Dimensional Data: A Top-Down Row
Enumeration Approach
Hongyan Liu
Jiawei Han
Dong Xin
Zheng Shao2
From SDM06
Yi-Husi Lee
1
Outline

Introduction

Preliminaries

Algorithm

Experimental

Conclusion
Yi-Husi Lee
2
Introduction

What is very high data ?
Different from transactional data set, very high dimensional
data like microarray data usually does not have somany rows
(samples) but have a large number of columns (genes).
Yi-Husi Lee
3
Preliminaries

Table and Transposed Table
Let minsup = 2,
Table T
Transposed Table TT
Yi-Husi Lee
4
Preliminaries(cont.)

Closure

Closed itemset and closed rowset
Given an itemset I ⊆ I, a rowset S ⊆ S and R ⊆ S × I is a relation,
we define
r(I) = { ri ∈ S | ∀ij ∈ I , (ri, ij) ∈ R }
i(S) = { ij ∈ I | ∀ri ∈ S , (ri, ij) ∈ R }
then
C(I) = i(r(I))
C(S) = r(i(S))
An itemset I is called a closed itemset iff I = C(I).
Likewise, a rowset S is called a closed rowset iff
S=C(S).
Yi-Husi Lee
5
Preliminaries(cont.)

Example
EX1:{b1, c2}
r({b1, c2}) = {2, 4}
i({2, 4}) = {b1, c2, d2}
Then C({b1, c2}) = {b1, c2, d2} ≠ {b1, c2}
Therefore, {b1, c2} is not a closed itemset.
EX2:{1,2}
i({1, 2}) = {a1, b1}
r({a1, b1}) = {1, 2, 3}
Then C({1, 2}) = {1, 2, 3} ≠ {1,2}
So rowset {1, 2} is not a Closed rowset, but apparently {1, 2, 3} is.
Yi-Husi Lee
6
Preliminaries(cont.)

Bottom-up Search Strategy
By bottom-up we mean that along every search the row
enumeration space from small rowsets to large ones.
It is monotonic in terms of bottom-up search order, it is hard to
prune the row enumeration search space early.
Memory cost for this kind of bottom-up search is also big.
As the minsup increase , the time need to complete the mining
process can not decrease rapidly.
Yi-Husi Lee
7
Preliminaries(cont.)

Top-down Search Strategy
A top-down search method and a novel row enumeration tree are
propose to take advantage of the pruning power of minimum
support threshold.
A new method, called closeness-checking, is developed to check
efficiently and effectively whether a pattern is closed. Unlike
other existing closeness-checking methods, it does not need to
scan the mining data set, nor the result set, and is easy to
integrate with the top-down search process.
Yi-Husi Lee
8
Preliminaries(cont.)

X-excluded transposed table
Given a rowset x = {ri1, ri2, …, rik} with an order such that ri1 > ri2
… > rik, a minimum support threshold minsup and its parent table
TT|p, an x-excluded transposed table TT|x is a table in which each
tuple contains rids less than any of rids in x, and at the same time
contains all of the rids greater than any of rids in x. Rowset x is
called an excluded rowset.
Yi-Husi Lee
9
Preliminaries(cont.)

Procedure to get X-excluded transposed
table
Example: find table TT|5, TT|54 & TT|4, let the minsup=2
The original transposed table TT|Φ
Yi-Husi Lee
10
Preliminaries(cont.)

EX1: TT|5
TT|5 is easy to find, for each tuple in table TT|Φ containing rid 5,
delete 5 and copy it to TT|5, then chick if the size of every tuple
is great than minsup. If not delete it from TT|5
TT|5
Yi-Husi Lee
11
Preliminaries(cont.)

EX2: TT|54
First: Each tuple in table TT|5 is a candidate tuple of TT|54 as
there is no rid greater than 5
Second: After excluding rids not less than 4, the table is show
below
Last: Since tuple c2 only contains one rid, it does not satisfy
minsup and delete it from TT|54
after prune
TT|54 without prune
Yi-Husi Lee
TT|54
12
Preliminaries(cont.)

EX3:TT|4
TT|4 is not the same of TT|54 and TT|5, because it is not
contains the greatest rid. Follow the definition (and at the same
time contains all of the rids greater than any of rids in x), so
we can find the tuple which has both 4 and 5.
But tuple a2 does not satisfy minsup after excluding rid 4, so
only tuple left in TT|4. Although the current size of tuple c2 in
TT|4 is 1, but it actual size is 2 since it contains rid 5 which is
not listed in TT|4.
after prune
Yi-Husi Lee
13
Algorithm

Closeness-Checking Method
To avoid generating all frequent itemsets during the mining
process, it is important to perform the closeness checking as
early as possible during mining.

Lemma1.
Given an itemset I ⊆ I and a rowset S ⊆ S, the following two
equations hold:
r(I) = r(i(r(I)))
i(S) = i(r(i(S)))
Yi-Husi Lee
14
Algorithm(cont.)


Lemma2.
In transposed table TT, a rowset S ⊆ S is closed iff it can be
represented by an intersection of a set of tuples, that is:
∃I ⊆ I, s.t. S = r(I) = ∩j r({ij})
where ij ∈ I, and I = {i1, i2, …, il}
Lemma3.
Given a rowset S ⊆ S, in transposed table TT, for every tuple ij
containing S, which means ij ∈ i(S), if S ≠ ∩jr({ij}), where ij ∈
i(S), then S is not closed.
Yi-Husi Lee
15
Algorithm(cont.)

Skip-rowset and merge of x-excluded
transposed table
In order to speed up this kind of checking, we add some additional
information for each x-excluded transposed table.
TT|54
TT|54 with skip rowset
Yi-Husi Lee
TT|54 after merge
16
Algorithm(cont.)

TD-Closed
n: the number of rowset
FCP: frequent closed patterns
excludedsize: is the number
of rids excluded from current
table
cMinsup: A dynamically changing
minimum support threshold
Yi-Husi Lee
17
Algorithm(cont.)

Pruning strategy 1: If excludedSize is equal to or greater than
(n–minsup), then there is no need to do any further recursive call
of TopDownMine. ExcludedSize is the number of rids excluded
from current table. If it is not less than (n–minsup), the size of
each rowset in current transposed table must be less than minsup,
so these rowsets are impossible to become large.
Yi-Husi Lee
18
Algorithm(cont.)


Pruning strategy 2: If an x-excluded transposed table contains
only one tuple, it is not necessary to do further recursive call to
deal with its child transposed tables. Of course, before return
according to pruning strategy 2, the current rowset S might be
a closed rowset, so if the skip-rowset is empty, we need to
output it first.
EX1.(pruning strategy 2)
For table TT|4, there is only one tuple in this table. After we
check this tuple to see if it is closed (it is not closed apparently
as its skip-rowset is not empty), we do not need to produce any
child excluded transposed table from it.
Yi-Husi Lee
19
Algorithm(cont.)


Pruning strategy 3: Each tuple t containing rid y in TT|x will be
deleted from TT|x∪y if size of t (that is the number of rids t
contains) equals cMinsup.
We can get these two tables, TT|x∪y and TT|x’, bysteps as follows.
First, for each tuple t in TT|x containing rid y, delete y from it,
and copy it to TT|x’.And then check if the size of t is less than
cMinsup. Ifnot, keep it and in the meantime put y in the
skiprowset,otherwise get rid of it from TT|x. Finally TT|x becomes
TT|x∪y, and at the same time TT|x’ is obtained.
Yi-Husi Lee
20
Algorithm(cont.)

Ex2.(pruning strategy 2)
Let cMinsup=2
TT|x’
TT|x
TT|x∪y
Yi-Husi Lee
21
Algorithm(cont.)



Pruning strategy 4: The step is the major output step. If there
is a tuple with empty skip-rowset and the largest size, say k, and
containing rid k in TT|x∪y, then output it. Since this tuple has the
largest size among all of the tuples in TT|x∪y, there is no other
tuple that will contribute to it.Therefore, it can be output.
EX3.(pruning strategy 3)
In table TT|54, itemset a1b1 corresponding to rowset {1,2,3} can
be output as its size is 3 which is larger than the size of the
other two tuples, and it also contains the largest rid 3.
Pruning strategy 5: It conducts two recursive calls to deal with
TT|x∪y and TT|x’ respectively.
Yi-Husi Lee
22
Experimental

Synthetic Data Set
Yi-Husi Lee
23
Experimental(cont.)
Yi-Husi Lee
24
Experimental(cont.)

Real Data Sets
Yi-Husi Lee
25
Conclusion



A new algorithm, TD-Close, is designed and implemented for
mining a complete set of frequent closed itemsets from high
dimensional data.
Several pruning strategies are developed to speed up searching.
Both analysis and experimental study show that these methods
are effective and useful.
Future work includes integrating top-down row enumeration
method and column row enumeration method for frequent
pattern mining from both long and deep large datasets, and
mining classification rules based on association rules using topdown searching strategy.
Yi-Husi Lee
26