Download A(1)

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lecture Notes for Chapter 7
Sequential Pattern Mining
Frequent Subgraph Mining
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Sequence Data
Timeline
10
Sequence Database:
Object
A
A
A
B
B
B
B
C
Timestamp
10
20
23
11
17
21
28
14
Events
2, 3, 5
6, 1
1
4, 5, 6
2
7, 8, 1, 2
1, 6
1, 8, 7
15
20
25
30
35
Object A:
2
3
5
6
1
1
Object B:
4
5
6
2
1
6
7
8
1
2
Object C:
1
7
8
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
2
Sequence Databases

A sequence database consists of ordered elements
or events

Transaction databases vs. sequence databases
A transaction database
A sequence database
TID
itemsets
SID
sequences
10
a, b, d
10
<a(abc)(ac)d(cf)>
20
a, c, d
20
<(ad)c(bc)(ae)>
30
a, d, e
30
<(ef)(ab)(df)cb>
40
b, e, f
40
<eg(af)cbc>
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
3
Examples of Sequence Data
Sequence
Database
Sequence
Element
(Transaction)
Event
(Item)
Customer
Purchase history of a given
customer
A set of items bought by
a customer at time t
Books, diary products,
CDs, etc
Web Data
Browsing activity of a
particular Web visitor
A collection of files
viewed by a Web visitor
after a single mouse click
Home page, index
page, contact info, etc
Event data
History of events generated
by a given sensor
Events triggered by a
sensor at time t
Types of alarms
generated by sensors
Genome
sequences
DNA sequence of a
particular species
An element of the DNA
sequence
Bases A,T,G,C
Element
(Transaction)
Sequence
© Tan,Steinbach, Kumar
E1
E2
E1
E3
E2
Introduction to Data Mining
E2
E3
E4
Event
(Item)
4/18/2004
4
Formal Definition of a Sequence

A sequence is an ordered list of elements
(transactions)
s = < e1 e2 e3 … >
– Each element contains a collection of events (items)
ei = {i1, i2, …, ik}
– Each element is attributed to a specific time or location

Length of a sequence, |s|, is given by the number
of elements of the sequence

A k-sequence is a sequence that contains k
events (items)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
5
Examples of Sequence

Web sequence:
< {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera}
{Shopping Cart} {Order Confirmation} {Return to Shopping} >

Sequence of initiating events causing the nuclear
accident at 3-mile Island:
(http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm)
< {clogged resin} {outlet valve closure} {loss of feedwater}
{condenser polisher outlet valve shut} {booster pumps trip}
{main waterpump trips} {main turbine trips} {reactor pressure increases}>

Sequence of books checked out at a library:
<{Fellowship of the Ring} {The Two Towers} {Return of the King}>
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
6
Formal Definition of a Subsequence



A sequence <a1 a2 … an> is contained in another
sequence <b1 b2 … bm> (m ≥ n) if there exist integers
i1 < i2 < … < in such that a1  bi1 , a2  bi1, …, an  bin
Data sequence
Subsequence
Contain?
< {2,4} {3,5,6} {8} >
< {2} {3,5} >
Yes
< {1,2} {3,4} >
< {1} {2} >
No
< {2,4} {2,4} {2,5} >
< {2} {4} >
Yes
The support of a subsequence w is defined as the fraction
of data sequences that contain w
A sequential pattern is a frequent subsequence (i.e., a
subsequence whose support is ≥ minsup)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
7
Sequential Pattern Mining: Definition

Given:
– a database of sequences
– a user-specified minimum support threshold, minsup

Task:
– Find all subsequences with support ≥ minsup
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
8
Sequential Pattern Mining: Challenge

Given a sequence: <{a b} {c d e} {f} {g h i}>
– Examples of subsequences:
<{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc.

How many k-subsequences can be extracted
from a given n-sequence?
<{a b} {c d e} {f} {g h i}> n = 9
k=4:
Y_
<{a}
© Tan,Steinbach, Kumar
_YY _ _ _Y
{d e}
Introduction to Data Mining
{i}>
Answer :
n 9
      126
 k   4
4/18/2004
9
Sequential Pattern Mining: Example
Object
A
A
A
B
B
C
C
C
D
D
D
E
E
Timestamp
1
2
3
1
2
1
2
3
1
2
3
1
2
© Tan,Steinbach, Kumar
Events
1,2,4
2,3
5
1,2
2,3,4
1, 2
2,3,4
2,4,5
2
3, 4
4, 5
1, 3
2, 4, 5
Introduction to Data Mining
Minsup = 50%
Examples of Frequent Subsequences:
< {1,2} >
< {2,3} >
< {2,4}>
< {3} {5}>
< {1} {2} >
< {2} {2} >
< {1} {2,3} >
< {2} {2,3} >
< {1,2} {2,3} >
s=60%
s=60%
s=80%
s=80%
s=80%
s=60%
s=60%
s=60%
s=60%
4/18/2004
10
Challenges on Sequential Pattern Mining

A huge number of possible sequential patterns
are hidden in databases

A mining algorithm should
– find the complete set of patterns, when possible,
satisfying the minimum support (frequency)
threshold
– be highly efficient, scalable, involving only a small
number of database scans
– be able to incorporate various kinds of user-specific
constraints
© Tan,Steinbach, Kumar
Introduction to Data Mining
11
4/18/2004
11
Studies on Sequential Pattern Mining

Concept introduction and an initial Apriori-like algorithm
– Agrawal & Srikant. Mining sequential patterns, [ICDE’95]

Apriori-based method: GSP (Generalized Sequential Patterns: Srikant
& Agrawal [EDBT’96])

Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.KDD’00;
Pei, et al. [ICDE’01])

Vertical format-based mining: SPADE (Zaki [Machine Leanining’00])

Constraint-based sequential pattern mining (SPIRIT: Garofalakis,
Rastogi, Shim [VLDB’99]; Pei, Han, Wang [CIKM’02])

Mining closed sequential patterns: CloSpan (Yan, Han & Afshar
[SDM’03])
© Tan,Steinbach, Kumar
Introduction to Data Mining
12
4/18/2004
12
Methods for sequential pattern mining

Apriori-based Approaches
– AprioriAll
– GSP

Sequence-Enumeration Tree/Lattice-based
– SPADE
– SPAM

Apriori-based
Pattern-Growth-based Approaches
– FreeSpan
– PrefixSpan
© Tan,Steinbach, Kumar
Introduction to Data Mining
13
4/18/2004
13
The Apriori Property of Sequential Patterns

A basic property: Apriori (Agrawal & Sirkant’94)
– If a sequence S is not frequent, then none of the supersequences of S is frequent
– E.g, <hb> is infrequent
so do <hab> and <(ah)b>
Major strength: Candidate pruning by Apriori
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
14
Extracting Sequential Patterns

Given n events: i1, i2, i3, …, in

Candidate 1-subsequences:
<{i1}>, <{i2}>, <{i3}>, …, <{in}>

Candidate 2-subsequences:
<{i1, i2}>, <{i1, i3}>, …, <{i1} {i1}>, <{i1} {i2}>, …, <{in-1} {in}>

Candidate 3-subsequences:
<{i1, i2 , i3}>, <{i1, i2 , i4}>, …, <{i1, i2} {i1}>, <{i1, i2} {i2}>, …,
<{i1} {i1 , i2}>, <{i1} {i1 , i3}>, …, <{i1} {i1} {i1}>, <{i1} {i1} {i2}>, …
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
15
Generalized Sequential Pattern (GSP)

Step 1:
– Make the first pass over the sequence database D to yield all the 1element frequent sequences

Step 2:
Repeat until no new frequent sequences are found
– Candidate Generation:

Merge pairs of frequent subsequences found in the (k-1)th pass to generate
candidate sequences that contain k items
– Candidate Pruning:

Prune candidate k-sequences that contain infrequent (k-1)-subsequences
– Support Counting:

Make a new pass over the sequence database D to find the support for these
candidate sequences
– Candidate Elimination:

Eliminate candidate k-sequences whose actual support is less than minsup
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
16
Candidate Generation

Base case (k=2):
– Merging two frequent 1-sequences <{i1}> and <{i2}> will produce
two candidate 2-sequences: <{i1} {i2}> and <{i1 i2}>

General case (k>2):
– A frequent (k-1)-sequence w1 is merged with another frequent
(k-1)-sequence w2 to produce a candidate k-sequence if the
subsequence obtained by removing the first event in w1 is the same
as the subsequence obtained by removing the last event in w2
The resulting candidate after merging is given by the sequence w1
extended with the last event of w2.

– If the last two events in w2 belong to the same element, then the last event
in w2 becomes part of the last element in w1
– Otherwise, the last event in w2 becomes a separate element appended to
the end of w1
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
17
Candidate Generation Examples

Merging the sequences
w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}>
will produce the candidate sequence < {1} {2 3} {4 5}> because the last two
events in w2 (4 and 5) belong to the same element

Merging the sequences
w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}>
will produce the candidate sequence < {1} {2 3} {4} {5}> because the last
two events in w2 (4 and 5) do not belong to the same element

We do not have to merge the sequences
w1 =<{1} {2 6} {4}> and w2 =<{1} {2} {4 5}>
to produce the candidate < {1} {2 6} {4 5}> because if the latter is a viable
candidate, then it can be obtained by merging w1 with
< {1} {2 6} {5}>
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
18
GSP Example
Frequent
3-sequences
< {1} {2} {3} >
< {1} {2 5} >
< {1} {5} {3} >
< {2} {3} {4} >
< {2 5} {3} >
< {3} {4} {5} >
< {5} {3 4} >
© Tan,Steinbach, Kumar
Candidate
Generation
< {1} {2} {3} {4} >
< {1} {2 5} {3} >
< {1} {5} {3 4} >
< {2} {3} {4} {5} >
< {2 5} {3 4} >
Introduction to Data Mining
Candidate
Pruning
< {1} {2 5} {3} >
4/18/2004
19
More GSP Example
Seq. ID
10
20
30
40
50
© Tan,Steinbach, Kumar
Sequence
<(bd)cb(ac)>
<(bf)(ce)b(fg)>
<(ah)(bf)abf>
<(be)(ce)d>
<a(bd)bcb(ade)>
Introduction to Data Mining
Given support threshold
min_sup =2
4/18/2004
20
Finding Length-1 Sequential Patterns

Initial candidates:
– <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>

Scan database once, count support
for candidates
min_sup =2
Seq. ID
10
20
30
40
50
© Tan,Steinbach, Kumar
Sequence
<(bd)cb(ac)>
<(bf)(ce)b(fg)>
<(ah)(bf)abf>
<(be)(ce)d>
<a(bd)bcb(ade)>
Introduction to Data Mining
Cand Sup
<a>
3
<b>
5
<c>
4
<d>
3
<e>
3
<f>
2
<g>
1
<h>
1
4/18/2004
21
Generating Length-2 Candidates
51 length-2
Candidates
<a>
<a>
<a>
<b>
<c>
<d>
<e>
<f>
<a>
<aa>
<ab>
<ac>
<ad>
<ae>
<af>
<b>
<ba>
<bb>
<bc>
<bd>
<be>
<bf>
<c>
<ca>
<cb>
<cc>
<cd>
<ce>
<cf>
<d>
<da>
<db>
<dc>
<dd>
<de>
<df>
<e>
<ea>
<eb>
<ec>
<ed>
<ee>
<ef>
<f>
<fa>
<fb>
<fc>
<fd>
<fe>
<ff>
<b>
<c>
<d>
<e>
<f>
<(ab)>
<(ac)>
<(ad)>
<(ae)>
<(af)>
<(bc)>
<(bd)>
<(be)>
<(bf)>
<(cd)>
<(ce)>
<(cf)>
<(de)>
<(df)>
<b>
<c>
<d>
<e>
<(ef)>
<f>
© Tan,Steinbach, Kumar
Introduction to Data Mining
Without Apriori
property,
8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates
4/18/2004
22
Finding Lenth-2 Sequential Patterns
Scan database one more time, collect support
count for each length-2 candidate
 There are 19 length-2 candidates which pass the
minimum support threshold

– They are length-2 sequential patterns
© Tan,Steinbach, Kumar
Introduction to Data Mining
23
4/18/2004
23
The GSP Mining Process
Cand. cannot pass
sup. threshold
<(bd)cba>
1st scan: 8 cand. 6 length-1
seq. pat.
<abba> <(bd)bc> …
2nd scan: 51 cand. 19 length-2
seq. pat. 10 cand. not in DB at
all
Cand. not in DB at all
<abb> <aab> <aba> <baa> <bab> …
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(e
3rd scan: 46 cand. 19 length-3
seq. pat. 20 cand. not in DB at
all
4th scan: 8 cand. 6 length-4
seq. pat.
5th
scan: 1 cand. 1 length-5
seq. pat.
© Tan,Steinbach, Kumar
<a> <b> <c> <d> <e> <f> <g> <h>
Seq. ID
Sequence
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
Introduction to Data Mining
min_sup =2
4/18/2004
24
The GSP Algorithm

Benefits from the Apriori pruning
– Reduces search space

Bottlenecks
– Scans the database multiple times
– Generates a huge set of candidate sequences
Especially
2-item candidate sequence
– Inefficient for mining long sequential patterns.
A
long pattern grow up from short patterns
An
exponential number of short candidates
There is a need for more efficient mining methods
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
25
The SPADE Algorithm

SPADE (Sequential PAttern Discovery using
Equivalent Class) developed by Zaki 2001

A vertical format sequential pattern mining
method

A sequence database is mapped to a large set of
Item: <SID, EID> (vertical data format)

Sequential pattern mining is performed by
– growing the subsequences (patterns) one item at a
time by Apriori candidate generation
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
26
The SPADE Algorithm
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
27
SPAM (Sequential pattern mining)


S = ({a, b, c}, {a, b})
Sequence length:
– Length (S) = 5

Sequence size:
– Size (S) = 2

Sequence-extended sequence
S’ = ({a, b, c}, {a, b},
b}, {a})
{a})

Itemset-extended sequence
S’ = ({a, b, c}, {a, b,
b, d})
d})
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
28
SPAM (Sequential pattern mining)
{}
• Max Size = 3
• Items = {a, b}
a
a,a
a,a,a
a,a,{a,b}
a,a,b
b
a,b
{a,b}
Level 1
Level 2
a,{a,b}
Level 3
Level 4
a,{a,b},a a,{a,b},b
Level 5
a,{a,b},{a,b}
Sequence-extended
Item-extended
© Tan,Steinbach, Kumar
Introduction to Data Mining
Level 6
4/18/2004
29
SPAM (Sequential pattern mining)
{}
• Max Size = 3
• Items = {a, b}
Level 1
a
a,a
a,a,a
a,a,{a,b}
a,a,b
a,{a,b}
a,b
a,b,a
a,{a,b},a a,{a,b},b a,b,{a,b}
a,{a,b},{a,b}
© Tan,Steinbach, Kumar
Level 2
{a,b}
a,b,b
{a,b},a {a,b},b
{a,b},a,a {a,b},a,b {a,b},{a,b}
{a,b},a,{a,b}
Introduction to Data Mining
{a,b},{a,b},a {a,b},{a,b},b
{a,b},{a,b},{a,b}
Level 3
Level 4
Level 5
Level 6
4/18/2004
30
SPAM (Sequential pattern mining)
Pruning
 Items = {a, b, c, d}

a
a,a
a,b
a,a,a a,a,b a,a,c a,a,d
a,{a,b}
a,{a,c}
© Tan,Steinbach, Kumar
a,{a,d}
Introduction to Data Mining
a,c
a,d
a,b,a a,b,b a,b,c a,b,d
a,{b,c}
a,{b,d}
4/18/2004
31
Data Representation – BitMap
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
32
S-type
S = {a}
S’={a},{b}
S’={a},{c}
…
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
33
I-type
S = {a}
S’={a, b}
S’={a, c}
…
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
34
Small database
Small database
•SPADE
•SPAM
•PrefixSpan
prefix
middle database
© Tan,Steinbach, Kumar
middle database
Introduction to Data Mining
4/18/2004
35
Large Database
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
36
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
37
PrefixSpan (Prefix-Projected Sequential
Pattern Growth)

PrefixSpan
– Projection-based
– But only prefix-based projection: less projections and
quickly shrinking sequences
J.Pei, J.Han, PrefixSpan : Mining sequential patterns efficiently by
prefix-projected pattern growth. ICDE’01.
© Tan,Steinbach, Kumar
Introduction to Data Mining
38
4/18/2004
38
Prefix and Suffix (Projection)

<a>, <aa>, <a(ab)> and <a(abc)> are prefixes
of sequence <a(abc)(ac)d(cf)>

Given sequence <a(abc)(ac)d(cf)>
Prefix
Suffix (Prefix-Based Projection)
<a>
<aa>
<ab>
<(abc)(ac)d(cf)>
<(_bc)(ac)d(cf)>
<(_c)(ac)d(cf)>
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
39
Mining Sequential Patterns by Prefix Projections

Step 1: find length-1 sequential patterns
– <a>, <b>, <c>, <d>, <e>, <f>

Step 2: divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets:
–
–
–
–
The ones having prefix <a>;
The ones having prefix <b>;
…
The ones having prefix <f>
© Tan,Steinbach, Kumar
Introduction to Data Mining
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
4/18/2004
40
Finding Seq. Patterns with Prefix <a>

Only need to consider projections w.r.t. <a>
– <a>-projected database:
<(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. having prefix <a>:
<aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
– Further partition into 6 subsets
Having
prefix <aa>
…
Having
© Tan,Steinbach, Kumar
prefix <af>
Introduction to Data Mining
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
4/18/2004
41
Completeness of PrefixSpan
SDB
Min_sup=2
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Having prefix <a>
Having prefix <b>
<a>-projected database
<(abc)(ac)d(cf)>
<(_d)c(bc)(ae)>
<(_b)(df)cb>
<(_f)cbc>
Length-2 sequential
patterns
<aa>, <ab>, <(ab)>,
<ac>, <ad>, <af>
Having prefix <aa> Having prefix <af>
<aa>-proj. db
<(_bc)(ac)d(cf)>
…
© Tan,Steinbach, Kumar
<af>-proj. db
<cb>
Introduction to Data Mining
Length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>
L1: <a>: 4, <b>: 4, <c>: 4
<d>: 3, <e>: 3, <f>: 3
Having prefix <c>, …, <f>
<b>-projected database
…
……
Scanning <a>-Projected database once:
a:2, b:4, c:4, d:2, e:1, f:2
(_b):2, (_c):1, (_d):1, (_e):1, (_f):1
L2: <aa>: 2 , <ab>: 4 , <(ab)>: 2
<ac>: 4 , <ad>: 2 , <af>: 2
Scanning <ab>-Projected database once:
a:2 , c:2 , d:1 , f:1 , (_c):2
L3: <a(bc)>: 2, <aba>: 2, <abc>: 2
4/18/2004
42
The Algorithm of PrefixSpan
Input: A sequence database S, and the minimum
support threshold min_sup
 Output: The complete set of sequential patterns
 Method: Call PrefixSpan(<>,0,S)
 Subroutine PrefixSpan(α, l, S|α)
 Parameters:

– α: sequential pattern,
– l: the length of α;
– S|α: the α-projected database, if α ≠<>; otherwise; the
sequence database S
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
43
The Algorithm of PrefixSpan(2)
Method
1. Scan S|α once, find the set of frequent items b
such that:

a) b can be assembled to the last element of α to form
<(ab)>
a sequential pattern; or
b) <b> can be appended to α to form a sequential
pattern. <ab>
2. For each frequent item b, append it to α to form
a sequential pattern α’, and output α’;
3. For each α’, construct α’-projected database S|α’,
and call PrefixSpan(α’, l+1, S|α’).
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
44
Efficiency of PrefixSpan

No candidate sequence needs to be generated

Projected databases keep shrinking

Major cost of PrefixSpan: constructing projected
databases
– Can be improved by bi-level projections
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
45
Optimization in PrefixSpan

Single level vs. bi-level projection
– Bi-level projection with 3-way checking may reduce
the number and size of projected databases

Physical projection vs. pseudo-projection
– Pseudo-projection may reduce the effort of projection
when the projected database fits in main memory

Parallel projection vs. partition projection
– Partition projection may avoid the blowup of disk
space
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
46
Performance on Data Set C10T8S8I8
© Tan,Steinbach, Kumar
Introduction to Data Mining
50
4/18/2004
50
Performance on Data Set Gazelle
© Tan,Steinbach, Kumar
Introduction to Data Mining
51
4/18/2004
51
Effect of Pseudo-Projection
© Tan,Steinbach, Kumar
Introduction to Data Mining
52
4/18/2004
52
Timing Constraints (I)
{A B}
{C}
<= xg
{D E}
xg: max-gap
>ng
ng: min-gap
ms: maximum span
<= ms
xg = 2, ng = 0, ms= 4
Data sequence
Subsequence
Contain?
< {2,4} {3,5,6} {4,7} {4,5} {8} >
< {6} {5} >
Yes
< {1} {2} {3} {4} {5}>
< {1} {4} >
No
< {1} {2,3} {3,4} {4,5}>
< {2} {3} {5} >
Yes
< {1,2} {3} {2,3} {3,4} {2,4} {4,5}>
< {1,2} {5} >
No
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
53
Mining Sequential Patterns with Timing Constraints

Approach 1:
– Mine sequential patterns without timing constraints
– Postprocess the discovered patterns

Approach 2:
– Modify GSP to directly prune candidates that violate
timing constraints
– Question:

Does Apriori principle still hold?
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
54
Apriori Principle for Sequence Data
Object
A
A
A
B
B
C
C
C
D
D
D
E
E
Timestamp
1
2
3
1
2
1
2
3
1
2
3
1
2
Events
1,2,4
2,3
5
1,2
2,3,4
1, 2
2,3,4
2,4,5
2
3, 4
4, 5
1, 3
2, 4, 5
Suppose:
xg = 1 (max-gap)
ng = 0 (min-gap)
ms = 5 (maximum span)
minsup = 60%
<{2} {5}> support = 40%
but
<{2} {3} {5}> support = 60%
Problem exists because of max-gap constraint
(No such problem if max-gap is infinite)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
55
Contiguous Subsequences

s is a contiguous subsequence of
w = <e1>< e2>…< ek>
if any of the following conditions hold:
1. s is obtained from w by deleting an item from either e1 or ek
2. s is obtained from w by deleting an item from any element ei that
contains more than 2 items
3. s is a contiguous subsequence of s’ and s’ is a contiguous
subsequence of w (recursive definition)

Examples: s = < {1} {2} >
–
is a contiguous subsequence of
< {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} >
–
is not a contiguous subsequence of
< {1} {3} {2}> and < {2} {1} {3} {2}>
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
56
Modified Candidate Pruning Step

Without maxgap constraint:
– A candidate k-sequence is pruned if at least one of its
(k-1)-subsequences is infrequent

With maxgap constraint:
– A candidate k-sequence is pruned if at least one of its
contiguous (k-1)-subsequences is infrequent
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
57
Timing Constraints (II)
{A B}
{C}
<= xg
xg: max-gap
{D E}
>ng
ng: min-gap
<= ws
ws: window size
<= ms
ms: maximum span
xg = 2, ng = 0, ws = 1, ms= 5
Data sequence
Subsequence
Contain?
< {2,4} {3,5,6} {4,7} {4,6} {8} >
< {3} {5} >
No
< {1} {2} {3} {4} {5}>
< {1,2} {3} >
Yes
< {1,2} {2,3} {3,4} {4,5}>
< {1,2} {3,4} >
Yes
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
58
Modified Support Counting Step

Given a candidate pattern: <{a, c}>
– Any data sequences that contain
<… {a c} … >,
<… {a} … {c}…> ( where time({c}) – time({a}) ≤ ws)
<…{c} … {a} …> (where time({a}) – time({c}) ≤ ws)
will contribute to the support count of candidate pattern
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
59
Other Formulation

In some domains, we may have only one very long
time series
– Example:


monitoring network traffic events for attacks

monitoring telecommunication alarm signals
Goal is to find frequent sequences of events in the
time series
– This problem is also known as frequent episode mining
E1
E3
E1
E1 E2 E4
E1
E2
E2
E4
E2 E3 E4
E2 E3 E5
E2 E3 E5
E1
E2 E3 E1
Pattern: <E1> <E3>
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
60
General Support Counting Schemes
Object's Timeline
p
1
p
2
p
q
3
q
4
p
q
5
p
q
6
Sequence: (p) (q)
q
Method
7
Support
Count
COBJ
1
CWIN
6
CMINWIN
4
Assume:
CDIST_O
8
xg = 2 (max-gap)
ng = 0 (min-gap)
ws = 0 (window size)
CDIST
5
ms = 2 (maximum span)
Frequent Subgraph Mining
Extend association rule mining to finding frequent
subgraphs
 Useful for Web Mining, computational chemistry,
bioinformatics, spatial data sets, etc

Homepage
Research
Artificial
Intelligence
Databases
Data Mining
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
62
Graph Definitions
a
a
q
p
p
b
r
p
t
c
c
p
s
q
b
(a) Labeled Graph
© Tan,Steinbach, Kumar
a
a
s
r
t
r
p
p
a
s
a
t
t
c
r
c
p
p
b
(b) Subgraph
Introduction to Data Mining
b
(c) Induced Subgraph
4/18/2004
63
Representing Transactions as Graphs

Each transaction is a clique of items
A
TID = 1:
Transaction
Id
1
2
3
4
5
Items
{A,B,C,D}
{A,B,E}
{B,C}
{A,B,D,E}
{B,C,D}
C
B
E
D
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
64
Representing Graphs as Transactions
a
q
e
p
r
b
p
p
q
d
c
G1
G1
G2
G3
G3
(a,b,p)
1
1
0
…
© Tan,Steinbach, Kumar
(a,b,r)
0
0
1
…
(b,c,p)
0
0
1
…
Introduction to Data Mining
d
r
a
G2
(a,b,q)
0
0
0
…
q
c
r
r
p
p
b
b
d
r
e
a
G3
(b,c,q)
0
0
0
…
(b,c,r)
1
0
0
…
…
…
…
…
…
4/18/2004
(d,e,r)
0
0
0
…
65
Challenges
Node may contain duplicate labels
 Support and confidence

– How to define them?

Additional constraints imposed by pattern
structure
– Support and confidence are not the only constraints
– Assumption: frequent subgraphs must be connected

Apriori-like approach:
– Use frequent k-subgraphs to generate frequent (k+1)
subgraphs
What
© Tan,Steinbach, Kumar
is k?
Introduction to Data Mining
4/18/2004
66
Challenges…

Support:
– number of graphs that contain a particular subgraph

Apriori principle still holds

Level-wise (Apriori-like) approach:
– Vertex growing:

k is the number of vertices
– Edge growing:

k is the number of edges
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
67
Vertex Growing
a
q
e
p
p
r
0

p
M 
p

q
G1
d
r
r
p
a
a
+
q
e
p
p
p
a
a
a
r
r
d
a
a
a
G1
G2
G3 = join(G1,G2)
p
p
0
r
r
0
0
0
© Tan,Steinbach, Kumar
q

0
0

0
0

p
M 
p

0
G2
p
0
r
0
p 0

r 0
0 r

r 0
Introduction to Data Mining
M G3
0

p
p

0
q

p
0
r
0
0
p 0 q

r 0 0
0 r 0

r 0 0
0 0 0 
4/18/2004
68
Edge Growing
a
a
q
p
p
a
f
r
p
+
a
p
f
p
q
f
p
a
a
r
r
r
r
a
a
a
G1
G2
G3 = join(G1,G2)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
69
Apriori-like Algorithm
Find frequent 1-subgraphs
 Repeat

– Candidate generation

Use frequent (k-1)-subgraphs to generate candidate k-subgraph
– Candidate pruning
Prune candidate subgraphs that contain infrequent
(k-1)-subgraphs

– Support counting

Count the support of each remaining candidate
– Eliminate candidate k-subgraphs that are infrequent
In practice, it is not as easy. There are many other issues
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
70
Example: Dataset
a
q
e
p
r
r
b
r
d
q
b
d
© Tan,Steinbach, Kumar
(a,b,r)
0
0
1
0
q
Introduction to Data Mining
p
r
c
p
d
G3
(b,c,p)
0
0
1
0
(b,c,q)
0
0
0
0
e
p
d
a
G2
(a,b,p) (a,b,q)
1
0
1
0
0
0
0
0
c
r
r
c
G1
p
b
q
a
p
p
p
G1
G2
G3
G4
e
a
G4
(b,c,r)
1
0
0
0
…
…
…
…
…
(d,e,r)
0
0
0
0
4/18/2004
71
Example
Minimum support count = 2
k=1
Frequent
Subgraphs
k=2
Frequent
Subgraphs
k=3
Candidate
Subgraphs
a
a
c
a
b
p
p
p
b
c
a
d
c
d
q
p
d
b
e
e
b
d
e
p
c
p
r
d
r
e
(Pruned candidate)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
72
Candidate Generation

In Apriori:
– Merging two frequent k-itemsets will produce a
candidate (k+1)-itemset

In frequent subgraph mining (vertex/edge
growing)
– Merging two frequent k-subgraphs may produce more
than one candidate (k+1)-subgraph
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
73
Multiplicity of Candidates (Vertex Growing)
a
q
e
p
p
r
0

p
M 
p

q
G1
d
r
r
p
?
a
a
+
q
e
p
p
p
a
a
a
r
r
d
a
a
a
G1
G2
G3 = join(G1,G2)
p
p
0
r
r
0
0
0
© Tan,Steinbach, Kumar
q

0
0

0
0

p
M 
p

0
G2
p
0
r
0
p 0

r 0
0 r

r 0
Introduction to Data Mining
M
G3
0

p
p

0
q

p
0
r
0
0
p 0 q

r 0 0
0 r 0

r 0 ?
0 ? 0 
4/18/2004
74
Multiplicity of Candidates (Edge growing)

Case 1: identical vertex labels
a
a
a
e
b
c
+
b
c
e
b
e
a
c
e
e
b
c
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
75
Multiplicity of Candidates (Edge growing)

Case 2: Core contains identical labels
b
a
c
a
a
a
a
a
a
a
a
b
+
c
a
b
a
a
c
a
a
a
a
a
Core: The (k-1) subgraph that is common
between the joint graphs
© Tan,Steinbach, Kumar
Introduction to Data Mining
c
a
a
a
4/18/2004
b
76
Multiplicity of Candidates (Edge growing)

Case 3: Core multiplicity
a
a
a
a
a
a
b
b
a
b
a
a
b
a
a
b
a
a
+
a
b
a
© Tan,Steinbach, Kumar
a
a
b
Introduction to Data Mining
a
a
4/18/2004
77
Adjacency Matrix Representation
A(1)
A(1)
A(2)
A(3)
A(4)
B(5)
B(6)
B(7)
B(8)
A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8)
1
1
1
0
1
0
0
0
1
1
0
1
0
1
0
0
1
0
1
1
0
0
1
0
0
1
1
1
0
0
0
1
1
0
0
0
1
1
1
0
0
1
0
0
1
1
0
1
0
0
1
0
1
0
1
1
0
0
0
1
0
1
1
1
A(1)
A(2)
A(3)
A(4)
B(5)
B(6)
B(7)
B(8)
A(1) A(2) A(3) A(4) B(5) B(6) B(7) B(8)
1
1
0
1
0
1
0
0
1
1
1
0
0
0
1
0
0
1
1
1
1
0
0
0
1
0
1
1
0
0
0
1
0
0
1
0
1
0
1
1
1
0
0
0
0
1
1
1
0
1
0
0
1
1
1
0
0
0
0
1
1
1
0
1
A(2)
B (5)
B (6)
B (7)
B (8)
A(3)
A(4)
A(2)
A(1)
B (7)
B (6)
B (5)
B (8)
A(3)
A(4)
• The same graph can be represented in many ways
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
78
Graph Isomorphism

A graph is isomorphic if it is topologically
equivalent to another graph
B
A
A
B
B
A
B
A
B
B
A
A
A
B
A
B
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
79
Graph Isomorphism

Test for graph isomorphism is needed:
– During candidate generation step, to determine
whether a candidate has been generated
– During candidate pruning step, to check whether its
(k-1)-subgraphs are frequent
– During candidate counting, to check whether a
candidate is contained within another graph
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
80
Graph Isomorphism

Use canonical labeling to handle isomorphism
– Map each graph into an ordered string representation
(known as its code) such that two isomorphic graphs
will be mapped to the same canonical encoding
– Example:

Lexicographically largest adjacency matrix
0
0

1

0
0 1 0
0 1 1

1 0 1

1 1 0
String: 0010001111010110
© Tan,Steinbach, Kumar
Introduction to Data Mining
0
1

1

1
1 1 1
0 1 0

1 0 0

0 0 0
Canonical: 0111101011001000
4/18/2004
81