Download Association Rule Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Sequential Pattern Mining
CS 685: Special Topics in Data Mining
The UNIVERSITY of KENTUCKY
Sequential Pattern Mining
Why sequential pattern mining?
GSP algorithm
PrefixSpan
2
CS685: Special Topics in Data Mining
Sequence Data
Timeline
10
Sequence Database:
Object
A
A
A
B
B
B
B
C
Timestamp
10
20
23
11
17
21
28
14
Events
2, 3, 5
6, 1
1
4, 5, 6
2
7, 8, 1, 2
1, 6
1, 8, 7
15
20
25
30
35
Object A:
2
3
5
6
1
1
Object B:
4
5
6
2
7
8
1
2
1
6
Object C:
1
7
8
3
CS685: Special Topics in Data Mining
Examples of Sequence Data
Sequence
Database
Sequence
Element
(Transaction)
Event
(Item)
Customer
Purchase history of a given
customer
A set of items bought by
a customer at time t
Books, diary products,
CDs, etc
Web Data
Browsing activity of a
particular Web visitor
Home page, index
page, contact info, etc
Event data
History of events generated
by a given sensor
A collection of files
viewed by a Web visitor
after a single mouse click
Events triggered by a
sensor at time t
Genome
sequences
DNA sequence of a
particular species
An element of the DNA
sequence
Bases A,T,G,C
Element
(Transaction)
E1
E2
E1
E3
E2
E2
E3
E4
Types of alarms
generated by sensors
Event
(Item)
Sequence
4
CS685: Special Topics in Data Mining
Formal Definition of a
Sequence
A sequence is an ordered list of elements
(transactions)
s = < e1 e2 e3 … >
Each element contains a collection of events (items)
ei = {i1, i2, …, ik}
Each element is attributed to a specific time or location
Length of a sequence, |s|, is given by the number of
elements of the sequence
A k-sequence is a sequence that contains k events
(items)
5
CS685: Special Topics in Data Mining
What Is Sequential Pattern
Mining?
Given a set of sequences, find the complete
set of frequent subsequences
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
A
sequence
: < (ef) (ab) (df) c b >
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
6
CS685: Special Topics in Data Mining
Sequential Pattern Mining:
Definition
Given:
a database of sequences
a user-specified minimum support threshold,
minsup
Task:
Find all subsequences with support ≥ minsup
7
CS685: Special Topics in Data Mining
Sequential Pattern Mining:
Challenge
Given a sequence: <{a b} {c d e} {f} {g h i}>
Examples of subsequences:
<{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc.
How many k-subsequences can be extracted from a
given n-sequence?
<{a b} {c d e} {f} {g h i}> n = 9
Answer :
k=4:
Y _ _ Y Y _ _ _Y
<{a}
8
{d e}
{i}>
n 9
      126
 k   4
CS685: Special Topics in Data Mining
Challenges on Sequential
Pattern Mining
A huge number of possible sequential
patterns are hidden in databases
A mining algorithm should
Find the complete set of patterns satisfying the
minimum support (frequency) threshold
Be highly efficient, scalable, involving only a
small number of database scans
Be able to incorporate various kinds of userspecific constraints
9
CS685: Special Topics in Data Mining
A Basic Property of
Sequential Patterns: Apriori
A basic property: Apriori (Agrawal & Sirkant’94)
If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent  so do <hab> and <(ah)b>
Seq. ID
10
20
30
40
50
10
Sequence
<(bd)cb(ac)>
<(bf)(ce)b(fg)>
<(ah)(bf)abf>
<(be)(ce)d>
<a(bd)bcb(ade)>
Given support threshold
min_sup =2
CS685: Special Topics in Data Mining
Basic Algorithm : Breadth
First Search (GSP)
L=1
While (ResultL != NULL)
Candidate Generate
Prune
Test
L=L+1
11
CS685: Special Topics in Data Mining
Finding Length-1 Sequential
Patterns
Initial candidates: all singleton sequences
<a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
Scan database once, count support for
candidates
min_sup =2
Seq. ID
10
20
30
40
50
12
Sequence
<(bd)cb(ac)>
<(bf)(ce)b(fg)>
<(ah)(bf)abf>
<(be)(ce)d>
<a(bd)bcb(ade)>
Cand
<a>
<b>
<c>
<d>
<e>
<f>
<g>
<h>
Sup
3
5
4
3
3
2
1
1
CS685: Special Topics in Data Mining
Generating Length-2
Candidates
51 length-2
Candidates
<a>
<a>
<b>
<c>
<d>
<e>
<f>
13
<a>
<b>
<c>
<d>
<e>
<f>
<a>
<aa>
<ab>
<ac>
<ad>
<ae>
<af>
<b>
<ba>
<bb>
<bc>
<bd>
<be>
<bf>
<c>
<ca>
<cb>
<cc>
<cd>
<ce>
<cf>
<d>
<da>
<db>
<dc>
<dd>
<de>
<df>
<e>
<ea>
<eb>
<ec>
<ed>
<ee>
<ef>
<f>
<fa>
<fb>
<fc>
<fd>
<fe>
<ff>
<b>
<c>
<d>
<e>
<f>
<(ab)>
<(ac)>
<(ad)>
<(ae)>
<(af)>
<(bc)>
<(bd)>
<(be)>
<(bf)>
<(cd)>
<(ce)>
<(cf)>
<(de)>
<(df)>
<(ef)>
Without Apriori
property,
8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates
CS685: Special Topics in Data Mining
The Mining Process
5th scan: 1 cand. 1 length-5 seq.
pat.
Cand. cannot pass
sup. threshold
<(bd)cba>
Cand. not in DB at all
4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> …
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> …
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
pat. 10 cand. not in DB at all
1st scan: 8 cand. 6 length-1 seq.
<a> <b> <c> <d> <e> <f> <g> <h>
pat.
Seq. ID
Sequence
min_sup =2
14
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
CS685: Special Topics in Data Mining
Candidate Generate-and-test: Drawbacks
A huge set of candidate sequences generated.
Especially 2-item candidate sequence.
Multiple Scans of database needed.
Inefficient for mining long sequential patterns.
A long pattern grow up from short patterns
The number of short patterns is exponential to
the length of mined patterns.
15
May 22, 2017
15
Data Mining: Concepts and Techniques
CS685: Special Topics in Data Mining
Bottlenecks of GSP
A huge set of candidates could be generated
1,000 frequent length-1 sequences generate s huge number of length-2
candidates!
1000 1000 
Multiple scans of database in mining
1000  999
 1,499,500
2
The length of each candidate grows by one at each database scan.
Mining long sequential patterns
Needs an exponential number of short candidates
A length-100 sequential pattern needs 1030
candidate sequences!
16
100  100
30



2

1

10

 
i 1  i 
100
CS685: Special Topics in Data Mining
Pattern Growth (prefixSpan)
Prefix and Suffix (Projection)
<a>, <aa>, <a(ab)> and <a(abc)> are prefixes
of sequence <a(abc)(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>
17
Prefix
Suffix (Prefix-Based Projection)
<a>
<aa>
<a(ab)>
<(abc)(ac)d(cf)>
<(_bc)(ac)d(cf)>
<(_c)(ac)d(cf)>
CS685: Special Topics in Data Mining
Mining Sequential Patterns by Prefix
Projections
Step 1: find length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>
Step 2: divide search space. The complete set of seq. pat. can be
partitioned into 6 subsets:
The ones having prefix <a>;
The ones having prefix <b>;
…
The ones having prefix <f>
18
SID
10
20
30
40
sequence
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
<eg(af)cbc>
CS685: Special Topics in Data Mining
Finding Seq. Patterns with Prefix <a>
Only need to consider projections w.r.t. <a>
<a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>,
<(ab)>, <ac>, <ad>, <af>
Further partition into 6 subsets
Having prefix <aa>;
…
Having prefix <af>
19
SID
10
20
30
40
sequence
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
<eg(af)cbc>
CS685: Special Topics in Data Mining
Completeness of PrefixSpan
SDB
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>
Having prefix <c>, …, <f>
Having prefix <a>
Having prefix <b>
<a>-projected database
<(abc)(ac)d(cf)>
<(_d)c(bc)(ae)>
<(_b)(df)cb>
<(_f)cbc>
Having prefix <aa>
<aa>-proj. db
20
<b>-projected database
Length-2 sequential
patterns
<aa>, <ab>, <(ab)>,
<ac>, <ad>, <af>
…
……
Having prefix <af>
…
<af>-proj. db
CS685: Special Topics in Data Mining
Efficiency of PrefixSpan
No candidate sequence needs to be generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing
projected databases
Can be improved by pseudo-projections
21
CS685: Special Topics in Data Mining
Speed-up by Pseudo-projection
Major cost of PrefixSpan: projection
Postfixes of sequences often appear
repeatedly in recursive projected databases
When (projected) database can be held in main memory,
use pointers to form projections
s=<a(abc)(ac)d(cf)>
<a>
Pointer to the sequence
Offset of the postfix
s|<a>: ( , 2)
<(abc)(ac)d(cf)>
<ab>
s|<ab>: ( , 4)
22
<(_c)(ac)d(cf)>
CS685: Special Topics in Data Mining
Pseudo-Projection vs. Physical Projection
Pseudo-projection avoids physically copying postfixes
Efficient in running time and space when
database can be held in main memory
However, it is not efficient when database cannot fit in
main memory
Disk-based random accessing is very costly
Suggested Approach:
Integration of physical and pseudo-projection
Swapping to pseudo-projection when the data
set fits in memory
23
CS685: Special Topics in Data Mining
Related documents