Download Stream Sequential Pattern Mining with Precise Error Bounds

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Stream Sequential Pattern Mining
with Precise Error Bounds
Luiz F. Mendes;Bolin Ding;
Jiawei Han
ICDM 2008
Outlines
• Introduction
• Problem Definition
• SS-BE Method (Stream Sequence miner using Bounded Error)
• SS-MB Method (Stream Sequence miner using Memory Bounds)
• Experimental Results
• Conclusions
2
Introduction
• A data stream is an unbounded sequence in which new
elements are generated continuously.
• Memory usage is restricted.
(meaning that we cannot store all the stream data in memory)
• Two methods : SS-BE and SS-MB.
• To break the data stream into fixed-sized batches and perform
sequential pattern mining on each batch.
• A lexicographic tree structure.
• Using different pruning strategies that restrict the memory
usage.
3
Problem Definition
• A data stream of sequences is an arbitrarily large list of
sequences.
• A sequence s contains another sequence s’ if s’ is a
subsequence of s.
• count(s) : the number of sequences that contain s.
• supp(s) : count(s) divided by the total number of sequences
seen.
• If supp(s) ≥ σ, where σ is a user-supplied minimum support
threshold, then we say that s is a frequent sequence, or
a sequential pattern.
4
Cont.
• Example.
• Suppose the length of our data stream is only 3 sequences :
S1 = <a, b, c>, S2 = <a, c>, and S3 = <b, c>.
Let us assume we are given that σ = 0.5. (supp(s) ≥ σ)
• .sequential pattern count(s) supp(s)
<a>
2
2/3=0.6 ≥ 0.5
<b>
2
2/3=0.6 ≥ 0.5
<c>
3
3/3=1 ≥ 0.5
<a, c>
2
2/3=0.6 ≥ 0.5
<b, c>
2
2/3=0.6 ≥ 0.5
<a, b>
1
1/3=0.3 < 0.5
<a, b, c>
1
1/3=0.3 < 0.5
5
Cont.
• SS-BE and SS-ME use a lexicographic tree T0 to store the
subsequences seen in the data stream.
• Example. (Cont.)
• Sequential patterns are : <a>:2, <b>:2, <c>:3, <a, c>:2, <b, c>:2.
• T0 :
6
SS-BE Method
• Example.
• . Input values
(1) a stream of sequences D = S1, S2,…
(2) minimum support threshold σ
0.75
(3) significance threshold ϵ
0.5
(4) batch length L
4
(5) batch support threshold α
0.4
(6) pruning period δ
2
• Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>.
• Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>.
7
SS-BE Method (Cont.)
• Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>,
with minimum support α = 0.4.
• Sequential patterns are : <a>:3, <b>:3, <c>:3, <a, b>:2, <a, c>:2,
and <b, c>:2.
• After batch 1 :
batchCount of 1
8
SS-BE Method (Cont.)
• Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>,
with minimum support α = 0.4.
• Sequential patterns are : <a>:4, <b>:4, <c>:2, <d>:2, and <a, b>:4.
• After batch 2 :
2
2
batchCount of 1
2
2
1
1
9
SS-BE Method (Cont.)
• B = 2 (B : the number of batches elapsed since the last pruning
before it was inserted in the tree.)
• B’ = B − batchCount, batchCount = 1;2, B’ = 1;0
• Pruning :
=> count + B’ ≤ 4
• After pruning :
2 + 1 ≤ 4 =>pruning
5+0≤4
• Output all sequences, having count ci,
count ≥ ( σ − ϵ )N = (0.75−0.5)8 = 2.
• The output sequences and counts are:
<a>:7, <b>:7, <c>:5, and <a, b>:6. (false positive : <c>)
10
SS-ME Method
• Example.
• . Input values
(1) a stream of sequences D = S1, S2,…
(2) minimum support threshold σ
0.75
(3) significance threshold ϵ
0.5
(4) batch length L
4
(5) maximum number of nodes in the tree m
7
• Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>.
• Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>.
11
SS-ME Method (Cont.)
• Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>,
with minimum support ϵ = 0.5.
• Sequential patterns are : <a>:3, <b>:3, <c>:3, <a, b>:2, <a, c>:2,
and <b, c>:2.
• After batch 1 :
12
SS-ME Method (Cont.)
• Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>,
with minimum support ϵ = 0.5.
• Sequential patterns are : <a>:4, <b>:4, <c>:2, <d>:2, and <a, b>:4.
• After batch 2 :
count as 2 + min = 2
13
• A variable min : to keep track of the largest count of any node
that has been removed from our tree. (initially set to 0)
SS-ME Method (Cont.)
• m=7
• The sequence <b, c> is removed, and min = 2.
pruning
• After batch
2 ::
• Output all sequences, having count ci,
count > ( σ − ϵ )N = (0.75−0.5)8 = 2.
• The output sequences and counts are:
<a>:7, <b>:7, <c>:5, and <a, b>:6.
• If min = 2 ≤ ( σ − ϵ )N = 2, then the algorithm guarantees that
there are no false negatives. (false positive : <c>)
14
Experimental Results
15
Cont.
16
Conclusions
• In this paper we propose two effective methods for mining
sequential patterns from data streams : SS-BE and SS-MB.
• The running time of each algorithm scales linearly as the
number of sequences grows.
• The maximum memory usage is restricted in both cases
through the pruning strategies adopted.
• These properties make both methods effective choices for
stream sequential pattern mining.
17
Related documents