Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. Mendes;Bolin Ding; Jiawei Han ICDM 2008 Outlines • Introduction • Problem Definition • SS-BE Method (Stream Sequence miner using Bounded Error) • SS-MB Method (Stream Sequence miner using Memory Bounds) • Experimental Results • Conclusions 2 Introduction • A data stream is an unbounded sequence in which new elements are generated continuously. • Memory usage is restricted. (meaning that we cannot store all the stream data in memory) • Two methods : SS-BE and SS-MB. • To break the data stream into fixed-sized batches and perform sequential pattern mining on each batch. • A lexicographic tree structure. • Using different pruning strategies that restrict the memory usage. 3 Problem Definition • A data stream of sequences is an arbitrarily large list of sequences. • A sequence s contains another sequence s’ if s’ is a subsequence of s. • count(s) : the number of sequences that contain s. • supp(s) : count(s) divided by the total number of sequences seen. • If supp(s) ≥ σ, where σ is a user-supplied minimum support threshold, then we say that s is a frequent sequence, or a sequential pattern. 4 Cont. • Example. • Suppose the length of our data stream is only 3 sequences : S1 = <a, b, c>, S2 = <a, c>, and S3 = <b, c>. Let us assume we are given that σ = 0.5. (supp(s) ≥ σ) • .sequential pattern count(s) supp(s) <a> 2 2/3=0.6 ≥ 0.5 <b> 2 2/3=0.6 ≥ 0.5 <c> 3 3/3=1 ≥ 0.5 <a, c> 2 2/3=0.6 ≥ 0.5 <b, c> 2 2/3=0.6 ≥ 0.5 <a, b> 1 1/3=0.3 < 0.5 <a, b, c> 1 1/3=0.3 < 0.5 5 Cont. • SS-BE and SS-ME use a lexicographic tree T0 to store the subsequences seen in the data stream. • Example. (Cont.) • Sequential patterns are : <a>:2, <b>:2, <c>:3, <a, c>:2, <b, c>:2. • T0 : 6 SS-BE Method • Example. • . Input values (1) a stream of sequences D = S1, S2,… (2) minimum support threshold σ 0.75 (3) significance threshold ϵ 0.5 (4) batch length L 4 (5) batch support threshold α 0.4 (6) pruning period δ 2 • Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>. • Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>. 7 SS-BE Method (Cont.) • Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>, with minimum support α = 0.4. • Sequential patterns are : <a>:3, <b>:3, <c>:3, <a, b>:2, <a, c>:2, and <b, c>:2. • After batch 1 : batchCount of 1 8 SS-BE Method (Cont.) • Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>, with minimum support α = 0.4. • Sequential patterns are : <a>:4, <b>:4, <c>:2, <d>:2, and <a, b>:4. • After batch 2 : 2 2 batchCount of 1 2 2 1 1 9 SS-BE Method (Cont.) • B = 2 (B : the number of batches elapsed since the last pruning before it was inserted in the tree.) • B’ = B − batchCount, batchCount = 1;2, B’ = 1;0 • Pruning : => count + B’ ≤ 4 • After pruning : 2 + 1 ≤ 4 =>pruning 5+0≤4 • Output all sequences, having count ci, count ≥ ( σ − ϵ )N = (0.75−0.5)8 = 2. • The output sequences and counts are: <a>:7, <b>:7, <c>:5, and <a, b>:6. (false positive : <c>) 10 SS-ME Method • Example. • . Input values (1) a stream of sequences D = S1, S2,… (2) minimum support threshold σ 0.75 (3) significance threshold ϵ 0.5 (4) batch length L 4 (5) maximum number of nodes in the tree m 7 • Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>. • Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>. 11 SS-ME Method (Cont.) • Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>, with minimum support ϵ = 0.5. • Sequential patterns are : <a>:3, <b>:3, <c>:3, <a, b>:2, <a, c>:2, and <b, c>:2. • After batch 1 : 12 SS-ME Method (Cont.) • Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>, with minimum support ϵ = 0.5. • Sequential patterns are : <a>:4, <b>:4, <c>:2, <d>:2, and <a, b>:4. • After batch 2 : count as 2 + min = 2 13 • A variable min : to keep track of the largest count of any node that has been removed from our tree. (initially set to 0) SS-ME Method (Cont.) • m=7 • The sequence <b, c> is removed, and min = 2. pruning • After batch 2 :: • Output all sequences, having count ci, count > ( σ − ϵ )N = (0.75−0.5)8 = 2. • The output sequences and counts are: <a>:7, <b>:7, <c>:5, and <a, b>:6. • If min = 2 ≤ ( σ − ϵ )N = 2, then the algorithm guarantees that there are no false negatives. (false positive : <c>) 14 Experimental Results 15 Cont. 16 Conclusions • In this paper we propose two effective methods for mining sequential patterns from data streams : SS-BE and SS-MB. • The running time of each algorithm scales linearly as the number of sequences grows. • The maximum memory usage is restricted in both cases through the pruning strategies adopted. • These properties make both methods effective choices for stream sequential pattern mining. 17