Download Stream Sequential Pattern Mining with Precise Error Bounds

Stream Sequential Pattern Mining with Precise Error Bounds Luiz F. Mendes；Bolin Ding； Jiawei Han ICDM 2008 Outlines • Introduction • Problem Definition • SS-BE Method (Stream Sequence miner using Bounded Error) • SS-MB Method (Stream Sequence miner using Memory Bounds) • Experimental Results • Conclusions 2 Introduction • A data stream is an unbounded sequence in which new elements are generated continuously. • Memory usage is restricted. (meaning that we cannot store all the stream data in memory) • Two methods : SS-BE and SS-MB. • To break the data stream into fixed-sized batches and perform sequential pattern mining on each batch. • A lexicographic tree structure. • Using different pruning strategies that restrict the memory usage. 3 Problem Definition • A data stream of sequences is an arbitrarily large list of sequences. • A sequence s contains another sequence s’ if s’ is a subsequence of s. • count(s) : the number of sequences that contain s. • supp(s) : count(s) divided by the total number of sequences seen. • If supp(s) ≥ σ, where σ is a user-supplied minimum support threshold, then we say that s is a frequent sequence, or a sequential pattern. 4 Cont. • Example. • Suppose the length of our data stream is only 3 sequences : S1 = <a, b, c>, S2 = <a, c>, and S3 = <b, c>. Let us assume we are given that σ = 0.5. (supp(s) ≥ σ) • .sequential pattern count(s) supp(s) <a> 2 2/3=0.6 ≥ 0.5 2 2/3=0.6 ≥ 0.5 <c> 3 3/3=1 ≥ 0.5 <a, c> 2 2/3=0.6 ≥ 0.5 <b, c> 2 2/3=0.6 ≥ 0.5 <a, b> 1 1/3=0.3 < 0.5 <a, b, c> 1 1/3=0.3 < 0.5 5 Cont. • SS-BE and SS-ME use a lexicographic tree T0 to store the subsequences seen in the data stream. • Example. (Cont.) • Sequential patterns are : <a>:2, :2, <c>:3, <a, c>:2, <b, c>:2. • T0 : 6 SS-BE Method • Example. • . Input values (1) a stream of sequences D = S1, S2,… (2) minimum support threshold σ 0.75 (3) significance threshold ϵ 0.5 (4) batch length L 4 (5) batch support threshold α 0.4 (6) pruning period δ 2 • Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>. • Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>. 7 SS-BE Method (Cont.) • Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>, with minimum support α = 0.4. • Sequential patterns are : <a>:3, :3, <c>:3, <a, b>:2, <a, c>:2, and <b, c>:2. • After batch 1 : batchCount of 1 8 SS-BE Method (Cont.) • Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>, with minimum support α = 0.4. • Sequential patterns are : <a>:4, :4, <c>:2, <d>:2, and <a, b>:4. • After batch 2 : 2 2 batchCount of 1 2 2 1 1 9 SS-BE Method (Cont.) • B = 2 (B : the number of batches elapsed since the last pruning before it was inserted in the tree.) • B’ = B − batchCount, batchCount = 1；2, B’ = 1；0 • Pruning : => count + B’ ≤ 4 • After pruning : 2 + 1 ≤ 4 =>pruning 5+0≤4 • Output all sequences, having count ci, count ≥ ( σ − ϵ )N = (0.75−0.5)8 = 2. • The output sequences and counts are: <a>:7, :7, <c>:5, and <a, b>:6. (false positive : <c>) 10 SS-ME Method • Example. • . Input values (1) a stream of sequences D = S1, S2,… (2) minimum support threshold σ 0.75 (3) significance threshold ϵ 0.5 (4) batch length L 4 (5) maximum number of nodes in the tree m 7 • Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>. • Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>. 11 SS-ME Method (Cont.) • Batch B1 : <a, b, c>, <a, c>, <a, b> and <b, c>, with minimum support ϵ = 0.5. • Sequential patterns are : <a>:3, :3, <c>:3, <a, b>:2, <a, c>:2, and <b, c>:2. • After batch 1 : 12 SS-ME Method (Cont.) • Batch B2 : <a, b, c, d>, <c, a, b>, <d, a, b> and <a, e, b>, with minimum support ϵ = 0.5. • Sequential patterns are : <a>:4, :4, <c>:2, <d>:2, and <a, b>:4. • After batch 2 : count as 2 + min = 2 13 • A variable min : to keep track of the largest count of any node that has been removed from our tree. (initially set to 0) SS-ME Method (Cont.) • m=7 • The sequence <b, c> is removed, and min = 2. pruning • After batch 2 :: • Output all sequences, having count ci, count > ( σ − ϵ )N = (0.75−0.5)8 = 2. • The output sequences and counts are: <a>:7, :7, <c>:5, and <a, b>:6. • If min = 2 ≤ ( σ − ϵ )N = 2, then the algorithm guarantees that there are no false negatives. (false positive : <c>) 14 Experimental Results 15 Cont. 16 Conclusions • In this paper we propose two effective methods for mining sequential patterns from data streams : SS-BE and SS-MB. • The running time of each algorithm scales linearly as the number of sequences grows. • The maximum memory usage is restricted in both cases through the pruning strategies adopted. • These properties make both methods effective choices for stream sequential pattern mining. 17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Stream Sequential Pattern Mining with Precise Error Bounds