Download Mining Sequential Patterns

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Course on Data Mining (581550-4):
Seminar Meetings
P
Ass. Rules
16.11.
P
Clustering
02.11.
M
23.11.
Episodes
P
KDD Process
09.11.
M
Text Mining
30.11.
M Seminar by Mika
P Seminar by Pirjo
Home Exam
Course on Data Mining: Seminar Meetings
Page
1/30
Course on Data Mining (581550-4):
Seminar Meetings
Today 09.11.2001
• Rakesh Agrawal and Ramakrishnan Srikant: Mining
Sequential Patterns. Int'l Conference on Data
Engineering, 1995.
• F. Masseglia, P. Poncelet and M. Teisseire:
Incremental Mining of Sequential Patterns in Large
Databases. 16èmes Journées Bases de Données
Avancées, 2000.
Course on Data Mining: Seminar Meetings
Page
2/30
Mining Sequential Patterns
Rakesh Agrawal and Ramakrishnan Srikant
IBM Almaden Research Center, USA
Published in ICDE'95 (Int'l Conf. on Data Engineering)
Data Mining course Autumn 2001/University of Helsinki
Summary by Mika Klemettinen
Course on Data Mining: Seminar Meetings
Page
3/30
Mining Sequential Patterns
• Problem statement:
• Database D with customer transactions
• Customer-id, transaction time, items purchased
• Quantities of items purchased are NOT concerned
• Definitions:
• Itemset: a non-empty set of items,  i1 i2 i3 … 
• Sequence: an ordered list of itemsets,  s1 s2 s3 … 
• A sequence  a1 a2 … an  is contained in  b1 b2 … bn  if there exist i1 < i2 < ... < in such
that a1  bi1, a2  bi2, … an  bin
• E.g.,  (3)(4 5)(8)    (7)(3 8)(9)(4 5 6)(8)>, since (3)  (3 8), (4 5)  (4 5 6) and (8)
 (8)
• However, note that sequence  (3)(5)    (3 5)  (and vice versa)
Course on Data Mining: Seminar Meetings
Page
4/30
Mining Sequential Patterns
•
Customer sequence: a sequence of transactions ("shopping baskets") of a customer,
ordered by transaction times Ti:
 itemset(T1) itemset(T2) … itemset(Tn) 
• A customer supports a sequence s if s is contained in the customer sequence for
this customer
• The support for a sequence is defined as the fraction of total customers who
support this sequence
• Task: Given a database D of customer transactions, the problem of mining sequential
patterns is to find the maximal sequences among all sequences that have a certain userspecified minimun support. Each such maximal sequence represents a sequential pattern
Course on Data Mining: Seminar Meetings
Page
5/30
Mining Sequential Patterns
Customer Id
Transaction time
Items bought
1
1
2
2
2
...
June
June
June
June
June
...
30
90
10, 20
30
40, 60, 70
...
Customer Id
Customer sequence
1
(30)(90)
2
3
4
5
(10 20)(30)(40 60 70)
(30 50 70)
(30)(40 70)(90)
(90)
25,
30,
10,
15,
20,
1993
1993
1993
1993
1993
Min. support 25%
=> 2 customers:
<(30)(90)> (1&4) and
<(30)(40 70)> (2&4)
are maximal
Course on Data Mining: Seminar Meetings
Page
6/30
Mining Sequential Patterns
• Definitions:
• Length of a sequence is the number of itemsets in the sequence
• A sequence of length k is called k-sequence
• A sequence concatenated from sequences x and y is denoted by x.y
• The support for an itemset i is defined as the fraction of customers who bought the
items in i in a single transaction
• An itemset with minimum support is called large itemset or litemset
• Each itemset in a large sequence must have minimum support, i.e., any large sequence
must be a list of litemsets (Apriori trick!)
• Three algorithms, all for sequential patterns:
• AprioriSome
• AprioriAll
• DynamicSome
Course on Data Mining: Seminar Meetings
Page
7/30
Mining Sequential Patterns
• Mining of sequential patterns:
•
1. Sort Phase
• Sort according to customer Id and transaction time
•
2. Litemset Phase
• Find large itemsets in a Apriori fashion, but like in MaxFreq, the support count is
incremented only once even if the customer buys the same set of items in two
different transactions
• The large itemsets are mapped to a set of contiguous integers (e.g. (30), (40), (70),
(40 70) and (90) becomes 1, 2, 3, 4 and 5); checking of equality is then fast
(constant time)!
Course on Data Mining: Seminar Meetings
Page
8/30
Mining Sequential Patterns
•
3. Transformation Phase
• There is a need to repeatedly check which large itemsets are contained in customer
sequences
• To make this fast, each customer sequence is transformed to a list of large itemsets
• Then the large itemsets are mapped to integers
CId
Original seq.
Transf.
Mapping
1
(30)(90)
{(30)}{(90)}
{1}{5}
2
(10 20)(30)(40 60 70)
{(30)}{(40),(70),(40 70)}
{1}{2,3,4}
3
(30 50 70)
{(30),(70)}
{1,3}
4
(30)(40 70)(90)
5
(90)
{(30)}{(40),(70),(40 70)}{(90)}
{1}{2,3,4}{5}
{(90)}
{5}
Course on Data Mining: Seminar Meetings
Page
9/30
Mining Sequential Patterns
•
4. Sequence Phase
• The large itemsets are used to find the desired sequences
•
AprioriAll:
– Based on the normal Apriori algorithm
– Counts all the large sequences
– Prunes non-maximal in the "Maximal phase"
•
*Some:
– Avoid counting sequences that are contained in longer sequences by
counting the longer ones first, also avoid having to count many
subsequences because their supersequences are not large
Course on Data Mining: Seminar Meetings
Page
10/30
Mining Sequential Patterns
– Forward phase: find all large sequences of certain lengths
– Backward phase: find all remaining large sequences
– AprioriSome: use only large sequences from previous pass to generate
candidates and validate their supports (i.e., if they are frequent or not)
– DynamicSome: generate candidates on-the-fly based on large sequences
found from the previous passes and the customer sequences read from the
database
•
5. Maximal Phase
• Find the maximal sequences among the large sequences
• In practice, starting from the largest sequences, delete all their subsequences
Course on Data Mining: Seminar Meetings
Page
11/30
Mining Sequential Patterns
• AprioriAll:
• Find all large sequences "normally"
•
•
Prune the non-maximal ones away starting from  1 2 3 4  by deleting all its
subsequences ( 1 2 3 ,  1 2 4 ,  1 3 4 ,  2 3 4 ,  1 2 ,  1 3 , …,  4 ), then take
the remaining  1 3 5  and prune all its subsequences, …
The maximal large sequences are  1 2 3 4 ,  1 3 5  and  4 5 
Course on Data Mining: Seminar Meetings
Page
12/30
Mining Sequential Patterns
• AprioriSome:
•
Count only sequences of, e.g., length 1, 2, 4 and 6 in "forward phase" and count
sequences of length 3 and 5 in "backward phase"
•
Note: in the forward phase, candidates for all levels are counted:
•
•
If in the large sequences of length Lk-1were checked, then generate new candidates
Ck based on them
•
If in the large sequences of length Lk-1were NOT checked, then generate new
candidates Ck based on candidates Ck-1
In backward phase: delete all sequences of the length k in candidate collection if they
are contained in some longer large sequence Li (i > k)
Course on Data Mining: Seminar Meetings
Page
13/30
Mining Sequential Patterns
•
Function "next" determines the next sequence length which is counted: this is based on
the assumption that if, e.g, almost all sequences of length k are large (frequent), then
many of the sequences of length k+1 are also large (frequent). E.g.,
•
Most of the sequences are large (85%) => next round is k+5
•
...
•
Not many of the sequences are large (67%) => next round is k+1 (AprioriAll)
Course on Data Mining: Seminar Meetings
Page
14/30
Mining Sequential Patterns
• DynamicSome:
•
In the initialization phase, count only sequences upto and including step variable length
•
•
E.g., if step is 3, count sequences of length 1, 2 and 3
In the forward phase, we generate sequences of length 2 × step, 3 × step, 4 × step, etc.
on-the-fly based on previous passes and customer sequences in the database
•
E.g., while generating sequences of length 9 with a step size 3: While passing the
data, if sequences s6  L6 and s3  L3 are both contained in the customer sequence
c in hand, and they do not overlap in c, then  sk . sj  is a candidate (k+j)-sequence
Course on Data Mining: Seminar Meetings
Page
15/30
Mining Sequential Patterns
•
In the intermediate phase, generate the candidate sequences for the skipped lengths
•
•
E.g., if we have counted L6 and L3 , and L9 turns out to be empty: we generate C7
and C8 , count C8 followed by C7 after deleting non-maximal sequences, and
repeat the process for C4 and C5
The backward phase is identical to AprioriSome
• Then we go on and spare a few thoughts on incremental mining of sequential
patterns
Course on Data Mining: Seminar Meetings
Page
16/30
Incremental Mining of Sequential Patterns in Large
Databases
F. Masseglia, P. Poncelet and M. Teisseire
Laboratoire PRiSM & LIRMM UMR CNRS, France
Published in BDA'00 (Bases de Données Avancées)
Data Mining course Autumn 2001/University of Helsinki
Summary by Mika Klemettinen
Course on Data Mining: Seminar Meetings
Page
17/30
Incremental Mining of Sequential Patterns
• Problem setting:
• Let us consider an original and an incremental customer transaction database
• For the original database, the frequent patterns have been created
• Incremental database may contain new customers and new transactions for both old
and new customers
• To compute the set of sequential patterns in the updated database, we want to avoid
counting everything from the scratch
• Some main things one has to consider:
• Discover all sequential patterns NOT frequent in the original database but
become frequent with the increment
• Examine all transactions in the original database which can be extended to
become frequent
• Old frequent sequences may become invalid when adding a customer or
customers
Course on Data Mining: Seminar Meetings
Page
18/30
Incremental Mining of Sequential Patterns
• Definitions are basically the same as in "Mining Sequential Patterns" paper
• Again, the problem is to find all (maximal) sequences whose support is greater than a
specified threshold (minimum support)
• Additional definitions:
• DB is the original database, minSupp is the minimum support
• db is the increment database
• U = DB  db is the updated database containing all sequences from DB and db
• LDB is the set of frequent sequences in DB
• Task is to find frequent sequences in U, noted LU, with respect to the minSupp
• An example database is presented on the next slide
Course on Data Mining: Seminar Meetings
Page
19/30
Incremental Mining of Sequential Patterns
Course on Data Mining: Seminar Meetings
Page
20/30
Incremental Mining of Sequential Patterns
• First problem (Figure 1): Append new transactions to customers already existing in
the original database
• Suppose that we have minSupp threshold of 50%
• In the original database, the frequent (maximal) sequences LDB are
{ (10 20) (30) ,  (10 20) (40) }
• New transactions are appended to customers C2 and C3
• Sequences  (60) (90)  and  (10 20) (50 70)  become frequent
• Customers C3 and C4 contain the first one, thus support is 50%
• Customers C1, C2, and C3 contain  (10 20) , thus the increments for C2 and C3
make the second one frequent, since customers C1 and C2 contain it ; thus
support is 50%
• Sequences  (10 20) (30)(50 60)(80)  and  (10 20) (40)(50 60)(80)  become
frequent, since  (50 60) (80)  is frequent in db and was added to the rows already
containing frequent sequences  (10 20) (30)  and  (10 20) (40) 
Course on Data Mining: Seminar Meetings
Page
21/30
Incremental Mining of Sequential Patterns
• Second problem (Figure 2): Append new customers and new transactions to the
original database
• Suppose again that we have minSupp threshold of 50%
• When one new customer is added to the database, a frequent sequence must be
observed for 3 customers (previously 2)
• In the original database, the frequent (maximal) sequences LDB used to be { (10 20) (30)
,  (10 20) (40) }, but is now just { (10 20) }
• Sequences  (10 20) (30)  and  (10 20) (40)  occur only for customers C2 and C3
• Sequence  (10 20)  occurs for C1, C2, and C3
• By introducing increment database db, the LU becomes { (10 20) (50) ,  (10) (70) , 
(10) (80) ,  (40) (80) ,  (60) }
• E.g., sequence  (10 20) (50)  is in the original database only for C1, and is not
frequent; as the item 50 becomes frequent with the increment database, the
sequence matches also C2 and C3
Course on Data Mining: Seminar Meetings
Page
22/30
Incremental Mining of Sequential Patterns
• Algorithm (ISE): The incremental mining is decomposed into two subproblems (k =
length of the longest frequent sequences in DB)
•
•
Find all new frequent sequences of size j  (k+1). During this phase, three kinds of
frequent sequences are considered:
• Sequences in DB can become frequent since they have sufficient support with the
increment
• There can be new frequent sequences appearing in increment db but not in
original DB
• Sequences in DB can become frequent when adding items of db
Find all new frequent sequences of size j > (k+1)
• This is straightforward Apriori-like algorithm applying, since we have all frequent
(k+1)-sequences discovered in the previous phase
Course on Data Mining: Seminar Meetings
Page
23/30
Incremental Mining of Sequential Patterns
• First iteration (1):
• Make a pass on db, count support for individual items of db
• Provide 1-candExt, sequences occurring in db
• Determine which items of db are frequent in U => Ld1b
• Prune out frequent sequences that used to be frequent in LDB, but which are no more
frequent in U
Course on Data Mining: Seminar Meetings
Page
24/30
Incremental Mining of Sequential Patterns
• First iteration (2):
• Create candidate sequences of length 2 by joining Ld1b with Ld1b => 2-candExt
• Generate from LDB the set of frequent sub-sequences
• Scan U to find out frequent 2-sequences from 2-candExt and frequent sub-sequences
occurring before items of Ld1b
Course on Data Mining: Seminar Meetings
Page
25/30
Incremental Mining of Sequential Patterns
• First iteration (3):
• freqSeed <= frequent sub-sequences occurring before items of Ld1b and appended with
the item
• 2-freqExt <= frequent 2-sequences from 2-candExt
Course on Data Mining: Seminar Meetings
Page
26/30
Incremental Mining of Sequential Patterns
• j th iteration with j  (k+1)
While (j-freqExt !=  AND j  (k+1) do
candInc <= Generate candidates from freqSeed and j-freqExt ;
j++;
j-candExt <= Generate candidate j-sequences from (j-1)freqExt ;
Scan db for j-candExt ;
if (j-candExt !=  AND candInc != ) then
Scan U for j-candExt and candInc ;
endif
j-freqExt <= frequent j-sequences;
freqInc <= freqInc + candidates from candInc verifying the support
enddo
LU <= LDB  { max. freq. sequences in freqSeed  freqInc  freqExt};
Course on Data Mining: Seminar Meetings
on U ;
Page
27/30
Incremental Mining of Sequential Patterns
• j th iteration with j > (k+1)
Apply Apriori-style algortihm until all frequent sequences are discovered
LU <= LU  { max. freq. sequences obtained from the previous step};
• On the next slide, processes in the first and j th iteration with j > (k+1) are summarized
• Optimization in "candInc <= Generate candidates from freqSeed and j-freqExt ":
Consider two sequences (s  freqSeed, s'  freqExt) such that an item i  Ld1b is the
last item of s and the first item of s'
Do not append s'  freqExt to s  freqSeed if there exist an item j  Ld1b such that j is
in s' and j is not preceded by s
Course on Data Mining: Seminar Meetings
Page
28/30
Incremental Mining of Sequential Patterns
Course on Data Mining: Seminar Meetings
Page
29/30
Unofficial Evaluation (Personal Views…)
• Mining Sequential Patterns:
• Paper comes from one of the top research groups in data mining area (IBM Almaden
Data Mining group led by Rakesh Agrawal)
• Quite well-written paper: Good language, clear examples and presentation => rather
"easy to read"
• Simple ideas, not very "break-through" ideas (at least this is the interpretation now);
quite good international conference
• One has to remember: this is written already in 1995
• Incremental Mining of Sequential Patterns in Large Databases
• Paper comes from not so well-known French research group
• Good: Lots of examples
• Bad: Language is not always as good as it could be & definitions are sometimes
somewhat "blurry", maybe too many abbreviations used
• Probably not very "break-through" ideas, national DB conference
• Remember: this is from year 2000 - rather new!
Course on Data Mining: Seminar Meetings
Page
30/30