Download Data Mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Presented By:
Kevin Seng
4/3/01
CS632 - Data Mining
1
Papers
Rakesh Agrawal and Ramakrishnan
Srikant:
 Fast algorithms for mining association
rules.
 Mining sequential patterns.
4/3/01
CS632 - Data Mining
2
Outline…
For each paper…
 Present the problem.
 Describe the algorithms.



4/3/01
Intuition
Design
Performance.
CS632 - Data Mining
3
Market Basket Introduction

Retailers are able to collect massive
amounts of sales data (basket data)



4/3/01
Bar-code technology
E-commerce
Sales data generally includes customer
id, transaction date and items bought.
CS632 - Data Mining
4
Market Basket Problem

It would be useful to find association
rules between transactions.


4/3/01
ie. 75% of the people who buy spaghetti
also by tomato sauce.
Given a set of basket data, how can we
efficiently find the set of association
rules?
CS632 - Data Mining
5
Formal Definition (1)



L = {i1,i2,… im} set of items.
Database D is a set of transactions.
Transaction T is a set of items such that
T  L.

4/3/01
An unique identifier, TID, is associated with
each transaction.
CS632 - Data Mining
6
Formal Definition (2)


T contains X, a set of some items in L, if
X  T.
Association rule, X  Y



4/3/01
X  T, Y  T, X  Y = 
Confidence – % of transactions which
contain X which also contain Y.
Support - % of transactions in D which
contain X  Y.
CS632 - Data Mining
7
Formal Definition (3)

4/3/01
Given a set of transactions D, we want
to generate all association rules that
have support and confidence greater
than the user-specified minimum
support (minsup) and minimum
confidence (minconf).
CS632 - Data Mining
8
Problem Decomposition
Two sub-problems:
 Find all itemsets that have transaction
support above minsup.


4/3/01
These itemsets are called large itemsets.
From all the large itemsets, generate
the set of association rules that have
confidence about minconf.
CS632 - Data Mining
9
Second Sub-problem
Straightforward approach:
 For every large itemset l, find all nonempty subsets of l.
 For every such subset a, output a rule
of the form a  (l – a) if ratio of
support(l) to support(a) is at least
minconf.
4/3/01
CS632 - Data Mining
10
Discovering Large Itemsets



Done with multiple passes over the data.
First pass, find all individual items that are
large (have minimum support).
Subsequent pass, using large itemsets found
in previous pass:



4/3/01
Generate candidate itemsets.
Count support for each candidate itemset.
Eliminate itemsets that do not have min support.
CS632 - Data Mining
11
Algorithm
L1 = {large 1-itemsets};
for( k=2; Lk-1; k++) do begin
Ck = apriori-gen(Lk-1); // New candidates
forall transactions t  D do // Counting support
Ct = subset(Ck, t); // Candidates in t
forall candidates c  Ct do
c.count++;
end
Lk = {c  Ck | c.count  minsup}
End
Answer = k Lk;
4/3/01
CS632 - Data Mining
12
Candidate Generation
AIS and SETM algorithms:
 Uses the transactions in the database to
generate new candidates.
 But… this generates a lot of candidates
which we know beforehand are not
large!
4/3/01
CS632 - Data Mining
13
Apriori Algorithms
Generate candidates using only large
itemsets found in previous pass without
considering the database.
Intuition:
 Any subset of a large itemset must be
large.

4/3/01
CS632 - Data Mining
14
Apriori Candidate Generation

Takes in Lk-1 and returns Ck.
Two steps:
 Join large itemsets Lk-1 with Lk-1.
 Prune out all itemsets in joined result
which contain a (k-1)subset not found
in Lk-1.
4/3/01
CS632 - Data Mining
15
Candidate Generation (Join)
insert into Ck
select p.item1, p.item2,…,p.itemk-1,q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1= q.item1,…, p.itemk-2= q.itemk-2,
p.itemk-1< q.itemk-1
4/3/01
CS632 - Data Mining
16
Candidate Gen. (Example)
L3
{1
{1
{1
{1
{2
2
2
3
3
3
4/3/01
3}
4}
4}
5}
4}
Join 
C4
C4
{1 2 3 4} Prune  {1 2 3 4}
{1 3 4 5}
CS632 - Data Mining
17
Counting Support


Need to count the number of
transactions which support a given
itemset.
For efficiency, use a hash-tree.

4/3/01
Subset Function
CS632 - Data Mining
18
Subset Function (Hash-tree)



Candidate itemsets are stored in hash-tree.
Leaf node – contains a list of itemsets.
Interior node – contains a hash table.



4/3/01
Each bucket of the hash table points to another
node.
Root is at depth 1.
Interior nodes at depth d points to nodes at
depth d+1.
CS632 - Data Mining
19
Hash-tree Example (1)
depth
C3
{1
{1
{1
{1
{2
1
2
2
3
3
3
3}
4}
4}
5}
4}
4/3/01
2
{1 2 3}
{1 2 4}
3
{1 3 4}
{1 3 5}
CS632 - Data Mining
2
1
{2 3 4}
2
3
t=2
20
Using the hash-tree



4/3/01
If we are at a leaf – find all itemsets
contained in transaction.
If we are at an interior node – hash on
each remaining element in transaction.
Root node – hash on all elements in
transaction.
CS632 - Data Mining
21
Hash-tree Example (2)
1
2
D
{1 2 3 4}
{2 3 5}
2
{1 2 3}
{1 2 4}
4/3/01
3
{2 3 4}
{1 3 4}
{1 3 5}
CS632 - Data Mining
22
AprioriTid (1)


Does not use the transactions in the
database for counting itemset support.
Instead stores transactions as sets of
possible large itemsets, Ck.

Each member of Ck is of the form:
< TID, {Xk}> , Xk is a possible large itemset
4/3/01
CS632 - Data Mining
23
AprioriTid (2)
Advantage of Ck
 If a transaction does not contain any
candidate k-itemset then it will have no
entry in Ck.
 Number of entries in Ck may be less
than the number of transactions in D.


4/3/01
Especially for large k.
Speeds up counting!
CS632 - Data Mining
24
AprioriTid (3)
However…
 For small k each entry in Ck may be
larger than it’s corresponding
transaction.
 The usual space vs. time.
4/3/01
CS632 - Data Mining
25
AprioriTid (4) Example
4/3/01
CS632 - Data Mining
26
Observation


When Ck does not fit
in main memory we
can see large jump
in execution time.
AprioriTid beats
Apriori only when Ck
can fit in main
memory.
4/3/01
CS632 - Data Mining
27
AprioriHybrid


It is not necessary to use the same
algorithm for all the passes.
Combine the two algorithms!


4/3/01
Start with Apriori
When Ck can fit in main memory switch to
AprioriTID
CS632 - Data Mining
28
Performance (1)


4/3/01
Measured performance by running
algorithms on generated synthetic data.
Used the following parameters:
CS632 - Data Mining
29
Performance (2)
4/3/01
CS632 - Data Mining
30
Performance (3)
4/3/01
CS632 - Data Mining
31
Mining Sequential Patterns (1)


Sequential patterns are ordered list of
itemsets.
Market basket example:


4/3/01
Customers typically rent “star wars” then
“empire strikes back” then “return of the
Jedi”
“Fitted sheets and pillow cases” then
“comforter” then “drapes and ruffles”
CS632 - Data Mining
32
Mining Sequential Patterns (2)


Looks at sequences of transactions as
opposed to a single transaction.
Groups transactions based on customer
ID.

4/3/01
Customer sequence.
CS632 - Data Mining
33
Formal Definition (1)


Given a database D of customer
transactions.
Each transaction consists of: customer
id, transaction-time, items purchased.

4/3/01
No customer has more than one
transaction with the same transaction-time.
CS632 - Data Mining
34
Formal Definition (2)




4/3/01
Itemset i, (i1 i2...im) where ij is an item.
Sequence s, s1s2…sn where sj is an
itemset.
Sequence a1a2…an contained in
b1b2…bn if there exist integers i1< i2 ...
< in such that a1 bi1 , a2 bi2 ,…, an bin .
A sequence s is maximal if it is not
contained in any other sequence.
CS632 - Data Mining
35
Formal Definition (3)


A customer supports a sequence s if s is
contained in the customer sequence for
this customer.
Support of a sequence - % of
customers who support the sequence.

4/3/01
For mining association rules, support was
% of transactions.
CS632 - Data Mining
36
Formal Definition (4)


4/3/01
Given a database D of customer
transactions find the maximal
sequences among all sequences that
have a certain user-specified minimum
support.
Sequences that have support above
minsup are large sequences.
CS632 - Data Mining
37
Algorithm: Sort Phase


Customer ID – Major key
Transaction-time – Minor key
Converts the original transaction database
into a database of customer sequences.
4/3/01
CS632 - Data Mining
38
Algorithm: Litemset Phase (1)
Litemset Phase:
 Find all large itemsets.
Why?
 Because each itemset in a large
sequence has to be a large itemset.
4/3/01
CS632 - Data Mining
39
Algorithm: Litemset Phase (2)


To get all large itemsets we can use the
Apriori algorithms discussed earlier.
Need to modify support counting.

4/3/01
For sequential patterns, support is
measured by fraction of customers.
CS632 - Data Mining
40
Algorithm: Litemset Phase (3)


4/3/01
Each large itemset is then mapped to a
set of contiguous integers.
Used to compare two large itemsets in
constant time.
CS632 - Data Mining
41
Algorithm: Transformation (1)



4/3/01
Need to repeatedly determine which of
a given set of large sequences are
contained in a customer sequence.
Represent transactions as sets of large
itemsets.
Customer sequence now becomes a list
of sets of itemsets.
CS632 - Data Mining
42
Algorithm: Transformation (2)
4/3/01
CS632 - Data Mining
43
Algorithm: Sequence Phase
(1)


Use the set of large itemsets to find the
desired sequences.
Similar structure to Apriori algorithms used to
find large itemsets.



4/3/01
Use seed set to generate candidate sequences.
Count support for each candidate.
Eliminate candidate sequences which are not
large.
CS632 - Data Mining
44
Algorithm: Sequence Phase
(2)
Two types of algorithms:
 Count-all: counts all large sequences,
including non-maximal sequences.


Count-some: try to avoid counting nonmaximal sequences by counting longer
sequences first.


4/3/01
AprioriAll
AprioriSome
DynamicSome
CS632 - Data Mining
45
Algorithm: Maximal Phase (1)


4/3/01
Find the maximal sequences among the
set of large sequences.
Set of all large subsequences S
CS632 - Data Mining
46
Algorithm: Maximal Phase (2)



4/3/01
Use hash-tree to find all subsequences
of sk in S.
Similar to subset function used in
finding large itemsets.
S is stored in hash-tree.
CS632 - Data Mining
47
AprioriAll (1)
4/3/01
CS632 - Data Mining
48
AprioriAll (2)



Hash-tree is used for counting.
Candidate generation similar to
candidate generation in finding large
itemsets.
Except that order matters and therefore
we don’t have the condition:
p.itemk-1< q.itemk-1
4/3/01
CS632 - Data Mining
49
AprioriAll (3)
Example of candidate generation:
4/3/01
CS632 - Data Mining
50
AprioriAll (4)

4/3/01
Example:
CS632 - Data Mining
51
Count-some Algorithms


Try to avoid counting non-maximal
sequences by counting longer
sequences first.
2 phases:


4/3/01
Forward Phase – find all large sequences
or certain lengths.
Backward Phase – find all remaining large
sequences.
CS632 - Data Mining
52
AprioriSome (1)




Determines which lengths to count using
next() function.
next() takes in as a parameter the length of
the sequence counted in the last pass.
next(k) = k + 1 - Same as AprioriAll
Balances tradeoff between:


4/3/01
Counting non-maximal sequences
Counting extensions of small candidate sequences
CS632 - Data Mining
53
AprioriSome (2)


4/3/01
hitk = Lk/ Ck
Intuition: As hitk increases the time wasted by
counting extensions of small candidates
decreases.
CS632 - Data Mining
54
AprioriSome (3)
4/3/01
CS632 - Data Mining
55
AprioriSome (4)
Backward Phase:
 For all lengths which we skipped:



4/3/01
Delete sequences in candidate set which
are contained in some large sequence.
Count remaining candidates and find all
sequences with min. support.
Also delete large sequences found in
forward phase which are non-maximal.
CS632 - Data Mining
56
AprioriSome (5)
4/3/01
CS632 - Data Mining
57
AprioriSome (6)

Example:
Forward Phase:
next(k) = 2k
minsup = 2
C3
3-Sequences
4/3/01
CS632 - Data Mining
58
AprioriSome (7)

Example
Backward Phase:
C3
3-Sequences
4/3/01
CS632 - Data Mining
59
Performance (1)


4/3/01
Used generated datasets again…
Parameters for data:
CS632 - Data Mining
60
Performance (2)


DynamicSome
generates too many
candidates.
AprioriSome does a
little better than
AprioriAll.

4/3/01
It avoids counting
many non-maximal
sequences.
CS632 - Data Mining
61
Performance (3)
Advantage of AprioriSome is reduced for
2 reasons:
 AprioriSome generates more
candidates.
 Candidates remain memory resident
even if skipped over.

4/3/01
Cannot always follow heuristic.
CS632 - Data Mining
62
Wrap up

Just presented two classic papers on
data-mining.


4/3/01
Association Rules
Sequential Patterns
CS632 - Data Mining
63