Download Association Rules Mining with SQL

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Association Rules Mining
with SQL
Kirsten Nelson
Deepen Manek
November 24, 2003
1
Organization of Presentation





Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL




Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
2
Early data mining applications



Most early mining systems were developed
largely on file systems, with specialized data
structures and buffer management strategies
devised for each
All data was read into memory before
beginning computation
This limits the amount of data that can be
mined
3
Advantage of SQL and RDBMS




Make use of database indexing and query
processing capabilities
More than a decade spent on making these
systems robust, portable, scalable, and
concurrent
Exploit underlying SQL parallelization
For long-running algorithms, use
checkpointing and space management
4
Organization of Presentation





Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL




Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
5
Use of Database in Data
Mining

“Loose coupling” of application and data




How would you write an Apriori program?
Use SQL statements in an application
Use a cursor interface to read through
records sequentially for each pass
Still two major performance problems:


Copying of record from database to memory
Process context switching for each record
retrieved
6
Organization of Presentation





Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL




Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
7
Tightly-coupled applications




Push computations into the database system
to avoid performance degradation
Take advantage of user-defined functions
(UDFs)
Does not require changes to database
software
Two types of UDFs we will use:


Ones that are executed only a few times,
regardless of the number of rows
Ones that are executed once for each selected
row
8
Tight-coupling using UDFs
Procedure TightlyCoupledApriori():
begin
exec sql connect to database;
exec sql select allocSpace() into
:blob from onerecord;
exec sql select * from sales where
GenL1(:blob, TID, ITEMID) = 1;
notDone := true;
9
Tight-coupling using UDFs
while notDone do {
exec sql select aprioriGen(:blob)
into :blob from onerecord;
exec sql select *
from sales
where itemCount(:blob, TID,
ITEMID)=1;
exec sql select GenLk(:blob) into
:notDone from onerecord
}
10
Tight-coupling using UDFs
exec sql select getResult(:blob) into
:resultBlob from onerecord;
exec sql select deallocSpace(:blob)
from onerecord;
compute Answer using resultBlob;
end
11
Organization of Presentation





Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL




Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
12
Methodology



Comparison done with Association Rules against IBM DB2
Only consider generation of frequent itemsets using
Apriori algorithm
Five alternatives considered:





Loose-coupling through SQL cursor interface – as described earlier
UDF tight-coupling – as described earlier
Stored-procedure to encapsulate mining algorithm
Cache-mine – caching data and mining on the fly
SQL implementations to force processing in the database

Consider two classes of implementations


SQL-92 – four different implementations
SQL-OR (with object relational extensions) – six implementations
13
Architectural Options

Stored procedure




Apriori algorithm encapsulated as a stored procedure
Implication: runs in the same address space as the DBMS
Mined results stored back into the DBMS.
Cache-mine





Variation of stored-procedure
Read entire data once from DBMS, temporarily cache data in
a lookaside buffer on a local disk
Cached data is discarded when execution completes
Disadvantage – requires additional disk space for caching
Use Intelligent Miner’s “space” option
14
Organization of Presentation





Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL




Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
15
Terminology

Use the following terminology

T: table of items



Ck: candidate k-itemsets


{tid,item} pairs
Data is normally sorted by transaction id
Obtained from joining and pruning frequent
itemsets from previous iteration
Fk: frequent items sets of length k

Obtained from Ck and T
16
Candidate Generation in SQL –
join step

Generate Ck from Fk-1 by joining Fk-1 with itself
insert into Ck select I1.item1,…,I1.itemk-1,I2.itemk-1
from Fk-1 I1,Fk-1 I2
where I1.item1 = I2.item1 and
…
I1.itemk-2 = I2.itemk-2 and
I1.itemk-1 < I2.itemk-1
17
Candidate Generation Example
 F3 is {{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}}
 C4 is {{1,2,3,4},{1,3,4,5}}
Table F3 (I1)
Table F3 (I2)
item1
item2
item3
item1
item2
item3
1
2
3
1
2
3
1
2
4
1
2
4
1
3
4
1
3
4
1
3
5
1
3
5
2
3
4
2
3
4
18
Pruning

Modify candidate generation algorithm to ensure all k
subsets of Ck of length (k-1) are in Fk-1



Do a k-way join, skipping itemn-2 when joining with the nth table
(2<n≤k)
Create primary index (item1, …, itemk-1) on Fk-1 to efficiently process
k-way join
For k=4, this becomes
insert into C4 select I1.item1, I1.item2, I1.item3,I2.item3 from F3 I1,F3 I2,
F3 I3, F3 I4 where I1.item1 = I2.item1 … and I1.item3 < I2.item3 and
I1.item2 = I3.item1 and I1.item3 = I3.item2 and I2.item3 = I3.item3 and
I1.item1 = I4.item1 and I1.item3 = I4.item2 and I2.item3 = I4.item3
19
Pruning Example


Evaluate join with I3 using previous example
C4 is {1,2,3,4}
Table F3 (I1)
Table F3 (I2)
Table F3 (I3)
item1
item2
item3
item1
item2
item3
item1
item2
item3
1
2
3
1
2
3
1
2
3
1
2
4
1
2
4
1
2
4
1
3
4
1
3
4
1
3
4
1
3
5
1
3
5
1
3
5
2
3
4
2
3
4
2
3
4
20
Support counting using SQL

Two different approaches

Use the SQL-92 standard


Use ‘standard’ SQL syntax such as joins and subqueries
to find support of itemsets
Use object-relational extensions of SQL (SQL-OR)


User Defined Functions (UDFs) & table functions
Binary Large Objects (BLOBs)
21
Support Counting using SQL-92

4 different methods, two of which detailed in
the papers



K-way Joins
SubQuery
Other methods not discussed because of
unacceptable performance


3-way join
2 Group-Bys
22
SQL-92: K-way join


Obtain Fk by joining Ck with table T of (tid,item)
Perform group by on the itemset
insert into Fk select item1,…,itemk,count(*)
from Ck, T t1, …, T tk,
where t1.item = Ck.item1, … , and
tk.item = Ck.itemk and
t1.tid = t2.tid … and
tk-1.tid = tk.tid
group by item1,…,itemk
having count(*) > :minsup
23
K-way join example


C3={B,C,E} and minimum support required is 2
Insert into F3 {B,C,E,2}
24
K-way join: Pass-2 optimization


When calculating C2, no pruning is required after we join
F1 with itself
Don’t calculate and materialize C2- replace C2 in 2-way
join algorithm with join of F1 with itself
insert into F2 select I1.item1, I2.item1,count(*)
from F1 I1, F1 I2, T t1, T t2
where I1.item1 < I2.item1 and
t1.item = I1.item1 and t2.item = I2.item1 and
t1.tid = t2.tid
group by I1.item1,I2.item1
having count(*) > :minsup
25
SQL-92: SubQuery based


Split support counting into cascade of k subqueries
nth subquery Qn finds all tids that match the distinct
itemsets formed by the first n items of Ck
insert into Fk select item1, …, itemk, count(*)
from (Subquery Qk) t
Group by item1, item2 … , itemk having count(*) > :minsup
Subquery Qn (for any n between 1 and k):
select item1, …, itemn, tid
from T tn, (Subquery Qn-1) as rn-1
(select distinct item1, …, itemn from CK) as dn
where rn-1.item1 = dn.item1 and … and rn-1.itemn-1 = dn.itemn
and rn-1.tid = tn.tid and tn.item = dn.itemn
26
Example of SubQuery based

Using previous example from class

C3 = {B,C,E}, minimum support = 2

Q0: No subquery Q0

Q1 in this case becomes
select item1, tid
From T t1,
(select distinct item1from C3) as d1
where t1.item = d1.item1
27
Example of SubQuery based cnt’d

Q2 becomes
select item1, item2, tid from T t2, (Subquery Q1) as r1,
(select distinct item1, item2 from C3) as d2 where r1.item1 =
d2.item1 and r1.tid = t2.tid and t2.item = d2.item2
28
Example of SubQuery based cnt’d

Q3 becomes
select item1,item2,item3, tid from T t3, (Subquery Q2) as r2,
(select distinct item1,item2,item3 from C3) as d3
where r2.item1 = d3.item1 and r2.item2 = d3.item2 and
r2.tid = t3.tid and t3.item = d3.item3
29
Example of SubQuery based cnt’d


Output of Q3 is
Item1
Item2
Item3
Tid
B
B
C
C
E
E
20
30
Insert statement becomes
insert into F3 select item1, item2, item3, count(*)
from (Subquery Q3) t
group by item1, item2 ,item3 having count(*) > :minsup

Insert the row {B,C,E,2}

For Q2, pass-2 optimization can be used
30
Performance Comparisons of
SQL-92 approaches

Used Version 5 of DB2 UDB and RS/6000 Model 140



200 Mhz CPU, 256 MB main memory, 9 GB of disk space,
Transfer rate of 8 MB/sec
Used 4 different item sets based on real-world data
Built the following indexes, which are not included in
any cost calculations



Composite index (item1, …, itemk) on Ck
k different indices on each of the k items in Ck
(item,tid) and (tid,item) indexes on the data table T
31
Performance Comparisons of
SQL-92 approaches


Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-A
Dataset-B
Dataset-C
Dataset-D
2.5
7.5
6.6
14
0.57
2.5
0.21
1.44
85
15.8
15.8
480
4.4
2.62
31
9.62
Best performance obtained by SubQuery approach
SubQuery was only comparable to loose-coupling in
some cases, failing to complete in other cases


DataSet C, for support of 2%, SubQuery outperforms loosecoupling but decreasing support to 1%, SubQuery takes 10
times as long to complete
Lower support will increase the size of Ck and Fk at each step,
causing the join to process more rows
32
Support Counting using SQL with
object-relational extensions

6 different methods, four of which detailed in the
papers





GatherJoin
GatherCount
GatherPrune
Vertical
Other methods not discussed because of
unacceptable performance


Horizontal
SBF
33
SQL Object-Relational Extension:
GatherJoin

Generates all possible k-item combinations of items
contained in a transaction and joins them with Ck


An index is created on all items of Ck
Uses the following table functions


Gather: Outputs records {tid,item-list}, with item-list being a
BLOB or VARCHAR containing all items associated with the
tid
Comb-K: returns all k-item combinations from the
transaction

Output has k attributes T_itm1, …, T_itmk
34
GatherJoin
insert into Fk select item1,…, itemk, count(*)
from Ck,
(select t2.T_itm1,…,t2.itmk from T,
table(Gather(T.tid,T.item)) as t1,
table(Comb-K(t1.tid,t1.item-list)) as t2)
where t2.T_itm1 = Ck.item1 and … and
t2.T_itmk = Ck.itemk
group by Ck.item1,…,Ck.itemk
having count(*) > :minsup
35
Example of GatherJoin
t1


t1 (output from Gather) looks like:
Item-List
10
20
30
40
A,C,D
B,C,E
A,B,C,E
B,E
t2 (generated by Comb-K from t1)
will be joined with C3 to obtain F3




Tid
1 row from Tid 10
1 row from Tid 20
4 rows from Tid 30
Insert {B,C,E,2}
36
GatherJoin: Pass 2 optimization


When calculating C2, no pruning is required after we join
F1 with itself
Don’t calculate and materialize C2 - replace C2 with a join
to F1 before the table function

Gather is only passed frequent 1-itemset rows
insert into F2 select I1.item1, I2.item1, count(*) from F1 I1,
(select t2.T_itm1,t2.T_itm2 from T, table(Gather(T.tid,T.item)) as t1,
table(Comb-K(t1.tid,t1.item-list)) as t2 where T.item = I1.item1)
group by t2.T_itm1,t2.T_itm2
having count(*) > :minsup
37
Variations of GatherJoin GatherCount


Perform the GROUP BY inside the table
function Comb-K for pass 2 optimization
Output of the table function Comb-K



Not the candidate frequent itemsets (Ck)
But the actual frequent itemsets (Fk) along with
the corresponding support
Use a 2-dimensional array to store possible
frequent itemsets in Comb-K

May lead to excessive memory use
38
Variations of GatherJoin GatherPrune


Push the join with Ck into the table function
Comb-K
Ck is converted into a BLOB and passed as an
argument to the table function.

Will have to pass the BLOB for each invocation of
Comb-K - # of rows in table T
39
SQL Object-Relational
Extension: Vertical



For each item, create a BLOB containing the
tids the item belongs to
 Use function Gather to generate {item,tidlist} pairs, storing results in table TidTable
 Tid-list are all in the same sorted order
Use function Intersect to compare two
different tid-lists and extract common values
Pass-2 optimization can be used for Vertical


Similar to K-way join method
Upcoming example does not show optimization
40
Vertical
insert into Fk select item1, …, itemk, count(tid-list) as cnt
from (Subquery Qk) t where cnt > :minsup
Subquery Qn (for any n between 2 and k)
Select item1, …, itemn,
Intersect(rn-1.tid-list, tn.tid-list) as tid-list
from TidTable tn, (Subquery Qn-1) as rn-1
(select distinct item1, …, itemn from CK) as dn
where rn-1.item1 = dn.item1 and … and
rn-1.itemn-1 = dn.itemn-1 and
tn.item = dn.itemn
Subquery Q1: (select * from TidTable)
41
Example of Vertical

Using previous example from class


C3 = {B,C,E}, minimum support = 2
Q1 is TidTable
Item
Tid-List
A
B
C
D
E
10,30
20,30,40
10,20,30
10
20,30,40
42
Example of Vertical cnt’d

Q2 becomes
Select item1, item2, Intersect(r1.tid-list, t2.tid-list) as tid-list
from TidTable t2, (Subquery Q1) as r1
(select distinct item1, item2 from C3) as d2
where r1.item1 = d2.item1 and t2.item = d2.item2
43
Example of Vertical cnt’d

Q3 becomes
select item1, item2, item3, intersect(r2.tid-list, t3.tid-list) as tid-list
from TidTable t3, (Subquery Q2) as r2
(select distinct item1, item2, item3 from C3) as d3
where r2.item1 = d3.item1 and r2.item2 = d3.item2 and
t3.item = d3.item3
44
Performance Comparisons
using SQL-OR
Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-A
Dataset-B
2.5
7.5
0.57
2.5
85
15.8
4.4
2.62
Legend:
Prep
Pass 1
Pass 2
Pass 3
Pass 4
Data Set A
Data Set B
14000
2500
12000
Time in Sec
10000
1500
8000
6000
4000
1000
2000
500
0
Support
0.5%
0.35%
0.20%
Support
0.10%
0.03%
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
0
Vert
Time in Sec
2000
0.01%
45
Performance Comparisons
using SQL-OR
Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-C
Dataset-D
6.6
14
0.21
1.44
15.8
480
31
9.62
Legend:
Prep
Pass 1
Pass 2
Pass 3
Pass 4
Data Set C
Data Set D
14000
12000
10000
10000
8000
8000
Support
2.0%
1.0%
0.25%
Support
0.2%
0.07%
Gcnt
Gjoin
Vert
Gcnt
Gjoin
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
0
Gprun
0
Vert
2000
Gcnt
2000
Gjoin
4000
Gprun
4000
Vert
6000
Gcnt
6000
Gjoin
Time in Sec
12000
Vert
Time in Sec
14000
0.02%
46
Performance comparison of SQL
object-relational approaches

Vertical has best overall performance, sometimes an
order of magnitude better than other 3 approaches



Pass-2 optimization has huge impact on performance
of GatherJoin


Majority of time is transforming the data in {item,tid-list}
pairs
Vertical spends too much time on the second pass
For Dataset-B with support of 0.1 %, running time for Pass 2
went from 5.2 hours to 10 minutes
Comb-K in GatherJoin generates large number of
potential frequent itemsets we must work with
47
Hybrid approach

Previous charts and algorithm analysis show



Vertical spends too much time on pass 2 compared to other
algorithms, especially when the support is decreased
GatherJoin degrades when the # of frequent items per
transaction increases
To improve performance, use a hybrid algorithm



Use Vertical for most cases
When size of candidate itemset is too large, GatherJoin is a
good option if number of frequent items per transaction (Nf)
is not too large
When Nf is large, GatherCount may be the only good option
48
Architecture Comparisons

Compare five alternatives

Loose-Coupling, Stored-procedure





Basically the same except for address space program is
being run in
Because of limited difference in performance, focus
solely on stored procedure in following charts
Cache-Mine
UDF tight-coupling
Best SQL approach (Hybrid)
49
Performance Comparisons of
Architectures
Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-A
Dataset-B
2.5
7.5
0.57
2.5
85
15.8
4.4
2.62
Data Set B
Data Set A
5000
700
4500
600
4000
500
3500
400
3000
2500
Time in Sec
300
200
100
1500
500
Support
0.1%
0.03%
SQL
UDF
Sproc
Cache
SQL
UDF
Sproc
Cache
SQL
UDF
Sproc
0.2%
SQL
UDF
Sproc
Cache
SQL
UDF
Cache
Sproc
0.35%
0
Cache
0.5%
SQL
UDF
Sproc
0
Support
2000
1000
Cache
Time in Sec
Legend:
Pass 1
Pass 2
Pass 3
Pass 4
0.01%
50
Performance Comparisons of
Architectures cnt’d
Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-C
Dataset-D
6.6
14
0.21
1.44
15.8
480
31
9.62
Data Set C
Legend:
Pass 1
Pass 2
Pass 3
Pass 4
Data Set D
3500
12000
3000
10000
Time in Sec
2000
1500
8000
6000
1000
4000
500
2000
Support
0.2%
0.07%
SQL
UDF
Sproc
Cache
SQL
UDF
Sproc
Cache
SQL
UDF
SQL
UDF
Cache
Sproc
0.25%
Sproc
1.00%
SQL
UDF
Sproc
Cache
SQL
UDF
2.0%
0
Cache
Support
Sproc
0
Cache
Time in Sec
2500
0.02%
51
Performance Comparisons of
Architectures cnt’d

Cache-Mine is the best or close to the best
performance in all cases


Factor of 0.8 to 2 times faster than SQL approach
Stored procedure is the worst

Difference between Cache-Mine directly related to the
number of passes through the data



Passes increase when the support goes down
May need to make multiple passes if all candidates cannot fit in
memory
UDF time per pass decreases 30-50% compared to
stored procedure because of tighter coupling with DB
52
Performance Comparisons of
Architectures cnt’d

SQL approach comes in second in performance to
Cache-Mine




Somewhat better than Cache-Mine for high support values
1.8 – 3 times better than Stored-procedure/loose-coupling
approach, getting better when support value decreases
Cost of converting to Vertical format is less than cost of
converting to binary format in Cache-Mine
For second pass through data, SQL approach takes much
more time than Cache-Mine, particularly when we decrease
the support
53
Organization of Presentation





Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL




Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
54
Taxonomies - example
Beverages
Soft Drinks
Pepsi
Snacks
Alcoholic Drinks
Coke
Example rule:
Soft Drinks  Pretzels with
30% confidence, 2% support
Beer
Pretzels
Chocolate Bar
Parent
Child
Beverages
Soft Drinks
Beverages
Alcoholic Drinks
Soft Drinks
Pepsi
Soft Drinks
Coke
Alcoholic Drinks
Beer
Snacks
Pretzels
Snacks
Chocolate Bar
55
Taxonomy augmentation


Algorithms similar to previous slides
Requires two additions to algorithm



Pruning itemsets containing an item and its
ancestor
Pre-computing the ancestors for each item
Will also consider support counting
56
Pruning items and ancestors
In the second pass we will join F1 with
F1 to give C2
 This will give, for example:
beverages,pepsi
snacks,coke
pretzels,chocolate bar
 But beverages,pepsi is redundant!

57
Pruning items and ancestors
The following modification to the SQL
statement eliminates such redundant
combinations from being selected:
insert into C2 (select I1.item1,
I2.item1 from F1 I1, F1 I2
where I1.item1 < I2.item1) except
(select ancestor, descendant from
Ancestor union
select descendant, ancestor from
Ancestor)

58
Pre-computing ancestors

An ancestor table is created


Format (ancestor, descendant)
Use the transitive closure operation
insert into Ancestor with R-Tax
(ancestor, descendant) as
(select parent, child from Tax union all
select p.ancestor, c.child from R-Tax
p, Tax c
where p.descendant = c.parent)
select ancestor, descendant from R-Tax
59
Support Counting

Extensions to handle taxonomies



Straightforward, but
Non-trivial
Need an extended transaction table


For example, if we have {coke, pretzels}
We add also {soft drinks, pretzels},
{beverages, pretzels}, {coke, snacks},
{soft drinks, snacks}, {beverages, snacks}
60
Extended transaction table
Can be obtained by the following SQL
Query to generate T*
select item, tid from T union
select distinct A.ancestor as item, T.tid
from T, Ancestor A
where A.descendant = T.item
 The “select distinct” clause gets rid of items with
common ancestor – e.g. don’t want {beverages,
beverages} being added twice from {pepsi, coke}

61
Pipelining of Query


No need to actually build T*
Make following modification to SQL:
insert into Fk with T*(tid, item) as (Query for T*)
select item1,…,itemk,count(*)
from Ck, T* t1, …, T* tk,
where t1.item = Ck.item1, … , and
tk.item = Ck.itemk and
t1.tid = t2.tid … and
tk-1.tid = tk.tid
group by item1,…,itemk
having count(*) > :minsup
62
Organization of Presentation





Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL




Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
63
Sequential patterns


Similar to papers covered on Nov 17
Input is sequences of transactions



E.g. ((computer,modem),(printer))
Similar to association rules, but dealing with
sequences as opposed to sets
Can also specify maximum and minimum time
gaps, as well as sliding time windows

Max-gap, min-gap, window-size
64
Input and output formats

Input has three columns:




Sequence identifier (sid)
Transaction time (time)
Idem identifier (item)
Output format is a collection of frequent
sequences, in a fixed-width table


(item1, eno1,…,itemk, enok, len)
For smaller lengths, extra column values are set to
NULL
65
GSP algorithm



Similar to algorithms shown earlier
Each Ck has transactions and times, but no length –
has fixed length of k
Candidates are generated in two steps

Join – join Fk-1 with itself



Sequence s1 joins with s2 if the subsequence obtained by
dropping the first item of s1 is the same as the one obtained by
dropping the last item of s2
When generating C2, we need to generate sequences where
both of the items appear as a single element as well as two
separate elements
Prune

All candidate sequences that have a non-frequent contiguous
(k-1) subsequence are deleted
66
GSP – Join SQL
insert into Ck
select I1.item1, I1.eno1, ... , I1.itemk-1,
I1.enok-1,
I2.itemkk-1, I1.enok-1 + I2.enok-1 –
I2.enok-2
from Fk-1 I1, Fk-1 I2
where I1.item2 = I2.item1 and ... and
I1.itemk-1 = I2.itemk-2 and
I1.eno3-I1.eno2 = I2.eno2 – I2.eno1 and
... and
I1.enok-1 – I1.enok-2 = I2.enok-2 – I2.enok-3
67
GSP – Prune SQL



Write as a k-way join, similar to before
There are at most k contiguous subsequences
of length (k-1) for which Fk-1 needs to be
checked for membership
Note that all (k-1) subsequences may not be
contiguous because of the max-gap
constraint between consecutive elements.
68
GSP – Support Counting


In each pass, we use the candidate table Ck
and the input data-sequences table D to
count the support
K-way join


We use select distinct before the group by to
ensure that only distinct data-sequences are
counted
We have additional predicates between sequence
numbers to handle the special time elements
69
GSP – Support Counting SQL
(Ck.enoj = Ck.enoi and
abs(dj.time – di.time)≤
window-size) or (Ck.enoj =
Ck.enoi + 1 and dj.time –
di.time max-gap and dj.time –
di.time > min-gap) or (Ck.enoj
> Ck.enoi + 1)
70
References
1. Developing Tightly-Coupled Data Mining Applications
on a Relational Database System

Rakesh Agrawal, Kyuseok Shim, 1996
2. Integrating Association Rule Mining with Relational
Database Systems: Alternatives and Implications


Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal, 1998
Refers to 1) above
3. Mining Generalized Association Rules and Sequential
Patterns Using SQL Queries


Shiby Thomas, Sunita Sarawagi, 1998
Refers to 1) and 2) above
71