Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Rules Mining
with SQL
Kirsten Nelson
Deepen Manek
November 24, 2003
1
Organization of Presentation
Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
2
Early data mining applications
Most early mining systems were developed
largely on file systems, with specialized data
structures and buffer management strategies
devised for each
All data was read into memory before
beginning computation
This limits the amount of data that can be
mined
3
Advantage of SQL and RDBMS
Make use of database indexing and query
processing capabilities
More than a decade spent on making these
systems robust, portable, scalable, and
concurrent
Exploit underlying SQL parallelization
For long-running algorithms, use
checkpointing and space management
4
Organization of Presentation
Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
5
Use of Database in Data
Mining
“Loose coupling” of application and data
How would you write an Apriori program?
Use SQL statements in an application
Use a cursor interface to read through
records sequentially for each pass
Still two major performance problems:
Copying of record from database to memory
Process context switching for each record
retrieved
6
Organization of Presentation
Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
7
Tightly-coupled applications
Push computations into the database system
to avoid performance degradation
Take advantage of user-defined functions
(UDFs)
Does not require changes to database
software
Two types of UDFs we will use:
Ones that are executed only a few times,
regardless of the number of rows
Ones that are executed once for each selected
row
8
Tight-coupling using UDFs
Procedure TightlyCoupledApriori():
begin
exec sql connect to database;
exec sql select allocSpace() into
:blob from onerecord;
exec sql select * from sales where
GenL1(:blob, TID, ITEMID) = 1;
notDone := true;
9
Tight-coupling using UDFs
while notDone do {
exec sql select aprioriGen(:blob)
into :blob from onerecord;
exec sql select *
from sales
where itemCount(:blob, TID,
ITEMID)=1;
exec sql select GenLk(:blob) into
:notDone from onerecord
}
10
Tight-coupling using UDFs
exec sql select getResult(:blob) into
:resultBlob from onerecord;
exec sql select deallocSpace(:blob)
from onerecord;
compute Answer using resultBlob;
end
11
Organization of Presentation
Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
12
Methodology
Comparison done with Association Rules against IBM DB2
Only consider generation of frequent itemsets using
Apriori algorithm
Five alternatives considered:
Loose-coupling through SQL cursor interface – as described earlier
UDF tight-coupling – as described earlier
Stored-procedure to encapsulate mining algorithm
Cache-mine – caching data and mining on the fly
SQL implementations to force processing in the database
Consider two classes of implementations
SQL-92 – four different implementations
SQL-OR (with object relational extensions) – six implementations
13
Architectural Options
Stored procedure
Apriori algorithm encapsulated as a stored procedure
Implication: runs in the same address space as the DBMS
Mined results stored back into the DBMS.
Cache-mine
Variation of stored-procedure
Read entire data once from DBMS, temporarily cache data in
a lookaside buffer on a local disk
Cached data is discarded when execution completes
Disadvantage – requires additional disk space for caching
Use Intelligent Miner’s “space” option
14
Organization of Presentation
Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
15
Terminology
Use the following terminology
T: table of items
Ck: candidate k-itemsets
{tid,item} pairs
Data is normally sorted by transaction id
Obtained from joining and pruning frequent
itemsets from previous iteration
Fk: frequent items sets of length k
Obtained from Ck and T
16
Candidate Generation in SQL –
join step
Generate Ck from Fk-1 by joining Fk-1 with itself
insert into Ck select I1.item1,…,I1.itemk-1,I2.itemk-1
from Fk-1 I1,Fk-1 I2
where I1.item1 = I2.item1 and
…
I1.itemk-2 = I2.itemk-2 and
I1.itemk-1 < I2.itemk-1
17
Candidate Generation Example
F3 is {{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}}
C4 is {{1,2,3,4},{1,3,4,5}}
Table F3 (I1)
Table F3 (I2)
item1
item2
item3
item1
item2
item3
1
2
3
1
2
3
1
2
4
1
2
4
1
3
4
1
3
4
1
3
5
1
3
5
2
3
4
2
3
4
18
Pruning
Modify candidate generation algorithm to ensure all k
subsets of Ck of length (k-1) are in Fk-1
Do a k-way join, skipping itemn-2 when joining with the nth table
(2<n≤k)
Create primary index (item1, …, itemk-1) on Fk-1 to efficiently process
k-way join
For k=4, this becomes
insert into C4 select I1.item1, I1.item2, I1.item3,I2.item3 from F3 I1,F3 I2,
F3 I3, F3 I4 where I1.item1 = I2.item1 … and I1.item3 < I2.item3 and
I1.item2 = I3.item1 and I1.item3 = I3.item2 and I2.item3 = I3.item3 and
I1.item1 = I4.item1 and I1.item3 = I4.item2 and I2.item3 = I4.item3
19
Pruning Example
Evaluate join with I3 using previous example
C4 is {1,2,3,4}
Table F3 (I1)
Table F3 (I2)
Table F3 (I3)
item1
item2
item3
item1
item2
item3
item1
item2
item3
1
2
3
1
2
3
1
2
3
1
2
4
1
2
4
1
2
4
1
3
4
1
3
4
1
3
4
1
3
5
1
3
5
1
3
5
2
3
4
2
3
4
2
3
4
20
Support counting using SQL
Two different approaches
Use the SQL-92 standard
Use ‘standard’ SQL syntax such as joins and subqueries
to find support of itemsets
Use object-relational extensions of SQL (SQL-OR)
User Defined Functions (UDFs) & table functions
Binary Large Objects (BLOBs)
21
Support Counting using SQL-92
4 different methods, two of which detailed in
the papers
K-way Joins
SubQuery
Other methods not discussed because of
unacceptable performance
3-way join
2 Group-Bys
22
SQL-92: K-way join
Obtain Fk by joining Ck with table T of (tid,item)
Perform group by on the itemset
insert into Fk select item1,…,itemk,count(*)
from Ck, T t1, …, T tk,
where t1.item = Ck.item1, … , and
tk.item = Ck.itemk and
t1.tid = t2.tid … and
tk-1.tid = tk.tid
group by item1,…,itemk
having count(*) > :minsup
23
K-way join example
C3={B,C,E} and minimum support required is 2
Insert into F3 {B,C,E,2}
24
K-way join: Pass-2 optimization
When calculating C2, no pruning is required after we join
F1 with itself
Don’t calculate and materialize C2- replace C2 in 2-way
join algorithm with join of F1 with itself
insert into F2 select I1.item1, I2.item1,count(*)
from F1 I1, F1 I2, T t1, T t2
where I1.item1 < I2.item1 and
t1.item = I1.item1 and t2.item = I2.item1 and
t1.tid = t2.tid
group by I1.item1,I2.item1
having count(*) > :minsup
25
SQL-92: SubQuery based
Split support counting into cascade of k subqueries
nth subquery Qn finds all tids that match the distinct
itemsets formed by the first n items of Ck
insert into Fk select item1, …, itemk, count(*)
from (Subquery Qk) t
Group by item1, item2 … , itemk having count(*) > :minsup
Subquery Qn (for any n between 1 and k):
select item1, …, itemn, tid
from T tn, (Subquery Qn-1) as rn-1
(select distinct item1, …, itemn from CK) as dn
where rn-1.item1 = dn.item1 and … and rn-1.itemn-1 = dn.itemn
and rn-1.tid = tn.tid and tn.item = dn.itemn
26
Example of SubQuery based
Using previous example from class
C3 = {B,C,E}, minimum support = 2
Q0: No subquery Q0
Q1 in this case becomes
select item1, tid
From T t1,
(select distinct item1from C3) as d1
where t1.item = d1.item1
27
Example of SubQuery based cnt’d
Q2 becomes
select item1, item2, tid from T t2, (Subquery Q1) as r1,
(select distinct item1, item2 from C3) as d2 where r1.item1 =
d2.item1 and r1.tid = t2.tid and t2.item = d2.item2
28
Example of SubQuery based cnt’d
Q3 becomes
select item1,item2,item3, tid from T t3, (Subquery Q2) as r2,
(select distinct item1,item2,item3 from C3) as d3
where r2.item1 = d3.item1 and r2.item2 = d3.item2 and
r2.tid = t3.tid and t3.item = d3.item3
29
Example of SubQuery based cnt’d
Output of Q3 is
Item1
Item2
Item3
Tid
B
B
C
C
E
E
20
30
Insert statement becomes
insert into F3 select item1, item2, item3, count(*)
from (Subquery Q3) t
group by item1, item2 ,item3 having count(*) > :minsup
Insert the row {B,C,E,2}
For Q2, pass-2 optimization can be used
30
Performance Comparisons of
SQL-92 approaches
Used Version 5 of DB2 UDB and RS/6000 Model 140
200 Mhz CPU, 256 MB main memory, 9 GB of disk space,
Transfer rate of 8 MB/sec
Used 4 different item sets based on real-world data
Built the following indexes, which are not included in
any cost calculations
Composite index (item1, …, itemk) on Ck
k different indices on each of the k items in Ck
(item,tid) and (tid,item) indexes on the data table T
31
Performance Comparisons of
SQL-92 approaches
Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-A
Dataset-B
Dataset-C
Dataset-D
2.5
7.5
6.6
14
0.57
2.5
0.21
1.44
85
15.8
15.8
480
4.4
2.62
31
9.62
Best performance obtained by SubQuery approach
SubQuery was only comparable to loose-coupling in
some cases, failing to complete in other cases
DataSet C, for support of 2%, SubQuery outperforms loosecoupling but decreasing support to 1%, SubQuery takes 10
times as long to complete
Lower support will increase the size of Ck and Fk at each step,
causing the join to process more rows
32
Support Counting using SQL with
object-relational extensions
6 different methods, four of which detailed in the
papers
GatherJoin
GatherCount
GatherPrune
Vertical
Other methods not discussed because of
unacceptable performance
Horizontal
SBF
33
SQL Object-Relational Extension:
GatherJoin
Generates all possible k-item combinations of items
contained in a transaction and joins them with Ck
An index is created on all items of Ck
Uses the following table functions
Gather: Outputs records {tid,item-list}, with item-list being a
BLOB or VARCHAR containing all items associated with the
tid
Comb-K: returns all k-item combinations from the
transaction
Output has k attributes T_itm1, …, T_itmk
34
GatherJoin
insert into Fk select item1,…, itemk, count(*)
from Ck,
(select t2.T_itm1,…,t2.itmk from T,
table(Gather(T.tid,T.item)) as t1,
table(Comb-K(t1.tid,t1.item-list)) as t2)
where t2.T_itm1 = Ck.item1 and … and
t2.T_itmk = Ck.itemk
group by Ck.item1,…,Ck.itemk
having count(*) > :minsup
35
Example of GatherJoin
t1
t1 (output from Gather) looks like:
Item-List
10
20
30
40
A,C,D
B,C,E
A,B,C,E
B,E
t2 (generated by Comb-K from t1)
will be joined with C3 to obtain F3
Tid
1 row from Tid 10
1 row from Tid 20
4 rows from Tid 30
Insert {B,C,E,2}
36
GatherJoin: Pass 2 optimization
When calculating C2, no pruning is required after we join
F1 with itself
Don’t calculate and materialize C2 - replace C2 with a join
to F1 before the table function
Gather is only passed frequent 1-itemset rows
insert into F2 select I1.item1, I2.item1, count(*) from F1 I1,
(select t2.T_itm1,t2.T_itm2 from T, table(Gather(T.tid,T.item)) as t1,
table(Comb-K(t1.tid,t1.item-list)) as t2 where T.item = I1.item1)
group by t2.T_itm1,t2.T_itm2
having count(*) > :minsup
37
Variations of GatherJoin GatherCount
Perform the GROUP BY inside the table
function Comb-K for pass 2 optimization
Output of the table function Comb-K
Not the candidate frequent itemsets (Ck)
But the actual frequent itemsets (Fk) along with
the corresponding support
Use a 2-dimensional array to store possible
frequent itemsets in Comb-K
May lead to excessive memory use
38
Variations of GatherJoin GatherPrune
Push the join with Ck into the table function
Comb-K
Ck is converted into a BLOB and passed as an
argument to the table function.
Will have to pass the BLOB for each invocation of
Comb-K - # of rows in table T
39
SQL Object-Relational
Extension: Vertical
For each item, create a BLOB containing the
tids the item belongs to
Use function Gather to generate {item,tidlist} pairs, storing results in table TidTable
Tid-list are all in the same sorted order
Use function Intersect to compare two
different tid-lists and extract common values
Pass-2 optimization can be used for Vertical
Similar to K-way join method
Upcoming example does not show optimization
40
Vertical
insert into Fk select item1, …, itemk, count(tid-list) as cnt
from (Subquery Qk) t where cnt > :minsup
Subquery Qn (for any n between 2 and k)
Select item1, …, itemn,
Intersect(rn-1.tid-list, tn.tid-list) as tid-list
from TidTable tn, (Subquery Qn-1) as rn-1
(select distinct item1, …, itemn from CK) as dn
where rn-1.item1 = dn.item1 and … and
rn-1.itemn-1 = dn.itemn-1 and
tn.item = dn.itemn
Subquery Q1: (select * from TidTable)
41
Example of Vertical
Using previous example from class
C3 = {B,C,E}, minimum support = 2
Q1 is TidTable
Item
Tid-List
A
B
C
D
E
10,30
20,30,40
10,20,30
10
20,30,40
42
Example of Vertical cnt’d
Q2 becomes
Select item1, item2, Intersect(r1.tid-list, t2.tid-list) as tid-list
from TidTable t2, (Subquery Q1) as r1
(select distinct item1, item2 from C3) as d2
where r1.item1 = d2.item1 and t2.item = d2.item2
43
Example of Vertical cnt’d
Q3 becomes
select item1, item2, item3, intersect(r2.tid-list, t3.tid-list) as tid-list
from TidTable t3, (Subquery Q2) as r2
(select distinct item1, item2, item3 from C3) as d3
where r2.item1 = d3.item1 and r2.item2 = d3.item2 and
t3.item = d3.item3
44
Performance Comparisons
using SQL-OR
Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-A
Dataset-B
2.5
7.5
0.57
2.5
85
15.8
4.4
2.62
Legend:
Prep
Pass 1
Pass 2
Pass 3
Pass 4
Data Set A
Data Set B
14000
2500
12000
Time in Sec
10000
1500
8000
6000
4000
1000
2000
500
0
Support
0.5%
0.35%
0.20%
Support
0.10%
0.03%
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
Gprun
0
Vert
Time in Sec
2000
0.01%
45
Performance Comparisons
using SQL-OR
Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-C
Dataset-D
6.6
14
0.21
1.44
15.8
480
31
9.62
Legend:
Prep
Pass 1
Pass 2
Pass 3
Pass 4
Data Set C
Data Set D
14000
12000
10000
10000
8000
8000
Support
2.0%
1.0%
0.25%
Support
0.2%
0.07%
Gcnt
Gjoin
Vert
Gcnt
Gjoin
Vert
Gcnt
Gjoin
Gprun
Vert
Gcnt
Gjoin
0
Gprun
0
Vert
2000
Gcnt
2000
Gjoin
4000
Gprun
4000
Vert
6000
Gcnt
6000
Gjoin
Time in Sec
12000
Vert
Time in Sec
14000
0.02%
46
Performance comparison of SQL
object-relational approaches
Vertical has best overall performance, sometimes an
order of magnitude better than other 3 approaches
Pass-2 optimization has huge impact on performance
of GatherJoin
Majority of time is transforming the data in {item,tid-list}
pairs
Vertical spends too much time on the second pass
For Dataset-B with support of 0.1 %, running time for Pass 2
went from 5.2 hours to 10 minutes
Comb-K in GatherJoin generates large number of
potential frequent itemsets we must work with
47
Hybrid approach
Previous charts and algorithm analysis show
Vertical spends too much time on pass 2 compared to other
algorithms, especially when the support is decreased
GatherJoin degrades when the # of frequent items per
transaction increases
To improve performance, use a hybrid algorithm
Use Vertical for most cases
When size of candidate itemset is too large, GatherJoin is a
good option if number of frequent items per transaction (Nf)
is not too large
When Nf is large, GatherCount may be the only good option
48
Architecture Comparisons
Compare five alternatives
Loose-Coupling, Stored-procedure
Basically the same except for address space program is
being run in
Because of limited difference in performance, focus
solely on stored procedure in following charts
Cache-Mine
UDF tight-coupling
Best SQL approach (Hybrid)
49
Performance Comparisons of
Architectures
Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-A
Dataset-B
2.5
7.5
0.57
2.5
85
15.8
4.4
2.62
Data Set B
Data Set A
5000
700
4500
600
4000
500
3500
400
3000
2500
Time in Sec
300
200
100
1500
500
Support
0.1%
0.03%
SQL
UDF
Sproc
Cache
SQL
UDF
Sproc
Cache
SQL
UDF
Sproc
0.2%
SQL
UDF
Sproc
Cache
SQL
UDF
Cache
Sproc
0.35%
0
Cache
0.5%
SQL
UDF
Sproc
0
Support
2000
1000
Cache
Time in Sec
Legend:
Pass 1
Pass 2
Pass 3
Pass 4
0.01%
50
Performance Comparisons of
Architectures cnt’d
Datasets
# records
(millions)
# Transactions
(millions)
# Items
(thousands)
Avg # Items
Dataset-C
Dataset-D
6.6
14
0.21
1.44
15.8
480
31
9.62
Data Set C
Legend:
Pass 1
Pass 2
Pass 3
Pass 4
Data Set D
3500
12000
3000
10000
Time in Sec
2000
1500
8000
6000
1000
4000
500
2000
Support
0.2%
0.07%
SQL
UDF
Sproc
Cache
SQL
UDF
Sproc
Cache
SQL
UDF
SQL
UDF
Cache
Sproc
0.25%
Sproc
1.00%
SQL
UDF
Sproc
Cache
SQL
UDF
2.0%
0
Cache
Support
Sproc
0
Cache
Time in Sec
2500
0.02%
51
Performance Comparisons of
Architectures cnt’d
Cache-Mine is the best or close to the best
performance in all cases
Factor of 0.8 to 2 times faster than SQL approach
Stored procedure is the worst
Difference between Cache-Mine directly related to the
number of passes through the data
Passes increase when the support goes down
May need to make multiple passes if all candidates cannot fit in
memory
UDF time per pass decreases 30-50% compared to
stored procedure because of tighter coupling with DB
52
Performance Comparisons of
Architectures cnt’d
SQL approach comes in second in performance to
Cache-Mine
Somewhat better than Cache-Mine for high support values
1.8 – 3 times better than Stored-procedure/loose-coupling
approach, getting better when support value decreases
Cost of converting to Vertical format is less than cost of
converting to binary format in Cache-Mine
For second pass through data, SQL approach takes much
more time than Cache-Mine, particularly when we decrease
the support
53
Organization of Presentation
Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
54
Taxonomies - example
Beverages
Soft Drinks
Pepsi
Snacks
Alcoholic Drinks
Coke
Example rule:
Soft Drinks Pretzels with
30% confidence, 2% support
Beer
Pretzels
Chocolate Bar
Parent
Child
Beverages
Soft Drinks
Beverages
Alcoholic Drinks
Soft Drinks
Pepsi
Soft Drinks
Coke
Alcoholic Drinks
Beer
Snacks
Pretzels
Snacks
Chocolate Bar
55
Taxonomy augmentation
Algorithms similar to previous slides
Requires two additions to algorithm
Pruning itemsets containing an item and its
ancestor
Pre-computing the ancestors for each item
Will also consider support counting
56
Pruning items and ancestors
In the second pass we will join F1 with
F1 to give C2
This will give, for example:
beverages,pepsi
snacks,coke
pretzels,chocolate bar
But beverages,pepsi is redundant!
57
Pruning items and ancestors
The following modification to the SQL
statement eliminates such redundant
combinations from being selected:
insert into C2 (select I1.item1,
I2.item1 from F1 I1, F1 I2
where I1.item1 < I2.item1) except
(select ancestor, descendant from
Ancestor union
select descendant, ancestor from
Ancestor)
58
Pre-computing ancestors
An ancestor table is created
Format (ancestor, descendant)
Use the transitive closure operation
insert into Ancestor with R-Tax
(ancestor, descendant) as
(select parent, child from Tax union all
select p.ancestor, c.child from R-Tax
p, Tax c
where p.descendant = c.parent)
select ancestor, descendant from R-Tax
59
Support Counting
Extensions to handle taxonomies
Straightforward, but
Non-trivial
Need an extended transaction table
For example, if we have {coke, pretzels}
We add also {soft drinks, pretzels},
{beverages, pretzels}, {coke, snacks},
{soft drinks, snacks}, {beverages, snacks}
60
Extended transaction table
Can be obtained by the following SQL
Query to generate T*
select item, tid from T union
select distinct A.ancestor as item, T.tid
from T, Ancestor A
where A.descendant = T.item
The “select distinct” clause gets rid of items with
common ancestor – e.g. don’t want {beverages,
beverages} being added twice from {pepsi, coke}
61
Pipelining of Query
No need to actually build T*
Make following modification to SQL:
insert into Fk with T*(tid, item) as (Query for T*)
select item1,…,itemk,count(*)
from Ck, T* t1, …, T* tk,
where t1.item = Ck.item1, … , and
tk.item = Ck.itemk and
t1.tid = t2.tid … and
tk-1.tid = tk.tid
group by item1,…,itemk
having count(*) > :minsup
62
Organization of Presentation
Overview – Data Mining and RDBMS
Loosely-coupled data and programs
Tightly-coupled data and programs
Architectural approaches
Methods of writing efficient SQL
Candidate generation, pruning, support counting
K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies
Mining sequential patterns
63
Sequential patterns
Similar to papers covered on Nov 17
Input is sequences of transactions
E.g. ((computer,modem),(printer))
Similar to association rules, but dealing with
sequences as opposed to sets
Can also specify maximum and minimum time
gaps, as well as sliding time windows
Max-gap, min-gap, window-size
64
Input and output formats
Input has three columns:
Sequence identifier (sid)
Transaction time (time)
Idem identifier (item)
Output format is a collection of frequent
sequences, in a fixed-width table
(item1, eno1,…,itemk, enok, len)
For smaller lengths, extra column values are set to
NULL
65
GSP algorithm
Similar to algorithms shown earlier
Each Ck has transactions and times, but no length –
has fixed length of k
Candidates are generated in two steps
Join – join Fk-1 with itself
Sequence s1 joins with s2 if the subsequence obtained by
dropping the first item of s1 is the same as the one obtained by
dropping the last item of s2
When generating C2, we need to generate sequences where
both of the items appear as a single element as well as two
separate elements
Prune
All candidate sequences that have a non-frequent contiguous
(k-1) subsequence are deleted
66
GSP – Join SQL
insert into Ck
select I1.item1, I1.eno1, ... , I1.itemk-1,
I1.enok-1,
I2.itemkk-1, I1.enok-1 + I2.enok-1 –
I2.enok-2
from Fk-1 I1, Fk-1 I2
where I1.item2 = I2.item1 and ... and
I1.itemk-1 = I2.itemk-2 and
I1.eno3-I1.eno2 = I2.eno2 – I2.eno1 and
... and
I1.enok-1 – I1.enok-2 = I2.enok-2 – I2.enok-3
67
GSP – Prune SQL
Write as a k-way join, similar to before
There are at most k contiguous subsequences
of length (k-1) for which Fk-1 needs to be
checked for membership
Note that all (k-1) subsequences may not be
contiguous because of the max-gap
constraint between consecutive elements.
68
GSP – Support Counting
In each pass, we use the candidate table Ck
and the input data-sequences table D to
count the support
K-way join
We use select distinct before the group by to
ensure that only distinct data-sequences are
counted
We have additional predicates between sequence
numbers to handle the special time elements
69
GSP – Support Counting SQL
(Ck.enoj = Ck.enoi and
abs(dj.time – di.time)≤
window-size) or (Ck.enoj =
Ck.enoi + 1 and dj.time –
di.time max-gap and dj.time –
di.time > min-gap) or (Ck.enoj
> Ck.enoi + 1)
70
References
1. Developing Tightly-Coupled Data Mining Applications
on a Relational Database System
Rakesh Agrawal, Kyuseok Shim, 1996
2. Integrating Association Rule Mining with Relational
Database Systems: Alternatives and Implications
Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal, 1998
Refers to 1) above
3. Mining Generalized Association Rules and Sequential
Patterns Using SQL Queries
Shiby Thomas, Sunita Sarawagi, 1998
Refers to 1) and 2) above
71