Download discovery of informative rule set: incremental deletion

Document related concepts
no text concepts found
Transcript
義 守 大 學
資 訊 管 理 研 究 所
碩 士 論 文
資訊性關聯規則之維護
Maintenance of Discovered
Informative Rule Sets
研究生:黃冠偉
指導教授:王天津博士
共同指導教授:王學亮博士
中華民國
九十二 年 六 月
I
資訊性關聯規則之維護
Maintenance of Discovered
Informative Rule Sets
研究生:黃冠偉
指導教授:王天津 博士
共同指導教授:王學亮 博士
Student:Kuan-Wei Huang
Advisor:Dr. Tien-Chin Wang
Coadvisor:Dr. Shyue-Liang Wang
義守大學
資訊管理研究所
碩士論文
A Thesis
Submitted to Department of Information Management
I-Shou University
in Partial Fulfillment of the Requirements
for the Master Degree
in
Information Management
June , 2003
Kaohsiung, Taiwan
中華民國九十二年六月
II
資訊性關聯規則之維護
研究生:黃冠偉
指導教授:王天津 博士
共同指導教授:王學亮 博士
義守大學資訊管理研究所
摘
要
本研究之目的在於探討有效率的維護資訊性關聯規則(Informative Rule Sets,
IRS)的方法。以信心度而言,資訊性關聯規則可以和一般的關聯規則做相同的
預測,且規則的數量會遠小於關聯規則的數量。預測是指給定一群顧客在某段時
間內購物行為之規則以及某一特定顧客之部分購物行為,希望能預測出此一特定
顧客之其他購物行為。而資訊性關聯規則之維護是指已知交易資料庫及其資訊性
關聯規則,當資料庫發生新增、刪除、或修改時,如何有效率的維護資訊性關聯
規則。
根據關聯規則之快速更新(Fast Update, FUP)的方法,本研究提出兩個有
效率的維護資訊性關聯規則的演算法。當資料庫發生新增或刪除資料時,漸近新
增演算法可有效率的維護資訊性關聯規則。當資料庫發生刪除資料時,漸近刪除
演算法可有效率的維護資訊性關聯規則。同時我們並與非漸近演算法作數值比
III
較,結果顯示我們所提出之方法需要較少之資料庫掃瞄次數、候選規則、及執行
時間。
關鍵字:資料探勘、預測、資訊性關聯規則、漸近發現、維護
IV
Maintenance of Discovered
Informative Rule Sets
Student:Kuan-Wei Huang
Advisor:Dr. Tien-Chin Wang
CoAdvisor:Dr. Shyue-Liang Wang
Institute of Information Management
I-Shou University
ABSTRACT
The goal of this research is to study the efficient maintenance of discovered
Informative Rule Set (IRS) when new transaction data is added to and/or deleted from
original transaction database.
An informative rule set is the smallest subset of
association rule set such that it can make the same prediction sequence according to
confidence priority.
Prediction is a process, for example, given a set of rules that
describe the shopping behavior of the customers in a store over time, and some
purchases made by a particular customer, we wish to predict what other purchases
will be made by that customer.
The problem of maintenance of discovered
informative rule set is that, given a transaction database and its informative rule set,
V
when the database receives insertion, deletion, or modification, we wish to maintain
the discovered informative rule set as efficiently as possible.
Based on the Fast Updating technique (FUP) for the updating of discovered
association rules, we present here two algorithms to maintain the discovered IRS.
The proposed incremental insertion algorithm maintains the discovered IRS
efficiently under database insertion. The proposed incremental deletion algorithm
maintains the discovered IRS efficiently under database deletion.
Numerical
comparison with the non-incremental informative rule set approach is shown to
demonstrate that our proposed techniques require less computation time, in terms of
number of database scanning and number of candidate rules generated, to maintain
the discovered informative rule set.
Keywords: data mining, prediction, informative rule set, incremental, maintenance
VI
ACKNOWLEDGEMETS
論文能夠順利完成,首先要感謝我的兩位指導老師,王天津教授
和王學亮教授;在這兩年研究所生涯中給予很大的鼓勵和指導,使我
能對此領與有更深入的了解。並且在我犯錯時,給予我適時的指正與
教誨,我才能順利的完成這份研究。
其次,要感謝我的論文口試委員,林建宏教授及樹德科技大學吳
志宏教授在論文口試期間的指導,因為他們的建議,使得本論文的內
容能更趨於完善,僅致以由衷的感謝。
接下來,謝謝這兩年來許多陪我一起度過的學長、同學及學弟妹
們,一路上有你們的陪伴與鼓勵,使的這兩年的學習生涯更加的多采
多姿,並且更加快樂。
最後,我要感謝我的家人,在求學的這段期間中,因為他們默默
的支持,讓我無憂無慮的完成我的學業。
黃冠偉 謹誌於 義守大學
VII
Contents
ABSTRACT (CHINESE) ............................................................................................ III
ABSTRACT (ENGLISH) ............................................................................................. V
ACKNOWLEDGEMENTS ..................................................................................... VII
LIST OF FIGURES .....................................................................................................IX
LIST OF TABLES ........................................................................................................ X
CHPATER 1 INTRODUCTION .................................................................................... 1
1.1 Background ...................................................................................................... 1
1.2 Motivation ........................................................................................................ 2
1.3 Thesis Organization ......................................................................................... 3
CHAPTER 2 LITERATURE SURVEY ........................................................................ 4
2.1 Association Rules for Prediction...................................................................... 4
2.2 Informative Rule set for Prediction ................................................................. 8
2.3 Maintenance of Association Rules ................................................................. 10
CHAPTER 3 DISCOVERY OF INFORMATIVE RULE SET:
INCREMENTALINSERTION ............................................................................ 14
3.1 Problem Description ...................................................................................... 14
3.2 Notations ........................................................................................................ 15
3.3 Algorithm ....................................................................................................... 16
3.4 Example ......................................................................................................... 21
CHAPTER 4 DISCOVERY OF INFORMATIVE RULE SET:
INCREMENTALINSERTION ............................................................................ 23
4.1 Problem Description ...................................................................................... 24
4.2 Algorithm ....................................................................................................... 25
4.3 Example ......................................................................................................... 29
CHAPTER 5 EXPERIMENT RESULTS .................................................................... 33
5.1 Incremental Insertion Results.………………………………………………33
5.2 Incremental Deletion Results.………………………………………………37
CHAPTER 6 CONCLUSION ..................................................................................... 41
REFERENCE............................................................................................................... 43
VIII
List of Figures
Figure 2.1 A simple database D .................................................................................. 7
Figure 2.2 A transaction database after deletion and insertion ................................. 11
Figure 3.2 New data set △+..................................................................................... 15
Figure 3.1 A simple database D ................................................................................ 15
Figure 3.3 A candidate tree over the set of items {a, b, c, d} ................................... 18
Figure 4.2 Deleted data set △-................................................................................. 25
Figure 4.1
Figure 4.3
Figure 5.1
Figure 5.2
Figure.5.3
A simple database D ................................................................................ 25
A candidate tree over the set of items {a, b, c, d} ................................... 26
Running time comparison of incremental and non-incremental
approaches under various minimum supports, incremental data size
2,000 records ........................................................................................ 34
Running time comparison of incremental and non-incremental
approaches under various minimum supports, incremental data size
10,000 records ...................................................................................... 35
Running time comparison of incremental and non-incremental
approaches under various incremental data sizes, minimum supports
10% ...................................................................................................... 36
Figure 5.4
Running time comparison of incremental and non-incremental
approaches under various incremental data sizes, minimum supports
2% ........................................................................................................ 36
Figure 5.5 Running time comparison of incremental and non-incremental approaches
under various minimum supports, incremental data size 2,000 records
.............................................................................................................. 37
Figure 5.6
Running time comparison of incremental and non-incremental
approaches under various minimum supports, incremental data size
Figure.5.7
10,000 records ...................................................................................... 38
Running time comparison of incremental and non-incremental
approaches under various incremental data sizes, minimum supports
10% ...................................................................................................... 39
IX
List of Tables
Table 2.1 A summary of Recent Association Rule Mining Approaches………….....5
Table 2.2 Association Rule Set Obtained from Figure 2.1…………………………..7
Table 2.3 Association Rule Set Obtained from Figure 2.1…………………………..9
Table 2.4 A Summary of Recent Association Rule Maintenance Approaches……..13
Table 3.1 Comparisons of Non-Incremental and Incremental Insertion
Approach……………………………………………………………...….24
Table 4.1 Comparisons of Non-Incremental and Incremental Deletion
Approach…………………………………………………………………42
X
CHAPTER 1
INTRODUCTION
1.1 Background
The discovery of association rules in transaction databases is an important
data-mining problem because of its wide application in many areas, such as market
basket analysis, decision support, financial forecast, collaborative recommendation,
and prediction.
Prediction is a process, for example, given a set of rules that describe the
shopping behavior of the customers in a store over time, and some purchases made by
a particular customer, we wish to predict what other purchases will be made by that
customer.
Many techniques have been proposed for prediction in the past.
In
addition to the classical decision-tree induction approach, there are Bayesian
classifications, neural network, nearest neighbor classifiers, case-based reasoning,
genetic algorithm, rough set, fuzzy set, and data mining approaches.
1
For data mining approach, the association rule set is usually used for prediction.
However, traditional association rule algorithms typically generate a large number of
rules, most of which are unnecessary when used for prediction.
Enhancements on
simplifying the association rule set[2] directly and indirectly have been therefore
studied extensively. Most indirect algorithms simplify the set by post-pruning and
re-organization of association rules. The direct algorithms attempts to reduce the
number of association rules directly, for example, the constraint association rule sets
[17][20], non-redundant rule sets [18][19], and informative rule sets [12].
In this work, we are particularly interested in improving the efficiency of mining
informative rule sets when the transaction database is updated, i.e., when a small
transaction data set is added to and/or deleted from the original database.
This
problem is referred to as the maintenance of discovered informative rule set.
1.2 Motivation
One possible approach to the maintenance problem is to re-run the data-mining
algorithm on the whole updated database. However, this approach has some obvious
2
disadvantage. All the computation done initially at finding out old large itemsets are
wasted and have to be computed again from scratch, for the case of association rules.
In the case of IRS, support counts of candidate itemsets also have to be re-computed
from scratch for the updated database.
Therefore, more efficient algorithms for
computing the large itemsets in the updated database, utilizing the information from
old large itemsets, are quite desirable.
1.3 Thesis Organization
The rest of our paper is organized as follows.
Chapter 2 reviews the association
rules and informative rule sets for prediction as well as maintenance of association
rules.
Chapter 3 presents the proposed incremental insertion algorithm.
presents the incremental deletion algorithm.
results of the proposed algorithms.
Chapter 4
Chapter 5 shows the experimental
Conclusion and future work are finally given in
Chapter 6.
3
CHAPTER 2
LITERATURE SURVEY
In this chapter, we review the association rules and informative rule sets for
prediction as well as maintenance of association rules.
In section 2.1, we review the
basic concept of association rules mining and summarize the related researches in
recent years. [1][2][17].
In section 2.2, we review the concept of informative rule
sets from transaction databases [12].
In section 2.3, we review the recent activities
of maintenance of discovered association rules [3][7][8][10][16].
2.1 Association Rules for Prediction
Association rule has been the most widely studied pattern in the field of data
mining. The following table summarizes some of the recent activities.
Method
Author
Subject
Association
David W.
Maintenance of Discovered Association Rules in
Rules
Cheung
Large Databases: An Incremental Updating
Technique (FUP)
4
Year
96’
Maintenance of Discovered Knowledge: A Case in
96’
Multi-level Association Rules (FUP*)
A General Incremental Technique for Maintaining
97’
Discovered Association Rules (FUP2)
L.P. Chen
Efficient Graph-Based Algorithms for Discovering
and Maintaining Association Rules in Large
01’
Databases (DUP)
Necip
An Efficient Algorithm to Update Large Itemsets
Fazil Ayan
with early Pruning (UWEP)
Shiby
An Efficient algorithm for Incremental Updating
Thomas
of Association Rules in Large Databases
T.P. Hong
Incremental Data Mining Using Pre-large Itemsets
Zaki
An Efficient Algorithm for Closed Association
99’
97’
Non-redundant
00’
99’
Association Rules
Rule Mining
Generating Non-Redundant Association Rules
Yves
Mining Minimal Non-Redundant Association
Bastide
Rules Using Frequent Closed Itemsets
Constraint-based
David W.
A Fast Distributed Algorithm for Mining
Association Rules
Cheung
Association Rules
00’
00’
96’
Jiawei
Han
Agrawal
Mining Association Rules with Item Constraint
Chunhua
Distributed Mining for Association Rules With
Wang
Item Constraints
Jiuyong Li
Mining the Smallest Association Rule Set for
97’
00’
Association Rules
for Predict
Predictions
01’
Table 2.1 A summary of recent association rule mining approaches
Here we briefly review the basic concept of association rule.
be a set of item.
Let I = {i1, i2, …, im}
Let D be a set of transactions, where each transaction T is be a set
of items such that T  I. Each transaction associated with a unique identifier, called
its TID.
An association rule is an implication of the form X  Y, where X  I,
5
Y  I, and X  Y = Ø. The rule X  Y has support s if s% of transactions in D
contains X  Y, and it has confidence c if c% of transactions in D that contains X also
contains Y. This is the original definition of an association rule.
Informally, the prediction using association rule set can be described as follows.
For a given association rule set R and an itemset P, we say that the predictions for P
from R is a sequence of items Q. The sequence of Q is generated by using the rules
in R which is descending order of confidence.
For each rule r that matches P (i.e. for
each rule whose antecedent is a subset of P), each consequent of r is added to Q.
After adding a consequence to Q, all rules whose consequences are in Q are removed
form R.
The following example shows the association rules of a simple data set and
its application to prediction.
Example 1
Consider a small database shown in Figure 2.1.
0.5 and minimum confidence 0.5.
For minimum support
For the rule: a  b, the 67% is called the
support of the rule is the percentage of transactions that contain both a and b. The
80% here called the confidence of the rule, which means that 80% of transaction that
contains X also contains Y. Therefore, set of 12 association rules can be found, as
shown in Table 2.2.
6
TID
Items
1
abc
2
abc
3
abc
4
abd
5
acd
6
bcd
Figure 2.1 A simple database D
Table 2.2
AR
Support Confidence
1
a=>b
0.67
0.8
2
a=>c
0.67
0.8
3
b=>a
0.67
0.8
4
b=>c
0.67
0.8
5
c=>a
0.67
0.8
6
c=>b
0.67
0.8
7
ab=>c
0.5
0.75
8
ac=>b
0.5
0.75
9
bc=>a
0.5
0.75
10
a=>bc
0.5
0.6
11
b=>ac
0.5
0.6
12
c=>ab
0.5
0.6
Association rule set obtained from figure 2.1
For prediction, given an itemset P = {a, b}, the predicted sequence of items will
be Q = {b, c, a}.
It can be observed that not all association rules are used to produce
the predicted sequence Q.
7
2.2 Informative Rule Set for Prediction
Basically an informative rule set is the smallest subset of association rule set
such that it can make the same prediction sequence according to confidence priority.
The definition of informative rule set introduced in [12] is given as follows.
Definition 2.1
Let RA be an association rule set and RA1 the set of
single-target rules in RA.
A set RI is informative over RA if (1) RI  RA1 ; (2)  r 
RI but does not exist r’  RI such that r’  r and conf(r’)  conf(r); and (3)  r”
 RA1 - RI, exist r  RI such that r”  r and conf(r”)  conf(r).
A top-down level-wise searching algorithm using candidate tree is proposed by
[12] for the efficient discovery of informative rule set.
Consider again the database
shown in Figure 2.1. The informative rule set for the same minimum support and
confidence will be the first 6 association rules in Table 2.2, i.e. RD = {a=>b, a=>c,
b=>a, b=>c, c=>a, c=>b}, which can make the same prediction sequence as the
whole association rule set.
The result is shown in Table 2.3.
8
AR
Table 2.3
Support Confidence
1
a=>b
0.67
0.8
2
a=>c
0.67
0.8
3
b=>a
0.67
0.8
4
b=>c
0.67
0.8
5
c=>a
0.67
0.8
6
c=>b
0.67
0.8
Informative rule set obtained from figure 2.1
The confidence priority of informative rule set can be further illustrated by the
following. Assuming the following two rules exist.
(1)

Purchasing PRINTER 
Purchasing PRINTING PAPER
(2)

Purchasing (PRINTER and PRINTING INK) 
Purchasing PRINTING PAPER
80%
60%
We can predict that the confidence is 80% that purchasing printer also purchasing
printing paper.
But another rule that purchasing both printer and printing ink will
also purchase printing paper has lower confidence. Then the rule that purchasing
both printer and printing ink will also purchase printing paper is redundant.
Example 2
For the table 1. Consider the rule set: {a  c(0.67, 0.8), b  c(0.67,
0.8), ab  c(0.5, 0.75)}, where the numbers in parentheses are the support and
confidence respectively.
Every transaction identified by the rule ab  c is also
identified by rule a  c or b  c with higher confidence. So ab  c can be
9
omitted from the informative rule set without losing predictive capability.
Example 3
For the table 1.
Consider the rule set {a  b(0.67, 0.8), a 
c(0.67, 0.8), a  bc(0.5, 0.6)}. Rules a  b and a  c provide predictions b
and c with higher confidence than rule a  bc, so rule a  bc can be omitted
from the informative rule set.
2.3 Maintenance of Association Rules
The problem of maintenance of discovered association rule set in large
database has been studied extensively in the past [3][7][8][10][16]. Maintenance of
discovered association rule set is a process by which, given a transaction database and
its association rule set, when the database receives insertion, deletion, or modification,
we wish to maintain the discovered association rules as efficiently as possible. The
following figure shows the relative relationship between a given transaction database
D with respect to the insertion △+ and deletion △-.
10
△D
DD+
△+
Figure 2.2
A transaction database after deletion and insertion
The following table summarizes some incremental approaches of the recent
years,
including
incremental
insertion,
incremental
deletion,
incremental
modification.
Serial
1
2
Subject
Author
Maintenance of
Discovered
Association Rules
in Large
Cheung
Databases: An
Incremental
Updating
Technique (FUP)
Maintenance of
Discovered
Knowledge: A
Case in
etc.
Cheung
etc
Year
96’
96’
11
From
In Proceedings of the
International
Conference on Data
Engineering,
New
Orleans, Louisiana, pp
106-114
In Proceedings of 2nd
international
Conference
on
Knowledge Discovery
Increment
insert
insert
3
4
5
Multi-level
and
Association Rules
(FUP*)
pp307-310
Maintenance of
Discovered
Association
Rules: When to
Update?
DMKD 1997
Efficient Mining
of Association
Rules in
Distributed
Databases
(DMA)
Cheung
etc
Cheung
etc
A General
Incremental
Technique for
Cheung
Maintaining
etc
Discovered
Association Rules
Data
Mining,
96’
96’
97’
insert
IEEE
Trans.
On
Knowledge and Data
Mining
In
Proceedings
of
International
Conference
on
Database Systems for
Advanced Applications,
1-4 April 1997
insert
delete
(FUP2)
6
An Efficient
algorithm for
Incremental
Shiby
Updating of
Thomas
Association Rules
etc.
in Large
Databases
In Proc. KDD 1997
97’
In 3rd Pracific-Asia
Conference,
Efficient
Graph-Based
7
8
update
Algorithms for
Discovering and
Maintaining
Knowledge in
Large Databases
L.P.
Chen
etc.
An Efficient
Approach for
Incremental
L.P.
Chen
etc.
99’
PAKDD-99
Proceedings, Beijing,
China, pp 409-419
update
99’
Methodologies
for
Knowledge Discovery
and Data Mining. 3rd
update
12
9
Association Rules
Pacific-Asia
Mining
Conference.
PAKDD-99
An Efficient
Algorithm to
Update Large
Itemsets with
early Pruning
(UWEP)
In Proc. 5th ACM
SIGKDD International
Conference
on
Knowledge Discovery
and Data Mining. San
Diego, CAUSA, August
1999
Incremental Data
10
11
Mining Using
Pre-large Itemsets
Efficient
Graph-Based
Algorithms for
Discovering and
Maintaining
Association Rules
Necip
Fazil
Ayan
etc.
T.P.
Hong
etc.
L.P.
Chen
etc.
99’
insert
Proceedings of the 2002
02’
IEEE
International
Conference on Data
Mining (ICDM 2002)
update
Knowledge
and
Information
Systems
(2001) 3: 338-355
01’
update
in Large
Databases (DUP)
Table 2.4 A summary of recent association rule maintenance approaches
13
CHAPTER 3
DISCOVERY OF INFORMATIVE
RULE SET: INCREMENTAL
INSERTION
This chapter presents an incremental insertion algorithm for the maintenance of
discovered informative rule set when a small transaction data set is added to the
original database.
The proposed approach is based on the Fast UPdating technique
(FUP) [7] for updating the discovered association rules.
In the following, we describe the problem statement, notations used in this work,
the top level of the proposed incremental insertion algorithm of informative rule set,
the main functions of the algorithm and an example to demonstrate the proposed
approach.
3.1 Problem Description
14
The problem of maintenance of discovered informative rule set is that, given a
transaction database and its informative rule set, when the database receives insertion,
deletion, or modification, we wish to maintain the discovered informative rule set as
efficiently as possible.
For example, consider the database shown in Figure 3.1 and
its informative rule set RD = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b}.
When a new data set in Figure 3.2 is inserted, the updated informative rule set
will be RD+ = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}.
TID
Items
1
abc
2
abc
3
abc
4
abd
5
acd
6
bcd
Figure 3.1
TID
Items
7
abcd
8
abd
Figure 3.2 New data set △+
A simple database D
In this chapter, we will consider the maintenance of discovered informative rule
set under insertion only.
3.2 Notations
15
The following notation will be used in this chapter.
D: original database, D-  △-
△+: new database
D+: updated database, D  △+
I: items in the database, (e.g. ID, ID+)
TN: transaction in database, (e.g. TND, TND+)
X: itemset, X  I
s: support
c: confidence
X.sup: support of X (e.g. X.suppD, X.supp△+, X.supp D+)
Ck: set of k-th candidate itemsets (e.g. C kD , C k  )
Lk: set of k-th large itemsets (e.g. LDk , Lk  )
T: candidate tree (e.g. TD, T△+)
Tk: k-th level of T
R: informative rule set (e.g. RD, R△+, RD+)
3.3 Algorithm
16
This section presents the proposed incremental insertion algorithm for the
maintenance of discovered informative rule set.
Based on the concept of informative
rule miner [12] and FUP technique [7], the proposed algorithm generates directly the
updated informative rule set level-by-level and stores them in a candidate tree.
A
candidate tree is an extended set enumeration tree such as Figure 3.3, where the set of
items is {a, b, c, d}. Each node in the candidate tree stores two sets {A, Z}.
A is an
itemset, the identity set of the node, and Z is a subset of the identity itemset, called
potential target set where every item can be the consequence of an association rule.
For example, a node {{abc}, {ab}} is a set of candidates of two rules, namely, bc=>a
and ac=>b.
Utilizing FUP technique to update the supports of 1-itemsets, the proposed
algorithm updates the level one nodes of candidate tree TD.
It then generates the
candidate nodes of the next level. While the candidate nodes are not empty, it
proceeds to update the support counts of the candidate nodes using FUP technique.
It then prunes the unnecessary candidate rules or nodes. This process repeats itself
until no candidate node is generated. The following is the top level of the proposed
incremental insertion algorithm of informative rule set.
17
Figure 3.3
Input:
A candidate tree over the set of items {a, b, c, d}
Database D,
the informative rule set RD,
candidate tree TD,
new data set
△+
Output: the informative rule set RD+
1. RD+=RD
2. Update supports of 1-itemsets
3. Update level one of candidate tree TD
4. Generate candidate nodes of next level
18
5. While (candidate nodes are not empty)
6.
Update supports of candidate node
7.
Prune the candidate rules or nodes
8.
Include qualified rule sets to RD+
9.
Generate candidate nodes of next level
10. Return rule set in RD+
The functions for generating candidate nodes in steps 4 and 9 and pruning the
candidate nodes in step 7 are introduced in [12]. The update support functions in
steps 2 and 6 are given as follows.
Function: Update supports of 1-itemsets
(1) Initialization: W= L1D , C 1D  =  , L1D  = 
(2) For all X  I +,
△
If X  W
then update support of X on D+
//X.suppD known
//scan △+
Else
If X  C 1D 
then C 1D  = C 1D   {X}, and calculate support of X on
△+
(3) For all X  W
19
If X.suppD+  s*(D+△+)
then L1D  = L1D   {X}
(4) For all X  C 1D 
If X.supp△+<s*△+
then C 1D  = C 1D  -{X}
(5) For all X  C 1D 
//scan D, X is non-frequent itemset in D
If X.suppD+>s*(D+△+)
then L1D  = L1D   {X}
Function: Update supports of Candidate nodes
(1) Initialization: W= TkD , TkD  = 
(2) C kD  =Candidate-Rule-generator ( TkD1 )- TkD
// Apriori generator function
(3) For all node ni  W
If ni contains non-frequent subset in TkD1
then W=W-{ni}
(4) For all ni  W
Calculate support of ni on D+
// ni.suppD+ known
// scan △+
If ni.supp D+  s*(D+△+)
20
then TkD  = TkD   {ni}
(5) For all ni  C kD 
If ni.supp△+<s*△+
then C kD  = C kD  -{ni}
(6) For all ni  C kD 
//scan D
If ni.suppD+  s*(D+△+)
then TkD  = TkD   {ni}
3.4 Example
This section gives an example to show the proposed algorithm can be used to
find the informative rule set incrementally.
Input:
Database D in Figure 3.1,
the informative rule set RD = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b},
the candidate tree TD,
a new data set d in Figure 3.2
Output: the informative rule set RD+
21
1. RD+=RD
2. Update supports of 1-itemsets: a.supD+ = 7, b.supD+ = 7, c.supD+ = 6, d.supD+ = 5.
3. The candidate nodes of level one nodes a, b, c, d with candidates {{a}, {a}}, {{b},
{b}}, {{c}, {c}}, {{d}, {d}}.
4. The candidate nodes in level two are nodes b, c, d, which are descendant of node a
of level one. Other nodes are shown in Figure 3.3.
5. While (candidate nodes are not empty)
6. Update supports of candidate nodes of level two, as shown in Figure 3.3.
7. Node d of descendant of c is pruned.
8. RD+ = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}.
9. Generate candidate nodes of level three, as shown in Figure 3.3.
10. The process repeats the while loop until there is no candidate node, as shown in
Figure 3.3.
The resulting IRS is RD+ = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a,
d=>b, a=> d, b=>d}.
In the following, we summarize the candidate rule sets and the number of
scanning of database for the non-incremental and incremental insertion algorithms for
the discovery of informative rule set in Table 3.1.
22
The candidate rule sets for
non-incremental approach are 4, 6, and 1 for level 1, 2, and 3, respectively.
the incremental approach has only 0, 3, 0, respectively.
However,
In addition, there is some
saving of database scanning for the incremental approach as shown in Table 3.1.
Ck
Scanning of Database
IRS
IIRS
Level 1
4
0
IRS
D&△+
Level 2
6
3
D&△+
Level 3
1
0
D&△+
IIRS
△+
D&△+
△+
Table 3.1 Comparisons of non-incremental and incremental insertion approaches
23
CHAPTER 4
DISCOVERY OF INFORMATIVE
RULE SET: INCREMENTAL
DELETION
This chapter presents an incremental deletion algorithm for the maintenance of
discovered informative rule set when a small transaction data set is removed from the
original database. The proposed approach is based on the Fast Updating2 technique
(FUP2) [8] for updating the discovered association rules.
In the following, we describe the problem statement, notations used in this work,
the top level of the proposed incremental deletion algorithm of informative rule set,
the main functions of the algorithm and an example to demonstrate the proposed
approach.
4.1 Problem Description
24
The problem of maintenance of discovered informative rule set is that, given a
transaction database and its informative rule set, when the database receives insertion,
deletion, or modification, we wish to maintain the discovered informative rule set as
efficiently as possible.
For example, consider the database shown in Figure 4.1 and
its informative rule set RD = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b}. When a
data set in Figure 4.2 is deleted, the updated informative rule set will be RD- = {a=>b,
a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}.
TID
Items
1
abc
2
abc
3
abc
4
abd
5
acd
6
bcd
Figure 4.1
TID
Items
1
abc
2
abc
Figure 4.2
Deleted data set △-
A simple database D
In this chapter, we will consider the maintenance of discovered informative rule
set under deletion only.
4.2 Algorithm
25
This section presents the proposed incremental deletion algorithm for the
maintenance of discovered informative rule set.
Based on the concept of informative
rule miner [12] and FUP2 technique [8], the proposed algorithm generates directly the
updated informative rule set level-by-level and stores them in a candidate tree.
A
candidate tree is an extended set enumeration tree such as Figure 4.3, where the set of
items is {a, b, c, d}.
Figure 4.3
A candidate tree over the set of items {a, b, c, d}
Utilizing FUP2 technique to update the supports of 1-itemsets, the proposed
algorithm updates the level one nodes of candidate tree TD.
26
It then generates the
candidate nodes of the next level. While the candidate nodes are not empty, it
proceeds to update the support counts of the candidate nodes using FUP technique.
It then prunes the unnecessary candidate rules or nodes. This process repeats itself
until no candidate node is generated. The following is the top level of the proposed
incremental deletion algorithm of informative rule set.
Input: Database D,
the informative rule set RD,
candidate tree TD,
removed data set
△-
Output: the informative rule set RD-
1. RD-=RD
2. Update supports of 1-itemsets
3. Update level one of candidate tree TD
4. Generate candidate nodes of next level
5. While (candidate nodes are not empty)
6.
Update supports of candidate node
7.
Prune the candidate rules or nodes
8.
Include qualified rule sets to RD-
9.
Generate candidate nodes of next level
10. Return rule set in RD27
The functions for generating candidate nodes in steps 4 and 9 and pruning the
candidate nodes in step 7 are introduced in [12]. The update support functions in
steps 2 and 6 are given as follows.
Function: Update supports of 1-itemsets
(1) Initialization: W= L1D , C1D  =  , L1D  = 
(2) For all X  I△-,
If X  W
then update support of X on D-
//X.suppD known
//scan △-
Else
If X  C1D 
then C1D  = C1D   {X}, and calculate support of X on △-}
(3) For all X  W
If X.suppD-  s*(D-△-), then L1D  = L1D   {X}
(4) For all X  C1D
If X.supp△->s*△-, then C1D  = C1D  -{X}
(5) For all X  C1D 
//scan DB, X is non-frequent itemset in DB
If X.suppD->s*(D-△-), then L1D  = L1D   {X}
28
Function: Update supports of Candidate nodes
(1) Initialization: W= T kD , TkD  = 
(2) C kD  =Candidate-Rule-generator ( TkD1 )- T kD
// Apriori generator function
(3) For all node ni  W
If ni contains non-frequent subset in TkD1
then W=W-{ni}
(4) For all ni  W
Calculate support of ni on D-
// ni suppD- known
// scan △-
If ni.supp△-<s*△then TkD  = TkD   {ni}
(5) For all ni  C kD 
If ni.supp D-<s*(D-△-)
then C kD  = C kD  -{ni}
(6) For all ni  C kD 
//scan DB
If ni.suppD-  s*(D-△-)
then TkD  = TkD   {ni}
29
4.3 Example
This section gives an example to show the proposed algorithm can be used to
find the informative rule set incrementally.
Input:
Database D in Figure 4.1,
the informative rule set RD = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b},
the candidate tree TD,
a new data set
△- in Figure 4.2
Output: the informative rule set RD-
1. RD-=RD
2. Update supports of 1-itemsets: a.supD- = 3, b.supD- = 3, c.supD- = 3, d.supD- = 3.
3. The candidate nodes of level one nodes a, b, c, d with candidates {{a}, {a}}, {{b},
{b}}, {{c}, {c}}, {{d}, {d}}.
4. The candidate nodes in level two are nodes b, c, d, which are descendant of node a
of level one. Other nodes are shown in Figure 4.3.
5. While (candidate nodes are not empty)
30
6. Update supports of candidate nodes of level two, as shown in Figure 4.3.
7. Node d of descendant of c is pruned.
8. RD-= {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}.
9. Generate candidate nodes of level three, as shown in Figure 4.3.
10. The process repeats the while loop until there is no candidate node, as shown in
Figure 4.3.
The resulting IRS is RD- = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a,
d=>b, a=> d, b=>d}.
In the following, we summarize the candidate rule sets and the number of
scanning of database for the non-incremental and incremental insertion algorithm for
the discovery of informative rule set in Table 4.1.
The candidate rule sets for
non-incremental approach are 4, 6, and 4 for level 1, 2, and 3, respectively.
the incremental approach has only 0, 2, 4, respectively.
However,
In addition, there is some
saving of database scanning for the incremental approach as shown in Table 4.1.
31
Ck
Scanning of Database
IRS
IIRS
IRS
IIRS
Level 1
4
0
D-&△-
△-
Level 2
6
2
D-&△-
D-&△-
Level 3
4
4
D-&△-
D-&△-
Table 4.1 Comparisons of non-incremental and incremental deletion approaches
32
CHAPTER 5
EXPERIMENT RESULT
In this chapter, we present the experimental results of comparing the incremental
insertion and deletion algorithms proposed in chapters 3 and 4 with the
non-incremental approach proposed by [12].
All programs are written in C++ and run on the same 600Hz Pentium III PC with
256 MB of memory running Windows XP operating system.
We ran the algorithms on a transaction data set of 50,000 records, which is
generated by the synthetic data generator of QUEST from IBM Almaden research
center.
In this dataset, there are 100 different items and the average length of
itemsets is 7.8 with maximum itemset length 16 and minimum itemset length 2.
5.1 Incremental Insertion Results
33
In the following, we present the processing times of the incremental insertion
algorithm and the non-incremental algorithm under various minimum supports and
incremental data sets.
800
Running time (seconds)
700
Non-Incremental
Incremental
600
500
400
300
200
100
0
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
minsup
Figure 5.1
Running time comparison of incremental and non-incremental
approaches under various minimum supports, incremental data
size 2,000 records
Figures 5.1 and 5.2 show the running times for both approaches under various
minimum supports.
The size of original database D is 40,000 records.
minimum confidence is set at 20%.
10,000 records respectively.
The
The size of inserted data set is 2,000 records and
We can observe that for small minimum supports, the
incremental approach performs better than the non-incremental approach.
34
For
minimum supports greater than 20%, the two approaches perform similarly. This is
due to the fact that the number of candidate itemsets and rules is small and about the
same, when the minimum support increases
Running time (seconds)
900
800
Non-Incremental
700
Incremental
600
500
400
300
200
100
0
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
minsup
Figure 5.2
Running time comparison of incremental and non-incremental
approaches under various minimum supports, incremental data
size 10,000 records
Figures 5.3 and 5.4 show the running times for both approaches under various
incremental data sizes. The size of original database D is 40,000 records. The
minimum confidence is set at 20%.
The minimum supports are 10% and 2%
respectively. We can observe that for various incremental data sizes, the incremental
35
Running time (seconds)
140
120
100
80
60
40
20
0
Non-Incremental
Incremental
2k
4k
6k
8k
10k
Incremental size
Figure.5.3
Running time comparison of incremental and non-incremental
Running time (seconds)
approaches under various incremental data sizes, minimum
supports 10%
1000
800
600
Non-Incremental
400
Incremental
200
0
2k
4k
6k
8k
10k
Incremental size
Figure 5.4
Running time comparison of incremental and non-incremental
approaches under various incremental data sizes, minimum
supports 2%
approach performs better than the non-incremental approach, for both minimum
supports.
In fact, the ratio of processing time is around 1.67 for all incremental data
36
sizes.
5.2 Incremental Deletion Results
In the following, we present the processing times of the incremental deletion
algorithm and the non-incremental algorithm under various minimum supports and
incremental data sets.
1000
900
Non-Incremental
Running time(second)
800
Incremental
700
600
500
400
300
200
100
0
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
minsup
Figure 5.5 Running time comparison of incremental and non-incremental
approaches under various minimum supports, incremental data
size 2,000 records
Figures 5.5 and 5.6 show the running times for both approaches under various
37
minimum supports.
The size of original database D is 50,000 records.
minimum confidence is set at 20%.
10,000 records respectively.
The
The size of deleted data set is 2,000 records and
We can observe that for small minimum supports, the
incremental approach performs better than the non-incremental approach.
For
minimum supports greater than 20%, the two approaches perform similarly. This is
due to the fact that the number of candidate itemsets and rules is small and about the
same, when the minimum support increases
900
800
Non-Incremental
Running time(second)
700
Incremental
600
500
400
300
200
100
0
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
minsup
Figure 5.6
Running time comparison of incremental and non-incremental
approaches under various minimum supports, incremental data
size 10,000 records
38
Figures 5.7 shows the running times for both approaches under various
incremental data sizes. The size of original database D is 50,000 records. The
minimum confidence is set at 20%.
The minimum supports are 10% and 6%
respectively. We can observe that for various incremental data sizes, the incremental
approach performs better than the non-incremental approach, for both minimum
supports.
140
Running time(second)
120
100
80
60
Non-Increment
40
Incremental
20
0
2k
4k
6k
8k
10k
Incremental size
Figure.5.7
Running time comparison of incremental and non-incremental
approaches under various incremental data sizes, minimum
supports 10%
39
Running time(second)
350
300
250
200
Non-Incremental
150
Incremental
100
50
0
2k
4k
6k
8k
10k
minsup
Figure 5.4
Running time comparison of incremental and non-incremental
approaches under various incremental data sizes, minimum
supports 6%
40
CHAPTER 6
CONCLUSION AND FUTURE
WORK
Informative rule set is the smallest subset of association rule set such that same
prediction sequence by confidence priority can be achieved.
The problem of
maintenance of discovered informative rule sets includes the incremental insertion,
incremental deletion, and incremental modification.
In this thesis, we have studied
the problem of maintenance of discovered informative rule set under insertion and
deletion.
For each of these two types of maintenance, we have proposed efficient
searching algorithms to maintain the discovered informative rule sets, based on the
FUP techniques.
In addition, numerical comparisons of the proposed incremental
insertion approach and incremental deletion approach with the non-incremental
approach has bee shown.
The experiment results show that the proposed approach is
more efficient than the non-incremental approach.
Although the proposed incremental approach works well, it is just a beginning.
There is still much work to be done on this topic.
41
Maintenance of discovered
informative rule sets under incremental modification has not been carried out.
In
addition, other maintenance approaches besides FUP techniques should also be
considered and compared.
42
REFERENCE
[1] R. Agrawal, T. Imielinksi and A. Swami, “Mining Association Rules between
Sets of Items in Large Database”, Proc. of the ACM SIGMOD Conference on
Management of Data, Washington DC, May 1993, 207-216.
[2] R. Agrawal, R. Srikant, “Fast Algorithms for Mining Association Rules”, Proc.
of the 20th Int’l Conference on Very Large Databases, Santiago, Chile,
September 1994, 487-499.
[3] Necip Fazil Ayan, “An Efficient Algorithm to Update Large Itemsets with early
Pruning”, In Proc. 5th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, San Diego, CAUSA, August 1999, 287-291.
[4] Yves Bastide, Nicolas Pasquier, Rafik Taouil, Gerd Stumme, Lot Lakhal,
“Mining Minimal Non-Redundant Association Rules Using Frequent Closed
Itemsets”, Proc. DOOD'2000 conference, LNCS, Springer-Verlag, July 2000,
972-986.
[5] L.P. Cheng, “Efficient Graph-Based Algorithms for Discovering and
Maintaining Association Rules in Large Databases”, In Proceedings of
Knowledge and Information Systems, Lonton, 2001, 338-355.
[6] David W. Cheung, Vincent T. Ng, Benjamin W. Tam, “Maintenance of
Discovered Knowledge: A Case in Multi-level Association Rules”, In
Proceedings of 2nd International Conference on Knowledge Discovery and Data
Mining, Edmonton, Alberta Canada, 1996, 307-310.
43
[7] David W. Cheung, Jiawei Han, Vincent T. Ng, C.Y. Wong, “Maintenance of
Discovered Association Rules in Large Databases: An Incremental Updating
Technique”, In Proceedings of the International Conference on Data Engineering,
New Orleans, Louisiana, 1996, 106-114.
[8] David W. Cheung, S.D. Lee, Benjamin Kao, “A General Incremental Technique
for Maintaining Discovered Association Rules”, In Proceedings of International
Conference on Database Systems for Advanced Applications, Melbourne, 1-4
April 1997, pp 185-194.
[9] Jun-Hui Her, Sung-Hae Jun, Jun-Heyog Choi, Jung-Hyun Lee, “A Bayesian
Neural Network Model for Dynamic Web Document Clustering”, Proceedings
of the IEEE Region 10 Conference , Volume: 2 , Dec 1999 Page(s): 1415-1418.
[10] T.P. Hong, “Incremental Data Mining Using Pre-large Itemsets”, Proceedings of
the 2002 IEEE International Conference on Data Mining, Maebashi City
December 2002, Japan, 9-12.
[11] Guanling Lee, K.L. Lee, Arbee L.P. Chen, “Efficient Graph-Based Algorithms
for Discovering and Maintaining Association Rules in Large Databases”,
Knowledge and Information Systems 3, 2001, 338-355.
[12] Jiuyong Li, Hong Shen, Rodney Topor, “Mining the Smallest Association Rule
Set for Predictions”, Proceedings of the 2001 IEEE International Conference on
Data Mining, California, USA, December, 2001.
[13] G. Piategsky-Shapiro, “ Discovery, Analysis and Presentation of Strong
Rules”, Knowledge Discovery in Databases, AAAI/MIT press, 1991, 229-248.
[14] Mei-Ling Shyu, Shu-Ching Chen, Chi-Min Shu, “Affinity-Based Probabilistic
Reasoning and Document Clustering on the WWW”, The Twenty-Fourth
44
Annual International Computer Software and Applications Conference October
25 - 28, 2000, Taipei, Taiwan, 149.
[15] Chunhua Wahg, Houkuan Huang, Honglian Li, “A Fast Distributed Mining
Algorithm for Association Rules With Item Constraints”, 2000 IEEE
International Conference on System, Man & Cybernetics, 2000, Vol 1-5,
1900-1905.
[16] G. I. Webb, “Efficient Search for Association Rules”, Proceedings of the 6th
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD-00), N.Y., Aug 20-23 2000. , 99-107
[17] Show-Jane Yen, Arbee L.P. Chen, “An Efficient Approach to Discovering
Knowledge from Large Databases”, In Proceedings of the IEEE/ACM
International Conference on Parallel and Distributed Information Systems, Berlin,
1996, 8-18.
[18] Mohammed J. Zaki, Ching-Jui Hsiao, “An Efficient Algorithm for Closed
Association Rule Mining”, 6th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Boston, MA, August 2000, 34-43.
[19] M.J. Zaki, “Generating Non-Redundant Association Rules”, Proc. of the 6th
ACM International Conference on Knowledge Discovery and Data Mining,
Edmonton, Alberta Canada, 2000.
[20] Tian Zhang, Raghu Ramakrishnan, Miron Livny, “BIRCH: An Efficient Data
Clustering Method for Very Large Databases”, In Proceedings of the 1996
ACM SIGMOD International Conference on Management of Data, Montreal,
Ouebec, Canada, 1996, pp 103-114.
45
Related documents