Download 簡要結案報告

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
行政院國家科學委員會補助專題研究計畫成果報告
※※※※※※※※※※※※※※※※※※※※※※※※
※
※
※
利用準大項目集之漸進式資料挖掘
※
※
※
※※※※※※※※※※※※※※※※※※※※※※※※
計畫類別:
□個別型計畫
計畫編號:NSC
□整合型計畫
89-2213-E-214-056
執行期間:89 年 8 月 1 日至 90 年 7 月 31 日
計畫主持人:洪宗貝
教授
共同主持人:陶幼慧
助理教授
本成果報告包括以下應繳交之附件:
□赴國外出差或研習心得報告一份
□赴大陸地區出差或研習心得報告一份
□ 出席國際學術會議心得報告及發表之論文各一份
□ 國際合作研究計畫國外研究報告書一份
執行單位:義守大學
中
華
民
國 90 年 10 月 20 日
行政院國家科學委員會專題研究計劃成果報告
利用準大項目集之漸進式資料挖掘
Incremental Data Mining Using Pre-large Itemsets
計畫編號:NSC 89-2213-E-214-056
執行期限:89 年 8 月 1 日至 90 年 7 月 31 日
主持人:洪宗貝
義守大學資管系
共同主持人:陶幼慧
義守大學資管系
計畫參與人員:王慶堯、林桂英、王乾隆、李詠騏
義守大學資工所
1、中文摘要
在本計劃當中,我們提出『準大項目
集』的概念,並設計一全新且有效率的漸
進式演算法。所謂準大項目集,主要是由
〝低支持度門檻值〞和〝高支持度門檻值〞
兩個支持度門檻值所定義而成,以用來減
少對原來資料庫的處理並節省維護所挖掘
出知識庫的成本。我們所提出之演算法除
非新增的資料太大,否則將不需要對原來
資料庫做處理或掃描的動作。另外,此演
算法更具備一個特質,即當資料庫不斷的
增加時,所獲得的結果會更好,這對於現
實資料庫的應用尤其有用。
關鍵詞:資料挖掘、關聯式法則、大
項目集、準大項目集、漸進式資料挖掘。
English Abstract
In this project, we propose the concept of
pre-large itemsets and design a novel
efficient incremental mining algorithm based
on it. Pre-large itemsets are defined using
two support thresholds, a lower support
threshold and an upper support threshold, to
reduce rescanning the original databases and
to save maintenance costs. The proposed
algorithm doesn't need to rescan the original
database until a number of transactions have
come. If the size of the database is growing
larger, then the allowed number of new
transactions will be larger too. Therefore,
along with the growth of the database, our
1
proposed approach is increasingly efficient.
This characteristic is especially useful for
real applications.
Keywords: data mining, association rule,
large itemset, pre-large itemset, incremental
mining.
2、Background and Purpose
In the past, many algorithms for mining
association rules from transactions were
proposed, most of which were executed in
level-wise processes. That is, itemsets
containing single items were processed first,
then itemsets with two items were processed,
then the process was repeated, continuously
adding one more item each time, until some
criteria were met. These algorithms usually
considered the database size static and
focused on batch mining. In real-world
applications, however, new records are
usually inserted into databases, and designing
a mining algorithm that can maintain
association rules as a database grows is thus
critically important.
When new records are added to databases,
the original association rules may become
invalid, or new implicitly valid rules may
appear in the resulting updated databases
[7][8][9][11][12]. In these situations,
conventional batch-mining algorithms must
re-process the entire updated databases to
find final association rules. Two drawbacks
may exist for conventional batch-mining
algorithms
in
maintaining
database
knowledge:
(a) Nearly the same computation time as
that spent in mining from the original
database is needed to cope with each new
transaction. If the original database is large,
much computation time is wasted in
maintaining association rules whenever new
transactions are generated
(b) Information previously mined from the
original database, such as large itemsets and
association rules, provides no help in the
maintenance process.
Cheung and his co-workers proposed an
incremental mining algorithm, called FUP
(Fast
UPdate
algorithm)
[7],
for
incrementally maintaining mined association
rules and avoiding the shortcomings
mentioned above. The FUP algorithm
modifies the Apriori mining algorithm [3]
and adopts the pruning techniques used in the
DHP (Direct Hashing and Pruning) algorithm
[10]. It first calculates large itemsets mainly
from newly inserted transactions, and
compares them with the previous large
itemsets from the original database.
According to the comparison results, FUP
determines whether re-scanning the original
database is needed, thus saving some time in
maintaining the association rules. Although
the FUP algorithm can indeed improve
mining performance for incrementally
growing databases, original databases still
need to be scanned when necessary. In this
paper, we thus propose a new mining
algorithm based on two support thresholds to
further reduce the need for rescanning
original databases. Since rescanning the
database spends much computation time, the
maintenance cost can thus be reduced in the
proposed algorithm.
itemsets already mined from the original
database in FUP. A pre-large itemset is not
truly large, but promises to be large in the
future. A lower support threshold and an
upper support threshold are used to realize
this concept. The upper support threshold is
the same as that used in the conventional
mining algorithms. The support ratio of an
itemset must be larger than the upper support
threshold in order to be considered large. On
the other hand, the lower support threshold
defines the lowest support ratio for an itemset
to be treated as pre-large. An itemset with its
support ratio below the lower threshold is
thought of as a small itemset. Pre-large
itemsets act like buffers in the incremental
mining process and are used to reduce the
movements of itemsets directly from large to
small and vice-versa.
Considering an original database and
transactions newly inserted using the two
support thresholds, itemsets may thus fall
into one of the following nine cases
illustrated in Figure 1.
New
transactions
Large
itemsets
Original
database
Pre-large Small
itemsets itemsets
Large
itemsets
Pre-large
itemsets
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Small
itemsets
Case 7
Case 8
Case 9
Figure 1: Nine cases arising from adding new
transactions to existing databases
Cases 1, 5, 6, 8 and 9 above will not affect
the final association rules according to the
weighted average of the counts. Cases 2 and
3 may remove existing association rules, and
cases 4 and 7 may add new association rules.
If we retain all large and pre-large itemsets
with their counts after each pass, then cases 2,
3 and case 4 can be handled easily. Also, in
the maintenance phase, the ratio of new
transactions to old transactions is usually
very small. This is more apparent when the
database is growing larger. An itemset in
3、Results and Discussions
In this plan, we propose the concept of
pre-large itemsets to solve the problem
represented by case 3 the situation in which a
candidate itemsets is large for new
transactions but is not recorded in large
2
Substep 5-2: If SU(I)/(d+t+c)  Su, then
assign I as a large itemset, set SD(I) = SU(I)
and keep I with SD(I),otherwise, if
SU(I)/(d+t+c)  Sl, then assign I as a
pre-large itemset, set SD(I) = SU(I) and keep I
with SD(I), otherwise, neglect I.
STEP 6: For each itemset I in the
originally pre-large itemset PkD , do the
following substeps:
Substep 6-1: Set the new count SU(I) =
ST(I)+ SD(I).
Substep 6-2: If SU(I)/(d+t+c)  Su, then
assign I as a large itemset, set SD(I) = SU(I)
and keep I with SD(I),otherwise, if
SU(I)/(d+t+c)  Sl, then assign I as a
pre-large itemset, set SD(I) = SU(I) and keep I
with SD(I), otherwise, neglect I.
STEP 7: For each itemset I in the
candidate itemsets that is not in the originally
large itemsets LDk or pre-large itemsets PkD ,
do the following substeps:
Substep 7-1: If I is in the large itemsets
T
Lk or pre-large itemsets PkT from the new
transactions, then put it in the rescan-set R,
which is used when rescanning in Step 8 is
necessary.
Substep 7-2: If I is small for the new
transactions, then do nothing.
STEP 8: If t +c  f or R is null, then do
nothing; otherwise, rescan the original
database to determine whether the itemsets in
the rescan-set R are large or pre-large.
STEP 9: Form candidate (k+1)-itemsets
Ck+1 from finally large and pre-large
k-itemsets ( LUk  PkU ) that appear in the new
transactions.
STEP 10: Set k = k+1.
STEP 11: Repeat STEPs 4 to 10 until no
new large or pre-large itemsets are found.
STEP 12: Modify the association rules
according to the modified large itemsets.
STEP 13: If t +c > f, then set d=d+t+c and
set c=0; otherwise, set c=t+c.
After Step 13, the final association rules
for the updated database have been
determined.
case 7 cannot possibly be large for the entire
updated database as long as the number of
transactions is small compared to the number
of transactions in the original database. This
is shown in the following theorem.
Theorem 1: let Sl and Su be respectively
the lower and the upper support thresholds,
and let d and t be respectively the numbers of
the original and new transactions. If t 
( Su  Sl ) d
, then an itemset that is small
1  Su
(neither large nor pre-large) in the original
database but is large in newly inserted
transactions is not large for the entire updated
database.
The details of the proposed maintenance
algorithm are described below. A variable, c,
is used to record the number of new
transactions since the last re-scan of the
original database.
The proposed maintenance algorithm:
INPUT: A lower support threshold Sl, an
upper support threshold Su, a set of large
itemsets and pre-large itemsets in the original
database consisting of (d+c) transactions, and
a set of t new transactions.
OUTPUT: A set of final association rules
for the updated database.
STEP 1: Calculate the safety number f of
new transactions according to theorem 1 as
follows:
( Su  S l ) d 
f = 
.
 1  Su 
STEP 2: Set k =1, where k records the
number of items in itemsets currently being
processed.
STEP 3: Find all candidate k-itemsets Ck
and their counts from the new transactions.
STEP 4: Divide the candidate k-itemsets
into three parts according to whether they are
large, pre-large or small in the original
database.
STEP 5: For each itemset I in the
originally large k-itemsets LDk , do the
following substeps:
Substep 5-1: Set the new count SU(I) =
T
S (I)+ SD(I).
3
The International Conference on Very
Large Data Bases, pp. 487-499, 1994.
[4]
R. Agrawal and R. Srikant, ”Mining
In this paper, we have proposed the
sequential patterns,” The Eleventh IEEE
concept of pre-large itemsets, and designed a
International Conference on Data
novel,
efficient,
incremental
mining
Engineering, pp. 3-14, 1995.
algorithm based on it. Using two
[5]
R. Agrawal, R. Srikant and Q. Vu,
user-specified upper and lower support
“Mining association rules with item
thresholds, the pre-large itemsets act as a gap
constraints,” The Third International
to avoid small itemsets becoming large in the
Conference on Knowledge Discovery in
updated database when transactions are
Databases and Data Mining, pp. 67-73,
inserted. Our proposed algorithm also retains
Newport Beach, California, 1997.
the features of the FUP algorithm [7][11].
Moreover, the proposed algorithm can [6] M.S. Chen, J. Han and P.S. Yu, “Data
mining: An overview from a database
effectively handle cases, in which itemsets
perspective,” IEEE Transactions on
are small in an original database but large in
Knowledge and Data Engineering, Vol. 8,
newly inserted transactions, although it does
No. 6, pp. 866-883, 1996.
need additional storage space to record the
[7]
D.W. Cheung, J. Han, V.T. Ng, and C.Y.
pre-large itemsets. Note that the FUP
Wong, “Maintenance of discovered
algorithm needs to rescan databases to handle
association rules in large databases: An
such cases. The proposed algorithm does not
incremental updating approach,” The
require rescanning of the original databases
Twelfth IEEE International Conference
until a number of new transactions
on Data Engineering, pp. 106-114, 1996.
determined from the two support thresholds
[8]
D.W. Cheung, S.D. Lee, and B. Kao, “A
and the size of the database have been
general incremental technique for
processed. If the size of the database grows
maintaining
discovered
association
larger, then the number of new transactions
rules,”
In
Proceedings
of
Database
allowed before rescanning will be larger too.
Systems for Advanced Applications, pp.
Therefore, as the database grows, our
185-194, Melbourne, Australia, 1997.
proposed approach becomes increasingly
efficient. This characteristic is especially [9]M.Y. Lin and S.Y. Lee, “Incremental
update on sequential patterns in large
useful for real-world applications. The
databases,” The Tenth IEEE International
contents of this paper have been published in
Conference on Tools with Artificial
the Intelligent Data Analysis, Vol. 5, No. 2,
Intelligence, pp. 24-31, 1998.
2001, pp. 111-129.
[10] J.S. Park, M.S. Chen, P.S. Yu, “Using a
5、References
hash-based method with transaction
trimming for mining association rules,”
[1] R. Agrawal, T. Imielinksi and A. Swami,
IEEE Transactions on Knowledge and
“Mining association rules between sets of
Data Engineering, Vol. 9, No. 5, pp.
items in large database,“ The ACM
812-825, 1997.
SIGMOD Conference, pp. 207-216, [11] N.L. Sarda and N.V. Srinivas, “An
Washington DC, USA, 1993.
adaptive algorithm for incremental
[2] R. Agrawal, T. Imielinksi and A. Swami,
mining of association rules,” The Ninth
“Database mining: a performance
International Workshop on Database
perspective,” IEEE Transactions on
and Expert Systems, pp. 240-245, 1998.
Knowledge and Data Engineering, Vol. 5, [12]
S.
Zhang,
“Aggregation
and
No. 6, pp. 914-925, 1993.
maintenance for database mining,”
[3] R. Agrawal and R. Srikant, “Fast
Intelligent Data Analysis, Vol. 3, No. 6,
algorithm for mining association rules,”
pp. 475-490, 1999.
4、Self-Evaluation
4