Download etd-0704103-082302æ

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
國立東華大學資訊工程學系
碩士論文
部分多重週期性無重複樣式的有效率探勘方法
Efficient Partial Multiple Periodic Patterns Mining Without
Redundant Rules
研究生:楊文博
指導教授:李官陵 博士
中華民國九十二年六月
Acknowledgement
這篇論文從初步想法到完成定稿的過程中,要感謝無數師長與朋友的幫
忙,感恩之心難以言喻。首先要感謝我的指導老師 李官陵教授,從思考方向、
提供資料到批改論文,無時無刻不關心備至,再老師的指導下才使的此篇論文能
如此充實且順利完成。
其次要感謝三位口試委員 羅壽之 教授、陳良弼 教授及 徐嘉連教授的建
議及指教才能使本篇論文能更臻完善。
此外也要感謝所以實驗室的同學與學弟妹,有他們的幫忙我才能全心全意
的投入研究,也有了他們的鼓勵才使得此篇論文能如期完成。
最後要感謝我的家人,由於他們的支持,使我能順利的畢業。尤其是我的
外婆,在我口試的那天去世,有她在天之靈的庇祐,我才能順利通過口試。
1
Abstract
Partial periodic patterns mining is a very interesting domain in data mining
problem. In the previous studies, full and partial multiple periodic patterns mining
problems are considered. The proposed methods, however, may produce redundant
information and are inefficient.
In this thesis, a novel concept and new parameters are proposed to improve the
performance of partial multiple periodic patterns mining. Moreover, the proposed
method will not produce redundant information with user’s exception. Without
mining every period, we only check the necessary period and use this information to
do further mining. Instead of considering the whole database, the information needed
for mining partial periodic patterns is transformed into a bit vector which can be
stored in a main memory. Therefore, it needs at most two times to scan database in
our approach. A set of simulations is also performed to show the benefit of our
approach.
Keyword:Data Mining, Partial periodic patterns mining, Multiple periodic patterns
mining, Time series analysis.
2
摘要
部分週期性樣式探勘式資料探勘中十分有趣的一個部分。以往的研究中,完
全與部分週期性樣式的探勘演算法都已被探討過。然而,以往的方法既沒效率也
會產生多餘的規則。
在這篇論文中,我們提出一個新的觀念來增進部分多重週期性樣式探勘演算
法的效能,而且,我們提出的演算法並不會產生多餘的規則。相較於探勘所有的
週期,我們只檢查真正需要去探勘的週期,並且利用已探勘週期的資訊來省略某
些週期的探勘。更進一步,我們將資料庫轉為演算法所需之一個可以容納於記憶
體的位元矩陣。因此,演算法最多只需將完整資料庫掃描兩次。最後,我們提出
一系列的模擬實驗來驗證演算法效能的增益。
關鍵字:時間相關資料庫、部分週期性樣式探勘、多重週期性樣式探勘、資料探
勘。
3
Table of Contents
Acknowledgement…………………………………………………………………….1
English Abstract……………………………………………………………………….2
Chinese Abstract………………………………………………………………………3
Chapter
1. Introduction
1.1 Motivation and Objective……………………………………………….5
1.2 Method and Achievement……………………………………………….5
1.3 Thesis Organization……………………………………………………..6
2. Background
2.1 Introduction to Data Mining…………………………………………….7
2.2 Association Rule Mining………………………………………………..8
2.3 Periodic Patterns Mining…………………………………………….....10
2.3.1 Full Periodic Patterns Mining…………………………………..12
2.3.2 Partial Periodic Patterns Mining………………………………..12
2.3.3 Multiple Periodic Patterns Mining and Our Approach…………13
3. Mining Algorithm for Multiple Periodic Patterns
3.1 Data Pre-Processing……………………………………………….……15
3.2 Partial Single Periodic Patterns Mining…………………………….…..16
3.3 Partial Multiple Periodic Patterns Mining………………………….…...18
3.3.1 Prime Period Mining (PPM)…………………………………....20
3.3.2 Composite Period Mining (CPM)……………………………....23
3.4 Data Post-Processing……………………………………………………24
4. Experiment and Efficiency Analysis
4.1 Simulation Platform…………………………………………………….26
4.2 Efficiency Analysis……………………………………………………..28
5. Conclusion and Future Works
5.1 Conclusion……………………………………………………………...29
5.2 Future Works……………………………………………………………29
Reference
4
CHAPTER 1
INTRODUCTION
1.1 Motivation and Objective
In the previous studies of periodic patterns mining, the issue is focused on the
efficiency of mining full/partial single periodic patterns[6][12]. In [6], the method of
mining multiple periodic patterns is to repeat the procedure of mining single periodic
patterns. It is inefficient, and the result will contain too many redundant rules.
Therefore, we propose a method that could mine partial multiple periodic patterns
efficient and generate the result without redundant rules.
1.2 Method and Achievement
In this thesis, our objective is to mine the partial multiple periodic patterns
(multiple periodic pattern in short) efficient and generate a set of concise rules. To
achieve the objective, we propose two algorithms: Prime Period Mining Algorithm
(PPM) and Composite Period Mining Algorithm (CPM). Prime Period Mining
Algorithm is responsible for mining the periods that cannot be applied to the prune
properties, and the Composite Period Mining Algorithm applies the prune properties
to check the rest of periods. Because the Composite Period Mining Algorithm only
checks the un-large itemset in Prime Period Mining Algorithm, the output of our
5
approach will not contain any redundant rules. In the end of the thesis, a set of
experiment is proposed to show the efficiency gain of our approach.
1.3
Thesis Organization
The remainder of this thesis is organized as follows. In chapter 2, we will
investigate the background knowledge and related works of periodic patterns mining.
The method for mining multiple periodic patterns is discussed in chapter 3. The
experimental evaluation is presented in chapter 4. Finally, we present our conclusions
in chapter 5 and identify directions for future research.
6
CHAPTER 2
BACKGROUND
2.1
Introduction to Data Mining
In this knowledge economic age, it is easy for any company to collect and keep
huge amounts of data. At the same time, high-speed computation has made it feasible
to analyze these data. This is called data mining. Simply stated, data mining refers to
extracting or “mining” knowledge from large amounts of data. There are many other
terms carrying a similar meaning to data mining, such as knowledge mining from
database, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging. However, the most popular used term is Knowledge Discovery in Database,
or KDD. No matter what terms we use, they all consist of the same iterative sequence
of the following steps:
1. Data cleaning: remove noise and inconsistent data.
2. Data integration: multiple data source may be combined.
3. Data selection: data relevant to the analysis task are selected from the
database
4. Data transformation: data are transformed into appropriate forms for
mining task by performing summary or aggregation operations, for
example.
5. Data Mining: essential process where intelligent methods are applied in
7
order to extract patterns.
6. Pattern evaluation: identify the interesting patterns representing knowledge
based on some interestingness measures.
7. Knowledge presentation: visualization and knowledge representation
techniques are used to present the mined knowledge to the user.
The purpose of using data mining is to help the policymaker making a good
and correct decision. Therefore, for different application domain, there are several
functionalities in data mining, including concept description[4][7], association
analysis[1][18][12][13][10][17][2][11],
classification/prediction[3][14],
cluster
analysis[9][8], outlier analysis and evolution[8]. In following discussion, we focus our
problem on association analysis.
2.2
Association Rule Mining
In the association analysis of transaction database, data records are stored in
a transaction form in database, where each transaction is a set of items. In this
supposition, the problem of discovering association rules is defined as finding
relationships between the occurrences of items within transactions[1]. For example,
an association rule might be "bread Æ milk support=10%, confidence=90%", which
means there are 10% of all transactions contain both items, and 90% of the
transactions that contain the item "milk" also contain item "bread". In the association
rule, each rule should have a measure of certainty associated with it that assesses the
validity of the rule. It is called confidence. The support of an association rule refers to
8
the percentage of task-relevant transaction for which the rule is true. Therefore, there
are two important parameters in data mining process. The first is minimum support,
and the second is minimum confidence. A large itemset is the itemset that satisfies the
minimum support. A strong association rule is a large itemset which is in the form of
"A Æ B" and satisfies the minimum confidence. The support of an itemset is the
fraction of transactions that contain the itemset. The confidence of a rule X ÆY is the
fraction of transactions containing X that also contain Y. The association rule X->Y
holds, if X ∪ Y is large and the confidence of the rule exceeds a given threshold
minimum confidence. Furthermore, an itemset that contains k items is a k-itemset. For
example, the set {AB} is a 2-itemset.
There are two major problems of association rule mining. The first one is
huge amount of candidate itemset, and the second one is the times of scanning
transaction database. In [1], Agrawal et al. propose a algorithm called Apriori which
employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found. This set
is denoted L1. L1 is used to find L2, the set of frequent 2-itemsets, which is used to
find L3, and so on, until no more frequent k-itemsets can be found. And the finding of
each Ln requires one full scan of the database. It is obvious when the transaction
database goes larger, the efficiency of Apriori algorithm will theatrically goes down.
There are many variations of the Apriori algorithm have been proposed that focus on
improving the efficiency of the original algorithm, for example, Hash-based technique,
Transaction reduction, Partitioning, Sampling and Dynamic itemset counting.
The discovery of association relationships among huge amounts of data is
9
useful in selective marketing, decision analysis, and business management. A popular
area of application is market basket analysis, which studies the buying habits of
customers by searching for sets of items that are frequently purchased together.
Although association rule is useful, it should not be used directly for prediction
without further analysis or domain knowledge. They are, however, a helpful starting
point for further exploration, making them a popular tool for understanding data.
2.3
Periodic Patterns Mining
In the early study of association analysis, transactions are regarded as only one
data segment, and there is no attention paid to segmenting the data over different time
intervals[6][12][10]. In the real world, we may need to find the case that bread and
milk are sold in every couple days or weeks...etc. For the above reason, association
analysis in a suitable time interval is more corresponding with the user respect. The
issue of time interval association analysis can be generally called periodic patterns
mining[12][6][10].
Periodic patterns mining problem can be categorized into two issues. One is
full periodic patterns mining, where every point in time contributes to the cyclic
behavior of the time series. The other and more general one is called partial periodic
patterns mining, which specify the behavior of the time series at some but not all
point in time series. Furthermore, the problem covers these two issues is how to mine
multiple periodic patterns efficiently, and it is the focus of this thesis. The definition
of periodic patterns is showed underlying.
10
Assume the time series database D of n time units has been collected. Let Di
denotes the unit time database for the ith time unit. Thus, the time series database is
represented as,
D = {D1 ∪ D2 ∪ D3 ∪ ... ∪ Dn }
D' = {{L1 }1 , {L1 }2 , {L1 }3 ,..., {L1 }n }where {L1 }i
We define large-1 matrix as
denotes the set of large 1-itemset in Di. We also define a periodic pattern
“S[p,o][periodic support]”, which appears in p-length with o offset from timestamp i=1.
We use p to denote the period length of S and o to denote the offset inside period p.
For example, “AB[2,0][90%]” represents pattern “AB” is frequent in 90% of
{Di | i mod 2 = 0} . A frequent periodic pattern is the periodic pattern that satisfies the
periodic minimum support. For example, if periodic minimum support is no larger
than 90%, “AB[2,0][90%]” is a frequent periodic pattern, otherwise, “AB[2,0][90%]” is
not.
Let the feature list of periodic pattern S[p,o] is represented as,
f S[ p ,o ] =< f i >, f i ∈ 0,1
< fi > is an ordered list where i = p • k + o for all 0 < k <
n
, k is an integer and for
p
f i , f j ∈ f S[ p ,o ] if i<j, then fi appears before fj. Moreover, fi is set to 1, if ∀item ∈ S and
item
contained
in
{Li}
i,
otherwise
fi = 0
D’={{ABC},{BCD},{ACD},{DEF}}, then f AB[ 2 , 0 ] =< 10 > .
11
.
For
example,
let
2.3.1
Full Periodic Patterns Mining
The problem of full periodic patterns mining is first addressed in [12] by
Ozden et al, and the definition of full periodic pattern is showed as follows.
In full periodic patterns mining problem, the periodic minimum support is
equal to 100%. Therefore a pattern S[p,o] is said to be a frequent full periodic pattern if
and only if ∀f i ∈ f S[ p ,o ] , f i = 1 .
By applying the characteristic of full periodic patterns to prune the irrelevant
data, an efficient algorithm, interleaved algorithm, is proposed in [12]. And the strict
constraint of full periodic patterns makes the proposed algorithm very efficient.
However, it also makes this approach can not be generalized to solve the more general
problem, partial periodic patterns mining problem.
2.3.2
Partial Periodic Patterns Mining
Partial periodic patterns mining has been completely discussed in [6][10]. By
definition a pattern S[p,o] is said to be a frequent partial periodic pattern if and only if
f i ∈ f S[ p ,o ] ∧ f i = 1
n p
≥ periodic minimum support, where predicate indicates the number
of fi that satisfies the predicate.
In [6], Han et al. consider the efficient mining of partial periodic patterns, for a
single period as well as for a set of periods. Some interesting properties related to
partial periodic patterns, including the Apriori property and the max-subpattern hit set
property, are explored in the proposed methods. The main contribution of Han’s study
12
is the speedy mining process on single partial periodic patterns.
An interesting method of mining partial periodic patterns is proposed in [10].
Difference with [6] and our approach, the method proposed by Li et al. only mines out
the periodic patterns that satisfy a specified calendar schema[10].
2.3.3
Multiple Periodic Patterns Mining and Our Approach
In the previous studies, the problem of multiple periodic patterns mining has not
been investigated popularly. Only in [6], Han at el. propose mining multiple periodic
patterns by repeating the same procedure of mining single periodic patterns. Thus
regard to the multiple periodic patterns mining the mining algorithm of Han’s study
is slow. Furthermore there are lots of redundant patterns still be mined out. For
example, if AB[2,0] is a frequent periodic pattern, AB[4,0] still has chance to be mined
out. Why is this kind of pattern redundant? Let’s take a supermarket for example, if
we have a rule such as “we need to stock with milk every two days” and “we need to
stock with milk every four days”. It is clear that the second rule is redundant,
because if we stock milk every two days, it has already covered stocking milk every
four days.
In this thesis, we address some properties to speed up the algorithm of multiple
periodic patterns mining. By given a predefined periodic minimum support and
maximum period length, the set of frequent partial periodic patterns whose period
length is no larger than maximum period length will be mined out.
The complete mining process is divided into three steps. The first step is data
13
pre-processing which converts the original transaction database into the large-1
database. The second step is multiple periodic patterns mining process. This step
contains two major procedures, “Prime Period Mining” and “Composite Period
Mining”. As implied by the name, PPM deals with those prime number periods, and
CPM deals with composite number periods. An un-frequent itemset list is constructed
to store the un-frequent itemset (discuss in chapter 3). All unfrequent itemset of prime
number period must be further checked by CPM based on period expansion property.
The final step is data post-processing. In this step, the unit time database is scanned to
check whether the candidate itemsets generated from the multiple periodic patterns
mining process is large in the unit time database.
14
CHAPTER 3
MINING ALGORITHM FOR PARTIAL
PERIOIDC PATTERNS
3.1
Data Pre-Processing
Let D be the transaction database, and transaction t = (TIME , ITEMSET ) .
ITEMSET is the set of items that customers purchase, and Time is the purchase time.
Assume that the time attribute is in the form of calendar format such as (year, month,
day)[10]. In the transformation phase, the time attribute need to be aggregated to a
certain time unit that users are interested in. Refer to Figure 1 the redundant portion of
time attribute in each transaction has been ignored. In this case, we consider the daily
databases separately.
Time
Item
Day 1
ACD
Day 1
BDE
…
Day 2
DEF
…
Day N
GEF
Figure 1 Time series database
In our algorithm, the transactions are partitioned based on the time attribute,
and each partition forms the unit time database. In each unit time database, Algorithm
15
Apriori[1] is executed to generate the large 1-itemset. Then we obtain a matrix called
large-1 matrix, denoted as M, to store these large 1-itemsets. Refer to Figure 2, M(i,j)
is set to “1” if item j is a large item in the unit time database i, “0”, otherwise.
Time
A
B
C
D
E
1
1
0
1
1
0
2
0
1
0
1
0
3
0
0
1
1
0
…
… … … … …
N
1
1
0
0
1
Figure 2 Large-1 matrix
3.2
Mining Single Periodic Patterns
Let large-1 matrix be the matrix shown in Figure 2. As illustrated above, the
row of large-1 matrix denotes the large 1-items in unit time database i. Assume we
are interested in periodic patterns of period length p (P-length). According to the
offsets of the periodic patterns, the P-length periodic patterns can be partitioned into
(P-1) sets, i.e., offsets from 0 to (P-1). In our approach, D’ is partitioned into (P-1)
sets, D0p D1p ...DPp−1 , where Dop keeps the related information of periodic pattern with
period length p and offset i. In each set, we run Apriori-Like algorithm[1] to get all
frequent periodic patterns.
16
Time
A
B
C
D
E
F
Day1
1
1
1
0
0
1
Day3
0
1
1
0
1
0
Day5
0
0
1
0
1
1
Day7
0
0
0
1
1
1
Day9
0
0
0
1
1
1
Day11
1
1
1
0
0
1
Figure 3 D02
For example, Figure 3 shows D02 which denote the related large 1-itemset
matrix of period length 2 and offset 0. In the Aprioir-Like algorithm, by joining
couple of large 1-itemsets, candidate 2-itemsets set (C2) is generated. The L3 is
generated by scanning D02 again. The process is repeated until no more candidate
itemset is generated. In order to efficient utilize D02 , scanning D02 is substituted by
Boolean operation. Take pattern “BE” for example, we have f B
[ 2,0]
= 110001 and
f E[ 2 , 0 ] = 011110 operated “AND” to carry out f BE[ 2 , 0 ] = 010000 . Assume the periodic
minimum support is 2. It is obvious that BE is not large. Furthermore all frequent
periodic patterns after single period patterns mining must be check again. The detail
of the check process will be discussed in Section 3.4.
17
3.3
Mining Multiple Periodic Patterns
Multiple periodic patterns mining is an extension of the single one. In the
previous studying, the mining algorithm of multiple periodic pattern mining is to
execute the single periodic pattern mining process again and again until all prefer
period length have been covered. We observe that if the multiple periodic patterns are
mined period by period, not only the cost of mining but also the redundancy of
patterns is too high to bear. A pattern is said to be redundant if and only if it is
contained by other patterns. In the following the redundant pattern is defined.
Definition [Period Contain Property]
Let period [p,o] indicates the period with
period length p and offset o. A period [p’,o’] is said to be contained by period [p,o] if
and only if p ' = m ⋅ p, m ∈ int eger and o' mod p = o .
By the definition, if period [p’,o’] is contained by [p,o], then f S[ p ',o '] ⊆ f S[ p ,o ] .
Definition [Redundant pattern]
Pattern S[p’,o’] is a redundant pattern if and only if
there exists a frequent periodic pattern S [p,o] and [p’,o’] is contained by period [p,o].
For example, if we have a rule such as “we need to stock with milk every two days”
and “we need to stock with milk every four days”. It is clear that the second rule is
redundant, because if we stock milk every two days, it has already covered stocking
milk every four days. It is obvious that these redundant patterns have to be pruned.
In our approach, the cost of mining partial multiple periodic patterns is reduced by
two viewpoints. In the viewpoint of redundant pattern, if we avoid generating
redundant patterns, the process and result of mining could be concise. The pruning of
18
redundant pattern involves two cases: first, S’ is a frequent periodic pattern, and
second, S’ is not a frequent periodic pattern. In the first case, S’ is a redundant pattern,
the information contained by S’ has also be contained by other frequent partial
periodic patterns, therefore, it has no need to be mined out. In the second case, S’ is
not a frequent periodic pattern, so it of course has no need to be mined out. In the
viewpoint of period, the data collected by mining previous period will be able to
supply information for mining the periods contained by the previous period. And
these two suppositions are the starting point of our pruning properties.
The following lemma shows that how the data collected by mining previous
period P can be used to speed up the periods contained by P.
Lemma 1 [Lower-bound property]
Let n be the number of the unit time database, p be the period length and m be the
periodic minimum support of patterns. If the count of periodic pattern S[ p ,o ] , C S[ p ,o ] , is
less than
 n 
m•

t • p 
, t is an integer, then S can not be a frequent periodic pattern with
period length t•p and offset o.
Proof
Let C S[ p ,o ] be the count of periodic pattern S[p,o]
C S[ p ,o ] < m •  n  , according to period contain property, we know that f S[ t • p ,o ] ⊆ f S[ p ,o ]
t• p


⇒ C S[ p ,o ] ≥ C S[ t • p ,o ]
 n 
⇒ C S[ p ,o ] < m • 

t • p 
⇒ S cannot be a frequent periodic pattern
19
Lemma 1 guarantees if the count of a pattern in period p is lower than the threshold, it
has no chance to be frequent in the period contained by p. It is one of the pruning
properties of our approach. Following the definition of redundant pattern, we discover
that only the unfrequent itemsets of prime number period need to be checked in the
composite number period. Therefore, in order to avoid generating redundant patterns
and to achieve the best pruning effect, we first generate the patterns that could not be
applied by Lemma 1. It is the characteristic of prime number obviously. Because all
numbers are composed by prime number, we can use the information gathered from
PPM to prune the candidate itemsets in composite number periods.
3.3.1
Prime Period Mining (PPM)
Prime period series is defined as the period which is unique and cannot be
composed by other periods. Therefore, the pruning properties can not be applied when
mining prime period series. The single periodic pattern mining algorithm is used to
mine the prime number period. However, some modification is made to adapt to the
multiple partial periodic pattern mining. The change of algorithm is shown
underlying.
Procedure PPM〈 D:Large-1 matrix〉
Scan D find out L1;
For all {a} = {{D} – {L1}}
If count of {a}<
n
Max _ period
Let C1 =D - {a};
For〈n =2;;n++〉
Join Cn-1’ to Generate Cn;
20
• periodic _ min_ sup
Scan D find out Ln;
For all {a} = {{Cn} – {Ln}}
If count of {a}<
n
Max _ period
• periodic _ min_ sup
Discard {a} from Cn;
End
Let Cn’=Cn;
If Cn’={}; Goto End;
End
Output(frequent pattern list);
Output(unfrequent itemset list);
End
n
• periodic minimum support is a threshold called pruning
Max period Length
minimum
support. The reason why we use pruning minimum support to prune
candidate itemsets is based on the Indefinite property on partial periodicity.
Property 2 [Indefinite property on partial periodicity]
If pattern S[p,o] is a frequent partial periodic pattern, pattern S may not be a frequent
partial periodic pattern in the period contained by period [p,o].
If pattern S[p,o] is not a frequent partial periodic pattern, pattern S may be a frequent
partial periodic pattern in period contained by period [p,o].
The proof of property 2 is based on the nature of the partial periodic pattern definition.
Suppose a pattern S[2,0] is not a frequent partial periodic pattern. Let
f S[ 2 , 0 ] = 101010101000
and
the
periodic
minimum
support
is
50%.
However, f S[ 4 , 0 ] = 111110 , it is obvious that pattern S[4,0] is a periodic pattern. In the
other case, suppose pattern S[2,0] is a frequent partial periodic pattern.
21
Let f S[ 2 , 0 ] = 011111 and the periodic minimum support is 75%. We have f S[ 4 , 0 ] = 011 .
It is obvious that pattern S[4,0] is not a frequent periodic pattern.
According to the property 2, due to the frequency of a pattern in current period cannot
be exactly predicted by the previous periods, the candidate pruning method[1] must
be modified. The candidate itemsets is divided into two parts. One is un-frequent
itemset, and the other is pruning itemset. In the PPM the algorithm imports two
thresholds. First is periodic minimum support, and the other is pruning minimum
support. The itemset whose count is beyond the periodic minimum support count is
the frequent periodic pattern that will be put into the frequent pattern list. And the
itemset whose count is lower than the pruning minimum support count is the pruning
itemset that has to be pruned. The remaining itemsets whose counts are between
periodic minimum support count and pruning minimum support count are the
candidate itemsets and cannot be pruned. The minimum pruning support is set to
be
n
• periodic minimum support, because it is the minimum count
Max period Length
that a pattern needs to be satisfied to be a frequent pattern in the period whose period
length no larger than max period length. Any itemset whose count is beyond the
threshold has a chance to become a frequent periodic patterns, therefore these
un-frequent itemsets and frequent itemsets will be put together do joining to generate
the candidate itemsets of next mining pass.
22
3.3.2 Composite Period Mining (CPM)
The composite number periods is composed by some factor period combination.
Therefore we can use the information that we gather from prime period mining
procedure to judge the composite period instead of running prime period mining
procedure again. This is called period expansion which is based on property 3. Period
expansion is depicted in Figure 4.
8-0
4-0
8-4
2-0
8-2
4-2
8-6
Figure. 4 Period Expansion
Property 3 [Period expansion property]
Let S[p,o] be a candidate itemset, [p,o] be the base period, and [p’,o’] be the expansion
p
period. We say [p,o] and [p’,o’] share the same related large-1 matrix Do of S if and
only if period [p’,o’] is contained by [p,o].
The main concept of period expansion is the support count of itemsets mined in the
base period can be the foundation for the expansive period. Because the expansive
period and the base period share the same related large-1 matrix, when we have the
support count of itemsets in base period, we can predict the itemsets, based on Lemma
1, are frequent in expansion period or not. If the support count of the itemset is lower
23
than the threshold, the itemset will be put into un-frequent itemset list again, waiting
for examination in the next time of period expansion. If it is beyond the threshold, the
feature list of itemset will be generated and check by Boolean operation discussed
above to confirm if it is a frequent periodic pattern or not. If the itemset is a frequent
periodic pattern, it will be put into the frequent itemset list, otherwise, it will be put
into the un-frequent itemset list.
The pruning of redundant periodic patterns is achieved by period expansion, too.
During the expansion, the itemsets we check are those in the un-frequent itemset list,
so there will be no redundant periodic patterns to be generated in the expansive period.
For example, we have a periodic pattern BE[2,0] and a un-frequent pattern CD[2,0].
Pattern BE and CD will be put into the frequent itemset list and un-frequent itemset
list, respectively. When mining the expansive period, CD will be checked in period
[4,0] and [4,2] to determine if it is a frequent periodic pattern or not. It is clear that the
redundant patterns are pruned automatically.
Because we only check the un-frequent itemsets of base period, the redundant
itemsets will not appear in the expansive periods. The check process is a linear time
complexity algorithm, so the efficiency is guaranteed.
3.4
Data Post-Processing
No matter what we are interested in, single or multiple periodic patterns, so far
we only have a list of candidate periodic patterns. Without this final check step, we
cannot ensure that these patterns are truly frequent periodic patterns or not. The
24
candidate periodic patterns mined out by PPM and CPM are only the “frequent
pattern” in large 1-itemset matrix but not in the unit time databases. We cannot
guarantee that the frequent patterns in large 1-item matrix are also frequent in unit
time database. Therefore in the final step of our algorithm, the unit time databases are
scanned to determine whether the candidate generated by PPM and CPM are frequent
itemsets.
Supposed that we have a candidate periodic patterns S[p,o]. We must check if the
pattern S is large in unit time database with timestamp i = k ⋅ p + o . If the pattern is
large in some unit time databases, we judge it by the periodic minimum support count
again. Applying this check process to all candidate periodic patterns, we can obtain all
ultimate frequent periodic patterns.
25
CHAPTER 4
EXPERIMENT AND EFFICIENCY ANALYSIS
4.1
Simulation Platform
In this section, a set of simulations is performed to show the benefit of our
approach. A comparison between Han’s method (repeated approach) and our approach
is also made. The result shows there is a notable improvement in efficiency by our
approach. The test data is generated by IBM Synthetic Data Generator[6]. The
parameters used to generat the synthetic data and simulate the experiments is shown
in table. Take D100δ 100T10 I 4 N1 S 0.8 for example, D100 represent there are 100K transactions
in an unit time database, δ 100 means there are 100K unit time database, T10 indicates
that the average size of transaction in unit time database is 10, I 4 means the average
length of a frequent pattern in unit time database is 4, N1 means that the number of
items in the time unit database, and S 0.8 means the periodic minimum support is 0.8.
Notation
Meaning
Default
Range
D
Number of transactions per unit time database
100K
-
Number of unit time database
100K
50K~200K
T
Average size of transaction
10
10~20
I
Average length of large itemset
4
3~7
N
Number of items
1,000
0.1K~ 1K
P
Maximal period length
30
2 ~100
S
Periodic minimum support
0.8
0.8 ~ 1
δ
Table. 1
26
4.2
Efficiency Analysis
The reason is that for the number smaller than 30, the raito of the number of
prime to the number of non_prime is high. As mentioned above, the prune properties
can only adopt to the periodic pattern with non_prime period length.
D100 δ 100 T10 I 4 N 1 S 0.8
14000.00
12000.00
Execute Time (s)
10000.00
Han
8000.00
CPS
6000.00
4000.00
2000.00
0.00
0
10
20
30
40
50
60
70
80
90 100
Maximal Period Length
Figure 5
Therefore, the improvement of our approach is not so obvious when the maximum
period length is smaller than 30.
δ 100T10 I 4 N1 P30 S 0.8
10000
Execute time(s)
9000
8000
7000
Han
CPS
6000
5000
4000
3000
2000
1000
0
50,000
100,000
150,000
200,000
Number of unit time database
Figure 6
The effect of unit time database size δ on mining efficiency is shown in
figure 6. As shown in the result, the execution time needed for both approach
increases as the unit time database size increases. Moreover, our approach performs
27
Execute Time (s)
much better than Han’s approach as the unit time database size increases.
D100δ 100T20 I 4 N 0.1 P30
35000
30000
25000
20000
15000
10000
5000
0
Han
CPS
80
85
90
95
100
Periodic Minimum Support
Figure 7
Figure 7 shows the effect of periodic minimum support. As shown in the result,
even for a large periodic minimum support, our approach outperforms Han’s
approach.
D100δ 100T10 N 0.1 P30 S 0.8
Execute Time
8000
6000
Han
4000
CPM
2000
0
3
4
5
6
7
Average length of large itemset
Figure 8
Figure 8 shows the effect of average length of frequent pattern in unit time
database. It shows that both approaches are effect by the average length of frequent
pattern, however our approach outperform Han’s approach, especially when the
average length of the frequent pattern is large.
28
CHAPTER 5
CONCLUSION AND FUTURE WORKS
5.1
Conclusion
In this thesis, an efficient mining method of multiple partial periodic patterns
has been studied. We have also explored the full and partial periodic patterns mining
issues. The main objective and difference to previous studies[12][6][10] is that our
proposed method is efficient and avoids the generation of the redundant periodic
patterns.
By studying some interesting properties related to multiple partial periodic
patterns mining, such as indefinite property on partial periodicity, lower-bound
property, and period expansion property, the efficient multiple partial periodic patterns
mining algorithm is proposed. The experiment shows that the proposed algorithm
offers excellent performance.
5.2
Future Works
In the future there are still many issues regarding multiple periodic patterns
mining, such as mining multiple periodic association rules, query based mining of
multiple partial periodic patterns, and applying the distributed computing environment
to improve the efficiency of our approach. We will continue studying these problems
29
and report our progress on time.
30
Reference
[1]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc.
1994 Int. Conf. Very Large Data Bases, pages 487–499, Santiago, Chile,
September 1994.
[2]
R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf.
Data Engineering, pages 3–14, Taipei, Taiwan, March 1995
[3]
L. Breiman, J. Friedman, R. Olshen, and C. Stone. “Classification and
Regression Trees.” Monterey, CA: Wadsworth International Group, 1984
[4]
Y. Cai, N. Cercone, and J. Han. “Attribute-oriented induction in relational
databases.” In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge
Discovery in Database, page 213-228, 1991
[5] Z. Huang. “Extensions to the k-means algorithm for clustering large data sets
with categorical values.” Data Mining and Knowledge Discovery, 2:283-304,
1998
[6] Jiawei Han, Guozhu Dong, Yiwen Yin. “Efficient Mining of Partial Periodic
Patterns in Time Series Database”. In Fifteenth International Conference on Data
Engineering, 1999.
[7] J. Han and Y. Fu. “Exploration of the power of attribute-oriented induction in
data mining.” In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages
399-421, 1996
31
[8] E. Knorr and R. Ng. “A unified notion of outliers: Properties and computation.” In
Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining(KDD’97), pages
219-222, Newport Beach, CA, Aug. 1997
[9] L. Kaufman and P. J. Rousseeuw. “Finding Groups in Data: An Introduction to
Cluster Analysis.” New York: John Wiley & Sons, 1990
[10] Yingjiu Li, Peng Ning, X. Sean Wang Sushil Jajodia, “Discovering
Calendar-based Temporal Association Rules.”, Temporal Representation and
Reasoning, 2001. TIME 2001. Proceedings. Eighth International Symposium on ,
2001.
[11]
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. “Exploratory mining and
pruning optimizations of constrained association rules”. In Proc. 1998
ACM-SIGMOD Int. Conf. Management of Data, pages 13–24, Seattle,
Washington, June 1998.
[12]
B. ¨ Ozden, S. Ramaswamy, and A. Silberschatz. “Cyclic association rules”. In
Proc. 1998 Int. Conf. Data Engineering (ICDE’98), pages 412–421, Orlando, FL,
Feb. 1998.
[13] J.S. Park, M.S. Chen, Philip S. Yu. “An Effective Hash-Based Algorithm for
Mining Association Rules” In Proc. of the 1995 ACM SIGMOD Conference, pages
175--186, San Jose, California, USA, May 1995.
[14] J. R. Quinlan. “Induction of decision trees.” Machine Learning, 1:81-106, 1986
[15] J. R.Quinlan. “C4.5:Programs for Machine Learning.” SanMateo, CA: Morgan
Kaufmann, 1993.
[16]
R. Srikant and R. Agrawal. “Mining Generalized Association Rules”. In
32
Proceedingsof the 21st International Conference on Very Large Data Bases,
pages 407–419, Zurich, Swizerland, September 1995.
[17]
R. Srikant and R. Agrawal. “Mining Quantitative Association Rules”. In
Proceedingsof the 1996ACM SIGMOD International Conference on Management
of Data, pages 1–12, Montreal, Canada, June 1996.
[18]
A. Savasere, E. Omiecinski, and S. Navathe. “An Efficient Algorithm for
Mining Association Rules in Large Databases”. In Proceedings of the 21st
International Conference on Very Large Data Bases, pages 432–444, Zurich,
Swizerland, September 1995.
[19] I. H. Toroslu, M. Kantarcioglu, “Mining Cyclic Patterns”, DAWAK 2001(LNCS
2114), pp.83-92, Munich, Germany, September 2001.
33