Download Comparing Methods of Mining Partial Periodic Patterns in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Comparing Methods of Mining Partial Periodic Patterns in
Multidimensional Time Series Databases
Meghan Callahan
Advisor: George Kollios
Department of Computer Science
Boston University
[email protected]
Abstract
Methods to efficiently find patterns in
periodic one-dimensional time series databases
have been heavily examined in recent data
mining research. The issue at hand is whether
or not these algorithms can be translated to find
such patterns in multidimensional periodic time
series dataset by performing classification
techniques to reduce the dimensionality.
This project will explore two solutions to the
problem of representing multidimensional values
as one-dimensional data: one grid-based and
one clustering-based. The two classification
methods, each with an algorithm to find partial
periodic patterns, will be compared based on
efficiency, accuracy, and scalability of the
approaches.
1.
Introduction
Time series datasets are used in many
practical applications, including finance,
weather, and economics. These datasets show
trends in values over time, and are important for
decision-making and the estimation of upcoming
events and occurrences. The datasets often are
of a periodic nature, which aids in the accurate
prediction of future data. However, the
discernment of a pattern in these databases is not
a trivial task and has become a well-researched
data mining problem.
Early data mining research of time series
databases focused on finding full, or perfect,
periodic patterns. This involves every data value
adding to the overall periodicity of the dataset.
For example, the amount sold each day by a
business adds to the overall sales cycle of the
business for the fiscal year.
In practice, perfectly periodic patterns
hardly occur. A relaxed version of these
patterns, called partially periodic patterns, can be
as meaningful for certain applications. Partial
patterns are periodic over a portion of the
database, yet may not be periodic across the
entire database. Continuing the sales example, a
partial pattern may show that the sales in the
month of December fluctuate each year, yet the
other months do not have such a pattern; this
“looser” pattern information may be useful in the
real world, e.g. for estimating the amount of
production needed for the December sales. In
[9], an algorithm was presented which efficiently
finds partial patterns by leveraging the Apriori
property, which was first used to find sequential
patterns in [1, 2]. This work has been extended
to handle the incremental addition of values to
the dataset [3]. Berberidis et al [4] also extend
this work by showing its applicability to period
detection as well as pattern mining.
In all of these studies, there has been work
with two-dimensional datasets, i.e. a value in one
dimension and a time measure. The algorithms
found in the aforementioned papers use twodimensional time series databases, where each
data point has a value dimension and time
dimension, yet they fail to research the
applicability of these methods to multidimensional sets. In this paper, the goal is to
present methods to efficiently and accurately
detect partial patterns in multi-dimensional
datasets, i.e. multiple numerical dimensions and
a time dimension.
However, finding efficient methods to mine
meaningful data from multiple dimensions is a
difficult problem in itself. The ability to do this
depends on finding algorithms for reducing the
dimensionality of the database. In [8], Faloutsos
and Lin present the FastMap algorithm, which
maps objects of higher dimension into a lowerdimensional space while preserving the
dissimilarity of the objects in both spaces. The
GEMINI method described in [11] provides a
similar kind of reduction. Other studies have
shown that the Discrete Fourier Transform [7],
and the Discrete Wavelet Transform [5] also
perform efficient and accurate dimensionality
reduction. These techniques are predominately
used in the indexing of the multidimensional
points.
In this project, we find a dimensionality
reduction technique capable of mapping multidimensional data points into a one-dimensional
value by using classification. Classification is a
method of assigning categorical labels to data
points [9]. These labels can be assigned via
supervised or unsupervised learning. Supervised
learning predetermines a set of classes; each
class has its own label and represents a certain
range of data measurements or observations.
Unsupervised learning has no such
predetermined classes; instead, the goal is to find
the existence of these classes based on the data
values themselves. Clustering is an example of
unsupervised learning.
This project consists of a comparison of two
methods to discover patterns in multidimensional
time series datasets by using classification as
dimensionality reduction. The first focuses on
clustering the multi-dimensional points to
determine a pattern. The other involves the
creation of a labeled grid to classify the points to
discern a pattern over time.
Both of these approaches then use the maxsubpattern hit algorithm described in [9], which
takes advantage of the Apriori property [1] and
the max-subpattern hit set property. The
proposed algorithms only require 4 scans of the
database: 2 scans to perform dimensionality
reduction and classification, and 2 more scans to
implement the max-subpattern algorithm.
In the remaining sections of the paper, these
ideas will be further developed. The next section
will more clearly define the problem statement
and define the elements used in the project.
Sections 3 and 4 will delve into more details of
the methods and algorithms used, and section 5
will discuss the overall implementation of these
methods. A report of the experiments and
analysis of the results is provided in section 6,
before a conclusion and thoughts for future
research is presented in the final section.
2.
Problem Statement
We want to be able to find some pattern in a
periodic multidimensional time series database S.
The terminology discussed in this section will be
used for the rest of the paper.
2.1
Pattern Terminology
Assuming that S = D1, D2, …, Dn where each
value D i is a set of features derived from the
dataset at time instant i. A pattern is a string s =
s1, s2, …, sp over an alphabet of the features L ∪
{*}, where the character * can assume any value
in L . The L-length of a pattern equals the
number of letters in s from L; furthermore, a
pattern of L-length i is called an i-pattern.
A pattern s has a subpattern s’ = s1’, s2’, …,
sp’ if s and s’ have the same length, and si’ = si or
si’ = * for all positions i. The actual length of s,
denoted |s|, is the period of the pattern s; a
period segment then takes the form Di|s|+1 …
Di|s|+|s|, where i∈ [0,m) and m is the maximum
number of periods of length |s| in S.
The frequency count, or support, of a
pattern is the number of occurrences of the
pattern in the dataset, while the confidence of a
pattern is the ratio of the frequency count to the
number of period segments m. A pattern s is
frequent in S if its confidence values are above a
set threshold, m i n _ c o n f , which changes
depending upon the mining application.
As an example of these concepts, consider
the pattern s = a*b* with period 4. In the feature
series acbbadbecadcadba, its frequency count is
3. The value m equals 4 in this series; therefore,
the confidence of s is 0.75.
2.2
Classification and Dimensionality
Reduction Terminology
A projection is one method to move ddimensional points into a k-dimensional point,
where k << d. In data mining, it is useful to
lower the dimensionality of the values being
mined. This eliminates the effect of the “curse
of dimensionality,” which causes skewed pairwise distances and difficulty in finding
meaningful similarity queries.
A cluster is a set of points in a space
deemed similar to each other and dissimilar to all
other points in the space. The similarity of two
points is determined by a similarity function. In
this project, the similarity function used is the
Euclidean distance function. Clustering a dataset
means partitioning the set of data points into
non-overlapping groups, such that the. Clusters
are a form of unsupervised classification in data
mining; there is no predefinition of the classes
the clusters represent.
A grid is a structure of d dimensions in
which d-dimensional data points. The grid
represents the range (xmax – xmin , ymax – ymin)
where d = 2 and xmax, xmin, ymax, and ymin are the
maximum and minimum values of the data
points in a given dimension. The grid is divided
into cells of length
xmax – xmin
&
2.
Cluster the points of the ( d - 1 ) dimensional space.
Clustering.
To efficiently perform the
clustering, the k-means heuristic algorithm is
used [6, 12]. This algorithm is given the value k,
the number of clusters to create, and the set of n
points of the dataset. It then produces a set of k
clusters, each of which has a representative
mean; thus the name k-means. It then assigns
each data point to the cluster with the closest
mean. If any point changes clusters, the means
of the clusters are recomputed.
ymax – ymin
c
c
k-Means Algorithm.
where c is the number of cells desired in each
dimension. Each cell of the grid has a categorical
value, i.e. a value l ∈ {L ∪ {*}}. A data point is
mapped into a particular cell if the values of its
coordinates fall within the range of values that
the cell represents. Each point in the dataset is
then assigned the label of the cell in which it is
mapped into.
1.
Place the n points into the k clusters,
such that each cluster is non-empty.
2.
Compute the mean of each cluster. This
value is called the centroid.
3.
Assign each data point to the cluster
with the closest centroid, using the
Euclidean distance as the measure of
closeness.
3.
4.
Repeat the algorithm from Step 2 until
there are no points changing clusters in
Step 3.
Methods of Classification
The following two methods to reduce the
dimensionality and classify the multidimensional
points of a dataset are explored in this paper.
The goal of both methods is to convert multidimensional, numerical points over a time value
into a categorical, single dimensional value,
which is shared by similar points over time.
3.1
Clustering Approach
Motivation. In a periodic time series database,
data points occurring around the same time in
each period will be similar by virtue of the fact
that there is a period. Therefore, if the time
element of an d -dimensional data point is
disregarded, the other points which occurred at
about the same time in a period will be close to
the data point in (d-1)- dimensional space. In
this space, the points can be clustered, and a
categorical value can be assigned to each cluster.
Projection Algorithm.
1.
Project each d-dimensional point into
( d – 1 ) -dimensional space by disregarding the time element of the point.
(a)
(b)
Figure 1.[6] This diagram shows an example of
the k-means algorithm. The filled circles
indicate the values of the centroids, which do not
need to be actual data points. The figure in (a)
shows the initial assignment of the points. The
figure labeled (b) shows the final cluster
assignment after the point labeled 4 moves from
Cluster 1 into Cluster 2 since it is closer to the
centroid of Cluster 2.
A globally optimal set of k clusters may or
may not be found depending on the initial
assignment of the clusters in Step 1. However,
this algorithm will find a locally optimal set of k
clusters, which optimizes the original
partitioning and approximates the global
optimum. Such a set can be found in O(tkn)
steps, where n is the number of data points in the
dataset and t is the number of iterations of steps
2 through 4. For most datasets, this algorithm is
considered efficient, since typically k is chosen
such that k << n and the algorithm converges to
an optimized set quickly.
This choice of k is critical to the
meaningfulness of the algorithm and is not an
inconsequential detail. In the k-means algorithm,
the number of clusters must be known a priori;
therefore, the algorithm does not dynamically
determine the best number of clusters. If k is
chosen to be too large a priori, the clusters will
be sparse and trivial since they will contain few
values. Conversely, if k is chosen to be too
small, the clusters will contain a large number of
points. Either way, the resulting classification
may not be as meaningful as desired. This will
be discussed further in Section 3.3.
The mean of the cluster represents the points
within it. If there are outliers present in the
dataset, and thus in the clusters, the k-means
algorithm distorts the value of the centroid. For
instance, in Figure 1, the point labeled 4 can be
thought of as an outlier in Cluster 1. As such,
the centroid of Cluster 1 in Figure 1(a) is shifted
toward the right, rather than being in the
proximity of the majority of the points in the
cluster.
In the example, the next iteration of the kmeans algorithm eliminates this discrepancy by
reassigning the cluster of the outlier point to a
closer centroid, namely the representative of
Cluster 2. However, some cases exist where the
outlier may not be reassigned and thus continues
to distort the value of the mean of the cluster. If
the centroid values are affected by noise, then the
overall clustering does not accurately represent
the dataset.
Even with the above issues, this algorithm is
more efficient than simpler methods of
clustering, such as agglomerative and divisive
clustering which, respectively, build and divide
clusters one point at a time. The k- m e a n s
algorithm is efficient and provides a meaningful
approximation of the optimal set of k clusters.
3.2
Grid Approach
The grid approach uses a grid to classify the
data points of the multidimensional database
over time. As in the projection method discussed
in the previous section, this approach aims to
reduce the dimensions and assign a categorical
value to each data point. This method also aims
to assign similar points the same categorical
value in order to highlight the periodicity of the
time series database.
In the projection approach, however, kmeans clustering is used to dynamically
determine which points are similar. The grid
approach uses a more structured method of
clustering; each grid cell can be thought of as a
cluster. The grid cell is thus a predefined cluster.
It contains only the points with coordinates
falling in the range represented by the cell.
Therefore no algorithm is needed to determine if
a point is in the best possible cell; a point only
has one cell to which it can belong.
Algorithm.
1.
Project each d-dimensional point into
( d – 1 ) -dimensional space by disregarding the time element of the point.
2.
For each of the (d – 1) dimensions of
each point, compare the value of the
coordinate to the range of values each
cell represents in that dimension.
3.
Assign the point the label of the cell
containing the point in all of the (d–1)
dimensions.
The number of cells needed in the grid must
be determined a priori and must make the
placement of points into the cells meaningful.
This problem is similar to that of choosing a
value for k in the k-means algorithm. Too few
cells in the grid gives a large number of points
the same categorical value; and too many cells
makes the grid sparse and points which are
relatively similar may be assigned different
values. Again, both cases may reduce the
meaningfulness of the assigned values, which
will be further discussed in Section 3.3.
dense and clustered, more, smaller cells or
clusters would aid in distinguishing a distinctive
pattern. On the other hand, if a dataset were
sparse, fewer larger cells or clusters would
reduce the variability of the assignment and
reveal which points are relatively similar. The
best choice of k or c requires analysis of the set;
it is difficult to determine an optimal value for all
databases a priori. In this project, we assume
that an approximate value of the best choice of k
or c suffices to reveal the patterns of the dataset.
Figure 2. A sample grid used to classify data
points. For instance, the grid cell labeled ‘b’
contains the point (19, 44) if the grid is defined
by the following values xmax = 50, xmin = 0, ymax =
50, and ymin = 0.
In terms of actual implementation, the grid
is only a logical structure. Creating such a grid
as a multidimensional array would require
O(cdn) space, where c is the number of cells in
each dimension, d is the dimensionality of the n
points being assigned a grid value. For large
values of d, this could be a large structure. As a
logical structure, only the maximum range value
of the cell could be stored for each cell in each
dimension, rather than storing the n points in the
grid. Taking this approach lowers the space
requirement to O(cd).
3.3 Meaningfulness of Classification
The goal of performing these dimensionality
reduction and classification techniques was to
convert numerical, multidimensional data points
into a categorical, single dimensional value. The
two approaches described above accomplish this
goal by assigning each data point a character
value that is shared with data points deemed
similar in multiple dimensions. Both approaches
assign the categorical value with the assumption
that each dimension is equally weighted in
determining the resulting value.
In the discussion of each approach, the issue
of selecting the number of clusters and the
number of cells was stated to be critical to the
meaningfulness of the categorical value
assignments. An assignment is meaningful if it
properly represents the dataset, i.e. reveals any
patterns that may be present, indicates the
presence of noise, etc.
How meaningful an assignment is with a
given selection of k or c is dependent on the
database itself. For instance, if a dataset were
4. Methods of Partial Pattern Mining
Partial Pattern Mining requires the use of
several properties and algorithms. The following
sections describe and analyze each one,
introduce the data structure used, and present the
overall partial pattern mining algorithm.
4.1
Apriori Property
A key property behind partial pattern
mining, and the efficient mining of association
rules, is the Apriori Property defined in [1].
This property states that if any subset of an
itemset is not frequent, then the itemset itself is
not frequent. The number of frequent (i + 1)patterns depends on the number of frequent ipatterns, rather than on all the possible patterns
existing in a dataset.
By leveraging this property, the space of all
possible frequent patterns is reduced as soon as
an infrequent pattern is found. This, in turn,
reduces the amount of time needed to find all of
the frequent patterns in a given database.
4.2
Candidate Generation
To derive the partial patterns from a
database, candidate i-patterns must be derived
and then tested to see if they are frequent
throughout the database. Using the Apriori
property above, we can eliminate possible partial
(i+1)-patterns if any subset is not a frequent ipattern, for values of i ≥ 1.
The first set of candidates to generate is the
set of 1-patterns, denoted as C 1 (the set of the
candidate 1-patterns). This is done by scanning
the entire database and collecting the values
present in the set with L-length = 1. A frequency
count is maintained for each value collected; if
the value is already in the set C 1 when it is
encountered in the database scan, its frequency
count is incremented. Upon completion of the
scan, an element of C1 is added to the set F1, the
set of all frequent 1-pattterns, if the value of its
frequency count is greater than or equal to
min_conf * m, where m is the maximum number
of periods and min_conf is the minimum
confidence level.
The total number of candidate subpatterns
that are generated is (|F 1| choose 2) + (|F 1|
choose 3) +…+ (|F 1| choose |F1|) = 2|F1| − |F1| −
1. Since the set of frequent 1-patterns needs to
be kept and requires |F 1| space, the total number
of space to store all the subpatterns is 2|F1| − 1 in
the worst case.
Generation of the frequent pattern
candidates for the sets F 2, F3, …, Fp depends
upon the set F(i – 1). If this set is non-empty, then
the set C i can be created by computing the (iway)-join of the set F (i – 1) to itself. The
frequency counts are then gathered, and
candidates are added in an Apriori-like manner
to the set Fi as in the generation of the set F1. Up
to p frequent partial pattern sets can be created,
where p is the period of the time series database.
However, a set F(i+1) will not be generated if the
set Fi is empty.
To perform the candidate generation, a
simplistic version of this algorithm will scan the
database p times in the worst case. There is a
single scan to determine the set F1 and then (p-1)
subsequent scans of the database to create the
remaining frequent pattern sets. If the database
is large, as is often the assumption, these scans
become very expensive.
The algorithm
presented in Section 4.4 aims to correct this large
cost by taking advantage of the Max-Subpattern
Hit Set Property, described in the next section,
and its novel algorithms to reduce the number of
database scans needed to derive the sets of
frequent patterns.
Candidate Generation Algorithm.
4.3
1.
Scan the database once to populate the
set F1 by finding all of the frequent 1patterns of length p, where p is the
period. A frequent 1-pattern will have
L-length = 1 and a frequency count
equal to or exceeding min_conf * m,
where m is the maximum number of
periods of length p in the database.
2.
Find the frequent i-patterns of length p
for values of i from 2 to p by
performing an Apriori-based method of
eliminating candidates of the join of the
set F(I-1) to itself.
3.
If a set Fi is empty, there is no need to
continue. Else, repeat Step 2.
Analysis. Arguably, the set of frequent 1patterns F1 is a large frequent set to as it contains
every possible pattern of L-length = 1 of the
entire database; the generation of this set cannot
take advantage of the Apriori property to reduce
the number of candidate patterns examined.
In turn, the generation of the candidate set
C2 is the largest and most expensive candidate
set to create. The 2-way join is performed on the
large frequency set F 1 and yields (|F 1| choose 2)
candidates. As the number of F 1 patterns
increases, the size of C2 drastically increases.
Max-Subpattern Hit Set Property
As presented in [9], the Max-Subpattern Hit
Set Property is useful in reducing the
computation time of the Fi sets for values of i >
1. The motivation for such an improvement rests
in the large number of candidate patterns that
will need to be generated and counted, while
scanning the database up to p times; if an
algorithm can speed up this step, the running
time of the entire candidate generation and
pattern mining algorithm is decreased.
This property relies on the discovery of
max-patterns and hit subpatterns. A candidate,
frequent max-pattern is defined to be the
maximal pattern generated from the set of
frequent 1-patterns F 1. This max-pattern is
called C max. For instance, the F 1 = {a****,
*b***, **c**, ***d*} would yield the Cmax =
abcd*. A position in the max-pattern may have
more than 1 possible value. If the 1 -pattern
*f*** is added to F1, then Cmax = a {b, f} cd*.
If a subpattern of C max is the maximal
subpattern in a given period segment Si of S, we
say it is a hit subpattern in S i. The set of all
such hit subpatterns in a time series S is called
the hit set, H, of that time series. This set is
useful in deriving the entire set of partial patterns
if the frequency counts of all the hit maximal
subpatterns of Cmax are known.
The size of the hit set, |H|, is bounded by the
maximum number of periods in S (i.e. |H| ≤ m)
since each period segment can generate only one
hit subpattern. The size of |H| is also bounded
by the number of subpatterns which can be
generated; in the previous section, we have
shown that this value is 2|F1| − 1. Since H is
bounded by these two quantities, we can say that
|H| ≤ min{m, 2|F1| − 1}, where m is the maximum
number of periods and F 1 is the set of all
frequent 1-patterns in the database S.
4.4
Max-Subpattern Hit Set Method
The property described in the above section
can be used to find the set of all partially
periodic patterns, with period p, present in the
time series S. The algorithm incorporates the
ideas of candidate generation, discussed in
Section 4.2, and the max-subpattern tree data
structure, which is described in the next section.
To efficiently access the values of the hit
count and show relationships between subpatterns, we need to use the max-subpattern tree
from [9]. This tree will assist in the derivation of
the set of frequent subpatterns.
The max-subpattern tree contains nodes with
four elements: a subpattern, a frequency count, a
pointer to its parent, and a set of pointers to their
immediate children. A node is a child if its
subpattern differs from that of its parent by one
non-* letter. The pattern stored at the root node
of the tree is the candidate max-subpattern C max;
the rest of the tree defines the subpatterns of
Cmax.
The conditions for a subpattern to be added
to the max-subpattern tree are as follows:
1.
The subpattern contained in a node must be
at least a 2-pattern; else, the subpattern is
already included in the set F1.
2.
A node w may have a set of children if a
subpattern exists which differs from w by
having one non-* letter present. The child
pointers of a node are referred to by the
value of the non-* letter which is missing
from its parent. For example, in Figure 3,
the link from the node ab1*d* to the node
a**d* would be labeled by the value b1,
since b 1 is missing from the child
subpattern.
3.
To be present in the tree, the subpattern of a
node, or one of its descendants, must be in
the hit set of S . If not in this set, the
subpattern cannot be frequent in S. Notice
the subpattern ab2*** in Figure 3. It is
never added to the max-subpattern tree,
since it is not in the hit set of S (i.e. its
frequency count is 0).
Algorithm.
1.
Generate the set of frequent 1-patterns
of length p, denoted F1, by scanning the
database S.
2.
Using the set F1, generate the candidate
max-subpattern Cmax and make as the
root of the max-subpattern tree (Section
4.5)
3.
Scan S again. Calculate the hit set for
each period segment. If it is non-empty,
add the max-subpattern into the hit set
and make its frequency count be 1. If
this max-subpattern is already in the set,
increment its frequency count. The
implementation of the hit set as the
max-subpattern tree is discussed in
Section 4.5.
4.
Using the hit subpatterns in the hit set,
derive the frequent patterns using the
algorithm detailed in Section 4.6.
Insertion Algorithm. The following steps are
performed to insert a max-subpattern w, found in
the current period segment, into the maxsubpattern tree.
1.
Compare w to the candidate maxsubpattern, Cmax, contained in the root
node. Find the correct child link by
checking which non-* letters are
missing from the subpatterns in order
from left to right (position-wise
difference).
2.
If a node containing the subpattern w is
found, increment the frequency count of
This algorithm scans the database only
twice: once in Step 1 to generate the 1-patterns
and again in Step 3 to build the hit set. This is a
large improvement over the earlier technique
discussed in Section 4.2, which was dependent
upon the value of the period p.
4.5
Max-Subpattern Tree Structure
Figure 3. Max-Subpattern tree. The root stores the value of the candidate max-subpattern Cmax. Each
child stores a subpattern of the root, which is hit in the time series S. The frequency count of the subpattern
of a node is shown above or below it. The missing non-* letter labels the links to the child nodes.
that node. Else, create a new node (initialize
count to 1) and its ancestor nodes along the path
to the root (count is 0), and insert them into the
proper place in the tree.
For example, in Figure 3, if the first maxsubpattern found for a period segment was
ab1***, the node ab1*** is added with count 1,
as well as the node a{b1,b2}*d*, the root, with
count 0, and the node ab1*d*, the direct child of
the root and parent of the max-subpattern, with
count 0.
The height of the max-subpattern tree
depends on the L-length of Cmax. If the L-length
is x, then the tree will have height (x – 1) since a
node at the bottom-most level must have at least
a two non-* letters. At every insertion, there will
be at most (x – 1) nodes created, and at least 0
nodes created. Each insertion adds a subpattern
found in S, which is an element of the hit set H.
As a result, the total number of nodes in the tree
is less than (x|H|).
In the insertion algorithm, the tree is
traversed by following the child link labeled by
the first non-* letter which differs from the
current subpattern. This means that some parent
to child links will not be created even though
two nodes are legitimately related. Such links
are shown in Figure 3 as dashed lines. Instead of
searching the tree for all of the possible child
pointers, a node will only link to nodes inserted
under it.
Reachable Ancestors. The set of reachable
ancestors of a node w is the set of all nodes in the
tree containing proper subpatterns of the
subpattern of w. A node w can compute this set
by performing the following algorithm.
1.
Derive a list, wm, of non-* letters which
are missing from the subpattern of w
when compared to C max (i.e. the
position-wise difference).
2.
The set of linked ancestors are those
nodes with patterns whose missing
letters form a proper prefix of wm.
Non-linked ancestors have patterns
forming a proper sublist of wm but not a
prefix.
For example, say we want to find the
reachable ancestors for the node *b1*d* from
Figure 3. The missing letters are {a, b2} from
Cmax. The set of linked ancestors is then
•
Missing ∅: a{b1,b2}*d*
•
Missing a: *{b1,b2}*d*
•
Missing a then b2: a{b1,b2}*d*
This method reverses the way the subpatterns are
inserted into the tree to avoid traversing the same
node multiply times. The set of not-reachable
ancestors can be found by looking at any other
sublist of {a, b2}, e.g. b2, which gives ab1*d*.
4.6 Frequent Partial Pattern Derivation
The Apriori and Max-Subpattern Hit set
property, combined with the candidate
generation algorithm provide a way to enumerate
all the frequent i-patterns in S.
Given the period p of the time series
database S, we can generate the set of all
frequent 1-patterns, use the max-subpattern hit
method to build the max-subpattern tree, and
then perform joins on the set of frequent (i–1)patterns to create the set of frequent i-patterns.
The frequency of each i-pattern is
determined by the sum of the frequency count of
the corresponding subpattern in the tree, if
present, and the frequency counts of the
reachable ancestors of that subpattern. The
Apriori property is applied here to prune those ipatterns which have frequency less than
min_conf x m.
Algorithm.
1.
2.
3.
Derive the set of frequent 1-patterns F1
from a scan of S.
Create the max-subpattern tree T by
using the Insertion Algorithm on the
max-subpattern of each period segment.
To derive the frequent k-patterns, where
k > 1, repeat the following steps until
the derived set Fk is empty:
a.
Perform a k-way join on frequent
patterns of L -length (k – 1) to
generate the candidate k-patterns.
b.
Compute the set of reachable
ancestors for each k-pattern.
c.
Scan T to find the frequency counts
of the k-pattern and its reachable
ancestors.
d.
Generate the list Fk by pruning the
candidate k-patterns with counts
less than min_conf x m.
The most expensive step of the algorithm
above it the k-way join step, as further discussed
in the candidate generation algorithm.
Finding the frequency counts of the
candidate subpatterns requires looking at (x – 1)
nodes in the max-subpattern tree, where x is the
L-length of the candidate max-subpattern C max.
The set of reachable ancestors of a subpattern
will be along one path from the root to the node
containing the subpattern; therefore, only one
path will need to be traversed to find the
frequency count of a subpattern. Since the
largest path in the tree is of length equal to the
height, only (x – 1) nodes will be examined.
The algorithm will not create the set Fi+1 if
Fi does not contain any i-patterns. This is due to
the Apriori property. Therefore, once a set Fi is
empty, no more frequent partial patterns exist in
the database. All of the patterns have been
mined.
5.
Implementation
In this project, an algorithm is implemented,
which incorporates all of the concepts defined
above in order to mine partial periodicity in a
multidimensional time series. The following is
the algorithm, shown in pseudocode; its
components are discussed in the previous
sections.
Multidimensional Partial Pattern Mining
Algorithm.
1.
For each d -dimensional point in the
database, perform dimensionality reduction by classifying the point using
one of the proposed methods.
2.
Represent the database as a list of all
the classified one-dimensional values of
each point, rather than as a set of d dimensional points.
3.
Generate the frequent 1-patterns by
scanning this list, as described in
Section 4.
4.
Build the max-subpattern tree from
Section 4 by scanning the list again.
5.
Derive the frequent patterns, as described in the algorithm in Section 4.
The preceding algorithm will only scan the
database four times. Two scans are required to
classify the multidimensional points, and two
additional scans will be used to derive the
frequent partial patterns of the database.
The grid-based classification of the points
requires a scan of the database to determine the
size of the grid in every dimension d, i.e. to
derive the values dmax and d min. The dataset is
then scanned again to assign each point the onedimensional value of the grid.
The clustering approach to classification
scans the database once to roughly cluster the
points. After the k-means algorithm, the cluster
value is assigned to each point as the database is
scanned a second time.
The discovery of the set F1 and the creation
of the max-subpattern tree each require one scan
of the database. Without the max-subpattern
tree, the frequency counts of subpatterns in the
sets Fi, where i >2, would be found by scanning
the database up to p times, where p is the length
of the period. Thus, only one scan to create the
tree is a significant speedup if the database and
the size of the derived patterns are large.
6.
Experiments
This section outlines the experiments
performed to determine the validity of the
proposed algorithm. The experiments were
designed to compare the projection-based
clustering classification approach to the gridbased classification approach on the basis of
efficiency and scalability.
Based upon the results of the experiments,
the grid-based approach classifies and finds
comparable patterns in the multi-dimensional
points faster than the clustering approach. As the
number of data points increases, the discrepancy
in efficiency also increases.
Test Databases. To perform the experiments,
large time series databases are needed. These
were created by using a data generation
algorithm, which chooses values for the
dimensions at random with some guarantee of
periodicity. The algorithm ensures that there is
some noise in the dataset as well.
6.1 Analysis
Classification Efficiency. As seen in the table
below, as the value of c increases, the grid
algorithm is able to classify the points in
constant time. However, the clustering approach
requires time, linear in the choice of k, to classify
the points. This is due to the nature of the
classification algorithms. Clustering is a form of
unsupervised learning and is computed
dynamically, while the grid is statically
determined apriori from the maximum and
minimum points of each dimension.
Clustering
Grid
4
16
25
4
16
25
500
84
179
328
33
33
33
5000
651
2261
6998
314
314
315
50000
12766
41898
154697
3134
3134
3137
Table 1. Shows the mean observed running time
of each algorithm for the number of multidimensional data values in the left hand column. The
values across the top are the choices of k and c.
Scalability Comparison. As the number of data
points in a database increases, the time to
classify them also increases. As seen in Table 1,
both running times increase linearly. However,
the clustering method runs significantly slower
than the grid approach.
This cost is also the penalty of determining
classes dynamically. As the number of points
increases, there are more points to cluster. The
k-means algorithm continues to run as long as a
point changes clusters; more points increase the
probability of changes occurring. This will make
the algorithm run longer. Thus, the classification
of the points takes longer.
The grid approach, on the other hand, has
increasing execution time as a result of scanning
a longer database. The grid itself is a static
entity that assigns a value in constant time; the
performance of the classification is not
dependent on the number of data points. This
makes it much more scalable than the clustering
approach.
However, the choice of c and k are vital to
the patterns generated in the partial pattern
mining phase. The c used in the grid may
produce many empty cells, while none of the k
clusters will be empty at the end of the
algorithm. The clustering approach may then
produce a more accurate classification of the data
points, which would yield more meaningful
partial patterns.
Pattern Mining Performance. As seen in [9],
the algorithm to mine the partial patterns is
efficient. When paired with the classification
methods, the algorithm is still efficient and
scalable as the number of data points increases,
as seen in the graph below.
8. References
[1] R. Agrawal and R. Srikant. Fast algorithms
for mining association rules. In Proceedings of the VLDB Con-ference, Santiago,
Chile, September 1994.
1000
Time (seconds)
800
[2] R. Agrawal and R. Srikant. Mining
Sequential Patterns. In Proceedings of 1995
International Conference on Data Engineering. Taipei, Taiwan, March 1995.
Grid50K
Grid100K
Clust50K
Clust100K
600
[3] W. Aref, M. Elfeky, and A. Elmagarmid.
Incremental, Online and Merge Mining of
Partial Periodic Patterns in Time-Series
Databases. Purdue Tech-nical Report, 2001.
400
200
0
4
9
16
25
Number of Clusters/Grid Cells
Figure 4. This graph shows the total time needed
to mine the partial patterns of the data points,
when classified by clustering or the grid
approach.
[4] C. Berberidis, et al. Multiple and Partial
Periodicity Mining in Time Series
Databases. F. van Harmelen (ed): ECAI
2002. In Proceedings of the 15th European
Conference on Artificial Intelligence, IOS
Press, Amsterdam, 2002.
The above graph shows the grid-based
classification method scales better than the
clustering approach as the number of data points
and clusters increases.
[5] K. Chan and A. Fu. Efficient Time-Series
Matching by Wavelets. In Proceedings of
1999 International Conference on Data
Engineering, Sydney, Australia, March
1999.
7.
[6] V. Faber. Clustering and the Continuous kMeans Algorithm. In Los Alamos Science,
Number 22, 1994. Pages 138 – 144.
Conclusions
Partial pattern mining algorithms yield
interesting information about periodic time series
databases, which cannot be found using
techniques for deriving full periodic patterns. If
the efficient methods of finding partial patterns
can be applied to multidimensional databases,
more interesting data could be found.
To find partial patterns from such
databases, algorithms to reduce the dimensionality must be used. This paper presented
two methods of dimensionality reduction by
using classification methods. One approach
involves clustering and the other uses a grid to
assign a single dimensional, categorical value to
each multidimensional data point. The values of
each point are then used to find partial patterns
using the method described in [9].
The experiments show this overall method
to be both scalable and efficient. The use of the
grid-based classification allows the algorithm to
achieve higher efficiency and scalability over the
clustering approach.
[7] C. Faloutsos, M. Ranganathan, and Y.
Manolopoulos. Fast Subsequence Matching
in Time-Series Databases. In Proceedings
of the 1994 ACM SIGMOD International
Conference on Management of Data, Minneapolis, Minnesota, May 1994.
[8] C.Faloutsos, K. Lin. Fast Map: A Fast
Algorithm for Indexing, Data-Mining and
Visualization of Traditional and Multimedia
Datasets. In Proceedings of the 1995 ACM
SIGMOD International Conference of
Management of Data, San Jose, California,
1995.
[9] J. Han, G. Dong, Y. Yin. Efficient Mining
of Partial Periodic Patterns in Time Series
Databases. In Proceedings of 1999
International Conference on Data Engineering. Sydney, Australia, March 1999.
[10] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers, 2000. ISBN 1-55860-489-8
[11] E. Keogh, K. Chakrabarti, M. Pazzani, and
S. Mehrotra. Dimensionality Reduction for
Fast Similarity Search in Large Time Series
Databases. Springer-Verlag, Knowledge,
and Information Systems (2001). Pages 263
– 286.
[12] J. B. MacQueen. Some Methods for
Classification and Analysis of Multivariate
Observations. In Proceedings of the 5th
Berkeley Symposium on Mathematical
Statistics and Probability. Berkeley,
University of California Press, 1:281-297.
1967.