Download Time-Series Similarity Problems and Well

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Time-Series Similarity Problems and Well-Separated Geometric Sets
Bela Bollobas
[email protected]
Gautam Dasy
[email protected]
Dimitrios Gunopulosz
[email protected]
Heikki Mannilax
[email protected]
Abstract
Given a pair of nonidentical complex objects, dening
(and determining) how similar they are to each other is
a nontrivial problem. In data mining applications, one
frequently needs to determine the similarity between two
time series. We analyze a model of time-series similarity that allows outliers, and dierent scaling functions.
We present deterministic and randomized algorithms for
computing this notion of similarity. The algorithms are
based on nontrivial tools and methods from computational geometry. In particular, we use properties of families of well-separated geometric sets. The randomized
algorithm has provably good performance and also works
extremely eciently in practice.
1 Introduction
Being able to measure the similarity between objects is
a crucial issue in many data retrieval and data mining
applications; see [9] for a general discussion on similarity queries. Typically, the task is to dene a function
Sim(X; Y ), where X and Y are two objects of a certain
class, and the function value represents how \similar"
they are to each other. For complex objects, designing
such functions, and algorithms to compute them, is by
no means trivial.
Time series are an important class of complex data
objects: they arise in many applications. Examples of
Institute for Advanced Study, Princeton, N.J. 08540 AND
University of Memphis, Department of Mathematical Sciences, Memphis, TN 38152, USA
y University of Memphis, Department of Mathematical Sciences, Memphis, TN 38152, USA
z IBM Almaden RC K55/B1, 650 Harry Rd., San Jose CA
95120, USA
x University of Helsinki, Department of Computer Science,
P.O. Box 26, FIN-00014 Helsinki, Finland
time series databases are stock price indices, volume of
product sales, telecommunication data, one-dimensional
medical signals, audio data, and environmental measurement sequences. In data mining applications, it is often
necessary to search within a series database for those
series that are similar to a given query series. This primitive is needed, for example, for prediction and clustering
purposes. While the statistical literature on time-series
is vast, it has not studied similarity notions that would
be appropriate for data mining applications.
In this paper we present and analyze a similarity function for time series. We present new deterministic and
randomized algorithms that are both practical and have
provable performance bounds. Interestingly, these algorithms are based on several subtle geometric properties
of the problems, in particular properties of certain wellseparated geometric sets, which are interesting in their
own right. In addition to our theoretical analysis, we
present several implementation results.
Next we describe the similarity notion used here; related work is considered at the end of the next section.
Suppose we are given two sequences X = x1 ; : : : ; xn
and Y = y1 ; : : : ; yn . A simple starting point would be
to measure the similarityn of X and Y by using their Lp distance as points of R . For time series, this way of
measuring similarity or distance is not appropriate, since
the sequences can have outliers, dierent scaling factors,
and baselines.
Outliers are values that are measurement errors and
should be omitted when comparing the sequence against
others. A grossly outlying value can cause two otherwise
identical series to have large distance. A more reasonably similarity notion is based on the longest common
subsequence concept, where intuitively X and Y are considered similar if they exhibit similar behavior
for a large
part of0 their length. More formally, let X 0 = xi1 ; : : : ; xil
and Y = yj1 ; : : : ; yjl be the longest subsequences in X
and Y respectively, where (a) for 1 k l , 1, ik < ik+1
and jk < jk+1 , and (b) for 1 k l, xik = yjk . We
dene Sim(X; Y ) to be l=n. Note that it is not necessary for the two given sequences to have the same length,
because the shorter sequence can always be padded with
dummy numbers.
There are several shortcomings in the above model.
The two sequences may have dierent scaling factors and
baselines. For example, two stock indices could be es-
sentially similar (i.e. they react similarly to changing
market conditions) even though one uctuates near $30
while the other uctuates near $100. In practice there
should also be some allowable tolerance when comparing
elements from both sequences (even after one sequence
has been appropriately scaled or otherwise transformed
to resemble the other sequence). The following similarity
function overcomes these problems.
The similarity function Sim; (X;Y ):
Let > 0 be an integer constant, 0 < < 1 a real constant, and f a linear function (of the form f : y = ax + b)
belonging to the (innite) family of linear functions L.
Given0 two sequences X = x1 ; : :0: ; xn and Y = y1 ; : : : ; yn ,
let X = (xi1 ; : : : ; xil ) and Y = (yj1 ; : : : ; yjl ) be the
longest subsequences in X and Y respectively such that
1. for 1 k l , 1, ik < ik+1 and jk < jk+1 ,
2. for 1 k l, jik , jk j , and
3. for 1 k l, yjk =(1 + ) f (xik ) yjk (1 + ).
Let Sf;; (X; Y ) be dened as l=n. Then Sim; (X;Y )
is dened as maxf 2LfSf;; (X; Y )g.
Thus, when Sim; (X; Y ) is close to 1, the two sequences are considered to be very similar. The constant
ensures that the positions of each matched pair of elements are not too far apart. In practice this is a reasonable assumption, and it also helps in designing very
ecient algorithms.
The linear function (or transformation) f allows us
to detect similarity between two sequences with dierent
base values and scaling factors. Note than an algorithm
trying to compute the similarity will have to nd out
which linear function to use (i.e. the values of a and b)
that maximizes l. The tolerance allows us to \approximately" match an element in X (after transformation)
with an element in Y . We observe that our similarity
function is not necessarily symmetric; a symmetric denition of similarity is maxfSim; (X;Y ), Sim; (Y; X )g.
Given two sequences of length n, the longest common
subsequence can be found in O(n2 ) time by a well known
dynamic programming algorithm [3]; This algorithm can
easily be modied to compute Sf;; (X;Y ) in O(n) time
where f is a given linear function. Note that this is
essentially linear time. We refer to this algorithm as the
LCSS algorithm.
To design algorithms to compute Sim; (X; Y ), the
main task is to locate a nite set of all fundamentally
dierent linear transformations and run LCSS on each
one.
Here we present algorithms that are based on a thorough analysis of the geometric properties of the problem.
These algorithms have provable performance bounds, and
some of them are very ecient in practice.
2 Main results
We summarize our results below.
1. Cubic-time exact algorithm:
Sim; (X; Y ) can be computed in O(n3 3 ) time.
Proof : (Sketch)
Let V = f(xi ; yj )jxi 2 X; yj 2 Y; ji , j j g be a
set of points in the xy-plane. Clearly jV j = O(n). For
every point p = (xi; yj ) 2 V , consider a vertical line
segment Lp = (p1 ; p2 ), where p1 = (xi ; yj (1 + )) and
p2 = (xi ; yj =(1 + )). Let R = fLp jp 2 V g. Consider
the linear transformation f : y = ax + b. Let Rf be the
segments of R stabbed by the innite line f . Let VR be
the set of end points of all vertical segments in R. Thus
jVR j = 2jV j = O(n). Consider any linear function f
that does not pass through any point in VR . Clearly if
we perturb f slightly, Rf will not change. Recall that0
L is the (innite) family of all linear functions.
Let L
be the family of linear functions such that f 0 2 L0 i f 0
passes through two points of VR . It is easy to see0 that0
for any linear function f 2 L, there is a function f 2 L
such that Vf = Vf .
Thus,
our algorithm rst computes L0 (whose size is
2
2
O(n )), then runs LCSS on each function, and nally
outputs the normalized length of the longest3 sequence
found. This gives a total running time of O(n 3 ). 2
0
2. Quadratic-time approximation algorithm:
Let 0 < < 1 be any desired small constant.
A linear transformation f and the corresponding
Sf;; (X;Y ) can be computed in O(n2 2 + n3 = 2 )
time such that Sim; (X;Y ) , Sf;; (X; Y ) .
3. Linear-time randomized approximation algorithm:
Let 0 < < 1 be any desired small constant.
A linear transformation f and the corresponding
Sf;; (X;Y ) can be computed in O(n3 = 2 ) running time such that Sim; (X;Y ) , Sf;; (X; Y ) with high probability.
We note that the randomized approximation algorithm
is very simple: it randomly
selects a constant number of
linear functions in L0 , and runs LCSS for each one.
While developing these algorithms we discovered several geometric results which are of independent interest.
We describe some of these results below.
Let Si Sj represent the symmetric-dierence between
two sets. Let V be a nite set and 2V be its power Vset.
Let k > 0 be an integer. A family of nite sets S 2 is
k-separated if for all Si ; Sj 2 S , jSi Sj j > k. It is known
that there exist a k-separated family S where jSj = 2n
where depends on k=n [4]. However, we can get much
better upper bounds if we only consider certain kinds of
geometric sets.
Consider the following set system. Let R be a xed
set of n line segments on the plane. Given any innite
line L, let RL be the set of line segments of R intersected
(or stabbed) by L. (If RL = R, then L is known as a
transversal of R.) Let the family S consist of the distinct
stabbed sets RL of R, for all possible innite lines L.
The following result and its corollaries are crucial for
our algorithms.
4. k-separated stabbed sets:
The maximum
size of a k-separated family S 0 S
2
2
is (n =k ).
In particular, if k = n for some constant 0 < < 1,
the size of any k-separated family is O(1). It is this
property that is used in our approximation algorithms.
We note here that the last result can also be derived
from the cutting theorem of [6, 5], but our proof is simpler and direct.
Related work: There has been some recent work on the
problem of dening similarity between time series; due to
lack of space we only give some references. The problem
was introduced to the data mining community by papers
[1, 8]. The similarity measures considered in [2, 7, 12]
use the concept of longest common subsequence. In [1],
a ngerprint method is used, where the discrete Fourier
transform is employed to reduce the dimensions of each
sequence, and the resulting ngerprints are compared.
In [10], feature extraction techniques are used; see [11]
for a general discussion on ngerprinting techniques.
3 Experimental results
We experimented with the exact algorithm and the randomized approximation algorithm using three collections
of sequences. One consisted of quarterly indicators of the
status of the Finnish economy (67 sequences, 85 points),
another of measurements of trac data, error counts and
call counts at 15 minute intervals (17 dierent phone
lines, i.e., 51 sequences, 478 points each), and the third
one about stock prices at the NYSE. Especially in the
phone line data outliers are truly a problem: there are
values in the sequence that dier by a factor of 2{5 from
all the other values. In most cases these are outliers, but
in some cases not; removing them permanently from the
data is not possibly.
Overall, the algorithms behaved as predicted by the
theoretical analysis. The exact O(n3 ) algorithm was far
too slow to use for any but the smallest sequences and
displacements. The randomized algorithm proved to be
very ecient. Moreover, it produced approximations to
the true similarity that are very close to the correct values. For example, in Table 1 we see that for varying ,
the randomized algorithm got to within 1 from the true
optimum already after 500 randomly chosen linear functions. Table 2 shows the time needed for this analysis.
The randomized algorithm works extremely fast, checking 500 random linear functions in about 6.5 seconds,
and this time is almost independent of the displacement.
The results for the other sequence types were similar,
and they are omitted from this version of the paper.
References
[1] R. Agrawal, C. Faloutsos and A. Swami. Ecient
Similarity Search in Sequence Databases. In Proc.
of the 4th Intl. Conf. on Foundations of Data Organization and Algorithms (FODO'93), 1993.
[2] R. Agrawal, K.-I. Lin, H. S. Sawhney and K. Shim.
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases. In
Proc. of the 21st Intl. Conf. on Very Large Data
Bases (VLDB'95), pp 490{501.
K
LCSS
0.10
0
0.10
10
50
100
200
500
1000
85
80.6
83.1
83.4
83.2
84.0
84.0
1
85
79.2
83.2
83.7
83.9
84.7
84.9
0.05
0
81
77.7
80.1
80.2
80.2
80.5
80.9
0.05
1
81
78.1
80.3
80.3
80.6
81.0
80.9
0.01
0
56
49.5
53.4
53.8
54.7
55.3
55.3
0.01
1
57
50.2
52.5
53.6
54.4
54.8
55.6
Table 1: True similarity between sequences and the
length of longest common subsequence found by using
K randomly chosen linear functions; averages over 10
trials. Data: two series of 85 points about the Finnish
national economy.
K
LCSS
0.10
0
85
0.10
1
0.05
10
50
100
500
1000
361
0.16
0.67
1.3
6.5
13.0
85
3211
0.15
0.68
1.3
6.5
13.0
0
81
358
0.15
0.67
1.4
6.5
13.0
0.05
1
81
3236
0.14
0.57
1.1
5.6
11.3
0.01
0
56
364
0.15
0.67
1.3
6.5
13.0
0.01
1
57
3269
0.11
0.58
1.1
5.6
11.2
exact
Table 2: Time (measured on a Pentium 32MB machine)
used by the exact algortithm and the randomized approximate algorithm (for K randomly chosen linear functions; averages over 10 trials). Data: as above.
[3] A. V. Aho. Algorithms for Finding Patterns in
Strings. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, Elsevier, 1990, pp. 255{400.
[4] B. Bollobas. Combinatorics. Cambridge University
Press, 1986.
[5] B. Chazelle. Cutting hyperplanes for divide-andconquer. In Discrete Comput. Geom., 9, 2, 1993, pp.
145-158.
[6] B. Chazelle and E. Welzl. Quasi-optimal range
searching in spaces of nite VC-dimension. In Discrete Comput. Geom., 4, 1989, pp. 467-489.
[7] G. Das, D. Gunopulos and H. Mannila. Finding Similar Time Series. Manuscript, 1996.
[8] C. Faloutsos, M. Ranganathan and Y. Manolopoulos. Fast Subsequence Matching in Time-Series
Databases. In SIGMOD'94, 1994.
[9] H.V. Jagadish, A. O. Mendelzon and T. Milo.
Similarity-Based Queries. In Proc. of 14th Symp. on
Principles of Database Systems (PODS'95), 1995,
pp. 36{45.
[10] H. Shatkay and S. Zdonik. Approximate Queries
and Representations for Large Data Sequences. In
ICDE'96, 1996.
[11] D. A. White and R. Jain. Algorithms and Strategies
for Similarity Retrieval. Technical Report VCL-96101, Visual Computing Laboratory, UC Davis, 1996.
[12] N. Yazdani and Z. M. Ozsoyoglu. Sequence Matching of Images. In Proc. of the 8th Intl. Conf. on Scientic and Statistical Database Management, 1996,
pp. 53{62.