Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Time-Series Similarity Problems and Well-Separated Geometric Sets Bela Bollobas [email protected] Gautam Dasy [email protected] Dimitrios Gunopulosz [email protected] Heikki Mannilax [email protected] Abstract Given a pair of nonidentical complex objects, dening (and determining) how similar they are to each other is a nontrivial problem. In data mining applications, one frequently needs to determine the similarity between two time series. We analyze a model of time-series similarity that allows outliers, and dierent scaling functions. We present deterministic and randomized algorithms for computing this notion of similarity. The algorithms are based on nontrivial tools and methods from computational geometry. In particular, we use properties of families of well-separated geometric sets. The randomized algorithm has provably good performance and also works extremely eciently in practice. 1 Introduction Being able to measure the similarity between objects is a crucial issue in many data retrieval and data mining applications; see [9] for a general discussion on similarity queries. Typically, the task is to dene a function Sim(X; Y ), where X and Y are two objects of a certain class, and the function value represents how \similar" they are to each other. For complex objects, designing such functions, and algorithms to compute them, is by no means trivial. Time series are an important class of complex data objects: they arise in many applications. Examples of Institute for Advanced Study, Princeton, N.J. 08540 AND University of Memphis, Department of Mathematical Sciences, Memphis, TN 38152, USA y University of Memphis, Department of Mathematical Sciences, Memphis, TN 38152, USA z IBM Almaden RC K55/B1, 650 Harry Rd., San Jose CA 95120, USA x University of Helsinki, Department of Computer Science, P.O. Box 26, FIN-00014 Helsinki, Finland time series databases are stock price indices, volume of product sales, telecommunication data, one-dimensional medical signals, audio data, and environmental measurement sequences. In data mining applications, it is often necessary to search within a series database for those series that are similar to a given query series. This primitive is needed, for example, for prediction and clustering purposes. While the statistical literature on time-series is vast, it has not studied similarity notions that would be appropriate for data mining applications. In this paper we present and analyze a similarity function for time series. We present new deterministic and randomized algorithms that are both practical and have provable performance bounds. Interestingly, these algorithms are based on several subtle geometric properties of the problems, in particular properties of certain wellseparated geometric sets, which are interesting in their own right. In addition to our theoretical analysis, we present several implementation results. Next we describe the similarity notion used here; related work is considered at the end of the next section. Suppose we are given two sequences X = x1 ; : : : ; xn and Y = y1 ; : : : ; yn . A simple starting point would be to measure the similarityn of X and Y by using their Lp distance as points of R . For time series, this way of measuring similarity or distance is not appropriate, since the sequences can have outliers, dierent scaling factors, and baselines. Outliers are values that are measurement errors and should be omitted when comparing the sequence against others. A grossly outlying value can cause two otherwise identical series to have large distance. A more reasonably similarity notion is based on the longest common subsequence concept, where intuitively X and Y are considered similar if they exhibit similar behavior for a large part of0 their length. More formally, let X 0 = xi1 ; : : : ; xil and Y = yj1 ; : : : ; yjl be the longest subsequences in X and Y respectively, where (a) for 1 k l , 1, ik < ik+1 and jk < jk+1 , and (b) for 1 k l, xik = yjk . We dene Sim(X; Y ) to be l=n. Note that it is not necessary for the two given sequences to have the same length, because the shorter sequence can always be padded with dummy numbers. There are several shortcomings in the above model. The two sequences may have dierent scaling factors and baselines. For example, two stock indices could be es- sentially similar (i.e. they react similarly to changing market conditions) even though one uctuates near $30 while the other uctuates near $100. In practice there should also be some allowable tolerance when comparing elements from both sequences (even after one sequence has been appropriately scaled or otherwise transformed to resemble the other sequence). The following similarity function overcomes these problems. The similarity function Sim; (X;Y ): Let > 0 be an integer constant, 0 < < 1 a real constant, and f a linear function (of the form f : y = ax + b) belonging to the (innite) family of linear functions L. Given0 two sequences X = x1 ; : :0: ; xn and Y = y1 ; : : : ; yn , let X = (xi1 ; : : : ; xil ) and Y = (yj1 ; : : : ; yjl ) be the longest subsequences in X and Y respectively such that 1. for 1 k l , 1, ik < ik+1 and jk < jk+1 , 2. for 1 k l, jik , jk j , and 3. for 1 k l, yjk =(1 + ) f (xik ) yjk (1 + ). Let Sf;; (X; Y ) be dened as l=n. Then Sim; (X;Y ) is dened as maxf 2LfSf;; (X; Y )g. Thus, when Sim; (X; Y ) is close to 1, the two sequences are considered to be very similar. The constant ensures that the positions of each matched pair of elements are not too far apart. In practice this is a reasonable assumption, and it also helps in designing very ecient algorithms. The linear function (or transformation) f allows us to detect similarity between two sequences with dierent base values and scaling factors. Note than an algorithm trying to compute the similarity will have to nd out which linear function to use (i.e. the values of a and b) that maximizes l. The tolerance allows us to \approximately" match an element in X (after transformation) with an element in Y . We observe that our similarity function is not necessarily symmetric; a symmetric denition of similarity is maxfSim; (X;Y ), Sim; (Y; X )g. Given two sequences of length n, the longest common subsequence can be found in O(n2 ) time by a well known dynamic programming algorithm [3]; This algorithm can easily be modied to compute Sf;; (X;Y ) in O(n) time where f is a given linear function. Note that this is essentially linear time. We refer to this algorithm as the LCSS algorithm. To design algorithms to compute Sim; (X; Y ), the main task is to locate a nite set of all fundamentally dierent linear transformations and run LCSS on each one. Here we present algorithms that are based on a thorough analysis of the geometric properties of the problem. These algorithms have provable performance bounds, and some of them are very ecient in practice. 2 Main results We summarize our results below. 1. Cubic-time exact algorithm: Sim; (X; Y ) can be computed in O(n3 3 ) time. Proof : (Sketch) Let V = f(xi ; yj )jxi 2 X; yj 2 Y; ji , j j g be a set of points in the xy-plane. Clearly jV j = O(n). For every point p = (xi; yj ) 2 V , consider a vertical line segment Lp = (p1 ; p2 ), where p1 = (xi ; yj (1 + )) and p2 = (xi ; yj =(1 + )). Let R = fLp jp 2 V g. Consider the linear transformation f : y = ax + b. Let Rf be the segments of R stabbed by the innite line f . Let VR be the set of end points of all vertical segments in R. Thus jVR j = 2jV j = O(n). Consider any linear function f that does not pass through any point in VR . Clearly if we perturb f slightly, Rf will not change. Recall that0 L is the (innite) family of all linear functions. Let L be the family of linear functions such that f 0 2 L0 i f 0 passes through two points of VR . It is easy to see0 that0 for any linear function f 2 L, there is a function f 2 L such that Vf = Vf . Thus, our algorithm rst computes L0 (whose size is 2 2 O(n )), then runs LCSS on each function, and nally outputs the normalized length of the longest3 sequence found. This gives a total running time of O(n 3 ). 2 0 2. Quadratic-time approximation algorithm: Let 0 < < 1 be any desired small constant. A linear transformation f and the corresponding Sf;; (X;Y ) can be computed in O(n2 2 + n3 = 2 ) time such that Sim; (X;Y ) , Sf;; (X; Y ) . 3. Linear-time randomized approximation algorithm: Let 0 < < 1 be any desired small constant. A linear transformation f and the corresponding Sf;; (X;Y ) can be computed in O(n3 = 2 ) running time such that Sim; (X;Y ) , Sf;; (X; Y ) with high probability. We note that the randomized approximation algorithm is very simple: it randomly selects a constant number of linear functions in L0 , and runs LCSS for each one. While developing these algorithms we discovered several geometric results which are of independent interest. We describe some of these results below. Let Si Sj represent the symmetric-dierence between two sets. Let V be a nite set and 2V be its power Vset. Let k > 0 be an integer. A family of nite sets S 2 is k-separated if for all Si ; Sj 2 S , jSi Sj j > k. It is known that there exist a k-separated family S where jSj = 2n where depends on k=n [4]. However, we can get much better upper bounds if we only consider certain kinds of geometric sets. Consider the following set system. Let R be a xed set of n line segments on the plane. Given any innite line L, let RL be the set of line segments of R intersected (or stabbed) by L. (If RL = R, then L is known as a transversal of R.) Let the family S consist of the distinct stabbed sets RL of R, for all possible innite lines L. The following result and its corollaries are crucial for our algorithms. 4. k-separated stabbed sets: The maximum size of a k-separated family S 0 S 2 2 is (n =k ). In particular, if k = n for some constant 0 < < 1, the size of any k-separated family is O(1). It is this property that is used in our approximation algorithms. We note here that the last result can also be derived from the cutting theorem of [6, 5], but our proof is simpler and direct. Related work: There has been some recent work on the problem of dening similarity between time series; due to lack of space we only give some references. The problem was introduced to the data mining community by papers [1, 8]. The similarity measures considered in [2, 7, 12] use the concept of longest common subsequence. In [1], a ngerprint method is used, where the discrete Fourier transform is employed to reduce the dimensions of each sequence, and the resulting ngerprints are compared. In [10], feature extraction techniques are used; see [11] for a general discussion on ngerprinting techniques. 3 Experimental results We experimented with the exact algorithm and the randomized approximation algorithm using three collections of sequences. One consisted of quarterly indicators of the status of the Finnish economy (67 sequences, 85 points), another of measurements of trac data, error counts and call counts at 15 minute intervals (17 dierent phone lines, i.e., 51 sequences, 478 points each), and the third one about stock prices at the NYSE. Especially in the phone line data outliers are truly a problem: there are values in the sequence that dier by a factor of 2{5 from all the other values. In most cases these are outliers, but in some cases not; removing them permanently from the data is not possibly. Overall, the algorithms behaved as predicted by the theoretical analysis. The exact O(n3 ) algorithm was far too slow to use for any but the smallest sequences and displacements. The randomized algorithm proved to be very ecient. Moreover, it produced approximations to the true similarity that are very close to the correct values. For example, in Table 1 we see that for varying , the randomized algorithm got to within 1 from the true optimum already after 500 randomly chosen linear functions. Table 2 shows the time needed for this analysis. The randomized algorithm works extremely fast, checking 500 random linear functions in about 6.5 seconds, and this time is almost independent of the displacement. The results for the other sequence types were similar, and they are omitted from this version of the paper. References [1] R. Agrawal, C. Faloutsos and A. Swami. Ecient Similarity Search in Sequence Databases. In Proc. of the 4th Intl. Conf. on Foundations of Data Organization and Algorithms (FODO'93), 1993. [2] R. Agrawal, K.-I. Lin, H. S. Sawhney and K. Shim. Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases. In Proc. of the 21st Intl. Conf. on Very Large Data Bases (VLDB'95), pp 490{501. K LCSS 0.10 0 0.10 10 50 100 200 500 1000 85 80.6 83.1 83.4 83.2 84.0 84.0 1 85 79.2 83.2 83.7 83.9 84.7 84.9 0.05 0 81 77.7 80.1 80.2 80.2 80.5 80.9 0.05 1 81 78.1 80.3 80.3 80.6 81.0 80.9 0.01 0 56 49.5 53.4 53.8 54.7 55.3 55.3 0.01 1 57 50.2 52.5 53.6 54.4 54.8 55.6 Table 1: True similarity between sequences and the length of longest common subsequence found by using K randomly chosen linear functions; averages over 10 trials. Data: two series of 85 points about the Finnish national economy. K LCSS 0.10 0 85 0.10 1 0.05 10 50 100 500 1000 361 0.16 0.67 1.3 6.5 13.0 85 3211 0.15 0.68 1.3 6.5 13.0 0 81 358 0.15 0.67 1.4 6.5 13.0 0.05 1 81 3236 0.14 0.57 1.1 5.6 11.3 0.01 0 56 364 0.15 0.67 1.3 6.5 13.0 0.01 1 57 3269 0.11 0.58 1.1 5.6 11.2 exact Table 2: Time (measured on a Pentium 32MB machine) used by the exact algortithm and the randomized approximate algorithm (for K randomly chosen linear functions; averages over 10 trials). Data: as above. [3] A. V. Aho. Algorithms for Finding Patterns in Strings. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, Elsevier, 1990, pp. 255{400. [4] B. Bollobas. Combinatorics. Cambridge University Press, 1986. [5] B. Chazelle. Cutting hyperplanes for divide-andconquer. In Discrete Comput. Geom., 9, 2, 1993, pp. 145-158. [6] B. Chazelle and E. Welzl. Quasi-optimal range searching in spaces of nite VC-dimension. In Discrete Comput. Geom., 4, 1989, pp. 467-489. [7] G. Das, D. Gunopulos and H. Mannila. Finding Similar Time Series. Manuscript, 1996. [8] C. Faloutsos, M. Ranganathan and Y. Manolopoulos. Fast Subsequence Matching in Time-Series Databases. In SIGMOD'94, 1994. [9] H.V. Jagadish, A. O. Mendelzon and T. Milo. Similarity-Based Queries. In Proc. of 14th Symp. on Principles of Database Systems (PODS'95), 1995, pp. 36{45. [10] H. Shatkay and S. Zdonik. Approximate Queries and Representations for Large Data Sequences. In ICDE'96, 1996. [11] D. A. White and R. Jain. Algorithms and Strategies for Similarity Retrieval. Technical Report VCL-96101, Visual Computing Laboratory, UC Davis, 1996. [12] N. Yazdani and Z. M. Ozsoyoglu. Sequence Matching of Images. In Proc. of the 8th Intl. Conf. on Scientic and Statistical Database Management, 1996, pp. 53{62.