Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. • You may freely use these slides for teaching, if • You send me an email telling me the class number/ university in advance. • My name and email address appears on the first slide (if you are using all or most of the slides), or on each slide (if you are just taking a few slides). • You may freely use these slides for a conference presentation, if • You send me an email telling me the conference name in advance. • My name appears on each slide you use. • You may not use these slides for tutorials, or in a published work (tech report/ conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is highly likely I will grant you permission. (c) Eamonn Keogh, [email protected] Everything you know about Dynamic Time Warping is Wrong Chotirat Ann Ratanamahatana Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected] Eamonn Keogh Outline of Talk • Introduction to dynamic time warping (DTW) • Why DTW is important • Introduction/review for LB_Keogh solution • Three popular beliefs about DTW • Why popular belief 1 is wrong • Why popular belief 2 is wrong • Why popular belief 3 is wrong • Conclusions Here is simple visual example to help you develop an intuition for DTW. We are looking at nuclear power data. Euclidean Nuclear Power Excellent! Dynamic Time Warping Let us compare Euclidean Distance and DTW on some problems Leaves Faces 4 3 2 1 0 -1 Gun -2 -4 0 Trace 2 10 20 30 40 50 60 70 80 90 0 10 20 Control 4 3 Sign language -3 1 0 -1 -2 -3 0 50 100 150 200 250 300 2-Patterns Word Spotting 30 40 50 60 70 80 Results: Error Rate Dataset Word Spotting Sign language GUN Nuclear Trace Leaves (4) Faces Control Chart 2-Patterns Euclidean 4.78 28.70 5.50 11.00 33.26 6.25 7.5 1.04 DTW 1.10 25.93 1.00 0.00 4.07 2.68 0.33 0.00 Using 1nearestneighbor, leavingone-out evaluation! Every possible warping between two time series, is a path though the matrix. We want the best one… How is DTW Calculated? Q DTW (Q, C ) min C K k 1 wk K C Q This recursive function gives us the minimum cost path (i,j) = d(qi,cj) + min{ (i-1,j-1), (i-1,j ), (i,j-1) } Warping path w Important note The time series can be of different lengths.. Q C C Q Warping path w Global Constraints • Slightly speed up the calculations • Prevent pathological warpings C Q Sakoe-Chiba Band In general, it’s hard to speed up a single DTW calculation However, if we have to make many DTW calculations (which is almost always the case), we can potentiality speed up the whole process by lowerbounding. Keep in mind that the lowerbounding trick works for any situation were you have an expensive calculation that can be lowerbound (string edit distance, graph edit distance etc) I will explain how lowerbounding works in a generic fashion in the next two slides, then show concretely how lowerbounding makes dealing with massive time series under DTW possible… Lower Bounding I Assume that we have two functions: • DTW(A,B) • lower_bound_distance(A,B) The true DTW function is very slow… The lower bound function is very fast… By definition, for all A, B, we have lower_bound_distance(A,B) DTW(A,B) Lower Bounding II We can speed up similarity search under DTW by using a lower bounding function Algorithm Lower_Bounding_Sequential_Scan(Q) 1. best_so_far = infinity; 2. for all sequences in database 3. LB_dist = lower_bound_distance(Ci, Q); if LB_dist < best_so_far 4. 5. true_dist = DTW(Ci, Q); if true_dist < best_so_far 6. 7. best_so_far = true_dist; 8. index_of_best_match = i; endif 9. endif 10. 11. endfor Try to use a cheap lower bounding calculation as often as possible. Only do the expensive, full calculations when it is absolutely necessary Lower Bound of Keogh Q C U Ui = max(qi-r : qi+r) Li = min(qi-r : qi+r) Sakoe-Chiba Band L Q LB_Keogh C U (qi U i ) 2 if qi U i n LB _ Keogh(Q, C ) (qi Li ) 2 if qi Li i 1 0 otherwise Q L Important Note The LB_Keogh lower bound only works for time series of the same length, and with constraints. However, we can always normalize the length of one of the time series C Q C Q Popular Belief 1 C Q The ability of DTW to handle sequences of different lengths is a great advantage, and therefore the simple lower bound that requires different length sequences to be reinterpolated to equal lengths is of limited utility. Examples “Time warping enables sequences with similar patterns to be found even when they are of different lengths” “ (DTW is) a more robust distance measure than Euclidean distance in many situations, where sequences may have different lengths” “(DTW) can be used to measure similarity between sequences of different lengths” Popular Belief 2 Constraining the warping paths is a necessary evil that we inherited from the speech processing community to make DTW tractable, and that we should find ways to speed up DTW with no (or larger) constraints. Examples “LB_Keogh cannot be applied when the warping path is not constrained”. “search techniques for wide constraints are required” C Q Popular Belief 3 There is a need for (and room for) improvements in the speed of DTW for data mining applications. Examples •“DTW incurs a heavy CPU cost” •“DTW is limited to only small time series datasets” •“(DTW) quadratic cost makes its application on databases of long time series very expensive” • “(DTW suffers from ) serious performance degradation in large databases” Popular Belief 1 The ability of DTW to handle sequences of different lengths is a great advantage, and therefore the simple lower bound that requires different length sequences to be reinterpolated to equal lengths is of limited utility. Is this true? These claims are surprising in that they are not supported by any empirical results in the papers in question. Furthermore, an extensive literature search through more than 500 papers dating back to the 1960’s failed to produce any theoretical or empirical results to suggest that simply making the sequences have the same length has any detrimental effect. Let us test this A Simple Experiment I For all datasets which have naturally the different lengths, let us compare 1-nearest neighbor classification rate, for all possible warping constraints: • After simply re-normalizing lengths. • Using DTWs “wonderful” ability to support different queries. The latter case has at least five “flavors”, to be fair we try all and report only the best. A Simple Experiment II 96.5 100 100 99.9 96 95 Face Leaf Leaf 85 TraceTrace 99.6 99.5 Accuracy (%) 95 Accuracy (%) 99.7 90 Accuracy (%) Accuracy (%) 95.5 99.8 99.4 94.5 99.3 99.2 80 94 99.1 93.5 0 10 20 30 40 50 60 Warping Window Size (%) 70 80 90 100 75 99 0 10 20 30 40 50 60 Warping Window Size (%) 70 80 90 100 0 10 20 30 40 50 60 Warping Window Size (%) 70 80 90 A two-tailed ttest with 0.05 significance level between each variable-length and equal-length pair indicates that there is no statistically significant difference between the accuracy of the two sets of experiments. 100 Popular Belief 1 is a Myth! The ability of DTW to handle sequences of different lengths is a NOT great advantage. So while Wong and Wong claim in IDEAS-03 “DTW is useful to measure similarity between sequences of different lengths”, we must recall that two Wongs don’t make a right. Popular Belief 2 Constraining the warping paths is a necessary evil that we inherited from the speech processing community to make DTW tractable, and that we should find ways to speed up DTW with no (or larger) constraints. Is this true? The vast majority of the data mining researchers have used a Sakoe-Chiba Band with a 10% width for the global constraint, but the last year has seen many papers that advocate wider constraints, or none at all. W A Simple Experiment For all classification datasets, let us compare 1nearest neighbor classification rate, for all possible warping constraints. If popular claim two is correct, the accuracy should grow for wider constraints. In particular, the accuracy should get better for values greater than 10% Accuracy vs. Width of Warping Window 100 95 W Warping width that achieves max Accuracy 85 80 75 70 FACE 2% GUNX 3% LEAF 8% Control Chart 4% TRACE 3% 2-Patterns 3% WordSpotting 3% W: Warping Width 100 97 93 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 29 25 21 17 13 9 5 65 1 Accuracy 90 Popular Belief 2 is a myth! Constraining the warping paths WILL give higher accuracy for classification/clustering/query by content. This result can be summarized by the KeoghRatanamahatana Maxim: “a little warping is a good thing, but too much warping is a bad thing”. Popular Belief 3 There is a need for (and room for) improvements in the speed of DTW for data mining applications. Is this true? Do papers published since the introduction of LB_Keogh really speed up DTW data mining? A Simple Experiment Lets do some experiments! W We will measure the average fraction of the n2 matrix that we must calculate, for a one nearest neighbor search. We will do this for every possible value of W, the warping window width. By testing this way, we are deliberately ignoring implementation details, like index structure, buffer size etc… Fraction of warping matrix accessed This plot tells us that although DTW is O(n2), after we set the warping window for maximum accuracy for this problem, we only have to do 6% of the work, and if we use the LB_Keogh lower bound, we only have to do 0.3% of the work! 1 0.06 0.9 Zoom-In 0.8 Nuclear Trace Dataset 0.7 0.6 No Lower Bound 0.5 0.05 0.04 0.03 LB-Keogh 0.4 0.02 0.3 0.01 0.2 0.1 0 0 0 10 20 30 40 50 60 70 80 Warping Window Size (%) 90 0 1 2 100 Maximum Accuracy 3 4 Fraction of warping matrix accessed This plot tells us that although DTW is O(n2), after we set the warping window for maximum accuracy for this problem, we only have to do 6% of the work, and if we use the LB_Keogh lower bound, we only have to do 0.21% of the work! 1 0.06 0.9 Zoom-In 0.8 0.05 0.7 Gun Dataset 0.04 0.6 No Lower Bound 0.5 0.03 LB-Keogh 0.4 0.02 0.3 0.01 0.2 0.1 0 0 0 10 20 30 40 50 60 70 80 Warping Window Size (%) 90 0 1 2 100 Maximum Accuracy 3 4 The results in the previous slides are pessimistic! As the size of the dataset gets larger, the lower bounds become more important and can prune a larger fraction of the data. From a similarity search/classification point of view, DTW Fraction of warping matrix accessed is linear! 1 0.06 Gun Dataset 0.9 Zoom-In 2 instances 6 instances 12 instances 24 instances 50 instances 100 instances 200 instances 0.8 0.7 0.6 0.5 0.05 0.04 0.03 0.4 0.02 0.3 0.01 0.2 0.1 0 0 0 10 20 30 40 50 60 70 80 Warping Window Size (%) 90 100 0 1 Maximum Accuracy* 2 3 4 Amortized percentage of the calculations required Let us consider larger datasets… On a (still small, by data mining standards) dataset of 40,960 objects, just ten lines of code (LB_Keogh) eliminates 99.369% of the CPU effort! 9 8 7 6 No Lower Bound LB_Keogh 5 4 3 2 1 0 Size of Database (Number of Objects) Popular Belief 3 is a Myth There is NO need for (and NO room for) improvements in the speed of DTW for data mining applications. We are very close the asymptotic limit of speed up for DTW. The time taken for searching a terabyte of data is about the same for Euclidean Distance or DTW. Conclusions We have shown that there is much misunderstanding about dynamic time warping, an important data mining tool. These misunderstandings have lead to much wasted research effort, which is a pity, because there are several important DTW problems to be solved (see paper). Are there other major misunderstandings about other data mining problems? Questions? All datasets and code used in this tutorial can be found at www.cs.ucr.edu/~eamonn/TSDMA/index.html