Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures Hui Ding, Goce Trajcevski, Peter Scheuermann Dept. of EECS, Northwestern University hdi117,goce,[email protected] Xiaoyue Wang, Eamonn Keogh Dept. of CS, U. of California, Riverside xwang,[email protected] 34th VLDB Conference, Auckland, New Zealand August 26, 2008 Motivation and Summary of Findings The tightness of lower bounding (thus the pruning power, indexing effectiveness) of different representation methods for time series data, for the most Key aspects for achieving part, makes a very little difference on various effectiveness and efficiency: data sets. representation methods Classification error ratios of elastic similarity measures. measures, e.g, DTW, LCSS, EDR and ERP can be significantly more accurate than other measures Consolidate the large amount of With large training data set size, existing research efforts Euclidean distance is competitive with elastic We conducted the largest (by a measures such as DTW (thus getting more data helps more than fussing with distance huge margin) set of time series measures in most cases ) data mining experiments Time series are ubiquitous Comparison of Time Series Representation Methods SAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA 0.8 0.6 0.4 0.2 480 960 1440 1920 0 10 8 6 foetal_ecg (excerpt) 0 200 4 400 TLB on an ECG data set SAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA 1 0.5 960 480 1920 0 1440 8 representation methods: SAX, DFT, DWT, DCT, PAA, CHEB, APCA, IPLA Use tightness of lower bounds (TLB) as a metric for comparison: TLB = LowerBoundDist / TrueEuclideanDist The tightness of lower bounding ( pruning power, effectiveness of the indexing) of different representation methods, for the most part, makes little difference on various data sets 10 8 6 4 SAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA TLB on a bursty data set 0.8 0.6 0.4 0.2 0 10 480 960 1440 1920 TLB on a periodic data set 8 6 4 Comparison of Time Series Similarity Measures - Findings Compared 9 similarity measures: Euclidean, L1, Linf, DISSIM, TQuEST, DTW, EDR, ERP, LCSS, Swale and Spade on 38 diverse data sets Used 1-Nearest Neighbor Classification for evaluating the accuracy of underlying measures Used stratified cross-validation to minimize the impact of class distribution of the data sets As training set size increases, Euclidean distance quickly becomes as effective as elastic measures (e.g., DTW, EDR) Edit-distance based measures are, for the most part, as effective as DTW (but require more effort for tuning) However they are not vastly superior as some have suggested Some measures (e.g., DISSIM, TQuEST) which were claimed as being vastly superior to simpler methods, are in fact no better or worse Example: Impact of Training Data Set Size 0.03 Euclidean DTW 0.025 CBF Dataset Out-of-Sample Error Rate 0.02 0.015 0.01 0.005 0 0.5 Two- Pat Dataset 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 Increasingly Large Training Sets If large training set is available, Euclidean may be as good as DTW, and is the fastest one can get… Visualizing Classification Accuracy Using Scatter Plot (1) Euclidean Distance vs. L1 Norm and Linf Norm DTW distance vs. Euclidean distance Visualizing Classification Accuracy Using Scatter Plot (2) LCSS distance vs. Euclidean and DTW distance ERP distance vs. Euclidean and DTW distance Visualizing Classification Accuracy Using Scatter Plot (3) DISSIM distance vs. Euclidean and DTW distance It has been claimed that DISSIM “efficiently retrieves similar trajectories in cases where related work fails” However, on average it is no better than DTW TQuEST distance vs. Euclidean and DTW distance It has been claimed that “DTW is the only competitor that achieves roughly similar accuracy (to TQuEST)” However, DTW and even Euclidean Distance is significantly better than TQuEST on average Visualizing Classification Accuracy Using Scatter Plot (4) Both SpADe and Swale have been proposed as been significantly better than Euclidean Distance and DTW. However, they are both about as good as Euclidean Distance on average (show to the left), and slightly worse than DTW on average. Conclusions & Future Work We attempted to consolidate existing works on representation methods and similarity measures for time series data Future extensions include: Conducting statistical analysis to investigate relationships among different similarity measures and present correlation-based comparison. Investigate (meta) properties of the datasets that could yield favorable effectiveness of some (or other) similarity measure Anything else You Suggest!