Download slides - University of California, Riverside

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Fair Use Agreement
This agreement covers the use of all slides on this
CD-Rom, please read carefully.
• You may freely use these slides for teaching, if
• You send me an email telling me the class number/ university in advance.
• My name and email address appears on the first slide (if you are using all or most of the slides), or on each
slide (if you are just taking a few slides).
• You may freely use these slides for a conference presentation, if
• You send me an email telling me the conference name in advance.
• My name appears on each slide you use.
• You may not use these slides for tutorials, or in a published work (tech report/
conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is
highly likely I will grant you permission.
(c) Eamonn Keogh, [email protected]
Everything you know about
Dynamic Time Warping is Wrong
Chotirat Ann Ratanamahatana
Computer Science & Engineering Department
University of California - Riverside
Riverside,CA 92521
[email protected]
Eamonn Keogh
Outline of Talk
• Introduction to dynamic time warping (DTW)
• Why DTW is important
• Introduction/review for LB_Keogh solution
• Three popular beliefs about DTW
• Why popular belief 1 is wrong
• Why popular belief 2 is wrong
• Why popular belief 3 is wrong
• Conclusions
Here is simple visual example to
help you develop an intuition for
DTW. We are looking at nuclear
power data.
Euclidean
Nuclear
Power
Excellent!
Dynamic Time Warping
Let us compare Euclidean Distance and DTW on some problems
Leaves
Faces
4
3
2
1
0
-1
Gun
-2
-4
0
Trace
2
10
20
30
40
50
60
70
80
90
0
10
20
Control
4
3
Sign language
-3
1
0
-1
-2
-3
0
50
100
150
200
250
300
2-Patterns
Word Spotting
30
40
50
60
70
80
Results: Error Rate
Dataset
Word Spotting
Sign language
GUN
Nuclear Trace
Leaves
(4) Faces
Control Chart
2-Patterns
Euclidean
4.78
28.70
5.50
11.00
33.26
6.25
7.5
1.04
DTW
1.10
25.93
1.00
0.00
4.07
2.68
0.33
0.00
Using 1nearestneighbor,
leavingone-out
evaluation!
Every possible warping between two time
series, is a path though the matrix. We
want the best one…
How is DTW
Calculated?
Q
DTW (Q, C )  min 

C

K
k 1
wk K
C
Q
This recursive function gives us the
minimum cost path
(i,j) = d(qi,cj) + min{ (i-1,j-1), (i-1,j ), (i,j-1) }
Warping path w
Important note
The time series can be of different
lengths..
Q
C
C
Q
Warping path w
Global Constraints
• Slightly speed up the calculations
• Prevent pathological warpings
C
Q
Sakoe-Chiba Band
In general, it’s hard to speed up a single DTW calculation
However, if we have to make many DTW
calculations (which is almost always the
case), we can potentiality speed up the
whole process by lowerbounding.
Keep in mind that the lowerbounding trick works
for any situation were you have an expensive
calculation that can be lowerbound (string edit
distance, graph edit distance etc)
I will explain how lowerbounding works in a generic
fashion in the next two slides, then show
concretely how lowerbounding makes dealing with
massive time series under DTW possible…
Lower Bounding I
Assume that we have two functions:
• DTW(A,B)
• lower_bound_distance(A,B)
The true DTW
function is very
slow…
The lower
bound function
is very fast…
By definition, for all A, B, we have
lower_bound_distance(A,B)  DTW(A,B)
Lower Bounding II
We can speed up similarity search under DTW
by using a lower bounding function
Algorithm Lower_Bounding_Sequential_Scan(Q)
1. best_so_far = infinity;
2. for all sequences in database
3.
LB_dist = lower_bound_distance(Ci, Q);
if LB_dist < best_so_far
4.
5.
true_dist = DTW(Ci, Q);
if true_dist < best_so_far
6.
7.
best_so_far = true_dist;
8.
index_of_best_match = i;
endif
9.
endif
10.
11. endfor
Try to use a cheap
lower bounding
calculation as
often as possible.
Only do the
expensive, full
calculations when
it is absolutely
necessary
Lower Bound of Keogh
Q
C
U
Ui = max(qi-r : qi+r)
Li = min(qi-r : qi+r)
Sakoe-Chiba Band
L
Q
LB_Keogh
C
U
 (qi  U i ) 2 if qi  U i
n

LB _ Keogh(Q, C )    (qi  Li ) 2 if qi  Li
i 1 
 0 otherwise
Q
L
Important Note
The LB_Keogh lower bound only works for time
series of the same length, and with constraints.
However, we can always normalize the length of one of the time series
C
Q
C
Q
Popular Belief 1
C
Q
The ability of DTW to handle sequences of
different lengths is a great advantage, and
therefore the simple lower bound that
requires different length sequences to be
reinterpolated to equal lengths is of limited
utility.
Examples
“Time warping enables sequences with similar patterns to be found
even when they are of different lengths”
“ (DTW is) a more robust distance measure than Euclidean distance
in many situations, where sequences may have different lengths”
“(DTW) can be used to measure similarity between sequences of
different lengths”
Popular Belief 2
Constraining the warping paths is a necessary evil
that we inherited from the speech processing
community to make DTW tractable, and that we
should find ways to speed up DTW with no (or
larger) constraints.
Examples
“LB_Keogh cannot be applied when the
warping path is not constrained”.
“search techniques for wide constraints are
required”
C
Q
Popular Belief 3
There is a need for (and room for) improvements
in the speed of DTW for data mining applications.
Examples
•“DTW incurs a heavy CPU cost”
•“DTW is limited to only small time series datasets”
•“(DTW) quadratic cost makes its application on databases
of long time series very expensive”
• “(DTW suffers from ) serious performance degradation in
large databases”
Popular Belief 1
The ability of DTW to handle sequences of
different lengths is a great advantage, and
therefore the simple lower bound that requires
different length sequences to be reinterpolated to
equal lengths is of limited utility.
Is this true?
These claims are surprising in that they are not supported by any
empirical results in the papers in question. Furthermore, an extensive
literature search through more than 500 papers dating back to the
1960’s failed to produce any theoretical or empirical results to
suggest that simply making the sequences have the same length has
any detrimental effect. Let us test this
A Simple Experiment I
For all datasets which have naturally the
different lengths, let us compare 1-nearest
neighbor classification rate, for all possible
warping constraints:
• After simply re-normalizing lengths.
• Using DTWs “wonderful” ability to support
different queries.
The latter case has at least five “flavors”, to be fair we try all
and report only the best.
A Simple Experiment II
96.5
100
100
99.9
96
95
Face
Leaf
Leaf
85
TraceTrace
99.6
99.5
Accuracy (%)
95
Accuracy (%)
99.7
90
Accuracy (%)
Accuracy (%)
95.5
99.8
99.4
94.5
99.3
99.2
80
94
99.1
93.5
0
10
20
30
40
50
60
Warping Window Size (%)
70
80
90
100
75
99
0
10
20
30
40
50
60
Warping Window Size (%)
70
80
90
100
0
10
20
30
40
50
60
Warping Window Size (%)
70
80
90
A two-tailed ttest with 0.05 significance level between each
variable-length and equal-length pair indicates that there is no
statistically significant difference between the accuracy of the
two sets of experiments.
100
Popular Belief 1 is a Myth!
The ability of DTW to handle sequences of
different lengths is a NOT great advantage.
So while Wong and Wong claim in IDEAS-03
“DTW is useful to measure similarity between
sequences of different lengths”, we must recall that
two Wongs don’t make a right.
Popular Belief 2
Constraining the warping paths is a necessary evil
that we inherited from the speech processing
community to make DTW tractable, and that we
should find ways to speed up DTW with no (or
larger) constraints.
Is this true?
The vast majority of the data mining
researchers have used a Sakoe-Chiba Band
with a 10% width for the global constraint,
but the last year has seen many papers that
advocate wider constraints, or none at all.
W
A Simple Experiment
For all classification datasets, let us compare 1nearest neighbor classification rate, for all
possible warping constraints.
If popular claim two is correct, the accuracy
should grow for wider constraints.
In particular, the accuracy should get better for
values greater than 10%
Accuracy vs. Width of Warping Window
100
95
W
Warping width that achieves
max Accuracy
85
80
75
70
FACE
2%
GUNX
3%
LEAF
8%
Control Chart
4%
TRACE
3%
2-Patterns
3%
WordSpotting
3%
W: Warping Width
100
97
93
89
85
81
77
73
69
65
61
57
53
49
45
41
37
33
29
25
21
17
13
9
5
65
1
Accuracy
90
Popular Belief 2 is a myth!
Constraining the warping paths WILL give higher
accuracy for classification/clustering/query by
content.
This result can be summarized by the KeoghRatanamahatana Maxim:
“a little warping is a good thing, but too
much warping is a bad thing”.
Popular Belief 3
There is a need for (and room for) improvements
in the speed of DTW for data mining applications.
Is this true?
Do papers published since the introduction of LB_Keogh really
speed up DTW data mining?
A Simple Experiment
Lets do some experiments!
W
We will measure the average fraction
of the n2 matrix that we must
calculate, for a one nearest neighbor
search.
We will do this for every possible
value of W, the warping window width.
By testing this way, we are
deliberately ignoring implementation
details, like index structure, buffer
size etc…
Fraction of warping matrix accessed
This plot tells us that although DTW is O(n2), after we
set the warping window for maximum accuracy for this
problem, we only have to do 6% of the work, and if we
use the LB_Keogh lower bound, we only have to do
0.3% of the work!
1
0.06
0.9
Zoom-In
0.8
Nuclear Trace
Dataset
0.7
0.6
No Lower Bound
0.5
0.05
0.04
0.03
LB-Keogh
0.4
0.02
0.3
0.01
0.2
0.1
0
0
0
10
20
30
40
50
60
70
80
Warping Window Size (%)
90
0
1
2
100
Maximum
Accuracy
3
4
Fraction of warping matrix accessed
This plot tells us that although DTW is O(n2), after
we set the warping window for maximum accuracy for
this problem, we only have to do 6% of the work, and
if we use the LB_Keogh lower bound, we only have to
do 0.21% of the work!
1
0.06
0.9
Zoom-In
0.8
0.05
0.7
Gun Dataset
0.04
0.6
No Lower Bound
0.5
0.03
LB-Keogh
0.4
0.02
0.3
0.01
0.2
0.1
0
0
0
10
20
30
40
50
60
70
80
Warping Window Size (%)
90
0
1
2
100
Maximum
Accuracy
3
4
The results in the previous slides are
pessimistic! As the size of the dataset
gets larger, the lower bounds become
more important and can prune a larger
fraction of the data. From a similarity
search/classification point of view, DTW
Fraction of warping matrix accessed
is linear!
1
0.06
Gun Dataset
0.9
Zoom-In
2 instances
6 instances
12 instances
24 instances
50 instances
100 instances
200 instances
0.8
0.7
0.6
0.5
0.05
0.04
0.03
0.4
0.02
0.3
0.01
0.2
0.1
0
0
0
10
20
30
40
50
60
70
80
Warping Window Size (%)
90
100
0
1
Maximum
Accuracy*
2
3
4
Amortized percentage of the calculations required
Let us consider
larger datasets…
On a (still small, by data mining
standards) dataset of 40,960 objects,
just ten lines of code (LB_Keogh)
eliminates 99.369% of the CPU effort!
9
8
7
6
No Lower Bound
LB_Keogh
5
4
3
2
1
0
Size of Database (Number of Objects)
Popular Belief 3 is a Myth
There is NO need for (and NO room for)
improvements in the speed of DTW for data mining
applications.
We are very close the asymptotic limit of speed
up for DTW. The time taken for searching a
terabyte of data is about the same for Euclidean
Distance or DTW.
Conclusions
We have shown that there is much misunderstanding
about dynamic time warping, an important data
mining tool.
These misunderstandings have lead to much wasted
research effort, which is a pity, because there are
several important DTW problems to be solved (see
paper).
Are there other major misunderstandings about
other data mining problems?
Questions?
All datasets and code used in this tutorial can be found at
www.cs.ucr.edu/~eamonn/TSDMA/index.html