Download Time Series Data Mining Group - University of California, Riverside

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Disk Aware Discord Discovery:
Finding Unusual Time Series in Terabyte Sized
Datasets
Dragomir Yankov, Eamonn Keogh, Umaa Rebbapragada
Computer Science & Eng. Dept.
Dept. of Computer Science
University of California, Riverside
Tufts University
Best paper winner: ICDM 2007
Time Series Data Mining Group
Outline
• What inspired the current work
• The time series discord detection problem
• An efficient algorithm for mining disk resident discords
– Detecting range-based discords
– Detecting the top k discords
• Experimental results
– Evaluating the effectiveness of the discord definition
– Scalability of the discord detection algorithm
Time Series Data Mining Group
A motivating example
• Myriads of telescopes around the world constantly record valuable
astronomical data, e.g. star light-curves
Click Image to Play
• A light-curve is a real-valued time series
of light magnitude measurements
derived from telescopic images
Eclipsed binary:
Sirius A&B
Image: Chandra X-ray observatory
Time Series Data Mining Group
Movie: By kind permissions of
Prof. Richard W. Pogge, OSU
A motivating example (cont)
• The American Association of Variable Star Observers has
a database of over 10.5 million variable star brightness
measurements going back over ninety years
• Over 400,000 new variable star brightness
measurements are added to the database every year
• Many of the observations are noisy or are preprocessed
inaccurately prior to storing
• Efficient, unsupervised methods for cleaning the data are
required
Time Series Data Mining Group
A motivating example (cont)
• Data are inherently non-convex and hard to model
probabilistically.
• Anomalies should be
defined with respect to
the non-linear manifolds
defined by the lightcurve time series (true
for many time series
datasets)
Time Series Data Mining Group
Definitions and assumptions
Nasdaq Composite (Oct06-Oct07)
• Notation
T
– time series:
T  (t1 ,, tm )
– subseqence:
Ci  (t p ,, t p  n 1 )
Ci1
n
n  m, 1  p  m  n  1
– time series database:
S  {Ci }
C i2
n
n
Ci j
• Function Dist (Ci , C j ) (may not be a metric) defines an
ordering for the elements in S
Time Series Data Mining Group
Time series discords
• Most-significant discord – the subsequence Ci  S
with maximal distance Dist (Ci , C j ) to its nearest
neighbor C j  S
Ci
Dist (Ci , C j )
Cj
Time Series Data Mining Group
Generalized discord definitions
• Most-significant k-th NN discord – the subsequence Ci  S
with maximal distance Dist (Ci , C j ) to its k-th nearest
neighbor C j  S
Ci
Dist (Ci , C j )
Cj
Time Series Data Mining Group
k 2
Generalized discord definitions
• Most-significant k-NN discord – the subsequence Ci  S
with maximal distance to its k nearest neighbors in S
Cj
Ci
k 2
The algorithm utilizes the first of these discord definitions for
its computational efficiency and intuitive interpretation
Time Series Data Mining Group
Disk aware discord detection
• Detecting discords is harder than finding similar
patterns
– anytime algorithms can quickly detect similarities
2
– anomalies require O(| S | ) computation time
• Indexing is not a solution
– time series are high dimensional
– dimensionality reduction is often inadequate
– linear scan is faster than 10% random disk accesses
We are looking for an algorithm that performs two disk scans
and “approximately linear” number of computations
Time Series Data Mining Group
Discord detection algorithm
• Phase 1 – candidates selection phase
C
S
C1
C2
C3
C4
C5
…
Time Series Data Mining Group
…
r - discord range
Discord detection algorithm
• Phase 1 – candidates selection phase
C
S
C1
C1
Dist (C2 , C1 )  r
C2
C3
C4
C5
…
Time Series Data Mining Group
…
r - discord range
Discord detection algorithm
• Phase 1 – candidates selection phase
C
S
C1
C1
C2
C2
C3
Dist (C3 , C1 )  r
C4
C5
…
Time Series Data Mining Group
…
r - discord range
Discord detection algorithm
• Phase 1 – candidates selection phase
C
S
C1
C2
C2
C3
Dist (C4 , C2 )  r
C4
C5
…
Time Series Data Mining Group
…
r - discord range
Discord detection algorithm
• Phase 1 – candidates selection phase
C
S
C1
C2
C2
C4
C3
C4
Dist (C5 , C2 )  r
Dist (C5 , C4 )  r
…
Time Series Data Mining Group
…
r - discord range
C5
Discord detection algorithm
• Phase 2 – candidates refinement phase
C
S
C1
C2
C2
C4
C3
C5
…
…
C4
C5
?
Time Series Data Mining Group
…
r - discord range
…
Dist (C1 , C j )  r
Discord detection algorithm
• Phase 2 – candidates refinement phase
C
S
C1
C2
C2
C4
C3
C5
…
…
…
Time Series Data Mining Group
C5
…
Dist (C3 , C5 )  r
r - discord range
C4
Discord detection algorithm
• Phase 2 – candidates refinement phase
C
S
C1
C2
C2
C4
…
…
C3
C4
…
Time Series Data Mining Group
C5
…
Upon completion sort
the candidates list C
Correctness of the algorithm
• The candidates set C contains all discords at
distance at least r from their NN, plus some other
elements
• The refinement phase removes from C all false
positives, and no real discord is pruned
• Correctness: the range discord algorithm detects all
discords and only the discords with respect to the
specified range r
Time Series Data Mining Group
Finding a good range parameter
• Selecting large r may result in an empty discord set, while too small
r can render the algorithm inefficient
• Computing the nearest neighbor distance distribution (NNDD) is
expensive
• NNDD depends
on the number
of examples in
the data
Time Series Data Mining Group
Approximating NNDD
•
Intuition – though the relative volume in the upper tail
decreases, the absolute number of discords cut by r
remains sufficient when adding more data
•
Detecting the top k discords
1. Select a uniformly random sample S '  S
2. Compute the top k discords in S '
3. Order their NN distances as: d1  d 2   d k
4. Set r  d k
5. Run the disk aware algorithm with range parameter r
Time Series Data Mining Group
Experimental evaluation
We performed two sets of experiments
1. Experiments showing the utility of the time series discord
definition
2. Experiments showing the scalability of the disk aware
discord detection algorithm
Time Series Data Mining Group
Experimental evaluation utility of the discord definition
• Star light-curve data from the
Optical Gravitational Lensing
Experiment (OGLE)
• Three classes of light-curves
-
Eclipsed binaries
-
Cepheids
-
RR Lyrae variables
top two discords
in each class
Time Series Data Mining Group
typical examples
Experimental evaluation utility of the discord definition
• MSN web
queries made
in 2002
patterns dominated by a weekly cycle
anticipated bursts
• The most significant discord using rotation invariant Euclidean
distance
periodicity 29.5 days –
the length of a synodic
month
Time Series Data Mining Group
Experimental evaluation utility of the discord definition
• Anomaly detection in video sequences (multivariate data)
our method achieves 100% accuracy
on the planted anomalous trajectories
• Adapting the method
as a data cleaning
procedure
Time Series Data Mining Group
the top one discord shown with
only one of the existing clusters
Experimental evaluation utility of the discord definition
• Population growth data – we studied the growth rate of 206
countries for the last 25 years, looking for the most dramatic 5 year
event
the top 2 discords
with a set of 10
representative
countries for
contrast
Time Series Data Mining Group
Experimental evaluation –
scalability of the disk aware algorithm
• We generated 3 data
sets of size up to 0.35Tb
of random walk time series
• Six non-random walk
time series were planted,
we looked for the top 10
discords
two of the planted series (top) were among the top 10 discords
• Time efficiency on the three random walk data sets:
Examples
1 million
10 million
100 million
Time Series Data Mining Group
Disk size
3.57 Gb
35.7 Gb
0.35 Tb
I/O Time
27min
4h 30min
45h
Total time
41min
7h 52min
90h 33min
Experimental evaluation –
scalability of the disk aware algorithm
• Time efficiency (Heterogeneous data):
Examples
Disk size
1.2 million
1.17 Gb
I/O Time
15min
Total time
16min
• Main memory requirement for different thresholds
Time Series Data Mining Group
Experimental evaluation –
scalability of the disk aware algorithm
• Parallelizing the algorithm (m computers):
S1
S2
S
S1 , C
C1
C2
C
C
i 1, m
…
S 2 , C C2
i
Candidate selection phase
Time Series Data Mining Group
C
C
i 1, m
…
Cm
Sm
C1
Cm
Sm , C
Candidate refinement phase
i
Experimental evaluation –
scalability of the disk aware algorithm
• Parallelizing the algorithm (dataset: one million random walks ):
The runtime overhead
for 8 computers is
approximately 30%.
This is due to the
increased candidate
set size |C| at the end
of phase 1
Time Series Data Mining Group
Conclusion
• Discords provide for an effective definition of rare time series
patterns.
• The presented disk aware algorithm has all requirements of a
good off-the-shelf data mining tool:
– The results are interpretable
– It is extremely efficient and largely scalable
– Very easy to implement (“8 lines in Matlab”)
• Allows for straight-forward parallel and online extensions
Time Series Data Mining Group
Acknowledgements
• We would like to thank to:
–
–
–
–
Dr. Pavlos Protopapas (Harvard University) – light-curve dataset
Dr. Michail Vlachos (IBM Watson) – MSN web query data
Dr. Longin Jan Latecki (Temple University) – Trajectory dataset1
Dr. Andrew Naftel (University of Manchester) - Trajectory
dataset2
also
– Dr. Jessica Lin (George Mason University) and
– Dr. Ada Fu (Chinese University of Hong Kong) – for useful
discussions
Time Series Data Mining Group
All datasets and the code can be downloaded from:
http://www.cs.ucr.edu/~dyankov/projects/
THANK YOU!
Time Series Data Mining Group