Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets Dragomir Yankov, Eamonn Keogh, Umaa Rebbapragada Computer Science & Eng. Dept. Dept. of Computer Science University of California, Riverside Tufts University Best paper winner: ICDM 2007 Time Series Data Mining Group Outline • What inspired the current work • The time series discord detection problem • An efficient algorithm for mining disk resident discords – Detecting range-based discords – Detecting the top k discords • Experimental results – Evaluating the effectiveness of the discord definition – Scalability of the discord detection algorithm Time Series Data Mining Group A motivating example • Myriads of telescopes around the world constantly record valuable astronomical data, e.g. star light-curves Click Image to Play • A light-curve is a real-valued time series of light magnitude measurements derived from telescopic images Eclipsed binary: Sirius A&B Image: Chandra X-ray observatory Time Series Data Mining Group Movie: By kind permissions of Prof. Richard W. Pogge, OSU A motivating example (cont) • The American Association of Variable Star Observers has a database of over 10.5 million variable star brightness measurements going back over ninety years • Over 400,000 new variable star brightness measurements are added to the database every year • Many of the observations are noisy or are preprocessed inaccurately prior to storing • Efficient, unsupervised methods for cleaning the data are required Time Series Data Mining Group A motivating example (cont) • Data are inherently non-convex and hard to model probabilistically. • Anomalies should be defined with respect to the non-linear manifolds defined by the lightcurve time series (true for many time series datasets) Time Series Data Mining Group Definitions and assumptions Nasdaq Composite (Oct06-Oct07) • Notation T – time series: T (t1 ,, tm ) – subseqence: Ci (t p ,, t p n 1 ) Ci1 n n m, 1 p m n 1 – time series database: S {Ci } C i2 n n Ci j • Function Dist (Ci , C j ) (may not be a metric) defines an ordering for the elements in S Time Series Data Mining Group Time series discords • Most-significant discord – the subsequence Ci S with maximal distance Dist (Ci , C j ) to its nearest neighbor C j S Ci Dist (Ci , C j ) Cj Time Series Data Mining Group Generalized discord definitions • Most-significant k-th NN discord – the subsequence Ci S with maximal distance Dist (Ci , C j ) to its k-th nearest neighbor C j S Ci Dist (Ci , C j ) Cj Time Series Data Mining Group k 2 Generalized discord definitions • Most-significant k-NN discord – the subsequence Ci S with maximal distance to its k nearest neighbors in S Cj Ci k 2 The algorithm utilizes the first of these discord definitions for its computational efficiency and intuitive interpretation Time Series Data Mining Group Disk aware discord detection • Detecting discords is harder than finding similar patterns – anytime algorithms can quickly detect similarities 2 – anomalies require O(| S | ) computation time • Indexing is not a solution – time series are high dimensional – dimensionality reduction is often inadequate – linear scan is faster than 10% random disk accesses We are looking for an algorithm that performs two disk scans and “approximately linear” number of computations Time Series Data Mining Group Discord detection algorithm • Phase 1 – candidates selection phase C S C1 C2 C3 C4 C5 … Time Series Data Mining Group … r - discord range Discord detection algorithm • Phase 1 – candidates selection phase C S C1 C1 Dist (C2 , C1 ) r C2 C3 C4 C5 … Time Series Data Mining Group … r - discord range Discord detection algorithm • Phase 1 – candidates selection phase C S C1 C1 C2 C2 C3 Dist (C3 , C1 ) r C4 C5 … Time Series Data Mining Group … r - discord range Discord detection algorithm • Phase 1 – candidates selection phase C S C1 C2 C2 C3 Dist (C4 , C2 ) r C4 C5 … Time Series Data Mining Group … r - discord range Discord detection algorithm • Phase 1 – candidates selection phase C S C1 C2 C2 C4 C3 C4 Dist (C5 , C2 ) r Dist (C5 , C4 ) r … Time Series Data Mining Group … r - discord range C5 Discord detection algorithm • Phase 2 – candidates refinement phase C S C1 C2 C2 C4 C3 C5 … … C4 C5 ? Time Series Data Mining Group … r - discord range … Dist (C1 , C j ) r Discord detection algorithm • Phase 2 – candidates refinement phase C S C1 C2 C2 C4 C3 C5 … … … Time Series Data Mining Group C5 … Dist (C3 , C5 ) r r - discord range C4 Discord detection algorithm • Phase 2 – candidates refinement phase C S C1 C2 C2 C4 … … C3 C4 … Time Series Data Mining Group C5 … Upon completion sort the candidates list C Correctness of the algorithm • The candidates set C contains all discords at distance at least r from their NN, plus some other elements • The refinement phase removes from C all false positives, and no real discord is pruned • Correctness: the range discord algorithm detects all discords and only the discords with respect to the specified range r Time Series Data Mining Group Finding a good range parameter • Selecting large r may result in an empty discord set, while too small r can render the algorithm inefficient • Computing the nearest neighbor distance distribution (NNDD) is expensive • NNDD depends on the number of examples in the data Time Series Data Mining Group Approximating NNDD • Intuition – though the relative volume in the upper tail decreases, the absolute number of discords cut by r remains sufficient when adding more data • Detecting the top k discords 1. Select a uniformly random sample S ' S 2. Compute the top k discords in S ' 3. Order their NN distances as: d1 d 2 d k 4. Set r d k 5. Run the disk aware algorithm with range parameter r Time Series Data Mining Group Experimental evaluation We performed two sets of experiments 1. Experiments showing the utility of the time series discord definition 2. Experiments showing the scalability of the disk aware discord detection algorithm Time Series Data Mining Group Experimental evaluation utility of the discord definition • Star light-curve data from the Optical Gravitational Lensing Experiment (OGLE) • Three classes of light-curves - Eclipsed binaries - Cepheids - RR Lyrae variables top two discords in each class Time Series Data Mining Group typical examples Experimental evaluation utility of the discord definition • MSN web queries made in 2002 patterns dominated by a weekly cycle anticipated bursts • The most significant discord using rotation invariant Euclidean distance periodicity 29.5 days – the length of a synodic month Time Series Data Mining Group Experimental evaluation utility of the discord definition • Anomaly detection in video sequences (multivariate data) our method achieves 100% accuracy on the planted anomalous trajectories • Adapting the method as a data cleaning procedure Time Series Data Mining Group the top one discord shown with only one of the existing clusters Experimental evaluation utility of the discord definition • Population growth data – we studied the growth rate of 206 countries for the last 25 years, looking for the most dramatic 5 year event the top 2 discords with a set of 10 representative countries for contrast Time Series Data Mining Group Experimental evaluation – scalability of the disk aware algorithm • We generated 3 data sets of size up to 0.35Tb of random walk time series • Six non-random walk time series were planted, we looked for the top 10 discords two of the planted series (top) were among the top 10 discords • Time efficiency on the three random walk data sets: Examples 1 million 10 million 100 million Time Series Data Mining Group Disk size 3.57 Gb 35.7 Gb 0.35 Tb I/O Time 27min 4h 30min 45h Total time 41min 7h 52min 90h 33min Experimental evaluation – scalability of the disk aware algorithm • Time efficiency (Heterogeneous data): Examples Disk size 1.2 million 1.17 Gb I/O Time 15min Total time 16min • Main memory requirement for different thresholds Time Series Data Mining Group Experimental evaluation – scalability of the disk aware algorithm • Parallelizing the algorithm (m computers): S1 S2 S S1 , C C1 C2 C C i 1, m … S 2 , C C2 i Candidate selection phase Time Series Data Mining Group C C i 1, m … Cm Sm C1 Cm Sm , C Candidate refinement phase i Experimental evaluation – scalability of the disk aware algorithm • Parallelizing the algorithm (dataset: one million random walks ): The runtime overhead for 8 computers is approximately 30%. This is due to the increased candidate set size |C| at the end of phase 1 Time Series Data Mining Group Conclusion • Discords provide for an effective definition of rare time series patterns. • The presented disk aware algorithm has all requirements of a good off-the-shelf data mining tool: – The results are interpretable – It is extremely efficient and largely scalable – Very easy to implement (“8 lines in Matlab”) • Allows for straight-forward parallel and online extensions Time Series Data Mining Group Acknowledgements • We would like to thank to: – – – – Dr. Pavlos Protopapas (Harvard University) – light-curve dataset Dr. Michail Vlachos (IBM Watson) – MSN web query data Dr. Longin Jan Latecki (Temple University) – Trajectory dataset1 Dr. Andrew Naftel (University of Manchester) - Trajectory dataset2 also – Dr. Jessica Lin (George Mason University) and – Dr. Ada Fu (Chinese University of Hong Kong) – for useful discussions Time Series Data Mining Group All datasets and the code can be downloaded from: http://www.cs.ucr.edu/~dyankov/projects/ THANK YOU! Time Series Data Mining Group