Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
iSAX: Indexing and Mining Terabyte Sized Time Series Jin Shieh, Eamonn Keogh Computer Science & Eng. Dept. University of California, Riverside Time Series Data Mining Group Outline • Introduction – Motivating example • iSAX representation – Indexing time series • Experimental evaluation • Conclusion Time Series Data Mining Group Introduction • Our work extends a popular symbolic representation of time series to allow for the indexing and retrieval of millions of time series • Symbolic Aggregate approXimation (SAX) – Represent a time series T of length n in w-dimensional i space using PAA ti wn T j • Where the ith element of is: n w j wn ( i 1) 1 – Then discretize into a vector of symbols • Breakpoints map to a small alphabet a of symbols 3 3 2 3 A time series T 2 2 PAA(T,4) iSAX(T,4,4) 00 1 1 1 01 0 0 0 -1 -1 -2 -2 10 11 -1 -3 0 4 8 12 16 Time Series Data Mining Group -3 -2 -3 0 4 8 12 16 0 4 8 12 16 Introduction (cont.) • SAX is lower bounding – Given a SAX representations Ta, Sa a lower bound to the Euclidean distance is: MINDIST(Ta, Sa) n w dist (t , s ) w i 1 2 i i – dist(ti,si) is the smallest distance between the breakpoints that characterize each symbol, 0 if they overlap Time Series Data Mining Group Motivating Example • Why not just index using SAX? • For example: index 1,000,000 time series using SAX – Choose SAX parameters • cardinality = 8, wordlength = 4 • 84 = 4,096 possible SAX word labels – Place time series which map to the same label in the same file on disk – Compute label for query and retrieve matching file • Time series in file likely to be good approximate matches – Average label occupancy 1,000,000/4,096 = ~244 (reasonable) Time Series Data Mining Group Motivating Example (cont.) • In practice, the distribution of time series to SAX word labels is not uniform! – Empty – Disproportionate percentage of the dataset • Ideal condition: We want to give a threshold th, and have the number of entries n mapped to a label to be 1 ≤ n ≤ th – Favor larger n • How can we achieve this? We need to make SAX more flexible Time Series Data Mining Group iSAX Representation • SAX uses a single hard-coded cardinality – Unable to differentiate only on dimensions of interest • We will show that the indexing problem can be solved if we extend SAX to allow: – Different cardinalities within a single word – Comparison of words with different cardinalities • We call this extension indexable SAX (iSAX) Time Series Data Mining Group iSAX Representation (cont.) • Multi-resolution property – Readily convert to any lower resolution that differs by a power of two {12,13, 6, 1} = {1100,1101,0110,0001} { 6, 6, 3, 0} = {110 ,110 ,011 ,000 } { 3, 3, 1, 0} = {11 ,11 ,01 ,00 } { 1, 1, 0, 0} = {1 ,1 ,0 ,0 } • Lower bounding distance between iSAX words enforced through examination of both sets of breakpoints • iSAX offers a bit aware, quantized, multi-resolution representation with variable granularity Time Series Data Mining Group Indexing with iSAX • Split a set of time series represented by a common iSAX word into mutually exclusive subsets (using multi-resolution property): – Increase cardinality along dimensions d, word length w, 1≤d≤w – Fan-out rate bound by 2d • Iterative doubling – Given a base cardinality b, cardinality at i-th increase is b*2i – Alignment of breakpoints overlap • Allows for index structures which are hierarchical, with nonoverlapping regions, and a controlled fan-out rate Time Series Data Mining Group Indexing with iSAX (cont.) • Simple tree-based index (base cardinality b, word length w, threshold th) – Hierarchically subdivides SAX space until entries in each subspace falls within th • Leaf nodes point to index files on disk • Internal nodes designate a split in SAX space • Approximate Search – Similar time series often represented by same iSAX word – Traverse index until leaf • Match iSAX representation at each level • Apply heuristics if no match • Exact Search – Leverage approximate search – Prune search space • Lower bounding distance Time Series Data Mining Group Experimental Evaluation • We conduct experiments to identify characteristics of the iSAX representation: – Tightness of the lower bound – Indexing performance on massive datasets – Applicability to data-mining algorithms Time Series Data Mining Group Tightness of Lower Bounds • TLB = LowerBoundDist(T’,S’) / EuclideanDist(T,S) • For a given dataset – Time series length [480, 960, 1440, 1920] – Bytes available for representation [16, 24, 32, 40] – Results similar across thirty datasets iSAX, DCT, ACPA, DFT, PAA/DWT, CHEB, IPLA Koski ECG 0.8 0 500 1000 0.4 0.2 480 960 1440 0 1920 TLB 0.6 Time series Length Koski ECG dataset Time Series Data Mining Group 40 bytes 32 bytes 24 bytes 16 bytes Indexing Performance on Massive Datasets • Indexed random walk datasets of [1, 2, 4, 8] million time series of length 256 – Parameters: b = 4, w = 8, th = 100 – Generated [39,255, 57,365, 92,209, 162,340] index files Percentage of Queries • Approximate Search (1000 queries): At least 1 from top 100 100 80 At least 1 from top 10 60 40 Outside top 1000 1 from top 1 (true nearest neighbor) 20 0 1m 2m Size of Random Walk Database 4m 8m Exact Search (100 queries): Avg. Disk Accesses/Query Avg. Time/Query (min) 1M 2M 4M 8M Exact Search 3.8 5.8 9.0 14.1 Sequential Scan 71.5 104.8 168.8 297.6 Time Series Data Mining Group 1M 2M 4M 8M Exact Search 2115.3 3172.5 4925.3 7719.1 Sequential Scan 39255 57365 92209 162340 Data Mining • Definition: Time Series Set Difference (TSSD) (A,B). Given two collections of time series A and B, the time series set difference is the subsequence in A whose distance from its nearest neighbor in B is maximal • Electrocardiogram dataset from a 45 year old male subject with suspected sleep-disordered breathing – 7.2 hours as reference set B (1,000,000 time series) – 8 minutes 39 seconds as “novel” set A (20,000 time series) where the patient woke up 4 0 90 -4 0 50 100 109 150 200 250 The Time Series Set Difference discovered between ECGs recorded during a waking cycle and the previous 7.2 hours (respiration pattern change in accordance with change in sleep stages) Time Series Data Mining Group Data Mining (cont.) • Solutions: 1) Sequential scan A across B 2) Exact search each entry in A using index on B 3) Leverage approximate and exact search • • Order A by approximate search distance in a queue Perform exact search using index on B in descending distance – Suspend if distance becomes lower than next entry in the queue – If search completes, return as TSSD Distance Computations Disk Accesses Est. Time 1) Sequential Scan 20,000,000,000 31,196 6.25 days 2) Exact Search 325,604,200 5,676,400 1.04 days 3) Leveraged 2,365,553 43,779 34 minutes Time Series Data Mining Group Conclusion • Introduced the iSAX representation and shown how it can be used for indexing time series • Demonstrated scalability and efficacy on massive datasets • Showed how approximate and exact search can be used in conjunction to produce exact results on data mining problems Time Series Data Mining Group THANK YOU! Time Series Data Mining Group