Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SAX: a Novel Symbolic Representation of Time Series Presenter Arif Bin Hossain Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Slides incorporate materials kindly provided by Prof. Eamonn Keogh Time Series  A time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. [Wiki] 30 20 10 0 0 2000 4000 6000  Example:  Economic, Sales, Stock market forecasting  EEG, ECG, BCI analysis 8000 Problems  Join: Given two data collections, link items occurring in each  Annotation: obtain additional information from given data  Query by content: Given a large data collection, find the k most similar objects to an object of interest.  Clustering: Given a unlabeled dataset, arrange them into groups by their mutual similarity Problems (Cont.)  Classification: Given a labeled training set, classify future unlabeled examples  Anomaly Detection: Given a large collection of objects, find the one that is most different to all the rest.  Motif Finding: Given a large collection of objects, find the pair that is most similar. Data Mining Constraints Clustering ¼ gig of data, 100 sec Clustering ½ gig of data, 200 sec Clustering 1 gig of data, 400 sec Clustering 1.1 gigs of data, few hours For example, suppose you have one gig of main memory and want to do Kmeans clustering… Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9-15 Generic Data Mining • Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest • Approximately solve the problem at hand in main memory • Make (hopefully very few) accesses to the original data on disk to confirm the solution Some Common Approximation Why Symbolic Representation? • Reduce dimension • Numerosity reduction • Hashing • Suffix Trees • Markov Models • Stealing ideas from text processing/ bioinformatics community Symbolic Aggregate ApproXimation (SAX) • Lower bounding of Euclidean distance • Lower bounding of the DTW distance • Dimensionality Reduction • Numerosity Reduction baabccbc SAX  Allows a time series of arbitrary length n to be reduced to a string of arbitrary length w (w<<n)  Notations C A time series C = c1, ….., cn Ć A Piecewise Aggregate Approximation of a time series Ć = ć1,…ćw Ĉ A symbolic representation of a time series Ĉ = ĉ1, …, ĉw w Number PAA segments representing C a Alphabet size How to obtain SAX?  Step 1: Reduce dimension by PAA Time series C of length n can be represented in a wdimensional space by a vector Ć = ć1,…ćw  The ith element is calculated by  ci   ni w w n j c j n ( i 1) 1 w Reduce dimension from 20 to 5. The 2nd element will be 5 8 C2  Cj  20 j 5 How to obtain SAX?  Data is divided into w equal sized frames.  Mean value of the data falling within a frame is calculated  Vector of these values becomes the PAA C C 0 20 40 60 80 100 120 How to obtain SAX?  Step 2: Discretization  Normalize Ć to have a Gaussian distribution  Determine breakpoints that will produce a equal-sized areas under Gaussian curve. c c c Words: 8 Alphabet: 3 b b a 0 20 b a 40 60 80 100 baabccbc 120 Distance Measure  Given 2 time series Q and C  Euclidean distance  Distance after transforming the subsequence to PAA Distance Measure  Define MINDIST after transforming to symbolic representation  MINDIST lower bounds the true distance between the original time series Numerosity Reduction  Subsequences are extracted by a sliding window  Sequences are mostly repetitive subsequence  Sliding window finds aabbcc  If the next sequence is also aabbcc, just store the position  This optimization depends on the data, but typically yields a reduction factor of 2 or 3  Space shuttle telemetry with subsequence length 32 Experimental Validation  Clustering  Hierarchical  Partitional  Classification  Nearest neighbor  Decision tree  Motif discovery Hierarchical Clustering  Sample dataset consists 3 decreasing trend, 3 upward shift and 3 normal classes Partitional Clustering (k-means)  Assign each point to one of k clusters whose center is nearest  Each iteration tries to minimize the sum of squared intra-clustered error Nearest Neighbor Classification  SAX beats Euclidean distance due to the smoothing effect of dimensional reduction Decision Tree Classification  Since decision trees are expensive to use with high dimensional dataset, Regression Tree [Geurts.2001] is a better approach for data mining on time series Motif Discovery  Implemented the random projection algorithm of Tompa and Buhler [ICMB2001]  Hashing subsequenced into buckets using a random subset of their features as a key New Version: iSAX  Use binary numbers for labeling the words  Different alphabet size(cardinality)within a word  Comparison of words with different cardinalities Thank you Questions?