Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Time Series II 1 Syllabus Nov 4 Introduction to data mining Nov 5 Association Rules Nov 10, 14 Clustering and Data Representation Nov 17 Exercise session 1 (Homework 1 due) Nov 19 Classification Nov 24, 26 Similarity Matching and Model Evaluation Dec 1 Exercise session 2 (Homework 2 due) Dec 3 Combining Models Dec 8, 10 Time Series Analysis Dec 15 Exercise session 3 (Homework 3 due) Dec 17 Ranking Jan 13 Review Jan 14 EXAM Feb 23 Re-EXAM 2 Last time… • What is time series? • How do we compare time series data? 3 Today… • What is the structure of time series data? • Can we represent this structure compactly and accurately? • How can we search streaming time series? 4 Time series summarization aabbbccb 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 a a b b b c DFT DWT PAA APCA PLA c b SAX 5 Why Summarization? • We can reduce the length of time series • We should not lose any information • We can process it faster 6 Discrete Fourier Transform (DFT) X X' 0 20 40 60 80 100 120 140 0 Basic Idea: Represent the time series as a linear combination of sines and cosines Transform the data from the time domain to the frequency domain Jean Fourier 1768-1830 1 2 Highlight the periodicities but keep only the first n/2 coefficients 3 4 5 6 Why n/2 coefficients? Because they are symmetric 7 Excellent free Fourier Primer 8 9 Hagit Shatkay, The Fourier Transform - a Primer'', Technical Report CS95-37, Department of Computer Science, Brown University, 1995. http://www.ncbi.nlm.nih.gov/CBBresearch/Postdocs/Shatkay/ 7 Why DFT? A: several real sequences are periodic Q: Such as? A: sales patterns follow seasons economy follows 50-year cycle (or 10?) temperature follows daily and yearly cycles Many real signals follow (multiple) cycles 8 How does it work? • Decomposes signal to a sum of sine and cosine waves • How to assess ‘similarity’ of x with a (discrete) wave? value x ={x0, x1, ... xn-1} s ={s0, s1, ... sn-1} 0 1 n-1 time 9 How does it work? • Consider the waves with frequency 0, 1, … • Use the inner-product (~cosine similarity) Freq=1/period value value freq. f=0 0 1 n-1 freq. f=1 time 0 1 sin(t * 2 p/n) n-1 time 10 How does it work? Consider the waves with frequency 0, 1, … Use the inner-product (~cosine similarity) value freq. f=2 time 0 1 n-1 11 How does it work? ‘basis’ functions 0 1 n-1 cosine, f=1 sine, freq =1 0 1 n-1 0 1 n-1 cosine, f=2 sine, freq = 2 01 n-1 01 n-1 12 How does it work? • Basis functions are actually n-dim vectors, orthogonal to each other • ‘similarity’ of x with each of them: inner product • DFT: ~ all the similarities of x with the basis functions 13 How does it work? Since: ejf = cos(f) + j sin(f), we finally have: X f 1/ n n 1 x t 0 t ( j 1 ) xt 1 / n with j=sqrt(-1) * exp( j 2p tf / n) inverse DFT n 1 X t 0 f * exp( j 2p tf / n) 14 How does it work? Each Xf is an imaginary number: Xf = a + b j • α is the real part • β is the imaginary part • Examples: – 10 + 5j – 4.5 – 4j 15 How does it work? SYMMETRY property of imaginary numbers: Xf = (Xn-f )* ( “*”: complex conjugate: (a + b j)* = a - b j ) Thus: we use only the first n/2 numbers 16 DFT: Amplitude spectrum • Amplitude Af Re ( X f ) Im ( X f ) 2 2 2 • Intuition: strength of frequency ‘f’ count Af freq: 12 time freq. f 17 Example Reconstruction using 1coefficients 5 0 -5 50 100 150 200 250 18 Example Reconstruction using 2coefficients 5 0 -5 50 100 150 200 250 19 Example Reconstruction using 7coefficients 5 0 -5 50 100 150 200 250 20 Example Reconstruction using 20coefficients 5 0 -5 50 100 150 200 250 21 DFT: Amplitude spectrum Can achieve excellent approximations, with only very few frequencies! SO what? 22 DFT: Amplitude spectrum Can achieve excellent approximations, with only very few frequencies! We can reduce the dimensionality of each time series by representing it with the k most dominant frequencies Each frequency needs two numbers (real part and imaginary part) Hence, a time series of length n can be represented using 2*k real numbers, where k << n 23 Raw Data C 0 20 40 60 80 n = 128 100 120 140 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … The graphic shows a time series with 128 points. The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown). C 0 20 40 60 80 100 120 .............. 140 Raw Data Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown). The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown). Raw Data C C’ 0 20 40 60 80 100 We have discarded 15 16 of the data. 120 140 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Truncated Fourier Fourier Coefficients Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 n = 128 N=8 Cratio = 1/16 Raw Data C C’ 0 20 40 60 80 100 120 140 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Sorted Truncated Fourier Fourier Coefficients Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.1670 0.4667 0.1928 0.1635 0.1302 0.0992 0.1282 0.2438 0.2316 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... 1.5698 1.0485 0.7160 0.8406 0.2667 0.1928 0.1438 0.1416 Instead of taking the first few coefficients, we could take the best coefficients Discrete Fourier Transform…recap X Pros and Cons of DFT as a time series representation X' 0 20 40 60 80 100 120 140 0 1 2 Pros: • Good ability to compress most natural signals • Fast, off the shelf DFT algorithms exist O(nlog(n)) 3 4 5 6 7 Cons: • Difficult to deal with sequences of different lengths 8 9 28 Piecewise Aggregate Approximation (PAA) Basic Idea: Represent the time series as a sequence of box basis functions, each box being of the same length X Computation: X' • X: time series of length n 0 20 40 60 80 100 120 140 x1 • Can be represented in the N-dimensional space as: x2 x3 x4 x5 xi = Nn ni N å xj j= Nn (i-1)+1 x6 x7 Keogh, Chakrabarti, Pazzani & Mehrotra, KAIS (2000) x8 Byoung-Kee Yi, Christos Faloutsos, VLDB (2000) 29 Piecewise Aggregate Approximation (PAA) Example X Let X = [1 3 -1 4 4 4 5 3 7] X' 0 20 40 60 80 100 120 140 x1 x2 • X can be mapped from its original dimension n = 9 to a lower dimension, e.g., N = 3, as follows: x3 x4 [1 3 -1 4 4 4 5 3 7] x5 x6 x7 [ 1 4 5 ] x8 30 Piecewise Aggregate Approximation (PAA) Pros and Cons of PAA as a time series representation. X Pros: X' • Extremely fast to calculate 0 20 40 60 80 100 120 140 • As efficient as other approaches (empirically) x1 • Support queries of arbitrary lengths x2 • Can support any Minkowski metric x3 • Supports non Euclidean measures x4 •Simple! Intuitive! x5 x6 Cons: x7 • If visualized directly, looks ascetically x8 unpleasing 31 Symbolic ApproXimation (SAX) • similar in principle to PAA – uses segments to represent data series • represents segments with symbols (rather than real numbers) – small memory footprint 32 Creating SAX PAA • Input – A time series (blue curve) Input Series • Output – SAX representation of the input time series (red string) SAX baabccbc 33 The Process (STEP 1) • Represent time series T of length n with w segments using Piecewise Aggregate Approximation (PAA) • PAA(T,w) = T t1 , , t w 3 where ti wn ni w T j j wn ( i 1) 1 3 A time series T 2 PAA(T,4) 2 1 1 0 0 -1 -1 -2 -2 -3 -3 0 4 8 12 16 0 4 8 12 16 34 The Process (STEP 2) • Discretize into a vector of symbols • Use breakpoints to map to a small alphabet α of symbols 3 3 PAA(T,4) 2 iSAX(T,4,4) 2 00 1 1 01 0 0 -1 -1 -2 -2 -3 0 4 8 12 16 10 11 -3 0 4 8 12 16 35 Symbol Mapping • Each average value from the PAA vector is replaced by a symbol from an alphabet • An alphabet size, a of 5 to 8 is recommended – a,b,c,d,e – a,b,c,d,e,f – a,b,c,d,e,f,g – a,b,c,d,e,f,g,h • Given an average value we need a symbol 36 Symbol Mapping This is achieved by using the normal distribution from statistics: – Assuming our input series is normalized we can use normal distribution as the data model – We divide the area under the normal distribution into ‘a’ equal sized areas where a is the alphabet size – Each such area is bounded by breakpoints 37 SAX Computation – in pictures C C 0 20 40 60 This slide taken from Eamonn’s Tutorial on SAX 80 100 120 c c c b b a 0 20 b a 40 60 80 100 baabccbc 120 38 Finding the BreakPoints • Breakpoints for different alphabet sizes can be structured as a lookup table • When a=3 – Average values below -0.43 are replaced by ‘A’ – Average values between -0.43 and 0.43 are replaced by ‘B’ – Average values above 0.43 are replaced by ‘C’ a=3 a=4 a=5 b1 -0.43 -0.67 -0.84 b2 0.43 b3 b4 0 -0.25 0.67 0.25 0.84 39 The GEMINI Framework • • • • Raw data: original full-dimensional space Summarization: reduced dimensionality space Searching in original space costly Searching in reduced space faster: – Less data, indexing techniques available, lower bounding • Lower bounding enables us to – prune search space: through away data series based on reduced dimensionality representation – guarantee correctness of answer • no false negatives • false positives: filtered out based on raw data 40 GEMINI Solution: Quick filter-and-refine: • extract m features (numbers, e.g., average) • map into a point into m-dimensional feature space • organize points • retrieve the answer using a NN query • discard false alarms 41 Generic Search using Lower Bounding Simplified DB Answer Superset No false negatives!! Original DB Final Answer set Verify against original DB Remove false positives!! simplified query query 42 GEMINI: contractiveness • GEMINI works when: Dfeature(F(x), F(y)) <= D(x, y) • Note that, the closer the feature distance to the actual one, the better 43 Streaming Algorithms • Similarity search is the bottleneck for most time series data mining algorithms, including streaming algorithms • Scaling such algorithms can be tedious when the target time series length becomes very large! • This will allow us to solve higher-level time series data mining problems: e.g., similarity search in data streams, motif discovery, at scales that would otherwise be untenable 44 Fast Serial Scan • A streaming algorithm for fast and exact search in very large data streams: query data stream 45 Z-normalization • Needed when interested in detecting trends and not absolute values B • For streaming data: C A – each subsequence of interest should be z-normalized before being compared to the z-normalized query – otherwise the trends lost • Z-normalization guarantees: – offset invariance – scale/amplitude invariance 46 Pre-Processing z-Normalization • data series encode trends • usually interested in identifying similar trends • but absolute values may mask this similarity 47 Pre-Processing z-Normalization xi zi v1 v2 • two data series with similar trends • but large distance… 48 Pre-Processing z-Normalization v1 v2 • zero mean – compute the mean of the sequence – subtract the mean from every value of the sequence 49 Pre-Processing z-Normalization • zero mean – compute the mean of the sequence – subtract the mean from every value of the sequence 50 Pre-Processing z-Normalization • zero mean – compute the mean of the sequence – subtract the mean from every value of the sequence 51 Pre-Processing z-Normalization • zero mean – compute the mean of the sequence – subtract the mean from every value of the sequence 52 Pre-Processing z-Normalization • zero mean • standard deviation one – compute the standard deviation of the sequence – divide every value of the sequence by the stddev 53 Pre-Processing z-Normalization • zero mean • standard deviation one – compute the standard deviation of the sequence – divide every value of the sequence by the stddev 54 Pre-Processing z-Normalization • zero mean • standard deviation one – compute the standard deviation of the sequence – divide every value of the sequence by the stddev 55 Pre-Processing z-Normalization • zero mean • standard deviation one 56 Pre-Processing z-Normalization • when to z-normalize – interested in trends • when not to z-normalize – interested in absolute values 57 Proposed Method: UCR Suite • An algorithm for similarity search in large data streams • Supports both ED and DTW search • Works for both z-normalized and un-normalized data series • Combination of various optimizations 58 Squared Distance + LB • Using the Squared Distance 𝑛 2 𝐸𝐷 𝑄, 𝐶 = 𝑖=1 𝑞𝑖 − 𝑐𝑖 2 • Lower Bounding – LB_Yi – LB_Kim – LB_Keogh LB_Keogh C U L Q 59 Lower Bounds • Lower Bounding – LB_Yi max(Q) min(Q) – LB_Kim C A D B – LB_Keogh C U Q L 60 Early Abandoning • Early Abandoning of ED ED(Q, C ) i 1 (qi ci ) 2 bsf n Q C We can early abandon at this point • Early Abandoning of LB_Keogh U C Q L L U, L is an envelope of Q 61 Early Abandoning • Early Abandoning of DTW • Earlier Early Abandoning of DTW using LB Keogh C Q Stop Fully calculated LB Keogh if dtw_dist ≥ bsf U Partial truncation of LBKeogh C L K =0 Q dtw_dist C Partial calculation of DTW K = 11 About to begin calculation of DTW R (Warping Windows) 62 Early Abandoning • Early Abandoning of DTW • Earlier Early Abandoning of DTW using LB_Keogh C Q Stop if dtw_dist +lb_keogh ≥ bsf Fully calculated LBKeogh U L Partial truncation of LBKeogh (partial) C lb_keogh K =0 Q (partial) dtw_dist C Partial calculation of DTW K = 11 About to begin calculation of DTW R (Warping Windows) 63 Z-normalization • Early Abandoning Z-Normalization – Do normalization only when needed (just in time) – Every subsequence needs to be normalized before it is compared to the query – Online mean and std calculation is needed – Keep a buffer of size m and compute a running mean and standard deviation xi zi 64 The Pseudocode 65 Reordering • Reordering Early Abandoning – We don’t have to compute ED or LB from left to right – Order points by expected contribution Standard early abandon ordering 12 34 5 6 7 8 Optimized early abandon ordering 9 5 1 3 24 C Q C Q Idea - Order by the absolute height of the query point This step is performed only once for the query and can save about 30%-50% of calculations 66 Reordering • Reordering Early Abandoning – We don’t have to compute ED or LB from left to right – Order points by expected contribution Idea Intuition - The query will be compared to many data stream points during a search - Candidates are z-normalized: - the distribution of many candidates will be Gaussian, with a zero mean of zero - the sections of the query that are farthest from the mean (zero) will on average have the largest contributions to the distance measure 67 Different Envelopes • Reversing the Query/Data Role in LB_Keogh – Make LB_Keogh tighter – Much cheaper than DTW – Online envelope calculation Envelop on Q U Envelop on C C L Q U L 68 Lower bounds • Cascading Lower Bounds – At least 18 lower bounds of DTW was proposed. – Use some lower bounds only on the Skyline. Tightness of LB Tightness of (LB/DTW) bound lower 1 Early_abandoning_DTW max(LB_Keogh EQ, LB_Keogh EC) LB_KimFL LB_FTW LB_Ecorner LB_Keogh EQ LB_Yi LB_Kim 0 O(1) O(n) DTW LB_PAA O(nR) 69 Experimental Result: Random Walk • Random Walk: Varying size of the data UCR-ED Million (Seconds) 0.034 Billion (Minutes) 0.22 Trillion (Hours) 3.16 SOTA-ED 0.243 2.40 39.80 UCR-DTW 0.159 1.83 34.09 SOTA-DTW 2.447 38.14 472.80 Code and data is available at: www.cs.ucr.edu/~eamonn/UCRsuite.html 70 Experimental Result: ECG • Data: One year of Electrocardiograms 8.5 billion data points. • Query: Idealized Premature Ventricular Contraction (PVC) of length 421 (R=21=5%). PVC (aka. skipped beat) ECG UCR-ED SOTA-ED UCR-DTW SOTA-DTW 4.1 minutes 66.6 minutes 18.0 minutes 49.2 hours ~30,000X faster than real time! 71 Up next… Nov 4 Introduction to data mining Nov 5 Association Rules Nov 10, 14 Clustering and Data Representation Nov 17 Exercise session 1 (Homework 1 due) Nov 19 Classification Nov 24, 26 Similarity Matching and Model Evaluation Dec 1 Exercise session 2 (Homework 2 due) Dec 3 Combining Models Dec 8, 10 Time Series Analysis Dec 15 Exercise session 3 (Homework 3 due) Dec 17 Ranking Jan 13 No Lecture Jan 14 EXAM Feb 23 Re-EXAM 72