Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
0 20 40 60 80 100 120 DFT 0 20 40 60 80 100 120 DWT 0 20 40 60 80 100 120 SVD 0 20 40 60 80 100 120 APCA 0 20 40 60 80 100120 PAA 0 20 40 60 80 100 120 PLA 0 20 40 60 80 100 120 CHEB A Different Approach… All the previous representations have been real valued, but think of what you can do with discrete data that you cannot do (or do easily) with real valued data… Markov Models, Suffix Trees, Hashing, Relevance Feedback, Kolmogorov Complexity etc There are many symbolic representations in the literature, but none lower bound, and they are typically ad hoc, high dimensionally and generally not useful for data mining. There is now a symbolic representation of time series that allows… • Lower bounding of Euclidean distance • Dimensionality Reduction • Numerosity Reduction We call our representation SAX Symbolic Aggregate ApproXimation baabccbc How do we obtain SAX? C C 0 40 60 80 100 120 c First convert the time series to PAA representation, then convert the PAA to symbols It takes linear time 20 c c b b a 0 20 b a 40 60 80 100 baabccbc 120 Visual Comparison 3 2 1 0 -1 -2 -3 DFT f e d c b a PLA Haar APCA A raw time series of length 128 is transformed into the word “ffffffeeeddcbaabceedcbaaaaacddee.” – We can use more symbols to represent the time series since each symbol requires fewer bits than real-numbers (float, double) SAX is Good! For classification, clustering and indexing of time series, SAX is as good or better than… • Fourier Transforms • Wavelets • The raw data! But I am not going to show you this today! (See Jessica Lin’s DMKD 2003 paper…) SAX is Great! SAX lets us do things that are difficult or impossible with other representations. • Finding motifs in time series (ICDM 02, SIGKDD 03) • Visualizing massive time series (SIGKDD04, VLDB 04) • Cluster from streams (ICDM 03, KAIS 04) • Kolmogorov complexity data mining (SIGKDD 04) • The papers above are just from my group, there are now a few dozen groups around the world using SAX…. The Joy of SAX SAX Ideas Idea I: A lite-weight, but incredibly useful tool call time series bitmaps. To explain time series bitmaps, we begin with a digression into DNA… The DNA of two species… TGGCCGTGCTAGGCCCCACCCCTACCTTGC GTCCCCGCAAGCTCATCTGCGCGAACCAGA ACGCCCACCACCCTTGGGTTGAAATTAAGG GGCGGTTGGCAGCTTCCCAGGCGCACGTA CTGCGAATAAATAACTGTCCGCACAAGGAG CCGACGATAGTCGACCCTCTCTAGTCACGA CTACACACAGAACCTGTGCTAGACGCCATG GATAAGCTAACACAAAAACATTTCCCACTAC TGCTGCCCGCGGGCTACCGGCCACCCCTG CTCAGCCTGGCGAAGCCGCCCTTCA CCGTGCTAGGGCCACCTACCTTGGTCC CCGCAAGCTCATCTGCGCGAACCAGAA GCCACCACCTTGGGTTGAAATTAAGGA GCGGTTGGCAGCTTCCAGGCGCACGTA CTGCGAATAAATAACTGTCCGCACAAG AGCCGACGATAAAGAAGAGAGTCGACC CTCTAGTCACGACCTACACACAGAACC GTGCTAGACGCCATGAGATAAGCTAAC C T A G 0.20 0.24 0.26 0.30 CCGTGCTAGGGCCACCTACCTTGGTCCG CCGCAAGCTCATCTGCGCGAACCAGAA GCCACCACCTTGGGTTGAAATTAAGGAG GCGGTTGGCAGCTTCCAGGCGCACGTA CTGCGAATAAATAACTGTCCGCACAAGG AGCCGACGATAAAGAAGAGAGTCGACC CTCTAGTCACGACCTACACACAGAACCT GTGCTAGACGCCATGAGATAAGCTAACA CC CT TC TT C T CA CG TA TG TC CCC CCT CTC CCA CCG CTA CAC CAT CAA AC AT GC GT A G AA AG GA GG CCGTGCTAGGGCCACCTACCTTGGTCC CCGCAAGCTCATCTGCGCGAACCAGAA GCCACCACCTTGGGTTGAAATTAAGGA GCGGTTGGCAGCTTCCAGGCGCACGT CTGCGAATAAATAACTGTCCGCACAAG AGCCGACGATAAAGAAGAGAGTCGAC CTCTAGTCACGACCTACACACAGAACC GTGCTAGACGCCATGAGATAAGCTAAC 1 0.02 0.04 0.09 0.04 CA 0.03 0.07 0.02 AC AT 0.11 0.03 AA AG 0 CCGTGCTAGGCCCCACCCCTACCTTGC GTCCCCGCAAGCTCATCTGCGCGAACC GAACGCCCACCACCCTTGGGTTGAAAT AAGGAGGCGGTTGGCAGCTTCCCAGG CACGTACCTGCGAATAAATAACTGTCC ACAAGGAGCCCGACGATAGTCGACCCT TCTAGTCACGACCTACACACAGAACCT TGCTAGACGCCATGAGATAAGCTAACA OK. Given any DNA string I can make a colored bitmap, so what? CCGTGCTAGGCCCCACCCCTACCTTGC GTCCCCGCAAGCTCATCTGCGCGAACC GAACGCCCACCACCCTTGGGTTGAAAT AAGGAGGCGGTTGGCAGCTTCCCAGG CACGTACCTGCGAATAAATAACTGTCC ACAAGGAGCCCGACGATAGTCGACCCT TCTAGTCACGACCTACACACAGAACCT TGCTAGACGCCATGAGATAAGCTAACA Two Questions Indian rhinoceros.dna • Can we do something similar for time series? white rhinoceros.dna rhesus monkey.dna pygmy chimpanzee.dna sperm whale.dna • Would it be useful? Indian elephant.dna hippopotamus.dna chimpanzee.dna Human.dna African elephant.dna orangutan.dna pygmy sperm whale.dna Can we do make bitmaps for time series? Yes, with SAX! accbabcdbcabdbcadbacbdbdcadbaacb… a c b d aa ac ca cc ab ad cb cd ba bc da dc bb bd db dd aaa aab aba aac aad abc aca acb acc Time Series Bitmap While they are all example of EEGs, example_a.dat is from a normal trace, whereas the others contain examples of spike-wave discharges. We can further enhance the time series bitmaps by arranging the thumbnails by “cluster”, instead of arranging by date, size, name etc We can achieve this with MDS. normal9.txt normal8.txt normal5.txt normal1.txt normal10.txt normal11.txt normal15.txt normal14.txt normal13.txt normal7.txt normal2.txt normal16.txt normal18.txt normal4.txt normal3.txt normal12.txt normal6.txt normal17.txt ventricular depolarization “plateau” stage repolarizatio nrecovery initial rapid repolarization 0 100 200 300 phase 400 normal9.txt 500 normal8.txt normal5.txt normal1.txt normal10.txtnormal11.txt normal15.txtnormal14.txt normal13.txtnormal7.txt normal2.txt normal16.txtnormal18.txt normal4.txt normal3.txt normal12.txt normal17.txt normal6.txt 0 100 200 300 400 500 Some of the data are not heartbeats! They are the action potential of a normal pacemaker cell We can test how much useful information is retained in the bitmaps by using only the bitmaps for clustering/classification/anomaly detection 20 19 17 18 16 8 7 We can test how much useful information is retained in the bitmaps by using only the bitmaps for clustering/classification/anomaly detection 10 9 6 15 14 12 13 Data Key Cluster 1 (datasets 1 ~ 5): BIDMC Congestive Heart Failure Database (chfdb): record chf02 Start times at 0, 82, 150, 200, 250, respectively 11 5 Cluster 2 (datasets 6 ~ 10): BIDMC Congestive Heart Failure Database (chfdb): record chf15 Start times at 0, 82, 150, 200, 250, respectively 4 Cluster 3 (datasets 11 ~ 15): 3 2 Long Term ST Database (ltstdb): record 20021 Start times at 0, 50, 100, 150, 200, respectively Cluster 4 (datasets 16 ~ 20): 1 MIT-BIH Noise Stress Test Database (nstdb): record 118e6 Start times at 0, 50, 100, 150, 200, respectively 24 23 25 26 20 19 16 15 14 13 12 11 10 9 8 7 22 21 6 5 4 3 30 29 28 27 18 17 2 1 We can test how much useful information is retained in the bitmaps by using only the bitmaps for clustering/classification/anomaly detection Here is a Premature Ventricular Contraction (PVC) Here the bitmaps are almost the same. Here the bitmaps are very different. This is the most unusual section of the time series, and it coincidences with the PVC. Annotations by a cardiologist Premature ventricular contraction Supraventricular escape beat Premature ventricular contraction Time Series Bitmaps Summary The first paper to describe Time Series Bitmaps appeared in SDM 05. There are lots of possible ideas for extensions/ commercialization. Time series bitmaps could be one of the few contributions of data mining to make a real world impact, because there is essentially no barrier to adoption. “The greatest value of a picture is when it forces us to notice what we never expected to see” John Turkey Exploring data analysis. Addison-Wesley, Reading MA, 1977. Using SAX to Visualize Time Series Motivation of VizTree 010110010111100110100100001000 101001101101011100001010101110 1111100011011011011111101001100 100100011010001111001101101000 101111000101101001101100110100 000010011000100111000001110100 1100101100001010010 10001000101001000101010100001 010100010101110111101011010010 11101001010100111010101010010 10010101011101010100101010101 101010100101100101110111101000 111000010100001001110101000111 00001010101100101110101 Here are two sets of bit strings. Which set is generated by a human and which one is generated by a computer? VizTree 010110010111100110100100001000101 001101101011100001010101110111110 001101101101111110100110010010001 101000111100110110100010111100010 110100110110011010000001001100010 011100000111010011001011000010100 10 0 1 10001000101001000101010100001010 100010101110111101011010010111010 010101001110101010100101001010101 110101010010101010110101010010110 010111011110100011100001010000100 111010100011100001010101100101110 101 0 1 0 0 1 1 Lets put the sequences into a depth limited tree, such that the frequencies of all triplets are encoded in the thickness of branches… “humans usually try to fake randomness by alternating patterns” VizTree The “trick” on the previous slide only works for discrete data, but time series are real valued. Details 2 Zoom in But we can SAX up a time series to make it discrete! Overview VisTree • Convert the time series to SAX • Push the data in a depth-limited suffix tree • Encode the frequencies as the line thickness Overview, zoom & filter, details on demand Details 1 SAX for Motif Discovery SAX allows Motif Discovery! Winding Dataset ( The angular speed of reel 2 ) 0 50 0 1000 150 0 2000 Informally, motifs are reoccurring patterns… 2500 Motif Discovery To find these 3 motifs would require about 6,250,000 calls to the Euclidean distance function. A 0 500 20 1500 2000 B 40 60 80 100 120 140 0 20 C (The angular speed of reel 2) 1000 A 0 Winding Dataset B 2500 C 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Why Find Motifs? · Mining association rules in time series requires the discovery of motifs. These are referred to as primitive shapes and frequent patterns. · Several time series classification algorithms work by constructing typical prototypes of each class. These prototypes may be considered motifs. · Many time series anomaly/interestingness detection algorithms essentially consist of modeling normal behavior with a set of typical shapes (which we see as motifs), and detecting future patterns that are dissimilar to all typical shapes. · In robotics, Oates et al., have introduced a method to allow an autonomous agent to generalize from a set of qualitatively different experiences gleaned from sensors. We see these “experiences” as motifs. · In medical data mining, Caraca-Valente and Lopez-Chavarrias have introduced a method for characterizing a physiotherapy patient’s recovery based of the discovery of similar patterns. Once again, we see these “similar patterns” as motifs. • Animation and video capture… (Tanaka and Uehara, Zordan and Celly) T Trivial Matches Space Shuttle STS - 57 Telemetry C 0 100 200 3 00 400 ( Inertial Sensor ) 500 600 70 0 800 900 100 0 Definition 1. Match: Given a positive real number R (called range) and a time series T containing a subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) R, then M is called a matching subsequence of C. Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p or p < q’< q. Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count of non-trivial matches (ties are broken by choosing the motif whose matches have the lower variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1 i < K. OK, we can define motifs, but how do we find them? The obvious brute force search algorithm is just too slow… Our algorithm is based on a hot idea from bioinformatics, random projection* and the fact that SAX allows use to lower bound discrete representations of time series. * J Buhler and M Tompa. Finding motifs using random projections. In RECOMB'01. 2001. A simple worked example of our motif discovery algorithm The next 4 slides T 0 500 ( m= 1000) 1000 C1 C^1 a c b a ^ S 1 a 2 b : : : : 58 a : : 985 b c c : : c : c b a : : c : c a b : : a : c a = 3 {a,b,c} n = 16 w=4 Assume that we have a time series T of length 1,000, and a motif of length 16, which occurs twice, at time T1 and time T58. A mask {1,2} was randomly chosen, so the values in columns {1,2} were used to project matrix into buckets. Collisions are recorded by incrementing the appropriate location in the collision matrix A mask {2,4} was randomly chosen, so the values in columns {2,4} were used to project matrix into buckets. Once again, collisions are recorded by incrementing the appropriate location in the collision matrix We can calculate the expected values in the matrix, assuming there are NO patterns… k i E( k , a, w, d , t ) 1- 2 i 0 w d t w a 1 i a i 1 a w i 1 Suppose E(k,a,w,d,t) = 2 2 2 : 1 3 58 27 : 3 2 1 2 2 1 985 0 1 2 1 1 2 : 58 3 : 985 A Simple Experiment Lets imbed two motifs into a random walk time series, and see if we can recover them C A D B 0 0 20 40 60 200 80 100 120 400 0 20 40 600 60 80 100 800 120 1000 1200 Planted Motifs C A B D “Real” Motifs 0 20 40 60 80 100 120 0 20 40 60 80 100 120 Some Examples of Real Motifs Motor 1 (DC Current) 0 500 1000 1500 2000 Astrophysics (Photon Count) 250 0 350 0 450 0 550 0 650 0 Motifs in Music jingle • • • • Single channel (mono) 225000 samples at sample rate of 6000 samples/sec, 32bits per sample. Pre-processing: Absolute-valued and down-sampled to total of 600 samples and new sample rate of 16 samples/sec. 400 projections with instance length equal to 2 seconds of sample. w=16, a=8. Jingle is highly repetitive, these motifs were found: How Fast can we find Motifs? Seconds 10k 8k Brute Force 6k TS-P 4k 2k 0 1000 2000 3000 4000 Length of Time Series 5000 The sun is setting on all other symbolic representations of time series, we have seen SAX for discord discovery, anomaly detection, clustering and visualization