Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. • You may freely use these slides for teaching, if • You send me an email telling me the class number/ university in advance. • My name and email address appears on the first slide (if you are using all or most of the slides), or on each slide (if you are just taking a few slides). • You may freely use these slides for a conference presentation, if • You send me an email telling me the conference name in advance. • My name appears on each slide you use. • You may not use these slides for tutorials, or in a published work (tech report/ conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is highly likely I will grant you permission. (c) Eamonn Keogh, [email protected] Towards Parameter Free Data Mining Stefano Lonardi Chotirat Ann Ratanamahatana Eamonn Keogh Computer Science & Engineering Department, University of California Riverside Outline of Talk • Why parameter laden data mining algorithms are troublesome… • A parameter free/lite approach • Empirical evaluation • Parameter free/lite clustering • Parameter free/lite anomaly detection • Parameter free/lite classifcation • Conclusions Parameters Anecdotal observations • A data mining paper I once reviewed had 14 parameters… • More than 5 parameters seems the norm for most papers… • Some algorithms have only 4 parameters, but the algorithm is tested on synthetic data, created by a algorithm with 3 or 4 parameters…. Is a surfeit of parameters a bad thing? Parameter laden algorithms are often hard to implement By definition, parameter laden algorithms tend to be hard to implement. But don’t we want people to use our work? This fact (combined with general laziness) probably accounts for the general dearth of strawman comparisons in most papers. See Keogh & Kasetty SIGKDD02. Parameter laden algorithms make it hard to reproduce others results We re-implement many other papers (63 for this paper alone). Yet we are rarely able to reproduce others results, even when using the same datasets. One reason is that we typically don’t know how someone set the 5, 6, 7, 8, 9 parameters used in the original paper. Are irreproducible results scientifically meaningful? Parameter laden algorithms make it hard not to overfit Over fitting is a problem in data mining. If we have 5 to 10 parameters to set, avoiding over fitting is nearly impossible. This makes it hard to evaluate the contribution of a paper. So you got 95% accuracy on the “bull/bear” classification problem. Did you… • Spend 2 minutes adjusting the parameters (I am somewhat impressed) • Spend 2 weeks adjusting the parameters (I am not impressed) Yet often we don’t know which! (See Pedro Domingos POE papers) Parameter laden algorithms are nearly useless to the non-specialist If we ask a Cardiologist / COE / Lawyer / School Administrator to use an algorithm with 10 parameters, then the best we can hope for is that they give up in frustration! The Central Claim of this Paper A very simple (12 lines of code) parameter free algorithm can outperform many (most/all?) other algorithms, for many data mining tasks. Quick Review The Kolmogorov complexity K(x) of a string x is defined as the length of the shortest program capable of producing x on a universal computer — such as a Turing machine. The conditional Kolmogorov complexity K(x|y) of x to y is defined as the length of the shortest program that computes x when y is given as an auxiliary input to the program. The function K(xy) is the length of the shortest program that outputs y concatenated to x. The Similarity Metric Ming Li et. al. Ming Li This is the optimal similarity measure, but intractable… This is a cheap approximation… K ( x | y ) K ( y | x) d k ( x, y) K ( xy) C ( x | y ) C ( y | x) d c ( x, y ) C ( xy) C(x) means the size of x compressed by an off-theshelf compression algorithm like Winzip, Stuffit etc Ming Li, Xin Chen, Xin Li, Bin Ma, Paul M. B. Vitányi: The similarity metric. SODA 2003: 863-872 The (very) Minor Extension proposed here This is the optimal similarity measure, but intractable… K ( x | y ) K ( y | x) d k ( x, y) K ( xy) This is a cheap approximation… C ( x | y ) C ( y | x) d c ( x, y ) C ( xy) This is a cheaper approximation… C ( xy) CDM ( x, y) C ( x) C ( y ) Note we flipped the numerator and denominator to go from similarity to dissimilarity Compression Based Distance Measure C ( xy) CDM ( x, y) C ( x) C ( y ) function dist = CDM(A, B) save A.txt A –ASCII zip('A.zip', 'A.txt'); A_file = dir('A.zip'); % Save variable A as A.txt % Compress A.txt % Get file information save B.txt B –ASCII zip('B.zip', 'B.txt'); B_file = dir('B.zip'); % Save variable B as B.txt % Compress B.txt % Get file information A_n_B = [A; B]; save A_n_B.txt A_n_B –ASCII zip('A_n_B.zip', 'A_n_B.txt'); A_n_B_file = dir('A_n_B.zip'); % Concatenate A and B % Save A_n_B.txt % Compress A_n_B.txt % Get file information % Return CDM dissimilarity dist = A_n_B_file.bytes / (A_file.bytes + B_file.bytes); But wait! Isn’t the choice of compression algorithm itself a parameter! No! We are trying to approximate the optimal Kolmogorov compression. So the best compression algorithm to use is the one that gives the smallest file size The clustering achieved by CDM on 16,300 symbols from the mitochondrial DNA of 12 primates, and one “outlier” species. Ring-Tailed Lemur prosimians Malayan Flying Lemur Capuchin hominoids Gibbon Sumatran Orangutan primates This is the correct clustering for these species Oyster pongines Orangutan anthropoids Gorilla Human Pygmy Chimpanzee Chimpanzee Sang-Hee Lee (Noted physical anthropologist) Barbary Ape Baboon taxonomy level controversial panines cercopithecoids catarrhines greater apes USA Germany Sweden Norway Denmark Italy Catalan Brazil Spain Mexico Argentina The clustering achieved on the text from various Yahoo portals (Jan-15th 2004). The smallest webpage had 1,615 characters, excluding white spaces Maori English Latin Italian Dutch French German Norwegian Danish The clustering achieved on the text from the first fifty chapters of Genesis But DNA strings and natural language text are inherently discrete. Much of data mining is on real valued data, like images, video and time series Yeah, what about time series? What about Time Series? Part I In the example below, 1 and 3 are very similar ECGS... 1 0.13812500000000 0.04875000000000 0.10375000000000 0.17875000000000 0.24093750000000 0.29875000000000 0.37000000000000 0.48375000000000 0.55593750000000 0.64625000000000 0.70125000000000 .. 2 0.51250000000000 0.50000000000000 0.50000000000000 0.47562500000000 0.45125000000000 0.45125000000000 0.47656250000000 0.50000000000000 0.48281250000000 0.48468750000000 0.46937500000000 .. 3 0.49561523437690 0.49604248046834 0.49653076171875 0.49706481933594 0.49750732421875 0.49808715820312 0.49875854492187 0.49939941406230 0.50007080078125 0.50062011718750 0.50123046875826 .. What about Time Series? Part II c c c b b a 0 20 b a 40 60 80 100 120 SAXIFY! In the SAX representation, every bit has about the same importance… 1 0.13812500000000 0.04875000000000 0.10375000000000 baabccbc 2 0.51250000000000 0.50000000000000 0.50000000000000 3 0.49561523437690 0.49604248046834 0.49653076171875 How well does CDM time series clustering work? • We compared to 51 other measures (we re-implemented every time series measure from SIGKDD, SIGMOD, ICDM, ICDE, VLDB, ICML, SSDM, PKDD and PAKDD conferences in the last decade). • We optimized the parameters of the above • We tested on diverse problems • Homogenous datasets: All ECGs etc • Heterogeneous datasets: A mixed bag of 18 diverse datasets • We used a formal metric to score the clusterings (but here we will just show a few pictures) Reel 2: Tension CDM does best (by a huge margin) on the heterogeneous data problems Reel 2: Angular speed Koski ECG: Fast 2 Koski ECG: Fast 1 Koski ECG: Slow 2 Koski ECG: Slow 1 Dryer hot gas exhaust Dryer fuel flow rate Ocean 2 Ocean 1 Evaporator: vapor flow Evaporator: feed flow Furnace: cooling input Furnace: heating input Great Lakes (Ontario) Great Lakes (Erie) Buoy Sensor: East Salinity Buoy Sensor: North Salinity Sunspots: 1869 to 1990 Sunspots: 1749 to 1869 Exchange Rate: German Mark Exchange Rate: Swiss Franc Foetal ECG thoracic Foetal ECG abdominal Balloon2 (lagged) Balloon1 Power Demand: April-June (Dutch) Power Demand: Jan-March (Dutch) Power Demand: April-June (Italian) Power Demand: Jan-March (Italian) MIT-BIH Noise Stress Test Database (nstdb): record 118e6 CDM does best (by a huge margin) on the homogenous data problems Long Term ST Database (ltstdb): record 20021 BIDMC Congestive Heart Failure Database (chfdb): record chf02 BIDMC Congestive Heart Failure Database (chfdb): record chf15 The second best clustering (of 51 approaches) CDM’s clustering 17 20 3 10 18 13 19 15 16 14 15 12 13 11 12 8 14 5 11 2 2 6 5 9 3 7 4 4 1 20 10 8 7 9 6 19 18 17 16 1 OK, you can cluster stuff, what about other data mining problems, like classification, motif discovery, anomaly detection Yes, what about anomaly detection? Parameter-Free Anomaly Detection The connection between the notion of similarity and anomalies is inherent in the English idiom. When confronted with an anomalous object or occurrence, people usually exclaim "I have never seen anything like it!" or "It is like nothing you have ever seen". A Parameter-Free Anomaly Detection Algorithm function loc_of_anomaly = kolmogorov_anomaly(data) loc_of_anomaly = 1; while size(data,1) > 2 if CDM(data(1:floor(end/2),:),data); < CDM(data(ceil(end/2):end,:),data); loc_of_anomaly = loc_of_anomaly + size(data,1) / 2; data = data(ceil(end/2):end,:); else data = data(1:floor(end/2),:); end end … A) B) C) D) E) A problem from SIGKDD 2003… 0 200 400 600 800 1000 1200 A) Our Approach B) Support Vector Machine (SVM) based approach (6 parameters). C) Immunology (IMM) inspired approach (5 parameters). D) Association Rule (AR) based approach (5 parameters). E) TSA-tree Wavelet based approach (3 parameters). For each experiment we spend one hour of CPU time, and one hour of human time trying to find the best parameters and only reported the best results. What happens when we change the anomaly? A) Our Approach B) Support Vector Machine (SVM) based approach (6 parameters). C) Immunology (IMM) inspired approach (5 parameters). D) Association Rule (AR) based approach (5 parameters). E) TSA-tree Wavelet based approach (3 parameters). A) B) C) D) E) 0 200 400 600 800 1000 1200 A small excerpt from dataset 118e06 from the MIT-BIH Noise Stress Test Database. The full dataset is 21,600 data points long. Here, we show only a subsection containing the two most interesting events detected by our algorithm 2000 3000 4000 5000 The gray markers are independent annotations by a cardiologist indicating Premature Ventricular Contractions Aerospace Corp Problems The results of using our algorithm on various datasets from the Aerospace Corp collection. The bolder the line, the stronger the anomaly. Note that because of the way we plotted these there is a tendency to locate the beginning of the anomaly as opposed to the most anomalous part. 1 L-1v 0 -1 0 100 200 300 400 500 600 700 800 900 1000 50 L-1t 0 -50 0 100 200 300 400 500 600 700 800 900 1000 1 L-1n 0.5 0 -0.5 -1 0 100 200 300 400 500 600 700 800 900 1000 2 1 0 L-1u -1 -2 0 100 200 300 400 500 600 700 800 900 1000 550 Hand above holster 500 Briefly swings gun at target, but does not aim Aiming at target 450 400 350 Actor misses holster 300 250 200 Laughing and flailing hand Hand resting at side 150 100 0 100 200 300 400 500 600 The parameter-free algorithm generalizes to multi-dimensional time series, without changing a single line of code! What about classification? Yes, what about classification? Parameter-Free (Structural) Time Series Classification Euclidean DTW CDM ECG: signal 1 42.25 % 16.25 % 6.25 % ECG: signal 2 47.50 % 11.25 % 7.50 % Gun: 2 classes 5.00 % 0.00 % 0.00 % Gun: 4 classes 37.50 % 12.5 % 5.00 % What are the limitations of parameter-free algorithms? The data cannot be arbitrarily small, but the amazing clustering results show here only used 1,000 datapoints per object! Conclusions Parameter laden algorithms are bad! • They are often hard to implement • They make it hard to reproduce others results • They make it difficult to judge the significance of a contribution • They make it hard not to overfit • They are next to useless to the non-specialist • Parameter free/lite algorithms are good! • They require little coding effort • They make it trivial for others to reproduce your results • They make it nearly impossible to overfit • They have a real chance of being used by the non-specialist • You should use a parameter free/lite approach before doing anything more complex! Questions? All datasets and code used in this talk can be found at www.cs.ucr.edu/~eamonn/TSDMA/index.html