* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Discovering the Intrinsic Cardinality and Dimensionality of Time
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN KEOGH DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING REPORTED BY WANG YAWEN Outline  Introduction  Definitions and Notation  MDL Modeling of Time Series  Algorithm  Experimental Evaluation  Complexity  Conclusion Introduction  Choose the best representation and abstraction level  Discover the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series   Select the best parameters for particular algorithms  An important sub-routine in algorithms for classification, clustering and outlier discovery Minimal Description Length(MDL) fame work Introduction  Dimension reduction  Discrete Fourier Transform(DFT)  Discrete Wavelet Transform(DWT)  Adaptive Piecewise Constant Approximation(APCA)  Piecewise Linear Approximation(PLA)  Choose the best abstraction level and/or representation of the data for a given task/dataset  Useful in its own right to understand/describe the data and an important sub-routine in algorithms for classification, clustering and outlier discovery Introduction  Actual cardinality: 14, 500, 62  Intrinsic cardinality: 2, 2, 12 Introduction  Objective  Not simply save memory  Increasing interest in using specialized hardware for data mining, but the complexity of implementing data mining algorithms in hardware typically grows super linearly with the cardinality of the alphabet  Some data mining benefit from having the data represented in the lowest meaningful cardinality Introduction  Objective  Most time series indexing algorithms critically depend on the ability to reduce the dimensionality or the cardinality of the time series, and searching over the compacted representation in main memory  Remove the spurious precision induced by a cardinality/dimensionally that is too high in resourcelimited devices  Create very simple outlier detection models Introduction  MDL framework  Automatically discover the parameters that reflect the intrinsic model/cardinality/dimensionally of the data  Without requiring external information or expensive cross validation search Definitions and Notations  MDL is defined for discrete values  Reduce the original number of possible values to a manageable amount  The quantization makes no perceptible difference Definitions and Notations Definitions and Notations  How many bits it takes to represent a time series T Definitions and Notations  Convert a given time series to other representation or model  DFT, APCA, PLA Definitions and Notations  DL(H): model cost  DL(T|H): correction cost(description cost or error term)  DL(T|H) = DL(T-H) MDL Modeling of Time Series MDL Modeling of Time Series  APCA  Mean 8  16 possible values, DL(H) = 4 MDL Modeling of Time Series MDL Modeling of Time Series Algorithm  Discover the intrinsic cardinality and dimensionality of an input time series  Find the right model or data representation for the given time series Algorithm Algorithm  APCA  Constant lines  Dimensionality: m/2  d constant segments  d-1 pointers to Indicate the offset of the end of each segment Algorithm  PLA  Starting value  Ending value  Ending offset Algorithm  DFT  Linear combination of sine waves  Half set of all coefficients  Subsets of half coef to approximately regenerate T   Sort by absolute value  Use top-d coefficients  inverseDFT Constant bits(32 bits) for max and min value of the real parts and of the imaginary parts Hence Experimental Evaluation  A detailed example on a famous problem  Baseline  L-Method: explain the residual error vs. size-of-model curve using all possible pairs of two regression lines10  Bayesian Information Criterion based method4 Experimental Evaluation  An example application in physiology Experimental Evaluation  An example application in astronomy  Anomaly detector Experimental Evaluation  An example application in cardiology Experimental Evaluation  An example application in geosciences Complexity  Space complexity   Linear in the size of the original data Time complexity  O(mlog2m) Conclusion  Simple methodology based on MDL  Robustly specify the intrinsic model, cardinality and dimensionality of time series data from a wide variety of domains  General and parameter-free
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            