* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Discovering the Intrinsic Cardinality and Dimensionality of Time
Survey
Document related concepts
Transcript
Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN KEOGH DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING REPORTED BY WANG YAWEN Outline Introduction Definitions and Notation MDL Modeling of Time Series Algorithm Experimental Evaluation Complexity Conclusion Introduction Choose the best representation and abstraction level Discover the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series Select the best parameters for particular algorithms An important sub-routine in algorithms for classification, clustering and outlier discovery Minimal Description Length(MDL) fame work Introduction Dimension reduction Discrete Fourier Transform(DFT) Discrete Wavelet Transform(DWT) Adaptive Piecewise Constant Approximation(APCA) Piecewise Linear Approximation(PLA) Choose the best abstraction level and/or representation of the data for a given task/dataset Useful in its own right to understand/describe the data and an important sub-routine in algorithms for classification, clustering and outlier discovery Introduction Actual cardinality: 14, 500, 62 Intrinsic cardinality: 2, 2, 12 Introduction Objective Not simply save memory Increasing interest in using specialized hardware for data mining, but the complexity of implementing data mining algorithms in hardware typically grows super linearly with the cardinality of the alphabet Some data mining benefit from having the data represented in the lowest meaningful cardinality Introduction Objective Most time series indexing algorithms critically depend on the ability to reduce the dimensionality or the cardinality of the time series, and searching over the compacted representation in main memory Remove the spurious precision induced by a cardinality/dimensionally that is too high in resourcelimited devices Create very simple outlier detection models Introduction MDL framework Automatically discover the parameters that reflect the intrinsic model/cardinality/dimensionally of the data Without requiring external information or expensive cross validation search Definitions and Notations MDL is defined for discrete values Reduce the original number of possible values to a manageable amount The quantization makes no perceptible difference Definitions and Notations Definitions and Notations How many bits it takes to represent a time series T Definitions and Notations Convert a given time series to other representation or model DFT, APCA, PLA Definitions and Notations DL(H): model cost DL(T|H): correction cost(description cost or error term) DL(T|H) = DL(T-H) MDL Modeling of Time Series MDL Modeling of Time Series APCA Mean 8 16 possible values, DL(H) = 4 MDL Modeling of Time Series MDL Modeling of Time Series Algorithm Discover the intrinsic cardinality and dimensionality of an input time series Find the right model or data representation for the given time series Algorithm Algorithm APCA Constant lines Dimensionality: m/2 d constant segments d-1 pointers to Indicate the offset of the end of each segment Algorithm PLA Starting value Ending value Ending offset Algorithm DFT Linear combination of sine waves Half set of all coefficients Subsets of half coef to approximately regenerate T Sort by absolute value Use top-d coefficients inverseDFT Constant bits(32 bits) for max and min value of the real parts and of the imaginary parts Hence Experimental Evaluation A detailed example on a famous problem Baseline L-Method: explain the residual error vs. size-of-model curve using all possible pairs of two regression lines10 Bayesian Information Criterion based method4 Experimental Evaluation An example application in physiology Experimental Evaluation An example application in astronomy Anomaly detector Experimental Evaluation An example application in cardiology Experimental Evaluation An example application in geosciences Complexity Space complexity Linear in the size of the original data Time complexity O(mlog2m) Conclusion Simple methodology based on MDL Robustly specify the intrinsic model, cardinality and dimensionality of time series data from a wide variety of domains General and parameter-free