Download Discovering the Intrinsic Cardinality and Dimensionality of Time

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Discovering the Intrinsic
Cardinality and Dimensionality
of Time Series using MDL
BING HU
THANAWIN RAKTHANMANON
YUAN HAO SCOTT EVANS1
STEFANO LONARDI
EAMONN KEOGH
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
REPORTED BY WANG YAWEN
Outline

Introduction

Definitions and Notation

MDL Modeling of Time Series

Algorithm

Experimental Evaluation

Complexity

Conclusion
Introduction

Choose the best representation and abstraction
level

Discover the natural intrinsic representation model,
dimensionality and alphabet cardinality of a time
series


Select the best parameters for particular algorithms

An important sub-routine in algorithms for classification,
clustering and outlier discovery
Minimal Description Length(MDL) fame work
Introduction

Dimension reduction

Discrete Fourier Transform(DFT)

Discrete Wavelet Transform(DWT)

Adaptive Piecewise Constant Approximation(APCA)

Piecewise Linear Approximation(PLA)

Choose the best abstraction level and/or
representation of the data for a given task/dataset

Useful in its own right to understand/describe the
data and an important sub-routine in algorithms for
classification, clustering and outlier discovery
Introduction

Actual cardinality: 14, 500, 62

Intrinsic cardinality: 2, 2, 12
Introduction

Objective

Not simply save memory

Increasing interest in using specialized hardware for
data mining, but the complexity of implementing data
mining algorithms in hardware typically grows super
linearly with the cardinality of the alphabet

Some data mining benefit from having the data
represented in the lowest meaningful cardinality
Introduction

Objective

Most time series indexing algorithms critically depend
on the ability to reduce the dimensionality or the
cardinality of the time series, and searching over the
compacted representation in main memory

Remove the spurious precision induced by a
cardinality/dimensionally that is too high in resourcelimited devices

Create very simple outlier detection models
Introduction

MDL framework

Automatically discover the parameters that reflect the
intrinsic model/cardinality/dimensionally of the data

Without requiring external information or expensive
cross validation search
Definitions and Notations

MDL is defined for discrete values

Reduce the original number of possible
values to a manageable amount

The quantization makes no perceptible
difference
Definitions and Notations
Definitions and Notations

How many bits it takes to represent a time series T
Definitions and Notations

Convert a given time series to other representation
or model

DFT, APCA, PLA
Definitions and Notations

DL(H): model cost

DL(T|H): correction cost(description cost or error term)

DL(T|H) = DL(T-H)
MDL Modeling of Time Series
MDL Modeling of Time Series

APCA

Mean 8

16 possible values, DL(H) = 4
MDL Modeling of Time Series
MDL Modeling of Time Series
Algorithm

Discover the intrinsic cardinality and dimensionality
of an input time series

Find the right model or data representation for the
given time series
Algorithm
Algorithm

APCA

Constant lines

Dimensionality: m/2

d constant segments

d-1 pointers to
Indicate the offset of the
end of each segment
Algorithm

PLA

Starting value

Ending value

Ending offset
Algorithm

DFT

Linear combination of sine waves

Half set of all coefficients

Subsets of half coef to
approximately regenerate T


Sort by absolute value

Use top-d coefficients

inverseDFT
Constant bits(32 bits) for max and
min value of the real parts and of
the imaginary parts Hence
Experimental Evaluation

A detailed example on a famous problem

Baseline

L-Method: explain the residual error vs. size-of-model curve
using all possible pairs of two regression lines10

Bayesian Information Criterion based method4
Experimental Evaluation

An example application in physiology
Experimental Evaluation

An example application in astronomy

Anomaly detector
Experimental Evaluation

An example application in cardiology
Experimental Evaluation

An example application in geosciences
Complexity

Space complexity


Linear in the size of the original data
Time complexity

O(mlog2m)
Conclusion

Simple methodology based on MDL

Robustly specify the intrinsic model, cardinality and
dimensionality of time series data from a wide variety of
domains

General and parameter-free