Download time series data mining research problem, issues, models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES,
MODELS, TRENDS AND TOOLS
Dr.Ilango Velchamy1 Dr.Uma Ilango2 Nitya Ramesh3
1
Professor, New Horizon College of Engineering, Bangalore
2
Professor, Christ University, Bangalore
3
Assistant Professor, New Horizon College of Engineering, Bangalore
ABSTRACT:
A time series is a collection of observations made chronologically. The increasing use of
time series data has initiated a great deal of research and development attempts in the field of
data mining. The abundant research on time series data mining in the last decade could hamper
the entry of interested researchers, due to its complexity. In this paper, a comprehensive
revision on the existing time series data mining research is given. They are generally
categorized into representation, problems, applications, suitable c o m p u t i n g
packages and state-of-the-art research issues are also highlighted. The primary objective of this
paper is to serve as a glossary for interested researchers to have an overall picture on the current
time series data mining development and identify their potential research direction to further
investigation.
Keywords: Trends, Models, Representations, Terminology and Computing
[1] INTRODUCTION
Time-series is a data type represented by a sequence of data points sampled at successive
times and usually found as the sequence of real or integer numbers with or without attached time
stamps. Today more data is being collected than ever before. A great deal of this data is in the
form of time series data. Time series data is a collection of observations that is recorded
over some period of regular or irregular time. Time series data occurrences are becoming
extremely valuable to the operations and of modern organizations. Time-series data arises
naturally from observations and reflects an evolution of some subject or a development of some
phenomena in time. Since the time-series is the only way to store valuable and often nonreproducible temporal information, it makes time-series data ubiquitous and important not
only in every scientific field but also in everyday life. According [21] “The time-series plot is the
most frequently used form of graphic design. With one dimension marching along to the regular
rhythm of seconds, minutes, hours, days, weeks, months, years, or millennia, the natural ordering
of the time scale gives this design a strength and efficiency of interpretation found in no other
graphic arrangement.” Recently, the increasing use of temporal data, in particular time series data,
has initiated various research and development attempts in the field of data mining. Time series
is an important class of temporal data objects, and it can be easily obtained from scientific and
financial applications (e.g. electrocardiogram (ECG), daily temperature, weekly sales totals, and
59
Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh
TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS
prices of mutual funds and stocks). A time series is a collection of observations made
chronologically. The nature of time series data includes: large in data size, high dimensionality
and update continuously. Moreover time series data, which is characterized by its numerical and
continuous nature, is always considered as a whole instead of individual numerical field.
Therefore, unlike traditional databases where similarity search is exact match based, similarity
search in time series data is typically carried out in an approximate manner. There are various
kinds of time series data related research, for example, finding similar time series [4] [13]
subsequence searching in time series [24], dimensionality reduction [36, 46] and segmentation
[3]. Those researches have been studied in considerable detail by both database and pattern
recognition communities for different domains of time series data [38]. In the context of time
series data mining, the fundamental problem is how to represent the time series data. One of the
common approaches is transforming the time series to another domain for dimensionality
reduction followed by an indexing mechanism. Moreover similarity measure between time series
or time series subsequences and segmentation are two core tasks for various time series mining
tasks. Based on the time series representation, different mining tasks can be found in the
literature and they can be roughly classified into four fields: pattern discovery and clustering,
classification, rule discovery and summarization. Some of the research concentrates on one of
these fields, while the others may focus on more than one of the above processes. In this
paper, a comprehensive review on the existing time series data mining research is given. The
remaining part of this paper is organized as follows: Section 2 contains a discussion of time series
models. The time series data analysis components have been reviewed in Section 3. The recent
research trends, issues and problems of time series data analysis have been discussed in Sections
4 and 5, respectively. In Section 6, various time series r e p r e s e n t a t i o n t e c h n i q u e s will
be reviewed. In section 7 comparisons of computing software packages and some important
terminology were discussed, whereas the conclusion will be made in Section 8.
[2] CATEGORIES OF TIME SERIES MODEL
Time Series Models
Deterministic
Stochastic
Time Domain
Frequency Domain











Trigonometric
Model
Linear trend
Polynomial trend
Exponential
Curve
Logistic Curve
Gomprtz Curve



Adjustment
Model
Filter Model
Auto
Prediction
Model
Explanatory
Model




Stationary
Random time
series
Auto correlation
Auto covariance
Cross
Correlation
White Noise





Fourier Transformation
Discrete Time Fourier
Transformation
Fast Fourier Transformation
Spectral distribution function
Windowing
Filter
Cross spectrum
Figure.1. Categories of Time Series Models
We can group all time series techniques into four broad categories—deterministic model,
stochastic models, time domain model, frequency domain model as shown in the Figure.1.In
deterministic model, the output of the model is fully determined by the parameter values and
the initial conditions. Stochastic models possess some inherent randomness. The same set of
parameter values and initial conditions will lead to an ensemble of different outputs. The time
domain m o d e l r e p r e s e n t s time series as a function of time. Its m a i n concern
is to explore whether the time series has a trend (rising or declining) and if so, to f i t a
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
forecasting model. The frequency domain model is based on the assumption
that the most regular, and hence predictable, behavior of a time series is likely to be
periodic.
Thus, t h e main c o n c e r n o f this approach i s to determine the periodic
components embedded in the time series. The choice between the time series models
depends essentially upon the types of questions that are being asked in different fields of
study. For example, economists have relatively greater interest in the time domain, whereas
communication engineers have greater interest in the frequency domain.
However,
combining these two approaches w o u l d yield a better understanding of the d a t a .
Time domain (including Box-Jenkins ARIMA analysis) and spectral domain (including
Fourier—Spectral—analysis).
[3] TIME SERIES COMPONENTS
A time series is a time-ordered sequence of observations taken at regular intervals
(e.g., hourly, daily, weekly, monthly, quarterly, annually). The data may be
measurements of demand, earnings, profits, shipments, accidents, output, precipitation,
productivity, or the consumer price index. Forecasting techniques based on time-series
data are made on the assumption that future values of the series can be estimated from
past values. Although no attempt is made to identify variables that influence the series,
these methods are widely used, often with quite satisfactory results. Analysis of timeseries data requires the analyst to identify the underlying behavior of the series. This can
often be accomplished by merely plotting the data and visually examining the plot. One
or more patterns might appear: trends, seasonal variations, cycles, or variations around
an average. In addition, there will be random and perhaps irregular variations. These
behaviors can be described as follows: Trend refers to a long-term upward or downward
movement in the data. Population shifts, changing incomes, and cultural changes often
account for such movements. Seasonality
refers to short-term, fairly regular
variations generally related to factors such as the calendar or time of day. Restaurants,
supermarkets, and theaters experience weekly and even daily “seasonal” variations.
Cycles are wavelike variations of more than one year’s duration. These are often related
to a variety of economic, political, and even agricultural conditions. Irregular variations
are due to unusual circumstances such as severe weather conditions, strikes, or a major
change in a product or service. They do not reflect typical behavior, and their inclusion
in the series can distort the overall picture. Whenever possible, these should be identified
and removed from the data. Random variations are residual variations that remain after
all other behaviors have been accounted for.
These behaviors are illustrated in Figure. 2. The small “bumps” in the plots represent
random variability.
61
Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh
TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS
Figure 2. Time Series Behaviors
[4] RESEARCH TRENDS AND ISSUES
Time-series data mining has been an ever growing and stimulating field of study that has
continuously raised challenges and research issues over the past decade. We discuss in the
following open research issues and trends in time-series for the next decade.
Stream analysis .The last decade of research in hardware and network research have
witnessed an explosion of streaming technologies with the continuous advances of bandwidth
capabilities. Streams are seen as continuously generated measurements that have to be processed
in massive and fluctuating data rates. Analyzing and mining such data flows are computationally
extreme tasks. Several papers review research issues for data streams mining [26] or management
[27]. Algorithms designed for static datasets have usually not been sufficiently optimized to be
capable of handling such continuous volumes of data. Many models have already been extended
to control data streams, such as clustering [20], classification [28], segmentation [51], or anomaly
detection [18]. Novel techniques will be required and they should be designed specifically to
cope with the ever flowing data streams. Convergence and hybrid approaches. A lot of new tasks
can be derived through a relatively easy combination of the already existing tasks. For instance,
[57] proposed three approaches, polynomial, DFT, and probabilistic, to predict the unknown
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
values that have not fed into the system and answer queries based on forecast data.. This work
shows that future research has to rely on the convergence of several tasks. This could potentially
lead to powerful hybrid approaches.
Embedded systems and resource- constrained
environments. With the advances in hardware miniaturization, new requirements are imposed on
analysis techniques and algorithms. Embedded systems have a very limited memory space and
cannot have permanent access to it. Furthermore, sensor networks usually generate huge amounts
of streaming data. So there is a vital need to design space-efficient techniques, in terms of memory
consumption as well as number of accesses. An interesting solution has been recently proposed
in [80]. User interaction. Time-series data mining is starting to be highly dedicated to
application-specific systems. The ultimate goal of such methods is to mine for higher order
knowledge and propose a set of solutions to the user. It could therefore seem natural to
include a user interaction scheme to allow for dynamic exploration and refinement of the
solutions. An early proposal by [30] allows for relevance feedback in order to improve the
querying process. From the best results of a query, the user is able to assign positive or negative
influences to the series. Exhaustive benchmarking. A wide range of systems and algorithms has
been proposed over the past few years. Individual proposals are usually submitted to get herewith
specific datasets and evaluation methods that prove the superiority of the new algorithm. There
is still a need for a common and exhaustive benchmarking system to perform objective testing.
Another highly challenging task is to develop a procedure for real-time accuracy evaluation
procedure. Link to shape analysis. Shape analysis has also been matter for discussion over the
past few years. There is an astonishing resemblance between the tasks that have been examined;
such as query by content [11], classification [32], clustering [58], segmentation [70], and even
motif discovery [77]. As a matter of fact, there is a deeper connection between these two fields
as recent work shows the numerous inherent links existing between these. [10] studied the
problem of classifying ordered sequences of digital images. As presented earlier, [80] proposed
to extract a time series from the contour of an image. They introduced the time-series shape
lets that represent the most informative part of an image and allow to easily discriminate
between image classes.
[5] REPRESENTATION
Many times series representation techniques have been investigated, each of them offering
different trade-offs between the properties listed before. The classification of time series
representation is depicted in the Figure.3.
63
Figure. 3. Time Series Representations
Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh
TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS
In order to perform such
classification, we follow the taxonomy of [55] by dividing
representations into three categories, namely nondata adaptive, data adaptive, and model based.
Nondata Adaptive: In nondata-adaptive representations, the parameters of the transformation
remain the same for every time series regardless of its nature. The first nondata-adaptive
representations were drawn from spectral decompositions. The DFT was used in the
seminal work of [4]. It projects the time series on a sine and cosine functions basis
[22] in the real domain. The resulting representation is a set of sinusoidal coefficients. Instead
of using a fixed set of basic functions, the DWT uses scaled and shifted versions of a mother
wavelet function [9]. This gives a multiresolution decomposition where low frequencies are
measured over larger intervals thus providing better accuracy [65]. A large number of wavelet
functions have been used in the literature like Haar [15]. The Discrete Cosine Transform (DCT)
uses only a cosine basis; it has also been applied to time-series mining [33]. However, it has
been shown that it does not offer any advantage over previously cited decompositions [55].
Other approaches more specific to time series have been proposed. The Piecewise
Aggregate Approximation (PAA) introduced by [43] represents a series through the mean
values of consecutive fixed-length segments. An extension of PAA including a multiresolution
property (MPAA) has been proposed in [53]. [2] Suggested to extract a sequence of amplitudelevel wise local features (ALF) to represent the characteristics of local structures. It was
shown that this proposal provided weak results in [19]. Random projections have been used
for representation in [30]; in this case, each time series enters a convolution product with
k random vectors drawn from
a multivariate standard. This approach has recently
been combined with spectral decompositions by [68] with the purpose of answering statistical
queries over streams. Data Adaptive: This approach implies that the parameters of a
transformation are modified depending on the data available. By adding a data-sensitive
selection step, almost all nondata-adaptive methods can become data adaptive. For
spectral decompositions, it usually consists in selecting a subset of the coefficients. This
approach has been applied to DFT [76] and DWT [75]. A data-adaptive version of PAA has
been proposed in [60], with vector quantization being used to create a codebook of recurrent
subsequences. This idea has been adapted to allow for multiple resolution levels. However,
this approach has only been tested on smaller datasets. A similar approach has been
undertaken in [74] with a codebook based on motion vectors being created to spot gestures.
However, it has been shown to be computationally less efficient than SAX. Several inherently
data-adaptive representations have also been used. SVD has been proposed [33] and later
been enhanced for streams [67]. However, SVD requires computation of eigenvalues for large
matrices and is therefore far more expensive than other mentioned schemes. It has recently
been adapted to find multiscale patterns in time- series streams [63]. PLA [72] is a widely
used approach for the segmentation task. The set of polynomial coefficients can be obtained
either by interpolation [39] or regression. Many derivatives of this technique have been
introduced. APCA [45] uses constant approximations per segment instead of polynomial
fitting. Indexable PLA has been proposed by [17] to speed up the indexing process. [24] Put
forward an approach based on PLA, to answer queries about the recent past with greater
precision than older
data and called such
representations amnesic. The method
consisting in using a segmentation algorithm as a representational tool has been extensively
investigated. The underlying idea is that segmenting a time series can be equated with the
process of representing the most salient features of a series while
considerably
reducing its dimensionality. [77] Proposed a pattern-based representation of time series. The
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
input series is approximated by a set of concave and convex patterns to improve the
subsequence matching process. [81] Proposed a pattern representation of time series to extract
outlier values and noise. The polynomial shape space representation [23] is a subspace
representation consisting of trend aspects estimators of a time series. [8] put forward a twolevel approach to recognize gestures by describing individual trajectories with key-points,
then characterizing gestures through the global properties of the trajectories.
Instead of producing a numeric output, it is also possible to discretize the data into symbols.
This conversion into a symbolical representation also offers the advantage of implicitly
performing noise removal by complexity reduction. A relational tree representation is used in
[7]. Nonterminal nodes of the tree correspond to valleys and terminal nodes to peaks in the
time series. The Symbolic Aggregate approximation (SAX) [59], based on the same underlying
idea as PAA, calls on equal frequency histograms on sliding windows to create a
sequence of short words. An extension of this approach, called indexable Symbolic
Aggregate approximation (iSAX) [73], has been proposed to make fast indexing possible by
providing zero overlap at leaf nodes. The grid-based representation [1] places a twodimensional grid over the time series. The final representation is a bit string describing which
values were kept and which bins they were in. Another possibility is to discretize the series to a
binary string (a technique called clipping) [66]. Each bit indicates whether the series is above
or below the average. That way, the series can be very efficiently manipulated. In [6] this is
done using the median as the clipping threshold. Clipped series offer the advantage
of allowing direct comparison with raw series, thus providing a tighter lower bounding
metric. Recently, a very interesting approach has been proposed in [79]; it is based on
primitives called shapelets, that is, subsequences which are maximally representative of a class
and thus fully discriminate classes through the use of a dictionary. This approach can be
considered as a step forward towards bridging the gap between time series and shape analysis.
Model Based: The model-based approach is based on the assumption that the time series
observed has been produced by an underlying model. The goal is thus to find parameters of such
a model as a representation. Two time series are therefore considered similar if they have
been produced by the same set of parameters driving the underlying model. Several
parametric temporal models may be considered, including statistical modeling by feature
extraction [61], ARMA models [31], Markov Chains (MCs) [70]. MCs are obviously simpler
than HMM so they fit well in shorter series but their expressive power is far more limited. The
time-series bitmaps introduced in [34] can also be considered as a model-based representation
for time series, even if it mainly aims at providing a visualization of time series.
As mentioned earlier, time series are essentially high-dimensional data. Defining algorithms
that work directly on the raw time series would therefore be computationally too expensive.
The main motivation of representations is thus to emphasize the essential characteristics of the
data in a concise way. Additional benefits gained are efficient storage, speedup of
processing, as well as implicit noise removal. These basic properties lead to the following
requirements for any representation: significant reduction of the data dimensionality,
emphasis on fundamental shape characteristics on both local and global scales, low
computational cost for computing the representation, good reconstruction quality from the
reduced representation, insensitivity to noise or implicit noise handling.
[6] COMPARISON OF COMPUTING PACKAGES
65
Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh
TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS
Important computer packages for time series analysis include: Splus, Matlab, Mathematica, SAS,
SPSS, TSP, and STAMP. Here three important time series packages based on ARIMA
modeling has been reviewed and their performance is shown in the Table.1. SAS and SPSS
each have a single program for ARIMA time-series modeling. SPSS has additional programs for
producing time-series graphs and for forecasting seasonally adjusted time series. SAS has a
variety of programs for time-series analysis, including a special one for seasonally adjusted time
series. SYSTAT also has a time-series program, but it does not support intervention analysis.
SPSS Package: SPSS has a number of time-series procedures, only a few of which are
relevant to the ARIMA models of this chapter. ACF produces autocorrelation and partial
autocorrelation plots, CCF produces cross-correlation plots, and TSPLOT produces the timeseries plot itself. All of the plots are produced in high-resolution form; ACF and CCF plots also
are shown in low-resolution form with numerical values. This is the only program that permits
alternative specification of standard errors for autocorrelation plots, with an independence model
possible (IND) as well as the usual Bartlett’s approximation (MA) used by other programs.
SPSS ARIMA is available for modeling time series with a basic set of features. Several options
for iteration are available, as well as two methods of parameter estimation. Initial values for
parameters can be specified, but there is no provision for centering the series (although the
mean may be subtracted in a separate transformation). There are built-in functions for a log
(base 10 or natural) transform of the series. Differencing is awkward for lag greater than 1. A
differenced variable is created, and then the differencing parameter is omitted from the model
specification (along with the constant). Residuals (and their upper and lower confidence values)
and predicted values are automatically written to the existing data file, and tests are available to
compare models (log-likelihood, AIC, and SBC).
Table.1. Comparison of Computing Packages
Feature
Input
Specify intervention variable
SPSS
ARIMA
SAS
ARIMA
SYSTAT
SERIES
No
Yes
Yes
Specify additional continuous variable(s)
No
Yes
No
Include constant or not
Yes
Yes
Yes
Specify maximum number of iterations
MXITER
MAXIT
Yes
Specify tolerance
PAREPS
SINGULAR
No
Specify change in sum of squares
SSQPCT
No
No
Parameter estimate stopping criterion
No
CONVERGE
Yes
MXLAMB
No
No
Specify delta
No
DELTA
No
Define maximum number of psi weights
No
No
No
CLS
CLS
Yes
Specify maximum lambda
Conditional least squares estimation method
Unconditional estimation method
EXACT
ULS
No
Maximum likelihood estimation method
No
ML
No
Options regarding display of iteration details
Yes
PRINTALL
No
User-specified initial values for AR and MA
parameters and constant
Yes
INITVAL
No
Request that mean be subtracted from each observation
No
CENTER
No
Options for forecasting
Yes
Yes
Yes
CINPCT
ALPHA
No
No
CLEAR
CLEAR SERIES
Specify size of confidence interval
Erase current time-series model
Use a previously defined model without respecification
APPLY
No
No
Request ACF plot with options
Yesa
Default
ACF
Request PACF plot with options
Yesa
Default
PACF
Request cross-correlation plot with options
Yesb
Default
CCF
Request inverse autocorrelation plot
No
Default
Request residual autocorrelation plots
No
PLOT
No
Request log transforms of series
Yes
No e
Yes
Request time series plot with options
Yesc
No f
TPLOT
Estimate missing data
No d
Yes
Yes
No
BY
No
MXAUTO a
NLAG
No
SERRORa
No
No
Specify method for estimating variance
No
NODF
No
Request serach for outliers in the solution
No
Yes
No
Number of observations or residuals (after differencing)
Yes
Yes
No
Specify analysis by groups
Specify number of lags in plots
Specify method for estimating standard errors of
autocorrelations
No
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
SAS System: SAS ARIMA also is a full-featured modeling program, with three estimation
methods as well as several options for controlling the iteration process. SAS ARIMA has the
most options for saving things to data files. Inverse autocorrelation plots and (optionally)
autocorrelation plots of residuals are available in addition to the usual ACF, PACF, and
cross-correlation plots that are produced by default. However, a plot of the raw time series itself
must be requested outside PROC ARIMA. The autocorrelation checks for white noise and
residuals are especially handy for diagnosing a model and models may be compared using
AIC and SBC. Missing data are estimated under some conditions.
SYSTAT System: Except for graphs and built-in transformations, SYSTAT SERIES is a bare
bones program. The time series plot can be edited and annotated; however, ACF, PACF, and CCF
plots have no numerical values. Maximum number of iterations and a convergence criterion can
be specified. Iteration history is provided, along with a final value of error variance. Other than
that, only the parameter estimates, standard errors, and 95% confidence interval are shown.
There is no provision for intervention analysis or any other input variables. SYSTAT does
differencing at lags greater than 1 in a manner similar to that of SPSS. Differenced values are
produced as a new variable (in a new file) and then the differenced values are used in the
model. Thus, forecast values are not in the scale of the original data, but in the scale of the
67
Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh
TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS
Table 2. Time Series Applications, Algorithms and Tools
TABLE 18.27
Continued
differenced values, making interpretation more difficult. SYSTAT does, however, provide a plot
showing values for the known (differenced) series as well as forecast values.
Time s e r i e s data provide u s e f u l information about the physical, b i o l o g i c a l , s o
c i a l or economic systems generating the time series. Types of time series data, various
applications, existing algorithms and software tools for time series analysis are described in
Table.2. In Table.3 important problems of time series research has been explained which may
take for further research. Some important terminology of time series is also explained in the
Table.4.
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
Table 3. Time Series Problem and Descriptions
Problems
Description
Scientific problems include: smoothing, prediction,
association, index numbers, feedback and control.
Theoretical Issues: The usual cautions apply
with regard to causal inference in time-series
analysis. But causal inference is much weaker in
any design that falls short of the requirements of
a true experiment.
Statistical problems include: explanatory in a
model, estimation of parameters such as hidden
frequencies, uncertainty computation, goodness of
fit and testing.
Special problems include:
Missing values,
censoring,
measurement
error,
irregular
sampling, feedback, outliers, shocks, signalgenerated noise, trading days, festivals, changing
seasonal pattern, measurement error, aliasing, data
observed in two series at different time points.
Discovering Time Series Motifs without all those
hard-to-set parameters:
Clustering streaming time series: The problem is
NOT to do this fast, the problem is to do this in a
meaningful way.
Time Series Joins: The problem is NOT to do this
fast, the problem is to do this in a meaningful way.
Practical Issues: Outliers among scores are
sought before modeling and among the residuals
once the model is developed.
Normality of Distributions of Residuals: A
model is developed and then normality of
residuals is evaluated in time-series analysis.
Examine the normalized plot of residuals for the
model before evaluating an intervention.
Transform the DV if residuals are non normal.
Homogeneity of Variance and Zero Mean of
Residuals: After the model is developed,
examine plots of standardized residuals versus
predicted values to assess homogeneity of
variance over time. Consider transforming the
DV if the width of the plot varies over the
predicted values.
Understanding the “why” in time series classification
and clustering: Given that two time series are
clustered/classified together, automatically construct
an explanation of why.
Independence of Residuals: During the
diagnostic phase, once the model is developed
and residuals are computed, there should be no
remaining
autocorrelations
or
partial
autocorrelations at various lags in the ACFs and
PACFs. Remaining autocorrelations at various
Building tools to visualize massive time series: How lags signal other possible patterns in the data
can we visually summarize massive time series, such that have not been properly modeled.
that regularities, outliers, anomalies etc, become
Absence of Outliers: Outliers are observations
visible?
that are highly inconsistent with the remainder
Classifying time series with a eager learner: As we of the time-series data. They can greatly affect
have seen, DTW is essentially linear, nevertheless, the results of the analysis and must be dealt with.
1-nearest neighbor needs to visit every instance, can They sometimes show up in the original plot of
the DV against time, but are often more
we do better?
noticeable after initial modeling is complete.
Weighted time series representations: It is well
known in the machine learning community that Other major issues are involved.
Data representation.
weighting features can greatly improve accuracy in
Similarity measurement.
classification and clustering tasks.
Indexing method.
We need to give more attention to problems with
real, demonstrated applications: : Music, Motion
Capture, Video, Web Logs etc.
69
Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh
TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS
Table 4. Important Terminology of Time Series
Term
Definition
Observation
The DV score at one time period. The score can be from a single case
or an aggregate score from numerous cases.
Random shock
The random component of a time series. The shocks are reflected by
the residuals (or errors) after an adequate model is identified.
ARIMA ( p, d, q)
The acronym for an auto-regressive integrated moving average
model. The three terms to be estimated in the model are autoregressive ( p), integrated (trend—d), and moving average (q).
Auto-regressive terms
The number of terms in the model that describe the dependency among
( p)
successive observations. Each term has an associated correlation
coefficient that describes the magnitude of the dependency. For example,
a model with two auto-regressive terms ( p C 2) is one in which an
observation depends on (is predicted by) two previous observations.
Moving average terms The number of terms that describe the persistence of a random shock
from one observation to the next. A model with two moving average
(q)
terms (q C 2) is one in which an observation depends on two preceding
random shocks.
Lag
The time periods between two observations.
Calculating differences among pairs of observations at some lag to
Differencing
make a non stationary series stationary.
Stationary
and
nonstationary series
Autocorrelation
Stationary series vary around a constant mean level, neither decreasing
nor increasing systematically over time, with constant variance.
Nonstationary series have systematic trends, such as linear, quadratic,
and so on. A nonstationary series that can be made stationary by
differencing is called “nonstationary in the homogenous sense.”
The terms needed to make a nonstationary time series stationary. A
model with two trend terms (d C 2) has to be differenced twice to make
it stationary. The first difference removes linear trend, the second
difference removes quadratic trend, and so on.
Correlations among sequential scores at different lags.
Autocorrelation
function (ACF)
The pattern of autocorrelations in a time series at numerous lags; the
correlation at lag 1, then the correlation at lag 2, and so on.
Partial autocorrelation
function (PACF)
The pattern of partial autocorrelations in a time series at numerous
lags after partialing out the effects of autocorrelations at intervening
lags.
Trend terms (d)
Conclusion
After almost two decades of research in time-series data mining, an incredible wealth of systems
and algorithms has been proposed. The ubiquitous nature of time series led to an extension of
the scope of applications simultaneously with the d e v e l o p m e n t of more mature and
efficient solutions to deal with problems of increasing computational complexity. We have
reviewed throughout this article the field of time-series data analysis models, representations and
problem and issues and comparison computational efficiency of the software packages. As for
most scientific research, trying to find the solution to a problem often leads to raising more q u
e s t i o n s than finding answers. We have thus outlined several trends and research directions
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
as well as open issues for the near future. It seems clear that time series research will continue
on all the existing topics as the assumptions on which any existing problem solution has
been based appear inadequate. Further it can be anticipated that more researchers from
nontraditional areas will become interested in the area of time series as they realize that the data
they have collected, or will be collecting, are correlated in time. Researchers can be
expected to be even more concerned
with the topics of nonlinearity, conditional
heteroscedasticity, inverse problems, long memory, long tails, uncertainty estimation, inclusion
of explanatories, new analytic models and properties of the estimates when the model is
not true. The motivation for the last is that time series with unusual structure seem to appear
steadily. More efficient, more robust and more applicable solutions will be found for existing
problems. Techniques will be developed for dealing with special difficulties such as missing
data, outliers. Better approximations to the distributions of time series based statistics will be
developed.
REFERENCES
1.
An, J., Chen, H., Furuse, K., Ohbo, N., And Keogh, E. 2003. Grid-Based Indexing For
Large Time Series Databases. In Intelligent Data Engineering And Automated Learning.
Lecture Notes In Computer Science, Vol. 1983. Springer 614–621.
2. Abfalg, J., Kriegel, H., Krö Ger, P., Kunath, P., Pryakhin, A., And Renz, M. 2008. Similarity
Search In Multime- Dia Time Series Data Using Amplitude-Level Features. In
Proceedings Of The 14th International Conference On Advances In Multimedia Modeling.
Springer, 123–133.
3. Abonyi, J., Feil, B., Nemeth, S., Arva, P., 2005. Modified Gath–Geva clustering for fuzzing
segmentation of multivariate time-series. Fuzzy Sets and Systems, Data Mining Special Issue 149,
39–56.
4. Agrawal, R., Faloutsos, C., Swami, A., 1993a. Efficient similarity search in sequence databases. In:
Proceedings of the Fourth International Conference on Foundations of Data Organization and
Algorithms, pp. 69–84.
5. Anderson, T. W. (1971): The Statistical Analysis Of Time Series. John Wiley & Sons, New York.
6. Bagnall, A., Janacek, G., De La Iglesia, B., And Zhang, M. 2003. Clustering Time Series
From Mixture Poly-Nomial Models With Discretised Data. In Proceedings Of The 2nd
Australasian Data Mining Workshop.105–120.
7. Bakshi, B. And Stephanopoulos, G. 1995. Reasoning In Time: Modeling, Analysis, And
Pattern Recognition Of Temporal Process Trends. Adv. Chem. Engin. 22, 485–548.
8. Bandera, J., Marfil, R., Bandera, A., RodrÍguez, J., Molina-Tanco, L., And Sandoval, F.
2009. Fast Gesture Recognition Based On A Two-Level Representation. Pattern Recogn.
Lett. 30, 13, 1181–1189.
9. Barone, P., Carfora, M., And March, R. 2009. Segmentation, Classification And Denoising
Of A Time Series Field By A Variational Method. J. Math. Imag. Vis. 34, 2, 152–164.
10. Berretti, S., Del Bimbo, A., And Pala, P. 2000. Retrieval By Shape Similarity With
Perceptual Distance And Effective Indexing. Ieee Trans. Multimedia 2, 4, 225–239.
11. Bhargava, R., Kargupta, H., And Powers, M. 2003. Energy Consumption In Data Analysis
For On-Board And Distributed Applications. In Proceedings Of The Icml. Vol. 3.
12. Berndt, D.J., Clifford, J., 1994. Using dynamic time warping to find patterns in time series. In: AAAI
Working Notes of the Knowledge Discovery in Databases Workshop, pp. 359–370.
13. Box, G. E. P., and G. M. Jenkins (1976): Time Series Analysis: Forecasting And Control. HoldenDay, San Francisco, Second Edn.
71
Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh
TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS
14. Chan, F., Fu, A., And Yu, C. 2003. Haar Wavelets For Efficient Similarity Search Of TimeSeries: With And Without Time Warping. IEEE Trans. Knowl. Data Engin. 15, 3, 686–
705.
15. Chan, K. And Fu, A. 1999. Efficient Time Series Matching By Wavelets. In Proceedings
Of The 15th IEEE International Conference On Data Engineering. 126–133.
16. Chen, Q., Chen, L., Lian, X., Liu, Y., And Yu, J. 2007a. Indexable Pla For Efficient
Similarity Search. In Proceedings Of The 33rd International Conference On Very Large
Data Bases. VLDB Endowment, 435–446.
17. Chuah, M. And Fu, F. 2007. Ecg Anomaly Detection Via Time Series Analysis. In Frontiers
Of High Performance Computing And Networking Ispa 07 Workshops. Springer, 123–135.
18. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., And Keogh, E. 2008. Querying And
Mining Of Time Series Data: Experimental Comparison Of Representations And
Distance Measures. Proc. Vldb Endowm. 1, 2, 1542–1552.
19. Domingos, P. And Hulten, G. 2000. Mining High-Speed Data Streams. In Proceedings Of
The 6th Acm Sigkdd International Conference On Knowledge Discovery And Data Mining.
Acm, 71–80.
20. Faloutsos, C. And Megalooikonomou, V. 2007. On Data Mining, Compression, And
Kolmogorov Complexity. Data Min. Knowl. Discov. 15, 1, 3–20.
21. Fuchs, E., Gruber, T., Pree, H., And Sick, B. 2010. Temporal Data Mining Using Shape
Space Representations Of Time Series. Neurocomput. 74, 1-3, 379–393.
22. Faloutsos, C., Ranganathan, M., ManolopoulosY., 1994. Fast subsequence matching in time -series
databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of
Data, pp. 419–429.
23. Gouriéroux, C., and A. Monfort (1997): Time Series and Dynamic Models. Cambridge University
Press, Cambridge, U.K.
24. Gaber, M., Zaslavsky, A., And Krishnaswamy, S. 2005. Mining Data Streams: A Review. Acm
Sigmod Rec. 34, 2, 18–26.
25. Golab, L. And Ozsu, M. 2003. Issues In Data Stream Management. Acm Sigmod Rec. 32,
2, 5–14.
26. Hulten, G., Spencer, L., And Domingos, P. 2001. Mining Time-Changing Data Streams.
In Proceedings Of The 7th Acm Sigkdd International Conference On Knowledge Discovery
And Data Mining. Acm, 97–106.
27. Hillmer, S. C., W. R. Bell, And G. C. Tiao (1983): “Modeling Considerations in Seasonal Adjustment
Of Economic Time Series,” In Zellner (1983), Pp.74–100, With Comments (101-124).
28. Indyk, P., Koudas, N., And Muthukrishnan, S. 2000. Identifying Representative Trends In
Massive Time Series Data Sets Using Sketches. In Proceedings Of The 26th International
Conference On Very Large Data Bases. Morgan Kaufmann Publishers Inc., 363–372.
29. Kalpakis, K., Gada, D., And Puttagunta, V. 2001. Distance Measures For Effective
Clustering Of Arima Time- Series. In Proceedings Of The Ieee International Conference On
Data Mining. 273–280.
30. Kauppinen, H., Seppanen, T., And Pietikainen, M. 1995. An Experimental Comparison
Of Autoregressive And Fourier-Based Descriptors In 2d Shape Classification. Ieee
Trans. Pattern Anal. Mach. Intell. 17, 2, 201–207.
31. Korn, F., Jagadish, H., And Faloutsos, C. 1997. Efficiently Supporting Ad Hoc Queries
In Large Datasets Of Time Sequences. In Proceedings of the ACM Sigmod International
Conference On Management Of Data. Acm, 289–300.
32. Kumar, N., Lolla, N., Keogh, E., Lonardi, S., Ratanamahatana, C., And Wei, L. 2005. Time
Series Bitmaps: A Practical Visualization Tool For Working With Large Time-Series
Databases. In Proceedings Of The Siam Data Mining Conference. 531–535.
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
33. Keogh, E., 1997a. A fast and robust method for pattern matching in time series databases. In:
Proceedings of the Ninth IEEE International Conference on Tools with Artificial Intelligence, pp.
578–584.
34. Keogh, E., 1997b. Fast similarity search in the presence of longitudinal scaling in time series
databases. In: Proceedings of the Ninth IEEE International Conference on Tools with Artificial
Intelligence, pp. 578–584.
35. Keogh, E., 2002. Exact indexing of dynamic time warping. In: Proceedings of the 28th International
Conference on Very Large Databases, pp. 406–417.
36. Keogh, E., Kasetty, S., 2002. On the need for time series data mining benchmarks: a survey and
empirical demonstration. In: Proceedings of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 102–111.
37. Keogh, E., Pazzani, M., 1998. An enhanced representation of time series which allows fast and
accurate classification, clustering and relevance feedback. In: Proceedings of the Fourth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 239 –341.
38. Keogh, E., Pazzani, M., 1999. Relevance feedback retrieval of time series data.In: Proceedings of the
22nd Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval, pp. 183–190.
39. Keogh, E., Pazzani, M., 2000a. A Simple dimensionality reduction technique for fast similarity search
in large time series databases. In: Proceedings of the 4th Pacific-Asia Conference on Knowledge
Discovery and Data Mining, pp. 122–133.
40. Keogh, E., Pazzani, M., 2000b. Scaling up dynamic time warping for data mining applications. In:
Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 285–289.
41. Keogh, E., Pazzani, M., 2001. Derivative dynamic time warping. In: Proceedings of the First SIAM
International Conference on Data Mining.
42. Keogh, E., Smyth, P.A., 2001. Probabilistic approach to fast pattern matching in time series databases.
In: Proceedings of the Third ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 1997, pp. 24–30.
43. Keogh, E., Chakrabarti, K., Mehrotra, S., Pazzani, M., 2001a. Locally adaptive dimensionality
reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD
International Conference on Management of Data, pp. 151–163.
44. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S., 2000. Dimensionality reduction for fast
similarity search in large time series databases. Journal of Knowledge and Information Systems 3 (3),
263–286.
45. Keogh, E., Chu, S., Pazzani, M., 2001b. Ensemble-index: a new approach to indexing large databases.
In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 117–125.
46. Keogh, E., Chu, S., Hart, D., Pazzani, M., 2001c. An online algorithm for segmenting time series. In:
Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 289 –296.
47. Keogh, E., Hochheiser, H., Shneiderman, B., 2002a. An augmented visual query mechanism for
finding patterns in time series data. In: Proceedings of the Fifth International Conference on Flexible
Query Answering Systems, pp. 240–250.
48. Keogh, E., Lin, J., Fu, A., 2005. HOT SAX: efficiently finding the most unusual time series
subsequence. In: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 226 –
233.
49. Keogh, E., Lin, J., Truppel, W., 2003. Clustering of time series subsequences is meaningless:
implications for previous and future research. In: Proceedings of the Third IEEE International
Conference on Data Mining, pp. 115–122.
50. Keogh, E., Lin, J., Fu, A., Herle, H.V., 2006. Finding unusual medical time-series subsequences:
algorithms and applications. IEEE Transactions on Information Technology in Biomedicine 10 (3),
429–439.
73
Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh
TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS
51. Keogh, E., Lin, J., Lee, S.H., Herle, H.V., 2007a. Finding the most unusual time series subsequence:
algorithms and applications. Knowledge and Information Systems 11 (1), 1–27.
52. Keogh, E., Lonardi, S., Chiu, Y.C., 2002b. Finding surprising patterns in a time series database in
linear time and space. In: Proceedings of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 550–556.
53. Keogh, E., Lonardi, S., Ratanamahatana, C.A., 2004. Towards parameter-free data mining. In:
Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 206–215.
54. Keogh, E., Lonardi, S., Ratanamahatana, C.A., Wei, L., Lee, S.H., Handley, J., 2007b.Compressionbased data mining of sequential data. Data Mining and Knowledge Discovery 14 (1), 99–129.
55. Lian, X. And Chen, L. 2007. Efficient Similarity Search Over Future Stream Time Series.
IEEE Trans. Knowl.Data Engin. 20, 1, 40–54.
56. Liew, A., Leung, S., And Lau, W. 2000. Fuzzy Image Clustering Incorporating Spatial
Continuity. Ieee Proc.Vis. Image Signal Process. 147, 2, 185–192.
57. Lin, J., Keogh, E., Lonardi, S., And Chiu, B. 2003. A Symbolic Representation Of Time
Series, With Implications For Streaming Algorithms. In Proceedings Of The 8th Acm
Sigmod Workshop On Research Issues In Data Mining And Knowledge Discovery. Acm
New York, 2–11.
58. Megalooikonomou, V., Wang, Q., Li, G., And Faloutsos, C. 2005. A Multiresolution Symbolic
Representation Of Time Series. In Proceedings Of The 21st International Conference On
Data Engineering. 668–679.
59. Nanopoulos, A., Alcock, R., And Manolopoulos, Y. 2001. Feature-Based Classification Of
Time-Series Data. In Information Processing And Technology. 49–61.
60. Palpanas, T., Vlachos, M., Keogh, E., And Gunopulos, D. 2008. Streaming Time Series
Summarization Using User-Defined Amnesic Functions. IEEE Trans. Knowl. Data
Engin. 20, 7, 992–1006.
61. Papadimitriou, S., Sun, J., And Yu, P. 2006. Local Correlation Tracking In Time Series. In
Proceedings Of The 6th International Conference On Data Mining. 456–465.
62. Philippe Esling And Carlos Agon, Institut De Recherche Et Coordinationacoustique/Musique (Ircam)
,Acm Computing Surveys, Vol. 45, No. 1, Article 12, Publication Date: November 2012.
63. Popivanov, I. And Miller, R. 2002. Similarity Search Over Time-Series Data Using
Wavelets. In Proceedings Of The International Conference On Data Engineering. 212
224.
64. Ratanamahatana, C., Keogh, E., Bagnall, A., And Lonardi, S. 2005. A Novel Bit Level Time
Series Representation With Implication Of Similarity Search And Clustering. Adv.
Knowl. Discov. Data Min. 771–777.
65. Ravi Kanth, K., Agrawal, D., And Singh, A. 1998. Dimensionality Reduction For Similarity
Searching In Dynamic Databases. ACM Sigmod Rec. 27, 2, 166–176.
66. Reeves, G., Liu, J., Nath, S., And Zhao, F. 2009. Managing Massive Time Series Streams
With Multi-Scale Compressed Trickles. Proc. Vldb Endow. 2, 1, 97–108.
67. Sebastian, T., Klein, P., And Kimia, B. 2003. On Aligning Curves. Ieee Trans. Pattern Anal.
Mach. Intell. 25, 1, 116–125.
68. Sebastiani, P., Ramoni, M., Cohen, P., Warwick, J., And Davis, J. 1999. Discovering
Dynamics Using Bayesian Clustering. In Lecture Notes In Computer Science, Vol.1642.
Springer, 199–209.
69. Shasha, D. And Zhu, Y. 2004. High Performance Discovery In Time Series: Techniques
And Case Studies. Springer.
70. Shatkay, H. And Zdonik, S. 1996. Approximate Queries And Representations For Large
Data Sequences. In Proceedings Of The 12th International Conference On Data
Engineering. 536–545.
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015
www.ijcea.com ISSN 2321-3469
71. Shieh, J. And Keogh, E. 2008. Isax: Indexing And Mining Terabyte Sized Time Series. In
Proceeding Of The 14th Acm Sigkdd International Conference On Knowledge Discovery And
Data Mining. Acm, 623–631.
72. Stiefmeier, T., Roggen, D., And Troster, G. 2007. Gestures Are Strings: Efficient Online
Gesture Spotting And Classification Using String Matching. In Proceedings Of The ICST
2nd International Conference On Body Area Networks. 1–8.
73. Struzik, Z., Siebes, A., And Cwi, A. 1999. Measuring Time Series Similarity Through
Large Singular Features Revealed With Wavelet Transformation. In Proceedings Of The
10th International Workshop On Database And Expert Systems Applications. 162-166.
74. Vlachos, M., Gunopulos, D., And Das, G. 2004. Indexing Time Series Under Conditions Of
Noise. In Data Mining In Time Series Databases. 67–100.
75. Xie, J. And Yan, W. 2007. Pattern-Based Characterization Of Time Series. Int. J. Info.
Syst. Sci. 3, 3, 479–491.
76. Yadav, R., Kalra, P., And John, J. 2007. Time Series Prediction With Single Multiplicative
Neuron Model. Appl. Soft Comput. 7, 4, 1157–1163.
77. Ye, D., Wang, X., Keogh, E., And Mafra-Neto, A. 2009. Autocannibalistic And Anyspace
Indexing Algorithms With Applications To Sensor Data Mining. In Proceedings Of The
Siam International Conference On Data Mining (Sdm 09). 85–96.
78. Ye, L. And Keogh, E. 2009. Time Series Shapelets: A New Primitive For Data Mining. In
Proceedings Of The 15th Acm Sigkdd International Conference On Knowledge Discovery
And Data Mining. Acm, 947–956.
79. Zhan, Y., Chen, X., And Xu, R. 2007. Outlier Detection Algorithm Based On Pattern
Representation Of Time Series. Appl. Res. Comput. 24, 11, 96–99.
80. Zhang, X., Wu, J., Yang, X., Ou, H., And Lv, T. 2009. A Novel Pattern Extraction
Method For Time Series Classification. Optimiz. Engin. 10, 2, 253–271.
81. Zhong, S. And Ghosh, J. 2002. Hmms And Coupled Hmms For Multi-Channel Eeg
Classification. In Proceedings Of The Ieee International Joint Conference On Neural
Networks. 1154–1159.
82. Zhong, S., Khoshgoftaar, T., And Seliya, N. 2007. Clustering-Based Network Intrusion
Detection. Int. J. ReliabQual. Safety Engin. 14, 2, 169–187.
75
Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh