Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015 www.ijcea.com ISSN 2321-3469 TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS Dr.Ilango Velchamy1 Dr.Uma Ilango2 Nitya Ramesh3 1 Professor, New Horizon College of Engineering, Bangalore 2 Professor, Christ University, Bangalore 3 Assistant Professor, New Horizon College of Engineering, Bangalore ABSTRACT: A time series is a collection of observations made chronologically. The increasing use of time series data has initiated a great deal of research and development attempts in the field of data mining. The abundant research on time series data mining in the last decade could hamper the entry of interested researchers, due to its complexity. In this paper, a comprehensive revision on the existing time series data mining research is given. They are generally categorized into representation, problems, applications, suitable c o m p u t i n g packages and state-of-the-art research issues are also highlighted. The primary objective of this paper is to serve as a glossary for interested researchers to have an overall picture on the current time series data mining development and identify their potential research direction to further investigation. Keywords: Trends, Models, Representations, Terminology and Computing [1] INTRODUCTION Time-series is a data type represented by a sequence of data points sampled at successive times and usually found as the sequence of real or integer numbers with or without attached time stamps. Today more data is being collected than ever before. A great deal of this data is in the form of time series data. Time series data is a collection of observations that is recorded over some period of regular or irregular time. Time series data occurrences are becoming extremely valuable to the operations and of modern organizations. Time-series data arises naturally from observations and reflects an evolution of some subject or a development of some phenomena in time. Since the time-series is the only way to store valuable and often nonreproducible temporal information, it makes time-series data ubiquitous and important not only in every scientific field but also in everyday life. According [21] “The time-series plot is the most frequently used form of graphic design. With one dimension marching along to the regular rhythm of seconds, minutes, hours, days, weeks, months, years, or millennia, the natural ordering of the time scale gives this design a strength and efficiency of interpretation found in no other graphic arrangement.” Recently, the increasing use of temporal data, in particular time series data, has initiated various research and development attempts in the field of data mining. Time series is an important class of temporal data objects, and it can be easily obtained from scientific and financial applications (e.g. electrocardiogram (ECG), daily temperature, weekly sales totals, and 59 Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS prices of mutual funds and stocks). A time series is a collection of observations made chronologically. The nature of time series data includes: large in data size, high dimensionality and update continuously. Moreover time series data, which is characterized by its numerical and continuous nature, is always considered as a whole instead of individual numerical field. Therefore, unlike traditional databases where similarity search is exact match based, similarity search in time series data is typically carried out in an approximate manner. There are various kinds of time series data related research, for example, finding similar time series [4] [13] subsequence searching in time series [24], dimensionality reduction [36, 46] and segmentation [3]. Those researches have been studied in considerable detail by both database and pattern recognition communities for different domains of time series data [38]. In the context of time series data mining, the fundamental problem is how to represent the time series data. One of the common approaches is transforming the time series to another domain for dimensionality reduction followed by an indexing mechanism. Moreover similarity measure between time series or time series subsequences and segmentation are two core tasks for various time series mining tasks. Based on the time series representation, different mining tasks can be found in the literature and they can be roughly classified into four fields: pattern discovery and clustering, classification, rule discovery and summarization. Some of the research concentrates on one of these fields, while the others may focus on more than one of the above processes. In this paper, a comprehensive review on the existing time series data mining research is given. The remaining part of this paper is organized as follows: Section 2 contains a discussion of time series models. The time series data analysis components have been reviewed in Section 3. The recent research trends, issues and problems of time series data analysis have been discussed in Sections 4 and 5, respectively. In Section 6, various time series r e p r e s e n t a t i o n t e c h n i q u e s will be reviewed. In section 7 comparisons of computing software packages and some important terminology were discussed, whereas the conclusion will be made in Section 8. [2] CATEGORIES OF TIME SERIES MODEL Time Series Models Deterministic Stochastic Time Domain Frequency Domain Trigonometric Model Linear trend Polynomial trend Exponential Curve Logistic Curve Gomprtz Curve Adjustment Model Filter Model Auto Prediction Model Explanatory Model Stationary Random time series Auto correlation Auto covariance Cross Correlation White Noise Fourier Transformation Discrete Time Fourier Transformation Fast Fourier Transformation Spectral distribution function Windowing Filter Cross spectrum Figure.1. Categories of Time Series Models We can group all time series techniques into four broad categories—deterministic model, stochastic models, time domain model, frequency domain model as shown in the Figure.1.In deterministic model, the output of the model is fully determined by the parameter values and the initial conditions. Stochastic models possess some inherent randomness. The same set of parameter values and initial conditions will lead to an ensemble of different outputs. The time domain m o d e l r e p r e s e n t s time series as a function of time. Its m a i n concern is to explore whether the time series has a trend (rising or declining) and if so, to f i t a International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015 www.ijcea.com ISSN 2321-3469 forecasting model. The frequency domain model is based on the assumption that the most regular, and hence predictable, behavior of a time series is likely to be periodic. Thus, t h e main c o n c e r n o f this approach i s to determine the periodic components embedded in the time series. The choice between the time series models depends essentially upon the types of questions that are being asked in different fields of study. For example, economists have relatively greater interest in the time domain, whereas communication engineers have greater interest in the frequency domain. However, combining these two approaches w o u l d yield a better understanding of the d a t a . Time domain (including Box-Jenkins ARIMA analysis) and spectral domain (including Fourier—Spectral—analysis). [3] TIME SERIES COMPONENTS A time series is a time-ordered sequence of observations taken at regular intervals (e.g., hourly, daily, weekly, monthly, quarterly, annually). The data may be measurements of demand, earnings, profits, shipments, accidents, output, precipitation, productivity, or the consumer price index. Forecasting techniques based on time-series data are made on the assumption that future values of the series can be estimated from past values. Although no attempt is made to identify variables that influence the series, these methods are widely used, often with quite satisfactory results. Analysis of timeseries data requires the analyst to identify the underlying behavior of the series. This can often be accomplished by merely plotting the data and visually examining the plot. One or more patterns might appear: trends, seasonal variations, cycles, or variations around an average. In addition, there will be random and perhaps irregular variations. These behaviors can be described as follows: Trend refers to a long-term upward or downward movement in the data. Population shifts, changing incomes, and cultural changes often account for such movements. Seasonality refers to short-term, fairly regular variations generally related to factors such as the calendar or time of day. Restaurants, supermarkets, and theaters experience weekly and even daily “seasonal” variations. Cycles are wavelike variations of more than one year’s duration. These are often related to a variety of economic, political, and even agricultural conditions. Irregular variations are due to unusual circumstances such as severe weather conditions, strikes, or a major change in a product or service. They do not reflect typical behavior, and their inclusion in the series can distort the overall picture. Whenever possible, these should be identified and removed from the data. Random variations are residual variations that remain after all other behaviors have been accounted for. These behaviors are illustrated in Figure. 2. The small “bumps” in the plots represent random variability. 61 Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS Figure 2. Time Series Behaviors [4] RESEARCH TRENDS AND ISSUES Time-series data mining has been an ever growing and stimulating field of study that has continuously raised challenges and research issues over the past decade. We discuss in the following open research issues and trends in time-series for the next decade. Stream analysis .The last decade of research in hardware and network research have witnessed an explosion of streaming technologies with the continuous advances of bandwidth capabilities. Streams are seen as continuously generated measurements that have to be processed in massive and fluctuating data rates. Analyzing and mining such data flows are computationally extreme tasks. Several papers review research issues for data streams mining [26] or management [27]. Algorithms designed for static datasets have usually not been sufficiently optimized to be capable of handling such continuous volumes of data. Many models have already been extended to control data streams, such as clustering [20], classification [28], segmentation [51], or anomaly detection [18]. Novel techniques will be required and they should be designed specifically to cope with the ever flowing data streams. Convergence and hybrid approaches. A lot of new tasks can be derived through a relatively easy combination of the already existing tasks. For instance, [57] proposed three approaches, polynomial, DFT, and probabilistic, to predict the unknown International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015 www.ijcea.com ISSN 2321-3469 values that have not fed into the system and answer queries based on forecast data.. This work shows that future research has to rely on the convergence of several tasks. This could potentially lead to powerful hybrid approaches. Embedded systems and resource- constrained environments. With the advances in hardware miniaturization, new requirements are imposed on analysis techniques and algorithms. Embedded systems have a very limited memory space and cannot have permanent access to it. Furthermore, sensor networks usually generate huge amounts of streaming data. So there is a vital need to design space-efficient techniques, in terms of memory consumption as well as number of accesses. An interesting solution has been recently proposed in [80]. User interaction. Time-series data mining is starting to be highly dedicated to application-specific systems. The ultimate goal of such methods is to mine for higher order knowledge and propose a set of solutions to the user. It could therefore seem natural to include a user interaction scheme to allow for dynamic exploration and refinement of the solutions. An early proposal by [30] allows for relevance feedback in order to improve the querying process. From the best results of a query, the user is able to assign positive or negative influences to the series. Exhaustive benchmarking. A wide range of systems and algorithms has been proposed over the past few years. Individual proposals are usually submitted to get herewith specific datasets and evaluation methods that prove the superiority of the new algorithm. There is still a need for a common and exhaustive benchmarking system to perform objective testing. Another highly challenging task is to develop a procedure for real-time accuracy evaluation procedure. Link to shape analysis. Shape analysis has also been matter for discussion over the past few years. There is an astonishing resemblance between the tasks that have been examined; such as query by content [11], classification [32], clustering [58], segmentation [70], and even motif discovery [77]. As a matter of fact, there is a deeper connection between these two fields as recent work shows the numerous inherent links existing between these. [10] studied the problem of classifying ordered sequences of digital images. As presented earlier, [80] proposed to extract a time series from the contour of an image. They introduced the time-series shape lets that represent the most informative part of an image and allow to easily discriminate between image classes. [5] REPRESENTATION Many times series representation techniques have been investigated, each of them offering different trade-offs between the properties listed before. The classification of time series representation is depicted in the Figure.3. 63 Figure. 3. Time Series Representations Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS In order to perform such classification, we follow the taxonomy of [55] by dividing representations into three categories, namely nondata adaptive, data adaptive, and model based. Nondata Adaptive: In nondata-adaptive representations, the parameters of the transformation remain the same for every time series regardless of its nature. The first nondata-adaptive representations were drawn from spectral decompositions. The DFT was used in the seminal work of [4]. It projects the time series on a sine and cosine functions basis [22] in the real domain. The resulting representation is a set of sinusoidal coefficients. Instead of using a fixed set of basic functions, the DWT uses scaled and shifted versions of a mother wavelet function [9]. This gives a multiresolution decomposition where low frequencies are measured over larger intervals thus providing better accuracy [65]. A large number of wavelet functions have been used in the literature like Haar [15]. The Discrete Cosine Transform (DCT) uses only a cosine basis; it has also been applied to time-series mining [33]. However, it has been shown that it does not offer any advantage over previously cited decompositions [55]. Other approaches more specific to time series have been proposed. The Piecewise Aggregate Approximation (PAA) introduced by [43] represents a series through the mean values of consecutive fixed-length segments. An extension of PAA including a multiresolution property (MPAA) has been proposed in [53]. [2] Suggested to extract a sequence of amplitudelevel wise local features (ALF) to represent the characteristics of local structures. It was shown that this proposal provided weak results in [19]. Random projections have been used for representation in [30]; in this case, each time series enters a convolution product with k random vectors drawn from a multivariate standard. This approach has recently been combined with spectral decompositions by [68] with the purpose of answering statistical queries over streams. Data Adaptive: This approach implies that the parameters of a transformation are modified depending on the data available. By adding a data-sensitive selection step, almost all nondata-adaptive methods can become data adaptive. For spectral decompositions, it usually consists in selecting a subset of the coefficients. This approach has been applied to DFT [76] and DWT [75]. A data-adaptive version of PAA has been proposed in [60], with vector quantization being used to create a codebook of recurrent subsequences. This idea has been adapted to allow for multiple resolution levels. However, this approach has only been tested on smaller datasets. A similar approach has been undertaken in [74] with a codebook based on motion vectors being created to spot gestures. However, it has been shown to be computationally less efficient than SAX. Several inherently data-adaptive representations have also been used. SVD has been proposed [33] and later been enhanced for streams [67]. However, SVD requires computation of eigenvalues for large matrices and is therefore far more expensive than other mentioned schemes. It has recently been adapted to find multiscale patterns in time- series streams [63]. PLA [72] is a widely used approach for the segmentation task. The set of polynomial coefficients can be obtained either by interpolation [39] or regression. Many derivatives of this technique have been introduced. APCA [45] uses constant approximations per segment instead of polynomial fitting. Indexable PLA has been proposed by [17] to speed up the indexing process. [24] Put forward an approach based on PLA, to answer queries about the recent past with greater precision than older data and called such representations amnesic. The method consisting in using a segmentation algorithm as a representational tool has been extensively investigated. The underlying idea is that segmenting a time series can be equated with the process of representing the most salient features of a series while considerably reducing its dimensionality. [77] Proposed a pattern-based representation of time series. The International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015 www.ijcea.com ISSN 2321-3469 input series is approximated by a set of concave and convex patterns to improve the subsequence matching process. [81] Proposed a pattern representation of time series to extract outlier values and noise. The polynomial shape space representation [23] is a subspace representation consisting of trend aspects estimators of a time series. [8] put forward a twolevel approach to recognize gestures by describing individual trajectories with key-points, then characterizing gestures through the global properties of the trajectories. Instead of producing a numeric output, it is also possible to discretize the data into symbols. This conversion into a symbolical representation also offers the advantage of implicitly performing noise removal by complexity reduction. A relational tree representation is used in [7]. Nonterminal nodes of the tree correspond to valleys and terminal nodes to peaks in the time series. The Symbolic Aggregate approximation (SAX) [59], based on the same underlying idea as PAA, calls on equal frequency histograms on sliding windows to create a sequence of short words. An extension of this approach, called indexable Symbolic Aggregate approximation (iSAX) [73], has been proposed to make fast indexing possible by providing zero overlap at leaf nodes. The grid-based representation [1] places a twodimensional grid over the time series. The final representation is a bit string describing which values were kept and which bins they were in. Another possibility is to discretize the series to a binary string (a technique called clipping) [66]. Each bit indicates whether the series is above or below the average. That way, the series can be very efficiently manipulated. In [6] this is done using the median as the clipping threshold. Clipped series offer the advantage of allowing direct comparison with raw series, thus providing a tighter lower bounding metric. Recently, a very interesting approach has been proposed in [79]; it is based on primitives called shapelets, that is, subsequences which are maximally representative of a class and thus fully discriminate classes through the use of a dictionary. This approach can be considered as a step forward towards bridging the gap between time series and shape analysis. Model Based: The model-based approach is based on the assumption that the time series observed has been produced by an underlying model. The goal is thus to find parameters of such a model as a representation. Two time series are therefore considered similar if they have been produced by the same set of parameters driving the underlying model. Several parametric temporal models may be considered, including statistical modeling by feature extraction [61], ARMA models [31], Markov Chains (MCs) [70]. MCs are obviously simpler than HMM so they fit well in shorter series but their expressive power is far more limited. The time-series bitmaps introduced in [34] can also be considered as a model-based representation for time series, even if it mainly aims at providing a visualization of time series. As mentioned earlier, time series are essentially high-dimensional data. Defining algorithms that work directly on the raw time series would therefore be computationally too expensive. The main motivation of representations is thus to emphasize the essential characteristics of the data in a concise way. Additional benefits gained are efficient storage, speedup of processing, as well as implicit noise removal. These basic properties lead to the following requirements for any representation: significant reduction of the data dimensionality, emphasis on fundamental shape characteristics on both local and global scales, low computational cost for computing the representation, good reconstruction quality from the reduced representation, insensitivity to noise or implicit noise handling. [6] COMPARISON OF COMPUTING PACKAGES 65 Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS Important computer packages for time series analysis include: Splus, Matlab, Mathematica, SAS, SPSS, TSP, and STAMP. Here three important time series packages based on ARIMA modeling has been reviewed and their performance is shown in the Table.1. SAS and SPSS each have a single program for ARIMA time-series modeling. SPSS has additional programs for producing time-series graphs and for forecasting seasonally adjusted time series. SAS has a variety of programs for time-series analysis, including a special one for seasonally adjusted time series. SYSTAT also has a time-series program, but it does not support intervention analysis. SPSS Package: SPSS has a number of time-series procedures, only a few of which are relevant to the ARIMA models of this chapter. ACF produces autocorrelation and partial autocorrelation plots, CCF produces cross-correlation plots, and TSPLOT produces the timeseries plot itself. All of the plots are produced in high-resolution form; ACF and CCF plots also are shown in low-resolution form with numerical values. This is the only program that permits alternative specification of standard errors for autocorrelation plots, with an independence model possible (IND) as well as the usual Bartlett’s approximation (MA) used by other programs. SPSS ARIMA is available for modeling time series with a basic set of features. Several options for iteration are available, as well as two methods of parameter estimation. Initial values for parameters can be specified, but there is no provision for centering the series (although the mean may be subtracted in a separate transformation). There are built-in functions for a log (base 10 or natural) transform of the series. Differencing is awkward for lag greater than 1. A differenced variable is created, and then the differencing parameter is omitted from the model specification (along with the constant). Residuals (and their upper and lower confidence values) and predicted values are automatically written to the existing data file, and tests are available to compare models (log-likelihood, AIC, and SBC). Table.1. Comparison of Computing Packages Feature Input Specify intervention variable SPSS ARIMA SAS ARIMA SYSTAT SERIES No Yes Yes Specify additional continuous variable(s) No Yes No Include constant or not Yes Yes Yes Specify maximum number of iterations MXITER MAXIT Yes Specify tolerance PAREPS SINGULAR No Specify change in sum of squares SSQPCT No No Parameter estimate stopping criterion No CONVERGE Yes MXLAMB No No Specify delta No DELTA No Define maximum number of psi weights No No No CLS CLS Yes Specify maximum lambda Conditional least squares estimation method Unconditional estimation method EXACT ULS No Maximum likelihood estimation method No ML No Options regarding display of iteration details Yes PRINTALL No User-specified initial values for AR and MA parameters and constant Yes INITVAL No Request that mean be subtracted from each observation No CENTER No Options for forecasting Yes Yes Yes CINPCT ALPHA No No CLEAR CLEAR SERIES Specify size of confidence interval Erase current time-series model Use a previously defined model without respecification APPLY No No Request ACF plot with options Yesa Default ACF Request PACF plot with options Yesa Default PACF Request cross-correlation plot with options Yesb Default CCF Request inverse autocorrelation plot No Default Request residual autocorrelation plots No PLOT No Request log transforms of series Yes No e Yes Request time series plot with options Yesc No f TPLOT Estimate missing data No d Yes Yes No BY No MXAUTO a NLAG No SERRORa No No Specify method for estimating variance No NODF No Request serach for outliers in the solution No Yes No Number of observations or residuals (after differencing) Yes Yes No Specify analysis by groups Specify number of lags in plots Specify method for estimating standard errors of autocorrelations No International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015 www.ijcea.com ISSN 2321-3469 SAS System: SAS ARIMA also is a full-featured modeling program, with three estimation methods as well as several options for controlling the iteration process. SAS ARIMA has the most options for saving things to data files. Inverse autocorrelation plots and (optionally) autocorrelation plots of residuals are available in addition to the usual ACF, PACF, and cross-correlation plots that are produced by default. However, a plot of the raw time series itself must be requested outside PROC ARIMA. The autocorrelation checks for white noise and residuals are especially handy for diagnosing a model and models may be compared using AIC and SBC. Missing data are estimated under some conditions. SYSTAT System: Except for graphs and built-in transformations, SYSTAT SERIES is a bare bones program. The time series plot can be edited and annotated; however, ACF, PACF, and CCF plots have no numerical values. Maximum number of iterations and a convergence criterion can be specified. Iteration history is provided, along with a final value of error variance. Other than that, only the parameter estimates, standard errors, and 95% confidence interval are shown. There is no provision for intervention analysis or any other input variables. SYSTAT does differencing at lags greater than 1 in a manner similar to that of SPSS. Differenced values are produced as a new variable (in a new file) and then the differenced values are used in the model. Thus, forecast values are not in the scale of the original data, but in the scale of the 67 Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS Table 2. Time Series Applications, Algorithms and Tools TABLE 18.27 Continued differenced values, making interpretation more difficult. SYSTAT does, however, provide a plot showing values for the known (differenced) series as well as forecast values. Time s e r i e s data provide u s e f u l information about the physical, b i o l o g i c a l , s o c i a l or economic systems generating the time series. Types of time series data, various applications, existing algorithms and software tools for time series analysis are described in Table.2. In Table.3 important problems of time series research has been explained which may take for further research. Some important terminology of time series is also explained in the Table.4. International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015 www.ijcea.com ISSN 2321-3469 Table 3. Time Series Problem and Descriptions Problems Description Scientific problems include: smoothing, prediction, association, index numbers, feedback and control. Theoretical Issues: The usual cautions apply with regard to causal inference in time-series analysis. But causal inference is much weaker in any design that falls short of the requirements of a true experiment. Statistical problems include: explanatory in a model, estimation of parameters such as hidden frequencies, uncertainty computation, goodness of fit and testing. Special problems include: Missing values, censoring, measurement error, irregular sampling, feedback, outliers, shocks, signalgenerated noise, trading days, festivals, changing seasonal pattern, measurement error, aliasing, data observed in two series at different time points. Discovering Time Series Motifs without all those hard-to-set parameters: Clustering streaming time series: The problem is NOT to do this fast, the problem is to do this in a meaningful way. Time Series Joins: The problem is NOT to do this fast, the problem is to do this in a meaningful way. Practical Issues: Outliers among scores are sought before modeling and among the residuals once the model is developed. Normality of Distributions of Residuals: A model is developed and then normality of residuals is evaluated in time-series analysis. Examine the normalized plot of residuals for the model before evaluating an intervention. Transform the DV if residuals are non normal. Homogeneity of Variance and Zero Mean of Residuals: After the model is developed, examine plots of standardized residuals versus predicted values to assess homogeneity of variance over time. Consider transforming the DV if the width of the plot varies over the predicted values. Understanding the “why” in time series classification and clustering: Given that two time series are clustered/classified together, automatically construct an explanation of why. Independence of Residuals: During the diagnostic phase, once the model is developed and residuals are computed, there should be no remaining autocorrelations or partial autocorrelations at various lags in the ACFs and PACFs. Remaining autocorrelations at various Building tools to visualize massive time series: How lags signal other possible patterns in the data can we visually summarize massive time series, such that have not been properly modeled. that regularities, outliers, anomalies etc, become Absence of Outliers: Outliers are observations visible? that are highly inconsistent with the remainder Classifying time series with a eager learner: As we of the time-series data. They can greatly affect have seen, DTW is essentially linear, nevertheless, the results of the analysis and must be dealt with. 1-nearest neighbor needs to visit every instance, can They sometimes show up in the original plot of the DV against time, but are often more we do better? noticeable after initial modeling is complete. Weighted time series representations: It is well known in the machine learning community that Other major issues are involved. Data representation. weighting features can greatly improve accuracy in Similarity measurement. classification and clustering tasks. Indexing method. We need to give more attention to problems with real, demonstrated applications: : Music, Motion Capture, Video, Web Logs etc. 69 Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS Table 4. Important Terminology of Time Series Term Definition Observation The DV score at one time period. The score can be from a single case or an aggregate score from numerous cases. Random shock The random component of a time series. The shocks are reflected by the residuals (or errors) after an adequate model is identified. ARIMA ( p, d, q) The acronym for an auto-regressive integrated moving average model. The three terms to be estimated in the model are autoregressive ( p), integrated (trend—d), and moving average (q). Auto-regressive terms The number of terms in the model that describe the dependency among ( p) successive observations. Each term has an associated correlation coefficient that describes the magnitude of the dependency. For example, a model with two auto-regressive terms ( p C 2) is one in which an observation depends on (is predicted by) two previous observations. Moving average terms The number of terms that describe the persistence of a random shock from one observation to the next. A model with two moving average (q) terms (q C 2) is one in which an observation depends on two preceding random shocks. Lag The time periods between two observations. Calculating differences among pairs of observations at some lag to Differencing make a non stationary series stationary. Stationary and nonstationary series Autocorrelation Stationary series vary around a constant mean level, neither decreasing nor increasing systematically over time, with constant variance. Nonstationary series have systematic trends, such as linear, quadratic, and so on. A nonstationary series that can be made stationary by differencing is called “nonstationary in the homogenous sense.” The terms needed to make a nonstationary time series stationary. A model with two trend terms (d C 2) has to be differenced twice to make it stationary. The first difference removes linear trend, the second difference removes quadratic trend, and so on. Correlations among sequential scores at different lags. Autocorrelation function (ACF) The pattern of autocorrelations in a time series at numerous lags; the correlation at lag 1, then the correlation at lag 2, and so on. Partial autocorrelation function (PACF) The pattern of partial autocorrelations in a time series at numerous lags after partialing out the effects of autocorrelations at intervening lags. Trend terms (d) Conclusion After almost two decades of research in time-series data mining, an incredible wealth of systems and algorithms has been proposed. The ubiquitous nature of time series led to an extension of the scope of applications simultaneously with the d e v e l o p m e n t of more mature and efficient solutions to deal with problems of increasing computational complexity. We have reviewed throughout this article the field of time-series data analysis models, representations and problem and issues and comparison computational efficiency of the software packages. As for most scientific research, trying to find the solution to a problem often leads to raising more q u e s t i o n s than finding answers. We have thus outlined several trends and research directions International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015 www.ijcea.com ISSN 2321-3469 as well as open issues for the near future. It seems clear that time series research will continue on all the existing topics as the assumptions on which any existing problem solution has been based appear inadequate. Further it can be anticipated that more researchers from nontraditional areas will become interested in the area of time series as they realize that the data they have collected, or will be collecting, are correlated in time. Researchers can be expected to be even more concerned with the topics of nonlinearity, conditional heteroscedasticity, inverse problems, long memory, long tails, uncertainty estimation, inclusion of explanatories, new analytic models and properties of the estimates when the model is not true. The motivation for the last is that time series with unusual structure seem to appear steadily. More efficient, more robust and more applicable solutions will be found for existing problems. Techniques will be developed for dealing with special difficulties such as missing data, outliers. Better approximations to the distributions of time series based statistics will be developed. REFERENCES 1. An, J., Chen, H., Furuse, K., Ohbo, N., And Keogh, E. 2003. Grid-Based Indexing For Large Time Series Databases. In Intelligent Data Engineering And Automated Learning. Lecture Notes In Computer Science, Vol. 1983. Springer 614–621. 2. Abfalg, J., Kriegel, H., Krö Ger, P., Kunath, P., Pryakhin, A., And Renz, M. 2008. Similarity Search In Multime- Dia Time Series Data Using Amplitude-Level Features. In Proceedings Of The 14th International Conference On Advances In Multimedia Modeling. Springer, 123–133. 3. Abonyi, J., Feil, B., Nemeth, S., Arva, P., 2005. Modified Gath–Geva clustering for fuzzing segmentation of multivariate time-series. Fuzzy Sets and Systems, Data Mining Special Issue 149, 39–56. 4. Agrawal, R., Faloutsos, C., Swami, A., 1993a. Efficient similarity search in sequence databases. In: Proceedings of the Fourth International Conference on Foundations of Data Organization and Algorithms, pp. 69–84. 5. Anderson, T. W. (1971): The Statistical Analysis Of Time Series. John Wiley & Sons, New York. 6. Bagnall, A., Janacek, G., De La Iglesia, B., And Zhang, M. 2003. Clustering Time Series From Mixture Poly-Nomial Models With Discretised Data. In Proceedings Of The 2nd Australasian Data Mining Workshop.105–120. 7. Bakshi, B. And Stephanopoulos, G. 1995. Reasoning In Time: Modeling, Analysis, And Pattern Recognition Of Temporal Process Trends. Adv. Chem. Engin. 22, 485–548. 8. Bandera, J., Marfil, R., Bandera, A., RodrÍguez, J., Molina-Tanco, L., And Sandoval, F. 2009. Fast Gesture Recognition Based On A Two-Level Representation. Pattern Recogn. Lett. 30, 13, 1181–1189. 9. Barone, P., Carfora, M., And March, R. 2009. Segmentation, Classification And Denoising Of A Time Series Field By A Variational Method. J. Math. Imag. Vis. 34, 2, 152–164. 10. Berretti, S., Del Bimbo, A., And Pala, P. 2000. Retrieval By Shape Similarity With Perceptual Distance And Effective Indexing. Ieee Trans. Multimedia 2, 4, 225–239. 11. Bhargava, R., Kargupta, H., And Powers, M. 2003. Energy Consumption In Data Analysis For On-Board And Distributed Applications. In Proceedings Of The Icml. Vol. 3. 12. Berndt, D.J., Clifford, J., 1994. Using dynamic time warping to find patterns in time series. In: AAAI Working Notes of the Knowledge Discovery in Databases Workshop, pp. 359–370. 13. Box, G. E. P., and G. M. Jenkins (1976): Time Series Analysis: Forecasting And Control. HoldenDay, San Francisco, Second Edn. 71 Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS 14. Chan, F., Fu, A., And Yu, C. 2003. Haar Wavelets For Efficient Similarity Search Of TimeSeries: With And Without Time Warping. IEEE Trans. Knowl. Data Engin. 15, 3, 686– 705. 15. Chan, K. And Fu, A. 1999. Efficient Time Series Matching By Wavelets. In Proceedings Of The 15th IEEE International Conference On Data Engineering. 126–133. 16. Chen, Q., Chen, L., Lian, X., Liu, Y., And Yu, J. 2007a. Indexable Pla For Efficient Similarity Search. In Proceedings Of The 33rd International Conference On Very Large Data Bases. VLDB Endowment, 435–446. 17. Chuah, M. And Fu, F. 2007. Ecg Anomaly Detection Via Time Series Analysis. In Frontiers Of High Performance Computing And Networking Ispa 07 Workshops. Springer, 123–135. 18. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., And Keogh, E. 2008. Querying And Mining Of Time Series Data: Experimental Comparison Of Representations And Distance Measures. Proc. Vldb Endowm. 1, 2, 1542–1552. 19. Domingos, P. And Hulten, G. 2000. Mining High-Speed Data Streams. In Proceedings Of The 6th Acm Sigkdd International Conference On Knowledge Discovery And Data Mining. Acm, 71–80. 20. Faloutsos, C. And Megalooikonomou, V. 2007. On Data Mining, Compression, And Kolmogorov Complexity. Data Min. Knowl. Discov. 15, 1, 3–20. 21. Fuchs, E., Gruber, T., Pree, H., And Sick, B. 2010. Temporal Data Mining Using Shape Space Representations Of Time Series. Neurocomput. 74, 1-3, 379–393. 22. Faloutsos, C., Ranganathan, M., ManolopoulosY., 1994. Fast subsequence matching in time -series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pp. 419–429. 23. Gouriéroux, C., and A. Monfort (1997): Time Series and Dynamic Models. Cambridge University Press, Cambridge, U.K. 24. Gaber, M., Zaslavsky, A., And Krishnaswamy, S. 2005. Mining Data Streams: A Review. Acm Sigmod Rec. 34, 2, 18–26. 25. Golab, L. And Ozsu, M. 2003. Issues In Data Stream Management. Acm Sigmod Rec. 32, 2, 5–14. 26. Hulten, G., Spencer, L., And Domingos, P. 2001. Mining Time-Changing Data Streams. In Proceedings Of The 7th Acm Sigkdd International Conference On Knowledge Discovery And Data Mining. Acm, 97–106. 27. Hillmer, S. C., W. R. Bell, And G. C. Tiao (1983): “Modeling Considerations in Seasonal Adjustment Of Economic Time Series,” In Zellner (1983), Pp.74–100, With Comments (101-124). 28. Indyk, P., Koudas, N., And Muthukrishnan, S. 2000. Identifying Representative Trends In Massive Time Series Data Sets Using Sketches. In Proceedings Of The 26th International Conference On Very Large Data Bases. Morgan Kaufmann Publishers Inc., 363–372. 29. Kalpakis, K., Gada, D., And Puttagunta, V. 2001. Distance Measures For Effective Clustering Of Arima Time- Series. In Proceedings Of The Ieee International Conference On Data Mining. 273–280. 30. Kauppinen, H., Seppanen, T., And Pietikainen, M. 1995. An Experimental Comparison Of Autoregressive And Fourier-Based Descriptors In 2d Shape Classification. Ieee Trans. Pattern Anal. Mach. Intell. 17, 2, 201–207. 31. Korn, F., Jagadish, H., And Faloutsos, C. 1997. Efficiently Supporting Ad Hoc Queries In Large Datasets Of Time Sequences. In Proceedings of the ACM Sigmod International Conference On Management Of Data. Acm, 289–300. 32. Kumar, N., Lolla, N., Keogh, E., Lonardi, S., Ratanamahatana, C., And Wei, L. 2005. Time Series Bitmaps: A Practical Visualization Tool For Working With Large Time-Series Databases. In Proceedings Of The Siam Data Mining Conference. 531–535. International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015 www.ijcea.com ISSN 2321-3469 33. Keogh, E., 1997a. A fast and robust method for pattern matching in time series databases. In: Proceedings of the Ninth IEEE International Conference on Tools with Artificial Intelligence, pp. 578–584. 34. Keogh, E., 1997b. Fast similarity search in the presence of longitudinal scaling in time series databases. In: Proceedings of the Ninth IEEE International Conference on Tools with Artificial Intelligence, pp. 578–584. 35. Keogh, E., 2002. Exact indexing of dynamic time warping. In: Proceedings of the 28th International Conference on Very Large Databases, pp. 406–417. 36. Keogh, E., Kasetty, S., 2002. On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 102–111. 37. Keogh, E., Pazzani, M., 1998. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proceedings of the Fourth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 239 –341. 38. Keogh, E., Pazzani, M., 1999. Relevance feedback retrieval of time series data.In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 183–190. 39. Keogh, E., Pazzani, M., 2000a. A Simple dimensionality reduction technique for fast similarity search in large time series databases. In: Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 122–133. 40. Keogh, E., Pazzani, M., 2000b. Scaling up dynamic time warping for data mining applications. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 285–289. 41. Keogh, E., Pazzani, M., 2001. Derivative dynamic time warping. In: Proceedings of the First SIAM International Conference on Data Mining. 42. Keogh, E., Smyth, P.A., 2001. Probabilistic approach to fast pattern matching in time series databases. In: Proceedings of the Third ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1997, pp. 24–30. 43. Keogh, E., Chakrabarti, K., Mehrotra, S., Pazzani, M., 2001a. Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 151–163. 44. Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S., 2000. Dimensionality reduction for fast similarity search in large time series databases. Journal of Knowledge and Information Systems 3 (3), 263–286. 45. Keogh, E., Chu, S., Pazzani, M., 2001b. Ensemble-index: a new approach to indexing large databases. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 117–125. 46. Keogh, E., Chu, S., Hart, D., Pazzani, M., 2001c. An online algorithm for segmenting time series. In: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 289 –296. 47. Keogh, E., Hochheiser, H., Shneiderman, B., 2002a. An augmented visual query mechanism for finding patterns in time series data. In: Proceedings of the Fifth International Conference on Flexible Query Answering Systems, pp. 240–250. 48. Keogh, E., Lin, J., Fu, A., 2005. HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 226 – 233. 49. Keogh, E., Lin, J., Truppel, W., 2003. Clustering of time series subsequences is meaningless: implications for previous and future research. In: Proceedings of the Third IEEE International Conference on Data Mining, pp. 115–122. 50. Keogh, E., Lin, J., Fu, A., Herle, H.V., 2006. Finding unusual medical time-series subsequences: algorithms and applications. IEEE Transactions on Information Technology in Biomedicine 10 (3), 429–439. 73 Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh TIME SERIES DATA MINING RESEARCH PROBLEM, ISSUES, MODELS, TRENDS AND TOOLS 51. Keogh, E., Lin, J., Lee, S.H., Herle, H.V., 2007a. Finding the most unusual time series subsequence: algorithms and applications. Knowledge and Information Systems 11 (1), 1–27. 52. Keogh, E., Lonardi, S., Chiu, Y.C., 2002b. Finding surprising patterns in a time series database in linear time and space. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 550–556. 53. Keogh, E., Lonardi, S., Ratanamahatana, C.A., 2004. Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215. 54. Keogh, E., Lonardi, S., Ratanamahatana, C.A., Wei, L., Lee, S.H., Handley, J., 2007b.Compressionbased data mining of sequential data. Data Mining and Knowledge Discovery 14 (1), 99–129. 55. Lian, X. And Chen, L. 2007. Efficient Similarity Search Over Future Stream Time Series. IEEE Trans. Knowl.Data Engin. 20, 1, 40–54. 56. Liew, A., Leung, S., And Lau, W. 2000. Fuzzy Image Clustering Incorporating Spatial Continuity. Ieee Proc.Vis. Image Signal Process. 147, 2, 185–192. 57. Lin, J., Keogh, E., Lonardi, S., And Chiu, B. 2003. A Symbolic Representation Of Time Series, With Implications For Streaming Algorithms. In Proceedings Of The 8th Acm Sigmod Workshop On Research Issues In Data Mining And Knowledge Discovery. Acm New York, 2–11. 58. Megalooikonomou, V., Wang, Q., Li, G., And Faloutsos, C. 2005. A Multiresolution Symbolic Representation Of Time Series. In Proceedings Of The 21st International Conference On Data Engineering. 668–679. 59. Nanopoulos, A., Alcock, R., And Manolopoulos, Y. 2001. Feature-Based Classification Of Time-Series Data. In Information Processing And Technology. 49–61. 60. Palpanas, T., Vlachos, M., Keogh, E., And Gunopulos, D. 2008. Streaming Time Series Summarization Using User-Defined Amnesic Functions. IEEE Trans. Knowl. Data Engin. 20, 7, 992–1006. 61. Papadimitriou, S., Sun, J., And Yu, P. 2006. Local Correlation Tracking In Time Series. In Proceedings Of The 6th International Conference On Data Mining. 456–465. 62. Philippe Esling And Carlos Agon, Institut De Recherche Et Coordinationacoustique/Musique (Ircam) ,Acm Computing Surveys, Vol. 45, No. 1, Article 12, Publication Date: November 2012. 63. Popivanov, I. And Miller, R. 2002. Similarity Search Over Time-Series Data Using Wavelets. In Proceedings Of The International Conference On Data Engineering. 212 224. 64. Ratanamahatana, C., Keogh, E., Bagnall, A., And Lonardi, S. 2005. A Novel Bit Level Time Series Representation With Implication Of Similarity Search And Clustering. Adv. Knowl. Discov. Data Min. 771–777. 65. Ravi Kanth, K., Agrawal, D., And Singh, A. 1998. Dimensionality Reduction For Similarity Searching In Dynamic Databases. ACM Sigmod Rec. 27, 2, 166–176. 66. Reeves, G., Liu, J., Nath, S., And Zhao, F. 2009. Managing Massive Time Series Streams With Multi-Scale Compressed Trickles. Proc. Vldb Endow. 2, 1, 97–108. 67. Sebastian, T., Klein, P., And Kimia, B. 2003. On Aligning Curves. Ieee Trans. Pattern Anal. Mach. Intell. 25, 1, 116–125. 68. Sebastiani, P., Ramoni, M., Cohen, P., Warwick, J., And Davis, J. 1999. Discovering Dynamics Using Bayesian Clustering. In Lecture Notes In Computer Science, Vol.1642. Springer, 199–209. 69. Shasha, D. And Zhu, Y. 2004. High Performance Discovery In Time Series: Techniques And Case Studies. Springer. 70. Shatkay, H. And Zdonik, S. 1996. Approximate Queries And Representations For Large Data Sequences. In Proceedings Of The 12th International Conference On Data Engineering. 536–545. International Journal of Computer Engineering and Applications, Volume IX, Issue V, May 2015 www.ijcea.com ISSN 2321-3469 71. Shieh, J. And Keogh, E. 2008. Isax: Indexing And Mining Terabyte Sized Time Series. In Proceeding Of The 14th Acm Sigkdd International Conference On Knowledge Discovery And Data Mining. Acm, 623–631. 72. Stiefmeier, T., Roggen, D., And Troster, G. 2007. Gestures Are Strings: Efficient Online Gesture Spotting And Classification Using String Matching. In Proceedings Of The ICST 2nd International Conference On Body Area Networks. 1–8. 73. Struzik, Z., Siebes, A., And Cwi, A. 1999. Measuring Time Series Similarity Through Large Singular Features Revealed With Wavelet Transformation. In Proceedings Of The 10th International Workshop On Database And Expert Systems Applications. 162-166. 74. Vlachos, M., Gunopulos, D., And Das, G. 2004. Indexing Time Series Under Conditions Of Noise. In Data Mining In Time Series Databases. 67–100. 75. Xie, J. And Yan, W. 2007. Pattern-Based Characterization Of Time Series. Int. J. Info. Syst. Sci. 3, 3, 479–491. 76. Yadav, R., Kalra, P., And John, J. 2007. Time Series Prediction With Single Multiplicative Neuron Model. Appl. Soft Comput. 7, 4, 1157–1163. 77. Ye, D., Wang, X., Keogh, E., And Mafra-Neto, A. 2009. Autocannibalistic And Anyspace Indexing Algorithms With Applications To Sensor Data Mining. In Proceedings Of The Siam International Conference On Data Mining (Sdm 09). 85–96. 78. Ye, L. And Keogh, E. 2009. Time Series Shapelets: A New Primitive For Data Mining. In Proceedings Of The 15th Acm Sigkdd International Conference On Knowledge Discovery And Data Mining. Acm, 947–956. 79. Zhan, Y., Chen, X., And Xu, R. 2007. Outlier Detection Algorithm Based On Pattern Representation Of Time Series. Appl. Res. Comput. 24, 11, 96–99. 80. Zhang, X., Wu, J., Yang, X., Ou, H., And Lv, T. 2009. A Novel Pattern Extraction Method For Time Series Classification. Optimiz. Engin. 10, 2, 253–271. 81. Zhong, S. And Ghosh, J. 2002. Hmms And Coupled Hmms For Multi-Channel Eeg Classification. In Proceedings Of The Ieee International Joint Conference On Neural Networks. 1154–1159. 82. Zhong, S., Khoshgoftaar, T., And Seliya, N. 2007. Clustering-Based Network Intrusion Detection. Int. J. ReliabQual. Safety Engin. 14, 2, 169–187. 75 Dr.Ilango Velchamy, Dr.Uma Ilango and Nitya Ramesh