Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Principal component analysis wikipedia , lookup
Mixture model wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Nearest-neighbor chain algorithm wikipedia , lookup
Demand Forecast for Short Life Cycle Products Mario José Basallo Triana December, 2012 Ante todo, a Dios. A mis padres Luis Mario y Luz Ángela por su sacrificio, dedicación, y amor. A mis maestros por enseñarme el camino. Agradecimientos Quiero agradecer a los profesores Jesus Andrés Rodrı́guez Sarasty y Hernán Darı́o Benitez Restrepo directores este proyecto por sus consejos, asistencia, y apoyo. Agradezco a la Pontificia Universidad Javeriana porproporcionarme los recursos necesarios para el desarrollo de este proyecto. Este proyecto fué apoyado por la Pontificia Universidad Javeriana mediante el proyecto Gestión de Inventarios, 020100292. Agradezco a todas las personas que de una u otra forma contribuyeron en la realización de este proyecto. Inteligencia, dame el nombre exacto de las cosas! ... Que mi palabra sea la cosa misma creada por mi alma nuevamente. Que por mı́ vayan todos los que no las conocen, a las cosas; que por mı́ vayan todos los que ya las olvidan, a las cosas; que por mı́ vayan todos los mismos que las aman, a las cosas ... Inteligencia, dame el nombre exacto, y tuyo, y suyo, y mı́o, de las cosas! Juan Ramón Jiménez, Eternidades (1918) Este trabajo se basa en las ideas propuestas por el profesor Jesus Andrés Rodrı́guez Sarasty para pronosticar la demanda de productos de corto ciclo de vida. CONTENTS 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. Problem statement . . . . . . . . . . . . 2.1 Forecasting the demand of short life 2.1.1 Problem formulation . . . . 2.2 Fundamental research hypothesis . . . . . . . . . . cycle products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 6 3. Objectives and Scope . . . 3.1 General objective . . 3.2 Specific objectives . . 3.3 Scope of the research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 7 4. Literature Review . . . . . . . . . . . . . . . . . 4.1 Forecast based on growth models . . . . . 4.2 Forecast based on similarity . . . . . . . . 4.3 Forecast based on machine learning models 4.4 Discussion of the current methods . . . . . 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 11 11 12 13 5. Time series analysis . . . . . . . 5.1 The datasets of time series 5.1.1 Real datasets . . . 5.1.2 Synthetic dataset . 5.2 Stationarity test . . . . . . 5.2.1 Unit-root test . . . 5.3 Clustering of time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 15 16 17 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 5.4 CONTENTS 5.3.1 Some insights on clustering time series 5.3.2 The clustering algorithm . . . . . . . . 5.3.3 Fuzzy cluster validity indices . . . . . . 5.3.4 Clustering results . . . . . . . . . . . . 5.3.5 Clustering results for the real datasets Conclusions of the chapter . . . . . . . . . . . 6. Regression Methods . . . . . . . . . . . . 6.1 Multiple linear regression . . . . . . . 6.2 Support vector regression . . . . . . . 6.3 Artificial Neural Networks . . . . . . 6.4 Tuning parameters . . . . . . . . . . 6.4.1 Response surface methodology . . . . . . . . . . for . . . . . . . . . . . . . . . . . . . . . . . . . . tuning 7. Experimental procedure . . . . . . . . . . . . . . . 7.1 Collection and analysis of data . . . . . . . . . 7.2 Clustering . . . . . . . . . . . . . . . . . . . . 7.3 Parameter tuning . . . . . . . . . . . . . . . . 7.4 Forecasts evaluation . . . . . . . . . . . . . . 7.5 Some computational aspects . . . . . . . . . . 7.6 Results of the tune parameters procedure . . . 7.6.1 Tuning parameters for SVR machines . 7.6.2 Tuning parameters for ANN machines 7.7 Conclusions of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 23 28 31 32 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . parameters . . . . . . 35 36 37 40 42 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 50 52 52 53 54 54 54 56 58 8. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Forecasting results using multiple linear regression . . . 8.1.1 Multiple linear regression results with clustering 8.1.2 Conclusions of the MLR case . . . . . . . . . . 8.2 Forecasting results using support vector regression . . . 8.2.1 Support vector regression results with clustering 8.2.2 Conclusions of the SVR case . . . . . . . . . . . 8.3 Forecasting results using artificial neural networks . . . 8.3.1 Artificial neural network results with clustering 8.3.2 Conclusions of the ANN case and other cases . . 8.4 Comparison of forecasting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 60 61 64 65 66 67 68 69 69 71 9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 ii CONTENTS CONTENTS A. Results for the SD1 dataset using the correct partition . . . . . . . 77 B. The effect of the clustering algorithm . . . . . . . . . . . . . . . . . 78 B.1 Multiple linear regression case . . . . . . . . . . . . . . . . . . 78 B.2 Support vector regression case . . . . . . . . . . . . . . . . . . 78 C. Variance of cumulative and non-cumulative data. . . . . . . . . . . 81 iii LIST OF FIGURES 2.1 2.2 Short life cycle product time series. . . . . . . . . . . . . . . . Generalized pattern of a short life cycle product time series. . 5.1 5.2 5.3 The datasets of short time series. . . . . . . . . . . . . . . . . 17 Representation of the SD1 data set by its 3 PIPs. . . . . . . . 33 Representation of the real datasets by its 3 PIPs. . . . . . . . 34 6.1 6.2 6.3 Illustration of the linear ε-SVR with soft margin. . . . . . . . 37 Feed-forward neural network. . . . . . . . . . . . . . . . . . . 41 Experimental designs. . . . . . . . . . . . . . . . . . . . . . . 45 7.1 7.2 7.3 Experimental factors. . . . . . . . . . . . . . . . . . . . . . . . 50 Framework of the forecasting procedures. . . . . . . . . . . . . 51 Some iterations of the proposed tuning parameters procedure. 59 8.1 8.2 8.3 8.4 8.5 8.6 Multiple linear regression results, complete datasets. . . Support vector regression results, complete datasets. . . Artificial neural network results, complete datasets. . . Forecasting results for some time series. . . . . . . . . . Pairwise comparison of regression methods. . . . . . . . Mean absolute error results for each regression method. . . . . . . . . . . . . . . . . . . . . . . . . 4 6 61 65 68 73 73 74 C.1 Variance of cumulative and non-cumulative data . . . . . . . . 82 LIST OF TABLES 5.1 5.2 5.3 Model parameters of synthetic time series. . . . . . . . . . . . 16 Validation results for SD1 dataset using FSTS algorithm. . . . 32 Optimal number of clusters for the real datasets. . . . . . . . . 32 7.1 7.2 7.3 7.4 Optimal Optimal Optimal Optimal 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 MLR results, complete datasets. . . . . . . . . . . . . MLR results, partitioned non-cumulative data . . . . MLR results, partitioned cumulative data . . . . . . . SVR results, complete datasets. . . . . . . . . . . . . SVR results, partitioned non-cumulative data . . . . SVR results, partitioned cumulative data . . . . . . . ANN, complete datasets. . . . . . . . . . . . . . . . . ANN, partitioned non-cumulative data . . . . . . . . ANN, partitioned cumulative data . . . . . . . . . . . Summary evaluation of the experimental treatments. SVR parameters procedure, non-cumulative data. SVR parameters, cumulative data. . . . . . . . . ANN parameters procedure, non-cumulative data. ANN parameters, cumulative data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 56 57 57 . . . . . . . . . . . . . . . . . . . . 62 64 64 66 67 67 69 70 71 72 A.1 Preliminary multiple linear regression results for SD1 dataset. 77 B.1 Comparison of FCM, FMLE and FSTS algorithms for MLR. . 79 B.2 Comparison of FCM, FMLE and FSTS algorithms for SVR. . 80 LIST OF ALGORITHMS 1 2 3 4 5 Perceptually Important Points Procedure. . . . . . . . . . . Kaufman initialization method. . . . . . . . . . . . . . . . . Fuzzy C-Means (FCM) clustering algorithm. . . . . . . . . Fuzzy short time series (FSTS) clustering algorithm. . . . . Fuzzy Maximum Likelihood Estimation (FMLE) algorithm. . . . . . . . . . . 21 22 25 26 28 6 7 Tuning parameters procedure. . . . . . . . . . . . . . . . . . . . 46 Proposed tuning parameters procedure. . . . . . . . . . . . . . 48 ABSTRACT Accurate forecast for demand of short life cycle products is a subject of special interest for many companies and researchers. However common forecasting approaches are not appropriate for this type of products due to the characteristics of their demand. This work proposes a method to forecast the demand of short life cycle products. Clustering techniques will be used to obtain natural groups in the time series. This analysis allows to extract relevant information for the forecasting method. The results of the proposed method will be compared to other approaches to forecast the demand of short life cycle products. Several time series datasets of different type of products are considered. Keywords: Short life cycle product, time series, forecast, cluster analysis, forecast performance. ONE INTRODUCTION Short life cycle products (SLCP) are characterized by a demand that only occurs during a short time period after which they become obsolete1 , this leads in some cases to very short demand time series (Rodrı́guez & Vidal, 2009). High technology products (e.g., computers, consumer electronics, video games) and fashion products (e.g., toys, apparel, text books) are typical examples of SLCPs (Kurawarwala & Matsuo, 1998; Thomassey & Happiette, 2007; Rodrı́guez & Vidal, 2009). The period of demand can vary from a few years to a few weeks as can be seen in the Colombian textbook industry (Rodrı́guez & Vidal, 2009). The dynamic of new products demand is generally characterized by a relatively slow growth in the introduction stage, followed by a stage of rapid growth, afterwards the demand is stabilized and the product enters in a stage of maturity; finally the demand declines and then, the product usually is replaced by another product (Trappey & Wu, 2008; Meade & Islam, 2006). In many industries, particularly in the technology sector, SLCPs are becoming increasingly common (Zhu & Thonemann, 2004). This phenomenon is motivated by a continuous introduction of new products as a consequence of high competitive markets. In this context, the competitive advantage of a company is determined largely on its ability to manage frequent entries and exits of products (Wu et al. , 2009). The demand of a SLCP is highly uncertain and volatile, particularly in the introduction stage (Wu & Aytac, 2008). Additionally, the demand pattern is 1 Short life cycle products are different of perishable products which generally deteriorate with time. 1. Introduction transient, non-stationary and non-linear (Rodrı́guez, 2007); these characteristics hinder the analysis and forecast of such a demand. On the other hand, the operations management of this type of products is also difficult because high technology and investment usually is required, the manufacturing and distribution lead times are usually long, and the risk of excess or shortage of inventory during the life cycle is high (Rodrı́guez & Vidal, 2009). An accurate prediction of demand reduces production and distribution costs, and the excess or shortage of inventory. It is for this reason that an accurate forecast is important for a company. This work proposes an efficient and effective forecasting methods of short life cycle products demand. For this reason we consider regression methods to forecast the demand of a SLCP. These methods are able to obtain forecasts at early stages of product life cycle achieving an important advantage over current forecasting methods, this fact also represent important advantages for a company. In order to improve the forecasting performance different strategies are considered, such strategies involve clustering techniques and cumulative and non-cumulative data. This document is organized as follows: Chapter 2 states the problem and formulates different forecasting strategies (hypothesis) aimed at improving the forecast performance. Chapter 3 presents the objectives of this work and the scope of the research. Chapter 4 duscusses a classification and description of the current methods used to predict the demand of short life cycle products. Chapter 5 describes the different time series datasets analyzed in this work and illustrates the use of different clustering techniques. Chapter 6 discusses regression methods and tuning parameters procedures followed. Chapter 7 explains in detail the experimental procedure followed in this work. Chapter 8 discusses the results and Chapter 9 concludes this document. 2 TWO PROBLEM STATEMENT 2.1 Forecasting the demand of short life cycle products There are different situations that make complex demand forecasting of short life cicle products. On one hand, the time series of demand of such products appears to be non-stationary, non-linear, and transient (Fig. 2.1 (left)). Another problem related to the demand forecasting of short life cycle products is the scarcity of historical data. These products once introduced to the market have a short period of sales. Then, there are little or no historical information related to the sales of such products. These are a severe inconvenience because in spite of the existence of forecasting methods for non-linear time series, such these techniques generally require large amounts of data in order to obtain accurate forecasts. Many non-linear models have been proposed in the field of time series analysis1 . This models, in the same way that linear models, requires large amount2 of data to obtain accurate results. Furthermore, the proposed nonlinear models employ explicit parametric forms; however, it is usually hard to justify a priori the appropriateness of such explicit models in real applications. The use of non-parametric regression analysis (such as support vector regression and artificial neural networks) can be an effective data driven approach (Peña et al. , 2001). Traditional forecasting methods3 cannot be used in this context because 1 Some nonlinear time series models are the Bilinear model, the Threshold Autoregressive model, the Smooth Transition model, the Markov Switching model, see Tsay (2005); Peña (2005). 2 At least 50 periods of historical data. 3 Methods such as: moving average, exponential smoothing, ARIMA, and others. In 2. Problem statement 2.1. Forecasting the demand of short life cycle products Cumulative real demand Cumulative smothed version Demand (units) Cumulative demand (units) Real demand Smothed version Time Time Fig. 2.1: Short life cycle product time series. Left: Typical demand pattern for a short life cycle product. Right: Cumulative demand time series. short life cycle products time series do not meet the assumptions required for these methods or there is not sufficient information available to get accurate estimations of the parameters of such methods (Kurawarwala & Matsuo, 1998; Wu & Aytac, 2008). These difficulties make necessary to develop forecasting methods specifically designed for this type of products. The forecasting procedures for short life cycle products time series should address all these difficulties and cope with the high uncertainty and volatility of the demand typical of this type of products. 2.1.1 Problem formulation In order to state the problem, let us define xt as the value of the time series at time t and x bt as the forecasted value at the same period. Then, we set x bt = F , where F is some functional relationship. We need to define F according to some criteria. One suitable criterion to define F is according to some error general, most of these methods are suitable to forecast conventional products with large history or a stable demand pattern. 4 2. Problem statement 2.1. Forecasting the demand of short life cycle products metric (forecasting error). Let et = f (xt − x bt ) be such metric. Then given a time series {xt }, t = 1, . . . , T , we wish to minimize Eq. 2.1 ∑ E= et (2.1) t∈T The problem can be stated as follow: Given a time series {xt }, the forecasting problem consists in defining F for which Eq. 2.1 is minimized. Set F = f (t, Θ), where Θ is some set of parameters, which in the context of product innovation corresponds to the parameters of a growth model (see Chapter 4). Another way of defining F is for example F = f (xt−1 , . . . , xt−p ), where p is a lag parameter. Finally, F can be defined as F = f (F1 , . . . , Fn ), where n is the number of different forecasts obtained with different forecast models. ∑ If we define Xt = ti=1 xi as the cumulative demand, the time series obtained is smoother than the non-cumulative demand time series (see Fig. 2.1 right). As this fact can be used to improve the forecasts results (Wu et al. , 2009), we forecast the cumulative demand and then the non-cumulative demand is calculated as xt = Xt − Xt−1 . This situation motivates the proposal of the following research hypothesis: The forecast with cumulative demand is more accurate than the forecast with non-cumulative demand. In order to cope with the scarcity of historical data of the product demand, in the literature related to SLCPs some references work with time series of similar products which have already been introduced to the market (see Chapter 4). In this case, a clustering technique may be used to analyze the information available and to organize this according to natural clusters (Wu et al. , 2009; Li et al. , 2012). The forecasts may be carried out according to the results of that cluster analysis. Therefore we also consider the following research hypothesis. Forecasts based on data obtained by means of cluster analysis improve forecast performance. Finally, a time series of a SLCP may have more complex shapes (see Fig. 2.2) 5 2. Problem statement 2.2. Fundamental research hypothesis that may not necessarily be similar to the typical bell-shaped pattern in Fig. 2.1 (left). In this situation the forecasting process becomes more difficult and we need to define forecast models that can deal with any pattern of SLCPs time series. One way to do this is to use some machine learning model in order to capture the complex functional relationship of the time series (Zhang et al. , 1998). Accordingly we assume the following research hypothesis. The methodological procedure for forecasting using machine learning models is based on Rodrı́guez & Vidal (2009) and Rodrı́guez (2007). The use of machine learning models in the forecasting process improves the forecast performance. Demand Time Fig. 2.2: Non bell-shaped pattern of a short life cycle product time series. 2.2 Fundamental research hypothesis It is possible to determine a forecasting procedure for SLCPs that, based on similar time series of available data, can obtain a minimum forecast error and guarantee a reliable basis for planning activities. 6 THREE OBJECTIVES AND SCOPE 3.1 General objective Propose a forecasting method based on machine learning models to forecast the demand of short life cycle products. 3.2 Specific objectives • Develop a clustering method for short life cycle products time series to extract relevant information for the forecast process. • Design the forecasting method for the demand of short life cycle products using multiple linear regression, support vector machines and/or artificial neural networks. • Evaluate the performance of forecast models using appropriate metrics and statistical tests. 3.3 Scope of the research In this work we use or consider • Sales data rather than demand data because the demand data is usually not available. • A validation and comparison of the forecasting methods using at least two different datasets of time series of demand. 3. Objectives and Scope 3.3. Scope of the research • The results do not make any implementation in a company and does not develop any implementation of software. • The analysis only considers time series that follow the behavior of the demand of short life cycle products. • This work focuses on the pattern of the time series rather than on the product or the type of product being considered. Note: Due to the scarcity of real data for this project, one of the databases used in this work contains time series of demand of textbooks and scholar products. Some of these products cannot be considered strictly as short life cycle products1 . However, we use these time series because they follow the pattern of the demand of a short life cycle product. 1 The demand of textbooks and scholar products is mainly related with the season, but during the season such demand follows a short life cycle pattern. 8 FOUR LITERATURE REVIEW This chapter presents a review of the current status of research in forecasting the demand (sales) of short life cycle products. This review is intended to discuss a general summary of the work in forecasting short life cycle products and guidelines for future research. 4.1 Forecast based on growth models Product growth models are widely used in the analysis of the diffusion of the innovation. For this reason diffusion models can be used to forecast the demand of SLCP. This approach requires determining a set of parameters; therefore an estimation parameters procedure is required. A review of forecasting methods based on diffusion models for short life cycle products is presented below. An extensive literature review of the use of diffusion models is found in Meade & Islam (2006). Trappey & Wu (2008) present a comparison of the time varying extended logistic, simple logistic, and Gompertz models. The study analyzes electronic products time series. Linear and non-linear least squares methods are used to determine model parameters. The authors found that the time varying extended logistic had the best fit and prediction capability for almost all tested cases, however this forecast procedure cannot converge in some situations 1 . Gompertz model had the second best forecasting error. 1 Cumulative time series of SLCPs converges to a limit superiorly, see Fig. 2.1 (right) on Chapter 2. 4. Literature Review 4.1. Forecast based on growth models Kurawarwala & Matsuo (1998) analyze three models to forecast the demand of short life cycle products: the linear growth model, the Bass model, and the seasonal trend Bass model 2 . Forecasts are performed using demand data of personal computer time series. Non-linear least squares estimation is used to determine parameters. Performance measures such as the sum of square error (SSE), the RMSE and the MAD show that the seasonal trend Bass model reaches the minimum forecast error. Tseng & Hu (2009) propose the quadratic interval Bass model to forecast new product sales diffusion. Fuzzy regression estimates the parameters of the Bass model, which is tested with different datasets. The proposed model is compared with Gompertz, Logistic and quadratic Gompertz, models and analyzed by means of the confidence interval length as performance measure. The authors conclude that the proposed method is suitable in cases of insufficient data and should not be used when sufficient data is available. Wu & Aytac (2008) propose a forecast procedure based on the use of time series of similar products (leading indicators), Bayesian updating and combined forecasts of different diffusion models. A priori forecast is made with several growth models. Then a sampling distribution is obtained with the forecasts of different time series of similar products (leading indicators), which are obtained by means of the mentioned growth models. Finally, Bayesian updating is performed and the final forecast is obtained as a combination of the different growth models in the a posteriori results. The main advantage of the method is the systematic reduction of the variance in the forecasts. The method is tested in semiconductor demand time series. A similar work is presented in Wu et al. (2009). Zhu & Thonemann (2004) propose an adaptive forecast procedure for short life cycle products. The authors use the Bass diffusion model and propose to update the parameters of such model using Bayesian approach. Prior estimation of the parameters is made using non-linear least squares estimation. The forecasts are performed using datasets of personal computers and analyzed using the MAD. The results show that the proposed method performs better than the double exponential smoothing and the Bass model. 2 The authors consider integrating elements of seasonality in the Bass diffusion model. 10 4. Literature Review 4.2. Forecast based on similarity 4.2 Forecast based on similarity The lack or scarcity of information in short life cycle products time series is compensated with the existence of information related to similar products for which there are sufficient history or have completed their life cycle. In this section we present a review of proposed forecast models that use directly similar products or time series to obtain forecasts of a SLCP. Szozda (2010) proposes an analogous forecast. The purpose of the method is to find the most similar time series. Calibration and adjustment of the time series is performed if necessary, in order to maximize the similarity measure. The forecast is the value of the most similar time series at a specified period of time. Datasets of different new products on European markets were used. The forecast method was analyzed using the mean squared error (MSE) and the results show that the proposed method presents good performance obtaining a forecast error less than 10% on average. Thomassey & Fiordaliso (2006) propose a forecast procedure based on clustering (unsupervised learning) and classification (supervised learning) to carry out early forecasts of new products. First, natural groups in the time series are obtained by means of a clustering procedure; then a classification procedure classifies new products in a specified cluster. A forecasting of sales profiles is given by the centroids of the cluster for which the new product belong. Datasets of textile fashion products is used. The forecasts were analyzed using the RMSE, the Mean Absolute Percentage Error (MAPE), and the Median Absolute Percentage Error (MdAPE). A similar work is presented by Thomassey & Happiette (2007). 4.3 Forecast based on machine learning models Machine learning methods such as neural networks are widely used in forecasting activities (Zhang et al. , 1998). Additionally techniques such a clustering and classification are important to extract relevant information from time series data and, therefore, of great utility in finding similar time series as can be seen in Section 4.2. In this section we present a review of machine learning methods used in forecasting activities for short life cycle products. Xu & Zhang (2008) uses a Support Vector Machine (SVM) to forecast the demand of short life cycle products in conditions of data deficiency. The authors take into account factors such as the past values of current demand, 11 4. Literature Review 4.4. Discussion of the current methods the forecast given by the Bass model, and seasonal factors. A dataset of computer products was utilized. The forecasts were analyzed using the RMSE and MAD. The results show that the proposed model outperform the Bass model. Meade & Islam (2006) a multilayer feed forward neural network accommodated for prediction and a controlled recurrent neural network to predict short time series. The authors use datasets available in the literature and find that the feed forward neural network accommodated for prediction perform better in one step ahead forecasts, but in the case of two step ahead forecasts the controlled recurrent neural network improves the feed forward neural network. The capability of artificial neural networks account for non-linear relationships and the fact that no assumptions are made about the time series characteristics are interesting features for forecasting. However, a drawback of neural networks is that they have many parameters to set up and there is not a standard procedure for this labor to ensure a good network performance. The lack of a systematic approach to neural network model building is probably the primary cause of inconsistencies in reported findings (Zhang et al. , 2001). Design of Experiments (DOE) and/or Response Surface Methodology (RSM) are techniques that can be used to tuning parameters of neural network models (Zhang et al. , 2001; Bashiri & Geranmayeh, 2011; Chiu et al. , 1994; Madadlou et al. , 1994; Balestrassi et al. , 1994). However the results obtained by response surface methodology are not necessarily the best (Wang & Wan, 2009; Jian et al. , 2009). 4.4 Discussion of the current methods From our literature review it is possible to describe some limitations of the current forecasting methods for SLCPs demand. These limitations are related mainly to the capability of forecasting the demand along the complete life cycle, the appropriateness of the models used, and the effective use of historical information related to others SLCPs. We describe in more detail these limitations. • Capability of forecast along the complete life cycle: This is an important question due that many methods such as diffusion models requires to extract part of the historical information for tuning parameters. For 12 4. Literature Review 4.5. Conclusions example, the Bass model (see 5.1.2) requires to determine three parameters (a, bandc, see 5.1) then it is necessary to reserve at least 3 data points for estimate such parameters. Then if we use the Bass diffusion model the initial forecast can be made at the 4th period. • Appropriateness of the model : Many models used in forecasting the demand of SLCPs employ explicit parametric forms that can, at best, be regarded as rough approximations of the underling non-linear behavior of the time series being studied. It is necessary to justify a priory the appropriateness of the model used. • Effective use of information related to other SLCPs: Many forecasting methods for SLCPs demand, such as difussion models, does not exploit information related to historical patterns of SLCPs demand. These drawbacks and the hypothesis outlined in chapter 2 motivate and justifies the work presented in this document. 4.5 Conclusions This chapter presented a review of the forecast models for short life cycle products and a description of forecasting models based on diffusion models, similarities between time series, and machine learning methods. The use of a Bayessian learning is a common technique that generally improves the forecast results. We note some limitations of the current methods used in forecasting tasks of SLCPs demand. 13 FIVE TIME SERIES ANALYSIS This chapter provides an analysis of the different datasets of time series. Initially, an univariate time series analysis is addressed; this analysis includes testing for stationarity and testing for nonlinearity in time series. Our experience with the univariate time series analysis does not allow us to achieve conclusive results. Since some of the techniques used here1 have undefined results; in short time series. To complement the analysis, an analysis of multiple time series is conducted using clustering techniques to extract natural groups on time series data. Classifying information by means of clustering allows to extract relevant information from the data that can be valuable in forecasting tasks. 5.1 The datasets of time series 5.1.1 Real datasets Demand dataset of texts and scholar products This dataset corresponds to the weekly sales of textbooks and scholar products of a Colombian company. The dataset comprises a total of 726 different time series which have a maximum length of 13 periods2 . We will refer to this dataset as RD1. Fig. 5.1 (a) shows some of these time series. 1 Such as maximum likelihood estimation of ARMA models and some nonlinear tests for time series. 2 It is important to note the presence of very short time series. This fact imposes constraints on the analysis to univariate time series, as will be seen later. 5. Time series analysis 5.1. The datasets of time series Citations of articles dataset This dataset correspond to annual citations received by scientific articles published in several knowledge fields. The articles were published between 1996 and 1997 and are available in the Scopus database. The dataset comprises a total of 600 time series of annual citations comprising years 1996–2014 having a maximum length of 18 periods. We will refer to this dataset as RD2. Fig. 5.1 (c) shows some of these time series. Note that although these time series show a downward trend in the last years, they have not yet completed its life cycle. This does not occur in the dataset RD1 where the lifetime of such time series has been completed. Patent to Patent citations dataset This dataset corresponds to annual patent to patent citations in the Unites States Patent Office (USPTO). The source of the data is Hall et al. (2001), here authors study the patent citations comprising the years 1975–1999. In this work we consider patents that mainly belongs to technological fields, we preprocess the information to obtain the annual patent citations of patents granted between 1976 and 19783 . The time series comprises the periods from 1976 to 1999 having a maximum length of 23 periods and the total number of patents analyzed was of 600. We will refer to this dataset as RD3. Fig. 5.1 (d) shows some of these time series. 5.1.2 Synthetic dataset Bass diffusion model One of the first attempts to model the life cycle of products was the Bass difussion model (Bass, 1969) which can be expressed using the following recurrence relation ( t−1 )2 t−1 ∑ ∑ (5.1) xt = a + b xi + c xi , i=1 i=1 where a = pm, b = q − p and c = −q/m. Here p, p ∈ (0, 1) , is called the innovation coefficient, q, q ∈ (0, 1) , is called the imitation coefficient 3 The patent citations are analyzed according to their granted year. We consider the granted year as the year of birth of the patent and the start of the counts for cites. 15 5. Time series analysis 5.2. Stationarity test ∑ and m = ∞ i=1 xi represent the total demand/sales over the life cycle of the product, see Bass (1969). Here t = 0 . . . ∞ and x0 = a = mp, the recurrence relation 5.1 can be used to obtain the following nonlinear autoregressive process ( t−1 )2 t−1 ∑ ∑ xt = a + b xi + c xi + ϵt , i=1 i=1 where ϵt is an independent and identically distributed random variable. We refer to this relation as the Bass Process and construct synthetic time series that follow such process with {ϵt } as normally distributed random variable of mean 0 and variance σ 2 . Four different groups of time series are generated each group containing 150 time series, reaching a total of 600. The length of the time series is set to a maximum of 25 periods. The parameters of the process are considered as random variables normally distributed with mean and standard deviation (σ) given in Tab. 5.1. We will refer to this dataset as SD1. Fig. 5.1 (b) shows 7 different time series of this dataset. Tab. 5.1: Mean values of the Model parameters for each group of synthetic parameters. Each parameter is considered as a random variable. Note: Values of p and q out of the range [0, 1] were not considered in the process. Group p q m I II III IV σ 0.043 0.002 0.031 0.015 0.071 0.425 0.496 0.220 0.549 0.001 193.1 119.5 105.2 193.2 14.29 5.2 Stationarity test Stationarity implies some regularity conditions required for some tests of hypothesis and for the parameter estimation of linear ARMA models (Peña, 2005, Chapter 10). A time series {xt } is said to be weakly stationary if their mean µt , their variance γ0 and the covariance γℓ between yt and yt−ℓ are time invariant, where ℓ is an arbitrary integer (see Chapter 2 of Tsay (2005)). Strong stationarity implies that the probability distribution function related 16 5. Time series analysis 5.2. Stationarity test (b) (a) 30 60 20 40 10 20 xt 2 4 6 8 10 12 14 (c) 150 30 100 20 50 10 5 10 15 5 10 15 (d) 20 25 5 10 15 20 25 t Fig. 5.1: The datasets of short life cycle product time series. (a) RD1 dataset. (b) SD1 dataset. (c) RD2 dataset. (d) RD3 dataset. to the values of the time series is time invariant. Strong stationarity is a key concept that allows characterizing the SLCPs time series. It is clear that the probability distribution function of such time series varies with time. In fact at the initial and final stages of the life cycle we might expect less variability in time series. On the other hand, it is expected that the variability of the time series be greater around the peak demand/sales. Also the mean value of the time series depends of the stage of the life cycle, i.e. the mean demand/sales at the beginning and at the end of the life cycle is expected to be close to zero. 5.2.1 Unit-root test A non-stationary time series is said to be an Autoregressive Integrated Moving Average, ARIMA(p, d, q), process if after applying the difference operator, ∇d yt , the resulting time series is stationary4 . Then the original time series is called unit-root no stationary (Tsay, 2005, Chapter 2). Consider the Autoregressive AR(1) process yt = ϕ0 + ϕ1 yt−1 + at , The difference is defined as ∇yt = yt − yt−1 . Let xt = yt − yt−1 , then ∇2 yt = xt − xt−1 = yt − 2yt−1 + yt−2 . 4 17 5. Time series analysis 5.3. Clustering of time series where at is an independent and identically random variable, e.g. a white noise sequence that is by definition a stationary process. If ϕ = 1 then yt = ϕ0 + yt−1 + at , ∇yt = ϕ0 + at . Given that after applying the difference operator to the original time series we obtain a stationary time series, we conclude that the process is nonstationary. Then for testing unit-root stationarity it is necessary test if ϕ = 1 or ϕ < 1.The above analysis can be generalized to ARMA models, however, we omit details (see Chapter 9 of Peña (2005)). Generally the null hypothesis of the test is that the series are non-stationary. This implies, possibly, differentiating the original series, by using the difference operator, to obtain a stationary process. We use the augmented Dickey-Fuller test to test for unit-root nonstationarity, with a significance level in the test of 0.01. Performing the Dickey-Fuller test requires to construct a suitable linear ARMA model beforehand. The parameters of such model are usually estimated by maximum likelihood estimation. Results of the unit root test Our experience on testing unit-root nonstationarity gives no conclusive results. This is confirmed in performing tasks of parameter estimation of ARMA models, where the algorithms used present undefined or invalid results5 . We admit that the time series discussed here are non-stationary, since time series do not satisfy regularity assumptions required for estimating parameters of ARMA models. 5.3 Clustering of time series Data mining methods are helpful to find and to describe patterns which are hidden in datasets. Clustering is a major technique in data mining, statistical and pattern recognition fields; the problem of clustering is also referred to as non-supervised classification and is identified as a non-supervised learning. 5 In estimating parameters we use here the exact maximum likelihood estimation using the Kalman filter, this algorithm requires the calculation of a log-likelihood function but in some cases the argument of the logarithm is negative. The same circumstances are found in some nonlinear tests. This implies undefined (complex) results. 18 5. Time series analysis 5.3. Clustering of time series Clustering consists in finding natural groups in data. This means that an element of a cluster possesses common characteristics to other elements in such cluster, but is significantly different of the elements of other clusters. In the context of time series data6 , clustering becomes the task of finding time series with similar characteristics. There are several characteristics that can be of interest. For example, one characteristic of clusters is that each time series in a cluster are generated by approximately the same generating process. Another characteristic of interest may be that the time series of a cluster are highly dependent. 5.3.1 Some insights on clustering time series Given a set of unlabeled time series, it is often desirable to determine groups of similar (according to our meaning of similar) time series. There are three clustering approaches for clustering time series (see Liao (2005)). • Raw-data-based approach. This approach uses the complete time series as a feature vector. The clustering is carried out into a high dimensional space if the time series are large. This fact can be problematic due to the curse of the dimensionality. • Feature-based approach. This approach extracts some relevant features of the time series. Techniques such as feature extraction or selection, dimensionality reduction or others can be used to extract information from data. • Model-based approach. Here the features can be obtained using some model. For example, if we deal with SLCP, we can use the Bass diffusion model for such time series and set the parameters of such model as the features. There are three main objectives in clustering time series (Zhang et al. , 2011): Obtain similarity in time, by analyzing series that varies in a similar way on each time step. Obtain similarity in shape by analyzing time series with common shape features together. And finally, obtain similarity in change by analyzing common trends in the data, similar variation in each time step. 6 A review of the literature related to time series data mining can be found in Fu (2010). 19 5. Time series analysis 5.3. Clustering of time series In general it is not desirable to work directly with the raw data. The reasons are that such time series are highly noisy (Liao, 2005) and as dimensionality increases all objects become essentially equidistant to each other (this is known as curse of dimensionality) and thus classification and clustering lose their meaning (Ratanamahatana et al. , 2010). Then, transformation of the raw data can improve the efficiency by reducing the dimension of the data or improve the clustering effects by smoothing the trend and giving prominence to typical features (Zhang et al. , 2011). Dimensionality reduction by perceptually important points Perceptually Important Points (PIP) is a simple method for reducing the dimension of a time series that preserves salient (representative) points. In this method all data points are reordered according to its importance. Then if given a time series of length T is required to reduce their dimension to n, n < T , then the top-n points of the list are selected. The method starts by selecting the initial and final points of the original time series as the first PIPs. The next PIP is selected as the point that has maximum distance to the first PIPs. The next PIP is selected as the point with maximum distance to its two adjacent PIPs. The procedure continues until n points have been selected. The distance to one point to its two adjacent PIPs is understood as the vertical distance to the point and the line connecting its two adjacent PIPs (Fu, 2010). Let the test point p3 = (x3 , y3 ) and let its corresponding adjacent points be p1 = (x1 , y2 ) and p2 = (x2 , y2 ), then the vertical distance between p3 and the line connecting p1 and p2 is shown in Eq. 5.2 (see Chung et al. (2002)). ( ) x − x c 1 d (p3 , pc ) = |yc − y3 | = y1 + (y2 − y1 ) − y3 , (5.2) x2 − x1 where xc = x3 . The general procedure of the PIPs is shown in Algorithm 1. Hard clustering or fuzzy clustering? It is known that hard clustering7 often does not reflect the description of the real data, where boundaries between subgroups might be fuzzy and numerous problems in the life sciences are better tackled by decision making in a 7 Hard clustering makes reference to the process of assigning each element to one, and only one, cluster. On the other hand, in fuzzy clustering each element belongs to each cluster in a certain degree of membership. 20 5. Time series analysis 5.3. Clustering of time series Algorithm 1: Perceptually Important Points Procedure. Data: Input sequence {at } , t = 1, . . . , T , required length n. Result: Reduced sequence {qt } , t = 1, . . . , n. Set q1 = a1 and qn = aT ; l = 2; while l < n do Select point aj with maximum distance to adjacent points in {qt }; Add aj to {qt }; l = l + 1; end 1 2 3 4 5 6 7 fuzzy environment. Then a fuzzy clustering becomes to be the best option (Gath & Geva, 1989a). Initialization method A clustering procedure generally is based on the following two majors steps8 : 1. Obtain an initial partition. That partition can be obtained randomly or by a more sophisticated method. 2. Iteratively obtain new partitions improving the clustering until some termination criterion is met. The initialization method can improve the clustering performance. The idea of the initialization is to use several clustering algorithms each of which is more sophisticated than the prior. For example, the Expectation and Maximization algorithm can initialized with the results of the k-means algorithm and this algorithm can initialized at a random partition (Bradley & Fayyad, 1998). An initialization method is proposed by Bradley & Fayyad (1998), the authors developed a procedure for computing refined initial centroids for the k-means algorithm based on an efficient technique for estimating the modes of a distribution. Other initialization method based on refined seeds or centroids can be found in Gath & Geva (1989a). The characteristic of such initialization is that the initial seeds are chosen randomly. 8 Referring mainly to partitional clustering algorithms. 21 5. Time series analysis 5.3. Clustering of time series Peña et al. (1999) present a comparison of the performance of four initialization methods for the k-means algorithm. The methods include: random initialization, the Forgy approach, the Macqueen approach and the Kaufman approach. Based on the statistical properties of the squared error, the authors found that the Kaufman initialization method outperforms the rest of the compared methods with respect to the effectiveness and the robustness of the k-means algorithm, and convergence speed. Given the above results, we describe in more detail the Kaufman initialization method. The Kaufman initialization method is shown in Algorithm 2 (see Kaufman & Rousseeuw (1990); Peña et al. (1999)). Here, d (x, y) correspond to some distance measure between vectors x and y. Algorithm 2: Kaufman initialization method. Data: Dataset of time series X ; Number of clusters K. Result: The set of cluster centroids {v1 , . . . , vk }, or initial partition. 1 2 3 4 5 6 7 8 9 10 11 12 13 Select as the first seed the most centrally located instance; k = 1; while k < K do for For every non-selected instance wi do for For every non-selected instance wj do Calculate Cji = max {Dj − dji , 0}, where dji = d (wi , wj ) and Dj = mins {djs } being s one of the selected seeds; end ∑ Calculate the gain of selecting wi by j Cji ; end {∑ } Select the instance l where l = argmaxi C ; ji j k = k + 1; For having a partition assign the non-selected instance to the cluster represented by the nearest seed. end Distance measure In partitional clustering it is necessary to measure the similarity between two objects. For this purpose some distance measure is considered. In 22 5. Time series analysis 5.3. Clustering of time series general, the use of the Euclidean distance is not necessarily the best option. Euclidean distance is not allowed for the situation when two sequences have different scales. In this case it is necessary to normalize the data (Ratanamahatana et al. , 2010). Euclidean distance does not take into account the temporal order and the length of sampling intervals if the time series considered have unevenly distributed points (Möller-Levet et al. , 2003, 2005). The so called Dynamic Time Warping Distance can be used when the time series does not line up in a horizontal scale. However, the time required to compute the distance is high (Ratanamahatana et al. , 2010). When the time series have unevenly distributed points, Möller-Levet et al. (2003) uses the so called Short Time Series Distance. Such distance measure is defined in Eq. 5.3. This distance measure take into account also the shape of the time series considered. )2 T −1 ( ∑ yr+1 − yr vr+1 − vr 2 d (y, v) = − . (5.3) tr+1 − tr tr+1 − tr r=1 On the other hand, Pascazio et al. (2007) propose a clustering procedure that is based on the Hausdorff distance as a similarity measure between clusters elements. The authors use the Hausdorff distance in hierarchical clustering and they recommend this tool for the analysis of complex sets with complicated (and even fractal-like) structures. The method is applied to financial time series. 5.3.2 The clustering algorithm Fuzzy C-Means Clustering Algorithm (FCM) The idea of partitional clustering algorithms considered here is based on the compactness of clusters and separation between clusters. The total sum of the distances between data points and their cluster centroids is often a figure of merit. The fuzzy-clustering problem can be formulated by minimizing the function given in Eqs. 5.4, where K is the number of clusters, uij is the grade of membership of the jth object to the ith cluster, uij = [0, 1], m is the fuzzyfier which have influence on the performance of the clustering algorithm. Y = {y1 , . . . , yN } ⊂ ℜT are the feature data, V = {v1 , . . . , vK } ⊂ ℜT are the cluster centroids, U = [uij ]K,N is the fuzzy partition matrix. 23 5. Time series analysis 5.3. Clustering of time series Minimize: Jm (Y, V, U ) = N ∑ K ∑ 2 um ij d (yj , vi ) . (5.4a) ∀j = 1, . . . , N, (5.4b) j=1 i=1 Subject to: K ∑ uij = 1, i=1 uij must be non-negative for all i, j. The optimization of uij is determined by equating to zero the derivative of the Lagrangian of the optimization problem and solving the resulting system. The results are shown in Eq. 5.5. [ uij = )1/(m−1) K ( ∑ d(yj , vi ) k=1 d(yj , vk ) ]−1 , (5.5) The optimization of vi follows the same procedure and the results are shown in Eq. 5.6. ∑n m j=1 uij yj v i = ∑n . (5.6) m j=1 uij It is clear that Eqs. 5.5 and 5.6 are coupled and it is not possible to obtain closed form solutions. One way to proceed is to follow an iterative algorithm to obtain the estimates of uij and vi . Such algorithm is called the FCM clustering algorithm (see Theodoridis & Koutroumbas (2006), pag. 602). Algorithm 3 shows the standard FCM algorithm, which requires an initial set of cluster centroids and we use the Kaufman initialization method (see Algorithm 2) to obtain such centroids. Fuzzy Short Time Series Clustering Algorithm (FSTS) In clustering time series tasks it is usually helpful to group the data according to their shape or other characteristic as discussed before. Then a special purpose clustering algorithm might be used. In this work we consider the fuzzy clustering algorithm proposed by Möller-Levet et al. (2003, 2005); this clustering algorithm is the same of the fuzzy c-means algorithm but uses the short time series distance metric given in Eq. 5.3 rather than the standard 24 5. Time series analysis 5.3. Clustering of time series Algorithm 3: Fuzzy C-Means (FCM) clustering algorithm. Data: Time series matrix X; Number of clusters K; Fuzzifier parameter m; Termination tolerance ε; Number of time series n; Length of the time series T . Result: The set of cluster centroids {v1 , . . . , vk }; Partition matrix U . 1 2 3 4 5 6 7 8 9 Obtain the initial partition using the Kaufman initialization method given in Algorithm 2; Get the initial partition matrix, U 0 ; l = 0; while U l − U l−1 ≥ ε do Compute the cluster prototypes using Eq. 5.6; Compute the distances of each data to each cluster centroid using the Euclidean distance; Update the partition matrix using Eq. 5.5; l = l + 1; end Euclidean distance. The optimization problem is the same for the FCM algorithm given in Eqs 5.4 and the value uij for the FSTS clustering algorithm is given in Eq. 5.5. Such membership values must be calculated using the short time series distance. The optimization of vi is quite different to that of FCM clustering; the resulting system of equations after deriving and equating to zero is shown in Eqs. 5.7. ar vi,r−1 + br vi,r + cr vi,r+1 = mir , (5.7) where ar = −(tr+1 − tr )2 , br = −(ar + cr ), cr = −(tr − tr−1 )2 , dr = (tr+1 − tr )2 , er = −(dr + fr ), fr = (tr − tr−1 )2 , and ∑N mir = j=1 um ij (dr yj,r−1 + er yj,r + fr yj,r+1 ) . ∑N m u ij j=1 This yields an undetermined system of equations. However, by adding two fixed time points at the beginning and at the end of the time series with a 25 5. Time series analysis 5.3. Clustering of time series value of 0 (such points do not alter the results) it is possible to solve such system (see Möller-Levet et al. (2003) and Möller-Levet et al. (2005) for details).The resulting system is a tridiagonal system of equations that can be easily solved recursively by using the so called tridiagonal matrix algorithm, TDMA. On the other hand, Möller-Levet et al. (2003) show a closed form recursive equation to solve such system. The fuzzy clustering algorithm of short time series is shown in Algorithm 4. Algorithm 4: Fuzzy short time series (FSTS) clustering algorithm. Data: Time series matrix X; Number of clusters K; Fuzzifier parameter m; Termination tolerance ε; Number of time series n; Length of the time series T . Result: The set of cluster centroids {v1 , . . . , vk }; Partition matrix U . 1 2 3 4 5 6 7 8 9 10 11 12 13 Add two fixed time points at the beginning and the end of the time series X = [[0]n×1 , X, [0]n×1 ]; Obtain the initial partition using the Kaufman initialization method given in Algorithm 2; Get the initial partition matrix, U 0 ; l = 0; while U l − U l−1 ≥ ε do Compute the cluster prototypes by set v(i, 1) = 0, v(i, T + 2) = 0 and solving the system given in Eq. 5.6 to obtain the K centroids; Compute the distances of each data to each cluster using Eq. 5.3; Update the partition matrix using Eq. 5.5; l = l + 1; end Fuzzy Maximum Likelihood Estimation (FMLE) Clustering Algorithm The following analysis is based on Benitez et al. (2013). For this purpose consider that the clusters are considered as random events each that occur in the sample space X with a positive probability P(i). By the theorem of 26 5. Time series analysis 5.3. Clustering of time series total probability P(X ) = K ∑ P(X |i)P(i), i=1 where P(X |j) is the conditional probability that given X the event (cluster) j occurs. If it is assumed that P(X |j)P(i) is the Gaussian N (vi , Σi ), then the function P(X ) can be seen as a Gaussian mixtures. The above equation can be written as the likelihood function with parameters Θ = {P(i), vi , Σi } P(X ) = P(X ; Θ) = N ∏ P(yj |Θ) = j=1 N ∑ K ∏ P(yj |i)P(i) = L(Θ; X ). (5.8) j=1 i=1 Given that P(yj |i) is Gaussian, it can be shown that [ ] 1 1 ′ −1 P(yj |i) = √ exp − (yj − vi ) Σ (yj − vi ) . 2 (2π)T |Σi | The parameters Θ are obtained by maximizing the likelihood function given in Eq. 5.8. This is done equating to zero the derivative (with respect to Θ) of the likelihood function and solving (for each parameter) the resulting system of equations. The results of the optimization problem are shown in Eq. 5.9. 1 ∑ P(i) = P(i|yj ), N j=1 ∑N j=1 P(i|yj )yj v i = ∑n , j=1 P(i|yj ) ∑N ′ j=1 P(i|yj )(yj − vi ) (yj − vi ) ∑n Σi = . j=1 P(i|yj ) n (5.9a) (5.9b) (5.9c) Note that the expression for vi is similar to those found for the FCM clustering algorithm in Eq. 5.6. The idea is that the posterior probability9 P(i|yj ) is equivalent to the degree of membership um ij . In fact the term P(i|yj ) is calculated as follow: [ K ( ]−1 ∑ d2 (yj , vi ) ) e , P(i|yj ) = (5.10a) 2 (y , v ) d j k e k=1 9 Probability of selecting the ith cluster given the jth feature vector. 27 5. Time series analysis 5.3. Clustering of time series Where: √ [ ] |Σi | 1 ′ −1 2 de (yj , vi ) = exp (yj − vi ) Σi (yj − vi ) . P(i) 2 (5.10b) The FMLE algorithm uses an exponential distance measure d2e based on maximum likelihood estimation. The characteristics of the FMLE clustering algorithm makes it suitable for partitioning the data into hyper-ellipsoidal clusters (see Gath & Geva (1989b)). The FMLE algorithm is shown in Algorithm 5. Algorithm 5: Fuzzy Maximum Likelihood Estimation (FMLE) clustering algorithm. Data: Time series matrix X; Number of clusters K; Termination tolerance ε; Number of time series n; Length of the time series T. Result: The set of cluster centroids {v1 , . . . , vk }; Posterior probabilities P(i|yj ). 1 2 3 4 5 6 7 8 9 Obtain the initial partition using the Kaufman initialization method given in Algorithm 2; Compute the posterior probabilities given in Eq. 5.10a; l = 0; while U l − U l−1 ≥ ε do Compute the cluster prototypes using Eq. 5.9b; Compute the parameters given in Eqs. 5.9a and 5.9c; Update posterior probabilities given in Eq. 5.10a ; l = l + 1; end 5.3.3 Fuzzy cluster validity indices In general, in most clustering algorithms the number of clusters must be specified beforehand. The selection of a different number of clusters result in different partitions. For this reason it is necessary to evaluate several partitions. The problem of finding the optimal number of clusters is called cluster validity. The selection of the appropriate partition must be done 28 5. Time series analysis 5.3. Clustering of time series according to a performance index. Total distance to the cluster centroids (the objective value in the clustering problem, see Eq. 5.4) is not the best option because this metric tends to decrease as the number of cluster increase. In general, a fuzzy cluster validity index must consider the partition matrix and the data set itself (Wang & Zhang, 2007). In Wang & Zhang (2007) an evaluation of different fuzzy cluster validity indices is carried out. The authors performs the evaluation using eight synthetic data sets and eight well-known data sets. This work shows that none of the indices considered identify the correct number of clusters in all data sets, but some indices have good results. The authors find that the PBMF index (see Pakhira et al. (2004)) only fails once to detect the correct number of clusters of the sixteen datasets. Other indices such as the PCAES, the Granularity-Dissimilarity GD index, and the SC index obtain good results too. In the following we describe some validity indices used in this work. Xie and Beni index, VXB The XB index is defined in Eq. 5.11 ∑K ∑n VXB = uij d2 (yj , vi ) . n mini,j {d2 (vi , vj )} i=1 j=1 (5.11) The index focuses on two properties: the compactness and separation. The numerator of Eq. 5.11 measures the compactness and the denominator measures the separation between clusters. The validation problem becomes to find the partition k for which k = argmaxc=2,...,n−1 VXB (c). PBMF index, VPBMF The PBMF (see Pakhira et al. (2004)) index is defined in Eq. 5.12 ( VPBMF = here, E1 = n ∑ d (yj , v) 1 E1 × DK × K Jm )2 K , and DK = max d (vi , vj ) . i,j=1 j=1 29 (5.12) 5. Time series analysis 5.3. Clustering of time series Jm is interpreted as the value for the clustering problem Jm (y, v, u) = N ∑ K ∑ 2 um ij d (yj , vi ) . j=1 i=1 The index comprises three factors; the first factor 1/K indicates the divisibility of a K cluster system, this reduces as K increase and allows avoiding the problem of convergence as K increases. The second term E1 /Jm includes the sum of intra-cluster distances for the complete dataset taken as a single cluster and that for the K cluster system (objective value). This factor measures the compactness of a cluster system. It is required to be increased it (Pakhira et al. , 2004). The third factor DK is the maximum inter-cluster separation and it is required to increase it. The validation problem becomes to find the partition k for which k = argmaxc=2,...,n−1 VPBMF (c). PCAES index, VPCAES The Partition Coefficient and Exponential Separation (PCAES) index (see Wu & Yang (2005)) is defined in Eq. 5.13 VPBMF ( { 2 }) K ∑ n K ∑ u2ij ∑ d (vi , vk ) = − exp − min , k̸=i u β M T i=1 j=1 i=1 (5.13) {∑ } ∑K 2 ∑n n m where uM = mini=1,...,K u j=1 ij , βT = i=1 d (vi , v) , v = j=1 yj /n. The first term of the index measures the compactness of the cluster system relative to the most compact cluster. The second term takes an exponentialtype inter-cluster separation. The validation problem becomes to find the partition k for which k = argminc=2,...,n−1 VPCAES (c). SC index, VSC The SC index (see Zahid et al. (1999)) is defined in Eq. 5.14 VSC = SC1 (K) − SC2 (K) , where ∑K d2 (vi , v) /K ), ∑n m 2 u u d (y , v ) / j i j=1 ij j=1 ij SC1 (K) = ∑ (∑ n K i=1 i=1 30 (5.14) 5. Time series analysis and SC2 (K) = 5.3. Clustering of time series ∑K−1 ∑K−1 (∑n ) ∑n min {u , u } / min {u , u } ij kj ij kj i=1 r=1 j=1 j=1 . ∑n 2 ∑n (max {u }) / max {u } i=1,...,K ij i=1,...,K ij j=1 j=1 The first and second terms of Eq. 5.14 measures the fuzzy compactness and fuzzy inter-cluster separation considering geometrical properties of the data and the membership function. The index obtains a fuzzy compactness/fuzzy separation degree. The validation problem becomes to find the partition k for which k = argminc=2,...,n−1 VPCAES (c). 5.3.4 Clustering results Experiments were conducted in order to validate or test the clustering results and fuzzy cluster indices. We use dataset SD1 described in Section 5.1.2, which was generated according to four classes or groups. The idea is to perform the clustering to the time series using the Algorithm 4 and a clustering validation using the indices described in Section 5.3.3 in order determine the number of clusters K. The results are compared with the correct number of clusters. In these experiments we dealed with the FSTS clustering algorithm. The fuzzifier parameter m of the clustering algorithm (see Algorithm 2) was set to m = 1.65, the convergence criterion of Algorithm 2 was set to ε = 10−5 . The algorithm used the short time series distance given in Eq. 5.3. The fuzzy cluster indices were evaluated using different number of clusters, with a minimum number of clusters of Kmin = 2 and a maximum number of clusters √ considered of Kmax = n, where n is the number of time series (n = 600, see Section 5.1.2). The cluster validity indices were calculated using the short time series distance. The results were obtained using a raw-data-based approach in which the complete time series was considered as a feature vector and the dimensionality of the time series was reduced using the Algorithm 1. The optimal number of clusters for different cluster validity indices and different dimensionalities is shown in Tab. 5.2. It is shown that the PCAES index obtained the most accurate results considering that the correct number of clusters is 4. In Fig. 5.2 where SD1 dataset has a dimensionality reduction to 3 features, it can be seen the presence of four fuzzy clusters. This reduction to 3 PIPs for a SLCP is actually obtained by selecting the initial sales, the peak sales and the final sales of such product. 31 5. Time series analysis 5.3. Clustering of time series Tab. 5.2: Validation results for SD1 dataset using FSTS algorithm. Index Dimensionality * Raw data 13 6 3 XB PBMF PCAES SC 19 24 6 19 17 22 21 3 2 2 24 24 23 15 2 24 *Nmber of periods in the time series. 5.3.5 Clustering results for the real datasets The raw datasets and the PCAES index were used to investigate the optimal number of clusters for the real datasets (see Section 5.1.1), the results are shown in Tab. 5.3. Fig. 5.3 shows the representation by its 3 PIPs of the real datasets. It is noted that these datasets do not have a clear clustering structure as with synthetic, SD1, dataset. On the other hand, all real datasets take integer values. This implies that the results of a clustering tendency tests such as Hopkins (see Banerjee & Davé (2004)) and Cox-Lewis may erroneously conclude that the data follows a clustering structure10 . The reason for such argument is the integrality of the data, for example, if we analyze Fig. 5.3 (c) it may look that the data form clusters for the values y1 = {1, 2, 3, 4, 5, 6} when in fact these are the only integer values than can take such dataset. Tab. 5.3: Optimal number of clusters for the real datasets. Dataset Number of clusters RD1 RD2 RD3 2 2 5 10 Hopkins test, for example, testing the hypothesis that the data is randomly (uniformly) generated into their convex hull in contrast with the hypothesis that the data form clusters or it is not generated in a completely random manner. The testing is carried out using Monte Carlo simulations where synthetic data is uniformly generated into the convex hull of the real data and then a comparison of the synthetic and real data is performed in order to contrast the clustering tendency 32 5. Time series analysis 5.4. Conclusions of the chapter 40 35 10 30 y2 y3 25 5 20 15 0 0 10 0 20 5 5 10 y2 40 15 0 y1 0 5 10 15 y1 Fig. 5.2: Representation of the SD1 data set by its 3 PIPs. 5.4 Conclusions of the chapter In the univariate analysis of time series this chapter allowed to know the difficulties on performing the unit-root stationarity test and some non-linear tests for time series. Time series do not satisfy the regularity conditions required in these tests. Therefore, the time series are non-stationary. Difficulties also arise dute to the time series are very short. This implies than the estimation process of parameters of ARMA models generate inaccurate results. In conclusion, our experience showed that it is difficult to analyze time series of SLCPs by using traditional ARIMA or ARMA models. On the other hand, the multivariate analysis allowed to finding groups in time series data. We presents several clustering algorithms which will be used in the forecasting framework as will be seen later. In this chapter we test the results for the FSTS algorithm only and evaluate different cluster validity indices. The cluster validity results for the syntetic datasets does not allow to know the correct partition of data, however, the PCAES index get good partitions. It was shown than the syntetic dataset follow a clustering structure of 4 groups. The real datasets have no a clear clustering structure. 33 5. Time series analysis 5.4. Conclusions of the chapter (a) 3 100 2 1 50 0 100 50 0 0 10 20 0 0 10 20 (b) 500 400 200 150 100 50 300 200 100 400 200 0 100 50 0 50 100 (c) 120 100 80 0.5 0 120 y2 y3 1 60 40 80 y240 2 y1 4 20 6 2 4 6 y1 Fig. 5.3: Representation of the real datasets by its 3 PIPs. (a) RD1 Dataset; (b)RD2 dataset; (c) RD3 dataset. 34 SIX REGRESSION METHODS Given that this work proposes the use of machine learning models in forecasting tasks of SLCPs demand, we consider the analysis of models such as multiple linear regression, support vector regression and artificial neural networks. In this chapter we briefly describe the theoretical background of such methods. Regression methods are parameter dependent, therefore is necessary to define a method for tuning parameters. This work employs the response surface methodology to tune parameters since it is an efficient way for process optimization. Forecasting methods use historical information to predict what will happen in the future. We can refer to this as the problem of learning from examples, as stated by Vapnik (1999) in the context of statistical learning theory and machine learning. Then the problem of forecasting can be stated as a problem of learning, specifically if the functional relationship yt = f (yt−1 , yt−2 , . . . , yt−p ) must be learned from historical information. The problem of finding the correct function that best predicts the value yt is also called the problem of regression. This chapter presents the methods based on regression used to forecast the demand of short life cycle products. For convenience we refer to the argument of the function as the p-dimensional vector y = (yt−1 , yt−2 , . . . , yt−p )′ and the result of the function as y, which means that y ≡ yt . Then the functional can be written as y = f (y), assuming that we have available historical information (y1 , y1 ), . . . , (yn , yn ), which is called the training set S, yi ∈ X ⊆ Rp , yi ∈ Y ⊆ R. 6. Regression Methods 6.1. Multiple linear regression The forecasts based on linear regression discussed here was first proposed by Rodrı́guez & Vidal (2009); Rodrı́guez (2007) to predict the demand of SLCPs. In this study compare the performance of this method with nonlinear models such as support vector regression and artificial neural networks. 6.1 Multiple linear regression Here we assume that the function y = f (y) is linear, that is y = f (y) = w′ y + w0 . (6.1) In terms of forecast it is assumed that the values of a time series are linear functions of some past values. Such equation is completely determined by defining the parameters w0 and w, given in Eq. 6.1. The idea is to get the values of such parameters to minimize the regression error. In general the least squares approach is followed to solve the linear regression problem by minimizing the sum of squared errors (deviations). Total sum of squared errors is defined as the loss function L, which is also known as the square loss function (Cristianini & Shawe-Taylor, 2000). L(w, w0 ) = n ∑ (yi − f (yi )) = 2 i=1 n ∑ (yi − w′ yi − w0 ) , 2 (6.2) i=1 b = The above equation can be expressed in matrix notation by setting w (w′ , w0 )′ , and (see Cristianini & Shawe-Taylor (2000)) b1′ y y b2′ b = bi′ = (yi′ , 1)′ . Y .. , where y . bn′ y The loss function (Eq. 6.2) given in matrix notation becomes b w) b w), b = (y − Y b ′ (y − Y b L(w) b equating them taking the derivatives of the loss function with respect to w, b we obtain the solution of the least squares problem to zero and solving for w )−1 ( ′b b ′y b b = YY Y (6.3) w 36 6. Regression Methods 6.2. Support vector regression 6.2 Support vector regression Support vector machines (SVM) are commonly used in forecasting tasks and time series analysis. Some literature related to forecasting with SVMs can be found in Pai et al. (2010), Yang et al. (2007), and Hu et al. (2011). In particular, Xu & Zhang (2008) uses the ε-SVR to forecast the demand of a short life cycle product, which is the interest of this work. However the authors take into account exogenous variables in the process of learning rather than past values of the time series. In regression problems there are mainly two kinds of SVMs: the so called ν-SVR (nu-Support Vector Regression, see Schölkopf et al. (–)) and the εSVR. In this work we consider the Epsilon Support Vector Regression or ε-SVR, proposed by Vapnik (Smola & Schlkopf, 2004). In addition, a more detailed description of this kind of SVM is presented. Consider the simplest case in which we need to perform a linear regression to some dataset as shown in Fig. 6.1. The idea is to obtain a regression line contained in a tube of width 2ε that contains all (or the greatest number of) data points and is as flat as possible. The reason to do this is to overcome overfitting (see (Lin, 2006, page 48)). It can be shown that this objective is equivalent to minimize the theoretical maximum of the generalization error (see Cristianini & Shawe-Taylor (2000); Vapnik (1999)). Let the regression line y ¶ ¶ Ξ Ξ* x Fig. 6.1: Illustration of the linear ε-SVR with soft margin. given by Eq. 6.1. The flatness of the regression function is determined by 37 6. Regression Methods 6.2. Support vector regression the value of the parameter w, by minimizing the norm ∥w∥2 (see Lin (2006); Smola & Schlkopf (2004)). Then, the following convex optimization problem is obtained. Minimize: 1 ∥w∥2 . 2 Subject to: yi − w′ y − w0 ≤ ε, w′ y + w0 − yi ≤ ε. The constraints of the above problem establish that all data points must lie between a tube of width 2ε. However it is possible to relax these constraints by adding slack variables. The resulting optimization problem is called soft margin problem. Soft margin problem The soft margin problem considers that it is possible to obtain some error (data points outside of the tube) which is measured by the slack variable ξi or ξi∗ for data point yi . The additions of slack variables makes necessary to penalize the magnitude of the slack variables (error) in the objective function, according to some penalising cost C. The resulting optimization problem is shown in Eqs. 6.4. Minimize: ∑ 1 ∥w∥2 + C (ξi + ξi∗ ). 2 i=1 n (6.4a) Subject to: yi − w′ yi − w0 ≤ ε + ξi , w′ yi + w0 − yi ≤ ε + ξi∗ , ξi , ξi∗ ≥ 0. (6.4b) (6.4c) (6.4d) The optimization problem 6.4 can be solved more easily in its dual formulation. More important yet is the fact that the dual problem provides the key 38 6. Regression Methods 6.2. Support vector regression for extending the SVM to nonlinear regression problems (see Smola & Schlkopf (2004)). The Lagrangian of the optimization problem is ∑ 1 L = ∥w∥2 + C (ξi + ξi∗ ) 2 i=1 n − n ∑ αi (ε + ξi − yi + w′ yi + w0 ) i=1 − n ∑ (6.5) αi∗ (ε + ξi∗ ′ − w yi − w 0 + y i ) i=1 − n ∑ (ηi ξi + ηi∗ ξi∗ ). i=1 where L is the Lagrangian and αi , αi∗ , ηi , ηi∗ are Lagrange multipliers or dual variables, it is required that αi , αi∗ , ηi , ηi∗ ≥ 0. The partial derivatives of L with respect to the primal variables (w, w0 , ξi , ξi∗ ) have to vanish for optimality. n ∑ ∂w L = (αi∗ − αi ) = 0, (6.6a) i=1 ∂w L = w − n ∑ (αi − αi∗ )yi = 0, (6.6b) i=1 ∂ξi L = C − αi − ηi = 0, ∂ξi∗ L = C − αi∗ − ηi∗ = 0. (6.6c) (6.6d) Substituting the results given in Eqs. 6.6 into the primal optimization problem 6.4 yields the dual optimization problem. Maximize: ∑ ∑ 1 ∑∑ (αi − αi∗ )(αj − αj∗ )yi′ yj − ε (αi + αi∗ ) + (αi − αi∗ )yi , 2 i=1 j=1 i=1 i=1 n − n n n (6.7a) Subject to: n ∑ (αi − αi∗ ) = 0, and αi , αi∗ ∈ [0, C] . i=1 39 (6.7b) 6. Regression Methods 6.3. Artificial Neural Networks Kernel methods The SVR discussed in past sections can solve linear regression problems, this is a limitation because many real problems are nonlinear. The concept of kernel is of great importance to deal with nonlinearities. The basic idea is to map the data points, y, in an input space X to a vector space H, called the feature space, via a nonlinear mapping ϕ(·) : X → H. The purpose is to translate the nonlinear structure on the data in space X into linear structures in a higher dimensional space H (see Prez-Cruz & Bousquet (2004)). When a suitable mapping is used the data ϕ(yi ), ∀i = 1, . . . , n, seems to be linear, then the regression function is determined using the data in the feature space to obtain y = w′ ϕ(y) + w0 . This technique is known as the Kernel Trick. There are some mappings for which the inner product ϕ′ (x)ϕ(y) can be computed directly from x and y without explicitly computing ϕ(x) and ϕ(y). Such inner product is written in a simpler form as K(x, y). The objective of the SVR optimization problem given in Eq. 6.7a is then modified to Eq. 6.8. Note that by using the kernel trick it is not necessary to compute the maps ϕ(x) and ϕ(y) directly. This is the key of kernel methods such a SVMs and another reason for using the dual optimization problem rather the primal. n n n n ∑ ∑ 1 ∑∑ ∗ ∗ ∗ − (αi −αi )(αj −αj )K(yi , yj )−ε (αi +αi )+ (αi −αi∗ )yi . (6.8) 2 i=1 j=1 i=1 i=1 It is important to note that the objective 6.8 represents a quadratic function and such function must be convex, which implies that the kernel matrix [Kij ] (i.e. the Hessian matrix of the objective function) must be positive definite. Gaussian Kernel : The Gaussian Kernel is a kernel function commonly used in the literature and for this reason is the kernel used in this work (Eq. 6.9). It is necessary for the Gaussian Kernel to define the parameters σ. K(x, y) = e− ∥x−y∥2 2σ 2 . (6.9) 6.3 Artificial Neural Networks In this work we focus on feed-forward neural networks (or multilayer perceptron). An illustration of such networks is shown in Fig. 6.2, a network with two inputs, two hidden layers with three neurons (nodes) in hidden layer 1, 40 6. Regression Methods 6.3. Artificial Neural Networks two neurons in hidden layer 2, and one output. The input nodes are connected forward to the hidden neurons and this neurons are connected forward to the output neuron. The connections between neurons i and j is related Inputs Hidden layers Output wij Fig. 6.2: Feed-forward neural network with two hidden layers, two inputs and one output. with a weight wij . When an input vector y is presented to the network, each element (input) of the vector is propagated through the network being affected by the weights of connections (see Tsay (2005)). The information that receives the neuron j from the first hidden layer is a linear combination of the inputs and the weights. This information is processed through an activation function that defines the output of such neuron as follow ( ) ∑ 1 1 h1j = fj w0j + wij yi , i→j 1 where w0j is a constant term called the bias term, the summation i → j means summing over all input nodes. The activation function for hidden neurons is usually tangent sigmoid or logistic function. Given the logistic function ez , fj (z) = 1 + ez 41 6. Regression Methods 6.4. Tuning parameters the output of the jth neuron of the first hidden layer is ( ) ∑ 1 1 exp w0j + i→j wij yi ). ( h1j = ∑ 1 1 + i→j wij yi 1 + exp w0j If there are H hidden layers then the output of the j neuron of the last layer is ) ( ∑ H H−1 H hi + wij , w0j hH j = fj i→j and this value corresponds to one input to the output layer neuron. Generally an output layer neuron has a linear activation function or a Heaviside function. The case of a linear activation function is ∑ o H o = w0o + wio hi . i→j The idea of using feed-forward neural networks is that, given the pairs (yl , yl ), i.e. the input yl and output yl patterns, determine the values of weights, wij , and biases, w0j , that generates outputs ol as close as possible to yl 1 . This can be done by minimizing some fitting criterion, such as the least squares error n ∑ 2 (ol − yl )2 , s = l=1 the process of training a neural network becomes a nonlinear optimization problem. Several algorithms have been proposed for this problem. A wellknown is the back propagation (BP) algorithm that is based on the gradient descent method. Other optimization algorithms such as LevenbergMarquardt are commonly used. 6.4 Tuning parameters Machine learning models have a set of different parameters that must be tuned beforehand. In the case of support vector regression models it is necessary to determine the value of parameters such as the penalty constant C (that appears to count the penalization due to the slack variables ξi and ξi∗ ), 1 Ensuring at the same time capability of generalization of the neural network results. 42 6. Regression Methods 6.4. Tuning parameters the width of the tube ε and the Gaussian kernel parameter σ. In the case of artificial neural network models the most important parameters to determine are the number of hidden layers and the number of neurons per hidden layer. The tuning process must be carried out according to the so called generalization regression error criterion. Then it is necessary to estimate such error according to some predefined method. Two possible ways of estimate such error are the following (see Chapelle et al. (2002)): • Cross validation error : In this procedure the data is divided into two subsets according to some proportion. A subset (the training set) is used in the training process and the remaining subset (the validation set) is used for estimating the regression error. The process is repeated choosing all different training and validation sets possible of the whole dataset without repetitions. An estimation of the generalization error is meant as validation error. • Leave one out error : In this procedure one data point is selected from the whole dataset for validation (error estimation) using the remaining data in the training process. The process is repeated until all data points have been chosen for validation. As can be seen, the leave one out estimates are more computationally expensive hence cross validation error is commonly used in practice (see Hsu et al. (2010)). The parameters must be tuned by reaching the minimum regression (validation) error. A search strategy can be carried out into two steps: in the first step a coarse grid is constructed in the parameter space and the error is evaluated in each point of the grid. Then the point with the minimum error is kept. In the second step a fine grid is constructed centered in the best solution of the last step. Then the point of the fine grid with minimum cross validation error is selected. However, such strategy is very time consuming and it is necessary consider more efficient approaches as discussed below. 6.4.1 Response surface methodology for tuning parameters The process of tuning parameters is in fact an optimization problem. Some characteristics of this optimization problem that makes it a difficult problem are: • The objective value (generalization error) is actually a random variable. 43 6. Regression Methods 6.4. Tuning parameters • The evaluation of the objective function is very time consuming. • The objective function is unknown in practical terms. On the other hand some advantages of such optimization problem are that it has a few number of variables and that generally the expected objective function is not highly nonlinear. This fact makes it possible to try to approximate such function using a low order polynomial. Initially the objective function can be written as follows E(p) = f (p) + ϵ, where f (p), p is an unknown function of the parameter vector p, p ∈ P, where P is the parameter space. ϵ is an independent and identically distributed random variable. For simplicity, a second order polynomial can approximate this objective function. Linear regression methods compute this polynomial. The goal is to optimize such polynomial and estimate the optimal set of parameters p∗ . Such optimal point is added to the original sample and the optimization process is repeated. The optimization concludes when it is not possible to improve the objective value. Gönen & Alpaydin (2011) used the method described for tuning the parameters of a support vector machine. Initial sample for fit: The regression method requires a sample to obtain a fit for the second order model, Gönen & Alpaydin (2011) uses design of experiments (DOE) and response surface methodology (RSM) for this task. The authors uses Khosal design (see Myers & Montgomery (2002), pag: 384) which is very efficient because it only requires a small sample of data, however, a more robust design such as the central composite design (CCD) can be used (and this is the design commonly used in the literature). Fig. 6.3 shows a two dimensional sample for the Khosal design and central composite design. It is evident than a Khosal design is a more efficient approach. Once the experiment is carried out using the sample points given by the experimental design, the following quadratic function is obtained from linear regression b = β0 + E k ∑ i=1 βi p i + k−1 ∑ k ∑ i=1 j=i+1 βi βj pi pj + k ∑ βii p2i , (6.10) i=1 where βi are the model parameters and k is the dimensionality of the parameter space or the number of parameters. Given that the objective value 44 6. Regression Methods 6.4. Tuning parameters (a) p2 (b) p1 Fig. 6.3: Two dimensional sample for a Khosal design (a) and a central composite design (b). is a random variable, replications of the experiment in each sample point may give better estimation results, thus it is necessary to define the number of replications R in the algorithm. Algorithm 6 shows a procedure for tuning parameters using response surface methodology based on the work of Gönen & Alpaydin (2011). In this algorithm the optimization problem 6.10 is restricted to some operability region of the parameters ℓ. Proposed metaheuristic for tuning parameters procedure Algorithm 6 aggregates a new solution to the current sample at each iteration, which increases the sample points. This can be ineffective in the sense that some of these points may actually do not contribute to obtain a good fit. Another reason is that near to the optimum, the objective function resembles a quadratic function and then better results are obtained using points near to such optimum. This does not necessarily can be achieved using Algorithm 6 due the bias induced by sample points far from the optimum. On the other hand, Algorithm 6 is completely deterministic and this may cause rapid convergence to a local minimum. According to discussion presented above in this section we propose a tuning parameters procedure based on the algorithm of Gönen & Alpaydin 45 6. Regression Methods 6.4. Tuning parameters Algorithm 6: Tuning parameters procedure. Data: Number of replications per sample point, R; threshold, ε; dimensions of the experimental design, δ; operability region of parameters, ℓ. Result: Parameters p∗ 1 2 3 4 5 6 7 8 9 10 11 Build the design matrix with dimensions δ, using some experimental design; Perform the experiment for each sample point R times and obtain the validation errors; Fit a second order model for the generalization error function; Solve the quadratic optimization problem given in Eq. 6.10 and subject to the operability region ℓ and get the optimum p∗0 ; t = 0; while p∗t − p∗t−1 ≥ ε do Perform the experiment R times for p∗t and obtain the validation errors; Fit a second order model using all information (sample points) available; Solve the quadratic optimization problem 6.10 and subject to the operability region ℓ and get the optimum p∗t+1 ; t = t + 1; end (2011). The proposed algorithm considers a fixed number of sample points at each iteration, some of such points obtained in a stochastic manner. The initial sample points are obtained in the same manner that Algorithm 6 getting a total of N sample points, then a new sample point is obtained by optimizing the fitted second order model, getting a total of N + 1 sample points. From these N + 1 sample points we select the N − n better solutions and generate randomly n new solutions to get a fixed sample size of N again. The n new solutions are generated according to a Gaussian distribution with mean equals to the mean of the current N −n better solutions and covariance 46 6. Regression Methods matrix of 6.4. Tuning parameters σ12 0 0 σ2 2 Σ = .. .. . . 0 0 ··· ··· .. . 0 0 .. , . (6.11) · · · σk2 where σi2 is the variance of the parameter i in the current sample of N − n points. The reason of using a Gaussian distribution is that the generated points tend to form a spherical group around the mean, and such spherical sample can improve the fit of the second order model. The process is repeated using the new N sample points. The proposed algorithm is shown in Algorithm 7. 47 6. Regression Methods 6.4. Tuning parameters Algorithm 7: Proposed tuning parameters procedure. Data: Number of replications per sample point, R; threshold, ε; dimensions of the experimental design, δ; operability region of parameters, ℓ; number of random solutions n. Result: Parameters p∗ 1 2 3 4 5 6 7 8 9 10 11 12 Build the design matrix with dimensions δ, using some experimental design; Perform the experiment for each sample point R times and obtain the validation errors; Fit a second order model for the generalization error function; Solve the quadratic optimization problem given in Eq. 6.10 and subject to the operability region ℓ and get the optimum p∗0 ; t = 0; while p∗t − p∗t−1 ≥ ε do Perform the experiment R times for p∗t and obtain the validation errors; Select the N − n better solutions from the current sample; Generate n new sample points using a Gaussian distribution with mean equal to the mean of the current N − n points and covariance matrix given in Eq. 6.11; If the generated points lie out of the operability region such points must be projected into such region; Fit a second order model using the N sample points available; Solve the quadratic optimization problem 6.10 subject to the operability region ℓ and subject to the following constraints N { } } N { min pcur ≤ pi ≤ max pcur , ij ij j=1 13 14 j=1 ∀i = 1, . . . , k, where pcur ij is the jth sample of the ith parameter in the current sample. Get the optimum p∗t+1 ; t = t + 1; end 48 SEVEN EXPERIMENTAL PROCEDURE The purpose of this chapter is to give a complete description of the experimental methodology used in this work. The experiments have different factors than influences the performance of the forecasting methods. Then these factors and the methodological framework are described.. This study aims to solve the SLCP demand/sales forecasting problem as we set in Chapter 2, in such chapter we establish some hypothesis (strategies) which could allow improving the forecast performance. In order to investigate the set of hypothesis presented in Section 2.1.1 it is necessary to test the effect in the forecasts of the use of cumulative or non-cumulative data, the effect of partitioning or not the data, and the effect of the regression model used (see Chapter 6). We refer to the above cases as experimental factors. Fig. 7.1 illustrates each of the experimental factors, with their respective levels. There are 3 experimental factors, the regression method with 3 options (levels), the clustering usage with 2 options, and the type of data with 2 options. These lead us to evaluate 3 · 2 · 2 = 12 experimental treatments which are the combinations of the options (levels) of the factors as shown in Fig. 7.1. For example, a particular treatment is to evaluate the forecasts obtained by means of multiple linear regression (MLR), without partitioning the data, and using non-cumulative data. We use a general procedure that standardizes the steps from data collection to final forecasting evaluation. The framework of the forecasting procedures is shown in Fig. 7.2. 7. Experimental procedure 7.1. Collection and analysis of data Experimental factors Forecasting method MLR SVR Clustering usage ANN Yes No Type of data Cumulative Non-cumulative FCM FMLE FSTS Fig. 7.1: Experimental factors. 7.1 Collection and analysis of data Usually the forecasting methods try to predict the values of a time series using the past values of such time series. For a short life cycle product (SLCP) usually there is not available such historical information or the information available is scarce. Thus, typical forecasting methods (such as moving average, exponential smoothing, and others) cannot be applied or their forecasting results are very poor. Given the scarcity of information of the historical demand of a new product, useful information can be obtained from products which have completed, or are completing their life cycle. We assume that from previous products, the life cycle pattern can be learnt, in order to predict the demand of a new product. We assume that from previous products, the life cycle pattern can be learnt, in order to predict the demand of a new product it is necessary to find large enough datasets of time series that contain similar patterns to the time series which we try to forecast. In a company this task becomesthe finding of demand time series of older SLCP. When such information is available it is necessary to clean the data of noise or information without any sense. In this study each of the considered datasets (see Section 5) are splitted into two subsets (see Fig. 7.2). The first subset, named the training set, it is used in the training process and corresponds to the 80 % of the whole 50 7. Experimental procedure 7.1. Collection and analysis of data dataset. The training set allows preparing beforehand the forecasting machines (see Chapter 6) via a training procedure. The second subset is called the test set and is used to evaluate the forecast performance of each forecasting procedure. This subset simulates the real-time information available once the SLCP has been introduced to the market or once the time series to be forecasted appears. Note that the training process is carried out once beforehand and it is not necessary to repeat such process each time the real-time information is available. This fact allows saving considerably computation time. 1. Collect and analyze data Preprocessing and data cleaning Historical data sales profiles of SLCP demand (Training-Validation data) Real time-data (Testing data) Obtain cumulative data if required Obtain cumulative data if required 2. Perform clustering if required 3. Perform classification if required Perform clustering. Use Alg. 1, Alg. 3, Alg. 4, or Alg. 5 Classify the time series of the real - time data into the cluster with nearest centroid Validate clustering results Get the cluster with nearest centroid Get the clusters and cluster centroids 4. Tune the parameters of the machine, MLR, SVR or ANN 5. Evaluate the forecasts Perform the forecasts Use n - fold cross validation to estimate the error. Search for the machine parameters using Alg. 6, or Alg. 7 Train the machine using the optimal values of the parameters. If clustering is used then use the cluster dataset, otherwise, use the whole dataset for training. If cumulative data is used, then obtain the correct forecasts by differencing Evaluate the forecast performance Fig. 7.2: Framework of the forecasting procedures. 51 7. Experimental procedure 7.2. Clustering 7.2 Clustering This operation is performed if required by any experimental treatment; the clustering process is carried out using the training set (either with cumulative or non-cumulative data). In this study the FCM, FMLE, and FSTS clustering algorithms (see Section 5.3.2) are considered. It is necessary to perform a clustering validation in order to tune clustering parameters such as the number of clusters K and the fuzzifier parameter m. This process obtain groups in data in which the data of the same cluster share similar characteristics of the cluster but are significantly different from the remaining clusters. The information that emerges from this stage, the partition of the data, and the cluster centroids, will be used in the following stages. The partition of the data will be used for training. The clusters centroids enable to classify real-time data (time series) into their most similar cluster. After obtaining the clusters, it is necessary to relate the real time information (time series) with these groups. The goal is to classify the real time series into those clusters. The idea is that given any real-time time series can be classified in one of the clusters previously established. In this work a minimum distance classification is performed, considering the real-time time series up to time t, yt , and considering the cluster centroids up to time t, vti , ∀i = 1, . . . , K, then the real-time time series is classified in the cluster k for which K k = argmin {d (yt , vti )} . (7.1) i=1 In other words, the real-time time series up to time t is classified in the cluster with nearest centroid up to time t. The selected cluster provides the data to be used in the later training process. 7.3 Parameter tuning The tuning parameter procedure requires the estimation of forecast error. The forecast error is estimated with a 5-fold cross-validation with the training dataset1 as is described in Section 6.4. Hence, the objective is to reach the minimum possible cross validation error by using the parameter search procedure described in Algorithm 6 or Algorithm 7. The proposed forecasting 1 Data for training can come from cumulative data, non-cumulative data, cluster data, or any combination, depending on the experimental treatment considered for analysis. 52 7. Experimental procedure 7.4. Forecasts evaluation method requires the training of regression methods (MLR, ANN, and SVR) for each period to be predicted. For example, suppose that the training set consists of 3 time series of length 5 as shown below y11 , y12 , y13 , y14 , y15 ; y21 , y22 , y23 , y24 , y25 ; y31 , y32 , y33 , y34 , y35 , now consider that the time lag is set to p = 3. Then, to forecast a real-time time series at period 4 it is necessary to train a machine using the training data comprising periods 1 to 3 as inputs or regressors and using the data of period 4 as output or response. The same procedure is followed to forecast the other periods. Note that to period 2 we use the training data comprising period 1 only as regressors because there is only one lag period. In the same manner, to forecast period 3 it is necessary to use the training data comprising periods 1 to 2 as regressors. It is impossible to get forecasts of period 1 since there is no lag periods beforehand. A practical way to forecast the first period is taking an average of the training set. According to the above discussion, to obtain an estimation of the forecast error by cross validation it is necessary to train T − 1 machines, where T is the length of the time series. This is a time consuming task if T is large, mainly because it is necessary to conduct a search process for parameters2 . To avoid the this problem, in this work we use only 5 equally spaced periods of time series for training and then get an estimation of the forecast error. Once the optimal values of the parameters are obtained, the training of the machine is performed again according to such results and considering the total T − 1 periods. The results obtained will allow to forecast any real-time time series. 7.4 Forecasts evaluation At this stage the forecasts of the real-time time series are obtained using the results of the training machine procedure as discussed before. The forecast is performed for each time series of the real-time (test) dataset. It is noteworthy that when the forecasts are based on cumulative data, then it is necessary 2 This is particularly true when support vector regression or artificial neural networks are considered. The case of multiple linear regression, however, is very efficient. 53 7. Experimental procedure 7.5. Some computational aspects to transform the data correctly by differencing3 . Root Mean Square Error (RMSE) evaluates forecast performance and it is calcuated as follows: v u T u1 ∑ t RM SE = (yt − ŷt )2 , (7.2) T t=1 where ŷt is the forecasted value at period t. This work also uses the Mean Absolute Error (MAE) becuase, as will be shown later, such metric allows an absolute comparison point. The MAE is calculated as follows: T 1 ∑ yt − ŷt . M AE = T t=1 ŷt (7.3) 7.5 Some computational aspects The algoritms were implemented in Matlab 2011b. We use the Neural Network Toolbox of Matlab to train the Feed Forward Neural Networks. For the Support Vector Machine case we make use of the LIBSVM Toolbox of Matlab due to Chang & Lin (2011), from this Toolbox the Epsilon Support Vector Regression was considered in this work. All algorithms presented earlier in this work were programmed in Matlab in the Windows Operating System and on a 32 Gb, 3.40 GHz machine. 7.6 Results of the tune parameters procedure 7.6.1 Tuning parameters for SVR machines As previously mentioned, the tuning parameter procedure takes into account the cross validation estimation of the forecasting error for a given training set. In the case of a SVM it is necessary to define the values of the width of the tube ε, the penalty constant C, and the Gaussian kernel parameter σ as is described in Section 6.2. Previous works in the search of such parameters establish that an exponential growing sequence of such parameters is a practical method to identify good parameters (Hsu et al. , 2010; Lin, 2006). For 3 To obtain non-cumulative demand we use the difference between cumulative demands which is xt = Xt − Xt−1 , where Xt is the cumularive demand at period t, and xt is the non-cumulative demand. 54 7. Experimental procedure 7.6. Results of the tune parameters procedure this reason the following search space is defined as: [ ] ε ∈ 2−5 , 24 ; [ ] C ∈ 2−2 , 215 ; [ ] 1 = γ ∈ 2−16 , 25 . 2 2σ This search region will be defined as the operability region, ℓ, of the parameters, which is considered large enough. For the case of the Support Vector Machines (SVM) the proposed search procedure described in Algorithm 7 is used as search method. In this work a Central Composite Design (CCD, see Section 6.4.1) is used to get the initial sample points. The between axial points of such design is given by the extreme values of each parameter into the operability region. The design is chosen to be rotatable (see Myers & Montgomery (2002)), which implies that the following relation must be satisfied α = k 1/4 , where α is a normalized mean distance between the axial points4 , and k is the number of parameters to be considered, which in this case is k = 3. In this work only R = 1 replicate is considered and the threshold or stopping criterion (see 6.4.1) of the algorithm is set to ε = 0.005. Finally the proposed tuning parameters procedure requires removes n of the worst solutions and generate again n new random solutions. In this work we set n = 5 given that N = 15 (see Section 6.4.1). The performance of Algorithm 7 at different iterations is shown in Fig. 7.3 for the different datasets. From this example it is shown that the algorithm converges to a, possibly, local optima in general. The results show that the performance of the proposed method is promising. An interesting fact is that the training process is little sensitive to the bandwidth parameter ε as it is shown for real datasets on the intermediate iteration. Tab. 7.1 shows the results of the tuning parameters procedure for non-cumulative data. Tab. 7.2 shows the results for cumulative data according to the optimal number of lags. Evidently the search time is much larger when cumulative data is considered. 4 No details are presented Myers & Montgomery (2002). on such 55 standardization; you can refer to 7. Experimental procedure 7.6. Results of the tune parameters procedure Tab. 7.1: SVR results of the tuning parameters procedure for non-cumulative data. Dataset p log2 ε log2 C log2 γ Search time (s) RD1 SD1 RD2 RD3 2 16 6 10 −2.50 −1.89 1.16 −3.59 9.86 5.68 12.35 10.93 −14.64 −10.35 −19.12 −15.26 167 237 118 223 *Optimal number of lags, see Fig. 8.2. Tab. 7.2: SVR results of the tuning parameters procedure for non-cumulative data. Dataset p log2 ε log2 C log2 γ Search time (s) RD1 SD1 RD2 RD3 1 18 1 1 −3.19 −3.01 0.00 −9.48 14.90 14.53 15.00 13.80 −19.92 −15.99 −23.22 −20.35 439 22 554 163 3 042 *Optimal number of lags, see Fig. 8.2. 7.6.2 Tuning parameters for ANN machines This work uses the Feed Forward Neural Network or Multilayer Perceptron as regression method, given the relative simplicity of this neural network model. Such networks requires to adjust or establish many parameters such as the learning rate, the number of neurons per hidden layer, the number of hidden layers, the selection of activation functions per neuron, the selection of the training algorithm, the number of iterations of the training algorithm, the weight initialization procedure, and possibly many other parameters that affect the performance of the machine. The interest here, however, is focused on determining the number of neurons per hidden layer and the number of hidden layers, the other parameters are set in a relatively arbitrary manner and according to the author’s experience. In this sense the Levenberg-Marquardt with Bayesian regularization training algorithm is considered, the learning rate is set to 0.01, the number of runs (iterations) is set to 300, the weights are initialized to zero, and a tangent sigmoid activation function is considered for hidden neurons and a 56 7. Experimental procedure 7.6. Results of the tune parameters procedure linear activation function is considered for the output neuron. Our experience working with artificial neural networks evidence a high computational cost. This may be a consequence of the use of Bayesian regularization in the training algorithm. The training algorithm implies longer processing times but good results are obtained, and this was a reason to select such training algorithm. Accordingly it is necessary to determine the number of neurons per hidden layer and the number of hidden layers. For this purpose we consider up to 2 hidden layers and up to 28 neurons per hidden layer. In this case a hidden layer with 0 neurons indicates that such layer is not considered. The tuning parameters procedure for ANN do not show a clear convergence as is the case with the SVM. The fact that the parameters are integers can make less effective the search procedure. Tab. 7.3: ANN results of the tuning parameters procedure for non-cumulative data. Dataset p* Number of neurons in hidden layer 1 RD1 SD1 RD2 RD3 4 25 23 24 3 1 2 14 Number of neurons in hidden layer 2 0 4 0 28 Search time (s) 267 309 339 291 *Optimal number of lags, see Fig. 8.3. Tab. 7.4: ANN results of the tuning parameters procedure for cumulative data. Dataset p* Number of neurons in hidden layer 1 RD1 SD1 RD2 RD3 11 1 1 1 7 1 1 28 Number of neurons in hidden layer 2 0 0 0 0 Search time (s) 785 582 892 611 *Optimal number of lags, see Fig. 8.3. The results of the search are summarized in Tab. 7.3 for the non-cumulative case and in Tab. 7.4 for the cumulative case. Note that the search time is 57 7. Experimental procedure 7.7. Conclusions of the chapter longer as compared with SVM case; also note that such time is larger for the cumulative case than the non-cumulative case. 7.7 Conclusions of the chapter This chapter presented a general theoretical framework of the regression metods to be used in the forecasting framework as will be seen later. This chapter also considered the problem of tuning parameters for learning machines such as neural networks and support vector machines. We consider the cross-validation regression error as figure of merit in tune perameters. The metaheuristic procedure proposed for tuning parameters generate reasonable results. 58 7. Experimental procedure 7.7. Conclusions of the chapter (a) (b) 5 0 −5 −10 −15 1510 5 0 5 0 −5 −10 −15 1510 5 0 4 0 2 −4−2 5 0 −5 −10 −15 1510 5 0 4 0 2 −4−2 5 0 −5 −10 −15 1510 5 0 4 0 2 −4−2 5 0 −5 −10 −15 1510 5 0 4 0 2 −4−2 5 0 −5 −10 −15 1510 5 0 4 0 2 −4−2 5 0 −5 −10 −15 1510 5 0 4 0 2 −4−2 4 0 2 −4−2 5 0 −5 −10 −15 1510 5 0 4 0 2 −4−2 4 0 2 −4−2 5 0 −5 −10 −15 1510 5 0 4 0 2 −4−2 5 0 −5 −10 −15 1510 5 0 log2 C 4 0 2 −4−2 log2 ǫ (c) 5 0 −5 −10 −15 1510 5 0 5 0 −5 −10 −15 1510 5 0 log2 γ (d) 4 0 2 −4−2 Fig. 7.3: Sample points given by the proposed tuning parameters procedure at different iterations, start, intermediate and final iteration. (a) RD1 Dataset; (b) SD1 dataset; (c) RD2 dataset; (d) RD3 dataset. This results are obtained for a number of lags of p = 2, we omit the results for other values of p. 59 EIGHT RESULTS 8.1 Forecasting results using multiple linear regression In this situation the parameters of the regression model are obtained using Eq. 6.3 of Section 6.1 by using the training set (see Chapter 7). Then the test set is used to investigate the performance of the method, by using the Root Mean Square Error (RMSE) metric. The results were obtained using non-cumulative and cumulative data shown in Fig.8.1 for different number of lags. It is shown that the use of non-cumulative data improves the forecasting results in terms of mean RMSE and variability of the forecast error as it is shown by the error lines. This result is confirmed in practically all datasets. Tab. 8.1 shows a comparison of the results obtained using non-cumulative and cumulative data in the forecast process. As observed the use of noncumulative data improves the forecasts results in all cases as is demonstrated with the Kruskal-Wallis test1 . As expected, the processing time (training and forecasting) is less using non-cumulative data. This is reasonable since the use of cumulative data requires the transformation of cumulative results into non-cumulative forecasts spending additional processing time. In average the use of cumulative data increases the process time in about 3 times the time required for non-cumulative data. This results have an important implication: the use of cumulative data does not necessarily improve the forecast performance, this fact contradicts the hypothesis presented in Section 2.1.1. 1 Kruskal-Wallis test is an estatistical analysis for perform a non-parametric one-way analysis of variance by comparing the medians of the experimental treatments or variables. 8. Results 8.1. Forecasting results using multiple linear regression (a) (b) 8 3 6 2.5 4 2 2 0 5 1.5 10 0 RMSE (c) 3.5 25 3 20 2.5 0 5 10 Lags (p) 20 (d) 30 15 10 20 15 10 20 Fig. 8.1: Multiple linear regression results for non-cumulative (blue line) and cumulative (red line) data and for different number of Lags (p). The black points corresponds to the optimum. The error lines correspond to the 10 % of the standard deviation. (a) RD1 dataset. (b) SD1 dataset. (c) RD2 dataset. (d) RD3 dataset. It is notable that the optimal number of lags for the SD1 (Bass process synthetic dataset) is very large; this implies a strong linear correlation or lag dependence. On the other hand, the optimal number of lags is relatively short for the other datasets, and for cumulative data only one lag is required. 8.1.1 Multiple linear regression results with clustering Preliminary study Initially (in a preliminary study) we carried out experiments using the SD1 dataset and considering the correct partition of the data which consist of four groups (see Section 5.1.2). The purpose of this experiment is to evaluate the effect of partitioning data in the forecast performance given that the correct partition of the data is known, as is the case for the SD1 dataset. According to the Kruskal-Wallis test, in this experiment again the use of 61 8. Results 8.1. Forecasting results using multiple linear regression Tab. 8.1: Multiple linear regression results using complete datasets. Dataset p* Mean RMSE S. Deviation Time (s) p-value ** RD1 RD1 cum. SD1 SD1 cum. RD2 RD2 cum. RD3 RD3 cum. 2 1 24 24 5 1 3 1 4.069 5.503 1.611 2.324 16.504 21.592 2.259 2.440 2.885 4.513 0.355 0.348 8.674 12.673 1.508 1.537 0.29 0.59 0.67 4.2 0.41 0.71 0.34 1.29 0.00 0.00 0.00 0.02 *Optimal number of lags, see Fig. 8.1. **p-value of the Kruskal-Wallis test, values near to 0 implies median differences. non-cumulative data outperforms the forecasting results with respect to the use of cumulative data. The results are shown in Tab. A.1 of Appendix A. For non-cumulative data the mean RMSE is smaller when clustering is used in the forecasting process. However, the methods produce results statistically equal, according to the p-value of the Kruskal-Wallis test. On the other hand, for cumulative data the clustering does not improve the results and the difference between both methods statistically is significant. An interesting result is that RMSE variability increases when partitioning the data. This fact can be explained because fewer amounts of data are used in the regression process and the variability of estimations increases. However, partitioning the data apparently reduces the process time. Use of clustering algorithms This section assesses the effect of partitioning the data in the forecast process on the real datasets. When forecasting with clustering, there are factors such as the number of clusters K, the fuzzyfier parameter m of the clustering algorithm, and the number of lags p that may affect the forecast performance. Keeping this in mind and to avoid potential bias in the results, we evaluated (by a grid search) several values of such parameters beforehand. Tab. B.1 of Appendix B shows the forecasting results obtained with several clustering algorithms, for optimal values of the clustering and lag parameters. The 62 8. Results 8.1. Forecasting results using multiple linear regression results show that the FMLE clustering algorithm outperforms other algorithms in RD1 and RD3 datasets. The FSTS do better than other clustering algorithms for the SD1 and the cumulative RD3 and the FCM performs better than other clustering algorithms for non-cumulative RD3 dataset. An interesting result related with FSTS clustering algorithm is that the optimal number of clusters for each dataset of non-cumulative data is the same that the optimal number of clusters found by the PCAES index (see Tab. 5.3 in Section 5.3.3). The effects of clustering is evaluated with the optimal values of clustering and lag parameters given in Tab. B.1 (Appendix B). The FMLE clustering algorithm is selected to perform the evaluation; the reason for this is that this algorithm present good results in most datasets. According to the results with non-cummulative data (Tab. 8.2). It is shown that the RMSE of the SD1 dataset is smaller, with a statistically significant difference, when partitioning the data. This is not the case for real datasets in which there is not statistical difference in the error metric. In fact the mean RMSE for the RD1 and RD2 is greater when clustering is used, which may be interpreted as a negative effect to the clustering process. However, in the SD1 and RD3 datasets the mean RMSE is smaller when clustering is used. Then, in some cases the clustering improves the results but in other cases clustering has the opposite effect. A possible explanation for this is that clustering works well for data in which a cluster structure is evident. When there is not an evident clustering tendency in the data the clustering actually may not improve the forecast performance. It is important to note that clustering reduces the size of the training set, limiting the amount of data available for parameter estimation. According to the results shown in Tab. 8.2, the standard deviation of the error for the SD1 and RD3 datasets2 is smaller than the corresponding standard deviation of datasts RD1 and RD2. This apparently indicates that if the clustering improves the results, the variability of the forecasts is smaller. It is evident that the processing time increases when clustering is performed. From our results this increase is of about 6 times the time required without using clustering in the non-cumulative case, and of about 3 times in the cumulative case. This, however, without considering the time required to adjust the clustering and lag parameters. The results when cumulative data is used are shown in Tab. 8.3. These 2 Datasets that have smaller mean RMSE when clustering is used. 63 8. Results 8.1. Forecasting results using multiple linear regression Tab. 8.2: Optimal result for multiple linear regression partitioning the data, noncumulative data. Dataset Mean RMSE S. Deviation Cluster Time (s) Forecast Time (s) p-value * RD1 SD1 RD2 RD3 4.070 1.420 16.897 2.247 2.917 0.320 9.244 1.406 2.79 1.92 1.42 1.44 0.30 0.52 0.23 0.37 0.96 0.00 0.94 0.91 *p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not the data. The reader may compare these results with the results shown in Tab. 8.1 for the non-cumulative case. results do not improve the forecast performance as already mentioned. In general, there is an increase on the mean error, the standard deviation of the error, and the processing time. Tab. 8.3: Optimal result for multiple linear regression partitioning the data, cumulative data. Dataset Mean RMSE S. Deviation Cluster Time (s) Forecast Time (s) p-value * RD1 SD1 RD2 RD3 5.009 1.833 18.944 2.623 3.437 0.456 10.259 1.677 2.82 1.59 1.71 1.53 0.42 0.70 0.43 0.45 0.62 0.00 0.02 0.12 *p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not the data. The reader may compare these results with the results shown in Tab. 8.1 for the cumulative case. 8.1.2 Conclusions of the MLR case From the results obtained here we can conclude that the use of cumulative data does not improve the forecast performance and it increases processing time. Partitioning the data does not necessarily improve the forecast results. However, there may (likely) be a marked improvement if the data has an evident clustering structure. It is necessary to note that data partition increases the time required for processing and parameter tuning. We have stated the 64 8. Results 8.2. Forecasting results using support vector regression hypothesis that partitioning the data could improve prediction performance, however, our datasets failed to properly validate this fact. A plausible explanation is that real datasets do not have clustering structure as observed in Fig. 5.3, Section 5.3.4. 8.2 Forecasting results using support vector regression The idea of using support vector machine is to capture the nonlinear process that is performed in the data, in order to obtain a better forecasts. An evaluation of the lag parameter is shown in Fig. 8.2. It is shown than the lag dependence of the non-cumulative case is stronger than in linear regression (see Fig. 8.1). For SVR the use of non-cumulative data improves the forecasting results as expected. (a) (b) 2.5 6 5.5 5 4.5 4 3.5 0 2 1.5 5 10 10 (c) 25 10 (d) 20 3.5 Error 3 20 15 0 2.5 5 10 20 15 10 20 p Fig. 8.2: Support vector regression results for non-cumulative (blue line) and cumulative (red line) data and for different number of Lags (p). The black points correspond the optimum. (a) RD1 dataset. (b) SD1 dataset. (c) RD2 dataset. (d) RD3 dataset. Tab. 8.4 shows the results considering the optimal number of lags (as is shown in Fig. 8.2). As observed, the use of non-cumulative data improves the forecasts results in all cases, as is demonstrated with the Kruskal-Wallis test. On the other hand, the process time (training and forecasting) is less 65 8. Results 8.2. Forecasting results using support vector regression using non-cumulative data in almost all cases. Finally, the standard deviation of the RMSE measures is smaller when non-cumulative data is used. This corroborates the fact that the use of cumulative data does not improve the performance of the forecast. Tab. 8.4: Support vector regression results using complete datasets. Dataset p* Mean RMSE RD1 RD1 cum. SD1 SD1 cum. RD2 RD2 cum. RD3 RD3 cum. 2 1 16 18 6 1 10 1 3.983 5.131 1.298 1.792 16.409 18.417 2.154 2.506 S. Deviation Train Time (s) Forecast time (s) pvalue ** 3.157 4.096 0.269 0.451 10.489 10.593 1.473 1.661 0.29 2.62 0.51 54.2 0.32 0.33 0.52 1.96 1.21 1.10 2.68 2.83 1.50 1.41 2.15 2.07 0.00 0.00 0.02 0.00 *Optimal number of lags, see Fig. 8.2. **p-value of the Kruskal-Wallis test, values near to 0 implies median differences. 8.2.1 Support vector regression results with clustering In order to evaluate the effect of the clustering in the forecast performance without any possible bias, we performed a grid search for the clustering and lag parameters as in the case of multiple linear regression. Tab. B.2 of Appendix B shows the optimal values for the clustering and lag parameters according to different clustering algorithms. It is shown that the FMLE clustering algorithm improves the results in almost all cases. For this reason the FMLE algorithm is selected for the analysis. Tab. 8.4 shows the results for the non-cumulative case. As observed, there is no improvement on the mean RMSE when clustering is used. In fact, the variability in the measurements increases. According to the Kruskal-Wallis test, the measurements are not statistically different; this fact shows that the clustering has no effect on the results when SVR is used as is confirmed with this experiment. According to the results in Tab. 8.6, there are no improvements compared with the case of non-cumulative data. On the other hand, the clustering 66 8. Results 8.2. Forecasting results using support vector regression Tab. 8.5: Optimal results for support vector regression partitioning the data, noncumulative data. Dataset Mean RMSE S. Deviation Cluster Time (s) Train time (s) Forecast Time (s) p-value * RD1 SD1 RD2 RD3 3.985 1.301 16.437 2.172 3.185 0.273 10.926 1.661 5.66 4.59 2.42 1.58 0.36 0.54 0.24 1.85 1.58 3.91 2.19 2.49 0.90 0.99 1.00 0.82 *p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not the data. The reader may compare these results with the results shown in Tab. 8.4 for the non-cumulative case. process could achieve lower values for the mean RMSE in RD1 and SD1 datasets, and the variability of the measurements is smaller for the SD1 datasets. However there is not a statistical difference between the results when compared with the effect using clustering. Tab. 8.6: Optimal results for support regression partitioning the data, cumulative data. Dataset RD1 SD1 RD2 RD3 Mean RMSE S. Deviation Cluster Time (s) Train time (s) Forecast Time (s) p-value * 5.081 1.740 18.91 2.506 3.838 0.457 10.96 1.661 2.86 4.06 1.40 1.58 0.31 3.16 0.21 1.85 1.41 3.56 1.37 2.49 0.98 0.27 0.64 1.00 *p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not the data. The reader may compare these results with the results shown in Tab. 8.4 for the cumulative case. 8.2.2 Conclusions of the SVR case The use of SVR machines generates very similar results to those obtained with MLR. There is strong evidence that the use of cumulative data does not improve the forecast performance and requires more computation time. The 67 8. Results 8.3. Forecasting results using artificial neural networks use of clustering does not show improvement in the forecasting performance, even for synthetic dataset. It is important to note that the use of SVR machines significantly increases the computation time which is mainly due to parameters tuning. 8.3 Forecasting results using artificial neural networks Artificial neural networks are used as an alternative to support vector machines in order to capture the nonlinear structure in the data. In the evaluation of the lag parameter for ANN some datasets such as SD1 and RD3 showed strong lag dependence in the non-cumulative data. The results show that the use of non-cumulative data improves the forecast performance as expected; in fact the results for the cumulative case are very unstable. Tab. 8.7 shows the results considering the optimal number of lags (as is shown in Fig. 8.3), there is enough statistical evidence to indicate that the use of cumulative data does not improve the results and, in general, such use increases the process time, the error and the variability of the measurements. (a) 20 (b) 30 15 20 10 10 5 00 5 (c) 150 Error 00 10 10 10 20 (d) 100 5 50 00 5 10 00 15 10 20 p Fig. 8.3: Artificial neural network results for non-cumulative (blue line) and cumulative (red line) data and for different number of Lags (p). The black points correspond the optimum. (a) RD1 dataset. (b) SD1 dataset. (c) RD2 dataset. (d) RD3 dataset. 68 8. Results 8.3. Forecasting results using artificial neural networks Tab. 8.7: Artificial neural network results using complete datasets. Dataset p* Mean RMSE RD1 RD1 cum. SD1 SD1 cum. RD2 RD2 cum. RD3 RD3 cum. 4 11 23 1 3 7 2 1 4.066 5.545 1.163 12.874 18.225 23.228 2.194 2.961 S. Deviation Train Time (s) Forecast time (s) pvalue ** 2.824 4.287 0.571 0.587 13.811 11.699 1.345 2.064 43.6 16.1 326 27.1 30.5 30.0 293 63.7 26.69 35.3 48.3 63.9 31.4 31.5 51.7 41.7 0.00 0.00 0.00 0.00 *Optimal number of lags, see Fig. 8.3. **p-value of the Kruskal-Wallis test, values near to 0 implies median differences. 8.3.1 Artificial neural network results with clustering Clustering effects on forecast performance are evaluated in the same manner as in the previous cases. In the non-cumulative case there is no statistical evidence of improvement when using clustering for the real datasets. In the case of synthetic dataset, however, the use of clustering actually improves the forecast performance (Tab. 8.7). In the datasets SD1 and RD2 the mean variability of RMSE decreases when clustering is used. The results for the cumulative data case are given in Tab. 8.9 and have not shown any improvement for the real datasets. For synthetic dataset, however, there is a significant improvement in the forecast performance as in the results of cumulative data. 8.3.2 Conclusions of the ANN case and other cases The conclusion for the case of neural networks is similar to those obtained for multiple linear regression and support vector regression. The use of cumulative data does not improve the forecast performance and the use of clustering has no clear effect on the forecast performance according to the real datasets. As additional information about the results obtained so far in Tab. 8.10 we present a summary of the results in each experimental treatment. Note that in any case the use of cumulative data improves the forecast performance. 69 8. Results 8.3. Forecasting results using artificial neural networks Tab. 8.8: Optimal results for artificial neural network partitioning the data, noncumulative data. Dataset Mean RMSE S. Deviation Cluster Time (s) Train time (s) Forecast Time (s) p-value * RD1 SD1 RD2 RD3 4.172 1.294 19.139 2.194 3.034 0.439 12.933 1.345 1.93 4.65 2.50 2.48 60.4 276 7.92 282 26.8 51.8 31.2 47.8 0.85 0.00 0.21 1.00 *p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not the data. The reader may compare these results with the results shown in Tab. 8.7 for the cumulative case. On the other hand, the use of clustering seems not to have any effect in the forecast performance. According to these results the idea of using clustering for forecasting loses validity, in practice, since this procedure does not improve prediction performance and has high computational cost, specially, in the clustering validation procedure. The possibility of improving forecasting performance is one of the main reasons for using cumulative data given the noise reduction (i.e curve smoothing). Moreover, many diffusion models take into consideration cumulative time series to facilitate tuning parameter tasks. In contrast to this idea, the results show that prediction performance is not improved, hence it is important to answer why the use of cumulative data is not effective in our forecasting framework? A first answer is that the use of cumulative data increases the range of the time series and even though the time series looks smooth, the error associated to the forecasts increases due to the increase in range of the time series. In Fig. C.1 of Appendix C it is shown the results of the calculation of the variances of the datasets at each time period for cumulative and non-cumulative data. The use of cumulative data evidences a remarkable increase in the variability of the time series, this provides a clear picture of what happens with the use of cumulative data. The increase in variability implies a poor performance of the forecasting method. 70 8. Results 8.4. Comparison of forecasting methods Tab. 8.9: Optimal results for artificial neural network partitioning the data, cumulative data. Dataset RD1 SD1 RD2 RD3 Mean RMSE S. Deviation Cluster Time (s) Train time (s) Forecast Time (s) p-value * 5.634 1.886 36.43 2.983 4.400 0.543 48.84 1.957 2.93 4.11 1.76 1.62 29.8 78.2 61.3 59.9 26.6 44.7 30.2 40.7 0.90 0.00 0.47 0.26 *p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not the data. The reader may compare these results with the results shown in Tab. 8.7 for the cumulative case. 8.4 Comparison of forecasting methods In order to have a visual illustration of the forecasting results we have selected a time series from each datasets to show the forecasting results for each regression method, Fig. 8.4 shows such forecasts. It is shown that the forecasts fit relatively well the time series, it is interesting to note the neural network forecast for the SD1 dataset, in this case the predictions are considerably good and this is true for this dataset in general. On the other hand, Fig. 8.5 shows the p-values of the pairwise comparison of regression methods according to the Kruskal-Wallis test, it is shown that for real datasets there is no difference in regression methods at least a 90 % of confidence. It is clear that multiple linear regression and artificial neural networks produce very similar results for real datasets. In contrast, in the case of the synthetic dataset, the regression methods are statistically different with confidence levels well above to 95 %, and in this case artificial neural networks are the undisputed winners in the forecasting results. The absence statistical differences between regression methods for real datasets highlights the multiple linear regression as a very efficient method to forecast the demand of SLCPs according to the proposed forecasting framework discussed in this work (see Chapter 7). An evaluation of the forecasting and regression methods is performed by considering the Mean Absolute Error of the forecasts at each time period Fig. 8.6 shows such results. An interesting result is that the mean absolute error is larger at the beginning and at the end of the time series for the datasets of time series that clearly have completed its life cycle, this fact is 71 8. Results 8.4. Comparison of forecasting methods Tab. 8.10: Summary evaluation of the experimental treatments according to the measurements of RMSE. Dataset RD1 SD1 RD2 RD3 RD1 SD1 RD2 RD3 RD1 SD1 RD2 RD3 Regression method Use of cumulative data Use of clustering MLR No No No No No effect Yes No effect No effect SVR No No No No No No No No ANN No No No No No effect No No effect No effect effect effect effect effect not obvious for the RD2 dataset and this is due to many of the time series contained in such dataset have not yet completed its life as already mentioned in Section 5.1.1. This implies that the absolute error decreases around the peak sales or around the peak of the time series, as to the comparison of the regression methods; it is shown that these methods achieve similar results at the beginning and during the peak of the time series. In contrast, there are noticeable differences between regression methods at the end of the life cycle. This is most evident for RD3 dataset in which the results at the end of the life cycle look very unstable in particular for SVRs and ANNs. However, given the above results, it is difficult to establish which regression method perform better for given set of periods. 72 8. Results 8.4. Comparison of forecasting methods (b) (a) 10 8 6 5 4 2 00 5 00 10 (c) 10 20 (d) Data MLR SVR ANN 8 100 yt 6 50 4 2 00 5 10 00 15 10 20 t Fig. 8.4: Forecasting results for some time series. (a) RD1 dataset; (b) SD1 dataset; (c) RD2 dataset; (d) RD3 dataset. Note: the forecast of the first period was obtained as the average of the training set. 1 MLR-SVR MLR-ANN SVR-ANN p-value 0.8 0.6 0.4 0.2 0.05 RD1 SD1 RD2 RD3 Datset Fig. 8.5: Pairwise comparison of regression methods according to the KruskalWallis test. 73 8. Results 8.4. Comparison of forecasting methods (b) (a) 25 13 12 11 10 9 8 7 6 5 4 3 2 1 20 15 10 5 0 0 5 10 0 (c) 5 10 (d) 18 MLR SVR ANN 16 14 20 12 15 t 10 8 10 6 4 5 2 0 0 0 5 10 0 5 10 Mean Absolute Error Fig. 8.6: Mean absolute error results for each regression method and each time period. (a) RD1 Dataset; (b) SD1 dataset; (c) RD2 dataset; (d) RD3 dataset. Note that the greatest mean absolute errors are found at the beginning and the end of the life cycle. 74 NINE CONCLUSIONS This work addressed the problem of forecasting the demand of short life cycle products (SLCPs) by using multiple linear regression, and machine learning methods such as SVM and ANN. The use of regression methods and the methodology followed in this study show a clear advantage over other forecasting methods proposed in the literature. This is because it is possible to obtain forecasts at early stages of the product life cycle. In fact, the methods discussed in this work only requires information available of the first demand/sales period, while other methods proposed in the literature require demand/sales information of at least 3 previous periods. This feature is possible due to the effective use of the time series demand of similar products that have already completed their life cycle. This work considers different strategies (hypothesis) aimed to improve the forecast performance of the regression methods. From the results obtained in this work we can conclude that: • The use of cumulative data does not improve the forecast performance. The results show a clear evidence of this statement. A possible explanation for this is the systematic increase of the variance of the cumulative time series, as a result of the sum of random variables. Although the cumulative time series is smooth, the values of such time series at different periods hide a larger variance than the non-cumulative values and such increase in variance generate poor forecasting results. On the other hand, the use of cumulative data increases the processing time showing that cumulative data is of little benefit for forecasting by using the framework proposed in this work. 9. Conclusions • The effect of the clustering in the forecasting results is not clear. Our experience on using clustering as a method to extract relevant information from the data to be used in the forecasting process show that, apparently, it is possible to have an improvement in forecast performance if the data shows a clear clustering structure. Unfortunately in our case all real datasets do not show a clear clustering structure. We think that the use of clustering techniques may be a valuable tool in the development of effective forecasting methods. However, the corresponding analysis is beyond the scope of this work. A possible direction is to forecast using the grade of membership to each cluster of each pattern. This is expected to improve the results. • Non-linear regression methods do not show a significant improvement in the forecasting performance for most of the datasets. Multiple linear regression showed statistically equal results to those of support vector regression and artificial neural networks with at least 90 % of confidence. This allows concluding that multiple linear regression is an efficient and effective method to forecast the demand of a SLCP. To sum up, according to the results and analysis presented in this document, the application of MLR for forecasting, with non-cumulative data and without clustering, is the best method to obtain low prediction errors. 76 APPENDIX A RESULTS FOR THE SD1 DATASET USING THE CORRECT PARTITION Tab. A.1 shows the effect in the forecasting performance of partitioning the data for the SD1 dataset according to its correct partition. The forecasting process is carried out with the optimal number of lags for the complete dataset (see Fig. 8.1 and Tab. 8.1). Tab. B.1 shows that clustering algorithm improves the results obtained on SD1 dataset. Tab. A.1: Preliminary multiple linear regression results for SD1 dataset considering all data and their correct partition. NON-CUMULATIVE DATA Dataset p* SD1 24 SD1 clust. 24 Mean RMSE S. Deviation Time (s) 1.6099 0.3564 0.7431 1.6005 0.5225 0.4888 CUMULATIVE DATA p-value ** 0.1114 Dataset p* SD1 24 SD1 clust. 24 Mean RMSE 2.3227 2.4926 p-value ** 0.0687 S. Deviation 0.3475 0.6497 Time (s) 5.0256 4.2385 *Optimal number of lags, see Fig. 8.1. **p-value of the Kruskal-Wallis test, values near to 0 implies median differences. APPENDIX B THE EFFECT OF THE CLUSTERING ALGORITHM B.1 Multiple linear regression case In this section, STS, FCM and FMLE algorithms cluster the data sets to investigate the effect of partitioning on prediction performance (see Section 5.3.2). For this study we investigate the effect of the clustering algorithm, the number of clusters, the value of the fuzzyfier parameter, and the number of lags on the forecasting performance. The number of partitions K changes from K = 2 to K = 8 clusters, the number of lags is up to p = 10 lags (for the SD1 dataset, however, we search for the following number of lags p = {18, . . . , 24} due to its strong lag dependence), and the following values for the fuzzifier parameter m = {1.1, . . . , 2.0, 2.25, 2.5, 2.75, 3}. The results are shown in Tab. B.1 B.2 Support vector regression case After evaluating the effect of the clustering and lag parameters for the SVR case we obtained the results shown in Tab. B.2. As observing the use of the FMLE clustering algorithm improves the results in almost all datasets. Appendix B. The effect of the clustering algorithm B.2. Support vector regression case Tab. B.1: Comparison of FCM, FMLE and FSTS algorithms for MLR. NON-CUMULATIVE DATA CUMULATIVE DATA Dataset RD1 FCM FMLE FSTS FCM RD1 FMLE FSTS RMSE K m p SD1 4.084 2 1.5 2 4.070 5 2.8 4 4.154 2 1.3 2 5.048 2 1.3 1 SD1 5.009 4 1.1 1 5.064 2 1.2 1 RMSE K m p RD2 1.431 4 2 24 1.420 5 2 24 1.413 6 1.7 20 1.841 7 1.7 18 RD2 1.832 6 1.3 18 1.7861 8 1.1 15 16.897 17.143 2 2 2.25 1.3 6 5 19.896 3 3 1 RD3 18.944 4 1.1 1 19.449 2 1.4 1 2.247 2 1.5 2 2.615 2 2.5 1 2.623 2 1.4 1 2.583 3 3 1 RMSE 17.137 K 2 m 1.2 p 5 RD3 RMSE K m p 2.215 2 1.4 3 2.240 5 1.8 1 79 Appendix B. The effect of the clustering algorithm B.2. Support vector regression case Tab. B.2: Comparison of FCM, FMLE and FSTS algorithms for SVR. NON-CUMULATIVE DATA CUMULATIVE DATA Dataset RD1 FCM FMLE FSTS FCM RD1 FMLE FSTS RMSE K m p SD1 4.030 2 1.8 4 3.985 8 2.5 5 4.051 2 1.3 2 5.148 2 1.2 1 SD1 5.082 4 1.1 1 5.152 2 1.6 1 RMSE K m p RD2 1.373 3 1.5 17 1.301 8 1.6 17 1.320 2 1.6 16 1.808 8 1.2 17 RD2 1.740 8 1.4 17 1.723 8 1.4 17 16.437 16.878 8 2 2.25 1.4 5 4 19.492 2 2 1 RD3 18.911 2 2.5 1 19.855 2 2.5 1 2.172 6 1.6 5 2.579 2 2.5 1 2.506 3 1.5 1 2.564 2 2.75 1 RMSE 17.276 K 2 m 1.2 p 6 RD3 RMSE K m p 2.178 2 1.4 5 2.188 2 1.8 5 80 APPENDIX C VARIANCE OF CUMULATIVE AND NON-CUMULATIVE DATA. This appendix assesses the variability in non-cumulative and cumulative time series considering the variances at each time period. The variances are calculated using the complete datasets, for example, the variance of the first period of the RD1 dataset is equal to the variance of the values of all-time series in such period. We assume that these values follow the same probability distribution, which is not necessarily true given the diverse nature of time series. This analysis, however, only provides a provisional idea of the variances of cumulative and non-cumulative time series. The variance of non-cumulative time series is calculated in the usual manner, the variance of cumulative time series is calculated according to the following expression var [Yt ] = var [Yt−1 ] + var [yt ] + 2cov [Yt−1 , yt ] , (C.1) where yt is the current value of the time series and Yt−1 is the cumulative value of the time series at period t − 1. Note the significant increase in variability when cumulative data is used. Appendix C. Variance of cumulative and non-cumulative data. (a) 5000 4000 3000 2000 1000 0 2 40 ×104 0 2000 4 6 8 0 10 12 (c) 5 10 15 20 25 (d) 3000 Variance 20 (b) 4000 Non-cum. Cum. 2000 1000 0 2 4 6 8 10 12 14 16 18 5 10 15 20 t Fig. C.1: Variance of cumulative and non-cumulative data. (a) RD1 Dataset; (b) SD1 dataset; (c) RD2 dataset; (d) RD3 dataset. 82 BIBLIOGRAPHY Balestrassi, P.P., Popova, E., Paiva, A.P., & Lima, J.W. 1994. Design of experiments on neural networks training for nonlinear time series forecasting. Neurocomputing, 1160–1178. Banerjee, A., & Davé, R. N. 2004. Validating clusters using the Hopkins statistic. IEEE, 25–29. Bashiri, M., & Geranmayeh, A.F. 2011. Tuning parameters of an artificial neural network using central composite design and genetic algorithm. Scientia Iranica, 1600–1608. Bass, F.M. 1969. A new product growth model for consumer durables. Management science, 15(3), 8496–8502. Benitez, H., D., Flórez, J., F., Duque, D., P., Benavides, A., Lucia Baquero, O., & Quintero, J., J. 2013. Spatial pattern recognition of seismic events in South West Colombia. Computer & Geosciences., 60–77. Bradley, P. S., & Fayyad, Usama M. 1998. Refining Initial Points for K-Means Clustering. Pages 91–99 of: -. Morgan kaufmann. Chang, C.-C., & Lin, C.-J. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 1– 27. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. 2002. Choosing multiple parameters for support vector machines. Machine learning, 131– 159. BIBLIOGRAPHY BIBLIOGRAPHY Chiu, C-C., Pignatiello, J.J., & Cook, D.F. 1994. Response surface methodology for optimal neural network selection. IEEE, 161–167. Chung, F.-L., Fu, T.-C., Luk, R., & Ng, V. 2002. Evolutionary time series segmentation for stock data mining. IEEE, 83–90. Cristianini, N., & Shawe-Taylor, J. 2000. An introduction to support vector machines. Cambridge University Press. Fu, Tak-Chung. 2010. A review on time series data mining. Engineering Applications of Articial Intelligence. Gath, & Geva, A. B. 1989a. Unsupervised optimal fuzzy clustering. IEEE Transactions on pattern analysis and machine intelligence, 11(7). Gath, I., & Geva, B. 1989b. Unsupervised optimal fuzzy clustering. IEEE, 773–781. Gönen, M., & Alpaydin, E. 2011. Regularizing multiple kernel learning using response surface methodology. Pattern recognition, 159–171. Hall, B. H., Jaffe, A. B., & Trajtenberg, M. 2001. The NBER patent citation data file: Lessons, insights and methodological tools. NBER working paper 8498. Hsu, C.-H., Chang, C.-C., & Lin, C.-J. 2010. A practical guide to support vector classification. National Taiwan University, 1–16. Hu, Y., Wu, C., & Liu, H. 2011. Prediction of passenger flow on the highway based on the least square support vector machine. Transport, 26(2), 663– 673. Jian, L., Xiuhua, C., & Hai, W. 2009. Comparison of artificial neural networks with response surface models in characterizing the impact damage resistance of sandwich airframe structures. Pages 210–215 of: -, vol. 2. cited By (since 1996)0. Kaufman, L., & Rousseeuw, P. J. 1990. Finding groups in data: An introduction to cluster analysis. Wiley. 84 BIBLIOGRAPHY BIBLIOGRAPHY Kurawarwala, A.A., & Matsuo, H. 1998. Product growth models for medium term forecasting of short life cycle products. Technological Forecasting and Social Change, 169–196. Li, B., Li, J., Li, W., & Shirodkar, S.A. 2012. Demand forecasting for production planning decision making based on the new optimized fuzzy short time series clustering. Production Planning and Control, 197–203. Liao, T. Warren. 2005. Clustering of time series data: A survey. Pattern recognition. Lin, C.-J. 2006. A gide to support vector machines. Madadlou, A., Emam-Djomeh, Z., Mousavi, M.E., Ehsani, M., Javanmard, M., & Sheehan, D. 1994. Response surface optimization of an artificial neural network for predicting the size of re-assembled casein micelles. Computers and Electronics in Agriculture, 216–221. Meade, N., & Islam, T. 2006. Modelling and forecasting the diffusion of innovation – A 25-year review. International Journal of Forecasting, 519– 545. Möller-Levet, C. S., Klawonn, F., Cho, K.-H., & Wolkenhauer, O. 2005. Clustering of unevenly sampled gene expression time series data. Fuzzy sets and systems, 49–66. Möller-Levet, Carla S., Klawonn, Frank, Cho, Kwang-Hyun, & Wolkenhauer, Olaf. 2003. Fuzzy clustering of short time series and unevenly distributed sampling points. Springer-Verlag, 330–340. Myers, R., & Montgomery, D. 2002. Response surface methodology: Process and product optimization using designed experiments. Wiley. Pai, P.-F., Lin, K.-P., Lin, C.-S., & Chang, P.-T. 2010. Time series forecasting by a seasonal support vector regression model. Expert Systems with Applications, 37(6), 4261–4265. Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. 2004. Validity index for crisp and fuzzy clusters. Pattern Recognition, 487–501. Pascazio, S., Basalto, N., Bellouti, R., Francesco, D. C., Facchi, P., & Pantaleo, E. 2007. Hausdorff clustering of nancial time series. Physica, 635–644. 85 BIBLIOGRAPHY BIBLIOGRAPHY Peña, Daniel. 2005. Análisis de series temporales. Alianza Editorial. Peña, Daniel, Tiao, George C., & Tsay, Ruey S. 2001. A course in time series analysis. Wiley. Peña, J. M., Lozano, J. A., & Larraaga, P. 1999. An empirical comparison of four initialization methods for the k-Means Algorithm. Pattern Recognition Letters, 1027–1040. Prez-Cruz, F., & Bousquet, O. 2004. Kernel methods and their potential use in signal processing: An overview and guidelines for future development. IEEE SIGNAL PROCESSING MAGAZINE, 54–65. Ratanamahatana, C. A., Lin, J., Gunopulos, D., Keogh, E., Vlachos, M., & Das, G. 2010. Mining time series data. Pages 1049–1077 of: Data Mining and Knowledge Discovery Handbook. Springer Verlag. Rodrı́guez, J.A. 2007. Diseño y evalución de modelos de abastecimiento de productos de corto ciclo de vida. M.Phil. thesis, Universidad del Valle. Rodrı́guez, J.A., & Vidal, C.J. 2009. A heuristic method for the inventory control of short life cycle products. Ingenierı́a y Competitividad, 37–35. Schölkopf, B., Bartlett, P., Smola, A., & Williamson, R. –. Shrinking the tube: A new support vector regression algorithm. –. Smola, A.J., & Schlkopf, B. 2004. A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222. Szozda, N. 2010. Analogous forecasting of products with a short life cycle. Decision Making in Manufacturing and Services, 4(1-2), 71–85. Theodoridis, S., & Koutroumbas, K. 2006. Pattern Recognition. Elsevier. Thomassey, S., & Fiordaliso, A. 2006. A hybrid sales forecasting system based on clustering and decision trees. Decision Support Systems, 408–421. Thomassey, S., & Happiette, M. 2007. A neural clustering and classification system for sales forecasting of new apparel items. Applied Soft Computing, 1177–1187. 86 BIBLIOGRAPHY BIBLIOGRAPHY Trappey, C.V., & Wu, H.Y. 2008. An evaluation of the time varying extended logistic, simple logistic, and Gompertz models for forecasting short product lifecycles. Advanced Engineering Informatics, 421–430. Tsay, R.S. 2005. Analysis of Financial Time Series. Wiley. Tseng, F-M., & Hu, Y-C. 2009. Quadratic interval Bass model for new product sales diffusion. Expert System with Applications, 8496–8502. Vapnik, V.N. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 988–999. Wang, J., & Wan, W. 2009. Optimization of fermentative hydrogen production process using genetic algorithm based on neural network and response surface methodology. International Journal of Hydrogen Energy, 255–261. Wang, Weina, & Zhang, Yunjie. 2007. On fuzzy cluster validity indices. Fuzzy sets and systems, 2095–2117. Wu, S.D., & Aytac, B. 2008. Characterization of demand for short life-cycle Technology products. Annals of Operations Research. Wu, S.D., Kempf, K.G., Atan, M.O., Aytac, B., Shirodkar, S.A., & Mishra, A. 2009. Extending Bass for improved new product forecasting. Informs, 234–247. Wu, X. L., & Yang, M. S. 2005. A cluster validity index for fuzzy clustering. Pattern Recognition Letters, 1275–1291. Xu, X.-H., & Zhang, H. 2008. Forecasting demand of short life cycle products by SVM. Pages 352–356 of: -. cited By (since 1996)1. Yang, Y., Fuli, R., Huiyou, C., & Zhijiao, X. 2007. SVR mathematical model and methods for sale prediction. Journal of Systems Engineering and Electronics, 18(4), 769–773. Zahid, N., Limouri, M., & Essaid, A. 1999. A new cluster-validity for fuzzy clustering. Pattern Recognition, 1089–1097. Zhang, G., Patuwo, B.E., & Hu, M.Y. 1998. Forecasting with artificial neural networks: the state of the art. International Journal of Forecasting. 87 BIBLIOGRAPHY BIBLIOGRAPHY Zhang, G., Patuwo, B.E., & Hu, M.Y. 2001. A simulation study of artificial neural networks for non-linear time series forecasting. Computers & Operations Research, 381–396. Zhang, X., Liu, J., Du, Y., & Lv, T. 2011. A novel clustering method on time series data. Expert Systems with Applications, 11891–11900. Zhu, K., & Thonemann, U.W. 2004. An adaptive forecasting algorithm and inventory policy for products with short life cycles. Naval Research Logistic, 633–653. 88