Download Demand Forecast for Short Life Cycle Products

Document related concepts

Principal component analysis wikipedia , lookup

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Demand Forecast for Short Life
Cycle Products
Mario José Basallo Triana
December, 2012
Ante todo, a Dios.
A mis padres Luis Mario y Luz Ángela por su sacrificio, dedicación, y amor.
A mis maestros por enseñarme el camino.
Agradecimientos
Quiero agradecer a los profesores Jesus Andrés Rodrı́guez Sarasty y Hernán
Darı́o Benitez Restrepo directores este proyecto por sus consejos, asistencia,
y apoyo.
Agradezco a la Pontificia Universidad Javeriana porproporcionarme los recursos necesarios para el desarrollo de este proyecto. Este proyecto fué apoyado
por la Pontificia Universidad Javeriana mediante el proyecto Gestión de Inventarios, 020100292.
Agradezco a todas las personas que de una u otra forma contribuyeron en la
realización de este proyecto.
Inteligencia, dame
el nombre exacto de las cosas!
... Que mi palabra sea
la cosa misma
creada por mi alma nuevamente.
Que por mı́ vayan todos
los que no las conocen, a las cosas;
que por mı́ vayan todos
los que ya las olvidan, a las cosas;
que por mı́ vayan todos
los mismos que las aman, a las cosas ...
Inteligencia, dame
el nombre exacto, y tuyo,
y suyo, y mı́o, de las cosas!
Juan Ramón Jiménez, Eternidades (1918)
Este trabajo se basa en las ideas propuestas por el profesor Jesus Andrés
Rodrı́guez Sarasty para pronosticar la demanda de productos de corto ciclo
de vida.
CONTENTS
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2. Problem statement . . . . . . . . . . . .
2.1 Forecasting the demand of short life
2.1.1 Problem formulation . . . .
2.2 Fundamental research hypothesis .
. . . . . . . . .
cycle products
. . . . . . . . .
. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
6
3. Objectives and Scope . . .
3.1 General objective . .
3.2 Specific objectives . .
3.3 Scope of the research
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
7
4. Literature Review . . . . . . . . . . . . . . . . .
4.1 Forecast based on growth models . . . . .
4.2 Forecast based on similarity . . . . . . . .
4.3 Forecast based on machine learning models
4.4 Discussion of the current methods . . . . .
4.5 Conclusions . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
11
11
12
13
5. Time series analysis . . . . . . .
5.1 The datasets of time series
5.1.1 Real datasets . . .
5.1.2 Synthetic dataset .
5.2 Stationarity test . . . . . .
5.2.1 Unit-root test . . .
5.3 Clustering of time series .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
14
14
15
16
17
18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
5.4
CONTENTS
5.3.1 Some insights on clustering time series
5.3.2 The clustering algorithm . . . . . . . .
5.3.3 Fuzzy cluster validity indices . . . . . .
5.3.4 Clustering results . . . . . . . . . . . .
5.3.5 Clustering results for the real datasets
Conclusions of the chapter . . . . . . . . . . .
6. Regression Methods . . . . . . . . . . . .
6.1 Multiple linear regression . . . . . . .
6.2 Support vector regression . . . . . . .
6.3 Artificial Neural Networks . . . . . .
6.4 Tuning parameters . . . . . . . . . .
6.4.1 Response surface methodology
. .
. .
. .
. .
. .
for
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
. . . .
tuning
7. Experimental procedure . . . . . . . . . . . . . . .
7.1 Collection and analysis of data . . . . . . . . .
7.2 Clustering . . . . . . . . . . . . . . . . . . . .
7.3 Parameter tuning . . . . . . . . . . . . . . . .
7.4 Forecasts evaluation . . . . . . . . . . . . . .
7.5 Some computational aspects . . . . . . . . . .
7.6 Results of the tune parameters procedure . . .
7.6.1 Tuning parameters for SVR machines .
7.6.2 Tuning parameters for ANN machines
7.7 Conclusions of the chapter . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
23
28
31
32
33
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
parameters
.
.
.
.
.
.
35
36
37
40
42
43
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
50
52
52
53
54
54
54
56
58
8. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1 Forecasting results using multiple linear regression . . .
8.1.1 Multiple linear regression results with clustering
8.1.2 Conclusions of the MLR case . . . . . . . . . .
8.2 Forecasting results using support vector regression . . .
8.2.1 Support vector regression results with clustering
8.2.2 Conclusions of the SVR case . . . . . . . . . . .
8.3 Forecasting results using artificial neural networks . . .
8.3.1 Artificial neural network results with clustering
8.3.2 Conclusions of the ANN case and other cases . .
8.4 Comparison of forecasting methods . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60
60
61
64
65
66
67
68
69
69
71
9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
ii
CONTENTS
CONTENTS
A. Results for the SD1 dataset using the correct partition . . . . . . . 77
B. The effect of the clustering algorithm . . . . . . . . . . . . . . . . . 78
B.1 Multiple linear regression case . . . . . . . . . . . . . . . . . . 78
B.2 Support vector regression case . . . . . . . . . . . . . . . . . . 78
C. Variance of cumulative and non-cumulative data. . . . . . . . . . . 81
iii
LIST OF FIGURES
2.1
2.2
Short life cycle product time series. . . . . . . . . . . . . . . .
Generalized pattern of a short life cycle product time series. .
5.1
5.2
5.3
The datasets of short time series. . . . . . . . . . . . . . . . . 17
Representation of the SD1 data set by its 3 PIPs. . . . . . . . 33
Representation of the real datasets by its 3 PIPs. . . . . . . . 34
6.1
6.2
6.3
Illustration of the linear ε-SVR with soft margin. . . . . . . . 37
Feed-forward neural network. . . . . . . . . . . . . . . . . . . 41
Experimental designs. . . . . . . . . . . . . . . . . . . . . . . 45
7.1
7.2
7.3
Experimental factors. . . . . . . . . . . . . . . . . . . . . . . . 50
Framework of the forecasting procedures. . . . . . . . . . . . . 51
Some iterations of the proposed tuning parameters procedure. 59
8.1
8.2
8.3
8.4
8.5
8.6
Multiple linear regression results, complete datasets. . .
Support vector regression results, complete datasets. . .
Artificial neural network results, complete datasets. . .
Forecasting results for some time series. . . . . . . . . .
Pairwise comparison of regression methods. . . . . . . .
Mean absolute error results for each regression method.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
6
61
65
68
73
73
74
C.1 Variance of cumulative and non-cumulative data . . . . . . . . 82
LIST OF TABLES
5.1
5.2
5.3
Model parameters of synthetic time series. . . . . . . . . . . . 16
Validation results for SD1 dataset using FSTS algorithm. . . . 32
Optimal number of clusters for the real datasets. . . . . . . . . 32
7.1
7.2
7.3
7.4
Optimal
Optimal
Optimal
Optimal
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
MLR results, complete datasets. . . . . . . . . . . . .
MLR results, partitioned non-cumulative data . . . .
MLR results, partitioned cumulative data . . . . . . .
SVR results, complete datasets. . . . . . . . . . . . .
SVR results, partitioned non-cumulative data . . . .
SVR results, partitioned cumulative data . . . . . . .
ANN, complete datasets. . . . . . . . . . . . . . . . .
ANN, partitioned non-cumulative data . . . . . . . .
ANN, partitioned cumulative data . . . . . . . . . . .
Summary evaluation of the experimental treatments.
SVR parameters procedure, non-cumulative data.
SVR parameters, cumulative data. . . . . . . . .
ANN parameters procedure, non-cumulative data.
ANN parameters, cumulative data. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
56
57
57
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62
64
64
66
67
67
69
70
71
72
A.1 Preliminary multiple linear regression results for SD1 dataset.
77
B.1 Comparison of FCM, FMLE and FSTS algorithms for MLR. . 79
B.2 Comparison of FCM, FMLE and FSTS algorithms for SVR. . 80
LIST OF ALGORITHMS
1
2
3
4
5
Perceptually Important Points Procedure. . . . . . . . . . .
Kaufman initialization method. . . . . . . . . . . . . . . . .
Fuzzy C-Means (FCM) clustering algorithm. . . . . . . . .
Fuzzy short time series (FSTS) clustering algorithm. . . . .
Fuzzy Maximum Likelihood Estimation (FMLE) algorithm.
.
.
.
.
.
.
.
.
.
.
21
22
25
26
28
6
7
Tuning parameters procedure. . . . . . . . . . . . . . . . . . . . 46
Proposed tuning parameters procedure. . . . . . . . . . . . . . 48
ABSTRACT
Accurate forecast for demand of short life cycle products is a subject of
special interest for many companies and researchers. However common forecasting approaches are not appropriate for this type of products due to the
characteristics of their demand. This work proposes a method to forecast
the demand of short life cycle products. Clustering techniques will be used
to obtain natural groups in the time series. This analysis allows to extract
relevant information for the forecasting method. The results of the proposed
method will be compared to other approaches to forecast the demand of short
life cycle products. Several time series datasets of different type of products
are considered.
Keywords: Short life cycle product, time series, forecast, cluster analysis, forecast performance.
ONE
INTRODUCTION
Short life cycle products (SLCP) are characterized by a demand that only
occurs during a short time period after which they become obsolete1 , this
leads in some cases to very short demand time series (Rodrı́guez & Vidal,
2009). High technology products (e.g., computers, consumer electronics,
video games) and fashion products (e.g., toys, apparel, text books) are typical
examples of SLCPs (Kurawarwala & Matsuo, 1998; Thomassey & Happiette,
2007; Rodrı́guez & Vidal, 2009). The period of demand can vary from a few
years to a few weeks as can be seen in the Colombian textbook industry
(Rodrı́guez & Vidal, 2009).
The dynamic of new products demand is generally characterized by a
relatively slow growth in the introduction stage, followed by a stage of rapid
growth, afterwards the demand is stabilized and the product enters in a
stage of maturity; finally the demand declines and then, the product usually
is replaced by another product (Trappey & Wu, 2008; Meade & Islam, 2006).
In many industries, particularly in the technology sector, SLCPs are becoming increasingly common (Zhu & Thonemann, 2004). This phenomenon
is motivated by a continuous introduction of new products as a consequence
of high competitive markets. In this context, the competitive advantage of a
company is determined largely on its ability to manage frequent entries and
exits of products (Wu et al. , 2009).
The demand of a SLCP is highly uncertain and volatile, particularly in the
introduction stage (Wu & Aytac, 2008). Additionally, the demand pattern is
1
Short life cycle products are different of perishable products which generally deteriorate with time.
1. Introduction
transient, non-stationary and non-linear (Rodrı́guez, 2007); these characteristics hinder the analysis and forecast of such a demand. On the other hand,
the operations management of this type of products is also difficult because
high technology and investment usually is required, the manufacturing and
distribution lead times are usually long, and the risk of excess or shortage of
inventory during the life cycle is high (Rodrı́guez & Vidal, 2009). An accurate prediction of demand reduces production and distribution costs, and the
excess or shortage of inventory. It is for this reason that an accurate forecast
is important for a company.
This work proposes an efficient and effective forecasting methods of short
life cycle products demand. For this reason we consider regression methods
to forecast the demand of a SLCP. These methods are able to obtain forecasts
at early stages of product life cycle achieving an important advantage over
current forecasting methods, this fact also represent important advantages
for a company. In order to improve the forecasting performance different
strategies are considered, such strategies involve clustering techniques and
cumulative and non-cumulative data.
This document is organized as follows: Chapter 2 states the problem and
formulates different forecasting strategies (hypothesis) aimed at improving
the forecast performance. Chapter 3 presents the objectives of this work and
the scope of the research. Chapter 4 duscusses a classification and description of the current methods used to predict the demand of short life cycle
products. Chapter 5 describes the different time series datasets analyzed in
this work and illustrates the use of different clustering techniques. Chapter
6 discusses regression methods and tuning parameters procedures followed.
Chapter 7 explains in detail the experimental procedure followed in this work.
Chapter 8 discusses the results and Chapter 9 concludes this document.
2
TWO
PROBLEM STATEMENT
2.1 Forecasting the demand of short life cycle products
There are different situations that make complex demand forecasting of short
life cicle products. On one hand, the time series of demand of such products appears to be non-stationary, non-linear, and transient (Fig. 2.1 (left)).
Another problem related to the demand forecasting of short life cycle products is the scarcity of historical data. These products once introduced to the
market have a short period of sales. Then, there are little or no historical
information related to the sales of such products. These are a severe inconvenience because in spite of the existence of forecasting methods for non-linear
time series, such these techniques generally require large amounts of data in
order to obtain accurate forecasts. Many non-linear models have been proposed in the field of time series analysis1 . This models, in the same way that
linear models, requires large amount2 of data to obtain accurate results. Furthermore, the proposed nonlinear models employ explicit parametric forms;
however, it is usually hard to justify a priori the appropriateness of such
explicit models in real applications. The use of non-parametric regression
analysis (such as support vector regression and artificial neural networks)
can be an effective data driven approach (Peña et al. , 2001).
Traditional forecasting methods3 cannot be used in this context because
1
Some nonlinear time series models are the Bilinear model, the Threshold Autoregressive model, the Smooth Transition model, the Markov Switching model, see Tsay (2005);
Peña (2005).
2
At least 50 periods of historical data.
3
Methods such as: moving average, exponential smoothing, ARIMA, and others. In
2. Problem statement
2.1. Forecasting the demand of short life cycle products
Cumulative real demand
Cumulative smothed version
Demand (units)
Cumulative demand (units)
Real demand
Smothed version
Time
Time
Fig. 2.1: Short life cycle product time series. Left: Typical demand pattern for a
short life cycle product. Right: Cumulative demand time series.
short life cycle products time series do not meet the assumptions required
for these methods or there is not sufficient information available to get accurate estimations of the parameters of such methods (Kurawarwala & Matsuo,
1998; Wu & Aytac, 2008). These difficulties make necessary to develop forecasting methods specifically designed for this type of products. The forecasting procedures for short life cycle products time series should address
all these difficulties and cope with the high uncertainty and volatility of the
demand typical of this type of products.
2.1.1 Problem formulation
In order to state the problem, let us define xt as the value of the time series at
time t and x
bt as the forecasted value at the same period. Then, we set x
bt = F ,
where F is some functional relationship. We need to define F according to
some criteria. One suitable criterion to define F is according to some error
general, most of these methods are suitable to forecast conventional products with large
history or a stable demand pattern.
4
2. Problem statement
2.1. Forecasting the demand of short life cycle products
metric (forecasting error). Let et = f (xt − x
bt ) be such metric. Then given a
time series {xt }, t = 1, . . . , T , we wish to minimize Eq. 2.1
∑
E=
et
(2.1)
t∈T
The problem can be stated as follow:
Given a time series {xt }, the forecasting problem consists in defining F for
which Eq. 2.1 is minimized.
Set F = f (t, Θ), where Θ is some set of parameters, which in the context
of product innovation corresponds to the parameters of a growth model (see
Chapter 4). Another way of defining F is for example F = f (xt−1 , . . . , xt−p ),
where p is a lag parameter. Finally, F can be defined as F = f (F1 , . . . , Fn ),
where n is the number of different forecasts obtained with different forecast
models.
∑
If we define Xt = ti=1 xi as the cumulative demand, the time series obtained is smoother than the non-cumulative demand time series (see Fig. 2.1
right). As this fact can be used to improve the forecasts results (Wu et al. ,
2009), we forecast the cumulative demand and then the non-cumulative demand is calculated as xt = Xt − Xt−1 . This situation motivates the proposal
of the following research hypothesis:
The forecast with cumulative demand is more accurate than the forecast with
non-cumulative demand.
In order to cope with the scarcity of historical data of the product demand,
in the literature related to SLCPs some references work with time series
of similar products which have already been introduced to the market (see
Chapter 4). In this case, a clustering technique may be used to analyze
the information available and to organize this according to natural clusters
(Wu et al. , 2009; Li et al. , 2012). The forecasts may be carried out according to the results of that cluster analysis. Therefore we also consider the
following research hypothesis.
Forecasts based on data obtained by means of cluster analysis improve forecast
performance.
Finally, a time series of a SLCP may have more complex shapes (see Fig. 2.2)
5
2. Problem statement
2.2. Fundamental research hypothesis
that may not necessarily be similar to the typical bell-shaped pattern in Fig.
2.1 (left). In this situation the forecasting process becomes more difficult
and we need to define forecast models that can deal with any pattern of
SLCPs time series. One way to do this is to use some machine learning
model in order to capture the complex functional relationship of the time
series (Zhang et al. , 1998). Accordingly we assume the following research
hypothesis. The methodological procedure for forecasting using machine
learning models is based on Rodrı́guez & Vidal (2009) and Rodrı́guez (2007).
The use of machine learning models in the forecasting process improves the
forecast performance.
Demand
Time
Fig. 2.2: Non bell-shaped pattern of a short life cycle product time series.
2.2 Fundamental research hypothesis
It is possible to determine a forecasting procedure for SLCPs that, based on
similar time series of available data, can obtain a minimum forecast error
and guarantee a reliable basis for planning activities.
6
THREE
OBJECTIVES AND SCOPE
3.1 General objective
Propose a forecasting method based on machine learning models to forecast
the demand of short life cycle products.
3.2 Specific objectives
• Develop a clustering method for short life cycle products time series to
extract relevant information for the forecast process.
• Design the forecasting method for the demand of short life cycle products using multiple linear regression, support vector machines and/or
artificial neural networks.
• Evaluate the performance of forecast models using appropriate metrics
and statistical tests.
3.3 Scope of the research
In this work we use or consider
• Sales data rather than demand data because the demand data is usually
not available.
• A validation and comparison of the forecasting methods using at least
two different datasets of time series of demand.
3. Objectives and Scope
3.3. Scope of the research
• The results do not make any implementation in a company and does
not develop any implementation of software.
• The analysis only considers time series that follow the behavior of the
demand of short life cycle products.
• This work focuses on the pattern of the time series rather than on the
product or the type of product being considered.
Note: Due to the scarcity of real data for this project, one of the databases
used in this work contains time series of demand of textbooks and scholar
products. Some of these products cannot be considered strictly as short life
cycle products1 . However, we use these time series because they follow the
pattern of the demand of a short life cycle product.
1
The demand of textbooks and scholar products is mainly related with the season, but
during the season such demand follows a short life cycle pattern.
8
FOUR
LITERATURE REVIEW
This chapter presents a review of the current status of research in forecasting
the demand (sales) of short life cycle products. This review is intended to
discuss a general summary of the work in forecasting short life cycle products
and guidelines for future research.
4.1 Forecast based on growth models
Product growth models are widely used in the analysis of the diffusion of
the innovation. For this reason diffusion models can be used to forecast
the demand of SLCP. This approach requires determining a set of parameters; therefore an estimation parameters procedure is required. A review of
forecasting methods based on diffusion models for short life cycle products is
presented below. An extensive literature review of the use of diffusion models
is found in Meade & Islam (2006).
Trappey & Wu (2008) present a comparison of the time varying extended
logistic, simple logistic, and Gompertz models. The study analyzes electronic
products time series. Linear and non-linear least squares methods are used
to determine model parameters. The authors found that the time varying extended logistic had the best fit and prediction capability for almost all tested
cases, however this forecast procedure cannot converge in some situations 1 .
Gompertz model had the second best forecasting error.
1
Cumulative time series of SLCPs converges to a limit superiorly, see Fig. 2.1 (right)
on Chapter 2.
4. Literature Review
4.1. Forecast based on growth models
Kurawarwala & Matsuo (1998) analyze three models to forecast the demand of short life cycle products: the linear growth model, the Bass model,
and the seasonal trend Bass model 2 . Forecasts are performed using demand
data of personal computer time series. Non-linear least squares estimation
is used to determine parameters. Performance measures such as the sum of
square error (SSE), the RMSE and the MAD show that the seasonal trend
Bass model reaches the minimum forecast error.
Tseng & Hu (2009) propose the quadratic interval Bass model to forecast new product sales diffusion. Fuzzy regression estimates the parameters
of the Bass model, which is tested with different datasets. The proposed
model is compared with Gompertz, Logistic and quadratic Gompertz, models and analyzed by means of the confidence interval length as performance
measure. The authors conclude that the proposed method is suitable in cases
of insufficient data and should not be used when sufficient data is available.
Wu & Aytac (2008) propose a forecast procedure based on the use of
time series of similar products (leading indicators), Bayesian updating and
combined forecasts of different diffusion models. A priori forecast is made
with several growth models. Then a sampling distribution is obtained with
the forecasts of different time series of similar products (leading indicators),
which are obtained by means of the mentioned growth models. Finally,
Bayesian updating is performed and the final forecast is obtained as a combination of the different growth models in the a posteriori results. The main
advantage of the method is the systematic reduction of the variance in the
forecasts. The method is tested in semiconductor demand time series. A
similar work is presented in Wu et al. (2009).
Zhu & Thonemann (2004) propose an adaptive forecast procedure for
short life cycle products. The authors use the Bass diffusion model and
propose to update the parameters of such model using Bayesian approach.
Prior estimation of the parameters is made using non-linear least squares estimation. The forecasts are performed using datasets of personal computers
and analyzed using the MAD. The results show that the proposed method
performs better than the double exponential smoothing and the Bass model.
2
The authors consider integrating elements of seasonality in the Bass diffusion model.
10
4. Literature Review
4.2. Forecast based on similarity
4.2 Forecast based on similarity
The lack or scarcity of information in short life cycle products time series is
compensated with the existence of information related to similar products
for which there are sufficient history or have completed their life cycle. In
this section we present a review of proposed forecast models that use directly
similar products or time series to obtain forecasts of a SLCP.
Szozda (2010) proposes an analogous forecast. The purpose of the method
is to find the most similar time series. Calibration and adjustment of the time
series is performed if necessary, in order to maximize the similarity measure.
The forecast is the value of the most similar time series at a specified period
of time. Datasets of different new products on European markets were used.
The forecast method was analyzed using the mean squared error (MSE)
and the results show that the proposed method presents good performance
obtaining a forecast error less than 10% on average.
Thomassey & Fiordaliso (2006) propose a forecast procedure based on
clustering (unsupervised learning) and classification (supervised learning) to
carry out early forecasts of new products. First, natural groups in the time
series are obtained by means of a clustering procedure; then a classification
procedure classifies new products in a specified cluster. A forecasting of sales
profiles is given by the centroids of the cluster for which the new product
belong. Datasets of textile fashion products is used. The forecasts were
analyzed using the RMSE, the Mean Absolute Percentage Error (MAPE),
and the Median Absolute Percentage Error (MdAPE). A similar work is
presented by Thomassey & Happiette (2007).
4.3 Forecast based on machine learning models
Machine learning methods such as neural networks are widely used in forecasting activities (Zhang et al. , 1998). Additionally techniques such a clustering and classification are important to extract relevant information from
time series data and, therefore, of great utility in finding similar time series
as can be seen in Section 4.2. In this section we present a review of machine
learning methods used in forecasting activities for short life cycle products.
Xu & Zhang (2008) uses a Support Vector Machine (SVM) to forecast
the demand of short life cycle products in conditions of data deficiency. The
authors take into account factors such as the past values of current demand,
11
4. Literature Review
4.4. Discussion of the current methods
the forecast given by the Bass model, and seasonal factors. A dataset of computer products was utilized. The forecasts were analyzed using the RMSE
and MAD. The results show that the proposed model outperform the Bass
model.
Meade & Islam (2006) a multilayer feed forward neural network accommodated for prediction and a controlled recurrent neural network to predict
short time series. The authors use datasets available in the literature and
find that the feed forward neural network accommodated for prediction perform better in one step ahead forecasts, but in the case of two step ahead
forecasts the controlled recurrent neural network improves the feed forward
neural network.
The capability of artificial neural networks account for non-linear relationships and the fact that no assumptions are made about the time series characteristics are interesting features for forecasting. However, a drawback of
neural networks is that they have many parameters to set up and there is not
a standard procedure for this labor to ensure a good network performance.
The lack of a systematic approach to neural network model building is probably the primary cause of inconsistencies in reported findings (Zhang et al. ,
2001).
Design of Experiments (DOE) and/or Response Surface Methodology
(RSM) are techniques that can be used to tuning parameters of neural network models (Zhang et al. , 2001; Bashiri & Geranmayeh, 2011; Chiu et al. ,
1994; Madadlou et al. , 1994; Balestrassi et al. , 1994). However the results obtained by response surface methodology are not necessarily the best
(Wang & Wan, 2009; Jian et al. , 2009).
4.4 Discussion of the current methods
From our literature review it is possible to describe some limitations of the
current forecasting methods for SLCPs demand. These limitations are related mainly to the capability of forecasting the demand along the complete
life cycle, the appropriateness of the models used, and the effective use of
historical information related to others SLCPs. We describe in more detail
these limitations.
• Capability of forecast along the complete life cycle: This is an important question due that many methods such as diffusion models requires
to extract part of the historical information for tuning parameters. For
12
4. Literature Review
4.5. Conclusions
example, the Bass model (see 5.1.2) requires to determine three parameters (a, bandc, see 5.1) then it is necessary to reserve at least 3 data
points for estimate such parameters. Then if we use the Bass diffusion
model the initial forecast can be made at the 4th period.
• Appropriateness of the model : Many models used in forecasting the demand of SLCPs employ explicit parametric forms that can, at best, be
regarded as rough approximations of the underling non-linear behavior
of the time series being studied. It is necessary to justify a priory the
appropriateness of the model used.
• Effective use of information related to other SLCPs: Many forecasting
methods for SLCPs demand, such as difussion models, does not exploit
information related to historical patterns of SLCPs demand.
These drawbacks and the hypothesis outlined in chapter 2 motivate and
justifies the work presented in this document.
4.5 Conclusions
This chapter presented a review of the forecast models for short life cycle
products and a description of forecasting models based on diffusion models,
similarities between time series, and machine learning methods. The use
of a Bayessian learning is a common technique that generally improves the
forecast results. We note some limitations of the current methods used in
forecasting tasks of SLCPs demand.
13
FIVE
TIME SERIES ANALYSIS
This chapter provides an analysis of the different datasets of time series. Initially, an univariate time series analysis is addressed; this analysis includes
testing for stationarity and testing for nonlinearity in time series. Our experience with the univariate time series analysis does not allow us to achieve
conclusive results. Since some of the techniques used here1 have undefined
results; in short time series. To complement the analysis, an analysis of multiple time series is conducted using clustering techniques to extract natural
groups on time series data. Classifying information by means of clustering
allows to extract relevant information from the data that can be valuable in
forecasting tasks.
5.1 The datasets of time series
5.1.1 Real datasets
Demand dataset of texts and scholar products
This dataset corresponds to the weekly sales of textbooks and scholar products of a Colombian company. The dataset comprises a total of 726 different
time series which have a maximum length of 13 periods2 . We will refer to
this dataset as RD1. Fig. 5.1 (a) shows some of these time series.
1
Such as maximum likelihood estimation of ARMA models and some nonlinear tests
for time series.
2
It is important to note the presence of very short time series. This fact imposes
constraints on the analysis to univariate time series, as will be seen later.
5. Time series analysis
5.1. The datasets of time series
Citations of articles dataset
This dataset correspond to annual citations received by scientific articles published in several knowledge fields. The articles were published between 1996
and 1997 and are available in the Scopus database. The dataset comprises a
total of 600 time series of annual citations comprising years 1996–2014 having a maximum length of 18 periods. We will refer to this dataset as RD2.
Fig. 5.1 (c) shows some of these time series. Note that although these time
series show a downward trend in the last years, they have not yet completed
its life cycle. This does not occur in the dataset RD1 where the lifetime of
such time series has been completed.
Patent to Patent citations dataset
This dataset corresponds to annual patent to patent citations in the Unites
States Patent Office (USPTO). The source of the data is Hall et al. (2001),
here authors study the patent citations comprising the years 1975–1999. In
this work we consider patents that mainly belongs to technological fields, we
preprocess the information to obtain the annual patent citations of patents
granted between 1976 and 19783 . The time series comprises the periods from
1976 to 1999 having a maximum length of 23 periods and the total number
of patents analyzed was of 600. We will refer to this dataset as RD3. Fig.
5.1 (d) shows some of these time series.
5.1.2 Synthetic dataset
Bass diffusion model
One of the first attempts to model the life cycle of products was the Bass
difussion model (Bass, 1969) which can be expressed using the following
recurrence relation
( t−1 )2
t−1
∑
∑
(5.1)
xt = a + b
xi + c
xi ,
i=1
i=1
where a = pm, b = q − p and c = −q/m. Here p, p ∈ (0, 1) , is called
the innovation coefficient, q, q ∈ (0, 1) , is called the imitation coefficient
3
The patent citations are analyzed according to their granted year. We consider the
granted year as the year of birth of the patent and the start of the counts for cites.
15
5. Time series analysis
5.2. Stationarity test
∑
and m = ∞
i=1 xi represent the total demand/sales over the life cycle of the
product, see Bass (1969). Here t = 0 . . . ∞ and x0 = a = mp, the recurrence
relation 5.1 can be used to obtain the following nonlinear autoregressive
process
( t−1 )2
t−1
∑
∑
xt = a + b
xi + c
xi + ϵt ,
i=1
i=1
where ϵt is an independent and identically distributed random variable. We
refer to this relation as the Bass Process and construct synthetic time series
that follow such process with {ϵt } as normally distributed random variable
of mean 0 and variance σ 2 . Four different groups of time series are generated
each group containing 150 time series, reaching a total of 600. The length
of the time series is set to a maximum of 25 periods. The parameters of the
process are considered as random variables normally distributed with mean
and standard deviation (σ) given in Tab. 5.1. We will refer to this dataset
as SD1. Fig. 5.1 (b) shows 7 different time series of this dataset.
Tab. 5.1: Mean values of the Model parameters for each group of synthetic parameters. Each parameter is considered as a random variable. Note: Values
of p and q out of the range [0, 1] were not considered in the process.
Group
p
q
m
I
II
III
IV
σ
0.043
0.002
0.031
0.015
0.071
0.425
0.496
0.220
0.549
0.001
193.1
119.5
105.2
193.2
14.29
5.2 Stationarity test
Stationarity implies some regularity conditions required for some tests of
hypothesis and for the parameter estimation of linear ARMA models (Peña,
2005, Chapter 10). A time series {xt } is said to be weakly stationary if their
mean µt , their variance γ0 and the covariance γℓ between yt and yt−ℓ are time
invariant, where ℓ is an arbitrary integer (see Chapter 2 of Tsay (2005)).
Strong stationarity implies that the probability distribution function related
16
5. Time series analysis
5.2. Stationarity test
(b)
(a)
30
60
20
40
10
20
xt
2
4
6
8 10 12 14
(c)
150
30
100
20
50
10
5
10
15
5
10
15
(d)
20
25
5
10
15
20
25
t
Fig. 5.1: The datasets of short life cycle product time series. (a) RD1 dataset. (b)
SD1 dataset. (c) RD2 dataset. (d) RD3 dataset.
to the values of the time series is time invariant. Strong stationarity is a key
concept that allows characterizing the SLCPs time series. It is clear that the
probability distribution function of such time series varies with time. In fact
at the initial and final stages of the life cycle we might expect less variability
in time series. On the other hand, it is expected that the variability of
the time series be greater around the peak demand/sales. Also the mean
value of the time series depends of the stage of the life cycle, i.e. the mean
demand/sales at the beginning and at the end of the life cycle is expected to
be close to zero.
5.2.1 Unit-root test
A non-stationary time series is said to be an Autoregressive Integrated Moving Average, ARIMA(p, d, q), process if after applying the difference operator, ∇d yt , the resulting time series is stationary4 . Then the original time
series is called unit-root no stationary (Tsay, 2005, Chapter 2). Consider the
Autoregressive AR(1) process
yt = ϕ0 + ϕ1 yt−1 + at ,
The difference is defined as ∇yt = yt − yt−1 . Let xt = yt − yt−1 , then ∇2 yt =
xt − xt−1 = yt − 2yt−1 + yt−2 .
4
17
5. Time series analysis
5.3. Clustering of time series
where at is an independent and identically random variable, e.g. a white
noise sequence that is by definition a stationary process. If ϕ = 1 then
yt = ϕ0 + yt−1 + at ,
∇yt = ϕ0 + at .
Given that after applying the difference operator to the original time series we obtain a stationary time series, we conclude that the process is nonstationary. Then for testing unit-root stationarity it is necessary test if ϕ = 1
or ϕ < 1.The above analysis can be generalized to ARMA models, however,
we omit details (see Chapter 9 of Peña (2005)). Generally the null hypothesis of the test is that the series are non-stationary. This implies, possibly,
differentiating the original series, by using the difference operator, to obtain
a stationary process. We use the augmented Dickey-Fuller test to test for
unit-root nonstationarity, with a significance level in the test of 0.01. Performing the Dickey-Fuller test requires to construct a suitable linear ARMA
model beforehand. The parameters of such model are usually estimated by
maximum likelihood estimation.
Results of the unit root test
Our experience on testing unit-root nonstationarity gives no conclusive results. This is confirmed in performing tasks of parameter estimation of
ARMA models, where the algorithms used present undefined or invalid results5 . We admit that the time series discussed here are non-stationary, since
time series do not satisfy regularity assumptions required for estimating parameters of ARMA models.
5.3 Clustering of time series
Data mining methods are helpful to find and to describe patterns which are
hidden in datasets. Clustering is a major technique in data mining, statistical
and pattern recognition fields; the problem of clustering is also referred to as
non-supervised classification and is identified as a non-supervised learning.
5
In estimating parameters we use here the exact maximum likelihood estimation using
the Kalman filter, this algorithm requires the calculation of a log-likelihood function but
in some cases the argument of the logarithm is negative. The same circumstances are
found in some nonlinear tests. This implies undefined (complex) results.
18
5. Time series analysis
5.3. Clustering of time series
Clustering consists in finding natural groups in data. This means that an
element of a cluster possesses common characteristics to other elements in
such cluster, but is significantly different of the elements of other clusters.
In the context of time series data6 , clustering becomes the task of finding
time series with similar characteristics. There are several characteristics that
can be of interest. For example, one characteristic of clusters is that each
time series in a cluster are generated by approximately the same generating
process. Another characteristic of interest may be that the time series of a
cluster are highly dependent.
5.3.1 Some insights on clustering time series
Given a set of unlabeled time series, it is often desirable to determine groups
of similar (according to our meaning of similar) time series. There are three
clustering approaches for clustering time series (see Liao (2005)).
• Raw-data-based approach. This approach uses the complete time series
as a feature vector. The clustering is carried out into a high dimensional
space if the time series are large. This fact can be problematic due to
the curse of the dimensionality.
• Feature-based approach. This approach extracts some relevant features
of the time series. Techniques such as feature extraction or selection,
dimensionality reduction or others can be used to extract information
from data.
• Model-based approach. Here the features can be obtained using some
model. For example, if we deal with SLCP, we can use the Bass diffusion
model for such time series and set the parameters of such model as the
features.
There are three main objectives in clustering time series (Zhang et al. ,
2011): Obtain similarity in time, by analyzing series that varies in a similar way on each time step. Obtain similarity in shape by analyzing time
series with common shape features together. And finally, obtain similarity
in change by analyzing common trends in the data, similar variation in each
time step.
6
A review of the literature related to time series data mining can be found in Fu (2010).
19
5. Time series analysis
5.3. Clustering of time series
In general it is not desirable to work directly with the raw data. The reasons are that such time series are highly noisy (Liao, 2005) and as dimensionality increases all objects become essentially equidistant to each other (this is
known as curse of dimensionality) and thus classification and clustering lose
their meaning (Ratanamahatana et al. , 2010). Then, transformation of the
raw data can improve the efficiency by reducing the dimension of the data or
improve the clustering effects by smoothing the trend and giving prominence
to typical features (Zhang et al. , 2011).
Dimensionality reduction by perceptually important points
Perceptually Important Points (PIP) is a simple method for reducing the
dimension of a time series that preserves salient (representative) points. In
this method all data points are reordered according to its importance. Then
if given a time series of length T is required to reduce their dimension to
n, n < T , then the top-n points of the list are selected. The method starts
by selecting the initial and final points of the original time series as the first
PIPs. The next PIP is selected as the point that has maximum distance
to the first PIPs. The next PIP is selected as the point with maximum
distance to its two adjacent PIPs. The procedure continues until n points
have been selected. The distance to one point to its two adjacent PIPs is
understood as the vertical distance to the point and the line connecting its
two adjacent PIPs (Fu, 2010). Let the test point p3 = (x3 , y3 ) and let its
corresponding adjacent points be p1 = (x1 , y2 ) and p2 = (x2 , y2 ), then the
vertical distance between p3 and the line connecting p1 and p2 is shown in
Eq. 5.2 (see Chung et al. (2002)).
(
)
x
−
x
c
1
d (p3 , pc ) = |yc − y3 | = y1 + (y2 − y1 )
− y3 ,
(5.2)
x2 − x1
where xc = x3 . The general procedure of the PIPs is shown in Algorithm 1.
Hard clustering or fuzzy clustering?
It is known that hard clustering7 often does not reflect the description of the
real data, where boundaries between subgroups might be fuzzy and numerous problems in the life sciences are better tackled by decision making in a
7
Hard clustering makes reference to the process of assigning each element to one, and
only one, cluster. On the other hand, in fuzzy clustering each element belongs to each
cluster in a certain degree of membership.
20
5. Time series analysis
5.3. Clustering of time series
Algorithm 1: Perceptually Important Points Procedure.
Data: Input sequence {at } , t = 1, . . . , T , required length n.
Result: Reduced sequence {qt } , t = 1, . . . , n.
Set q1 = a1 and qn = aT ;
l = 2;
while l < n do
Select point aj with maximum distance to adjacent points in {qt };
Add aj to {qt };
l = l + 1;
end
1
2
3
4
5
6
7
fuzzy environment. Then a fuzzy clustering becomes to be the best option
(Gath & Geva, 1989a).
Initialization method
A clustering procedure generally is based on the following two majors steps8 :
1. Obtain an initial partition. That partition can be obtained randomly
or by a more sophisticated method.
2. Iteratively obtain new partitions improving the clustering until some
termination criterion is met.
The initialization method can improve the clustering performance. The idea
of the initialization is to use several clustering algorithms each of which is
more sophisticated than the prior. For example, the Expectation and Maximization algorithm can initialized with the results of the k-means algorithm
and this algorithm can initialized at a random partition (Bradley & Fayyad,
1998). An initialization method is proposed by Bradley & Fayyad (1998),
the authors developed a procedure for computing refined initial centroids
for the k-means algorithm based on an efficient technique for estimating the
modes of a distribution. Other initialization method based on refined seeds
or centroids can be found in Gath & Geva (1989a). The characteristic of
such initialization is that the initial seeds are chosen randomly.
8
Referring mainly to partitional clustering algorithms.
21
5. Time series analysis
5.3. Clustering of time series
Peña et al. (1999) present a comparison of the performance of four initialization methods for the k-means algorithm. The methods include: random
initialization, the Forgy approach, the Macqueen approach and the Kaufman
approach. Based on the statistical properties of the squared error, the authors found that the Kaufman initialization method outperforms the rest of
the compared methods with respect to the effectiveness and the robustness
of the k-means algorithm, and convergence speed.
Given the above results, we describe in more detail the Kaufman initialization method. The Kaufman initialization method is shown in Algorithm
2 (see Kaufman & Rousseeuw (1990); Peña et al. (1999)). Here, d (x, y)
correspond to some distance measure between vectors x and y.
Algorithm 2: Kaufman initialization method.
Data: Dataset of time series X ; Number of clusters K.
Result: The set of cluster centroids {v1 , . . . , vk }, or initial partition.
1
2
3
4
5
6
7
8
9
10
11
12
13
Select as the first seed the most centrally located instance;
k = 1;
while k < K do
for For every non-selected instance wi do
for For every non-selected instance wj do
Calculate Cji = max {Dj − dji , 0}, where dji = d (wi , wj )
and Dj = mins {djs } being s one of the selected seeds;
end
∑
Calculate the gain of selecting wi by j Cji ;
end
{∑
}
Select the instance l where l = argmaxi
C
;
ji
j
k = k + 1;
For having a partition assign the non-selected instance to the
cluster represented by the nearest seed.
end
Distance measure
In partitional clustering it is necessary to measure the similarity between
two objects. For this purpose some distance measure is considered. In
22
5. Time series analysis
5.3. Clustering of time series
general, the use of the Euclidean distance is not necessarily the best option. Euclidean distance is not allowed for the situation when two sequences
have different scales. In this case it is necessary to normalize the data
(Ratanamahatana et al. , 2010). Euclidean distance does not take into account the temporal order and the length of sampling intervals if the time series considered have unevenly distributed points (Möller-Levet et al. , 2003,
2005). The so called Dynamic Time Warping Distance can be used when the
time series does not line up in a horizontal scale. However, the time required
to compute the distance is high (Ratanamahatana et al. , 2010).
When the time series have unevenly distributed points, Möller-Levet et al.
(2003) uses the so called Short Time Series Distance. Such distance measure
is defined in Eq. 5.3. This distance measure take into account also the shape
of the time series considered.
)2
T −1 (
∑
yr+1 − yr vr+1 − vr
2
d (y, v) =
−
.
(5.3)
tr+1 − tr
tr+1 − tr
r=1
On the other hand, Pascazio et al. (2007) propose a clustering procedure
that is based on the Hausdorff distance as a similarity measure between
clusters elements. The authors use the Hausdorff distance in hierarchical
clustering and they recommend this tool for the analysis of complex sets
with complicated (and even fractal-like) structures. The method is applied
to financial time series.
5.3.2 The clustering algorithm
Fuzzy C-Means Clustering Algorithm (FCM)
The idea of partitional clustering algorithms considered here is based on the
compactness of clusters and separation between clusters. The total sum of
the distances between data points and their cluster centroids is often a figure
of merit. The fuzzy-clustering problem can be formulated by minimizing the
function given in Eqs. 5.4, where K is the number of clusters, uij is the
grade of membership of the jth object to the ith cluster, uij = [0, 1], m is the
fuzzyfier which have influence on the performance of the clustering algorithm.
Y = {y1 , . . . , yN } ⊂ ℜT are the feature data, V = {v1 , . . . , vK } ⊂ ℜT are
the cluster centroids, U = [uij ]K,N is the fuzzy partition matrix.
23
5. Time series analysis
5.3. Clustering of time series
Minimize:
Jm (Y, V, U ) =
N ∑
K
∑
2
um
ij d (yj , vi ) .
(5.4a)
∀j = 1, . . . , N,
(5.4b)
j=1 i=1
Subject to:
K
∑
uij = 1,
i=1
uij must be non-negative for all i, j. The optimization of uij is determined by
equating to zero the derivative of the Lagrangian of the optimization problem
and solving the resulting system. The results are shown in Eq. 5.5.
[
uij =
)1/(m−1)
K (
∑
d(yj , vi )
k=1
d(yj , vk )
]−1
,
(5.5)
The optimization of vi follows the same procedure and the results are shown
in Eq. 5.6.
∑n
m
j=1 uij yj
v i = ∑n
.
(5.6)
m
j=1 uij
It is clear that Eqs. 5.5 and 5.6 are coupled and it is not possible to obtain
closed form solutions. One way to proceed is to follow an iterative algorithm
to obtain the estimates of uij and vi . Such algorithm is called the FCM
clustering algorithm (see Theodoridis & Koutroumbas (2006), pag. 602).
Algorithm 3 shows the standard FCM algorithm, which requires an initial
set of cluster centroids and we use the Kaufman initialization method (see
Algorithm 2) to obtain such centroids.
Fuzzy Short Time Series Clustering Algorithm (FSTS)
In clustering time series tasks it is usually helpful to group the data according
to their shape or other characteristic as discussed before. Then a special
purpose clustering algorithm might be used. In this work we consider the
fuzzy clustering algorithm proposed by Möller-Levet et al. (2003, 2005); this
clustering algorithm is the same of the fuzzy c-means algorithm but uses the
short time series distance metric given in Eq. 5.3 rather than the standard
24
5. Time series analysis
5.3. Clustering of time series
Algorithm 3: Fuzzy C-Means (FCM) clustering algorithm.
Data: Time series matrix X; Number of clusters K; Fuzzifier
parameter m; Termination tolerance ε; Number of time series
n; Length of the time series T .
Result: The set of cluster centroids {v1 , . . . , vk }; Partition matrix U .
1
2
3
4
5
6
7
8
9
Obtain the initial partition using the Kaufman initialization method
given in Algorithm 2;
Get the initial partition matrix, U 0 ;
l = 0; while U l − U l−1 ≥ ε do
Compute the cluster prototypes using Eq. 5.6;
Compute the distances of each data to each cluster centroid using
the Euclidean distance;
Update the partition matrix using Eq. 5.5;
l = l + 1;
end
Euclidean distance. The optimization problem is the same for the FCM
algorithm given in Eqs 5.4 and the value uij for the FSTS clustering algorithm
is given in Eq. 5.5. Such membership values must be calculated using the
short time series distance. The optimization of vi is quite different to that of
FCM clustering; the resulting system of equations after deriving and equating
to zero is shown in Eqs. 5.7.
ar vi,r−1 + br vi,r + cr vi,r+1 = mir ,
(5.7)
where
ar = −(tr+1 − tr )2 , br = −(ar + cr ), cr = −(tr − tr−1 )2 ,
dr = (tr+1 − tr )2 , er = −(dr + fr ), fr = (tr − tr−1 )2 ,
and
∑N
mir =
j=1
um
ij (dr yj,r−1 + er yj,r + fr yj,r+1 )
.
∑N m
u
ij
j=1
This yields an undetermined system of equations. However, by adding two
fixed time points at the beginning and at the end of the time series with a
25
5. Time series analysis
5.3. Clustering of time series
value of 0 (such points do not alter the results) it is possible to solve such
system (see Möller-Levet et al. (2003) and Möller-Levet et al. (2005) for
details).The resulting system is a tridiagonal system of equations that can be
easily solved recursively by using the so called tridiagonal matrix algorithm,
TDMA. On the other hand, Möller-Levet et al. (2003) show a closed form
recursive equation to solve such system. The fuzzy clustering algorithm of
short time series is shown in Algorithm 4.
Algorithm 4: Fuzzy short time series (FSTS) clustering algorithm.
Data: Time series matrix X; Number of clusters K; Fuzzifier
parameter m; Termination tolerance ε; Number of time series
n; Length of the time series T .
Result: The set of cluster centroids {v1 , . . . , vk }; Partition matrix U .
1
2
3
4
5
6
7
8
9
10
11
12
13
Add two fixed time points at the beginning and the end of the time
series X = [[0]n×1 , X, [0]n×1 ];
Obtain the initial partition using the Kaufman initialization method
given in Algorithm 2;
Get the initial partition matrix, U 0 ;
l = 0; while U l − U l−1 ≥ ε do
Compute the cluster prototypes by set
v(i, 1) = 0,
v(i, T + 2) = 0
and solving the system given in Eq. 5.6 to obtain the K centroids;
Compute the distances of each data to each cluster using Eq. 5.3;
Update the partition matrix using Eq. 5.5;
l = l + 1;
end
Fuzzy Maximum Likelihood Estimation (FMLE) Clustering Algorithm
The following analysis is based on Benitez et al. (2013). For this purpose
consider that the clusters are considered as random events each that occur
in the sample space X with a positive probability P(i). By the theorem of
26
5. Time series analysis
5.3. Clustering of time series
total probability
P(X ) =
K
∑
P(X |i)P(i),
i=1
where P(X |j) is the conditional probability that given X the event (cluster)
j occurs. If it is assumed that P(X |j)P(i) is the Gaussian N (vi , Σi ), then
the function P(X ) can be seen as a Gaussian mixtures. The above equation
can be written as the likelihood function with parameters Θ = {P(i), vi , Σi }
P(X ) = P(X ; Θ) =
N
∏
P(yj |Θ) =
j=1
N ∑
K
∏
P(yj |i)P(i) = L(Θ; X ).
(5.8)
j=1 i=1
Given that P(yj |i) is Gaussian, it can be shown that
[
]
1
1
′ −1
P(yj |i) = √
exp − (yj − vi ) Σ (yj − vi ) .
2
(2π)T |Σi |
The parameters Θ are obtained by maximizing the likelihood function given
in Eq. 5.8. This is done equating to zero the derivative (with respect to
Θ) of the likelihood function and solving (for each parameter) the resulting
system of equations. The results of the optimization problem are shown in
Eq. 5.9.
1 ∑
P(i) =
P(i|yj ),
N j=1
∑N
j=1 P(i|yj )yj
v i = ∑n
,
j=1 P(i|yj )
∑N
′
j=1 P(i|yj )(yj − vi ) (yj − vi )
∑n
Σi =
.
j=1 P(i|yj )
n
(5.9a)
(5.9b)
(5.9c)
Note that the expression for vi is similar to those found for the FCM clustering algorithm in Eq. 5.6. The idea is that the posterior probability9 P(i|yj )
is equivalent to the degree of membership um
ij . In fact the term P(i|yj ) is
calculated as follow:
[ K (
]−1
∑ d2 (yj , vi ) )
e
,
P(i|yj ) =
(5.10a)
2 (y , v )
d
j
k
e
k=1
9
Probability of selecting the ith cluster given the jth feature vector.
27
5. Time series analysis
5.3. Clustering of time series
Where:
√
[
]
|Σi |
1
′ −1
2
de (yj , vi ) =
exp
(yj − vi ) Σi (yj − vi ) .
P(i)
2
(5.10b)
The FMLE algorithm uses an exponential distance measure d2e based on
maximum likelihood estimation. The characteristics of the FMLE clustering
algorithm makes it suitable for partitioning the data into hyper-ellipsoidal
clusters (see Gath & Geva (1989b)). The FMLE algorithm is shown in Algorithm 5.
Algorithm 5: Fuzzy Maximum Likelihood Estimation (FMLE) clustering algorithm.
Data: Time series matrix X; Number of clusters K; Termination
tolerance ε; Number of time series n; Length of the time series
T.
Result: The set of cluster centroids {v1 , . . . , vk }; Posterior
probabilities P(i|yj ).
1
2
3
4
5
6
7
8
9
Obtain the initial partition using the Kaufman initialization method
given in Algorithm 2;
Compute the posterior probabilities given in Eq. 5.10a;
l = 0; while U l − U l−1 ≥ ε do
Compute the cluster prototypes using Eq. 5.9b;
Compute the parameters given in Eqs. 5.9a and 5.9c;
Update posterior probabilities given in Eq. 5.10a ;
l = l + 1;
end
5.3.3 Fuzzy cluster validity indices
In general, in most clustering algorithms the number of clusters must be
specified beforehand. The selection of a different number of clusters result
in different partitions. For this reason it is necessary to evaluate several
partitions. The problem of finding the optimal number of clusters is called
cluster validity. The selection of the appropriate partition must be done
28
5. Time series analysis
5.3. Clustering of time series
according to a performance index. Total distance to the cluster centroids
(the objective value in the clustering problem, see Eq. 5.4) is not the best
option because this metric tends to decrease as the number of cluster increase.
In general, a fuzzy cluster validity index must consider the partition matrix
and the data set itself (Wang & Zhang, 2007).
In Wang & Zhang (2007) an evaluation of different fuzzy cluster validity
indices is carried out. The authors performs the evaluation using eight synthetic data sets and eight well-known data sets. This work shows that none
of the indices considered identify the correct number of clusters in all data
sets, but some indices have good results. The authors find that the PBMF
index (see Pakhira et al. (2004)) only fails once to detect the correct number
of clusters of the sixteen datasets. Other indices such as the PCAES, the
Granularity-Dissimilarity GD index, and the SC index obtain good results
too. In the following we describe some validity indices used in this work.
Xie and Beni index, VXB
The XB index is defined in Eq. 5.11
∑K ∑n
VXB =
uij d2 (yj , vi )
.
n mini,j {d2 (vi , vj )}
i=1
j=1
(5.11)
The index focuses on two properties: the compactness and separation. The
numerator of Eq. 5.11 measures the compactness and the denominator measures the separation between clusters. The validation problem becomes to
find the partition k for which k = argmaxc=2,...,n−1 VXB (c).
PBMF index, VPBMF
The PBMF (see Pakhira et al. (2004)) index is defined in Eq. 5.12
(
VPBMF =
here,
E1 =
n
∑
d (yj , v)
1
E1
× DK
×
K Jm
)2
K
,
and DK = max d (vi , vj ) .
i,j=1
j=1
29
(5.12)
5. Time series analysis
5.3. Clustering of time series
Jm is interpreted as the value for the clustering problem
Jm (y, v, u) =
N ∑
K
∑
2
um
ij d (yj , vi ) .
j=1 i=1
The index comprises three factors; the first factor 1/K indicates the divisibility of a K cluster system, this reduces as K increase and allows avoiding
the problem of convergence as K increases. The second term E1 /Jm includes
the sum of intra-cluster distances for the complete dataset taken as a single cluster and that for the K cluster system (objective value). This factor
measures the compactness of a cluster system. It is required to be increased
it (Pakhira et al. , 2004). The third factor DK is the maximum inter-cluster
separation and it is required to increase it. The validation problem becomes
to find the partition k for which k = argmaxc=2,...,n−1 VPBMF (c).
PCAES index, VPCAES
The Partition Coefficient and Exponential Separation (PCAES) index (see
Wu & Yang (2005)) is defined in Eq. 5.13
VPBMF
(
{ 2
})
K ∑
n
K
∑
u2ij ∑
d (vi , vk )
=
−
exp − min
,
k̸=i
u
β
M
T
i=1 j=1
i=1
(5.13)
{∑
}
∑K 2
∑n
n
m
where uM = mini=1,...,K
u
j=1 ij , βT =
i=1 d (vi , v) , v =
j=1 yj /n.
The first term of the index measures the compactness of the cluster system
relative to the most compact cluster. The second term takes an exponentialtype inter-cluster separation. The validation problem becomes to find the
partition k for which k = argminc=2,...,n−1 VPCAES (c).
SC index, VSC
The SC index (see Zahid et al. (1999)) is defined in Eq. 5.14
VSC = SC1 (K) − SC2 (K) ,
where
∑K
d2 (vi , v) /K
),
∑n
m 2
u
u
d
(y
,
v
)
/
j
i
j=1 ij
j=1 ij
SC1 (K) = ∑ (∑
n
K
i=1
i=1
30
(5.14)
5. Time series analysis
and
SC2 (K) =
5.3. Clustering of time series
∑K−1 ∑K−1 (∑n
)
∑n
min
{u
,
u
}
/
min
{u
,
u
}
ij
kj
ij
kj
i=1
r=1
j=1
j=1
.
∑n
2 ∑n
(max
{u
})
/
max
{u
}
i=1,...,K
ij
i=1,...,K
ij
j=1
j=1
The first and second terms of Eq. 5.14 measures the fuzzy compactness and
fuzzy inter-cluster separation considering geometrical properties of the data
and the membership function. The index obtains a fuzzy compactness/fuzzy
separation degree. The validation problem becomes to find the partition k
for which k = argminc=2,...,n−1 VPCAES (c).
5.3.4 Clustering results
Experiments were conducted in order to validate or test the clustering results and fuzzy cluster indices. We use dataset SD1 described in Section
5.1.2, which was generated according to four classes or groups. The idea
is to perform the clustering to the time series using the Algorithm 4 and a
clustering validation using the indices described in Section 5.3.3 in order determine the number of clusters K. The results are compared with the correct
number of clusters.
In these experiments we dealed with the FSTS clustering algorithm. The
fuzzifier parameter m of the clustering algorithm (see Algorithm 2) was set
to m = 1.65, the convergence criterion of Algorithm 2 was set to ε = 10−5 .
The algorithm used the short time series distance given in Eq. 5.3. The
fuzzy cluster indices were evaluated using different number of clusters, with a
minimum number of clusters
of Kmin = 2 and a maximum number of clusters
√
considered of Kmax = n, where n is the number of time series (n = 600, see
Section 5.1.2). The cluster validity indices were calculated using the short
time series distance.
The results were obtained using a raw-data-based approach in which the
complete time series was considered as a feature vector and the dimensionality of the time series was reduced using the Algorithm 1. The optimal number
of clusters for different cluster validity indices and different dimensionalities
is shown in Tab. 5.2. It is shown that the PCAES index obtained the most
accurate results considering that the correct number of clusters is 4. In Fig.
5.2 where SD1 dataset has a dimensionality reduction to 3 features, it can
be seen the presence of four fuzzy clusters. This reduction to 3 PIPs for a
SLCP is actually obtained by selecting the initial sales, the peak sales and
the final sales of such product.
31
5. Time series analysis
5.3. Clustering of time series
Tab. 5.2: Validation results for SD1 dataset using FSTS algorithm.
Index
Dimensionality *
Raw data 13 6
3
XB
PBMF
PCAES
SC
19
24
6
19
17 22
21 3
2 2
24 24
23
15
2
24
*Nmber of periods in the time series.
5.3.5 Clustering results for the real datasets
The raw datasets and the PCAES index were used to investigate the optimal
number of clusters for the real datasets (see Section 5.1.1), the results are
shown in Tab. 5.3. Fig. 5.3 shows the representation by its 3 PIPs of the
real datasets. It is noted that these datasets do not have a clear clustering structure as with synthetic, SD1, dataset. On the other hand, all real
datasets take integer values. This implies that the results of a clustering
tendency tests such as Hopkins (see Banerjee & Davé (2004)) and Cox-Lewis
may erroneously conclude that the data follows a clustering structure10 . The
reason for such argument is the integrality of the data, for example, if we
analyze Fig. 5.3 (c) it may look that the data form clusters for the values
y1 = {1, 2, 3, 4, 5, 6} when in fact these are the only integer values than can
take such dataset.
Tab. 5.3: Optimal number of clusters for the real datasets.
Dataset
Number of clusters
RD1
RD2
RD3
2
2
5
10
Hopkins test, for example, testing the hypothesis that the data is randomly (uniformly)
generated into their convex hull in contrast with the hypothesis that the data form clusters
or it is not generated in a completely random manner. The testing is carried out using
Monte Carlo simulations where synthetic data is uniformly generated into the convex hull
of the real data and then a comparison of the synthetic and real data is performed in order
to contrast the clustering tendency
32
5. Time series analysis
5.4. Conclusions of the chapter
40
35
10
30
y2
y3
25
5
20
15
0
0
10
0
20
5
5
10
y2
40 15
0
y1
0
5
10
15
y1
Fig. 5.2: Representation of the SD1 data set by its 3 PIPs.
5.4 Conclusions of the chapter
In the univariate analysis of time series this chapter allowed to know the
difficulties on performing the unit-root stationarity test and some non-linear
tests for time series. Time series do not satisfy the regularity conditions
required in these tests. Therefore, the time series are non-stationary. Difficulties also arise dute to the time series are very short. This implies than
the estimation process of parameters of ARMA models generate inaccurate
results. In conclusion, our experience showed that it is difficult to analyze
time series of SLCPs by using traditional ARIMA or ARMA models.
On the other hand, the multivariate analysis allowed to finding groups
in time series data. We presents several clustering algorithms which will be
used in the forecasting framework as will be seen later. In this chapter we
test the results for the FSTS algorithm only and evaluate different cluster
validity indices. The cluster validity results for the syntetic datasets does not
allow to know the correct partition of data, however, the PCAES index get
good partitions. It was shown than the syntetic dataset follow a clustering
structure of 4 groups. The real datasets have no a clear clustering structure.
33
5. Time series analysis
5.4. Conclusions of the chapter
(a)
3
100
2
1
50
0
100
50
0 0
10
20
0
0
10
20
(b)
500
400
200
150
100
50
300
200
100
400
200
0
100
50
0
50
100
(c)
120
100
80
0.5
0
120
y2
y3
1
60
40
80
y240
2 y1
4
20
6
2
4
6
y1
Fig. 5.3: Representation of the real datasets by its 3 PIPs. (a) RD1 Dataset;
(b)RD2 dataset; (c) RD3 dataset.
34
SIX
REGRESSION METHODS
Given that this work proposes the use of machine learning models in forecasting tasks of SLCPs demand, we consider the analysis of models such
as multiple linear regression, support vector regression and artificial neural
networks. In this chapter we briefly describe the theoretical background of
such methods. Regression methods are parameter dependent, therefore is
necessary to define a method for tuning parameters. This work employs the
response surface methodology to tune parameters since it is an efficient way
for process optimization.
Forecasting methods use historical information to predict what will happen in the future. We can refer to this as the problem of learning from
examples, as stated by Vapnik (1999) in the context of statistical learning theory and machine learning. Then the problem of forecasting can be
stated as a problem of learning, specifically if the functional relationship
yt = f (yt−1 , yt−2 , . . . , yt−p ) must be learned from historical information. The
problem of finding the correct function that best predicts the value yt is also
called the problem of regression. This chapter presents the methods based
on regression used to forecast the demand of short life cycle products. For
convenience we refer to the argument of the function as the p-dimensional
vector y = (yt−1 , yt−2 , . . . , yt−p )′ and the result of the function as y, which
means that y ≡ yt . Then the functional can be written as y = f (y), assuming
that we have available historical information
(y1 , y1 ), . . . , (yn , yn ),
which is called the training set S, yi ∈ X ⊆ Rp , yi ∈ Y ⊆ R.
6. Regression Methods
6.1. Multiple linear regression
The forecasts based on linear regression discussed here was first proposed
by Rodrı́guez & Vidal (2009); Rodrı́guez (2007) to predict the demand of
SLCPs. In this study compare the performance of this method with nonlinear
models such as support vector regression and artificial neural networks.
6.1 Multiple linear regression
Here we assume that the function y = f (y) is linear, that is
y = f (y) = w′ y + w0 .
(6.1)
In terms of forecast it is assumed that the values of a time series are linear
functions of some past values. Such equation is completely determined by
defining the parameters w0 and w, given in Eq. 6.1. The idea is to get the
values of such parameters to minimize the regression error. In general the
least squares approach is followed to solve the linear regression problem by
minimizing the sum of squared errors (deviations). Total sum of squared
errors is defined as the loss function L, which is also known as the square
loss function (Cristianini & Shawe-Taylor, 2000).
L(w, w0 ) =
n
∑
(yi − f (yi )) =
2
i=1
n
∑
(yi − w′ yi − w0 ) ,
2
(6.2)
i=1
b =
The above equation can be expressed in matrix notation by setting w
(w′ , w0 )′ , and (see Cristianini & Shawe-Taylor (2000))
 
b1′
y
y
b2′ 

b =
bi′ = (yi′ , 1)′ .
Y
 ..  , where y
.
bn′
y
The loss function (Eq. 6.2) given in matrix notation becomes
b w)
b w),
b = (y − Y
b ′ (y − Y
b
L(w)
b equating them
taking the derivatives of the loss function with respect to w,
b we obtain the solution of the least squares problem
to zero and solving for w
)−1
(
′b
b ′y
b
b = YY
Y
(6.3)
w
36
6. Regression Methods
6.2. Support vector regression
6.2 Support vector regression
Support vector machines (SVM) are commonly used in forecasting tasks and
time series analysis. Some literature related to forecasting with SVMs can
be found in Pai et al. (2010), Yang et al. (2007), and Hu et al. (2011).
In particular, Xu & Zhang (2008) uses the ε-SVR to forecast the demand
of a short life cycle product, which is the interest of this work. However
the authors take into account exogenous variables in the process of learning
rather than past values of the time series.
In regression problems there are mainly two kinds of SVMs: the so called
ν-SVR (nu-Support Vector Regression, see Schölkopf et al. (–)) and the εSVR. In this work we consider the Epsilon Support Vector Regression or
ε-SVR, proposed by Vapnik (Smola & Schlkopf, 2004). In addition, a more
detailed description of this kind of SVM is presented. Consider the simplest case in which we need to perform a linear regression to some dataset
as shown in Fig. 6.1. The idea is to obtain a regression line contained in a
tube of width 2ε that contains all (or the greatest number of) data points
and is as flat as possible. The reason to do this is to overcome overfitting
(see (Lin, 2006, page 48)). It can be shown that this objective is equivalent to minimize the theoretical maximum of the generalization error (see
Cristianini & Shawe-Taylor (2000); Vapnik (1999)). Let the regression line
y
¶
¶
Ξ
Ξ*
x
Fig. 6.1: Illustration of the linear ε-SVR with soft margin.
given by Eq. 6.1. The flatness of the regression function is determined by
37
6. Regression Methods
6.2. Support vector regression
the value of the parameter w, by minimizing the norm ∥w∥2 (see Lin (2006);
Smola & Schlkopf (2004)). Then, the following convex optimization problem
is obtained.
Minimize:
1
∥w∥2 .
2
Subject to:
yi − w′ y − w0 ≤ ε,
w′ y + w0 − yi ≤ ε.
The constraints of the above problem establish that all data points must lie
between a tube of width 2ε. However it is possible to relax these constraints
by adding slack variables. The resulting optimization problem is called soft
margin problem.
Soft margin problem
The soft margin problem considers that it is possible to obtain some error
(data points outside of the tube) which is measured by the slack variable ξi
or ξi∗ for data point yi . The additions of slack variables makes necessary to
penalize the magnitude of the slack variables (error) in the objective function,
according to some penalising cost C. The resulting optimization problem is
shown in Eqs. 6.4.
Minimize:
∑
1
∥w∥2 + C
(ξi + ξi∗ ).
2
i=1
n
(6.4a)
Subject to:
yi − w′ yi − w0 ≤ ε + ξi ,
w′ yi + w0 − yi ≤ ε + ξi∗ ,
ξi , ξi∗ ≥ 0.
(6.4b)
(6.4c)
(6.4d)
The optimization problem 6.4 can be solved more easily in its dual formulation. More important yet is the fact that the dual problem provides the key
38
6. Regression Methods
6.2. Support vector regression
for extending the SVM to nonlinear regression problems (see Smola & Schlkopf
(2004)). The Lagrangian of the optimization problem is
∑
1
L = ∥w∥2 + C
(ξi + ξi∗ )
2
i=1
n
−
n
∑
αi (ε + ξi − yi + w′ yi + w0 )
i=1
−
n
∑
(6.5)
αi∗ (ε
+
ξi∗
′
− w yi − w 0 + y i )
i=1
−
n
∑
(ηi ξi + ηi∗ ξi∗ ).
i=1
where L is the Lagrangian and αi , αi∗ , ηi , ηi∗ are Lagrange multipliers or dual
variables, it is required that αi , αi∗ , ηi , ηi∗ ≥ 0. The partial derivatives of L
with respect to the primal variables (w, w0 , ξi , ξi∗ ) have to vanish for optimality.
n
∑
∂w L =
(αi∗ − αi ) = 0,
(6.6a)
i=1
∂w L = w −
n
∑
(αi − αi∗ )yi = 0,
(6.6b)
i=1
∂ξi L = C − αi − ηi = 0,
∂ξi∗ L = C − αi∗ − ηi∗ = 0.
(6.6c)
(6.6d)
Substituting the results given in Eqs. 6.6 into the primal optimization problem 6.4 yields the dual optimization problem.
Maximize:
∑
∑
1 ∑∑
(αi − αi∗ )(αj − αj∗ )yi′ yj − ε
(αi + αi∗ ) +
(αi − αi∗ )yi ,
2 i=1 j=1
i=1
i=1
n
−
n
n
n
(6.7a)
Subject to:
n
∑
(αi − αi∗ ) = 0,
and αi , αi∗ ∈ [0, C] .
i=1
39
(6.7b)
6. Regression Methods
6.3. Artificial Neural Networks
Kernel methods
The SVR discussed in past sections can solve linear regression problems, this
is a limitation because many real problems are nonlinear. The concept of
kernel is of great importance to deal with nonlinearities. The basic idea is
to map the data points, y, in an input space X to a vector space H, called
the feature space, via a nonlinear mapping ϕ(·) : X → H. The purpose is to
translate the nonlinear structure on the data in space X into linear structures
in a higher dimensional space H (see Prez-Cruz & Bousquet (2004)).
When a suitable mapping is used the data ϕ(yi ), ∀i = 1, . . . , n, seems to be
linear, then the regression function is determined using the data in the feature
space to obtain y = w′ ϕ(y) + w0 . This technique is known as the Kernel
Trick. There are some mappings for which the inner product ϕ′ (x)ϕ(y) can be
computed directly from x and y without explicitly computing ϕ(x) and ϕ(y).
Such inner product is written in a simpler form as K(x, y). The objective of
the SVR optimization problem given in Eq. 6.7a is then modified to Eq. 6.8.
Note that by using the kernel trick it is not necessary to compute the maps
ϕ(x) and ϕ(y) directly. This is the key of kernel methods such a SVMs and
another reason for using the dual optimization problem rather the primal.
n
n
n
n
∑
∑
1 ∑∑
∗
∗
∗
−
(αi −αi )(αj −αj )K(yi , yj )−ε
(αi +αi )+
(αi −αi∗ )yi . (6.8)
2 i=1 j=1
i=1
i=1
It is important to note that the objective 6.8 represents a quadratic function
and such function must be convex, which implies that the kernel matrix [Kij ]
(i.e. the Hessian matrix of the objective function) must be positive definite.
Gaussian Kernel : The Gaussian Kernel is a kernel function commonly
used in the literature and for this reason is the kernel used in this work (Eq.
6.9). It is necessary for the Gaussian Kernel to define the parameters σ.
K(x, y) = e−
∥x−y∥2
2σ 2
.
(6.9)
6.3 Artificial Neural Networks
In this work we focus on feed-forward neural networks (or multilayer perceptron). An illustration of such networks is shown in Fig. 6.2, a network with
two inputs, two hidden layers with three neurons (nodes) in hidden layer 1,
40
6. Regression Methods
6.3. Artificial Neural Networks
two neurons in hidden layer 2, and one output. The input nodes are connected forward to the hidden neurons and this neurons are connected forward
to the output neuron. The connections between neurons i and j is related
Inputs
Hidden layers
Output
wij
Fig. 6.2: Feed-forward neural network with two hidden layers, two inputs and one
output.
with a weight wij . When an input vector y is presented to the network, each
element (input) of the vector is propagated through the network being affected by the weights of connections (see Tsay (2005)). The information that
receives the neuron j from the first hidden layer is a linear combination of the
inputs and the weights. This information is processed through an activation
function that defines the output of such neuron as follow
(
)
∑
1
1
h1j = fj w0j
+
wij
yi ,
i→j
1
where w0j
is a constant term called the bias term, the summation i → j
means summing over all input nodes. The activation function for hidden
neurons is usually tangent sigmoid or logistic function. Given the logistic
function
ez
,
fj (z) =
1 + ez
41
6. Regression Methods
6.4. Tuning parameters
the output of the jth neuron of the first hidden layer is
(
)
∑
1
1
exp w0j
+ i→j wij
yi
).
(
h1j =
∑
1
1
+ i→j wij
yi
1 + exp w0j
If there are H hidden layers then the output of the j neuron of the last layer
is
)
(
∑
H H−1
H
hi
+
wij
,
w0j
hH
j = fj
i→j
and this value corresponds to one input to the output layer neuron. Generally an output layer neuron has a linear activation function or a Heaviside
function. The case of a linear activation function is
∑
o H
o = w0o +
wio
hi .
i→j
The idea of using feed-forward neural networks is that, given the pairs (yl , yl ),
i.e. the input yl and output yl patterns, determine the values of weights, wij ,
and biases, w0j , that generates outputs ol as close as possible to yl 1 . This
can be done by minimizing some fitting criterion, such as the least squares
error
n
∑
2
(ol − yl )2 ,
s =
l=1
the process of training a neural network becomes a nonlinear optimization
problem. Several algorithms have been proposed for this problem. A wellknown is the back propagation (BP) algorithm that is based on the gradient descent method. Other optimization algorithms such as LevenbergMarquardt are commonly used.
6.4 Tuning parameters
Machine learning models have a set of different parameters that must be
tuned beforehand. In the case of support vector regression models it is necessary to determine the value of parameters such as the penalty constant C
(that appears to count the penalization due to the slack variables ξi and ξi∗ ),
1
Ensuring at the same time capability of generalization of the neural network results.
42
6. Regression Methods
6.4. Tuning parameters
the width of the tube ε and the Gaussian kernel parameter σ. In the case of
artificial neural network models the most important parameters to determine
are the number of hidden layers and the number of neurons per hidden layer.
The tuning process must be carried out according to the so called generalization regression error criterion. Then it is necessary to estimate such
error according to some predefined method. Two possible ways of estimate
such error are the following (see Chapelle et al. (2002)):
• Cross validation error : In this procedure the data is divided into two
subsets according to some proportion. A subset (the training set) is
used in the training process and the remaining subset (the validation
set) is used for estimating the regression error. The process is repeated
choosing all different training and validation sets possible of the whole
dataset without repetitions. An estimation of the generalization error
is meant as validation error.
• Leave one out error : In this procedure one data point is selected from
the whole dataset for validation (error estimation) using the remaining
data in the training process. The process is repeated until all data
points have been chosen for validation.
As can be seen, the leave one out estimates are more computationally expensive hence cross validation error is commonly used in practice (see Hsu et al.
(2010)). The parameters must be tuned by reaching the minimum regression
(validation) error. A search strategy can be carried out into two steps: in the
first step a coarse grid is constructed in the parameter space and the error is
evaluated in each point of the grid. Then the point with the minimum error
is kept. In the second step a fine grid is constructed centered in the best
solution of the last step. Then the point of the fine grid with minimum cross
validation error is selected. However, such strategy is very time consuming
and it is necessary consider more efficient approaches as discussed below.
6.4.1 Response surface methodology for tuning parameters
The process of tuning parameters is in fact an optimization problem. Some
characteristics of this optimization problem that makes it a difficult problem
are:
• The objective value (generalization error) is actually a random variable.
43
6. Regression Methods
6.4. Tuning parameters
• The evaluation of the objective function is very time consuming.
• The objective function is unknown in practical terms.
On the other hand some advantages of such optimization problem are that
it has a few number of variables and that generally the expected objective
function is not highly nonlinear. This fact makes it possible to try to approximate such function using a low order polynomial. Initially the objective
function can be written as follows
E(p) = f (p) + ϵ,
where f (p), p is an unknown function of the parameter vector p, p ∈ P,
where P is the parameter space. ϵ is an independent and identically distributed random variable. For simplicity, a second order polynomial can approximate this objective function. Linear regression methods compute this
polynomial. The goal is to optimize such polynomial and estimate the optimal set of parameters p∗ . Such optimal point is added to the original sample
and the optimization process is repeated. The optimization concludes when
it is not possible to improve the objective value. Gönen & Alpaydin (2011)
used the method described for tuning the parameters of a support vector
machine.
Initial sample for fit: The regression method requires a sample to obtain
a fit for the second order model, Gönen & Alpaydin (2011) uses design of
experiments (DOE) and response surface methodology (RSM) for this task.
The authors uses Khosal design (see Myers & Montgomery (2002), pag: 384)
which is very efficient because it only requires a small sample of data, however, a more robust design such as the central composite design (CCD) can
be used (and this is the design commonly used in the literature). Fig. 6.3
shows a two dimensional sample for the Khosal design and central composite
design. It is evident than a Khosal design is a more efficient approach.
Once the experiment is carried out using the sample points given by the
experimental design, the following quadratic function is obtained from linear
regression
b = β0 +
E
k
∑
i=1
βi p i +
k−1 ∑
k
∑
i=1 j=i+1
βi βj pi pj +
k
∑
βii p2i ,
(6.10)
i=1
where βi are the model parameters and k is the dimensionality of the parameter space or the number of parameters. Given that the objective value
44
6. Regression Methods
6.4. Tuning parameters
(a)
p2
(b)
p1
Fig. 6.3: Two dimensional sample for a Khosal design (a) and a central composite
design (b).
is a random variable, replications of the experiment in each sample point
may give better estimation results, thus it is necessary to define the number of replications R in the algorithm. Algorithm 6 shows a procedure for
tuning parameters using response surface methodology based on the work of
Gönen & Alpaydin (2011). In this algorithm the optimization problem 6.10
is restricted to some operability region of the parameters ℓ.
Proposed metaheuristic for tuning parameters procedure
Algorithm 6 aggregates a new solution to the current sample at each iteration,
which increases the sample points. This can be ineffective in the sense that
some of these points may actually do not contribute to obtain a good fit.
Another reason is that near to the optimum, the objective function resembles
a quadratic function and then better results are obtained using points near
to such optimum. This does not necessarily can be achieved using Algorithm
6 due the bias induced by sample points far from the optimum. On the
other hand, Algorithm 6 is completely deterministic and this may cause rapid
convergence to a local minimum.
According to discussion presented above in this section we propose a
tuning parameters procedure based on the algorithm of Gönen & Alpaydin
45
6. Regression Methods
6.4. Tuning parameters
Algorithm 6: Tuning parameters procedure.
Data: Number of replications per sample point, R; threshold, ε;
dimensions of the experimental design, δ; operability region of
parameters, ℓ.
Result: Parameters p∗
1
2
3
4
5
6
7
8
9
10
11
Build the design matrix with dimensions δ, using some experimental
design;
Perform the experiment for each sample point R times and obtain the
validation errors;
Fit a second order model for the generalization error function;
Solve the quadratic optimization problem given in Eq. 6.10 and
subject to the operability region ℓ and get the optimum p∗0 ;
t = 0; while p∗t − p∗t−1 ≥ ε do
Perform the experiment R times for p∗t and obtain the validation
errors;
Fit a second order model using all information (sample points)
available;
Solve the quadratic optimization problem 6.10 and subject to the
operability region ℓ and get the optimum p∗t+1 ;
t = t + 1;
end
(2011). The proposed algorithm considers a fixed number of sample points
at each iteration, some of such points obtained in a stochastic manner. The
initial sample points are obtained in the same manner that Algorithm 6
getting a total of N sample points, then a new sample point is obtained by
optimizing the fitted second order model, getting a total of N + 1 sample
points. From these N + 1 sample points we select the N − n better solutions
and generate randomly n new solutions to get a fixed sample size of N again.
The n new solutions are generated according to a Gaussian distribution with
mean equals to the mean of the current N −n better solutions and covariance
46
6. Regression Methods
matrix of
6.4. Tuning parameters

σ12 0
 0 σ2
2

Σ =  ..
..
.
.
0 0
···
···
..
.

0
0

..  ,
.
(6.11)
· · · σk2
where σi2 is the variance of the parameter i in the current sample of N − n
points. The reason of using a Gaussian distribution is that the generated
points tend to form a spherical group around the mean, and such spherical
sample can improve the fit of the second order model. The process is repeated using the new N sample points. The proposed algorithm is shown in
Algorithm 7.
47
6. Regression Methods
6.4. Tuning parameters
Algorithm 7: Proposed tuning parameters procedure.
Data: Number of replications per sample point, R; threshold, ε;
dimensions of the experimental design, δ; operability region of
parameters, ℓ; number of random solutions n.
Result: Parameters p∗
1
2
3
4
5
6
7
8
9
10
11
12
Build the design matrix with dimensions δ, using some experimental
design;
Perform the experiment for each sample point R times and obtain the
validation errors;
Fit a second order model for the generalization error function;
Solve the quadratic optimization problem given in Eq. 6.10 and
subject to the operability region ℓ and get the optimum p∗0 ;
t = 0; while p∗t − p∗t−1 ≥ ε do
Perform the experiment R times for p∗t and obtain the validation
errors;
Select the N − n better solutions from the current sample;
Generate n new sample points using a Gaussian distribution with
mean equal to the mean of the current N − n points and
covariance matrix given in Eq. 6.11;
If the generated points lie out of the operability region such points
must be projected into such region;
Fit a second order model using the N sample points available;
Solve the quadratic optimization problem 6.10 subject to the
operability region ℓ and subject to the following constraints
N {
}
}
N {
min pcur
≤ pi ≤ max pcur
,
ij
ij
j=1
13
14
j=1
∀i = 1, . . . , k,
where pcur
ij is the jth sample of the ith parameter in the current
sample. Get the optimum p∗t+1 ;
t = t + 1;
end
48
SEVEN
EXPERIMENTAL PROCEDURE
The purpose of this chapter is to give a complete description of the experimental methodology used in this work. The experiments have different
factors than influences the performance of the forecasting methods. Then
these factors and the methodological framework are described..
This study aims to solve the SLCP demand/sales forecasting problem as
we set in Chapter 2, in such chapter we establish some hypothesis (strategies)
which could allow improving the forecast performance. In order to investigate
the set of hypothesis presented in Section 2.1.1 it is necessary to test the effect
in the forecasts of the use of cumulative or non-cumulative data, the effect of
partitioning or not the data, and the effect of the regression model used (see
Chapter 6). We refer to the above cases as experimental factors. Fig. 7.1
illustrates each of the experimental factors, with their respective levels. There
are 3 experimental factors, the regression method with 3 options (levels), the
clustering usage with 2 options, and the type of data with 2 options. These
lead us to evaluate 3 · 2 · 2 = 12 experimental treatments which are the
combinations of the options (levels) of the factors as shown in Fig. 7.1.
For example, a particular treatment is to evaluate the forecasts obtained
by means of multiple linear regression (MLR), without partitioning the data,
and using non-cumulative data. We use a general procedure that standardizes
the steps from data collection to final forecasting evaluation. The framework
of the forecasting procedures is shown in Fig. 7.2.
7. Experimental procedure
7.1. Collection and analysis of data
Experimental factors
Forecasting method
MLR
SVR
Clustering usage
ANN
Yes
No
Type of data
Cumulative
Non-cumulative
FCM
FMLE
FSTS
Fig. 7.1: Experimental factors.
7.1 Collection and analysis of data
Usually the forecasting methods try to predict the values of a time series
using the past values of such time series. For a short life cycle product
(SLCP) usually there is not available such historical information or the information available is scarce. Thus, typical forecasting methods (such as
moving average, exponential smoothing, and others) cannot be applied or
their forecasting results are very poor. Given the scarcity of information of
the historical demand of a new product, useful information can be obtained
from products which have completed, or are completing their life cycle. We
assume that from previous products, the life cycle pattern can be learnt, in
order to predict the demand of a new product. We assume that from previous
products, the life cycle pattern can be learnt, in order to predict the demand
of a new product it is necessary to find large enough datasets of time series
that contain similar patterns to the time series which we try to forecast. In a
company this task becomesthe finding of demand time series of older SLCP.
When such information is available it is necessary to clean the data of noise
or information without any sense.
In this study each of the considered datasets (see Section 5) are splitted
into two subsets (see Fig. 7.2). The first subset, named the training set,
it is used in the training process and corresponds to the 80 % of the whole
50
7. Experimental procedure
7.1. Collection and analysis of data
dataset. The training set allows preparing beforehand the forecasting machines (see Chapter 6) via a training procedure. The second subset is called
the test set and is used to evaluate the forecast performance of each forecasting procedure. This subset simulates the real-time information available once
the SLCP has been introduced to the market or once the time series to be
forecasted appears. Note that the training process is carried out once beforehand and it is not necessary to repeat such process each time the real-time
information is available. This fact allows saving considerably computation
time.
1. Collect and analyze data
Preprocessing and data cleaning
Historical data
sales profiles of SLCP demand
(Training-Validation data)
Real time-data
(Testing data)
Obtain cumulative
data if required
Obtain cumulative
data if required
2. Perform clustering if required
3. Perform classification if required
Perform clustering. Use Alg. 1,
Alg. 3, Alg. 4, or Alg. 5
Classify the time series of the real - time
data into the cluster with nearest centroid
Validate clustering results
Get the cluster with nearest centroid
Get the clusters
and cluster centroids
4. Tune the parameters of the machine, MLR, SVR or ANN
5. Evaluate the forecasts
Perform the forecasts
Use n - fold cross validation to estimate the error.
Search for the machine parameters using Alg. 6, or Alg. 7
Train the machine using the optimal values of the parameters.
If clustering is used then use the cluster dataset,
otherwise, use the whole dataset for training.
If cumulative data is used,
then obtain the correct
forecasts by differencing
Evaluate the forecast performance
Fig. 7.2: Framework of the forecasting procedures.
51
7. Experimental procedure
7.2. Clustering
7.2 Clustering
This operation is performed if required by any experimental treatment; the
clustering process is carried out using the training set (either with cumulative or non-cumulative data). In this study the FCM, FMLE, and FSTS
clustering algorithms (see Section 5.3.2) are considered. It is necessary to
perform a clustering validation in order to tune clustering parameters such
as the number of clusters K and the fuzzifier parameter m. This process
obtain groups in data in which the data of the same cluster share similar
characteristics of the cluster but are significantly different from the remaining clusters. The information that emerges from this stage, the partition of
the data, and the cluster centroids, will be used in the following stages. The
partition of the data will be used for training. The clusters centroids enable
to classify real-time data (time series) into their most similar cluster.
After obtaining the clusters, it is necessary to relate the real time information (time series) with these groups. The goal is to classify the real time
series into those clusters. The idea is that given any real-time time series
can be classified in one of the clusters previously established. In this work
a minimum distance classification is performed, considering the real-time
time series up to time t, yt , and considering the cluster centroids up to time
t, vti , ∀i = 1, . . . , K, then the real-time time series is classified in the cluster
k for which
K
k = argmin {d (yt , vti )} .
(7.1)
i=1
In other words, the real-time time series up to time t is classified in the
cluster with nearest centroid up to time t. The selected cluster provides the
data to be used in the later training process.
7.3 Parameter tuning
The tuning parameter procedure requires the estimation of forecast error.
The forecast error is estimated with a 5-fold cross-validation with the training
dataset1 as is described in Section 6.4. Hence, the objective is to reach
the minimum possible cross validation error by using the parameter search
procedure described in Algorithm 6 or Algorithm 7. The proposed forecasting
1
Data for training can come from cumulative data, non-cumulative data, cluster data,
or any combination, depending on the experimental treatment considered for analysis.
52
7. Experimental procedure
7.4. Forecasts evaluation
method requires the training of regression methods (MLR, ANN, and SVR)
for each period to be predicted. For example, suppose that the training set
consists of 3 time series of length 5 as shown below
y11 , y12 , y13 , y14 , y15 ;
y21 , y22 , y23 , y24 , y25 ;
y31 , y32 , y33 , y34 , y35 ,
now consider that the time lag is set to p = 3. Then, to forecast a real-time
time series at period 4 it is necessary to train a machine using the training
data comprising periods 1 to 3 as inputs or regressors and using the data of
period 4 as output or response. The same procedure is followed to forecast
the other periods. Note that to period 2 we use the training data comprising
period 1 only as regressors because there is only one lag period. In the
same manner, to forecast period 3 it is necessary to use the training data
comprising periods 1 to 2 as regressors. It is impossible to get forecasts of
period 1 since there is no lag periods beforehand. A practical way to forecast
the first period is taking an average of the training set.
According to the above discussion, to obtain an estimation of the forecast
error by cross validation it is necessary to train T − 1 machines, where T is
the length of the time series. This is a time consuming task if T is large,
mainly because it is necessary to conduct a search process for parameters2 .
To avoid the this problem, in this work we use only 5 equally spaced periods
of time series for training and then get an estimation of the forecast error.
Once the optimal values of the parameters are obtained, the training of the
machine is performed again according to such results and considering the
total T − 1 periods. The results obtained will allow to forecast any real-time
time series.
7.4 Forecasts evaluation
At this stage the forecasts of the real-time time series are obtained using the
results of the training machine procedure as discussed before. The forecast is
performed for each time series of the real-time (test) dataset. It is noteworthy
that when the forecasts are based on cumulative data, then it is necessary
2
This is particularly true when support vector regression or artificial neural networks
are considered. The case of multiple linear regression, however, is very efficient.
53
7. Experimental procedure
7.5. Some computational aspects
to transform the data correctly by differencing3 . Root Mean Square Error
(RMSE) evaluates forecast performance and it is calcuated as follows:
v
u
T
u1 ∑
t
RM SE =
(yt − ŷt )2 ,
(7.2)
T t=1
where ŷt is the forecasted value at period t. This work also uses the Mean
Absolute Error (MAE) becuase, as will be shown later, such metric allows
an absolute comparison point. The MAE is calculated as follows:
T 1 ∑ yt − ŷt .
M AE =
T t=1 ŷt (7.3)
7.5 Some computational aspects
The algoritms were implemented in Matlab 2011b. We use the Neural Network Toolbox of Matlab to train the Feed Forward Neural Networks. For the
Support Vector Machine case we make use of the LIBSVM Toolbox of Matlab
due to Chang & Lin (2011), from this Toolbox the Epsilon Support Vector
Regression was considered in this work. All algorithms presented earlier in
this work were programmed in Matlab in the Windows Operating System
and on a 32 Gb, 3.40 GHz machine.
7.6 Results of the tune parameters procedure
7.6.1 Tuning parameters for SVR machines
As previously mentioned, the tuning parameter procedure takes into account
the cross validation estimation of the forecasting error for a given training
set. In the case of a SVM it is necessary to define the values of the width of
the tube ε, the penalty constant C, and the Gaussian kernel parameter σ as
is described in Section 6.2. Previous works in the search of such parameters
establish that an exponential growing sequence of such parameters is a practical method to identify good parameters (Hsu et al. , 2010; Lin, 2006). For
3
To obtain non-cumulative demand we use the difference between cumulative demands
which is xt = Xt − Xt−1 , where Xt is the cumularive demand at period t, and xt is the
non-cumulative demand.
54
7. Experimental procedure
7.6. Results of the tune parameters procedure
this reason the following search space is defined as:
[
]
ε ∈ 2−5 , 24 ;
[
]
C ∈ 2−2 , 215 ;
[
]
1
= γ ∈ 2−16 , 25 .
2
2σ
This search region will be defined as the operability region, ℓ, of the parameters, which is considered large enough. For the case of the Support Vector
Machines (SVM) the proposed search procedure described in Algorithm 7
is used as search method. In this work a Central Composite Design (CCD,
see Section 6.4.1) is used to get the initial sample points. The between
axial points of such design is given by the extreme values of each parameter into the operability region. The design is chosen to be rotatable (see
Myers & Montgomery (2002)), which implies that the following relation must
be satisfied
α = k 1/4 ,
where α is a normalized mean distance between the axial points4 , and k is
the number of parameters to be considered, which in this case is k = 3. In
this work only R = 1 replicate is considered and the threshold or stopping
criterion (see 6.4.1) of the algorithm is set to ε = 0.005. Finally the proposed
tuning parameters procedure requires removes n of the worst solutions and
generate again n new random solutions. In this work we set n = 5 given that
N = 15 (see Section 6.4.1).
The performance of Algorithm 7 at different iterations is shown in Fig. 7.3
for the different datasets. From this example it is shown that the algorithm
converges to a, possibly, local optima in general. The results show that the
performance of the proposed method is promising. An interesting fact is that
the training process is little sensitive to the bandwidth parameter ε as it is
shown for real datasets on the intermediate iteration. Tab. 7.1 shows the
results of the tuning parameters procedure for non-cumulative data. Tab.
7.2 shows the results for cumulative data according to the optimal number
of lags. Evidently the search time is much larger when cumulative data is
considered.
4
No details are presented
Myers & Montgomery (2002).
on
such
55
standardization;
you
can
refer
to
7. Experimental procedure
7.6. Results of the tune parameters procedure
Tab. 7.1: SVR results of the tuning parameters procedure for non-cumulative data.
Dataset
p
log2 ε
log2 C
log2 γ
Search
time (s)
RD1
SD1
RD2
RD3
2
16
6
10
−2.50
−1.89
1.16
−3.59
9.86
5.68
12.35
10.93
−14.64
−10.35
−19.12
−15.26
167
237
118
223
*Optimal number of lags, see Fig. 8.2.
Tab. 7.2: SVR results of the tuning parameters procedure for non-cumulative data.
Dataset
p
log2 ε
log2 C
log2 γ
Search
time (s)
RD1
SD1
RD2
RD3
1
18
1
1
−3.19
−3.01
0.00
−9.48
14.90
14.53
15.00
13.80
−19.92
−15.99
−23.22
−20.35
439
22 554
163
3 042
*Optimal number of lags, see Fig. 8.2.
7.6.2 Tuning parameters for ANN machines
This work uses the Feed Forward Neural Network or Multilayer Perceptron
as regression method, given the relative simplicity of this neural network
model. Such networks requires to adjust or establish many parameters such
as the learning rate, the number of neurons per hidden layer, the number of
hidden layers, the selection of activation functions per neuron, the selection
of the training algorithm, the number of iterations of the training algorithm,
the weight initialization procedure, and possibly many other parameters that
affect the performance of the machine.
The interest here, however, is focused on determining the number of neurons per hidden layer and the number of hidden layers, the other parameters
are set in a relatively arbitrary manner and according to the author’s experience. In this sense the Levenberg-Marquardt with Bayesian regularization
training algorithm is considered, the learning rate is set to 0.01, the number
of runs (iterations) is set to 300, the weights are initialized to zero, and a
tangent sigmoid activation function is considered for hidden neurons and a
56
7. Experimental procedure
7.6. Results of the tune parameters procedure
linear activation function is considered for the output neuron. Our experience working with artificial neural networks evidence a high computational
cost. This may be a consequence of the use of Bayesian regularization in the
training algorithm. The training algorithm implies longer processing times
but good results are obtained, and this was a reason to select such training
algorithm.
Accordingly it is necessary to determine the number of neurons per hidden
layer and the number of hidden layers. For this purpose we consider up to 2
hidden layers and up to 28 neurons per hidden layer. In this case a hidden
layer with 0 neurons indicates that such layer is not considered. The tuning
parameters procedure for ANN do not show a clear convergence as is the
case with the SVM. The fact that the parameters are integers can make less
effective the search procedure.
Tab. 7.3: ANN results of the tuning parameters procedure for non-cumulative
data.
Dataset
p*
Number of neurons
in hidden layer 1
RD1
SD1
RD2
RD3
4 25
23 24
3 1
2 14
Number of neurons
in hidden layer 2
0
4
0
28
Search time (s)
267
309
339
291
*Optimal number of lags, see Fig. 8.3.
Tab. 7.4: ANN results of the tuning parameters procedure for cumulative data.
Dataset
p*
Number of neurons
in hidden layer 1
RD1
SD1
RD2
RD3
11 1
1 1
7 1
1 28
Number of neurons
in hidden layer 2
0
0
0
0
Search time (s)
785
582
892
611
*Optimal number of lags, see Fig. 8.3.
The results of the search are summarized in Tab. 7.3 for the non-cumulative
case and in Tab. 7.4 for the cumulative case. Note that the search time is
57
7. Experimental procedure
7.7. Conclusions of the chapter
longer as compared with SVM case; also note that such time is larger for the
cumulative case than the non-cumulative case.
7.7 Conclusions of the chapter
This chapter presented a general theoretical framework of the regression
metods to be used in the forecasting framework as will be seen later. This
chapter also considered the problem of tuning parameters for learning machines such as neural networks and support vector machines. We consider the
cross-validation regression error as figure of merit in tune perameters. The
metaheuristic procedure proposed for tuning parameters generate reasonable
results.
58
7. Experimental procedure
7.7. Conclusions of the chapter
(a)
(b)
5
0
−5
−10
−15
1510
5 0
5
0
−5
−10
−15
1510
5 0
4
0 2
−4−2
5
0
−5
−10
−15
1510
5 0
4
0 2
−4−2
5
0
−5
−10
−15
1510
5 0
4
0 2
−4−2
5
0
−5
−10
−15
1510
5 0
4
0 2
−4−2
5
0
−5
−10
−15
1510
5 0
4
0 2
−4−2
5
0
−5
−10
−15
1510
5 0
4
0 2
−4−2
4
0 2
−4−2
5
0
−5
−10
−15
1510
5 0
4
0 2
−4−2
4
0 2
−4−2
5
0
−5
−10
−15
1510
5 0
4
0 2
−4−2
5
0
−5
−10
−15
1510
5 0
log2 C
4
0 2
−4−2
log2 ǫ
(c)
5
0
−5
−10
−15
1510
5 0
5
0
−5
−10
−15
1510
5 0
log2 γ
(d)
4
0 2
−4−2
Fig. 7.3: Sample points given by the proposed tuning parameters procedure at different iterations, start, intermediate and final iteration. (a) RD1 Dataset;
(b) SD1 dataset; (c) RD2 dataset; (d) RD3 dataset. This results are obtained for a number of lags of p = 2, we omit the results for other values
of p.
59
EIGHT
RESULTS
8.1 Forecasting results using multiple linear regression
In this situation the parameters of the regression model are obtained using
Eq. 6.3 of Section 6.1 by using the training set (see Chapter 7). Then the
test set is used to investigate the performance of the method, by using the
Root Mean Square Error (RMSE) metric. The results were obtained using
non-cumulative and cumulative data shown in Fig.8.1 for different number of
lags. It is shown that the use of non-cumulative data improves the forecasting
results in terms of mean RMSE and variability of the forecast error as it is
shown by the error lines. This result is confirmed in practically all datasets.
Tab. 8.1 shows a comparison of the results obtained using non-cumulative
and cumulative data in the forecast process. As observed the use of noncumulative data improves the forecasts results in all cases as is demonstrated
with the Kruskal-Wallis test1 . As expected, the processing time (training and
forecasting) is less using non-cumulative data. This is reasonable since the
use of cumulative data requires the transformation of cumulative results into
non-cumulative forecasts spending additional processing time. In average the
use of cumulative data increases the process time in about 3 times the time
required for non-cumulative data. This results have an important implication: the use of cumulative data does not necessarily improve the forecast
performance, this fact contradicts the hypothesis presented in Section 2.1.1.
1
Kruskal-Wallis test is an estatistical analysis for perform a non-parametric one-way
analysis of variance by comparing the medians of the experimental treatments or variables.
8. Results
8.1. Forecasting results using multiple linear regression
(a)
(b)
8
3
6
2.5
4
2
2
0
5
1.5
10
0
RMSE
(c)
3.5
25
3
20
2.5
0
5
10
Lags (p)
20
(d)
30
15
10
20
15
10
20
Fig. 8.1: Multiple linear regression results for non-cumulative (blue line) and cumulative (red line) data and for different number of Lags (p). The black
points corresponds to the optimum. The error lines correspond to the 10
% of the standard deviation. (a) RD1 dataset. (b) SD1 dataset. (c) RD2
dataset. (d) RD3 dataset.
It is notable that the optimal number of lags for the SD1 (Bass process
synthetic dataset) is very large; this implies a strong linear correlation or
lag dependence. On the other hand, the optimal number of lags is relatively
short for the other datasets, and for cumulative data only one lag is required.
8.1.1 Multiple linear regression results with clustering
Preliminary study
Initially (in a preliminary study) we carried out experiments using the SD1
dataset and considering the correct partition of the data which consist of four
groups (see Section 5.1.2). The purpose of this experiment is to evaluate the
effect of partitioning data in the forecast performance given that the correct
partition of the data is known, as is the case for the SD1 dataset.
According to the Kruskal-Wallis test, in this experiment again the use of
61
8. Results
8.1. Forecasting results using multiple linear regression
Tab. 8.1: Multiple linear regression results using complete datasets.
Dataset
p*
Mean
RMSE
S.
Deviation
Time (s)
p-value **
RD1
RD1 cum.
SD1
SD1 cum.
RD2
RD2 cum.
RD3
RD3 cum.
2
1
24
24
5
1
3
1
4.069
5.503
1.611
2.324
16.504
21.592
2.259
2.440
2.885
4.513
0.355
0.348
8.674
12.673
1.508
1.537
0.29
0.59
0.67
4.2
0.41
0.71
0.34
1.29
0.00
0.00
0.00
0.02
*Optimal number of lags, see Fig. 8.1.
**p-value of the Kruskal-Wallis test, values near to 0 implies median differences.
non-cumulative data outperforms the forecasting results with respect to the
use of cumulative data. The results are shown in Tab. A.1 of Appendix A.
For non-cumulative data the mean RMSE is smaller when clustering is used in
the forecasting process. However, the methods produce results statistically
equal, according to the p-value of the Kruskal-Wallis test. On the other
hand, for cumulative data the clustering does not improve the results and the
difference between both methods statistically is significant. An interesting
result is that RMSE variability increases when partitioning the data. This
fact can be explained because fewer amounts of data are used in the regression
process and the variability of estimations increases. However, partitioning
the data apparently reduces the process time.
Use of clustering algorithms
This section assesses the effect of partitioning the data in the forecast process
on the real datasets. When forecasting with clustering, there are factors such
as the number of clusters K, the fuzzyfier parameter m of the clustering algorithm, and the number of lags p that may affect the forecast performance.
Keeping this in mind and to avoid potential bias in the results, we evaluated
(by a grid search) several values of such parameters beforehand. Tab. B.1
of Appendix B shows the forecasting results obtained with several clustering
algorithms, for optimal values of the clustering and lag parameters. The
62
8. Results
8.1. Forecasting results using multiple linear regression
results show that the FMLE clustering algorithm outperforms other algorithms in RD1 and RD3 datasets. The FSTS do better than other clustering
algorithms for the SD1 and the cumulative RD3 and the FCM performs better than other clustering algorithms for non-cumulative RD3 dataset. An
interesting result related with FSTS clustering algorithm is that the optimal
number of clusters for each dataset of non-cumulative data is the same that
the optimal number of clusters found by the PCAES index (see Tab. 5.3 in
Section 5.3.3).
The effects of clustering is evaluated with the optimal values of clustering
and lag parameters given in Tab. B.1 (Appendix B). The FMLE clustering
algorithm is selected to perform the evaluation; the reason for this is that
this algorithm present good results in most datasets. According to the results with non-cummulative data (Tab. 8.2). It is shown that the RMSE of
the SD1 dataset is smaller, with a statistically significant difference, when
partitioning the data. This is not the case for real datasets in which there is
not statistical difference in the error metric. In fact the mean RMSE for the
RD1 and RD2 is greater when clustering is used, which may be interpreted
as a negative effect to the clustering process. However, in the SD1 and RD3
datasets the mean RMSE is smaller when clustering is used. Then, in some
cases the clustering improves the results but in other cases clustering has the
opposite effect. A possible explanation for this is that clustering works well
for data in which a cluster structure is evident. When there is not an evident
clustering tendency in the data the clustering actually may not improve the
forecast performance. It is important to note that clustering reduces the
size of the training set, limiting the amount of data available for parameter
estimation.
According to the results shown in Tab. 8.2, the standard deviation of
the error for the SD1 and RD3 datasets2 is smaller than the corresponding
standard deviation of datasts RD1 and RD2. This apparently indicates that
if the clustering improves the results, the variability of the forecasts is smaller.
It is evident that the processing time increases when clustering is performed.
From our results this increase is of about 6 times the time required without
using clustering in the non-cumulative case, and of about 3 times in the
cumulative case. This, however, without considering the time required to
adjust the clustering and lag parameters.
The results when cumulative data is used are shown in Tab. 8.3. These
2
Datasets that have smaller mean RMSE when clustering is used.
63
8. Results
8.1. Forecasting results using multiple linear regression
Tab. 8.2: Optimal result for multiple linear regression partitioning the data, noncumulative data.
Dataset
Mean
RMSE
S.
Deviation
Cluster
Time (s)
Forecast
Time (s)
p-value *
RD1
SD1
RD2
RD3
4.070
1.420
16.897
2.247
2.917
0.320
9.244
1.406
2.79
1.92
1.42
1.44
0.30
0.52
0.23
0.37
0.96
0.00
0.94
0.91
*p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not
the data. The reader may compare these results with the results shown in Tab.
8.1 for the non-cumulative case.
results do not improve the forecast performance as already mentioned. In
general, there is an increase on the mean error, the standard deviation of the
error, and the processing time.
Tab. 8.3: Optimal result for multiple linear regression partitioning the data, cumulative data.
Dataset
Mean
RMSE
S.
Deviation
Cluster
Time (s)
Forecast
Time (s)
p-value *
RD1
SD1
RD2
RD3
5.009
1.833
18.944
2.623
3.437
0.456
10.259
1.677
2.82
1.59
1.71
1.53
0.42
0.70
0.43
0.45
0.62
0.00
0.02
0.12
*p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not
the data. The reader may compare these results with the results shown in Tab.
8.1 for the cumulative case.
8.1.2 Conclusions of the MLR case
From the results obtained here we can conclude that the use of cumulative
data does not improve the forecast performance and it increases processing
time. Partitioning the data does not necessarily improve the forecast results.
However, there may (likely) be a marked improvement if the data has an evident clustering structure. It is necessary to note that data partition increases
the time required for processing and parameter tuning. We have stated the
64
8. Results
8.2. Forecasting results using support vector regression
hypothesis that partitioning the data could improve prediction performance,
however, our datasets failed to properly validate this fact. A plausible explanation is that real datasets do not have clustering structure as observed
in Fig. 5.3, Section 5.3.4.
8.2 Forecasting results using support vector regression
The idea of using support vector machine is to capture the nonlinear process
that is performed in the data, in order to obtain a better forecasts. An
evaluation of the lag parameter is shown in Fig. 8.2. It is shown than
the lag dependence of the non-cumulative case is stronger than in linear
regression (see Fig. 8.1). For SVR the use of non-cumulative data improves
the forecasting results as expected.
(a)
(b)
2.5
6
5.5
5
4.5
4
3.5 0
2
1.5
5
10
10
(c)
25
10 (d)
20
3.5
Error
3
20
15 0
2.5
5
10
20
15
10
20
p
Fig. 8.2: Support vector regression results for non-cumulative (blue line) and cumulative (red line) data and for different number of Lags (p). The black
points correspond the optimum. (a) RD1 dataset. (b) SD1 dataset. (c)
RD2 dataset. (d) RD3 dataset.
Tab. 8.4 shows the results considering the optimal number of lags (as is
shown in Fig. 8.2). As observed, the use of non-cumulative data improves
the forecasts results in all cases, as is demonstrated with the Kruskal-Wallis
test. On the other hand, the process time (training and forecasting) is less
65
8. Results
8.2. Forecasting results using support vector regression
using non-cumulative data in almost all cases. Finally, the standard deviation
of the RMSE measures is smaller when non-cumulative data is used. This
corroborates the fact that the use of cumulative data does not improve the
performance of the forecast.
Tab. 8.4: Support vector regression results using complete datasets.
Dataset
p*
Mean
RMSE
RD1
RD1 cum.
SD1
SD1 cum.
RD2
RD2 cum.
RD3
RD3 cum.
2
1
16
18
6
1
10
1
3.983
5.131
1.298
1.792
16.409
18.417
2.154
2.506
S. Deviation
Train
Time
(s)
Forecast
time (s)
pvalue **
3.157
4.096
0.269
0.451
10.489
10.593
1.473
1.661
0.29
2.62
0.51
54.2
0.32
0.33
0.52
1.96
1.21
1.10
2.68
2.83
1.50
1.41
2.15
2.07
0.00
0.00
0.02
0.00
*Optimal number of lags, see Fig. 8.2.
**p-value of the Kruskal-Wallis test, values near to 0 implies median differences.
8.2.1 Support vector regression results with clustering
In order to evaluate the effect of the clustering in the forecast performance
without any possible bias, we performed a grid search for the clustering and
lag parameters as in the case of multiple linear regression. Tab. B.2 of
Appendix B shows the optimal values for the clustering and lag parameters
according to different clustering algorithms. It is shown that the FMLE clustering algorithm improves the results in almost all cases. For this reason the
FMLE algorithm is selected for the analysis. Tab. 8.4 shows the results for
the non-cumulative case. As observed, there is no improvement on the mean
RMSE when clustering is used. In fact, the variability in the measurements
increases. According to the Kruskal-Wallis test, the measurements are not
statistically different; this fact shows that the clustering has no effect on the
results when SVR is used as is confirmed with this experiment.
According to the results in Tab. 8.6, there are no improvements compared
with the case of non-cumulative data. On the other hand, the clustering
66
8. Results
8.2. Forecasting results using support vector regression
Tab. 8.5: Optimal results for support vector regression partitioning the data, noncumulative data.
Dataset
Mean
RMSE
S. Deviation
Cluster
Time
(s)
Train
time (s)
Forecast
Time
(s)
p-value *
RD1
SD1
RD2
RD3
3.985
1.301
16.437
2.172
3.185
0.273
10.926
1.661
5.66
4.59
2.42
1.58
0.36
0.54
0.24
1.85
1.58
3.91
2.19
2.49
0.90
0.99
1.00
0.82
*p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not
the data. The reader may compare these results with the results shown in Tab.
8.4 for the non-cumulative case.
process could achieve lower values for the mean RMSE in RD1 and SD1
datasets, and the variability of the measurements is smaller for the SD1
datasets. However there is not a statistical difference between the results
when compared with the effect using clustering.
Tab. 8.6: Optimal results for support regression partitioning the data, cumulative
data.
Dataset
RD1
SD1
RD2
RD3
Mean
RMSE
S. Deviation
Cluster
Time
(s)
Train
time (s)
Forecast
Time
(s)
p-value *
5.081
1.740
18.91
2.506
3.838
0.457
10.96
1.661
2.86
4.06
1.40
1.58
0.31
3.16
0.21
1.85
1.41
3.56
1.37
2.49
0.98
0.27
0.64
1.00
*p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not
the data. The reader may compare these results with the results shown in Tab.
8.4 for the cumulative case.
8.2.2 Conclusions of the SVR case
The use of SVR machines generates very similar results to those obtained
with MLR. There is strong evidence that the use of cumulative data does not
improve the forecast performance and requires more computation time. The
67
8. Results
8.3. Forecasting results using artificial neural networks
use of clustering does not show improvement in the forecasting performance,
even for synthetic dataset. It is important to note that the use of SVR
machines significantly increases the computation time which is mainly due
to parameters tuning.
8.3 Forecasting results using artificial neural networks
Artificial neural networks are used as an alternative to support vector machines in order to capture the nonlinear structure in the data. In the evaluation of the lag parameter for ANN some datasets such as SD1 and RD3
showed strong lag dependence in the non-cumulative data. The results show
that the use of non-cumulative data improves the forecast performance as
expected; in fact the results for the cumulative case are very unstable. Tab.
8.7 shows the results considering the optimal number of lags (as is shown in
Fig. 8.3), there is enough statistical evidence to indicate that the use of cumulative data does not improve the results and, in general, such use increases
the process time, the error and the variability of the measurements.
(a)
20
(b)
30
15
20
10
10
5
00
5
(c)
150
Error
00
10
10
10
20
(d)
100
5
50
00
5
10
00
15
10
20
p
Fig. 8.3: Artificial neural network results for non-cumulative (blue line) and cumulative (red line) data and for different number of Lags (p). The black
points correspond the optimum. (a) RD1 dataset. (b) SD1 dataset. (c)
RD2 dataset. (d) RD3 dataset.
68
8. Results
8.3. Forecasting results using artificial neural networks
Tab. 8.7: Artificial neural network results using complete datasets.
Dataset
p*
Mean
RMSE
RD1
RD1 cum.
SD1
SD1 cum.
RD2
RD2 cum.
RD3
RD3 cum.
4
11
23
1
3
7
2
1
4.066
5.545
1.163
12.874
18.225
23.228
2.194
2.961
S. Deviation
Train
Time
(s)
Forecast
time (s)
pvalue **
2.824
4.287
0.571
0.587
13.811
11.699
1.345
2.064
43.6
16.1
326
27.1
30.5
30.0
293
63.7
26.69
35.3
48.3
63.9
31.4
31.5
51.7
41.7
0.00
0.00
0.00
0.00
*Optimal number of lags, see Fig. 8.3.
**p-value of the Kruskal-Wallis test, values near to 0 implies median differences.
8.3.1 Artificial neural network results with clustering
Clustering effects on forecast performance are evaluated in the same manner
as in the previous cases. In the non-cumulative case there is no statistical
evidence of improvement when using clustering for the real datasets. In the
case of synthetic dataset, however, the use of clustering actually improves
the forecast performance (Tab. 8.7). In the datasets SD1 and RD2 the mean
variability of RMSE decreases when clustering is used.
The results for the cumulative data case are given in Tab. 8.9 and have
not shown any improvement for the real datasets. For synthetic dataset,
however, there is a significant improvement in the forecast performance as in
the results of cumulative data.
8.3.2 Conclusions of the ANN case and other cases
The conclusion for the case of neural networks is similar to those obtained for
multiple linear regression and support vector regression. The use of cumulative data does not improve the forecast performance and the use of clustering
has no clear effect on the forecast performance according to the real datasets.
As additional information about the results obtained so far in Tab. 8.10 we
present a summary of the results in each experimental treatment. Note that
in any case the use of cumulative data improves the forecast performance.
69
8. Results
8.3. Forecasting results using artificial neural networks
Tab. 8.8: Optimal results for artificial neural network partitioning the data, noncumulative data.
Dataset
Mean
RMSE
S. Deviation
Cluster
Time
(s)
Train
time (s)
Forecast
Time
(s)
p-value *
RD1
SD1
RD2
RD3
4.172
1.294
19.139
2.194
3.034
0.439
12.933
1.345
1.93
4.65
2.50
2.48
60.4
276
7.92
282
26.8
51.8
31.2
47.8
0.85
0.00
0.21
1.00
*p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not
the data. The reader may compare these results with the results shown in Tab.
8.7 for the cumulative case.
On the other hand, the use of clustering seems not to have any effect in the
forecast performance. According to these results the idea of using clustering
for forecasting loses validity, in practice, since this procedure does not improve prediction performance and has high computational cost, specially, in
the clustering validation procedure.
The possibility of improving forecasting performance is one of the main reasons for using cumulative data given the noise reduction (i.e curve smoothing). Moreover, many diffusion models take into consideration cumulative
time series to facilitate tuning parameter tasks. In contrast to this idea,
the results show that prediction performance is not improved, hence it is
important to answer why the use of cumulative data is not effective in our
forecasting framework? A first answer is that the use of cumulative data
increases the range of the time series and even though the time series looks
smooth, the error associated to the forecasts increases due to the increase in
range of the time series. In Fig. C.1 of Appendix C it is shown the results
of the calculation of the variances of the datasets at each time period for
cumulative and non-cumulative data. The use of cumulative data evidences
a remarkable increase in the variability of the time series, this provides a
clear picture of what happens with the use of cumulative data. The increase
in variability implies a poor performance of the forecasting method.
70
8. Results
8.4. Comparison of forecasting methods
Tab. 8.9: Optimal results for artificial neural network partitioning the data, cumulative data.
Dataset
RD1
SD1
RD2
RD3
Mean
RMSE
S. Deviation
Cluster
Time
(s)
Train
time (s)
Forecast
Time
(s)
p-value *
5.634
1.886
36.43
2.983
4.400
0.543
48.84
1.957
2.93
4.11
1.76
1.62
29.8
78.2
61.3
59.9
26.6
44.7
30.2
40.7
0.90
0.00
0.47
0.26
*p-value of the Kruskal-Wallis test by evaluating the effect of partitioning or not
the data. The reader may compare these results with the results shown in Tab.
8.7 for the cumulative case.
8.4 Comparison of forecasting methods
In order to have a visual illustration of the forecasting results we have selected
a time series from each datasets to show the forecasting results for each regression method, Fig. 8.4 shows such forecasts. It is shown that the forecasts
fit relatively well the time series, it is interesting to note the neural network
forecast for the SD1 dataset, in this case the predictions are considerably
good and this is true for this dataset in general. On the other hand, Fig. 8.5
shows the p-values of the pairwise comparison of regression methods according to the Kruskal-Wallis test, it is shown that for real datasets there is no
difference in regression methods at least a 90 % of confidence. It is clear that
multiple linear regression and artificial neural networks produce very similar
results for real datasets. In contrast, in the case of the synthetic dataset,
the regression methods are statistically different with confidence levels well
above to 95 %, and in this case artificial neural networks are the undisputed
winners in the forecasting results. The absence statistical differences between
regression methods for real datasets highlights the multiple linear regression
as a very efficient method to forecast the demand of SLCPs according to the
proposed forecasting framework discussed in this work (see Chapter 7).
An evaluation of the forecasting and regression methods is performed by
considering the Mean Absolute Error of the forecasts at each time period
Fig. 8.6 shows such results. An interesting result is that the mean absolute
error is larger at the beginning and at the end of the time series for the
datasets of time series that clearly have completed its life cycle, this fact is
71
8. Results
8.4. Comparison of forecasting methods
Tab. 8.10: Summary evaluation of the experimental treatments according to the
measurements of RMSE.
Dataset
RD1
SD1
RD2
RD3
RD1
SD1
RD2
RD3
RD1
SD1
RD2
RD3
Regression method
Use of cumulative
data
Use of clustering
MLR
No
No
No
No
No effect
Yes
No effect
No effect
SVR
No
No
No
No
No
No
No
No
ANN
No
No
No
No
No effect
No
No effect
No effect
effect
effect
effect
effect
not obvious for the RD2 dataset and this is due to many of the time series
contained in such dataset have not yet completed its life as already mentioned
in Section 5.1.1. This implies that the absolute error decreases around the
peak sales or around the peak of the time series, as to the comparison of the
regression methods; it is shown that these methods achieve similar results
at the beginning and during the peak of the time series. In contrast, there
are noticeable differences between regression methods at the end of the life
cycle. This is most evident for RD3 dataset in which the results at the end of
the life cycle look very unstable in particular for SVRs and ANNs. However,
given the above results, it is difficult to establish which regression method
perform better for given set of periods.
72
8. Results
8.4. Comparison of forecasting methods
(b)
(a)
10
8
6
5
4
2
00
5
00
10
(c)
10
20
(d)
Data
MLR
SVR
ANN
8
100
yt
6
50
4
2
00
5
10
00
15
10
20
t
Fig. 8.4: Forecasting results for some time series. (a) RD1 dataset; (b) SD1
dataset; (c) RD2 dataset; (d) RD3 dataset. Note: the forecast of the
first period was obtained as the average of the training set.
1
MLR-SVR
MLR-ANN
SVR-ANN
p-value
0.8
0.6
0.4
0.2
0.05
RD1
SD1
RD2
RD3
Datset
Fig. 8.5: Pairwise comparison of regression methods according to the KruskalWallis test.
73
8. Results
8.4. Comparison of forecasting methods
(b)
(a)
25
13
12
11
10
9
8
7
6
5
4
3
2
1
20
15
10
5
0
0
5
10
0
(c)
5
10
(d)
18
MLR
SVR
ANN
16
14
20
12
15
t
10
8
10
6
4
5
2
0
0
0
5
10
0
5
10
Mean Absolute Error
Fig. 8.6: Mean absolute error results for each regression method and each time
period. (a) RD1 Dataset; (b) SD1 dataset; (c) RD2 dataset; (d) RD3
dataset. Note that the greatest mean absolute errors are found at the
beginning and the end of the life cycle.
74
NINE
CONCLUSIONS
This work addressed the problem of forecasting the demand of short life
cycle products (SLCPs) by using multiple linear regression, and machine
learning methods such as SVM and ANN. The use of regression methods
and the methodology followed in this study show a clear advantage over other
forecasting methods proposed in the literature. This is because it is possible
to obtain forecasts at early stages of the product life cycle. In fact, the
methods discussed in this work only requires information available of the first
demand/sales period, while other methods proposed in the literature require
demand/sales information of at least 3 previous periods. This feature is
possible due to the effective use of the time series demand of similar products
that have already completed their life cycle.
This work considers different strategies (hypothesis) aimed to improve the
forecast performance of the regression methods. From the results obtained
in this work we can conclude that:
• The use of cumulative data does not improve the forecast performance.
The results show a clear evidence of this statement. A possible explanation for this is the systematic increase of the variance of the cumulative
time series, as a result of the sum of random variables. Although the
cumulative time series is smooth, the values of such time series at different periods hide a larger variance than the non-cumulative values
and such increase in variance generate poor forecasting results. On the
other hand, the use of cumulative data increases the processing time
showing that cumulative data is of little benefit for forecasting by using
the framework proposed in this work.
9. Conclusions
• The effect of the clustering in the forecasting results is not clear. Our
experience on using clustering as a method to extract relevant information from the data to be used in the forecasting process show that,
apparently, it is possible to have an improvement in forecast performance if the data shows a clear clustering structure. Unfortunately in
our case all real datasets do not show a clear clustering structure. We
think that the use of clustering techniques may be a valuable tool in
the development of effective forecasting methods. However, the corresponding analysis is beyond the scope of this work. A possible direction
is to forecast using the grade of membership to each cluster of each pattern. This is expected to improve the results.
• Non-linear regression methods do not show a significant improvement
in the forecasting performance for most of the datasets. Multiple linear
regression showed statistically equal results to those of support vector
regression and artificial neural networks with at least 90 % of confidence. This allows concluding that multiple linear regression is an
efficient and effective method to forecast the demand of a SLCP.
To sum up, according to the results and analysis presented in this document, the application of MLR for forecasting, with non-cumulative data and
without clustering, is the best method to obtain low prediction errors.
76
APPENDIX
A
RESULTS FOR THE SD1 DATASET USING THE
CORRECT PARTITION
Tab. A.1 shows the effect in the forecasting performance of partitioning the
data for the SD1 dataset according to its correct partition. The forecasting
process is carried out with the optimal number of lags for the complete
dataset (see Fig. 8.1 and Tab. 8.1). Tab. B.1 shows that clustering algorithm
improves the results obtained on SD1 dataset.
Tab. A.1: Preliminary multiple linear regression results for SD1 dataset considering all data and their correct partition.
NON-CUMULATIVE DATA
Dataset
p*
SD1
24
SD1 clust. 24
Mean RMSE S. Deviation Time (s)
1.6099
0.3564
0.7431
1.6005
0.5225
0.4888
CUMULATIVE DATA
p-value **
0.1114
Dataset
p*
SD1
24
SD1 clust. 24
Mean RMSE
2.3227
2.4926
p-value **
0.0687
S. Deviation
0.3475
0.6497
Time (s)
5.0256
4.2385
*Optimal number of lags, see Fig. 8.1.
**p-value of the Kruskal-Wallis test, values near to 0 implies median differences.
APPENDIX
B
THE EFFECT OF THE CLUSTERING ALGORITHM
B.1 Multiple linear regression case
In this section, STS, FCM and FMLE algorithms cluster the data sets to
investigate the effect of partitioning on prediction performance (see Section
5.3.2). For this study we investigate the effect of the clustering algorithm,
the number of clusters, the value of the fuzzyfier parameter, and the number
of lags on the forecasting performance. The number of partitions K changes
from K = 2 to K = 8 clusters, the number of lags is up to p = 10 lags
(for the SD1 dataset, however, we search for the following number of lags
p = {18, . . . , 24} due to its strong lag dependence), and the following values
for the fuzzifier parameter m = {1.1, . . . , 2.0, 2.25, 2.5, 2.75, 3}. The results
are shown in Tab. B.1
B.2 Support vector regression case
After evaluating the effect of the clustering and lag parameters for the SVR
case we obtained the results shown in Tab. B.2. As observing the use of the
FMLE clustering algorithm improves the results in almost all datasets.
Appendix B. The effect of the clustering algorithm
B.2. Support vector regression case
Tab. B.1: Comparison of FCM, FMLE and FSTS algorithms for MLR.
NON-CUMULATIVE DATA
CUMULATIVE DATA
Dataset
RD1
FCM
FMLE
FSTS
FCM
RD1
FMLE
FSTS
RMSE
K
m
p
SD1
4.084
2
1.5
2
4.070
5
2.8
4
4.154
2
1.3
2
5.048
2
1.3
1
SD1
5.009
4
1.1
1
5.064
2
1.2
1
RMSE
K
m
p
RD2
1.431
4
2
24
1.420
5
2
24
1.413
6
1.7
20
1.841
7
1.7
18
RD2
1.832
6
1.3
18
1.7861
8
1.1
15
16.897 17.143
2
2
2.25
1.3
6
5
19.896
3
3
1
RD3
18.944
4
1.1
1
19.449
2
1.4
1
2.247
2
1.5
2
2.615
2
2.5
1
2.623
2
1.4
1
2.583
3
3
1
RMSE 17.137
K
2
m
1.2
p
5
RD3
RMSE
K
m
p
2.215
2
1.4
3
2.240
5
1.8
1
79
Appendix B. The effect of the clustering algorithm
B.2. Support vector regression case
Tab. B.2: Comparison of FCM, FMLE and FSTS algorithms for SVR.
NON-CUMULATIVE DATA
CUMULATIVE DATA
Dataset
RD1
FCM
FMLE
FSTS
FCM
RD1
FMLE
FSTS
RMSE
K
m
p
SD1
4.030
2
1.8
4
3.985
8
2.5
5
4.051
2
1.3
2
5.148
2
1.2
1
SD1
5.082
4
1.1
1
5.152
2
1.6
1
RMSE
K
m
p
RD2
1.373
3
1.5
17
1.301
8
1.6
17
1.320
2
1.6
16
1.808
8
1.2
17
RD2
1.740
8
1.4
17
1.723
8
1.4
17
16.437 16.878
8
2
2.25
1.4
5
4
19.492
2
2
1
RD3
18.911
2
2.5
1
19.855
2
2.5
1
2.172
6
1.6
5
2.579
2
2.5
1
2.506
3
1.5
1
2.564
2
2.75
1
RMSE 17.276
K
2
m
1.2
p
6
RD3
RMSE
K
m
p
2.178
2
1.4
5
2.188
2
1.8
5
80
APPENDIX
C
VARIANCE OF CUMULATIVE AND
NON-CUMULATIVE DATA.
This appendix assesses the variability in non-cumulative and cumulative time
series considering the variances at each time period. The variances are calculated using the complete datasets, for example, the variance of the first
period of the RD1 dataset is equal to the variance of the values of all-time
series in such period. We assume that these values follow the same probability distribution, which is not necessarily true given the diverse nature
of time series. This analysis, however, only provides a provisional idea of
the variances of cumulative and non-cumulative time series. The variance of
non-cumulative time series is calculated in the usual manner, the variance of
cumulative time series is calculated according to the following expression
var [Yt ] = var [Yt−1 ] + var [yt ] + 2cov [Yt−1 , yt ] ,
(C.1)
where yt is the current value of the time series and Yt−1 is the cumulative
value of the time series at period t − 1. Note the significant increase in
variability when cumulative data is used.
Appendix C. Variance of cumulative and non-cumulative data.
(a)
5000
4000
3000
2000
1000
0
2
40
×104
0
2000
4
6
8
0
10 12
(c)
5
10
15
20
25
(d)
3000
Variance
20
(b)
4000
Non-cum.
Cum.
2000
1000
0
2 4 6 8 10 12 14 16 18
5
10
15
20
t
Fig. C.1: Variance of cumulative and non-cumulative data. (a) RD1 Dataset; (b)
SD1 dataset; (c) RD2 dataset; (d) RD3 dataset.
82
BIBLIOGRAPHY
Balestrassi, P.P., Popova, E., Paiva, A.P., & Lima, J.W. 1994. Design of experiments on neural networks training for nonlinear time series forecasting.
Neurocomputing, 1160–1178.
Banerjee, A., & Davé, R. N. 2004. Validating clusters using the Hopkins
statistic. IEEE, 25–29.
Bashiri, M., & Geranmayeh, A.F. 2011. Tuning parameters of an artificial
neural network using central composite design and genetic algorithm. Scientia Iranica, 1600–1608.
Bass, F.M. 1969. A new product growth model for consumer durables. Management science, 15(3), 8496–8502.
Benitez, H., D., Flórez, J., F., Duque, D., P., Benavides, A., Lucia Baquero,
O., & Quintero, J., J. 2013. Spatial pattern recognition of seismic events
in South West Colombia. Computer & Geosciences., 60–77.
Bradley, P. S., & Fayyad, Usama M. 1998. Refining Initial Points for K-Means
Clustering. Pages 91–99 of: -. Morgan kaufmann.
Chang, C.-C., & Lin, C.-J. 2011. LIBSVM: a library for support vector
machines. ACM Transactions on Intelligent Systems and Technology, 1–
27.
Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. 2002. Choosing
multiple parameters for support vector machines. Machine learning, 131–
159.
BIBLIOGRAPHY
BIBLIOGRAPHY
Chiu, C-C., Pignatiello, J.J., & Cook, D.F. 1994. Response surface methodology for optimal neural network selection. IEEE, 161–167.
Chung, F.-L., Fu, T.-C., Luk, R., & Ng, V. 2002. Evolutionary time series
segmentation for stock data mining. IEEE, 83–90.
Cristianini, N., & Shawe-Taylor, J. 2000. An introduction to support vector
machines. Cambridge University Press.
Fu, Tak-Chung. 2010. A review on time series data mining. Engineering
Applications of Articial Intelligence.
Gath, & Geva, A. B. 1989a. Unsupervised optimal fuzzy clustering. IEEE
Transactions on pattern analysis and machine intelligence, 11(7).
Gath, I., & Geva, B. 1989b. Unsupervised optimal fuzzy clustering. IEEE,
773–781.
Gönen, M., & Alpaydin, E. 2011. Regularizing multiple kernel learning using
response surface methodology. Pattern recognition, 159–171.
Hall, B. H., Jaffe, A. B., & Trajtenberg, M. 2001. The NBER patent citation
data file: Lessons, insights and methodological tools. NBER working paper
8498.
Hsu, C.-H., Chang, C.-C., & Lin, C.-J. 2010. A practical guide to support
vector classification. National Taiwan University, 1–16.
Hu, Y., Wu, C., & Liu, H. 2011. Prediction of passenger flow on the highway
based on the least square support vector machine. Transport, 26(2), 663–
673.
Jian, L., Xiuhua, C., & Hai, W. 2009. Comparison of artificial neural networks with response surface models in characterizing the impact damage
resistance of sandwich airframe structures. Pages 210–215 of: -, vol. 2.
cited By (since 1996)0.
Kaufman, L., & Rousseeuw, P. J. 1990. Finding groups in data: An introduction to cluster analysis. Wiley.
84
BIBLIOGRAPHY
BIBLIOGRAPHY
Kurawarwala, A.A., & Matsuo, H. 1998. Product growth models for medium
term forecasting of short life cycle products. Technological Forecasting and
Social Change, 169–196.
Li, B., Li, J., Li, W., & Shirodkar, S.A. 2012. Demand forecasting for production planning decision making based on the new optimized fuzzy short
time series clustering. Production Planning and Control, 197–203.
Liao, T. Warren. 2005. Clustering of time series data: A survey. Pattern
recognition.
Lin, C.-J. 2006. A gide to support vector machines.
Madadlou, A., Emam-Djomeh, Z., Mousavi, M.E., Ehsani, M., Javanmard,
M., & Sheehan, D. 1994. Response surface optimization of an artificial
neural network for predicting the size of re-assembled casein micelles. Computers and Electronics in Agriculture, 216–221.
Meade, N., & Islam, T. 2006. Modelling and forecasting the diffusion of
innovation – A 25-year review. International Journal of Forecasting, 519–
545.
Möller-Levet, C. S., Klawonn, F., Cho, K.-H., & Wolkenhauer, O. 2005.
Clustering of unevenly sampled gene expression time series data. Fuzzy
sets and systems, 49–66.
Möller-Levet, Carla S., Klawonn, Frank, Cho, Kwang-Hyun, & Wolkenhauer,
Olaf. 2003. Fuzzy clustering of short time series and unevenly distributed
sampling points. Springer-Verlag, 330–340.
Myers, R., & Montgomery, D. 2002. Response surface methodology: Process
and product optimization using designed experiments. Wiley.
Pai, P.-F., Lin, K.-P., Lin, C.-S., & Chang, P.-T. 2010. Time series forecasting by a seasonal support vector regression model. Expert Systems with
Applications, 37(6), 4261–4265.
Pakhira, M. K., Bandyopadhyay, S., & Maulik, U. 2004. Validity index for
crisp and fuzzy clusters. Pattern Recognition, 487–501.
Pascazio, S., Basalto, N., Bellouti, R., Francesco, D. C., Facchi, P., & Pantaleo, E. 2007. Hausdorff clustering of nancial time series. Physica, 635–644.
85
BIBLIOGRAPHY
BIBLIOGRAPHY
Peña, Daniel. 2005. Análisis de series temporales. Alianza Editorial.
Peña, Daniel, Tiao, George C., & Tsay, Ruey S. 2001. A course in time
series analysis. Wiley.
Peña, J. M., Lozano, J. A., & Larraaga, P. 1999. An empirical comparison of
four initialization methods for the k-Means Algorithm. Pattern Recognition
Letters, 1027–1040.
Prez-Cruz, F., & Bousquet, O. 2004. Kernel methods and their potential use
in signal processing: An overview and guidelines for future development.
IEEE SIGNAL PROCESSING MAGAZINE, 54–65.
Ratanamahatana, C. A., Lin, J., Gunopulos, D., Keogh, E., Vlachos, M., &
Das, G. 2010. Mining time series data. Pages 1049–1077 of: Data Mining
and Knowledge Discovery Handbook. Springer Verlag.
Rodrı́guez, J.A. 2007. Diseño y evalución de modelos de abastecimiento de
productos de corto ciclo de vida. M.Phil. thesis, Universidad del Valle.
Rodrı́guez, J.A., & Vidal, C.J. 2009. A heuristic method for the inventory
control of short life cycle products. Ingenierı́a y Competitividad, 37–35.
Schölkopf, B., Bartlett, P., Smola, A., & Williamson, R. –. Shrinking the
tube: A new support vector regression algorithm. –.
Smola, A.J., & Schlkopf, B. 2004. A tutorial on support vector regression.
Statistics and Computing, 14(3), 199–222.
Szozda, N. 2010. Analogous forecasting of products with a short life cycle.
Decision Making in Manufacturing and Services, 4(1-2), 71–85.
Theodoridis, S., & Koutroumbas, K. 2006. Pattern Recognition. Elsevier.
Thomassey, S., & Fiordaliso, A. 2006. A hybrid sales forecasting system based
on clustering and decision trees. Decision Support Systems, 408–421.
Thomassey, S., & Happiette, M. 2007. A neural clustering and classification
system for sales forecasting of new apparel items. Applied Soft Computing,
1177–1187.
86
BIBLIOGRAPHY
BIBLIOGRAPHY
Trappey, C.V., & Wu, H.Y. 2008. An evaluation of the time varying extended
logistic, simple logistic, and Gompertz models for forecasting short product
lifecycles. Advanced Engineering Informatics, 421–430.
Tsay, R.S. 2005. Analysis of Financial Time Series. Wiley.
Tseng, F-M., & Hu, Y-C. 2009. Quadratic interval Bass model for new
product sales diffusion. Expert System with Applications, 8496–8502.
Vapnik, V.N. 1999. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 988–999.
Wang, J., & Wan, W. 2009. Optimization of fermentative hydrogen production process using genetic algorithm based on neural network and response
surface methodology. International Journal of Hydrogen Energy, 255–261.
Wang, Weina, & Zhang, Yunjie. 2007. On fuzzy cluster validity indices. Fuzzy
sets and systems, 2095–2117.
Wu, S.D., & Aytac, B. 2008. Characterization of demand for short life-cycle
Technology products. Annals of Operations Research.
Wu, S.D., Kempf, K.G., Atan, M.O., Aytac, B., Shirodkar, S.A., & Mishra,
A. 2009. Extending Bass for improved new product forecasting. Informs,
234–247.
Wu, X. L., & Yang, M. S. 2005. A cluster validity index for fuzzy clustering.
Pattern Recognition Letters, 1275–1291.
Xu, X.-H., & Zhang, H. 2008. Forecasting demand of short life cycle products
by SVM. Pages 352–356 of: -. cited By (since 1996)1.
Yang, Y., Fuli, R., Huiyou, C., & Zhijiao, X. 2007. SVR mathematical
model and methods for sale prediction. Journal of Systems Engineering
and Electronics, 18(4), 769–773.
Zahid, N., Limouri, M., & Essaid, A. 1999. A new cluster-validity for fuzzy
clustering. Pattern Recognition, 1089–1097.
Zhang, G., Patuwo, B.E., & Hu, M.Y. 1998. Forecasting with artificial neural
networks: the state of the art. International Journal of Forecasting.
87
BIBLIOGRAPHY
BIBLIOGRAPHY
Zhang, G., Patuwo, B.E., & Hu, M.Y. 2001. A simulation study of artificial neural networks for non-linear time series forecasting. Computers &
Operations Research, 381–396.
Zhang, X., Liu, J., Du, Y., & Lv, T. 2011. A novel clustering method on
time series data. Expert Systems with Applications, 11891–11900.
Zhu, K., & Thonemann, U.W. 2004. An adaptive forecasting algorithm
and inventory policy for products with short life cycles. Naval Research
Logistic, 633–653.
88