Download The Synergy between Classical and Soft

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Network tap wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
The Synergy between Classical and
Soft-Computing Techniques for Time Series
Prediction
I. Rojas1 , F. Rojas1 , H. Pomares1 , L.J. Herrera1 ,
J. González1 , and O. Valenzuela2
1
University of Granada, Department of Computer Architecture and Technology,
E.T.S. Computer Engineering, 18071 Granada, Spain
2
University of Granada, Department of Applied Mathematics, Science faculty,
Granada, Spain
Abstract. A new method for extracting valuable process information
from input-output data is presented in this paper using a pseudo-gaussian
basis function neural network with regression weights. The proposed
methodology produces dynamical radial basis function, able to modify
the number of neuron within the hidden layer. Other important characteristic of the proposed neural system is that the activation of the
hidden neurons is normalized, which, as described in the bibliography,
provides better performance than non-normalization. The effectiveness of
the method is illustrated through the development of dynamical models
for a very well known benchmark, the synthetic time series Mackey-Glass.
1
Introduction
RBF networks form a special neural network architecture, which consists of three
layers, namely the input, hidden and output layers. The input layer is only used
to connect the network to its environment. Each node in the hidden layer has
associated a centre, which is a vector with dimension equal to that of the network
input data. Finally, the output layer is linear and serves as a summation unit:
FRBF (xn ) =
K
wi φi (xn , ci , σi )
(1)
i=1
where the radial basis functions φi are nonlinear functions, usually gaussian
∗
of the
functions [9]. An alternative is to calculate the weighted average FRBF
radial basis function with the addition of lateral connections between the radial
neurons. In normalized RBF neural networks, the output activity is normalized
by the total input activity in the hidden layer.
The use of the second method has been presented in different studies as
an approach which, due to its normalization properties, is very convenient and
provides better performance than the weighted sum method for function approximation problems. In terms of smoothness, the weighted average provides better
performance than the weighted sum [8,3].
R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 30–39, 2004.
c Springer-Verlag Berlin Heidelberg 2004
The Synergy between Classical and Soft-Computing Techniques
31
Assuming that training data (xi , yi ), i = 1, 2, . . . , D are available and have
to be approximated, the RBF network training problem can be formulated as an
optimization problem, where the normalized root mean squared errors (NRMSE)
between the true outputs and the network predictions must be minimized with
respect to both the network structure (the number of nodes K in the hidden
layer) and the network parameters (center, sigmas and output weights):
e2
(2)
N RM SE =
σz2
where σz2 is the variance of the output data, and e2 is the mean-square error
between the obtained and the desired output. The development of a single procedure that minimizes the above error taking into account the structure and the
parameters that define the system, is rather difficult using the traditional optimization techniques. Most approaches presented in the bibliography consider a
fixed RBF network structure and decompose the optimization of the parameters
into two steps: In the first step the centres of the nodes are obtained (different
paradigms can be used as cluster techniques, genetic algorithms, etc .) and in the
second step, the connection weights are calculated using simple linear regression.
Finally, a sequential learning algorithm is presented to adapt the structure of
the network, in which it is possible to create new hidden units and also to detect
and remove inactive units.
In this paper we propose to use a pseudo-gaussian function for the nonlinear
function within the hidden unit. The output of a hidden neuron is computed as:
φi (x) = ϕi,v (xv )
v 
2
(xv −cvi )


v
 − σi,−
(3)
− ∞ < x ≤ cvi
e
ϕi,v (xv ) =
2
v
v
(x −ci )


v
 − σi,+
e
cvi < x < ∞
The index i runs over the number of neurons (K) while v runs over the dimension of the input space (v ∈ [1, D]). The behaviour of classical gaussian functions
and the new PG-RBF in two dimensions is illustrated in Fig. 1 and Fig. 2.
The weights connecting the activation of the hidden units with the output of
the neural system, instead of being single parameters, are functions of the input
variables. Therefore, the wi are given by:
wi =
bvi xv + b0i
(4)
v
where bvi are single parameters.
The behaviour of the new PGBF in two dimensions is illustrated in Fig.1.
Therefore, the structure of the neural system proposed is modified using a
pseudo-gaussian function (PG) in which two scaling parameters σ are introduced,
which eliminate the symmetry restriction and provide the neurons in the hidden
32
I. Rojas et al.
Fig. 1. 3-D behaviour of a pseudo-gaussian function for two inputs
Fig. 2. Contour of a pseudo-gaussian function for two inputs
layer with greater flexibility for function approximation. Other important characteristics of the proposed neural system are that the activation of the hidden
neurons is normalized and that instead of using a single parameter for the output
weights, these are functions of the input variables which leads to a significant reduction in the number of hidden units compared with the classical RBF network.
2
Sequential Learning Using PGBF Network
Learning in the PGBF consists in determining the minimum necessary number
of neuron units, and in adjusting the parameters of each individual hidden neuron, given a set of data (xn ,yn )[5]. The sequential learning algorithm starts with
only one hidden node and creates additional neurons based on the novelty (innovation) in the observations which arrive sequentially. The decision as to whether
a datum should be deemed novel is based on the following conditions:
The Synergy between Classical and Soft-Computing Techniques
∗
en = yn − FRBF
>ξ
βmax = max (φi ) > ζ
33
(5)
i
If both conditions are satisfied, then the data is considered to be novel and
therefore a new hidden neuron is added to the network. This process continues
until a maximum number of hidden neurons is reached. The parameters ξ and ζ
are thresholds to be selected appropriately for each problem. The first condition
states that the error must be significant and the second deals with the activation
of the nonlinear neurons. The parameters of the new hidden node are determined
initially as follows:
k = k
+1
∗
if v = 0
yn − FRBF
v
bK =
0
otherwise
cK = xn ; (cvK = xvn , ∀v ∈ [1, D])
v
v
σK,+
= σK,−
= γ σinit min xn − ci (6)
i=1,...,K−1
where γ is an overlap factor that determines the amount of overlap of the data
considered as novel and the nearest centre of a neuron. If an observation has no
novelty then the existing parameters of the network are adjusted by a gradient
descent algorithm to fit that observation. We propose a pruning strategy that
can detect and remove hidden neurons, which although active initially, may
subsequently end up contributing little to the network output. Then a more
streamlined neural network can be constructed as learning progresses. For this
purpose, three cases will be considered:
– (a) Pruning the hidden units that make very little contribution to the overall
network
Noutput for
the whole data set. Pruning removes a hidden unit i when:
θi =
φi (xn ) < χ1 ; where χ1 is a threshold.
n=1
– (b) Pruning hidden units which have a very small activation region. These
units obviously represent an overtrained learning. A neuron i having very
v
v
low values of σi,+
+ σi,−
in the different dimensions of the input space will
v
v
be removed:
σi,+ + σi,−
< χ2
v
– (c) Pruning hidden units which have a very similar activation to other neurons in the neural system. To achieve this, we define the vectors ψi=1...N ,
where N is the number of input/output vectors presented, such that:
ψi = [φi (x1 ), φi (x2 ), . . . , φi (xn )]. As a guide to determine when two neurons
present similar behaviour, this can be expressed in terms of the inner product
ψi ·ψj < χ3 . If the inner product is near one then ψi and ψj are both attempting to do nearly the same job (they possess a very similar activation level for
the same input values).In this case, they directly compete in the sense that
only one of these neurons is selected and therefore the other one is removed.
If any of these conditions are fulfilled for a particular neuron, the neuron is
automatically removed.
34
I. Rojas et al.
The final algorithm is summarized below:
Step 1: Initially, no hidden neurons exist.
Step 2: Set n = 0, K = 0, h = 1, where n, K and h are the number of patterns
presented to the network, the number of hidden neurons and the number of
learning cycles, respectively. Set the effective radius ·h Set the maximum number
of hidden neurons M axN euron.
Step 3: For each observation (xn , yn ) compute:
a) the overall network output:
K
FRBF (xn ) =
wi φi (xn , ci , σi )
i=1
K
i=1
=
φi (xn , ci , σi )
N um
Den
(7)
b) the parameter required for
the evaluation of the novelty of the observation;
∗
the error en = yn − FRBF and the maximum degree of activation βmax . If
((en > ξ) and (βmax < ·h) and (K < M axN euron)) allocate a new hidden unit
with parameters:
k = k
+1
∗
if v = 0
yn − FRBF
v
bK =
0
otherwise
cK = xn ; (cvK = xvn , ∀v ∈ [1, D])
v
v
= σK,−
= γ σinit min xn − ci σK,+
(8)
i=1,...,K−1
else apply the parameter learning for all the hidden nodes:
∗
RBF
∂φi
∂E ∂ F
∗
∂φi ∂cv
RBF
∂F
i
∗
i −yn
FRBF
) wDen
∗
2
∂E
∆cvi = − ∂c
v = −
i
= (y −
 n
2
v
xv
n −ci
v
σi,−
−
(xn −cvi )
σv
i,−
e
v
xv
n −ci
v
σi,+
∗
RBF
∂φi
∂E ∂ F
v
∗
∂φi ∂σi,+
RBF
∂F

∗
i −yn 
= (yn − FRBF
2
) wDen
v
xv
n −ci
v
σi,+
v 2
xv
n −ci
σv
i,+
(
e
)

v
v
−
v
∗
i −yn  xn −ci
2 σv e
= (yn − FRBF
) wDen
∆σi,−
i,−
∗
RBF
∂E ∂ F
∗
∂N um
RBF
∂F
∗
RBF
∂F
− ∂E
∗
∂N
um
∂F

2
U (xvn ; −∞, cvi ) + 2
v
∆σi,+
= − ∂σ∂E
=−
v
i,+
=
(xvn −cvi )
e
σv
i,+
=
U (xvn ; cvi , ∞)

U (xvn ; cvi , ∞)
v 2
xv
n −ci
σv
i,−
(
)

∂N ∂wi
∂wi ∂bv
i
1
∗
= (yn − FRBF
) Den
φi (xn )xvn
∂E
∆b0i = − ∂b
v =
∂N ∂wi
∂wi ∂b0i
1
∗
= (yn − FRBF
) Den
φi (xn )
i
RBF
(10)
U (xvn ; −∞, cvi )
∂E
∆bvi = − ∂b
v = −
i
(9)
(11)
The Synergy between Classical and Soft-Computing Techniques
35
Step 4: If all the training patterns are presented, then increment the number
of learning cycles (h = h + 1), and check the criteria for pruning hidden units:
θi =
v
φi (xn ) < χ1
n=1
v
v
σi,+
< χ2
+ σi,−
N
(12)
ψi · ψj < χ3 , ∀j = i
Step 5: If the network shows satisfactory performance (N RM SE < π ∗ ) then
stop. Otherwise go to Step 3.
3
Using GA to Tune the Free Parameters of the
Sequential Learning Algorithms
Stochastic algorithms, such as simulated annealing (SA) or genetic algorithms
(GA) are more and more used for combinatorial optimization problems in diverse
fields, and particularly in time series [6]. The main advantage of GA upon SA is
that it works on a set of potential solutions instead of a single one; however, on
particular applications, the major inconvenient lies in the difficulty of carrying
out the crossover operator for generating feasible solutions with respect to the
problem constraints. Insofar as this last point was not encountered in this study,
a GA [2,4] has been retained.
Genetic algorithms are searching methods based upon the biological principles of natural selection and survival of the fittest introduced by Charles Darwin in his seminal work “The Origin of Species” (1859). They were rigorously
introduced by [2]. GAs consist of a population of individuals that are possible
solutions and each one of these individuals receives a reward, known as “fitness”,
that quantifies its suitability to solve the problem. In ordinary applications, fitness is simply the objective function. Individuals with better than average fitness
receive greater opportunities to cross. On the other hand, low fitness individuals
will have less chance to reproduce until they are extinguished. Consequently, the
good features of the best individuals are disseminated over the generations. In
other words, the most promising areas of the search space are explored, making
the GA converge to the optimal or near optimal solution.
The ‘reproduction’ process by which the new individuals are derived consists
in taking the chromosomes of the parents and subjecting them to crossover and
mutation operations. The symbols (genes) from parents are combined into new
chromosomes and afterwards, randomly selected symbols from these new chromosomes are altered in a simulation of the genetic recombination and mutation
process of nature. The key ideas are thus the concept of a population of individual solutions being processed together and symbol configurations conferring
greater fitness being combined in the ‘offspring’, in the hope of producing even
better solutions.
36
I. Rojas et al.
As far as this paper is concerned, the main advantages of a GA strategy lie in:
– 1. The increased likelihood of finding the global minimum in a situation
where local minima may abound.
– 2. The flexibility of the approach whereby the search for better solutions
can be tailored to the problem in hand by, for example, choosing the genetic
representation to suit the nature of the function being optimized.
In the Sequential learning algorithms proposed in section 2, there exist different parameter that should be tuned in order to obtain optimal solution. This
parameters are: ξ, ζ, χ1 , χ2 , χ3 . One possibility is to make this work by trial and
error, and the second possibility is to use the GA as an optimization tool that
must decided the best value for this parameters. In the way described above, the
relations between the different paradigms are described in figure 3:
Fig. 3. Block diagram of the different paradigms used in the complete algorithm
4
Application to Time Series Prediction
In this subsection we attempt a short-term prediction by means of the algorithm
presented in the above subsection with regard to the Mackey-Glass time series
data. The Mackey-Glass chaotic time series is generated from the following delay
differential equation:
dx(t)
ax(t − τ )
− bx(t)
=
dt
1 + x(t − τ )10
(13)
When τ > 17, the equation shows chaotic behaviour. Higher values of τ yield
higher dimensional chaos. To make the comparisons with earlier work fair, we
chose the parameters of n = 4 and P = 6.
The Synergy between Classical and Soft-Computing Techniques
37
Fig. 4. 3-D Prediction step = 6 and number of neurons = 12. (a) Result of the original
and predicted Mackey-Glass time series (which are indistinguishable). (b) prediction
error
Table 1. Comparison results of the prediction error of different methods for prediction
step equal to 6 (500 training data).
Method
Auto Regressive Model
Cascade Correlation NN
Back-Prop. NN
6th-order Polynomial
Linear Predictive Method
Kim and Kim (Genetic Algorithm and
Fuzzy System [6]
5 MFs
7 MFs
9 MFs
ANFIS and Fuzzy System (16 rules)[6]
Classical RBF (with 23 neurons)[1]
Our Approach (With 12 neurons)
Prediction Error
(RMSE)
0.19
0.06
0.02
0.04
0.55
0.049206
0.042275
0.037873
0.007
0.0114
0.0036 ± 0.0008
38
I. Rojas et al.
To compare our approach with earlier works, we chose the parameters presented in [6]. The experiment was performed 25 times, and we will show graphically one result that is close to the average error obtained. Fig4a) shows the
predicted and desired values (dashed and continuous lines respectively) for both
training and test data (which is indistinguishable from the time series here). As
they are practically identical, the difference can only be seen on a finer scale
(Fig4b)). Table 1 compares the prediction accuracy of different computational
paradigms presented in the bibliography for this benchmark problem (including
our proposal), for various fuzzy system structures, neural systems and genetic
algorithms [4,5,6,7] (each reference use different number of decimal for the prediction, we take exactly the value presented).
5
Conclusion
This article describes a new structure to create a RBF neural network that
uses regression weights to replace the constant weights normally used. These
regression weights are assumed to be functions of the input variables. In this
way the number of hidden units within a RBF neural network is reduced. A
new type of nonlinear function is proposed: the pseudo-gaussian function. With
this, the neural system gains flexibility, as the neurons possess an activation
field that does not necessarily have to be symmetric with respect to the centre
or to the location of the neuron in the input space. In addition to this new
structure, we propose a sequential learning algorithm, which is able to adapt the
structure of the network. This algorithm makes possible to create new hidden
units and also to detect and remove inactive units. We have presented conditions
to increase or decrease the number of neurons, based on the novelty of the data
and on the overall behaviour of the neural system, respectively. The feasibility
of the evolution and learning capability of the resulting algorithm for the neural
network is demonstrated by predicting time series.
Acknowledgements. This work has been partially supported by the Spanish
CICYT Project DPI2001-3219.
References
1. Cho, K.B.,Wang, B.H.: Radial basis function based adaptive fuzzy systems and their
applications to system identification and prediction. Fuzzy Sets and Systems, vol.83.
(1995) 325–339
2. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan
Press. (1975)
3. Benaim, M.: On functional approximation with normalized Gaussian units. Neural
Comput. Vol.6, (1994)
4. Goldberg, D.E.: Genetic Algorithms in Search, Optimization & Machine Learning.
Addison-Wesley (1989).
The Synergy between Classical and Soft-Computing Techniques
39
5. Karayiannis, N.B., Weiqun Mi, G.: Growing Radial Basis Neural Networks: Merging
Supervised and Unsupervised Learning with Network Growth Techniques. IEEE
Transaction on Neural Networks, vol.8, no.6. (1997) 1492–1506
6. Kim, D., Kim, C.: Forecasting time series with genetic fuzzy predictor ensemble.
IEEE Transactions on Fuzzy Systems, vol.5, no.4. November (1997) 523–535
7. González, J., Rojas, I., Ortega, J., Pomares, H., Fernández, F.J., Díaz, A.F.: Multiobjetive evolutionary optimization of the size, shape and position parameters of
radial basis function networks for function approximation. Accepted IEEE Transactions on Neural Networks,Vol.14, No.6, (2003)
8. Nowlan, S.: Maximum likelihood competitive learning. Proc. Neural Inform. Process.
Systems. (1990) 574–582
9. Rojas, I., Anguita, M., Ros, E., Pomares, H., Valenzuela, O., Prieto, A.: What are
the main factors involved in the design of a Radial Basis Function Network?. 6th
European Symposium on Artificial Neural Network, ESANN’98.April 22-24, (1998).
1–6