Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Synergy between Classical and Soft-Computing Techniques for Time Series Prediction I. Rojas1 , F. Rojas1 , H. Pomares1 , L.J. Herrera1 , J. González1 , and O. Valenzuela2 1 University of Granada, Department of Computer Architecture and Technology, E.T.S. Computer Engineering, 18071 Granada, Spain 2 University of Granada, Department of Applied Mathematics, Science faculty, Granada, Spain Abstract. A new method for extracting valuable process information from input-output data is presented in this paper using a pseudo-gaussian basis function neural network with regression weights. The proposed methodology produces dynamical radial basis function, able to modify the number of neuron within the hidden layer. Other important characteristic of the proposed neural system is that the activation of the hidden neurons is normalized, which, as described in the bibliography, provides better performance than non-normalization. The effectiveness of the method is illustrated through the development of dynamical models for a very well known benchmark, the synthetic time series Mackey-Glass. 1 Introduction RBF networks form a special neural network architecture, which consists of three layers, namely the input, hidden and output layers. The input layer is only used to connect the network to its environment. Each node in the hidden layer has associated a centre, which is a vector with dimension equal to that of the network input data. Finally, the output layer is linear and serves as a summation unit: FRBF (xn ) = K wi φi (xn , ci , σi ) (1) i=1 where the radial basis functions φi are nonlinear functions, usually gaussian ∗ of the functions [9]. An alternative is to calculate the weighted average FRBF radial basis function with the addition of lateral connections between the radial neurons. In normalized RBF neural networks, the output activity is normalized by the total input activity in the hidden layer. The use of the second method has been presented in different studies as an approach which, due to its normalization properties, is very convenient and provides better performance than the weighted sum method for function approximation problems. In terms of smoothness, the weighted average provides better performance than the weighted sum [8,3]. R. Monroy et al. (Eds.): MICAI 2004, LNAI 2972, pp. 30–39, 2004. c Springer-Verlag Berlin Heidelberg 2004 The Synergy between Classical and Soft-Computing Techniques 31 Assuming that training data (xi , yi ), i = 1, 2, . . . , D are available and have to be approximated, the RBF network training problem can be formulated as an optimization problem, where the normalized root mean squared errors (NRMSE) between the true outputs and the network predictions must be minimized with respect to both the network structure (the number of nodes K in the hidden layer) and the network parameters (center, sigmas and output weights): e2 (2) N RM SE = σz2 where σz2 is the variance of the output data, and e2 is the mean-square error between the obtained and the desired output. The development of a single procedure that minimizes the above error taking into account the structure and the parameters that define the system, is rather difficult using the traditional optimization techniques. Most approaches presented in the bibliography consider a fixed RBF network structure and decompose the optimization of the parameters into two steps: In the first step the centres of the nodes are obtained (different paradigms can be used as cluster techniques, genetic algorithms, etc .) and in the second step, the connection weights are calculated using simple linear regression. Finally, a sequential learning algorithm is presented to adapt the structure of the network, in which it is possible to create new hidden units and also to detect and remove inactive units. In this paper we propose to use a pseudo-gaussian function for the nonlinear function within the hidden unit. The output of a hidden neuron is computed as: φi (x) = ϕi,v (xv ) v 2 (xv −cvi ) v − σi,− (3) − ∞ < x ≤ cvi e ϕi,v (xv ) = 2 v v (x −ci ) v − σi,+ e cvi < x < ∞ The index i runs over the number of neurons (K) while v runs over the dimension of the input space (v ∈ [1, D]). The behaviour of classical gaussian functions and the new PG-RBF in two dimensions is illustrated in Fig. 1 and Fig. 2. The weights connecting the activation of the hidden units with the output of the neural system, instead of being single parameters, are functions of the input variables. Therefore, the wi are given by: wi = bvi xv + b0i (4) v where bvi are single parameters. The behaviour of the new PGBF in two dimensions is illustrated in Fig.1. Therefore, the structure of the neural system proposed is modified using a pseudo-gaussian function (PG) in which two scaling parameters σ are introduced, which eliminate the symmetry restriction and provide the neurons in the hidden 32 I. Rojas et al. Fig. 1. 3-D behaviour of a pseudo-gaussian function for two inputs Fig. 2. Contour of a pseudo-gaussian function for two inputs layer with greater flexibility for function approximation. Other important characteristics of the proposed neural system are that the activation of the hidden neurons is normalized and that instead of using a single parameter for the output weights, these are functions of the input variables which leads to a significant reduction in the number of hidden units compared with the classical RBF network. 2 Sequential Learning Using PGBF Network Learning in the PGBF consists in determining the minimum necessary number of neuron units, and in adjusting the parameters of each individual hidden neuron, given a set of data (xn ,yn )[5]. The sequential learning algorithm starts with only one hidden node and creates additional neurons based on the novelty (innovation) in the observations which arrive sequentially. The decision as to whether a datum should be deemed novel is based on the following conditions: The Synergy between Classical and Soft-Computing Techniques ∗ en = yn − FRBF >ξ βmax = max (φi ) > ζ 33 (5) i If both conditions are satisfied, then the data is considered to be novel and therefore a new hidden neuron is added to the network. This process continues until a maximum number of hidden neurons is reached. The parameters ξ and ζ are thresholds to be selected appropriately for each problem. The first condition states that the error must be significant and the second deals with the activation of the nonlinear neurons. The parameters of the new hidden node are determined initially as follows: k = k +1 ∗ if v = 0 yn − FRBF v bK = 0 otherwise cK = xn ; (cvK = xvn , ∀v ∈ [1, D]) v v σK,+ = σK,− = γ σinit min xn − ci (6) i=1,...,K−1 where γ is an overlap factor that determines the amount of overlap of the data considered as novel and the nearest centre of a neuron. If an observation has no novelty then the existing parameters of the network are adjusted by a gradient descent algorithm to fit that observation. We propose a pruning strategy that can detect and remove hidden neurons, which although active initially, may subsequently end up contributing little to the network output. Then a more streamlined neural network can be constructed as learning progresses. For this purpose, three cases will be considered: – (a) Pruning the hidden units that make very little contribution to the overall network Noutput for the whole data set. Pruning removes a hidden unit i when: θi = φi (xn ) < χ1 ; where χ1 is a threshold. n=1 – (b) Pruning hidden units which have a very small activation region. These units obviously represent an overtrained learning. A neuron i having very v v low values of σi,+ + σi,− in the different dimensions of the input space will v v be removed: σi,+ + σi,− < χ2 v – (c) Pruning hidden units which have a very similar activation to other neurons in the neural system. To achieve this, we define the vectors ψi=1...N , where N is the number of input/output vectors presented, such that: ψi = [φi (x1 ), φi (x2 ), . . . , φi (xn )]. As a guide to determine when two neurons present similar behaviour, this can be expressed in terms of the inner product ψi ·ψj < χ3 . If the inner product is near one then ψi and ψj are both attempting to do nearly the same job (they possess a very similar activation level for the same input values).In this case, they directly compete in the sense that only one of these neurons is selected and therefore the other one is removed. If any of these conditions are fulfilled for a particular neuron, the neuron is automatically removed. 34 I. Rojas et al. The final algorithm is summarized below: Step 1: Initially, no hidden neurons exist. Step 2: Set n = 0, K = 0, h = 1, where n, K and h are the number of patterns presented to the network, the number of hidden neurons and the number of learning cycles, respectively. Set the effective radius ·h Set the maximum number of hidden neurons M axN euron. Step 3: For each observation (xn , yn ) compute: a) the overall network output: K FRBF (xn ) = wi φi (xn , ci , σi ) i=1 K i=1 = φi (xn , ci , σi ) N um Den (7) b) the parameter required for the evaluation of the novelty of the observation; ∗ the error en = yn − FRBF and the maximum degree of activation βmax . If ((en > ξ) and (βmax < ·h) and (K < M axN euron)) allocate a new hidden unit with parameters: k = k +1 ∗ if v = 0 yn − FRBF v bK = 0 otherwise cK = xn ; (cvK = xvn , ∀v ∈ [1, D]) v v = σK,− = γ σinit min xn − ci σK,+ (8) i=1,...,K−1 else apply the parameter learning for all the hidden nodes: ∗ RBF ∂φi ∂E ∂ F ∗ ∂φi ∂cv RBF ∂F i ∗ i −yn FRBF ) wDen ∗ 2 ∂E ∆cvi = − ∂c v = − i = (y − n 2 v xv n −ci v σi,− − (xn −cvi ) σv i,− e v xv n −ci v σi,+ ∗ RBF ∂φi ∂E ∂ F v ∗ ∂φi ∂σi,+ RBF ∂F ∗ i −yn = (yn − FRBF 2 ) wDen v xv n −ci v σi,+ v 2 xv n −ci σv i,+ ( e ) v v − v ∗ i −yn xn −ci 2 σv e = (yn − FRBF ) wDen ∆σi,− i,− ∗ RBF ∂E ∂ F ∗ ∂N um RBF ∂F ∗ RBF ∂F − ∂E ∗ ∂N um ∂F 2 U (xvn ; −∞, cvi ) + 2 v ∆σi,+ = − ∂σ∂E =− v i,+ = (xvn −cvi ) e σv i,+ = U (xvn ; cvi , ∞) U (xvn ; cvi , ∞) v 2 xv n −ci σv i,− ( ) ∂N ∂wi ∂wi ∂bv i 1 ∗ = (yn − FRBF ) Den φi (xn )xvn ∂E ∆b0i = − ∂b v = ∂N ∂wi ∂wi ∂b0i 1 ∗ = (yn − FRBF ) Den φi (xn ) i RBF (10) U (xvn ; −∞, cvi ) ∂E ∆bvi = − ∂b v = − i (9) (11) The Synergy between Classical and Soft-Computing Techniques 35 Step 4: If all the training patterns are presented, then increment the number of learning cycles (h = h + 1), and check the criteria for pruning hidden units: θi = v φi (xn ) < χ1 n=1 v v σi,+ < χ2 + σi,− N (12) ψi · ψj < χ3 , ∀j = i Step 5: If the network shows satisfactory performance (N RM SE < π ∗ ) then stop. Otherwise go to Step 3. 3 Using GA to Tune the Free Parameters of the Sequential Learning Algorithms Stochastic algorithms, such as simulated annealing (SA) or genetic algorithms (GA) are more and more used for combinatorial optimization problems in diverse fields, and particularly in time series [6]. The main advantage of GA upon SA is that it works on a set of potential solutions instead of a single one; however, on particular applications, the major inconvenient lies in the difficulty of carrying out the crossover operator for generating feasible solutions with respect to the problem constraints. Insofar as this last point was not encountered in this study, a GA [2,4] has been retained. Genetic algorithms are searching methods based upon the biological principles of natural selection and survival of the fittest introduced by Charles Darwin in his seminal work “The Origin of Species” (1859). They were rigorously introduced by [2]. GAs consist of a population of individuals that are possible solutions and each one of these individuals receives a reward, known as “fitness”, that quantifies its suitability to solve the problem. In ordinary applications, fitness is simply the objective function. Individuals with better than average fitness receive greater opportunities to cross. On the other hand, low fitness individuals will have less chance to reproduce until they are extinguished. Consequently, the good features of the best individuals are disseminated over the generations. In other words, the most promising areas of the search space are explored, making the GA converge to the optimal or near optimal solution. The ‘reproduction’ process by which the new individuals are derived consists in taking the chromosomes of the parents and subjecting them to crossover and mutation operations. The symbols (genes) from parents are combined into new chromosomes and afterwards, randomly selected symbols from these new chromosomes are altered in a simulation of the genetic recombination and mutation process of nature. The key ideas are thus the concept of a population of individual solutions being processed together and symbol configurations conferring greater fitness being combined in the ‘offspring’, in the hope of producing even better solutions. 36 I. Rojas et al. As far as this paper is concerned, the main advantages of a GA strategy lie in: – 1. The increased likelihood of finding the global minimum in a situation where local minima may abound. – 2. The flexibility of the approach whereby the search for better solutions can be tailored to the problem in hand by, for example, choosing the genetic representation to suit the nature of the function being optimized. In the Sequential learning algorithms proposed in section 2, there exist different parameter that should be tuned in order to obtain optimal solution. This parameters are: ξ, ζ, χ1 , χ2 , χ3 . One possibility is to make this work by trial and error, and the second possibility is to use the GA as an optimization tool that must decided the best value for this parameters. In the way described above, the relations between the different paradigms are described in figure 3: Fig. 3. Block diagram of the different paradigms used in the complete algorithm 4 Application to Time Series Prediction In this subsection we attempt a short-term prediction by means of the algorithm presented in the above subsection with regard to the Mackey-Glass time series data. The Mackey-Glass chaotic time series is generated from the following delay differential equation: dx(t) ax(t − τ ) − bx(t) = dt 1 + x(t − τ )10 (13) When τ > 17, the equation shows chaotic behaviour. Higher values of τ yield higher dimensional chaos. To make the comparisons with earlier work fair, we chose the parameters of n = 4 and P = 6. The Synergy between Classical and Soft-Computing Techniques 37 Fig. 4. 3-D Prediction step = 6 and number of neurons = 12. (a) Result of the original and predicted Mackey-Glass time series (which are indistinguishable). (b) prediction error Table 1. Comparison results of the prediction error of different methods for prediction step equal to 6 (500 training data). Method Auto Regressive Model Cascade Correlation NN Back-Prop. NN 6th-order Polynomial Linear Predictive Method Kim and Kim (Genetic Algorithm and Fuzzy System [6] 5 MFs 7 MFs 9 MFs ANFIS and Fuzzy System (16 rules)[6] Classical RBF (with 23 neurons)[1] Our Approach (With 12 neurons) Prediction Error (RMSE) 0.19 0.06 0.02 0.04 0.55 0.049206 0.042275 0.037873 0.007 0.0114 0.0036 ± 0.0008 38 I. Rojas et al. To compare our approach with earlier works, we chose the parameters presented in [6]. The experiment was performed 25 times, and we will show graphically one result that is close to the average error obtained. Fig4a) shows the predicted and desired values (dashed and continuous lines respectively) for both training and test data (which is indistinguishable from the time series here). As they are practically identical, the difference can only be seen on a finer scale (Fig4b)). Table 1 compares the prediction accuracy of different computational paradigms presented in the bibliography for this benchmark problem (including our proposal), for various fuzzy system structures, neural systems and genetic algorithms [4,5,6,7] (each reference use different number of decimal for the prediction, we take exactly the value presented). 5 Conclusion This article describes a new structure to create a RBF neural network that uses regression weights to replace the constant weights normally used. These regression weights are assumed to be functions of the input variables. In this way the number of hidden units within a RBF neural network is reduced. A new type of nonlinear function is proposed: the pseudo-gaussian function. With this, the neural system gains flexibility, as the neurons possess an activation field that does not necessarily have to be symmetric with respect to the centre or to the location of the neuron in the input space. In addition to this new structure, we propose a sequential learning algorithm, which is able to adapt the structure of the network. This algorithm makes possible to create new hidden units and also to detect and remove inactive units. We have presented conditions to increase or decrease the number of neurons, based on the novelty of the data and on the overall behaviour of the neural system, respectively. The feasibility of the evolution and learning capability of the resulting algorithm for the neural network is demonstrated by predicting time series. Acknowledgements. This work has been partially supported by the Spanish CICYT Project DPI2001-3219. References 1. Cho, K.B.,Wang, B.H.: Radial basis function based adaptive fuzzy systems and their applications to system identification and prediction. Fuzzy Sets and Systems, vol.83. (1995) 325–339 2. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press. (1975) 3. Benaim, M.: On functional approximation with normalized Gaussian units. Neural Comput. Vol.6, (1994) 4. Goldberg, D.E.: Genetic Algorithms in Search, Optimization & Machine Learning. Addison-Wesley (1989). The Synergy between Classical and Soft-Computing Techniques 39 5. Karayiannis, N.B., Weiqun Mi, G.: Growing Radial Basis Neural Networks: Merging Supervised and Unsupervised Learning with Network Growth Techniques. IEEE Transaction on Neural Networks, vol.8, no.6. (1997) 1492–1506 6. Kim, D., Kim, C.: Forecasting time series with genetic fuzzy predictor ensemble. IEEE Transactions on Fuzzy Systems, vol.5, no.4. November (1997) 523–535 7. González, J., Rojas, I., Ortega, J., Pomares, H., Fernández, F.J., Díaz, A.F.: Multiobjetive evolutionary optimization of the size, shape and position parameters of radial basis function networks for function approximation. Accepted IEEE Transactions on Neural Networks,Vol.14, No.6, (2003) 8. Nowlan, S.: Maximum likelihood competitive learning. Proc. Neural Inform. Process. Systems. (1990) 574–582 9. Rojas, I., Anguita, M., Ros, E., Pomares, H., Valenzuela, O., Prieto, A.: What are the main factors involved in the design of a Radial Basis Function Network?. 6th European Symposium on Artificial Neural Network, ESANN’98.April 22-24, (1998). 1–6