Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
International Conference on Mathematical and Statistical Modeling in Honor of Enrique Castillo. June 28-30, 2006 A new initialization method for neural networks using sensitivity analysis Bertha Guijarro-Berdiñas, Oscar Fontenla-Romero, Beatriz Pérez-Sánchez and Amparo Alonso-Betanzos∗ Department of Computer Science University of A Coruña. Abstract The learning methods for feedforward neural networks find the network’s optimal parameters through a gradient descent mechanism starting from an initial state of the parameters. This initial state influences both in convergence speed and the error that finally is achieved. In this paper, we present a sensitivity analysis based initialization method for two-layer feedforward neural networks, which uses a linear procedure to obtain the weights of each layers. First, random values are assigned to the outputs of the first layer; later, these initial values are updated based on sensitivity formulas, and finally the weights are calculated using a linear system of equations. This new method presents the advantage of achieving a good solution in just one epoch using few computational time. In this paper, we explore the use of this method, as an initialization procedure, using several data sets and learning algorithms and comparing the performance with other well-known initialization methods. Key Words: Supervised learning, Sensitivity analysis, linear optimization, initialization method, least squares. 1 Introduction Most of the learning methods for feedforward neural networks find the optimal parameters or weights of the network based on a gradient descent mechanism (using first and/or second order information) that tries to minimize an error function. These methods start from an initial state of the parameters which determines the starting point over the surface of the function being optimized. In this way, this initial state significantly influences both in the speed to reach the optimum (convergence speed) and the minimum that finally is achieved (a local or global minimum). Thus, several ∗ Correspondence to: Bertha Guijarro-Berdiñas. Facultad de Informatica. Campus de Elviña s/n, 15071. Universidad de A Coruña. SPAIN 2 Guijarro-Berdiñas, Fontenla-Romero, Pérez-Sánchez and Alonso-Betanzos solutions for the appropriate initialization of weights have been proposed. Nguyen and Widrow assign each hidden processing element an approximate portion of the range of the desired response (Nguyen and Widrow (1990)), and Drago and Ridella use the statistically controlled activation weight initialization, which aims to prevent neurons from saturation during the adaptation process by estimating the maximum value that the weights should take initially (Drago and Ridella (1992)). Also, in Ridella et al. (1997), an analytical technique, to initialize the weights of a multilayer perceptron with Vector Quantization (VQ) prototypes given the equivalence between circular backpropagation networks and VQ classifiers, has been proposed. Sensitivity analysis is a very useful technique for deriving how and how much the solution to a given problem depends on data (see, for example, Castillo et al. (1997, 1999, 2000)). In this work, we show that sensitivity formulas can also be used for the initialization of the weights of a two-layer feedforward neural network. The proposed method, is based on the use of the sensitivities of each layer’s parameters with respect to its inputs and outputs, and also on the use of independent systems of linear equations for each layer, to obtain the initial values of its parameters. 2 Proposed algorithm Consider the two-layer feedforward neural network in Figure 1 where I is the number of inputs xis , J the number of outputs yjs , K the number of hidden units with outputs zks , x0s = 1, z0s = 1, w are the weights, f the activation function and S the number of data samples. The superscripts z0 s wki (1) + f1(1) + f 2(1) + f K(1- )1 + f K(1) x0 s x1 s z1 s z2 s w jk (2) + f1( 2 ) y1 s + f J( 2 ) y Js x2 s x Is z Ks Figure 1: Two-layer feedforward neural network. A new initialization method for neural networks using sensitivity analysis 3 (1) and (2) are used to refer to the first and second layer. This network can be considered to be composed of two one-layer neural networks. Usually, weights are updated using the mean squared error (MSE) as cost or error function. This function estimates the error of the system comparing the real and the desired output. In this work, assuming that the intermediate layer outputs z are known, a new cost function for this network is defined as (Castillo et al. (2006)) Q(z) = Q(1) (z) + Q(2) (z) = Ã I !2 S K X X X (1) −1 (1) = w xis − f (zks ) + ki s=1 k=1 ÃK J X X j=1 k i=0 (2) (2)−1 wjk zks − fj (2.1) !2 (yjs ) . k=0 This cost function is based on the sum of squared errors obtained, independently, by the hidden and output layers. In the proposed algorithm, initially the values of the zks are assigned to a uniformly distributed random variable on the interval [0.05, 0.95]. After that, the sensitivities of the cost function with respect to zks are calculated as: ∂Q ∂zks ∂Q(1) ∂Q(2) + = ∂zks ∂zks µ I ¶ P (1) (1)−1 2 wki xis − fk (zks ) i=0 =− + 0 (1) fk (zks ) ÃK ! J X X (2) (2)−1 (2) 2 wjr zrs − fj (yjs ) wjk . = j=1 (2.2) r=0 with k = 1, . . . , K, as z0s = 1, ∀s. Next, the values of the intermediate outputs z are modified using the Taylor series approximation: Q(z + ∆z) = Q(z) + K X S X ∂Q(z) k=1 s=1 ∂zks ∆zks ≈ 0, (2.3) 4 Guijarro-Berdiñas, Fontenla-Romero, Pérez-Sánchez and Alonso-Betanzos which leads to the following increment ∆z = − Q(z) ∇Q, ||∇Q||2 (2.4) which is used to update the zks values. Thus, using the outputs of the intermediate layer zks we can learn, for each layer independently, the weights (1) (2) wki and wjk using the learning method for one-layer neural networks proposed in Castillo et al. (2002). This last method obtains the optimal weights by solving a linear system of equations. This algorithm has two advantages: a) it always obtains the global optimum of the cost function for the given data, and b) it is a very fast procedure. 3 Experimental results In this section, the performance of proposed method is analyzed. Several experiments were accomplished in order to compare this method with the Nguyen-Widrow initialization method, one of the most popular, and a random initialization of the weights. The experiments were carried out for two kind of problems: classification and regression. Regarding the classification problems the Galaxy Dim and the Mushroom data sets were used. These problems were obtained from the Data Mining Institute (DMI) (http://www.cs.wisc.edu/dmi). For the regression problems, the Henon and Ikeda time series were employed. For every experiment 30 different simulations, using different initial states, were done. Moreover, the Scaled Conjugate Gradient (Moller (1993)) and the Stochastic Backpropagation (Bishop (1995)) methods were used to learn from the given initial values. The first one was used for the regression problems and the second one for the classification tasks, as they are among the most recommended methods for each kind of problems. 3.1 Galaxy Dim data set In this two-class problem, a training set of 2000 instances and a test set of 1000 examples were used. The network topology includes 100 hidden neurons, 14 inputs and 2 binary outputs. Figure 2 contains the mean learning curves achieved by the Stochastic Backpropagation for this data set using different initialization methods A new initialization method for neural networks using sensitivity analysis 5 95 90 Proposed method Random 85 Accuracy (%) 80 Nguyen 75 70 65 60 55 50 45 0 50 100 150 Iteration Figure 2: Learning curves for the Galaxy data set. (random, Nguyen-Widrow and the proposed one). It includes the accuracy for the training (solid lines) and test (dashed lines) data sets for every iteration of the learning process. As can be observed, the learning algorithm presents a better convergence speed (the curves rise faster) and also a higher accuracy using the presented method. Besides, in this experiment the Nguyen-Widrow seems to stuck the learning method in a local minimum and achieves worse results even than the random initialization. 3.2 Mushroom data set This is also a two-class problem, with a training set of 5000 instances and a test set of 2000 examples. The network topology has 100 hidden neurons, 22 inputs and 2 binary outputs. Figure 3 shows the mean learning curves obtained for learning algorithm using different initialization methods. As in the previous case, it contains the accuracy for the training and test data set for each iteration. As can be noticed, the Nguyen-Widrow initialization method allows the learning algorithm to obtain the best convergence speed. Although, the proposed method presents also a good performance, that is much better than the random initialization of the weights. 6 Guijarro-Berdiñas, Fontenla-Romero, Pérez-Sánchez and Alonso-Betanzos 90 Nguyen 85 Proposed method 80 Random Accuracy (%) 75 70 65 60 55 50 0 50 100 150 Iteration Figure 3: Learning curves for the mushroom data set. 3.3 Henon time series This is the first regression problem used in these experiments. The training set contains 5000 data points and the test set 30000 examples. The network topology has 10 hidden neurons, 7 inputs and 1 continuous output. Figure 4 shows the mean learning curves obtained for the learning algorithm using the mentioned initialization methods. In this case, the curves correspond to the mean squared error (MSE) for the training (solid lines) and test (dashed lines) data sets for each iteration. As can be noticed, the differences between the training and test curves are almost imperceptible. It is important to remark that in this case the axes of the figure are in a logarithmic scale. In this data set, the differences are not visually significant but at the last iteration the obtained MSEs, for the test set, are 3.97 × 10−5 , 3.31 × 10−5 and 1.37 × 10−5 for the Random, Nguyen-Widrow and the proposed method, respectively. 3.4 Ikeda time series In this second regression problem, a training set of 5000 data points and a test set of 30000 examples were used. The network topology has 10 hidden neurons, 12 inputs and 1 continuous output. As in the previous example, Figure 5 shows the evolution of the mean squared error during the learning A new initialization method for neural networks using sensitivity analysis 7 2 10 1 10 Mean squared error (MSE) 0 10 Nguyen −1 10 −2 10 Random −3 10 Proposed method −4 10 −5 10 0 1 10 2 10 3 10 10 iteration Figure 4: Learning curves for the Henon time series. 2 10 1 Mean squared error (MSE) 10 0 10 Nguyen −1 10 Random Proposed method −2 10 0 10 1 2 10 10 3 10 iteration Figure 5: Learning curves for the Ikeda time series. process for both the training and test data sets. In this example, the behavior is similar to the Henon time series being the last MSEs 2.40×10−2 , 2.04 × 10−2 and 1.49 × 10−2 for the Random, Nguyen-Widrow and the proposed method, respectively. 8 4 Guijarro-Berdiñas, Fontenla-Romero, Pérez-Sánchez and Alonso-Betanzos Conclusions In this work a new initialization method for two-layer feedforward neural networks have been proposed. The main conclusions that can be drawn are: • Over the experiments made, when the proposed method is compared with a random initialization, it improves significantly the performance of a learning algorithm, both in convergence speed and final error, • The differences are not so remarkable when the presented method is compared with the Nguyen-Widrow algorithm. However, in three of the four experiments the sensitivity based method allows to obtain a lower error in less iterations. • The sensitivities of the sum of squared errors with respect to the outputs of the intermediate layer allow an efficient initialization method to be applied. • Finally, as the method is based on solving a linear system of equations, it offers an interesting combination of speed and simplicity. 5 Acknowledgements This work has been partially funded by the project TIC2003-00600 of the Ministerio de Ciencia y Tecnologı́a, Spain (partially supported by FEDER funds) and the project PGIDT04PXIC10502PN of the Xunta de Galicia. References Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press, New York. Castillo, E., Cobo, A., Gutiérrez, J. M., and Pruneda, R. E. (1999). Working with differential, functional and difference equations using functional networks. Applied Mathematical Modelling, 23(2):89– 107. A new initialization method for neural networks using sensitivity analysis 9 Castillo, E., Cobo, A., Gutiérrez, J. M., and Pruneda, R. E. (2000). Functional networks. a new neural network based methodology. Computer-Aided Civil and Infrastructure Engineering, 15(2):90–106. Castillo, E., Fontenla-Romero, O., Betanzos, A. A., and Guijarro-Berdiñas, B. (2002). A global optimum approach for onelayer neural networks. Neural Computation, 14(6):1429–1449. Castillo, E., Guijarro-Berdiñas, B., Fontenla-Romero, O., and Alonso-Betanzos, A. (2006). A very fast learning method for neural networks based on sensitivity analysis. Journal of Machine Learning Research, (in press). Castillo, E., Gutiérrez, J. M., and Hadi, A. (1997). Sensitivity analysis in discrete bayesian networks. IEEE Transactions on Systems, Man and Cybernetics, 26(7):412–423. Drago, G. P. and Ridella, S. (1992). Statistically controlled activation weight initialization (SCAWI). IEEE Transactions on Neural Networks, 3:899–905. Moller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6:525–533. Nguyen, D. and Widrow, B. (1990). Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights. Proceedings of the International Joint Conference on Neural Networks, 3:21–26. Ridella, S., Rovetta, S., and Zunino, R. (1997). Circular backpropagation networks for classification. 8(1):84–97.