Download A new initialization method for neural networks using sensitivity

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Conference on Mathematical and Statistical Modeling
in Honor of Enrique Castillo. June 28-30, 2006
A new initialization method for neural networks
using sensitivity analysis
Bertha Guijarro-Berdiñas, Oscar Fontenla-Romero,
Beatriz Pérez-Sánchez and Amparo Alonso-Betanzos∗
Department of Computer Science
University of A Coruña.
Abstract
The learning methods for feedforward neural networks find the network’s optimal
parameters through a gradient descent mechanism starting from an initial state of
the parameters. This initial state influences both in convergence speed and the
error that finally is achieved. In this paper, we present a sensitivity analysis based
initialization method for two-layer feedforward neural networks, which uses a linear
procedure to obtain the weights of each layers. First, random values are assigned
to the outputs of the first layer; later, these initial values are updated based on
sensitivity formulas, and finally the weights are calculated using a linear system of
equations. This new method presents the advantage of achieving a good solution
in just one epoch using few computational time. In this paper, we explore the use
of this method, as an initialization procedure, using several data sets and learning
algorithms and comparing the performance with other well-known initialization
methods.
Key Words: Supervised learning, Sensitivity analysis, linear optimization, initialization method, least squares.
1
Introduction
Most of the learning methods for feedforward neural networks find the optimal parameters or weights of the network based on a gradient descent
mechanism (using first and/or second order information) that tries to minimize an error function. These methods start from an initial state of the
parameters which determines the starting point over the surface of the function being optimized. In this way, this initial state significantly influences
both in the speed to reach the optimum (convergence speed) and the minimum that finally is achieved (a local or global minimum). Thus, several
∗
Correspondence to: Bertha Guijarro-Berdiñas. Facultad de Informatica. Campus
de Elviña s/n, 15071. Universidad de A Coruña. SPAIN
2
Guijarro-Berdiñas, Fontenla-Romero, Pérez-Sánchez and Alonso-Betanzos
solutions for the appropriate initialization of weights have been proposed.
Nguyen and Widrow assign each hidden processing element an approximate portion of the range of the desired response (Nguyen and Widrow
(1990)), and Drago and Ridella use the statistically controlled activation
weight initialization, which aims to prevent neurons from saturation during
the adaptation process by estimating the maximum value that the weights
should take initially (Drago and Ridella (1992)). Also, in Ridella et al.
(1997), an analytical technique, to initialize the weights of a multilayer
perceptron with Vector Quantization (VQ) prototypes given the equivalence between circular backpropagation networks and VQ classifiers, has
been proposed. Sensitivity analysis is a very useful technique for deriving
how and how much the solution to a given problem depends on data (see,
for example, Castillo et al. (1997, 1999, 2000)). In this work, we show that
sensitivity formulas can also be used for the initialization of the weights of
a two-layer feedforward neural network. The proposed method, is based
on the use of the sensitivities of each layer’s parameters with respect to its
inputs and outputs, and also on the use of independent systems of linear
equations for each layer, to obtain the initial values of its parameters.
2
Proposed algorithm
Consider the two-layer feedforward neural network in Figure 1 where I is
the number of inputs xis , J the number of outputs yjs , K the number of
hidden units with outputs zks , x0s = 1, z0s = 1, w are the weights, f the
activation function and S the number of data samples. The superscripts
z0 s
wki
(1)
+
f1(1)
+
f 2(1)
+
f K(1- )1
+
f K(1)
x0 s
x1 s
z1 s
z2 s
w jk
(2)
+
f1( 2 )
y1 s
+
f J( 2 )
y Js
x2 s
x Is
z Ks
Figure 1: Two-layer feedforward neural network.
A new initialization method for neural networks using sensitivity analysis
3
(1) and (2) are used to refer to the first and second layer. This network can
be considered to be composed of two one-layer neural networks. Usually,
weights are updated using the mean squared error (MSE) as cost or error
function. This function estimates the error of the system comparing the
real and the desired output. In this work, assuming that the intermediate
layer outputs z are known, a new cost function for this network is defined
as (Castillo et al. (2006))
Q(z) = Q(1) (z) + Q(2) (z) =

à I
!2
S
K
X
X
X (1)
−1
(1)

=
w xis − f
(zks ) +
ki
s=1
k=1
ÃK
J
X
X
j=1
k
i=0
(2)
(2)−1
wjk zks − fj
(2.1)
!2 
(yjs ) .
k=0
This cost function is based on the sum of squared errors obtained, independently, by the hidden and output layers.
In the proposed algorithm, initially the values of the zks are assigned to
a uniformly distributed random variable on the interval [0.05, 0.95]. After
that, the sensitivities of the cost function with respect to zks are calculated
as:
∂Q
∂zks
∂Q(1) ∂Q(2)
+
=
∂zks
∂zks
µ I
¶
P (1)
(1)−1
2
wki xis − fk
(zks )
i=0
=−
+
0 (1)
fk (zks )
ÃK
!
J
X
X (2)
(2)−1
(2)
2
wjr zrs − fj
(yjs ) wjk .
=
j=1
(2.2)
r=0
with k = 1, . . . , K, as z0s = 1, ∀s.
Next, the values of the intermediate outputs z are modified using the
Taylor series approximation:
Q(z + ∆z) = Q(z) +
K X
S
X
∂Q(z)
k=1 s=1
∂zks
∆zks ≈ 0,
(2.3)
4
Guijarro-Berdiñas, Fontenla-Romero, Pérez-Sánchez and Alonso-Betanzos
which leads to the following increment
∆z = −
Q(z)
∇Q,
||∇Q||2
(2.4)
which is used to update the zks values. Thus, using the outputs of the intermediate layer zks we can learn, for each layer independently, the weights
(1)
(2)
wki and wjk using the learning method for one-layer neural networks proposed in Castillo et al. (2002). This last method obtains the optimal weights
by solving a linear system of equations. This algorithm has two advantages:
a) it always obtains the global optimum of the cost function for the given
data, and b) it is a very fast procedure.
3
Experimental results
In this section, the performance of proposed method is analyzed. Several
experiments were accomplished in order to compare this method with the
Nguyen-Widrow initialization method, one of the most popular, and a random initialization of the weights. The experiments were carried out for
two kind of problems: classification and regression. Regarding the classification problems the Galaxy Dim and the Mushroom data sets were
used. These problems were obtained from the Data Mining Institute (DMI)
(http://www.cs.wisc.edu/dmi). For the regression problems, the Henon
and Ikeda time series were employed. For every experiment 30 different
simulations, using different initial states, were done. Moreover, the Scaled
Conjugate Gradient (Moller (1993)) and the Stochastic Backpropagation
(Bishop (1995)) methods were used to learn from the given initial values.
The first one was used for the regression problems and the second one for
the classification tasks, as they are among the most recommended methods
for each kind of problems.
3.1
Galaxy Dim data set
In this two-class problem, a training set of 2000 instances and a test set
of 1000 examples were used. The network topology includes 100 hidden
neurons, 14 inputs and 2 binary outputs.
Figure 2 contains the mean learning curves achieved by the Stochastic Backpropagation for this data set using different initialization methods
A new initialization method for neural networks using sensitivity analysis
5
95
90
Proposed method
Random
85
Accuracy (%)
80
Nguyen
75
70
65
60
55
50
45
0
50
100
150
Iteration
Figure 2: Learning curves for the Galaxy data set.
(random, Nguyen-Widrow and the proposed one). It includes the accuracy for the training (solid lines) and test (dashed lines) data sets for every
iteration of the learning process. As can be observed, the learning algorithm presents a better convergence speed (the curves rise faster) and also
a higher accuracy using the presented method. Besides, in this experiment
the Nguyen-Widrow seems to stuck the learning method in a local minimum
and achieves worse results even than the random initialization.
3.2
Mushroom data set
This is also a two-class problem, with a training set of 5000 instances and a
test set of 2000 examples. The network topology has 100 hidden neurons,
22 inputs and 2 binary outputs.
Figure 3 shows the mean learning curves obtained for learning algorithm
using different initialization methods. As in the previous case, it contains
the accuracy for the training and test data set for each iteration. As can
be noticed, the Nguyen-Widrow initialization method allows the learning
algorithm to obtain the best convergence speed. Although, the proposed
method presents also a good performance, that is much better than the
random initialization of the weights.
6
Guijarro-Berdiñas, Fontenla-Romero, Pérez-Sánchez and Alonso-Betanzos
90
Nguyen
85
Proposed method
80
Random
Accuracy (%)
75
70
65
60
55
50
0
50
100
150
Iteration
Figure 3: Learning curves for the mushroom data set.
3.3
Henon time series
This is the first regression problem used in these experiments. The training
set contains 5000 data points and the test set 30000 examples. The network
topology has 10 hidden neurons, 7 inputs and 1 continuous output.
Figure 4 shows the mean learning curves obtained for the learning algorithm using the mentioned initialization methods. In this case, the curves
correspond to the mean squared error (MSE) for the training (solid lines)
and test (dashed lines) data sets for each iteration. As can be noticed, the
differences between the training and test curves are almost imperceptible.
It is important to remark that in this case the axes of the figure are in
a logarithmic scale. In this data set, the differences are not visually significant but at the last iteration the obtained MSEs, for the test set, are
3.97 × 10−5 , 3.31 × 10−5 and 1.37 × 10−5 for the Random, Nguyen-Widrow
and the proposed method, respectively.
3.4
Ikeda time series
In this second regression problem, a training set of 5000 data points and a
test set of 30000 examples were used. The network topology has 10 hidden
neurons, 12 inputs and 1 continuous output. As in the previous example,
Figure 5 shows the evolution of the mean squared error during the learning
A new initialization method for neural networks using sensitivity analysis
7
2
10
1
10
Mean squared error (MSE)
0
10
Nguyen
−1
10
−2
10
Random
−3
10
Proposed method
−4
10
−5
10
0
1
10
2
10
3
10
10
iteration
Figure 4: Learning curves for the Henon time series.
2
10
1
Mean squared error (MSE)
10
0
10
Nguyen
−1
10
Random
Proposed method
−2
10
0
10
1
2
10
10
3
10
iteration
Figure 5: Learning curves for the Ikeda time series.
process for both the training and test data sets. In this example, the
behavior is similar to the Henon time series being the last MSEs 2.40×10−2 ,
2.04 × 10−2 and 1.49 × 10−2 for the Random, Nguyen-Widrow and the
proposed method, respectively.
8
4
Guijarro-Berdiñas, Fontenla-Romero, Pérez-Sánchez and Alonso-Betanzos
Conclusions
In this work a new initialization method for two-layer feedforward neural
networks have been proposed. The main conclusions that can be drawn
are:
• Over the experiments made, when the proposed method is compared
with a random initialization, it improves significantly the performance
of a learning algorithm, both in convergence speed and final error,
• The differences are not so remarkable when the presented method is
compared with the Nguyen-Widrow algorithm. However, in three of
the four experiments the sensitivity based method allows to obtain a
lower error in less iterations.
• The sensitivities of the sum of squared errors with respect to the outputs of the intermediate layer allow an efficient initialization method
to be applied.
• Finally, as the method is based on solving a linear system of equations,
it offers an interesting combination of speed and simplicity.
5
Acknowledgements
This work has been partially funded by the project TIC2003-00600 of the
Ministerio de Ciencia y Tecnologı́a, Spain (partially supported by FEDER
funds) and the project PGIDT04PXIC10502PN of the Xunta de Galicia.
References
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford
University Press, New York.
Castillo, E., Cobo, A., Gutiérrez, J. M., and Pruneda, R. E.
(1999). Working with differential, functional and difference equations
using functional networks. Applied Mathematical Modelling, 23(2):89–
107.
A new initialization method for neural networks using sensitivity analysis
9
Castillo, E., Cobo, A., Gutiérrez, J. M., and Pruneda, R. E.
(2000). Functional networks. a new neural network based methodology.
Computer-Aided Civil and Infrastructure Engineering, 15(2):90–106.
Castillo, E., Fontenla-Romero, O., Betanzos, A. A., and
Guijarro-Berdiñas, B. (2002). A global optimum approach for onelayer neural networks. Neural Computation, 14(6):1429–1449.
Castillo, E., Guijarro-Berdiñas, B., Fontenla-Romero, O., and
Alonso-Betanzos, A. (2006). A very fast learning method for neural
networks based on sensitivity analysis. Journal of Machine Learning
Research, (in press).
Castillo, E., Gutiérrez, J. M., and Hadi, A. (1997). Sensitivity analysis in discrete bayesian networks. IEEE Transactions on Systems, Man
and Cybernetics, 26(7):412–423.
Drago, G. P. and Ridella, S. (1992). Statistically controlled activation
weight initialization (SCAWI). IEEE Transactions on Neural Networks,
3:899–905.
Moller, M. F. (1993). A scaled conjugate gradient algorithm for fast
supervised learning. Neural Networks, 6:525–533.
Nguyen, D. and Widrow, B. (1990). Improving the learning speed of
2-layer neural networks by choosing initial values of the adaptive weights.
Proceedings of the International Joint Conference on Neural Networks,
3:21–26.
Ridella, S., Rovetta, S., and Zunino, R. (1997). Circular backpropagation networks for classification. 8(1):84–97.
Related documents