Download Effects of Chaos on Training Recurrent Neural Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Catastrophic interference wikipedia , lookup

Convolutional neural network wikipedia , lookup

Transcript
Using Chaos to Characterize and Train Neural Networks
Oguz Yetkin*
J.C. Sprott†
Department of Physics, University of Wisconsin
Madison, WI 53706 USA
Abstract
A neural network will become a dynamical system if the outputs generated by the
network are used as inputs for the next time step. In this way, the network can be used to
generate time series. Due to the non-linearity in the neurons, these networks are also
capable of chaotic behavior (sensitive dependence on initial conditions). We present
ways to characterize the degree of chaos in a given network, demonstrate that it is
influenced greatly by network size, and use this information to aid in training. Our
results indicate that training by simulated annealing can be improved by rejecting highly
chaotic solutions encountered by the training algorithm, but is hindered if all chaotic
solutions are rejected.
*
[email protected][email protected], to whom all correspondence should be addressed
1
1. Introduction
Since their invention, artificial neural networks have been important in artificial
intelligence, engineering applications, and in helping biologists model the nervous
system. These systems are capable of exhibiting chaotic behavior—this has immediate
consequences for engineers attempting to design systems which must behave predictably
given a specified range of inputs, and perhaps more subtle ones for biologists. An
important class of applications affected by the sensitivity to initial conditions that chaotic
behavior implies involves time series prediction. Neural networks can be trained to
replicate time series without prior knowledge of the system that produced the dynamical
behavior. For example, such networks have been used to predict future blood glucose
levels given current levels of glucose and insulin. Neural networks can also reproduce
the dynamics of chaotic systems such as the Hénon (Hénon, 1976) map.
It is therefore important to determine what percent of networks with a given
architecture and distribution of weights can be expected to exhibit chaos. Previous
studies have investigated the frequency of chaos in feedforward networks and determined
that large networks are almost 100% chaotic (Albers et al., 1996). While feedforward
nets have important applications in machine learning and engineering, recurrent neural
networks more closely resemble the architecture of certain structures such as the brain,
and are capable of representing arbitrary nonlinear dynamical systems (Seidl and Lorenz,
1991; Hornik et al., 1990).
A recurrent neural network can be made into a dynamical system by using the output
vector produced by one iteration as the input vector for the next one—the value of this
vector changes as time is advanced. The sensitivity of the network to initial conditions
can be quantified by numerically calculating the largest Lyapunov exponent (LE)
(Frederickson et al., 1983). In addition to estimating the fraction of randomly connected
networks that can be expected to exhibit chaos, we have also discovered certain LE
constraints that can be imposed upon the training algorithm to improve performance. Of
specific practical interest is the finding that a small amount of chaos appears to be
necessary for the algorithm.
2.1 Network Architecture
We used recurrent, fully connected neural networks of various sizes to generate time
series. The largest Lyapunov exponent of each network was then calculated.
The recurrent neural network is defined as follows:
Let W be an NxM matrix of connection weights between neurons (initially assigned
Gaussian random numbers with zero mean and unit variance) where N is the number of
neurons and M is the number of inputs to each neuron. For a fully connected network
(and for all the networks in this paper), M equals N. We can make this simplification
because an N x N square matrix can be used to represent any artificial neural network of
this type containing N neurons.
Let x be an NxN matrix representing N outputs over N time steps
such that xi,n is value of the ith neuron at time step n (while this matrix could contain any
number of time steps, we have chosen N steps in order to arrive at a unique solution).
2
Let s be a real, positive scalar (usually set to 1).
The output of a neuron i at time n+1 (i.e., xi,n+1) is dependent on the values of all neurons
at time n and the weight matrix W in the following manner:
M
xi,n+1=tanh(  sWi, jx j ,n / M )
(1)
j 1
The weights are normalized by the square root of M so that the mean square input to the
neurons is roughly independent of M.
2.2 Quantification of Chaos in the Network
It is possible to determine whether the aforementioned system is chaotic by calculating its
largest Lyapunov exponent (Frederickson et al., 1983). The LE is a measure of sensitivity
to initial conditions; a positive LE implies chaotic behavior.
The LE for a network represented by an NxN weight matrix W was calculated
numerically as follows:
Let X be an N dimensional vector representing the outputs of the network in a given time
step.
Every element of X was assigned random initial values (Gaussian distribution, centered
on zero with a variance of 1)
For each iteration n of the calculation:
A copy (Xc) of the vector X representing the recurrent network's
outputs was generated. Every component of Xc was perturbed by a small positive number
such that the Euclidean distance between X and Xc equals d0. At each time step, equation
(1) was solved twice to obtain new values for X and Xc. The distance d1 between the new
values of X and Xc was then calculated. The position of Xc was readjusted after each
iteration to be in the same direction as that of the vector Xc - X, but with separation d0 by
adjusting each component Xcj of Xc according to:
Xcj=Xj+(d0/d1)* (Xcj-Xj)
(2)
The average of log | d1 / d0| after many iterations gives the largest Lyapunov exponent.
We used 200 iterations (which is sufficient to give 2 digit accuracy), and a d0 of 10-8.
3
2.3 Probability of Chaos
Albers and Sprott (1996) showed that untrained feedforward networks with randomly
chosen weights have a probability of chaos that approaches unity as the number of
weights and connections are increased. We have extended this work by testing the
probability of chaos in fully connected recurrent networks.
The effect of a scaling parameter, s, and the size of the network, N, on the frequency of
chaos and the average Lyapunov Exponent was determined by using 30 networks with
different, randomly chosen weights. Figure 1 shows the effects of the scaling parameters.
Figure 1 Percent chaos and average LE vs s. 1a) N=16 1b) N=32 Note that the larger
network is 100% chaotic for most s values
For large values of N, the networks are almost always chaotic for a wide range of s
values. This is similar to be behavior of feedforward networks, except that the transition
to 100% chaos with respect to the parameters s and N is more abrupt for recurrent
networks. The decrease in percent chaos as s gets larger (observed in figure 1) is due to
the saturation of the neurons (all outputs are driven to either 1 or –1).
These findings indicate that, for a large recurrent network, chaos is the rule rather than
the exception. This has consequences for time series prediction as well as training.
While the training algorithm alters the LE of an individual network, the average LE of a
large number of networks that are trained agrees with the values given in figure 1, and is
not affected by the nature of the training set.
2.4 Effect of Chaos on Training:
To determine the relationship between training and chaos, we used a variant of the
simulated annealing algorithm to train networks and measured the LE of the network
during training.
4
2.4.1 Implementation of the Training Algorithm
Simulated annealing was implemented as follows:
All weights of the connections between neurons (in the matrix W) were initialized
to zero. These weights were perturbed by multiplying a Gaussian random number with
zero mean and variance T (initially chosen to be 1), which was divided by the square root
of M, the number of connections per neuron. An error term, e was initialized to a very
high value. The value T can be thought of as an effective temperature.
The training set for periodic time series consisted of N identical sine waves (one per
neuron) sampled at N time points with different initial phases. For a sample chaotic time
series, we used values generated by N logistic maps in a similar fashion. The logistic map
is a very simple chaotic system and is given by Xn+1=4Xn(1-Xn) (May, 1976).
Using a sine wave in this fashion (one per neuron) results in a training set that is an NxN
matrix, y, with each column representing a sine wave. To calculate the error in training,
the time series generated by the neurons in the system are also placed in an NxN matrix,
x. The RMS distance between these matrices yields the error term e and can be
calculated as:
e
1 N N
[
( yi ,n  xi ,n ) 2 ]1/ 2
2 
N i 1 n1
(replace n^2 with n!)
(3)
At each iteration, the weight matrix W of the network is perturbed by Gaussian random
numbers with variance T/M (???).
The network is then allowed to generate a time series consisting of NxN points (N time
points and neurons), and the error is compared to the best error known so far. If the new
error is less, it is assigned to e and the new weights are kept, and T is doubled (if less than
1). If not, T is decreased by a cooldown factor and the procedure is repeated. (We used a
cooldown factor of 0.995).
2.4.2 Relation of Training to Chaos
The LE of the network was calculated at each improvement of the training algorithm.
This allowed the characterization of the path taken through LE space by various instances
of the training algorithm. Figure 2 shows three typical paths. Only the instances of the
algorithm where the network LE remains near zero converged to a good solution.
5
Figure 2: LE vs. steps during training on 16 points of 16 out of phase sine waves for 3 instances of the
algorithm. The instance whose LE vs. time graph is in the middle performed well. The others converged
to unacceptable errors. The variance of the LEs is higher in the positive region—this is because perturbing
a network with a highly positive LE has a greater effect (on the LE of the new network obtained) than
perturbing a network with a lower LE by the same amount.
There are certain networks for which the error converges much faster (using the fewest
number of CPU cycles) than average. Examination of LE vs time graphs of several
networks revealed that the networks that converge the slowest also go through regions
with highly positive LEs. However, constraining the LE to be negative (i.e., rejecting all
solutions with positive LEs, during training), did not yield better performance. After
some experimentation, we found that allowing the LE to be slightly positive causes the
error to converge faster (see figure 3).
6
error
error vs steps
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
2000
4000
6000
8000
10000
steps
unconstrained
constrained
Figure 3: A network constrained between an LE of [-1, 0.005] performs better than an
unconstrained network (5 cases, averaged).
The sine wave experiment was repeated with replacing the 16 out of phase sine wave
values in the training set with values from the logistic map (Xn+1=4Xn(1-Xn)).
2.5 Dynamics of Trained Networks
While networks produced by the training algorithm that have positive LEs may perform
well for the duration of the training set, they invariably do poorly on any prediction
beyond it, as illustrated by Figure 4.
7
Training does not cause the LE of a network to converge to a certain value, even
though the LE changes at every step (this can be observed in figure 2). However,
networks trained on both periodic and chaotic systems (such as the sine wave or the
logistic map) that converge to a small error value have LEs that are close to zero. This
provides further justification for our proposed method of discarding networks with LEs
that are too high or too low. If a network is trained on time series generated by a chaotic
system (such as the system of logistic maps), the error converges more slowly. It is
possible to find periodic or chaotic solutions that replicate the training set, but the nature
of the behavior after that is not always predictable. This is a fundamental problem with
fitting a potentially chaotic system to replicate time series. Chaotic systems can exhibit a
phenomenon called intermittency where the observed behavior is apparently periodic for
a long time (compared to the typical time period of the system) and abruptly switches to
chaotic behavior (Hilborn, 1994). It is therefore possible for a training algorithm to find
such a region of apparent periodicity that matches the training set, even when the
dynamics of the system is actually chaotic. Increasing the number of steps used in the
calculation of the LE will help reduce the chances of mistaking a chaotic network for a
periodic one.
It must be noted, however, that a negative LE only guarantees a periodic solution, but not
necessarily the solution in the training set. Even if the first N points generated by the
network match the training set perfectly, the long term behavior may still be different.
Figure 5 shows some examples of this. The time course generated by the network may
attract to a fixed point or a cyclic orbit with a very different period.
Figure 5: data generated by networks with negative LEs
Networks trained on chaotic systems (such as the logistic map) generate data that may or
may not be chaotic, but do not necessarily resemble the dynamics or the LE of the
original system in the long run.
Since the likelihood of chaos in a network is dependent mostly on the network size, the
percent chaos data is important in knowing which parameters to chose, depending on
whether the desired net is chaotic or periodic.
8
3. Conclusion:
It is not surprising that eliminating networks with extremely positive LEs during
training to periodic data saves time by not exploring solutions which are far from
optimal. These highly chaotic nets are unlikely to generate the time series needed or be
stable in the long run. What is somewhat surprising, however, is that excluding all
networks with positive LEs (even when training on a periodic solution) reduces
performance. Apparently, going through regions in weight space with slightly positive
LEs causes the training algorithm to explore different solutions faster without straying
too far from the actual solution.
Another possible explanation is that the LE for the sine wave is zero, so
constraining the network to produce solutions around this number improves training.
This does not seem to be the case, however, since constraining the LE to be around zero
also improves training for chaotic networks. As noted earlier, these networks do not
replicate the long term behavior of the chaotic systems, even though they have the
potential to do so.
Biological neural networks might be exploiting this phenomenon as well, since
they are usually large enough to be 100% chaotic, but are unlikely to have high LEs,
since that would destroy their long-term memory. Further studies could look for chaos
in networks that model well-known biological systems, such as the neuroanatomy of C.
Elegans.
9
References
Albers, D.J., Sprott, J.C., and Dechert, W.D. (1996). Dynamical behavior of artificial
neural networks with random weights. In Intelligent Engineering Systems Through
Artificial Neural Networks, Volume 6, ed. C. H. Dagli, M. Akay, C.L.P. Chen, B.R.
Fernandez
Frederickson, P., Kaplan, J.L., Yorke, E.D., Yorke, J.A., 1983, “The Lyapunov
Dimension of Strange Attractors,” Journal of Differential Equations, Vol. 49, pp. 185-207
Henon, M. (1976) Commun. Math. Phys. V 50, p59
Hilborn, Robert C. (1994). Chaos and Nonlinear Dynamics--An Introduction for
Scientists and Engineers. Oxford University Presss, New York. pp296
Hornik, K, Stinchcombe, M., White, H (1990). “Universal Approximation of an
Unknown Mapping and its Derivatives Using Multilayer Feedforward Networks,” Neural
Networks, Vol. 3., pp., 551-560
May, R. (1976). “Simple Mathematical Models with very Complicated Dynamics, Nature
261, pp. 45-47.
Seidl, D. and Lorenz, D. (1991). A structure by which a recurrent neural network can
approximate a nonlinear dynamic system. In Proceedings of the International Joint
Conference on Neural Networks, volume 2, pp. 709-714.
10