Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Using Chaos to Characterize and Train Neural Networks Oguz Yetkin* J.C. Sprott† Department of Physics, University of Wisconsin Madison, WI 53706 USA Abstract A neural network will become a dynamical system if the outputs generated by the network are used as inputs for the next time step. In this way, the network can be used to generate time series. Due to the non-linearity in the neurons, these networks are also capable of chaotic behavior (sensitive dependence on initial conditions). We present ways to characterize the degree of chaos in a given network, demonstrate that it is influenced greatly by network size, and use this information to aid in training. Our results indicate that training by simulated annealing can be improved by rejecting highly chaotic solutions encountered by the training algorithm, but is hindered if all chaotic solutions are rejected. * [email protected] † [email protected], to whom all correspondence should be addressed 1 1. Introduction Since their invention, artificial neural networks have been important in artificial intelligence, engineering applications, and in helping biologists model the nervous system. These systems are capable of exhibiting chaotic behavior—this has immediate consequences for engineers attempting to design systems which must behave predictably given a specified range of inputs, and perhaps more subtle ones for biologists. An important class of applications affected by the sensitivity to initial conditions that chaotic behavior implies involves time series prediction. Neural networks can be trained to replicate time series without prior knowledge of the system that produced the dynamical behavior. For example, such networks have been used to predict future blood glucose levels given current levels of glucose and insulin. Neural networks can also reproduce the dynamics of chaotic systems such as the Hénon (Hénon, 1976) map. It is therefore important to determine what percent of networks with a given architecture and distribution of weights can be expected to exhibit chaos. Previous studies have investigated the frequency of chaos in feedforward networks and determined that large networks are almost 100% chaotic (Albers et al., 1996). While feedforward nets have important applications in machine learning and engineering, recurrent neural networks more closely resemble the architecture of certain structures such as the brain, and are capable of representing arbitrary nonlinear dynamical systems (Seidl and Lorenz, 1991; Hornik et al., 1990). A recurrent neural network can be made into a dynamical system by using the output vector produced by one iteration as the input vector for the next one—the value of this vector changes as time is advanced. The sensitivity of the network to initial conditions can be quantified by numerically calculating the largest Lyapunov exponent (LE) (Frederickson et al., 1983). In addition to estimating the fraction of randomly connected networks that can be expected to exhibit chaos, we have also discovered certain LE constraints that can be imposed upon the training algorithm to improve performance. Of specific practical interest is the finding that a small amount of chaos appears to be necessary for the algorithm. 2.1 Network Architecture We used recurrent, fully connected neural networks of various sizes to generate time series. The largest Lyapunov exponent of each network was then calculated. The recurrent neural network is defined as follows: Let W be an NxM matrix of connection weights between neurons (initially assigned Gaussian random numbers with zero mean and unit variance) where N is the number of neurons and M is the number of inputs to each neuron. For a fully connected network (and for all the networks in this paper), M equals N. We can make this simplification because an N x N square matrix can be used to represent any artificial neural network of this type containing N neurons. Let x be an NxN matrix representing N outputs over N time steps such that xi,n is value of the ith neuron at time step n (while this matrix could contain any number of time steps, we have chosen N steps in order to arrive at a unique solution). 2 Let s be a real, positive scalar (usually set to 1). The output of a neuron i at time n+1 (i.e., xi,n+1) is dependent on the values of all neurons at time n and the weight matrix W in the following manner: M xi,n+1=tanh( sWi, jx j ,n / M ) (1) j 1 The weights are normalized by the square root of M so that the mean square input to the neurons is roughly independent of M. 2.2 Quantification of Chaos in the Network It is possible to determine whether the aforementioned system is chaotic by calculating its largest Lyapunov exponent (Frederickson et al., 1983). The LE is a measure of sensitivity to initial conditions; a positive LE implies chaotic behavior. The LE for a network represented by an NxN weight matrix W was calculated numerically as follows: Let X be an N dimensional vector representing the outputs of the network in a given time step. Every element of X was assigned random initial values (Gaussian distribution, centered on zero with a variance of 1) For each iteration n of the calculation: A copy (Xc) of the vector X representing the recurrent network's outputs was generated. Every component of Xc was perturbed by a small positive number such that the Euclidean distance between X and Xc equals d0. At each time step, equation (1) was solved twice to obtain new values for X and Xc. The distance d1 between the new values of X and Xc was then calculated. The position of Xc was readjusted after each iteration to be in the same direction as that of the vector Xc - X, but with separation d0 by adjusting each component Xcj of Xc according to: Xcj=Xj+(d0/d1)* (Xcj-Xj) (2) The average of log | d1 / d0| after many iterations gives the largest Lyapunov exponent. We used 200 iterations (which is sufficient to give 2 digit accuracy), and a d0 of 10-8. 3 2.3 Probability of Chaos Albers and Sprott (1996) showed that untrained feedforward networks with randomly chosen weights have a probability of chaos that approaches unity as the number of weights and connections are increased. We have extended this work by testing the probability of chaos in fully connected recurrent networks. The effect of a scaling parameter, s, and the size of the network, N, on the frequency of chaos and the average Lyapunov Exponent was determined by using 30 networks with different, randomly chosen weights. Figure 1 shows the effects of the scaling parameters. Figure 1 Percent chaos and average LE vs s. 1a) N=16 1b) N=32 Note that the larger network is 100% chaotic for most s values For large values of N, the networks are almost always chaotic for a wide range of s values. This is similar to be behavior of feedforward networks, except that the transition to 100% chaos with respect to the parameters s and N is more abrupt for recurrent networks. The decrease in percent chaos as s gets larger (observed in figure 1) is due to the saturation of the neurons (all outputs are driven to either 1 or –1). These findings indicate that, for a large recurrent network, chaos is the rule rather than the exception. This has consequences for time series prediction as well as training. While the training algorithm alters the LE of an individual network, the average LE of a large number of networks that are trained agrees with the values given in figure 1, and is not affected by the nature of the training set. 2.4 Effect of Chaos on Training: To determine the relationship between training and chaos, we used a variant of the simulated annealing algorithm to train networks and measured the LE of the network during training. 4 2.4.1 Implementation of the Training Algorithm Simulated annealing was implemented as follows: All weights of the connections between neurons (in the matrix W) were initialized to zero. These weights were perturbed by multiplying a Gaussian random number with zero mean and variance T (initially chosen to be 1), which was divided by the square root of M, the number of connections per neuron. An error term, e was initialized to a very high value. The value T can be thought of as an effective temperature. The training set for periodic time series consisted of N identical sine waves (one per neuron) sampled at N time points with different initial phases. For a sample chaotic time series, we used values generated by N logistic maps in a similar fashion. The logistic map is a very simple chaotic system and is given by Xn+1=4Xn(1-Xn) (May, 1976). Using a sine wave in this fashion (one per neuron) results in a training set that is an NxN matrix, y, with each column representing a sine wave. To calculate the error in training, the time series generated by the neurons in the system are also placed in an NxN matrix, x. The RMS distance between these matrices yields the error term e and can be calculated as: e 1 N N [ ( yi ,n xi ,n ) 2 ]1/ 2 2 N i 1 n1 (replace n^2 with n!) (3) At each iteration, the weight matrix W of the network is perturbed by Gaussian random numbers with variance T/M (???). The network is then allowed to generate a time series consisting of NxN points (N time points and neurons), and the error is compared to the best error known so far. If the new error is less, it is assigned to e and the new weights are kept, and T is doubled (if less than 1). If not, T is decreased by a cooldown factor and the procedure is repeated. (We used a cooldown factor of 0.995). 2.4.2 Relation of Training to Chaos The LE of the network was calculated at each improvement of the training algorithm. This allowed the characterization of the path taken through LE space by various instances of the training algorithm. Figure 2 shows three typical paths. Only the instances of the algorithm where the network LE remains near zero converged to a good solution. 5 Figure 2: LE vs. steps during training on 16 points of 16 out of phase sine waves for 3 instances of the algorithm. The instance whose LE vs. time graph is in the middle performed well. The others converged to unacceptable errors. The variance of the LEs is higher in the positive region—this is because perturbing a network with a highly positive LE has a greater effect (on the LE of the new network obtained) than perturbing a network with a lower LE by the same amount. There are certain networks for which the error converges much faster (using the fewest number of CPU cycles) than average. Examination of LE vs time graphs of several networks revealed that the networks that converge the slowest also go through regions with highly positive LEs. However, constraining the LE to be negative (i.e., rejecting all solutions with positive LEs, during training), did not yield better performance. After some experimentation, we found that allowing the LE to be slightly positive causes the error to converge faster (see figure 3). 6 error error vs steps 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 steps unconstrained constrained Figure 3: A network constrained between an LE of [-1, 0.005] performs better than an unconstrained network (5 cases, averaged). The sine wave experiment was repeated with replacing the 16 out of phase sine wave values in the training set with values from the logistic map (Xn+1=4Xn(1-Xn)). 2.5 Dynamics of Trained Networks While networks produced by the training algorithm that have positive LEs may perform well for the duration of the training set, they invariably do poorly on any prediction beyond it, as illustrated by Figure 4. 7 Training does not cause the LE of a network to converge to a certain value, even though the LE changes at every step (this can be observed in figure 2). However, networks trained on both periodic and chaotic systems (such as the sine wave or the logistic map) that converge to a small error value have LEs that are close to zero. This provides further justification for our proposed method of discarding networks with LEs that are too high or too low. If a network is trained on time series generated by a chaotic system (such as the system of logistic maps), the error converges more slowly. It is possible to find periodic or chaotic solutions that replicate the training set, but the nature of the behavior after that is not always predictable. This is a fundamental problem with fitting a potentially chaotic system to replicate time series. Chaotic systems can exhibit a phenomenon called intermittency where the observed behavior is apparently periodic for a long time (compared to the typical time period of the system) and abruptly switches to chaotic behavior (Hilborn, 1994). It is therefore possible for a training algorithm to find such a region of apparent periodicity that matches the training set, even when the dynamics of the system is actually chaotic. Increasing the number of steps used in the calculation of the LE will help reduce the chances of mistaking a chaotic network for a periodic one. It must be noted, however, that a negative LE only guarantees a periodic solution, but not necessarily the solution in the training set. Even if the first N points generated by the network match the training set perfectly, the long term behavior may still be different. Figure 5 shows some examples of this. The time course generated by the network may attract to a fixed point or a cyclic orbit with a very different period. Figure 5: data generated by networks with negative LEs Networks trained on chaotic systems (such as the logistic map) generate data that may or may not be chaotic, but do not necessarily resemble the dynamics or the LE of the original system in the long run. Since the likelihood of chaos in a network is dependent mostly on the network size, the percent chaos data is important in knowing which parameters to chose, depending on whether the desired net is chaotic or periodic. 8 3. Conclusion: It is not surprising that eliminating networks with extremely positive LEs during training to periodic data saves time by not exploring solutions which are far from optimal. These highly chaotic nets are unlikely to generate the time series needed or be stable in the long run. What is somewhat surprising, however, is that excluding all networks with positive LEs (even when training on a periodic solution) reduces performance. Apparently, going through regions in weight space with slightly positive LEs causes the training algorithm to explore different solutions faster without straying too far from the actual solution. Another possible explanation is that the LE for the sine wave is zero, so constraining the network to produce solutions around this number improves training. This does not seem to be the case, however, since constraining the LE to be around zero also improves training for chaotic networks. As noted earlier, these networks do not replicate the long term behavior of the chaotic systems, even though they have the potential to do so. Biological neural networks might be exploiting this phenomenon as well, since they are usually large enough to be 100% chaotic, but are unlikely to have high LEs, since that would destroy their long-term memory. Further studies could look for chaos in networks that model well-known biological systems, such as the neuroanatomy of C. Elegans. 9 References Albers, D.J., Sprott, J.C., and Dechert, W.D. (1996). Dynamical behavior of artificial neural networks with random weights. In Intelligent Engineering Systems Through Artificial Neural Networks, Volume 6, ed. C. H. Dagli, M. Akay, C.L.P. Chen, B.R. Fernandez Frederickson, P., Kaplan, J.L., Yorke, E.D., Yorke, J.A., 1983, “The Lyapunov Dimension of Strange Attractors,” Journal of Differential Equations, Vol. 49, pp. 185-207 Henon, M. (1976) Commun. Math. Phys. V 50, p59 Hilborn, Robert C. (1994). Chaos and Nonlinear Dynamics--An Introduction for Scientists and Engineers. Oxford University Presss, New York. pp296 Hornik, K, Stinchcombe, M., White, H (1990). “Universal Approximation of an Unknown Mapping and its Derivatives Using Multilayer Feedforward Networks,” Neural Networks, Vol. 3., pp., 551-560 May, R. (1976). “Simple Mathematical Models with very Complicated Dynamics, Nature 261, pp. 45-47. Seidl, D. and Lorenz, D. (1991). A structure by which a recurrent neural network can approximate a nonlinear dynamic system. In Proceedings of the International Joint Conference on Neural Networks, volume 2, pp. 709-714. 10