* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Survey
Document related concepts
Transcript
Neural Networks WHY ARTIFICIAL NEURAL NETWORKS? • Characteristics of the human brain that are not present in von Neumann or modern parallel computers include: • • • • • • • • massive parallelism, distributed representation and computation, learning ability, generalization ability, adaptivety, inherent contextual information processing, fault tolerance, and low energy consumption. • It is hoped that devices based on biological neural networks will possess some of these desirable characteristics. 2 3 ANNs • Inspired by biological neural networks, ANNs are massively parallel computing systems consisting of an extremely large number of simple processors with many interconnections. • ANN models attempt to use some “organizational” principles believed to be used in the human 4 Brief historical review • ANN research has experienced three periods of extensive activity: • The first peak in the 1940s was due to McCulloch and Pitts' • The second occurred in the 1960s with Rosenblatt's perceptron convergence theorem and Minsky and Papert's work showing the limitations of a simple perceptron. Minsky and Papert's results dampened the enthusiasm of most researchers which lasted almost 20 years. • Since the early 1980s, ANNs have received considerable renewed interest. The major developments include • Hopfield's energy approach in 1982 and • The back-propagation learning algorithm for multilayer perceptrons (multilayer feed forward networks) first proposed by Werbos, and then popularized by Rumelhart et al. in 1986. 5 Biological neural networks • A neuron (or nerve cell) is a special biological cell that processes information. It is composed of a cell body, or soma, and two types of outreaching tree-like branches: the axon and the dendrites. 6 Biological neural networks (cont.) • A neuron receives signals (impulses) from other neurons through its dendrites (receivers) and transmits signals generated by its cell body along the axon (transmitter), which eventually branches into strands and sub strands. • At the terminals of these strands are the synapses. • A synapse is an elementary structure and functional unit between two neurons (an axon strand of one neuron and a dendrite of another) 7 Biological neural networks (cont.) • The human brain contains about 1011 neurons, which is approximately the number of stars in the Milky Way. • Neurons are massively connected, much more complex and dense than telephone networks. • Each neuron is connected to 103 to l04 other neurons. • In total, the human brain contains approximately 1014 to 1015 interconnections. 8 Biological neural networks (cont.) • Complex perceptual decisions such as face recognition are typically made by humans within a few hundred milliseconds. • These decisions are made by a network of neurons whose operational speed is only a few milliseconds. This implies that • the computations cannot take more than about 100 serial stages. • the brain runs parallel programs that are about 100 steps long for such perceptual tasks. This is known as the hundred step rule 9 Computational models of neurons • This mathematical neuron computes a weighted sum of its n input signals ,x,, j = 1,2, . . . , n. • Generates an output of 1 if this sum > certain threshold U. Otherwise, an output of 0 results. 10 • Mathematically (.) is the unit step function wj is the synapse weight associated with the jth input For simplicity of notation, we often consider the threshold U as another weight wo = - U attached to the neuron with a constant input x0 = 1 11 Activation Functions 12 The Sigmoid • The standard sigmoid function is the logistic function, defined by where is the slope parameter 13 Network architectures • ANNs can be viewed as weighted directed graphs in which artificial neurons are nodes and directed edges (with weights) are connections between neuron outputs and neuron inputs. • feed-forward networks, in which graphs have no loops, and • recurrent (or feedback) networks, in which loops occur because of feedback connections. 14 Network architectures Different connectivity's yield different network behaviors 15 Network architectures • Feed-forward networks are • static, that is, they produce only one set of output values rather than a sequence of values from a given input. • memory-less in the sense that their response to an input is independent of the previous network state. • Recurrent, or feedback, networks are • dynamic systems. • When a new input pattern is presented, the neuron outputs are computed. Because of the feedback paths, the inputs to each neuron are then modified, which leads the network to enter a new state. • Different network architectures require appropriate learning algorithms. 16 Learning • A learning process in the ANN context can be viewed as the problem of updating network architecture and connection weights so that a network can efficiently perform a specific task. • The network usually must learn the connection weights from available training patterns. • Performance is improved over time by iteratively updating the weights in the network. 17 Learning • ANNs' ability to automatically learn from examples makes them attractive and exciting. • ANNs appear to learn underlying rules (like input-output relationships) from the given collection of representative examples. • This is one of the major advantages of neural networks over traditional expert systems. 18 Learning algorithm • To understand or design a learning process, you must have 1. A learning paradigm: a model of the environment in which a neural network operates, i.e., you must know what information is available to the network. 2. Learning rules: you must understand how network weights are updated, i.e., which learning rules govern the updating process. • A learning algorithm refers to a procedure in which learning rules are used for adjusting the weights. 19 Learning paradigms • Supervised learning: The network is provided with a correct answer (output) for every input pattern - learning with a “teacher”. • Weights are determined to allow the network to produce answers as close as possible to the known correct answers. • Reinforcement learning is a variant of supervised learning in which the network is provided with only a critique on the correctness of network outputs, not the correct answers themselves. • Unsupervised learning: The network explores the underlying structure in the data, or correlations between patterns in the data, and organizes patterns into categories from these correlations learning without a teacher. • Hybrid learning: Part of the weights are usually determined through supervised learning, while the others are obtained through unsupervised learning - combines supervised and unsupervised learning. 20 Learning theory • Learning theory must address three fundamental and practical issues associated with learning from samples: capacity, sample complexity, and computational complexity. • Capacity: how many patterns can be stored, and what functions and decision boundaries a network can form. • Sample complexity: determines the number of training patterns needed to train the network to guarantee a valid generalization. • Too few patterns may cause “over-fitting” (wherein the network performs well on the training data set, but poorly on independent test patterns drawn from the same distribution as the training patterns). • Computational complexity: refers to the time required for a learning algorithm to estimate a solution from training patterns. • Many existing learning algorithms have high computational complexity. 21 Learning rules • Error correction, Boltzmann, Hebbian, and Competitive learning. • ERROR-CORRECTION RULES: During the learning process, the actual output y generated by the network may not equal the desired output d. • The basic principle of error-correction learning rules is to use the error signal (d-y) to modify the connection weights to gradually reduce this error. • The perceptron learning rule is based on this errorcorrection principle. • A perceptron consists of a single neuron with adjustable weights, wj, j = 1,2, . . . , n, and threshold U (threshold function). 22 ERROR-CORRECTION RULES • Given an input vector x= (xl, x2, . . . , xn)t, the net input to the neuron is • The output y of the perceptron is + 1 if v > 0, and 0 otherwise. • In a two-class classification problem, the perceptron assigns an input pattern to one class if y = 1, and to the other class if y=0. • The linear equation defines the decision boundary that halves the space. 23 Perceptron learning algorithm 1) Randomly initialize weights and threshold w1 w2 … wm 2) Present an input vector x= (xl, x2, . . . , xn)t and evaluate the output of the neuron. 3) Update the weights according to wj (t + 1) = wj (t) + (d-y) xj where d is the desired output, t is the iteration number, and is the gain step size ( 0.0 < < 1.0) 24 Perceptron learning algorithm • Note that learning occurs only when the perceptron makes an error. • The perceptron convergence theorem: Rosenblatt proved that when training patterns are drawn from two linearly separable classes, the perceptron learning procedure converges after a finite number of iterations. • In practice, you do not know whether the patterns are linearly separable. • Many variations of this learning algorithm have been proposed in the literature • Other activation functions that lead to different learning characteristics can also be used. • The back-propagation learning algorithm is based on the errorcorrection principle. 25 Perceptrons and Boolean Functions If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s… • Can learn the function x1 x2 X2 X1 • Can learn the function x1 x2 . 26 Perceptrons and Boolean Functions • What about the exclusive or function? f(x1,x2) = x1 x2 = (x1 ~x2) (~ x1 x2) 27 XOR problem Desired: make an ANN which will produce Y = X1 xor X2 on inputs X1 and X2. Problem: there is no single line that can “cut” X1 X2 space into two proper regions. Therefore, cannot use a single-layer neural net. Solution: use multilayer network 28 HEBBIAN RULE • The oldest learning rule is Hebb’s postulate of learning. Hebb based it on the following observation from neurobiological experiments: • If neurons on both sides of a synapse are activated synchronously and repeatedly, the synapse’s strength is selectively increased. • Mathematically, the Hebbian rule can be described as where xi and yj are the output values of neurons i and j, respectively, which are connected by the synapse wij and is the learning rate. Note that xi is the input to the synapse. 29 HEBBIAN RULE • An important property of this rule is that • learning is done locally, i.e., the change in synapse weight depends only on the activities of the two neurons connected by it. • This significantly simplifies the complexity of the learning circuit in a VLSI implementation. 30 HEBBIAN RULE • A single neuron trained using the Hebbian rule exhibits an orientation selectivity. • The points depicted are drawn from a two-dimensional Gaussian distribution and used for training a neuron. • The weight vector of the neuron is initialized to w0. • As the learning proceeds, the weight vector moves progressively closer to the direction w of maximal variance in the data. • w is the eigenvector of the covariance matrix of the data corresponding to the largest eigen value. 31 BOLTZMANN LEARNING • The Boltzmann machine (named in honour of a 19thcentury scientist by its inventors) • Boltzmann machines are symmetric recurrent networks consisting of binary units (+ 1 for “on” and -1 for “off’). • the weight on the connection from unit i to unit j is equal to the weight on the connection from unit j to unit i. • A subset of the neurons, called visible, interact with the environment; the rest, called hidden, do not. • Each neuron is a stochastic unit that generates an output (or state) according to the Boltzmann distribution of statistical mechanics. 32 • Boltzmann machines operate in two modes: • Clamped: visible neurons are clamped onto specific states determined by the environment; and • Free-running: both visible and hidden neurons are allowed to operate freely. The hidden neurons always operate freely. K is the number of visible neurons L is the number of hidden neurons. 33 BOLTZMANN LEARNING • Boltzmann learning is a stochastic learning rule derived from information-theoretic and thermodynamic principles. • The objective of Boltzmann learning is to adjust the connection weights so that the states of visible units satisfy a particular desired probability distribution. • According to the Boltzmann learning rule, the change in the connection weight wg is given by where is the learning rate, and ij and ij are the correlations between the states of units i and j when the network operates in the clamped mode and free-running mode, respectively. 34 Summary of the Boltzmann Machine Learning Procedure 1. Initialization: set weights to random numbers in [–1,1] 2. Clamping Phase: Present the net with the mapping it is supposed to learn by clamping input and output units to patterns. For each pattern, perform simulated annealing on the hidden units at a sequence T0, T1, ..., Tfinal of temperatures. At the final temperature, collect statistics to estimate the correlations 35 Summary of the Boltzmann Machine Learning Procedure 3. Free-Running Phase: Repeat the calculations performed in step 2, but this time clamp only the input units. Hence, at the final temperature, estimate the correlations 4. Updating of Weights: update them using the learning rule Where is a learning rate parameter. 36 Summary of the Boltzmann Machine Learning Procedure 5. Iterate until Convergence: Iterate steps 2 to 4 until the learning procedure converges with no more changes taking place in the synaptic weights wji for all j, i. 37 38 Alternative Boltzmann Architecture • Alternatively, the visible units may be viewed as divided into input and output units. • In this case the Boltzmann machine performs association under the supervision of a teacher, with the input units receiving information form the environment, and the output units reporting the outcome for that input pattern. 39 Boltzmann vs Hopfield Similarities: 1. Processing units have binary states (±1) 2. Connections between units are symmetric 3. Units are picked at random and one at a time for updating 4. Units have no self-feedback. Differences: 1. Boltzmann machine permits the use of hidden neurons. 2. Boltzmann machine uses stochastic neurons with a probabilistic firing mechanism, whereas the standard Hopfield net uses neurons based on the McCulloch-Pitts model with a deterministic firing mechanism. 3. Boltzmann machine may also be trained by a probabilistic form of supervision. 40 COMPETITIVE LEARNING RULES • Competitive-learning output units compete among themselves for activation. As a result, only one output unit is active at any given time. This phenomenon is known as winner-take-all. • Competitive learning has been found to exist in biological neural network. • Competitive learning often clusters or categorizes the input data. Similar patterns are grouped by the network and represented by a single unit. This grouping is done automatically based on data correlations. 41 COMPETITIVE LEARNING RULES • The simplest competitive learning network consists of a single layer of output units. • Each output unit i in the network connects to all the input units (xi ,s) via weights, wij , j= 1,2, . . . , n. • Each output unit also connects to all other output units via inhibitory weights but has a self-feed back with an excitatory weight. 42 COMPETITIVE LEARNING RULES • A simple competitive learning rule can be stated as • Note that only the weights of the winner unit get updated. • The effect of this learning rule is to move the stored pattern in the winner unit (weights) a little bit closer to the input pattern. • Assume that all input vectors have been normalized to have unit length. • The weight vectors of the three units are randomly initialized. Their initial and final positions on the sphere after competitive learning are marked as Xs. • Each of the three natural groups (clusters) of patterns has been discovered by an output unit whose weight vector points to the center of gravity of the 43 discovered group. COMPETITIVE LEARNING RULES • You can see from the competitive learning rule that the network will not stop learning (updating weights) unless the learning rate q is 0. • A particular input pattern can fire different output units at different iterations during learning. • The system is said to be stable if no pattern in the training data changes its category after a finite number of learning iterations. • One way to achieve stability is to force the learning rate to decrease gradually as the learning process proceeds towards 0. However, this artificial freezing of learning causes another problem termed plasticity, which is the ability to adapt to new data. This is known as Grossberg’s stability- plasticity dilemma in competitive learning. 44 COMPETITIVE LEARNING RULES • The most well-known example of competitive learning is vector quantization for data compression. • It has been widely used in speech and image processing for efficient storage, transmission, and modeling. • Its goal is to represent a set or distribution of input vectors with a relatively small number of prototype vectors (weight vectors), or a codebook. Once a codebook has been constructed and agreed upon by both the transmitter and the receiver, you need only transmit or store the index of the corresponding prototype to the input vector. • Given an input vector, its corresponding prototype can be found by searching for the nearest prototype in the codebook. 45 Well known learning algorithms 46 Well known learning algorithms 47 SUMMARY • Learning rules based on error-correction can be used for training feed-forward networks • Hebbian learning rules have been used for all types of network architectures. • Each learning algorithm is designed for training a specific architecture. • When we discuss a learning algorithm, a particular network architecture association is implied. • Each algorithm can perform only a few tasks well. • Other algorithms, including Adaline, Madaline, linear discriminant analysis, Sammon's projection , and principal component analysis. 48 Multilayer Networks The class of functions representable by perceptrons is limited Out(x) g w x g w j x j j Out(x) g W j g w jk x jk k j This is a nonlinear function Of a linear combination Of non linear functions Of linear combinations of inputs 49 A 1-HIDDEN LAYER NET NINPUTS = 2 NHIDDEN = 3 N INS v1 g w1k xk k 1 w11 x1 w21 w31 w12 x2 w1 w22 N INS v2 g w2 k xk k 1 w2 N HID Out g Wk vk k 1 w3 w32 N INS v3 g w3k xk k 1 50 OTHER NEURAL NETS 51 Multilayer perceptron • The most popular class of multilayer feed-forward networks is multilayer perceptrons • Each computational unit employs either the thresholding function or the sigmoid function. • Multilayer perceptrons can form arbitrarily complex decision boundaries and represent any Boolean function. • The development of the back-propagation learning algorithm for determining weights in a multilayer perceptron has made these networks the most popular among researchers and users of neural networks. 52 Multilayer perceptron • We denote wij(l) as the weight on the connection between the ith unit in layer (l-1) to jth unit in layer l. • Let {(x(1), d(1)), (x(2), d(2)), . . . , (x(p), d(p))} be a set of p training patterns (input-output pairs), where x(i) Rn is the input vector in the n-dimensional pattern space, and d(i) [0, l] m, an m-dimensional hypercube. For classification purposes, m is the number of classes. The squared error cost function most frequently used in the ANN literature is defined as 53 Back-propagation The back-propagation algorithm is a gradientdescent method to minimize the squared-error cost function E. 54 GRADIENT DESCENT Suppose we have a scalar function f(w) : We want to find a local minimum. Assume our current weight is w GRADIENT DESCENT RULE: w w f w w η is called the LEARNING RATE. A small positive number, e.g. η = 0.05 55 Gradient Descent in “m” Dimensions m f(w ) : Given f w w1 f w points in direction of steepest ascent. w f w m f w is the gradient in that direction GRADIENT DESCENT RULE: w w -f w ….where wj is the jth weight Equivalently w j w j - η f w w j “just like a linear feedback 56 system” i yi w x i w j w j i xij A RULE KNOWN BY MANY NAMES 57 Back-propagation algorithm 1. Initialize the weights to small random variables 2- Randomly choose an input pattern X(u) 3- propagate the signal forward through the network 4- Compute iL in the output layer (Oi = yiL) il = g’ (hil) [diu – yil], where hil represents the net input to the ith unit in the lth layer, g’ is the derivative of the activation function g. 5- Compute the deltas for the preceding layers by propagating the errors backwards; il = g’ (hil) j wijl+1 jl+1 , for l = (L-1),…, 1 6- Update weights using wjil = il yjl-1 7- Go to step 2 and repeat for next pattern until the error in the output layer is acceptably low, or a prespecified number of iterations is reached. 58 Backpropagation algorithm (instance-based) 1 Randomize the weights {ws} to small random values (both positive and negative) to ensure that the network is not saturated by large values of weights. 2 Select an instance t, that is the vector {xk(t)}, i = 1,...,Ninp (a pair of input and output patterns), from the training set. 3 Apply the network input vector to network input. 4 Calculate the network output vector {zk(t)}, k = 1,...,Nout. 5 Calculate the errors for each of the outputs k , k=1,...,Nout, the difference between the desired output and the network output: (for simplicity we will denote it as simply E). 6 Calculate the necessary updates for weights -ws in a way that minimizes this error (discussed below). 7 Adjust the weights of the network by -ws. 8 Repeat steps 2 – 6 for each instance (pair of input–output vectors) in the training set until the error for the entire system (error E defined above or the error on cross-validation set) is acceptably low, or the pre-defined number of iterations is reached. 59 Backpropagation algorithm • Often it is reasonable not to update weights immediately after processing each instance, but accumulates (sums up) the necessary changes across a subset of training instances (call an epoch) and only then updates the weights. This allows for faster convergence (Smith 1993). • Epoch can be the part or the whole training set. After the whole training set is processed (this sequence of steps is called an iteration), • the whole process is repeated again in an iterative fashion – until the total error is acceptably low. • Number of such iterations may sometimes be as high as several thousand. 60 Backpropagation algorithm (epoch-based, with cumulative updates) 1 – 6 as above 7 add up the calculated weights’ updates {-ws} to the accumulated total updates {ΔWs}. 8 Repeat steps 2 – 7 for several instances comprising an epoch. 9 Adjust the weights {ws} of the network by the updates {Ws}. 10 Repeat steps 2 – 9 until all instances in the training set are processed. This constitutes one iteration. 11 Repeat the iteration of steps 2 – 10 until the error for the entire system (error E defined above or the error on cross-validation set) is acceptably low, or the predefined number of iterations is reached. 61 62 Backpropagation • In a Single-layer network, Each neuron adjusts its weights according to what output was expected of it, and the output it gave. This can be mathematically expressed by the Perceptron Delta Rule: Where w is the array of weights, x is the array of inputs. 63 The Sigmoid (logistic) function • One of the more popular alternatives function used with back-propagation nets is the Sigmoid (logistic) function. 64 The perceptron learning rule Where w is the array of weights, x is the array of inputs, and is defined as the learning rate. yi and di are the actual and desired outputs, respectively. • Calculating the deltas for the output layer as: 65 Calculate delta for the hidden layers • We have to know the effect on the output of the neuron if a weight is to change. • Therefore, we need to know the derivative of the error with respect to that weight. • It has been proven that for neuron q in hidden layer p, delta is: Each delta value for hidden layers require that the delta value for the layer after it be calculated. 66 Backpropagation example • NINPUTS = 2 1 NHIDDEN = 2 W1(0,1) W1(0,2) W1(1,1) x1 1 W2(1,1) W2(0,1) W2(0,1) W1(1,2) W1(2,1) x2 N INS v1 g w1k x k k 0 N INS v 2 g w2 k x k k 0 N HID Out g Wk vk k 1 W2(2,1) W1(2,2) 67 Back propagation algorithm 1-Initialize the weights to small random variables Layer 1 Layer 2 68 Randomly choose an input pattern X(u) 3- Propagate the signal forward through the network Layer 1 X2(i) = ∑k=0,1,2 Wi (k,i) X(k) 69 Out(x) = g(∑k=0,1,2 Wi (k,i) X(k) ) X2(i) = g(∑k=0,1,2 Wi (k,i) X(k) ) g(x) = 1/(1+e-x) 70 4. Compute iL in the output layer (Oi = yiL) il = g’ (hil) [diu – yil], where hil represents the net input to the ith unit in the lth layer, g’ is the derivative of the activation function g. d3(1) = x3(1)(1 - x3(1))(d - x3(1)) 71 5- Compute the deltas for the preceding layers by propagating the errors backwards; il = g’ (hil) j wijl+1 jl+1 , for l = (L-1),…, 1 72 6- Update weights using wjil = il yjl-1 Taking as 0.05 dw2(0,1) = *x1(0)*d2(1) 73 7- Go to step 2 and repeat for next pattern until the error in the output layer below a prespecified number of iterations is reached. 74 • Run the entire process again on the next set of training data. • Slowly, as the training data is fed in and the network in retrained a few thousand times, the network could balance out to certain values. 75 APPLICATIONS • To successfully work with real-world problems, you must deal with numerous design issues, including network model, network size, activation function, learning parameters, and number of training samples. • • • • • • • Pattern classification Clustering Function approximation Prediction Optimization Content addressable memory Control 76 Reference 77