Download 5. Neural Network

MITM 613 Intelligent System Chapter 8: Neural networks Abdul Rahim Ahmad 2 Chapter Eight : Neural Networks Abdul Rahim Ahmad 8.1 Introduction 8.2 Neural network applications Nonlinear estimation Classification Clustering Content-addressable memory 8.3 Nodes and interconnections 8.4 Single and multilayer perceptrons Network topology Perceptrons as classifiers Training a perceptron Hierarchical perceptrons Some practical considerations 8.5 The Hopfield network 8.6 MAXNET 8.7 The Hamming network 8.8 Adaptive Resonance Theory (ART) networks 8.9 Kohonen self-organizing networks 8.10 Radial basis function networks 3 Artificial neural networks Abdul Rahim Ahmad 4 Artificial Neural Networks (ANN)  ANN - A family of techniques for numerical learning.  Consist of many nonlinear computational elements which form the network nodes or neurons, linked by weighted interconnections.  Analogous in structure to the biological neurological system but are much simpler and effective certain tasks, such as classification.  Generally neural network is taken to mean artificial neural network. Abdul Rahim Ahmad 5 Artificial neural networks  Using neural networks is described as connectionism.  Each node in a neural network may have several inputs, each of which has an associated weighting.  The node performs a simple computation on its input values, which are single integers or real numbers, to produce a single numerical value as its output.  The output from a node can either form an input to other nodes or be part of the output from the network as a whole. Abdul Rahim Ahmad 6 Artificial neural networks  Overall effect -> a pattern of numbers is generated at its outputs in response to a pattern of numbers at its inputs.  These patterns of numbers are one-dimensional arrays known as vectors, e.g., (0.1, 1.0, 0.2).  Each neuron performs its computation independently.  Outputs from some neurons may form the inputs to others. Abdul Rahim Ahmad  Thus, neural networks have a highly parallel structure, allowing them to explore many competing hypotheses simultaneously. 7 Artificial neural networks  Parallelism allows to chance to take advantage of parallel processing computers.  ANN can run on conventional serial computers, except that it is longer.  ANN are tolerant of the failure of individual neurons or interconnections.  ANN performance degrade gracefully if the localized failures within the network occur. Abdul Rahim Ahmad  The weights on the node interconnections, together with the overall topology, define the output vector that is derived by the network from a given input vector. 8 The Weights  In supervised learning:  Examples are presented along with the corresponding desired output vectors.  Weight is adjusted with each iteration until the actual output for each input is close to the desired vector.  In unsupervised learning:  Examples are presented without any corresponding desired output vectors.  Weight is adjusted in accordance with naturally occurring patterns in the data using a suitable training algorithm.  Output vector represents the position of the input vector within the discovered patterns of the data. Abdul Rahim Ahmad 9  When presented with noisy/incomplete data, ANN produce approximate answer rather than incorrect.  When presented with unfamiliar data within the range of its previously seen examples, ANN will generally produce a reasonable output interpolated between the example outputs.  However, ANN is unable to extrapolate reliably beyond the range of the previously seen examples. Abdul Rahim Ahmad  For Interpolation we can use fuzzy logic. Therefore, ANN and fuzzy logic are alternative solutions to engineering problem and may be combined in a hybrid system. 10 ANN applications  ANN can be applied to many tasks.  ANN associates input vector (x1, x2, … xn) with output vector (y1, y2, … ym)  The function linking the input and output may be unknown and can be highly nonlinear.  A linear function is one that can be represented as f(x) = mx + c, where m and c are constants;  a nonlinear one may include higher order terms for x, or trigonometric or logarithmic functions of x.) Abdul Rahim Ahmad 11 Application 1: Nonlinear estimation  ANN technique can determine values of variables that cannot be measured easily, but known to depend on other more accessible variables.  The measurable variables form the network input vector and the unknown variables constitute the output vector.  In Nonlinear estimation, the network is initially trained using a set of examples known as the training data.  Supervised learning is used;  i.e: each example in the training comprises two vectors: an input vector and its corresponding desired output vector. Abdul Rahim Ahmad  This assumes that some values for the less accessible variable have been obtained to form the desired outputs. 12 Application 1: Nonlinear estimation  During training, the network learns to associate the example input vectors with their desired output vectors.  When it is subsequently presented with a previously unseen input vector, the network is able to interpolate between similar examples in the training data to generate an output vector. Abdul Rahim Ahmad 13 Classification  Output vector classify input into one of a set of known possible class.  Example: speech recognition system:  Classify input into 3 different words: yes, no, and maybe.  Input: Preprocessed digitized sound of the words  Output: (0, 0, 1) for yes (0, 1, 0) for no (1, 0, 0) for maybe.  During training, the network learns to associate similar input vectors with a particular output vector. Abdul Rahim Ahmad  When it is subsequently presented with a previously unseen input vector, the network selects the output vector that offers the closest match. 14 Clustering  Unsupervised learning  Input vectors are clustered into N groups, (N is integer, may be prespecified or may be allowed to grow according to the diversity of the data).  Example: In speech recognition  Input : only spoken words  Training: cluster together examples that is similar to each other. (eg: according to different words or voices).  Once the clusters have formed, a second neural network is trained to associate each cluster with a particular desired output.  The overall system then becomes a classifier, where the first network is unsupervised and the second one is supervised.  Clustering is useful for data compression and is an important aspect of data mining, i.e., finding patterns in complex data. Abdul Rahim Ahmad 15 Content-addressable memory  A form of unsupervised learning.  no desired output vectors associated with the training data. During training, each example input vector becomes stored in a dispersed form through the network.  When a previously unseen vector is subsequently presented to the network, it is treated as though it were an incomplete or error-ridden version of one of the stored examples.  So the network regenerates the stored example that most closely resembles the presented vector.  This can be thought of as a type of classification, where each of the examples in the training data belongs to a separate class, and each represents the ideal vector for that class. Abdul Rahim Ahmad 16 Nodes and interconnections  Node or neuron is a simple computing element having an input side and an output side.  Each node may have directional connections to many other nodes at both its input and output sides.  Each input xi is multiplied by its associated weight wi. Abdul Rahim Ahmad  Typically, the node’s role is to sum each of its weighted inputs and add a bias term w0 to form an intermediate quantity called the activation, a.  It then passes the activation through a nonlinear function ft known as the transfer function or activation function. Figure shows the function of a single neuron. 17 Nodes and interconnections Abdul Rahim Ahmad  The behavior of a neural network depends on its topology, the weights, the bias terms, and the transfer function.  The weights and biases can be learned, and the learning behavior of a network depends on the chosen training algorithm.  Typically a sigmoid function is used as the transfer function  For each neuron, the activation function is given by: where n is the number of inputs and the bias term w0 is defined separately for each node. 18 Typical Transfer Functions  Non-linear transfer function: Sigmoid function Abdul Rahim Ahmad Ramp function Step function 19 MultiLayer Perceptron (MLP) Abdul Rahim Ahmad  The neurons are organized in layers.  Each neuron is totally connected to the neurons in the layers above and below, but not to the neurons in the same layer.  These networks are also called feed forward networks.  MLPs can be used either for classification or as nonlinear estimators.  Number of nodes in each layer and the number of layers are determined by the network builder, often on a trialand-error basis.  There is always an input layer and an output layer; the number of nodes in each is determined by the number of inputs and outputs being considered. 20 MultiLayer Perceptron (MLP)  Can have any number of hidden layers between input and output layers.  have no obvious meaning associated with them.  If no hidden layers, the network is a single layer perceptron (SLP).  Network shown has Abdul Rahim Ahmad  Three input nodes  Two hidden layers with four nodes each.  One output layer of two nodes.  Short form name is 3–4–4–2 MLP. 21 MultiLayer Perceptron (MLP)  Feed data forwards along input layer, hidden layers, to the output layer.  Inputs to a node are the outputs from each node in the previous layer except nodes in the input layer.  At each node except input layer, the data are weighted, summed, added to the bias, and then passed through the transfer function.  In the counting of layers, the input nodes is not included since it do not perform any processing  The network in the figure is a three layer MLP. Abdul Rahim Ahmad 22 Perceptrons as classifiers  Normally there is one input node for each element of the input vector and one output node for each element of the output vector.  Each output node would usually represent a particular class  Typical representation for a class would be  ~1 for one class and the rest ~0.  For the case it does not fall into any class, the winning node must exceed a predetermined threshold such as 0.5.  Other representations are such as two output nodes to represent four classes. Eg: (0,0), (0,1), (1,0), and (1,1). Abdul Rahim Ahmad 23 Linear classifiers  Abdul Rahim Ahmad Example: Single layer perceptron  Input : 2 neuron  Output: 3 classes  Each class has 1 dividing line  Linearly separable  Output, prior to application of the transfer function, is given by  The dividing criterion is assumed to be a = 0 corresponding to output of 0.5 after the application of the sigmoid transfer function  Thus the hyperplane that separates the two regions is given by:  In the form of a straight line : 24 Nonlinear classifiers  Multilayer perceptron  one hidden layer  a differentiable, i.e., smooth, transfer function such as the sigmoid function  First layer divides the state space with straight lines (or hyperplanes),  2nd layer forms multifaceted regions by Boolean combinations (AND, OR, and NOT) of the linearly separated regions.  To perform any nonlinear mapping or classification with an MLP:  Abdul Rahim Ahmad  With sigmoid transfer function, one hidden layer is needed.  With step transfer function, less than two hidden layers are required. Learning cannot be guaranteed; final topology involves trial and error. start small then expand. 25 Training a perceptron  Training separate the regions in state space by adjusting its weights and bias.  Difference between the generated value and the desired value is the error  The overall error is expressed as the root mean squares (RMS) of the errors (both –ve and +ve)  Training minimized RMS by altering the weights and bias, through many passes of the training data.  This search for weights and biases that gives the minimum RMS error is an optimization problem with RMS error as the cost function.  When RMS error is within a small range, we say that the network converged. Abdul Rahim Ahmad 26 Training Algorithm  Most common is : back-error propagation (BP) algorithm (or generalized delta rule)  A gradient-proportional descent technique with continuous and differentiable transfer function such as sigmoid.  For sigmoid function the derivative is Abdul Rahim Ahmad , 27 Nomenclature for BP algorithm Abdul Rahim Ahmad 28 BP Training Algorithm  In BP, biases is always 1.  Delta rule determines the modifications to the weights as follows: for all nodes j in layer A and all nodes i in layer B after .(B = A + 1).  Neurons in the output and hidden layers have error term, δ. When the sigmoid transfer function is used, δAj is given by: Abdul Rahim Ahmad 29 BP Training Algorithm (cont.)  learning rate, η, is applied to the calculated values for δAj and should be about 0.35.  Sometimes, momentum coefficient, α is included.  Momentum term forces changes in weight to be dependent on previous weight changes.  Momentum coefficient must be in the range 0–1.  Some suggest to set α to be 0.0 for the first few training passes and then increased to 0.9. Abdul Rahim Ahmad 30 BP algorithm  2 stages  Gather error term  Update weights  Repeat as many times required Abdul Rahim Ahmad 31 Hierarchical Perceptrons  In complex problems recommended to divide MLP into several smaller MLPs arranged in a hierarchy.  Each MLP independent from each other, can be trained separately or in parallel. Abdul Rahim Ahmad 32 Some Practical Considerations  Stop training if RMS error remains constant so as not to over-train the network (expert at giving correct output for training data, but not with new data).  Some reasons :  too many cycles of training  over-complex network (many hidden layers or numbers of neurons)  To avoid:  divide the data into training, testing, and validation.  Use leave-one-out method  Use scaled data. Abdul Rahim Ahmad 33 Effects of Over training Abdul Rahim Ahmad 34 Hopfield network  One layer: nodes are used for both input and output  Used as a content-addressable memory  Input: binary, (1 and -1)  Output: binary  Transfer function ft is step nonlinearity.  If network has Nn nodes, then the input and output would comprise Nn binary digits.  Abdul Rahim Ahmad Weights and biases are set according to the following equations: where wij is the weighting on the connection from node i to node j, wi0 is the bias on node i, and xik is the ith digit of example k. There are no circular connections from a node to itself, hence wij = 0 where i = j. 35 Learning in Hopfield  Setting weights constitutes the learning phase, results in the examples being stored in a distributed fashion in the network  A new input, is initially the output, too, (as nodes are used for both input and output).  The node function is performed on each node in parallel. If this is repeated many times, the output will be progressively modified and will converge on the example that most closely resembles the initial input.  number of examples (Ne) should not exceed 0.15Nn Abdul Rahim Ahmad 36 MAXNET  Topology : identical to the Hopfield network  Difference: weights on the circular interconnections, wii, are not always zero.  Used to recognize which of its inputs has the highest value.  Used in conjunction with MLP to select output node that generates the highest value.  interconnection weights are set as follows Abdul Rahim Ahmad Circular connections from a node to itself are allowed in the MAXNET, but are disallowed in the Hopfield network 37 Comparison Abdul Rahim Ahmad 38 Hamming network  Has two parts :  a 2 layer feed forward network : used to compare the input vector with each of the examples, awarding a matching score to each example  a MAXNET: used to pick out the example that has attained the highest score  The overall effect is to categorize the input vector Abdul Rahim Ahmad 39 Adaptive Resonance Theory (ART)  Unsupervised, comprises of 2-way interconnections between input nodes and a MAXNET.  Classifies the incoming data into clusters.  1st example stored as example or model pattern, 2nd example compared to 1st: either same cluster or new example.  How the differences are measured the closeness measure.  New example is compared with all current example in parallel. Abdul Rahim Ahmad 40 ART algorithm Abdul Rahim Ahmad 41 Kohonen Self-organizing Networks  Unsupervised - also called self-organizing maps (SOMs).  Topology:  Processing nodes arranged in a 2-D array (Kohonen layer)  1-D layer of input nodes, each input node connected to each node in the Kohonen layer.  Used to cluster together similar patterns.  Learning involves competition between the neurons to respond to a particular input vector.  Weights of “winner” set to generate a high output (~1)  Weights on nearby neurons (neighborhood) adjusted to be high.  Weights on the “losers” are unchanged. Abdul Rahim Ahmad 42 Kohonen Self-organizing Networks Abdul Rahim Ahmad  When the trained network is presented with an input pattern, one neuron in the Kohonen layer will produce an output larger than the others, and is said to have fired. When a second similar pattern is presented, the same neuron or one in its neighborhood will fire.  As similar patterns cause topologically close neurons to fire, clustering of similar patterns is achieved. 43 Kohonen Self-organizing Networks  Can demonstrate by training the network using pairs of Cartesian coordinates - Distribution of the firing neurons corresponds with the Cartesian coordinates represented by the input.  Thus, if the input elements fall in the range between –1 and 1, then an input vector of (–0.9, 0.9) will cause a neuron close to one corner of the Kohonen layer to fire, while an input vector of (0.9, –0.9) would cause a neuron close to the opposite corner to fire.  Can form part of a hybrid network for supervised learning.  Can pass coordinates of the firing neuron in a SOM to an MLP  learning takes place in two distinct phases  First, the Kohonen self-organizing network learns, without supervision, to associate regions in the pattern space with clusters of neurons in the Kohonen layer.  Second, an MLP learns to associate the coordinates of the firing neuron in the Kohonen layer with the desired class. Abdul Rahim Ahmad 44 Radial Basis Function (RBF) networks  Unsupervised and feedforward  overall architecture similar to a 3-layer perceptron (i.e:MLP with one hidden layer)  The input & output neurons - similar to perceptron  Neurons in the hidden layer is a symmetrical function - radial basis function (RBF). Abdul Rahim Ahmad 45 RBF networks (cont.)  The input neurons simply feed the input data into the nodes above.  The neurons in the output layer produce the weighted sum of their inputs, passed through a linear transfer function.  For an input vector (x1, x2, … xn), a neuron i in the hidden layer produces an output, yi, given by: Abdul Rahim Ahmad  where wij are the weights on the inputs to neuron i, and fr is a radial basis function (RBF). 46 Gaussian RBF  The most commonly used RBF is a Gaussian function:  Where σi is the standard deviation of a distribution described by the function  Each neuron, i, in the hidden layer has its own separate value for σi Abdul Rahim Ahmad 47 Training  If the set of weights (wi1, wi2, … win) on a given neuron i is treated as the coordinates of a point in pattern space, then ri is the Euclidean distance from there to the point represented by the input vector (x1, x2, … xn).  During unsupervised learning, the network adjusts the weights — (centers in an RBF network ) - so that each point (wi1, wi2, … win) Abdul Rahim Ahmad represents the center of a cluster of data points in pattern space. 48 Training (cont.)  Sizes of the clusters is defined by adjusting the variables σi (or equivalent variables if an RBF other than the Gaussian is used). Data points within a certain range, e.g., 2σi from a cluster center might be deemed members of the cluster.  RBF network can be thought of as drawing circles around clusters in 2-D space, or hypersheres in n-D space. Abdul Rahim Ahmad  One such cluster can be identified for each neuron in hidden layer. 49 Training (cont.)  Gaussian function in 2-D space  it can be seen that a fixed output value (e.g., 0.5) defines a circle in the pattern space.  Hidden layer : unsupervised learning – forming clusters.  Output layer : supervised learning - associate each cluster with a particular class.  Associate several circular clusters of varying center and size with a single class. Abdul Rahim Ahmad

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 5. Neural Network