Download MS Word document - School of Computer Science, University of

THE UNIVERSITY OF BIRMINGHAM Degree of B.Sc. with Honours Artificial Intelligence and Computer Science. Second Examination Computer Science/Software Engineering. Second Examination Joint Honours Degree of B.Sc. with Honours Mathematics and Artificial Intelligence. Second Examination Psychology and Artificial Intelligence. Second Examination 06 02360 (SEM 2A2) Introduction to Neural Networks August 2000 (RESIT) 2 hours [Answer ALL Questions] U2127 1. Answer the following questions: (a) Among various non-linear activation functions, why do we prefer to use the logistic function in back-propagation networks. [10%] ANSWER: The primary reason is the ease of calculating derivatives and efficiency. BP is in essence a gradient descent optimisation algorithm. It requires computation of derivatives of the error function with respect to weights. As BP training involved thousands of epochs through a training set and each presentation of a training pattern will lead to calculation of derivatives unless the error is zero, it is of great practical importance to be able to compute derivatives of an activation function efficiently. Because derivatives of a logistic function can easily be computed from its function values, we prefer to use it in neural networks. (b) One of the useful “tricks” in training a back-propagation network with a sigmoidal activation function is to initialise all weights at random in a small range, e.g., in [0.5, 0.5] or [-1.0, 1.0]. Please explain why this is beneficial to network learning. [10%] ANSWER: It is important to initialise weights in a small range because large weights tend to make a neuron’s output saturate and thus unable to learn. When inputs are large negatives or positives, the output of a sigmoidal function tends to have a constant output. In other words, the function output is very insensitive to different inputs. When inputs to a sigmoidal function is small (in terms of absolute values), its output is most responsive to input changes. In this case, a neural network is able to adapt its weights and learn faster. (c) Most data sets obtained from real-world problems contain noise. Why isn’t a neural network that achieves zero training error the best neural network? Describe one method that you would use to find a better neural network. [10%] ANSWER: A neural network that achieves zero training error would have learned all the noise in the data set, i.e., over-fitting to the training data. Such over-fitting leads to poor generalisation of the learned network, which would perform poorly on unseen data. The ultimate goal of learning is to achieve best generalisation not the minimum training error. One method that can be used to improve generalisation is to stop training early using a validation set. We first divide all the training data into a training set and a validation set. The training set will be used in training as was done before. The validation set is used to decide when to stop training. In the initial stages of training, neural network’s errors on both training and validation sets are expected to decrease. However, at certain point, the error on the validation set will start increasing again. This is when the training should be stopped. In essence, we use neural network’s performance on the validation set to estimate its generalisation ability. (d) U2127 Can a back-propagation training algorithm always find a global minimum of the error function for a feed-forward neural network? Why or why not? [5%] ANSWER: Not always, because (1) the BP algorithm may get stuck in a local minimum due to its gradient descent nature; (2) the NN may be inappropriately designed (i.e., too small) to perform the task; and/or (3) the parameters of the BP algorithm may be inappropriately set so that the algorithm never converges. 2. The following figure shows a back-propagation network with all weights given. There are no biases (thresholds) in this network. Given a training pattern with input (0.2, 0.3, 0.4). Assume that the logistic function with gain parameter 1 is used for all hidden and output nodes. (a) What are the outputs from hidden units H1 and H2? [5%] (b) Assume that the outputs from O1 and O2 are the same as target outputs, but the output from O3 is different from the target output. List all the weights which will be unchanged after the presentation of this training pattern. [10%] (Note: You only need to give correct expressions for questions (a) and (b). Calculation of exponential functions is not required.) w1=1.1 x1=0.2 w7=1.7 w2=1.2 w3=1.3 x2=0.3 H1 w9=1.9 w4=1.4 w5=1.5 x3=0.4 O1 H2 w10=2.0 w11=2.1 w6=1.6 w8=1.8 O2 O3 ANSWER: (a) O H1 O (b) = H2 1 , where u=0.2*1.1+0.3*1.3+0.4*1.5 1  e u = 1 , where u=0.2*1.2+0.3*1.4+0.4*1.6 1  e u w,w,w,w,w,w ,w 2 4 6 7 signal back-propagated. U2127 8 10 11 . They do not change because there is no error 3. What are similarities and differences between a back-propagation network with Gaussian activation functions and a radial basis function network? [10%] ANSWER: First, both of them are feed-forward NNs. Second, training algorithms are very different. A BP network uses hyperplanes to form decision boundaries while a RBF network uses hyperellipses. BP training relies on error back-propagation while RBF training does not. BP training applies to multilayers while RBF training applies only to the hidden to output layer. The input to hidden layer in a RBF network is not trained. It is determined by the centre of Gaussian functions. BP usually has a nonlinear output layer while RBF has a linear one. In practice, BP training can be very slow while RBF training is relatively fast. 4. The Kohonen learning algorithm is an unsupervised learning algorithm. (a) What does the algorithm learn in unsupervised learning as there are no target outputs that can be used to calculate an error? [7%] (b) Why do we need to define a neighbourhood for each output node in Kohonen networks? How does it help learning? [8%] ANSWER: (a) In unsupervised learning, the algorithm learns the relationship among data according to inputs only. A similarity measurement, e.g., the Euclidean distance, is often defined and used in learning. The algorithm learns to cluster similar data together into groups/clusters. Learning here is not based on minimisation of an error between the target and actual outputs, but based on maximisation of similarity of data points in a cluster and maximisation of dissimilarity of data points between clusters. Unsupervised learning algorithms are closely related to clustering algorithms in statistical pattern recognition. (b) Winner-takes-all is simple and works well for some problems. However, when a winner is only slightly better than its closest competitors, winner-takes-all does not work well. The introduction of neighbourhoods can remedy this problem. In addition, the introduction of neighbourhoods encourages formation of feature maps in the output layer of Kohonen networks, i.e., similar units are close to each other spatially. The introduction of neighbourhoods also makes learning faster although the neighbourhood size decreases as learning progresses. 5. The post code recognition problem can be described as follows. (a) Each post code consists of six or seven hand-written letters/digits. The letters are 26 upper-case English letters, i.e., 'A' .. 'Z'. The digits are '0' .. '9'. (b) Each letter/digit is represented by a 32x32 binary image. U2127 (c) The total available data consist of 10000 images. They are in the format of 0111011 ... 0100000 B 0100000 ... 1111110 H 1100111 ... 0011010 7 ... ... where the first 1024 bits represent 32x32 pixel values and the last column indicates the correct class. You are required to design a feed-forward artificial neural network (or several networks) to recognise letters/digits. Your answer should include (1) Input representation, output representation, and network (or networks) architecture (including the number of hidden layers, hidden nodes and connectivity pattern). (2) Any pre-processing of the input data and/or post-processing of the output data that should be carried out. (3) The training algorithm that should be used. You must justify all your design choices. [25%] ANSWER: There is no fixed answer to this question. However, the answer (including the design and its justification) should include the following for marks to be awarded. (The answer may be illustrated by a number of graphs which show the network structure and connectivity.) The best design for this problem is to classify letters from digits first using one neural network (NN1), and then classify different letters and digits using separate neural networks (NNL and NND), respectively. The raw 32x32 images should not be used as inputs to neural networks (NNs) because 32*32=1024 is too large in comparison with the number of training patterns available (10000). Pre-processing or weight-sharing must be used to reduce the number of inputs to the NN. For example, 32x32 raw images can be reduced to 8x8 inputs by averaging pixel values in 4x4 windows. For NN1, there are 8x8 inputs and one output. The network must have a hidden layer since the problem is nonlinear. The number of hidden nodes is determined empirically. For this design, the number of hidden nodes is set as 4. The network is strictly-layered and fully-connected. Pre-processing is necessary for NN1: first, original 32x32 binary data will be transformed into 64 real values. The target output values should not be 0 (for letters) and U2127 1 (for digits). They should be greater than 0 and smaller than 1, e.g., 0.3 for letters and 0.7 for digits due to the use of logistic functions. The output of NN1 is used to determine whether NNL or NND should be used for the next step. NNL also has 8x8 inputs. It has 26 outputs, one for each letter. NNL must have a hidden layer since the problem is non-linear. The number of hidden nodes should be greater than the number of hidden nodes used in NN1 because NNL performs a more difficult task. For this design, the number of hidden nodes is set (empirically) as 5. The target output of NNL should not be a vector of 26 binary values. It should be a vector of 26 real values, one of which is close to 1 (e.g., 0.7) and the rest close to 0 (e.g., 0.3). The network is strictly layered and fully-connected. As 64*5*26=8320 is rather close to 10000, NNL has a high risk of over-fitting the training data if nothing is done to further reduce the network size or having more training data. For this design, we will add small Gaussian noise to the existing training data to generate more training data. Each existing pattern will generate three new patterns, making the total number of training patterns 40000. 37000 will be used in training while 3000 will be used as a validation set for early stopping in order to improve generalisation. NND is similar to NNL, but with 10 outputs. Any algorithm which works with continuous variables can be used as the training algorithm for the above three NNs, but not the Perceptron learning algorithm. U2127

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download MS Word document - School of Computer Science, University of