Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BIRKBECK (University of London) MSc/MRes EXAMINATION FOR INTERNAL STUDENTS MSc IN ADVANCED INFORMATION SYSTEMS MSc IN WEB INFORMATION MANAGEMENT MRes IN COMPUTER SCIENCE SCHOOL OF COMPUTER SCIENCE AND INFORMATION SYSTEMS COMPUTATIONAL INTELLIGENCE AND VISUALISATION OFFICIAL MODULE NUMBER: COIY022P DATE OF EXAMINATION: Friday 2nd June 2006 DURATION OF PAPER: 14.30-16.30 There are six questions on this paper. Candidates should attempt any FOUR of them. Calculators are permitted. COIY022P Page 1 of 7 © Birkbeck College 2006 Question 1. 1D distributions 1.1. Consider feature x whose range is between 1 and 10. When the range of x is divided in 9 bins (in this case, intervals of the lengths one: [1,2), [2,3),…, [9,10]), the x frequencies in the corresponding bins are: 10, 20, 10, 20, 30, 20, 40, 20, 30. Please answer these questions: 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 How many observations of x are available? (1 mark) What can be said about the value of the median of x? (3 marks) Provide the minimum and maximum estimates of the average of x. (5 marks) What can be said of 20% quantiles of x? (4 marks) What is the distribution of x when the number of bins is 3? What is the qualitative variance (Gini coefficient) for this distribution? (7 marks) 1.2. Given a triangular fuzzy set defined by the triple (1, 5, 6), draw a graph of its membership function. (1 mark) 1.3. Given three triangular fuzzy sets defined by triples (0,1,2), (0, 2, 3), and (1, 3, 4), determine the corresponding central triangular fuzzy set. (4 marks) Answer: 1.1.1 1.1.2 There are 200 observations. The median lies between 100-th and 101-th values in a sorted order, that is, in the 6th bin, that is, between 6 and 7. 1.1.3 The minimum estimate of the mean is computed with the minimal values in bins: (1*10+2*20+3*10+4*20+5*30+6*20+7*40+8*20+9*30)/200=5.7 The maximum estimate is calculated using the same formula with all bin values increased by 1, which should lead to 5.7+1=6.7. 1.1.4 20% of 200 is 40. That means that the 20% quantile on the left end of x is 4, while that on the right end must be in the 8-th bin, that is, between 8 and 9. 1.1.5 The three-bin distribution will be 40, 70, 90 or, in the relative frequencies, 0.2, 0.35, .45, which leads to the Gini index equal to 1-0.2^2-0.35^2-0.45^2=0.635. 1.2 Graph of the triangular fuzzy set (1,5,6) has the following shape: 1 1.3 5 6 The central triangular fuzzy set is defined by the average values as (1/3, 2,3). Question 2. Multivariate data. Consider a data table for 6 students and 2 features, as follows: Student ____ 1 2 3 COIY022P ! ! ! ! ! Mark Occupation _____________ 60 IT 80 IT 80 IT Page 2 of 7 © Birkbeck College 2006 4 5 6 ! ! ! 60 40 40 AN AN AN 2.1. Data standardisation: Pre-process the data into a quantitative format and standardise it using the averages and ranges. (5 marks) 2.2. Euclidean distances squared: Compute the between-entity distance matrix and draw an edge-weighted graph whose vertices correspond to entities and edges to distances. (5 marks) 2.3. Compute the data scatter and determine contributions of features to it. (7 marks) 2.4. After data have been quantified, you must have different columns (features) for IT and AN occupations. Can you determine the correlation coefficient between these two features? (4 marks) What is the inner product of these two features? (4 marks) Answer: 2.1 The pre-processed and standardised data : Mark IT AN 1 0 0.5 -0.5 2 0.5 0.5 -0.5 3 0.5 0.5 -0.5 4 0 -0.5 0.5 5 -0.5 -0.5 0.5 6 -0.5 -0.5 0.5 because the means of the original table are (60, 0.5, 0.5) and the ranges (40, 1, 1). 2.2 The Euclidean squared distance between, for example, entities 1 and 2 is d(1,2)=(00.5)^2+(0.5-0.5)^2+(-0.5-0.5)^2=0.25+0+0=0.25. The distance graph can be presented as follows: 0.25 0.25 1 2 2.25 2.25 2.25 4 0 2 2.25 3 3 3 3 0.25 5 0 6 0.25 2.3 The data scatter is the sum of all entries in 2.1 squared, that is, 16*.25=4. The first feature contributes to it 4*.25=1 and the other to, 6*.25=1.5 each. These can be expressed as 25%, 37.5%, and 37.5%. COIY022P Page 3 of 7 © Birkbeck College 2006 2.4 Features IT and AN in 2.1 sum up to zero, which means they are linearly related so that their correlation coefficient is –1. Their inner product is 0. Question 3. Neural networks. 3.1. What is an artificial neuron? (5 marks) 3.2. Explain the concept of perceptron and its relation to the gradient algorithm. (12 marks) 3.3. What are the main steps of the back-propagation algorithm for a neural network with one hidden layer? (8 marks) Answer: 3.1 3.2 An artificial neuron is a system implementing a mapping of a vector of its inputs x=(xi) into a value f(w1*x1+w2*x2+…+wM*xM w0) where w1,…,wM are (wiring) weights, w0 is the bias, and f is the neuron activation function such as sign or sigmoid s(x)=1/(1+exp(-x)). The perceptron is an artificial neuron that implements the following algorithm for learn weights to recognise a pattern, that is, to minimise the error of prediction of target values u that are equal to either 1 or –1 and are associated with the input feature vector x over a number of instances of known pairs (x,u): 0. Initialise weights w randomly or to zero. 1. For each training instance (xi,ui) a. compute ůi = sign(<w,xi>) b. if ůi ui, update weights w according to equation w(new) = w(old) + (ui- ůi)xi where , a real between 0 and 1, is the so-called learning rate. 2. Stop at convergence. The gradient optimisation (the steepest ascent/descent, or hill-climbing) of a function f(x) of a multidimensional variable works as this: given an initial state x0, perform a sequence of iterations of finding a new x location. Each of the iterations updates the old xvalue as follows: x(new) =x(old) ± *grad(f(x(old)) where grad(f(x)) is the vector of partial derivatives of f with respect to the components of x. It is known from the calculus, that the vector grad(f(x)) shows the steepest rise of f at the point x. Thus + is used for maximisation of f(x), and – for minimisation. The value controls the length of the change and should be small (to guarantee not over jumping the slope) , but not too small (to guarantee changes when grad(f(x(old)) becomes too small; indeed grad(f(x(old))=0 in the optimum point). It can be proven that the partial derivative of the quadratic error criterion with respect to wt, in the case when only one incoming entity (xi,ui) is considered, is equal to –2(ui- ûi) xit, which is similar to the perceptron learning rule. Thus, the perceptron is similar to the gradient optimisation except that the continuous ûi is changed in it for the discrete ůi =sign(ûi). 3.3 The back-propagation algorithm is an implementation of the gradient optimisation method in the framework of a multilayer neural network. It starts with random weights and runs a pre-specified number of epochs (or, until convergence) by processing entities (x, u) in a random order. Given an entity (xi, ui), first, it is feed- COIY022P Page 4 of 7 © Birkbeck College 2006 forwarded through the net to produce a computed output value u’ and the error e=uiu’. Then this error is back-propagated along the net topology to compute the weights gradient, which is used then to update the weights. Question 4. MST and Single linkage clustering. 4.1. Find a maximum spanning tree in the following similarity graph using Prim’s algorithm, stating the order in which the edges are added to the tree. (17 marks) 6 A B 10 4 10 4 11 5 7 D C 8 12 E 2 F 8 3 5 G H 10 4.2. What is the total length of the tree? In what sense is the algorithm “greedy”? (3 marks) 4.3. Find a three-cluster single linkage partition by cutting the MST found above. (5 marks) Answer: 4.1. A possible answer when starting from A. The maximum link from A is 1either AC or AD (weight 10), of which we select C. The maximum link from A and C is AD (weight 10). The maximum link from A, C, D to the rest is DB (weight 11). The next maximum links are DG (8), GH (10), HE (8) and EF (12). This leads to the following MST (edges highlighted): A 10 B 10 11 12 C D E 8 F 8 10 G H 4.2 The total length is 69. The algorithm is greedy because at each step it considers acquisition of only one entity, in a best possible way. 4.3 The MST must be cut at two shortest links, that are DG (8) and EH (8), thus leading to the following three single link clusters: ABCD, GH, and F. COIY022P Page 5 of 7 © Birkbeck College 2006 Question 5. K-Means clustering. Consider a data table of 7 entities (1, 2, …, 7) and 2 features (F1, F2): Entity 1 2 3 4 5 6 7 F1 25 15 18 10 22 25 25 F2 0 2 1 2 1 0 1 5.1. Standardise the data with the feature averages and ranges. (4 marks) 5.2. Set K=2 and initial seeds of two clusters so that they should be as far from each other as possible. Assign entities to the seeds with the Minimum distance rule. (8 marks) 5.3. Calculate centroids of the found clusters; compare them with the initial seeds. (4 marks) 5.4. Is there any chance that the found clusters are final in the K-means process? (3 marks) 5.5. Take one of the clusters found in 5.2 and determine the relative feature contributions to the cluster. Comment on the results. (6 marks) Answers: 5.1. The means are 20 and 1; the ranges, 15 and 2. Subtracting the means and dividing by the ranges, one obtains the standardised data as 0.33 -0.33 -0.13 -0.67 0.13 0.33 0.33 -0.5 0.5 0 0.5 0 -0.5 0 5.2 Entities (rows) 1 and 4 are farthest away from each other, with the distance d(1,4)=2. By taking them as initial seeds, the Minimum distance rule assigns entities 3,5,6,7 to seed 1, and entity 2 to seed 4. This produces clusters 13567 and 24. 5.3 Centroid of cluster 13567 is (0.2, -0.2) (the seed was (0.33, -0.5)), centroid of cluster 24 is (-.5, .5) (the seed was (-.67, 0.5)). 5.4 The found clusters are indeed final, because applying the minimum distance rule to the entities with centroids in 5.3, leads to the same clusters as in 5.2. 5.5 The relative feature contributions are proportional to their centroid values squared; the centroids in 5.3 lead to the same relative contribution weights for both features in both clusters. This means that the features have the same degree of variation within the clusters. COIY022P Page 6 of 7 © Birkbeck College 2006 Question 6. Nature inspired algorithms. 6.1. Explain the structure of a genetic algorithm (GA). (5 marks) 6.2. Explain main steps of GA for K-Means clustering in the setting of cluster label strings. (10 marks) 6.1. Explain main steps of the evolutionary algorithm for K-Means clustering. (10 marks) Answer: 6.1 A genetic algorithm is defined by a population comprising a number of structured entities, called chromosomes, that evolve imitating the following biological mechanisms: 1. Selection 2. Cross-over 3. Mutation These mechanisms apply to carry on the population from one iteration to the next one. The initial population is selected, typically, randomly. The evolution stops when the population’s fitness doesn’t change anymore or when a pre-specified threshold to the number of iterations is reached. 6.2 A partition S = {S1, … , SK} of the entity set is represented by a “chromosome” which is the string of cluster labels assigned to the entities in the order i=1,…, N. If, for instance, N=8, and the entities are e1, e2, e3, e4, e5, e6, e7, e8, then the string 12333112 represents partition S with three classes, S1={e1, e6, e7}, S2={e2, e8}, and S3={e3, e4, e5}, which can be easily seen from the diagram e1 e2 e3 e4 e5 e6 e7 e8 1 2 3 3 3 1 1 2 The main steps are: (1) Initial setting (Randomly generate strings s1,..,sP of K integers 1 ,…, K and compute the values of K-Means criterion for each); (2) Selection (Randomly select mating pairs); (3) Cross-over (For each of the mating pairs, generate a random number r between 0 and 1. If r is smaller than a pre-specified probability p (typically, p is taken about 0.7-0.8), then perform a crossover; otherwise the mates themselves are considered the result); (4) Mutation (Random alter a character in each chromosome); (5) Elitist survival (Store the best fitting chromosome and put the record chromosome instead of the worst one into the population); (6) Halt (typically, a limit on the number of iterations. If this doesn’t hold, go to 1; otherwise, halt). 6.3 In an evolutionary K-Means algorithm, a chromosome is represented by the set of K centroids c1, c2, ck, which can be considered a string of K*V real (“float”) numbers. In contrast to the GA representation, the length of the string here does not depend on the number of entities that can be of advantage when the number of entities is massive. Furthermore, each centroid in the string is analogous to a gene in the chromosome. Computations are performed similarly to those in GA with the following steps: (1) Initial setting (Randomly generate strings of K centroids and compute the values of K-Means criterion for each); (2) Selection (Randomly select mating pairs); (3) Cross-over (For each of the mating pairs, generate a random number r between 0 and 1. If r is smaller than a pre-specified probability p (typically, p is taken about 0.7-0.8), then perform a crossover; otherwise the mates themselves are considered the result); (4) Mutation (Add small normally distributed noise to each chromosome); (5) Elitist survival (Store the best fitting chromosome and put the record chromosome instead of the worst one into the population); (6) Halt (typically, a limit on the number of iterations. If this doesn’t hold, go to 1; otherwise, halt). COIY022P Page 7 of 7 © Birkbeck College 2006