Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ch. 9 Unsupervised Learning Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from Stephen Marsland and some slides from the Internet Collected and modified by Longin Jan Latecki Temple University [email protected] 159.302 3.1 Stephen Marsland Introduction Suppose we don’t have good training data Hard and boring to generate targets Don’t always know target values Biologically implausible to have targets? Two cases: Know when we’ve got it right No external information at all 159.302 3.2 Stephen Marsland Unsupervised Learning We have no external error information No task-specific error criterion Generate internal error Must be general Usual method is to cluster data together according to activation of neurons Competitive learning 159.302 3.3 Stephen Marsland Competitive Learning Set of neurons compete to fire Neuron that ‘best matches’ the input (has the highest activation) fires Winner-take-all Neurons ‘specialise’ to recognise some input Grandmother cells 159.302 3.4 Stephen Marsland The k-Means Algorithm Suppose that you know the number of clusters, but not what the clusters look like How do you assign each data point to a cluster? Position k centers at random in the space Assign each point to its nearest center according to some chosen distance measure Move the center to the mean of the points that it represents Iterate 159.302 3.5 Stephen Marsland k-means Clustering 3.6 6 Euclidean Distance y y1 - y2 x1 - x2 x 159.302 3.7 Stephen Marsland The k-Means Algorithm 4 means . . .. ++ . .. . * ++ . . - . - - . 159.302 ^^ 3.8 Stephen Marsland The k-Means Algorithm These are local minima solutions ++ ^^ ++ * ++ - * -- - - 159.302 ^^ - - 3.9 Stephen Marsland The k-Means Algorithm More perfectly valid, wrong solutions ++ ^^ ++ * -- ^ -- - - 159.302 ^^ - * 3.10 Stephen Marsland The k-Means Algorithm If you don’t know the number of means the problem is worse ++ ^^ ++ * -- -- - - 159.302 ++ + - - 3.11 Stephen Marsland The k-Means Algorithm One solution is to run the algorithm for many values of k Pick the one with lowest error Up to overfitting Run the algorithm from many starting points Avoids local minima? What about noise? Median instead of mean? 159.302 3.12 Stephen Marsland k-Means Neural Network Neuron activation measures distance between input and neuron position in weight space 159.302 3.13 Stephen Marsland Weight Space Image we plot neuronal positions according to their weights w 2 w1 w 2 w3 w1 w3 159.302 3.14 Stephen Marsland k-Means Neural Network Use winner-take-all neurons Winning neuron is the one closest to input Best-matching cluster How do we do training? Update weights - move neuron positions Move winning neuron towards current input Ignore the rest 159.302 3.15 Stephen Marsland Normalisation Suppose the weights are: w2 (0.2, 0.2, -0.1) (0.15, -0.15, 0.1) (10, 10, 10) The input is (0.2, 0.2, -0.1) w1 w3 159.302 3.16 Stephen Marsland Normalisation For a perfect match with first neuron: 0.2*0.2 + 0.2*0.2 + -0.1*-0.1 = 0.09 0.15*0.2 + -0.15*0.2 + 0.1*-0.1 = -0.01 10*0.2 + 10*0.2 + 10*-0.1 = 3 Can only compare activations if the weights are about the same size 159.302 3.17 Stephen Marsland Normalisation Make the distance between each neuron and the origin be 1 All neurons lie on the unit hypersphere Need to stop the weights growing unboundedly 159.302 3.18 Stephen Marsland k-Means Neural Network Normalise inputs too Then use: That’s it Simple and easy 159.302 3.19 Stephen Marsland Vector Quantisation (VQ) Think about the problem of data compression Want to store a set of data (say, sensor readings) in as small an amount of memory as possible We don’t mind some loss of accuracy Could make a codebook of typical data and index each data point by reference to a codebook entry Thus, VQ is a coding method by mapping each data point x to the closest codeword, i.e., we encode x by replacing it with the closest codeword. 159.302 3.20 Stephen Marsland Outline of Vector Quantization of Images S.R.Subramanya 3.21 21 Vector Quantisation The Codebook... 10110 01001 11010 11100 11001 0 1 2 3 4 … is sent to the receiver 0 1 2 3 4 10110 01001 11010 11100 11001 At least 30 bits 159.302 3.22 Stephen Marsland Vector Quantisation The data... 01001 11100 … is encoded... 11101 10110 01001 11010 11100 11001 0 1 2 3 4 00101 11110 …and sent 3 bits 159.302 3.23 1 Stephen Marsland Vector Quantisation The data... 01001 11100 … is encoded... 11101 10110 01001 11010 11100 11001 0 1 2 3 4 00101 11110 …and sent 3 bits 159.302 3.24 3 Stephen Marsland Vector Quantisation The data... 01001 11100 … is encoded... 11101 00101 0 1 2 3 4 ? 11110 159.302 10110 01001 11010 11100 11001 3.25 Stephen Marsland Vector Quantisation The data... 01001 11100 … is encoded... 11101 00101 10110 01001 11010 11100 11001 0 1 2 3 4 ? 11110 Pick the nearest according to some measure 159.302 3.26 Stephen Marsland Vector Quantisation The data... 01001 11100 … is encoded... 11101 10110 01001 11010 11100 11001 0 1 2 3 4 00101 11110 And send … 3 bits, but information is lost ? Pick the nearest according to some measure 159.302 3.27 Stephen Marsland Vector Quantisation The data... 01001 … is sent as 11100 13313 11101 … which takes 15 bits instead of 30 00101 11110 159.302 Of course, sending the codebook is inefficient for this data, but if there was a lot more information, the cost would have been reduced 3.28 Stephen Marsland Vector Quantisation The problem is that we have only sent 2 different pieces of data - 11100 and 00101, instead of the 5 we had. If the codebook had been picked more carefully, this would have been a lot better How can you pick the codebook? Usually k-means is used for Learning Vector Quantisation 159.302 3.29 Stephen Marsland Voronoi Tesselation Join neighbouring points Draw lines equidistant to each pair of points These are perpendicular to other lines 159.302 3.30 Stephen Marsland Two Dimensional Voronoi Diagram Codewords in 2-dimensional space. Input vectors are marked with an x, codewords are marked with red circles, and the Voronoi regions are separated with boundary lines. 3.31 Self Organizing Maps Self-organizing maps (SOMs) are a data visualization technique invented by Professor Teuvo Kohonen Also called Kohonen Networks, Competitive Learning, Winner-Take-All Learning Generally reduces the dimensions of data through the use of self-organizing neural networks Useful for data visualization; humans cannot visualize high dimensional data so this is often a useful technique to make sense of large data sets 3.32 Neurons in the Brain Although heterogeneous, at a low level the brain is composed of neurons A neuron receives input from other neurons (generally thousands) from its synapses Inputs are approximately summed When the input exceeds a threshold the neuron sends an electrical spike that travels that travels from the body, down the axon, to the next neuron(s) 3.33 Feature Maps Low pitch Higher pitch High pitch 159.302 3.34 Stephen Marsland Feature Maps Sounds that are similar (‘close together’) excite neurons that are near to each other Sounds that are very different excite neurons that are a long way off This is known as topology preservation The ordering of the inputs is preserved If possible (perfectly topology-preserving) 159.302 3.35 Stephen Marsland Topology Preservation Inputs Outputs 159.302 3.36 Stephen Marsland Topology Preservation 159.302 3.37 Stephen Marsland Self-Organizing Maps (Kohonen Maps) Common output-layer structures: One-dimensional (completely interconnected for determining “winner” unit) i i Two-dimensional (connections omitted, only neighborhood relations shown) Neighborhood of neuron i November 24, 2009 3.38 Introduction to Cognitive Science Lecture 21: Self-Organizing Maps 38 The Self-Organising Map Inputs 159.302 3.39 Stephen Marsland Neuron Connections? We don’t actually need the inhibitory connections Just use a neighbourhood of positive connections How large should this neighbourhood be? Early in learning, network is unordered Big neighbourhood Later on, just fine-tuning network Small neighbourhood 159.302 3.40 Stephen Marsland The Self-Organising Map The weight vectors are randomly initialised Input vectors are presented to the network The neurons are activated proportional to the Euclidean distance between the input and the weight vector The winning node has its weight vector moved closer to the input So do the neighbours of the winning node Over time, the network self-organises so that the input topology is preserved 159.302 3.41 Stephen Marsland Self-Organisation Global ordering from local interactions Each neurons sees its neighbours The whole network becomes ordered Understanding self-organisation is part of complexity science Appears all over the place 159.302 3.42 Stephen Marsland Basic “Winner Take All” Network Two layer network Input units, output units, each input unit is connected to each output unit Input Layer I1 Output Layer O1 I2 I3 O2 Wi,j 3.43 Basic Algorithm (the same as k-Means Neural Network) Initialize Map (randomly assign weights) Loop over training examples Assign input unit values according to the values in the current example Find the “winner”, i.e. the output unit that most closely matches the input units, using some distance metric, e.g. For all output units j=1 to m and input units i=1 to n Find the one that minimizes: W n i 1 ij 2 Ii Modify weights on the winner to more closely match the input W t 1 c( X it W t ) where c is a small positive learning constant that usually decreases as the learning proceeds 3.44 Result of Algorithm Initially, some output nodes will randomly be a little closer to some particular type of input These nodes become “winners” and the weights move them even closer to the inputs Over time nodes in the output become representative prototypes for examples in the input Note there is no supervised training here Classification: Given new input, the class is the output node that is the winner 3.45 Typical Usage: 2D Feature Map In typical usage the output nodes form a 2D “map” organized in a grid-like fashion and we update weights in a neighborhood around the winner Output Layers Input Layer O11 O12 O13 O14 O15 O21 O22 O23 O24 O25 O31 O32 O33 O34 O35 O41 O42 O43 O44 O45 O51 O52 O53 O54 O55 I1 I2 … I3 3.46 Modified Algorithm Initialize Map (randomly assign weights) Loop over training examples Assign input unit values according to the values in the current example Find the “winner”, i.e. the output unit that most closely matches the input units, using some distance metric, e.g. Modify weights on the winner to more closely match the input Modify weights in a neighborhood around the winner so the neighbors on the 2D map also become closer to the input Over time this will tend to cluster similar items closer on the map 3.47 Unsupervised Learning in SOMs For n-dimensional input space and m output neurons: (1) Choose random weight vector wi for neuron i, i = 1, ..., m (2) Choose random input x (3) Determine winner neuron k: ||wk – x|| = mini ||wi – x|| (Euclidean distance) (4) Update all weight vectors of all neurons i in the neighborhood of neuron k: wi := wi + η·h(i, k)·(x – wi) (wi is shifted towards x) (5) If convergence criterion met, STOP. Otherwise, narrow neighborhood function h and learning parameter η and go to (2). November 24, 2009 3.48 Introduction to Cognitive Science Lecture 21: Self-Organizing Maps 48 The Self-Organising Map Before training (large neighbourhood) 159.302 3.49 Stephen Marsland The Self-Organising Map After training (small neighbourhood) 159.302 3.50 Stephen Marsland Updating the Neighborhood Node O44 is the winner Color indicates scaling to update neighbors Output Layers W t 1 c( X it W t ) O11 O12 O13 O14 O15 O21 O22 O23 O24 O25 c=1 O31 O32 O33 O34 O35 c=0.75 O41 O42 O43 O44 O45 c=0.5 O51 O52 O53 O54 O55 3.51 Selecting the Neighborhood Typically, a “Sombrero Function” or Gaussian function is used Strength Distance Neighborhood size usually decreases over time to allow initial “jockeying for position” and then “fine-tuning” as algorithm proceeds 3.52 Color Example http://davis.wpi.edu/~matt/courses/soms/applet.html 3.53 Kohonen Network Examples Document Map: http://websom.hut.fi/websom/milliondemo/ html/root.html 3.54 Poverty Map http://www.cis.hut.fi/rese arch/somresearch/worldmap.html 3.55 SOM for Classification A generated map can also be used for classification Human can assign a class to a data point, or use the strongest weight as the prototype for the data point For a new test case, calculate the winning node and classify it as the class it is closest to 3.56 Network Size We have to predetermine the network size Big network Each neuron represents exact feature Not much generalisation Small network Too much generalisation No differentiation Try different sizes and pick the best 159.302 3.57 Stephen Marsland