Download Clustering178winter07

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining Unsupervised Learning • In supervised learning we were given attributes & targets (e.g. class labels). In unsupervised learning we are only given attributes. • Our task is to discover structure in the data. • Example: the data may be structured in clusters: Is this a good clustering? Why Discover Structure ? • Often, the result of an unsupervised learning algorithm is a new representation for the same data. This new representation should be more meaningful and could be used for further processing (e.g. classification). • Clustering: The new representation is now given by the label of a cluster to which the data-point belongs. This tells us which data-cases are similar to each other. • The new representation is smaller and hence more convenient computationally. • Clustering: Each data-case is now encoded by its cluster label. This is a lot cheaper than its attribute values. • CF: We can group the users into user-communities or/and the movies into movie genres. If we need to predict something we simply pick the average rating in the group. Clustering: K-means • We iterate two operations: 1. Update the assignment of data-cases to clusters 2. Update the location of the cluster. • Denote zi  [1,2,3,..., K ] the assignment of data-case “i” to cluster “c”. d • Denote c  • Denote xi  d the position of cluster “c” in a d-dimensional space. the location of data-case i • Then iterate until convergence: 1. For each data-case, compute distances to each cluster and pick the closest one: zi  argmin || xi  c || c i 2. For each cluster location, compute the mean location of all data-cases assigned to it: 1 c  Nr. of data-cases in cluster c Nc x i Sc i c Set of data-cases assigned to cluster c K-means N • Cost function: C  || xi   zi ||2 i 1 • Each step in k-means decreases this cost function. • Often initialization is very important since there are very many local minima in C. Relatively good initialization: place cluster locations on K randomly chosen data-cases. • How to choose K? Add complexity term: C  C  1  [# parameters ]  log(N ) 2 and minimize also over K Vector Quantization • K-means divides the space up in a Voronoi tesselation. • Every point on a tile is summarized by the code-book vector “+”. This clearly allows for data compression ! Mixtures of Gaussians • K-means assigns each data-case to exactly 1 cluster. But what if clusters are overlapping? Maybe we are uncertain as to which cluster it really belongs. • The mixtures of Gaussians algorithm assigns data-cases to cluster with a certain probability. MoG Clustering N [x ; , ]  1 d /2 2  1 exp[  (x   )T  1 (x   )] 2 det() Covariance determines the shape of these contours • Idea: fit these Gaussian densities to the data, one per cluster. EM Algorithm: E-step • “r” is the probability that data-case “i” belongs to cluster “c”. •  c is the a priori probability of being assigned to cluster “c”. • Note that if the Gaussian has high probability on data-case “i” (i.e. the bell-shape is on top of the data-case) then it claims high responsibility for this data-case. • The denominator is just to normalize all responsibilities to 1: K r c 1 ric  K  c N [xi ; c , c ]  c '1 N [xi ; c ', c '] c' ic 1 i EM Algorithm: M-Step Nc   ric total responsibility claimed by cluster “c” i Nc c  N c  c  1 Nc 1 Nc expected fraction of data-cases assigned to this cluster r x ic i r i ic i weighted sample mean where every data-case is weighted according to the probability that it belongs to that cluster. (xi  c )(xi  c )T weighted sample covariance EM-MoG • EM comes from “expectation maximization”. We won’t go through the derivation. • If we are forced to decide, we should assign a data-case to the cluster which claims highest responsibility. • For a new data-case, we should compute responsibilities as in the E-step and pick the cluster with the largest responsibility. • E and M steps should be iterated until convergence (which is guaranteed). • Every step increases the following objective function (which is the total log-probability of the data under the model we are learning): N K  L  log    c N [xi ; c , c ]  i 1  c 1  Agglomerative Hierarchical Clustering Every data-case is a cluster • Define a “distance” between clusters (later). • Initially, every data-case is its own cluster. • At each iteration, compute the distances between all existing clusters (you can store distances and avoid their re-computation). • Merge the closest clusters into 1 single cluster. • Update you “dendrogram”. Iteration 1 Iteration 2 Iteration 3 • This way you build a hierarchy. • Complexity Order N 2 (why?) Dendrogram Distances Dmin (Ci , C j )  min x Ci ,x 'C j || x  x '|| produces minimal spanning tree. Dmax (Ci , C j )  Davg (Ci , C j )  max || x  x '|| x Ci ,x 'C j 1 Ni N j  x Ci ,x 'C j Dmean (Ci , C j ) || i   j || || x  x '|| avoids elongated clusters. Gene Expression Data Micro-array Data • The expression level of genes is tested under different experimental conditions. • We like to find the genes which co-express in a subset of conditions. • Both genes and conditions are clustered and shown as dendrograms. Exercise I Imagine I have run a clustering algorithm on some data describing 3 attributes of cars: height, weight, length. I have found two clusters. An expert comes by and tells you that class 1 is really Ferrari’s while class 2 is Hummers. • A new data-case (car) is presented, i.e. you get to see the height, weight, length. Describe how you can use the output of your clustering, including the information obtained from the expert to classify the new car as a Ferrari or a Hummer. Be very precise: use an equation or pseudo-code to describe what to do. • You add the new car to the dataset and run the K-means starting at its converged assignments and cluster means obtained from before. Is it possible that the assignments of the old data change due to the addition of the new data-case? Exercise II • We classify data according to the 3-nearest neighbors (3-NN) rule. Explain in detail how this works. • Which decision surface do you think is smoother: the one for 1-NN or for 100-NN? Explain. • Is k-NN a parametric or non-parametric method. Give an important property of non-parametric classification method. • We will do linear regression on data of the form (Xn,Yn) where Xn and Yn are real values: Yn = AXn+b+n where A,b are parameters and n is the noise variable. • Provide the equation for the total Error of the data-items. • We want to minimize the Error. With respect to what ? • You are given a new attribute Xnew. What would you predict for Ynew.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering178winter07