Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Supervised vs Unsupervised • Supervised learning – the “teacher” Y (the response variable) – The goal is to predict Y from X, which is equivalent to estimate f (x) = E[Y | X = x], under square error loss for quantitative response or 0/1 loss for binary response. • Unsupervised learning: understand the data generating process for X. 1 • P (X, Y ) = P (Y |X) × P (X) – P (Y |X): focus of supervised learning. – P (X): focus of unsupervised learning. • Goal is clear in supervised learning, and measure of performance is well defined, but that’s not the case for unsupervised learning. 2 Clustering • Gaol: group or segment objects into subsets or clusters, such that those within each cluster are more similar to one another than objects assigned to different clusters. • A key issue: choice of similarity/dissimilarity measure between two objects. 3 • Partitioning methods: K-means and K-medoids • Hierarchical methods: agglomerative, divisive or hybrid. • Spectral clustering • Model based clustering, such as mixture models, f (x) = K X k=1 4 πk p(x | θk ). Inputs for Clustering • Xn×p : data matrix, where rows stand for objects/samples and columns stand for variables/features. • Dn×n : dissimilarity matrix where d(i, j) = d(j, i) measures the difference between the objects i and j. Some clustering methods process data matrix along with their own similarity measures, while some operate on dissimilarity matrix. But we can always transform the input from one form to the other: • X =⇒ D: easy • D =⇒ X: e.g., multi-dimensional scaling (MDS) 5 Multi-dimensional Scaling (MDS) Given a dissimilarity matrix Dn×n , find values z1 , . . . , zn ∈ Rk to minimize the following stress function S(z1 , . . . , zn ) = X 2 dij − kzi − zj k . i6=j That is, MDS finds a k-dimensional representation of the data that preserves the pairwise distances as well as possible. 6 K-means Clustering • Input: Xn×p (data matrix) and K (number of clusters) • K-means clustering algorithm 0. Start with some initial guess for the K cluster centers (centroid). 1. For each data point, find the closest cluster center (partition step). 2. Replace each centroid by average of data points in its partition (update centroids). 3. Repeat 1 + 2 until convergence. 7 • Goal: Partition data into K groups so that the dissimilarities between points in the same cluster are smaller than those in different clusters. • Dissimilarity measure is the squareda Euclidean distance. • Optimize the following objective function (within cluster sum of squares) Ω(C1K , mK 1 ) = K X X k=1 i∈Ck a What about un-squared Euclidean distance? 8 kxi − mk k2 . K-means converges to a local minimum of Ω by iterating the following two steps Step 1 (Update Centers): Given the partition C1 , . . . , CK , min mK 1 K X X kxi − mk k2 . k=1 i∈Ck Step 2 (Update Partition): Given the cluster centers m1 , . . . , mK , min C1K K X X k=1 i∈Ck 9 kxi − mk k2 . Some Practical issues • Try many random starting centroids and choose the solution with the smallest within-cluster SS. • How to choose K? • Should we normalize each dimension? • If we use other dissimilarity measures, such as – Different measures for numerical, ordinal and categorical variables, – can we still use K-means algorithm? 10 Dissimilarity Measures d(xi , xj ) = p X wl (i, j)d(xil , xjl ). l=1 • Numerical variables: d(xil , xjl ) = g(|xil − xjl |), where g is a monotone function. • Interval variables: replace xil by xil , range of the lth variable and then treat it numerical. 11 (1) • Ordinal variables (e.g., rank data, say 1 : M ): replace xil by xil −1/2 , M and then treat it as an interval-valued variable. • Categorical variables: d(xil , xjl ) = 1(xij = xjl ). • wl (i, j) = 0 if xil or xjl is missing. • wl (i, j) = 0 if xil = xjl = 0 when lth variable is asymmetric binary. Other choices of dissimilarity measures, which can not be expressed as (1): check ?dist and ?daisy in R. 12 K-medoids Clustering • Input: Dn×n (pair-wise dissimilarity matrix) and K (number of clusters) • Cluster centers (medoids) are restricted to be data points. mk = arg min xi :C(i)=k X d(xi , xj ). j:C(j)=k • Iterate the following two steps – Partition C(i) = arg min1≤k≤K d(xi , mk ) – Update Medoids based on (2). 13 (2) Partition Around Medoids (PAM) Recall that the K-medoids algorithm iteratively solves min C,m1:K K X X d(xi , mk ). k=1 i:C(i)=k PAM (Kaufman and Rousseeew, 1990) minimizes the above objective function by iteratively swapping xi ↔ xj that decreases the objective function the most, where xi ∈ {m1 , . . . , mk }, xj 6= {m1 , . . . , mk }, until convergence. 14 How Many Clusters? • Difficult, because it is NOT a well-defined problem. • Can we cast it as a model selection problem and use “fitting error” plus “penalty” (increasing with K) to select K? – Possible for the mixture model approach where error is −2 log Lik. – For ordinary clustering algorithms, what’s the error? – Can we cast it as a classification problem with the label being the cluster membership? A tricky issue: the meaning of the so-called “cluster 1” changes when data change. – A better choice is the Association matrix A: Aij = 1 if i and j are in the same cluster, and 0, otherwise. 15 A couple heuristic approaches • Gap statistics • Silhouettes statistics • Prediction strength 16 Gap Statistics • Many measures of goodness for clusterings are based on the tightness of clusters. SS(K) = K X X kxi − mk k2 . k=1 i∈Ck • Gap statistic (Tibshirani et al 2001) h i G(K) = E0 log SS(K) − log SS(K) B X 1 log SSb (K) − log SS(K) ≈ B b=1 17 • One-standard-error type rule: K ∗ = arg min{K : G(K) ≥ G(K + 1) − sK+1 } K p where sK = sd0 (log SS(K)) 1 + 1/B. 18 Silhouettes statistics • The Silhouette statistic (Rousseeuw, 1987) of the ith obs measures how well it fits in its own cluster versus how well it fits in its next closest cluster. • Adapted to K-means, define a(i) = kxi − mk k2 , b(i) = kxi − ml k2 , where i ∈ Ck and Cl is the next-closest cluster to xi . Then its silhouette is b(i) − a(i) . s(i) = max{a(i), b(i)} 19 • In general, – a(i) = average dissimilarity between i and all other points in the same cluster; – d(i, C) = average dissimilarty between i and points in C; – b(i) = minC d(i, C) and i is not in C. • 1 ≥ s(i) ≥ −1; usually s(i) > 0, but can be slightly below 0. • s(i) ≈ 1: object i is well classified; • s(i) ≈ 0: object i lies intermediate between two clusters; • s(i) ≈ −1: object i is badly classified. • The larger the (average) silhouette, the better the cluster. 20 • Silhouette Coefficient: Average silhouette widths. A rule of thumb from Anja et al. “Clustering in an Object-Priented Environment”: – SC = 0.71–1.00: A strong structure has been found. – SC = 0.51 – 0.70: A reasonable structure has been found. – SC = 0.26 – 0.50: The structure is weak and could be artificial, try additional methods. – SC < 0.26: No substantial structure has been found. 21 Prediction Strength • Divide the data into training and test sets. • Cluster test set into K clusters. • Cluster training set into K clusters. • Predict pairwise co-memberships in the test set, using the clustering rule obtained from training set. • Prediction strength (Tibshirani et al 2005) is the proportion of pairwise memberships that are correctly predicted. • For each k, repeat the CV procedure above and calculate the average prediction strength. 22 • Suppose we have training and test sets, Xtr and Xte , and a clustering operator C(X, k). • If X has n obs, let D[C(...), X] be a n × n (membership) matrix with Dij = 1 if obs i and j are in the same cluster. • Let A1 , . . . , Ak denote the indices of the test obs in the k test clusters. • Let n1 , . . . , nk denote the number of observations in these clusters. • The prediction strength of C(·, k) is defined to be 1 min j∈{1,...,k} X nj (nj − 1) i,i0 ∈A j I D(C[Xtr , k], Xte )ii0 23 =1 Hierarchical Clustsering Methods • Input: – Pairwise dissimilarity (or distance) of observations – Rule to calculate dissimilarty between (disjoint) groups of observations • Output: a hierarchical clustering result – n single-point clusters at the lowest level; – one cluster at the highest level; – clusters at one level are created by splitting/merging clusters at the next higher/lower level. 24 • Bottom-up clustering starts with all observations seperate, and successively joins together the cloest groups of observations until all observations are in a single group. • Dissimilarty (or distance) between groups of obs – Single-linkage (nearest-neighbor) – Complete-linkage (furthest-neighbor) – Group average 25 • Top-down clustering starts with all observations together, and successively divide into two groups until all observations seperate. • Hybrid (Chipman and Tibshirani, 2006). 26