* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download o Sequential K Means Algorithms
Survey
Document related concepts
Transcript
General problem o Unsupervised learning o Classification = clustering o Static distribution (in contrast with HMM) o Large amounts of data o Problems strange shapes Applications o Clustering gene expression data Cluster analysis and display of genome-wide expression patterns. M. B. Eisen et al.(1998) PNAS 95, 1486314868 Distinct types of diffuse large B-cell lymphoma identified by gene-expression profiling. A. Alizadeh et al. (2003) Nature 403 503-511 Revealing modular organization in the yeast transcriptional network. J. Ihmels et al. (2002) Nature Genet. 31 370-377 Cluster analysis of gene expression data E. Domany (2003) J. Stat. Phys. 110 1117-1139 o Clustering genes into families (clustering graphs) An efficient algorithm for large-scale detection of protein families A. Enright et al. (2002) Nucl. Acids. Res. 30 1575-1584 Curvature of co-links uncovers hidden thematic layers in the World Wide Web JP Eckmann and E Moses (2002) PNAS 99 5825-5829 A genomic perspective on protein families RL Tatusov et al. (1997) Science 278 631-637 o Clustering transcription factor binding sites Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics E. van Nimwegen et al. (2002) PNAS 99 7323-7328. Typical Distance metrics o Genetic distance (mutation distance) o Amino-acid distance o Euclidean distance o Manhattan distance o Bit match Distance Measures: Minkowski Metric Suppose two objects x and y both have p features: x ( x1 x 2 xp ) y ( y1 y 2 yp ) The Minkowski metric is defined by d ( x, y ) p r | x i i 1 yi |r 1, r 2 (Euclidean distance ) d ( x, y ) p 2 | x i yi |2 i 1 2, r 1 (Manhattan distance) d ( x, y ) p | x i yi | i 1 3, r ("sup" distance ) d ( x, y ) max | xi yi | 1 i p non-hierarchical methods divide a dataset of N objects into M clusters, with or without overlap. hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains. non-hierarchical methods o partitioning methods - classes are mutually exclusive o clumping method, - overlap is allowed. Hierarchical methods o Divisive methods begin with all objects in a single cluster and at each of N-1 steps divides some clusters into two smaller clusters, until each object resides in its own cluster. o Agglomerative or - The hierarchy is build up in a series of N1 agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered dataset. Non-hierarchical Methods Partitioning Methods o The partitioning methods generally result in a set of M clusters, each object belonging to one cluster. Each cluster may be represented by a centroid or a cluster representative o T center is usually the arithmetic mean of the attribute vectors for all objects within a cluster o If the number of the clusters is large, the centroids can be further clustered to produces hierarchy within a dataset. o Partitioning methods are divided acording to the number of passes over the data. o Single pass Basic partitioning methods o Multiple passes K –means (Very widely used) Single pass Algorithm: o Make the first object the centroid for the first cluster. o For the next object, calculate the similarity, S, with each existing cluster centroid, using some similarity coefficient. o If the highest calculated S is greater than some specified threshold value, add the object to the corresponding cluster and re determine the centroid; otherwise, use the object to initiate a new cluster. If any objects remain to be clustered, return to step 2. Single Pass – Advantages and Disadvantages o As its name implies, this method requires only one pass through the dataset; the time requirements are typically of order O(NlogN) for order O(logN) clusters. This makes it a very efficient clustering method. o A disadvantage is that the resulting clusters are not independent of the order in which the documents are processed, with the first clusters formed usually being larger than those created later in the clustering run K-means: The Algorithm o Given a set of numeric points in d dimensional space, and integer k o Algorithm generates k (or fewer) clusters as follows: o Assign all points to a cluster at random o Compute centroid for each cluster o Reassign each point to nearest centroid o If centroids changed go back to stage 2 K-means: Sample Application o Gene clustering. o Given a series of microarray experiments measuring the expression of a set of genes at regular time intervals in a common cell line. o Normalization allows comparisons across microarrays. o Produce clusters of genes which vary in similar ways over time. o Hypothesis: genes which vary in the same way may be coregulated and/or participate in the same pathway. K-means: Weaknesses o Must choose parameter k in advance, or try many values. o Data must be numerical and must be compared via Euclidean distance (there is a variant called the k-medians algorithm to address these concerns) o The algorithm works best on data which contains spherical clusters; clusters with other geometry may not be found. o The algorithm is sensitive to outliers---points which do not belong in any cluster. These can distort the centroid positions and ruin the clustering. Adaptive K-means Method o In order to increase the stability in cluster analysis, specific weights or adaptive weights in the distance formula could be applied rather than ordinary weight or . Fuzzy K means o The clusters produced by the k-means procedure are sometimes called "hard" or "crisp" clusters, since any feature vector x either is or is not a member of a particular cluster. This is in contrast to "soft" or "fuzzy" clusters, in which a feature vector x can have a degree of membership in each cluster. o The fuzzy-k-means procedure allows each feature vector x to have a degree of membership in Cluster i: Fuzzy K means Algorithm o Make initial guesses for the means m1, m2,..., mk o Until there are no changes in any mean: Use the estimated means to find the degree of membership u(j,i) of xj in Cluster i; for example, if dist(j,i) = exp(- || xj - mi ||2 ), one might use u(j,i) = dist(j,i) / j dist(j,i) For i from 1 to k Replace mi with the fuzzy mean of all of the examples for Cluster i u( j, i) x u( j, i) 2 mi j j 2 j end_for o end_until Sequential K means o Another way to modify the k-means procedure is to update the means one example at a time, rather than all at once. This is particularly attractive when we acquire the examples over a period of time, and we want to start clustering before we have seen all of the examples. o The algorithm is similar to the regular k means algorithm. o Sequential K Means Algorithms Make initial guesses for the means m1, m2, ..., mk Set the counts n1, n2, ..., nk to zero Until interrupted Acquire the next example, x If mi is closest to x o Increment ni o Replace mi by mi + (1/ni)*( x - mi) end_if end_until K-medoids clustering method o K-medoids: the most centrally located objects in a cluster o Arbitrarily choose k objects as the initial medoids o Until no change, do (Re)assign each object to the cluster to which the nearest medoid Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’ If S < 0 then swap o with o’ to form the new set of k medoids S po int s d ( p, o) clusters Principal Component Analysis o Problem: many types of data have too many attributes to be visualized or manipulated conveniently. o For example, a single microarray experiment may have 6,0008,000 genes. o PCA is a method for reducing the number of attributes (dimensions) of numerical data while attempting to preserve the cluster structure. o After PCA, we hopefully get the same clusters as we would if we clustered the data before PCA. o After PCA, plots of the data should still have the clusters falling into obvious groups. o By using PCA to reduce the data to 2 or 3 dimensions, off-theshelf geometry viewers can be used to visualize data. o Consider the data as an m by m matrix in which each cell is the covariance between attribute i and j.The eigenvectors corresponding to the d largest eigenvalues of this matrix are the “principal components”By projecting the data onto these vectors, one obtains d-dimensional pointsConsider