Download 4 - Read

《智能信息处理》课程第四讲模糊信息处理技术（4）模糊聚类原理 2008年10月17日（星期五3、4节,理教110） 1 Fuzzy Clustering What’s clustering? Some concepts Clustering Algorithms K-means method Fuzzy C-means (FCM) clustering method Hierarchical Clustering Algorithms Mixture of Gaussians Homework 2 What’s clustering ? Clustering can be considered the most important unsupervised learning problem, it deals with finding a structure in a collection of unlabeled data. Definition of clustering The process of organizing objects into groups whose members are similar in some way. A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. 3 a graphical example of clustering 4 It is easily to identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering. Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. 5 Vehicle Example Vehicle V1 V2 V3 V4 V5 V6 V7 V8 V9 Top speed km/h 220 230 260 140 155 130 100 105 110 Colour red black red gray blue white black red gray Air resistance 0.30 0.32 0.29 0.35 0.33 0.40 0.50 0.60 0.55 Weight Kg 1300 1400 1500 800 950 600 3000 2500 3500 6 Vehicle Clusters 3500 3000 Lorries Weight [kg] 2500 Sports cars 2000 1500 Medium market cars 1000 500 100 150 200 250 300 Top speed [km/h] 7 Terminology feature space Object or data point 3500 label 3000 Lorries 2500 Weight [kg] cluster Sports cars 2000 1500 Medium market cars feature 1000 500 100 150 200 250 300 Top speed [km/h] feature 8 The Goals of Clustering To determine the intrinsic grouping in a set of unlabeled data. How to decide what constitutes a good clustering? 9 The Goals of Clustering（2） It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. 10 The Goals of Clustering（3） For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). 11 Rich Applications of Clustering Pattern Recognition Spatial Data Analysis  Create thematic maps in GIS by clustering feature spaces  Detect spatial clusters or for other spatial mining tasks Image Processing Economic Science (especially market research) WWW  Document classification  Cluster Weblog data to discover groups of similar access patterns 12 Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults 13 What is Cluster Analysis?  Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters  Cluster analysis  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms 14 Requirements of a clustering algorithm scalability; dealing with different types of attributes; discovering clusters with arbitrary shape; minimal requirements for domain knowledge to determine input parameters; ability to deal with noise and outliers; insensitivity to order of input records; high dimensionality; interpretability and usability. 15 Quality: What Is Good Clustering? A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns 16 Problems  current clustering techniques do not address all the requirements adequately (and concurrently);  dealing with large number of dimensions and large number of data items can be problematic because of time complexity;  the effectiveness of the method depends on the definition of “distance” (for distance-based clustering);  if an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multidimensional spaces;  the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways. 17 Clustering Algorithms Clustering algorithms may be classified as listed below: Exclusive Clustering Overlapping Clustering Hierarchical Clustering Probabilistic Clustering 18 Exclusive Clustering Data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bidimensional plane. 19 20 Overlapping clustering Overlapping clustering uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value. 21 Hierarchical Clustering A hierarchical clustering algorithm is based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted. 22 Probabilistic Clustering Probabilistic clustering uses a completely probabilistic approach for clustering the data in hand. 23 Four most used clustering algorithms K-means Fuzzy C-means Hierarchical clustering Mixture of Gaussians 24 Distance Measure An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading. 25 26 K-Means Clustering  K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.  The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori.  The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function 27 Partitioning Algorithms: Basic Concept  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance k 2 m1tmiKm (Cm  tmi )  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-means (MacQueen’67): Each cluster is represented by the center of the cluster  k-medoids (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 29 The K-Means Clustering Method Given k, the k-means implemented in four steps: algorithm is  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment 30 The K-Means Clustering Method Example 10 10 9 9 8 8 7 7 6 6 5 5 10 9 8 7 6 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 9 9 8 8 7 7 6 6 5 5 4 3 2 1 0 1 2 3 4 5 6 7 8 8 9 10 reassign 10 0 7 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 7 31 8 9 10 Comments on the K-Means Method  Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.  Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))  Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms  Weakness  Applicable only when mean is defined, then what about categorical data?  Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers  Not suitable to discover clusters with non-convex shapes 32 Fuzzy C-Means Clustering  Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition. It is based on minimization of the following objective function:  where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center. 33 Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers cj by This iteration will stop when 34 FCM’s Steps 1.Initialize U=[uij] matrix, U(0) 2.At k-step: calculate the centers vectors C(k)=[cj] with U(k) 3.Update U(k) , U(k+1) 4.If || U(k+1) - U(k)||< then STOP; otherwise return to step 2. 35 Remarks As already told, data are bound to each cluster by means of a Membership Function, which represents the fuzzy behavior of this algorithm. To do that, we simply have to build an appropriate matrix named U whose factors are numbers between 0 and 1, and represent the degree of membership between data and centers of clusters. 36 A 1-D example 37 matrix U Now, instead of using a graphical representation, we introduce a matrix U whose factors are the ones taken from the membership functions: (a) (b) The number of rows and columns depends on how many data and clusters we are considering. More exactly we have C = 2 columns (C = 2 clusters) and N rows. 38 Other properties • • 39 A 1-D application of the FCM Figures below show the membership value for each datum and for each cluster. 40 In the simulation , we have used a fuzzyness coefficient m = 2 and we have also imposed to terminate the algorithm when . The picture shows the initial condition where the fuzzy distribution depends on the particular position of the clusters. No step is performed yet so that clusters are not identified very well. Now we can run the algorithm until the stop condition is verified. The figure below shows the final condition reached at the 8th step with m=2 and =0.3: 41 Is it possible to do better? Certainly, we could use an higher accuracy but we would have also to pay for a bigger computational effort. In the figure below we can see a better result having used the same initial conditions and =0.01, but we needed 37 steps! 42 Hierarchical Clustering Algorithms Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this: 1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*) 43 Algorithm Steps 1. Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0. 2. Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters in the current clustering. 3. Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)] 4. Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way: d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)] 5. If all objects are in one cluster, stop. Else, go to step 2. 44 agglomerative / divisive This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. There is also a divisive hierarchical clustering which does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces. 45 Example a hierarchical clustering of distances in kilometers between some Italian cities 46 Input distance matrix BA FI MI NA RM TO BA 0 662 877 255 412 996 FI 662 0 295 468 268 400 MI 877 295 0 754 564 138 NA 255 468 754 0 219 869 RM 412 268 564 219 0 669 TO 996 400 138 869 669 0 47 MI，TO merged into MI/TO BA FI MI/TO NA RM BA 0 662 877 255 412 FI 662 0 295 468 268 MI/TO 877 295 0 754 564 NA 255 468 754 0 219 RM 412 268 564 219 0 48 merge NA and RM into a new NA/RM cluster BA FI MI/TO NA/RM BA 0 662 877 255 FI 662 0 295 268 MI/TO 877 295 0 564 NA/RM 255 268 564 0 49 BA/FI/NA/RM MI/TO BA/FI/NA/RM 0 295 MI/TO 295 0 50 Hierarchical tree 51 Clustering as a Mixture of Gaussians a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model. Each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modeled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution 52 A mixture model with high likelihood tends to have the following traits:  component distributions have high “peaks” (data in one cluster are tight);  the mixture model “covers” the data well (dominant patterns in the data are captured by component distributions). Main advantages of model-based clustering:  well-studied statistical inference techniques available;  flexibility in choosing the component distribution;  obtain a density estimation for each cluster;  a “soft” classification is available. 53 Mixture of Gaussians 54 The algorithm works in the following way: •it chooses the component (the Gaussian) at random with probability ; •it samples a point . Let’s suppose to have: • x1, x2,..., xN • We can obtain the likelihood of the sample: . What we really want to maximise is (probability of a datum given the centres of the Gaussians). 55 is the base to write the likelihood function: Now we should maximise the likelihood function by calculating but it would be too difficult. That’s why we use a simplified algorithm called EM (Expectation-Maximization). 56 References  Tariq Rashid: “Clustering” http://www.cs.bris.ac.uk/home/tr1690/documentation/fuzzy_clustering_initial_report/node11.html  Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering” http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html  Pier Luca Lanzi: “Ingegneria della Conoscenza e Sistemi Esperti – Lezione 2: Apprendimento non supervisionato” http://www.elet.polimi.it/upload/lanzi/corsi/icse/2002/Lezione%202%20%20Apprendimento%20non%20supervisionato.pdf  J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3: 32-57  J. C. Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function Algoritms", Plenum Press, New York  Tariq Rashid: “Clustering” http://www.cs.bris.ac.uk/home/tr1690/documentation/fuzzy_clustering_initial_report/node11.html  Hans-Joachim Mucha and Hizir Sofyan: “Nonhierarchical Clustering” http://www.quantlet.com/mdstat/scripts/xag/html/xaghtmlframe149.html  A.P. Dempster, N.M. Laird, and D.B. Rubin (1977): "Maximum Likelihood from Incomplete Data via theEM algorithm", Journal of the Royal Statistical Society, Series B, vol. 39, 1:1-38  Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering” http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html  Jia Li: “Data Mining Clustering by Mixture Models” http://www.stat.psu.edu/~jiali/course/stat597e/notes/mix.pd 57 Homework 1. 为什么需要聚类分析？它有什么作用？ 2. 请列举出一些聚类算法的应用领域，并简要说明。 3. 实现FCM算法，并用它处理一个2-D数据的聚类问题，给出实验结果。 58 谢谢！ 59

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 4 - Read